Merge branch 'main' into constprop_parallel

onnx · Jan 21, 2025 · a476c36 · a476c36
2 parents 96f98ce + bd41f89
commit a476c36
Show file tree

Hide file tree

Showing 4 changed files with 73 additions and 4 deletions.
diff --git a/docs/AddCustomAccelerators.md b/docs/AddCustomAccelerators.md
@@ -20,8 +20,9 @@ The folder content is flexible depending on each accelerator. However, we recomm
 To build accelerators in onnx-mlir, use the cmake variable `ONNX_MLIR_ACCELERATORS` when building onnx-mlir. `ONNX_MLIR_ACCELERATORS` accepts a semicolon-separated list of accelerator names. For example,
 ```bash
 $ cd build
-$ cmake .. -DONNX_MLIR_ACCELERATORS=accel1;accel2
+$ cmake .. -DONNX_MLIR_ACCELERATORS='accel1;accel2'
 ```
+Note that the list should be quoted.
 
 ### 1.2 Compile a model to run with selected accelerators.
 

diff --git a/docs/Quantization-NNPA.md b/docs/Quantization-NNPA.md
@@ -0,0 +1,65 @@
+<!--- SPDX-License-Identifier: Apache-2.0 -->
+
+# Overview 
+
+NNPA in IBM Telum II supports 8-bit signed-integer quantized matrix multiplications. This document shows how to compile an ONNX model for 8-bit quantization on NNPA. When not following these steps, models will still be accelerated when targeting Telum systems using a mixture of 16-bit floating-point numbers for computations mapped to the Telum's Integrated AI accelerator and 32-bit floating-point numbers for computations mapped to the Telum CPUs.
+
+There are two approaches to using quantization in the onnx-mlir compiler, depending on the input ONNX model to the compile:
+- The input model is a quantized model that was quantized by other frameworks such as ONNX Runtime. In this case, the input ONNX model contains 8-bit operations, and the onnx-mlir compiler selects suitable 8-bit operations to run on NNPA. There is no special compile flags needed to enable quantization when compiling this quantized model. Hence, we do not discuss this case in this document.
+  - In this approach, the compiler supports both static and dynamic quantized models.
+- The input model is a non-quantized model, e.g. operations operate on float32 data types. In this case, the onnx-mlir compiler provides several quantization options in order to quantize the model during compilation, then run the compiled model on NNPA. The remaining of this document describes this approach.
+  - In this approach, the compiler only supports dynamic quantization.
+
+In both approaches, the following constraints are applied:
+- Only per-tensor quantization is supported, meaning `scale` and `zero_point` are computed per-tensor and are scalar values.
+- Target quantization data type is 8-bit signed-integer.
+
+Quantization requires NNPA in IBM Telum II, meaning that the following compile flags must be specified to enable quantization: `-maccel=NNPA -march=arch15`.
+
+# Dynamic quantization by the compiler
+
+Again, it is important to note that the onnx-mlir compiler currently:
+- supports per-tensor dynamic quantization, and
+- quantizes data tensors from float32 to 8-bit signed integer. If a data tensor in the input model is already in 8-bit singed integer, the compiler will not quantize it again.
+
+The compiler provides two compile flags for dynamically quantizing a model at compile time:
+- `--nnpa-quant-dynamic` to enable dynamic quantization.
+- `--nnpa-quant-op-types` to specify the types of ONNX operations to quantize manually, e.g. `MatMul,Conv`.
+
+Users can specify whether or not to symmetrize data for activations and weights by using options `symActivation, asymActivation, symWeight, asymWeight` as values for `--nnpa-quant-dynamic`.
+For examples, to asymmetrize data for activations and to symmetrize data for weights, one can use `--nnpa-quant-dynamic=asymActivation,symWeight`.
+
+By specifying `--nnpa-quant-dynamic` only, the compiler will decide quantization options and operation types by itself.
+
+## Computing `scale` and `zero_point` 
+The compiler uses the following equations to compute `scale` and `zero_point` for 8-bit signed integer quantization.
+
+Asymmetric quantization
+```
+scale = (maximum(0, max(x)) - minimum(0, min(x))) / (qmax - qmin)
+zero_point = cast(round(saturate(qmin - min(x)/scale)))
+```
+where
+- `x` is the input tensor to quantize,
+- data range is adjusted to include 0,
+- `qmax=127` and `qmin=-128` are the max and min values for quantization range.
+- `saturate` is to saturate to `[-128, 127]`.
+
+Symmetric quantization
+```
+scale = max(abs(x)) / 127
+zero_point = 0
+```
+
+Given `scale` and `zero_point`, the input `x` is quantized to
+```
+quantized_x = x/scale + zero_point
+```
+
+# Performance notes
+
+It is often the case that symmetric quantization leads to better inference performance but poorer accuracy than asymmetric quantization.
+Users may want to experiment with different quantization schemes to find the best combination for their own model.
+
+# Resources
+- [A visual guide to quantization](https://www.maartengrootendorst.com/blog/quantization/)
diff --git a/src/Accelerators/CMakeLists.txt b/src/Accelerators/CMakeLists.txt
@@ -1,7 +1,8 @@
 # SPDX-License-Identifier: Apache-2.0
 
 # Populate the accelerator list and add the accelerator subdirectories.
-# ONNX_MLIR_ACCELERATORS is the list of accelerators user specified
+# ONNX_MLIR_ACCELERATORS is the semicolon-separated list of accelerators user specified
+# Note that the list should be quoted, e.g. -DONNX_MLIR_ACCELERATORS='A;B'
 # ACCEL_TARGET_LIST is the list of cmake targets
 # ACCEL_LINK_LIST is the lists of accelerator libraries
 # ACCEL_INCLUDE_LIST is the list passed to inc generator
@@ -10,7 +11,8 @@ if (ONNX_MLIR_ACCELERATORS)
     add_subdirectory(${t})
 
     # If the accelerator can be built
-    if (${t}_ENABLED)
+    string(TOUPPER ${t} T)
+    if (${T}_ENABLED)
       list(APPEND ACCEL_TARGET_LIST "${t}Accel")
       list(APPEND ACCEL_LINK_LIST "OM${t}Accel")
       list(APPEND ACCEL_INCLUDE_LIST "${t}")

diff --git a/test/mlir/CMakeLists.txt b/test/mlir/CMakeLists.txt
@@ -4,7 +4,8 @@
 # accelerator code itself cannot be built.
 if (ONNX_MLIR_ACCELERATORS)
   foreach(t ${ONNX_MLIR_ACCELERATORS})
-    set(${t}_LIT_ENABLED 1)
+    string(TOUPPER ${t} T)
+    set(${T}_LIT_ENABLED 1)
     list(APPEND ACCEL_LIT_LIST "${t}")
   endforeach(t)
 endif(ONNX_MLIR_ACCELERATORS)