Merge with develop WIP

daniil-lyakhov · Nov 17, 2023 · 2d9e939 · 2d9e939
2 parents 3c0e659 + b6910ea
commit 2d9e939
Show file tree

Hide file tree

Showing 144 changed files with 1,680,117 additions and 3,911 deletions.
diff --git a/.github/workflows/post_pr_merge.yml b/.github/workflows/post_pr_merge.yml
@@ -24,6 +24,7 @@ jobs:
       merge_commit_sha: ${{ github.event.pull_request.merge_commit_sha }}
       last_sha_in_pr: ${{ github.event.pull_request.head.sha }}
       coverage_artifact_name_in_pr: coverage_common
+      coverage_flags: COMMON
     secrets:
       CODECOV_TOKEN: ${{ secrets.CODECOV_TOKEN }}
   upload-coverage-onnx:
@@ -33,5 +34,6 @@ jobs:
       merge_commit_sha: ${{ github.event.pull_request.merge_commit_sha }}
       last_sha_in_pr: ${{ github.event.pull_request.head.sha }}
       coverage_artifact_name_in_pr: coverage_onnx
+      coverage_flags: ONNX
     secrets:
       CODECOV_TOKEN: ${{ secrets.CODECOV_TOKEN }}
diff --git a/.github/workflows/precommit.yml b/.github/workflows/precommit.yml
@@ -33,6 +33,7 @@ jobs:
         with:
           token: ${{ secrets.CODECOV_TOKEN }}
           name: coverage_common
+          flags: COMMON
   onnx:
     runs-on: ubuntu-20.04
     steps:
@@ -58,4 +59,5 @@ jobs:
         with:
           token: ${{ secrets.CODECOV_TOKEN }}
           name: coverage_onnx
+          flags: ONNX
 
diff --git a/.github/workflows/upload_coverage_for_develop.yml b/.github/workflows/upload_coverage_for_develop.yml
@@ -12,6 +12,9 @@ on:
       coverage_artifact_name_in_pr:
         required: true
         type: string
+      coverage_flags: 
+        required: true
+        type: string     
     secrets:
       CODECOV_TOKEN:
         required: true
@@ -37,4 +40,4 @@ jobs:
 
           # github.event.pull_request.merge_commit_sha is the fresh commit in the develop,
           # provided that github.event.pull_request.merged == true
-          ./codecov -f ./coverage.xml -t ${{ secrets.CODECOV_TOKEN }}  -C ${{ inputs.merge_commit_sha }} -B develop -n "${{ inputs.coverage_artifact_name_in_pr }}"
+          ./codecov -f ./coverage.xml -t ${{ secrets.CODECOV_TOKEN }} -F ${{ inputs.coverage_flags }} -C ${{ inputs.merge_commit_sha }} -B develop -n "${{ inputs.coverage_artifact_name_in_pr }}"
diff --git a/Makefile b/Makefile
@@ -50,6 +50,7 @@ test-examples-onnx:
 install-openvino-test:
 	pip install -U pip
 	pip install -e .[openvino]
+	pip install tensorflow==2.12.0
 	pip install -r tests/openvino/requirements.txt
 	pip install -r tests/cross_fw/install/requirements.txt
 	pip install -r tests/cross_fw/examples/requirements.txt

diff --git a/ReleaseNotes.md b/ReleaseNotes.md
@@ -1,5 +1,52 @@
 # Release Notes
 
+## New in Release 2.7.0
+
+Post-training Quantization:
+
+- Features:
+  - (OpenVINO) Added support for data-free 4-bit weights compression through NF4 and INT4 data types (`compress_weights(…)` pipeline).
+  - (OpenVINO) Added support for [IF operation](https://docs.openvino.ai/latest/openvino_docs_ops_infrastructure_If_8.html) quantization.
+  - (OpenVINO) Added `dump_intermediate_model` parameter support for AccuracyAwareAlgorithm (`quantize_with_accuracy_control(…)` pipeline).
+  - (OpenVINO) Added support for SmoothQuant and ChannelAlignment algorithms for HyperparameterTuner algorithm (`quantize_with_tune_hyperparams(…)` pipeline).
+  - (PyTorch) Post-training Quantization is now supported with `quantize(…)` pipeline and the common implementation of quantization algorithms. Deprecated `create_compressed_model()` method for Post-training Quantization.
+  - Added new types (AvgPool, GroupNorm, LayerNorm) to the ignored scope for `ModelType.Transformer` scheme.
+  - `QuantizationPreset.Mixed` was set as the default for `ModelType.Transformer` scheme.
+- Fixes:
+  - (OpenVINO, ONNX, PyTorch) Aligned/added patterns between backends (SE block, MVN layer, multiple activations, etc.) to restore performance/metrics.
+  - Fixed patterns for `ModelType.Transformer` to align with the [quantization scheme](https://docs.openvino.ai/latest/openvino_docs_OV_UG_lpt.html).
+- Improvements:
+  - Improved UX with the new progress bar for pipeline, new exceptions, and .dot graph visualization updates.
+  - (OpenVINO) Optimized WeightsCompression algorithm (`compress_weights(…)` pipeline) execution time for LLM's quantization, added ignored scope support.
+  - (OpenVINO) Optimized AccuracyAwareQuantization algorithm execution time with multi-threaded approach while calculating ranking score (`quantize_with_accuracy_control(…)` pipeline).
+  - (OpenVINO) Added [extract_ov_subgraph tool](tools/extract_ov_subgraph.py) for large IR subgraph extraction.
+  - (ONNX) Optimized quantization pipeline (up to 1.15x speed up).
+- Tutorials:
+  - [Post-Training Optimization of BLIP Model](https://github.com/openvinotoolkit/openvino_notebooks/tree/main/notebooks/233-blip-visual-language-processing)
+  - [Post-Training Optimization of DeepFloyd IF Model](https://github.com/openvinotoolkit/openvino_notebooks/tree/main/notebooks/238-deepfloyd-if)
+  - [Post-Training Optimization of Grammatical Error Correction Model](https://github.com/openvinotoolkit/openvino_notebooks/tree/main/notebooks/214-grammar-correction)
+  - [Post-Training Optimization of Dolly 2.0 Model](https://github.com/openvinotoolkit/openvino_notebooks/tree/main/notebooks/240-dolly-2-instruction-following)
+  - [Post-Training Optimization of Massively Multilingual Speech Model](https://github.com/openvinotoolkit/openvino_notebooks/tree/main/notebooks/255-mms-massively-multilingual-speech)
+  - [Post-Training Optimization of OneFormer Model](https://github.com/openvinotoolkit/openvino_notebooks/tree/main/notebooks/249-oneformer-segmentation)
+  - [Post-Training Optimization of InstructPix2Pix Model](https://github.com/openvinotoolkit/openvino_notebooks/tree/main/notebooks/231-instruct-pix2pix-image-editing)
+  - [Post-Training Optimization of LLaVA Model](https://github.com/openvinotoolkit/openvino_notebooks/tree/main/notebooks/257-llava-multimodal-chatbot)
+  - [Post-Training Optimization of Latent Consistency Model](https://github.com/openvinotoolkit/openvino_notebooks/tree/main/notebooks/263-latent-consistency-models-image-generation)
+  - [Post-Training Optimization of Distil-Whisper Model](https://github.com/openvinotoolkit/openvino_notebooks/tree/main/notebooks/267-distil-whisper-asr)
+  - [Post-Training Optimization of FastSAM Model](https://github.com/openvinotoolkit/openvino_notebooks/tree/main/notebooks/261-fast-segment-anything)
+- Known issues:
+  - (ONNX) `quantize(...)` method can generate inaccurate int8 results for models with the BatchNormalization layer that contains biases. To get the best accuracy, use the `do_constant_folding=True` option during export from PyTorch to ONNX.
+
+Compression-aware training:
+
+- Fixes:
+  - (PyTorch) Fixed Hessian trace calculation to solve [#2155](https://github.com/openvinotoolkit/nncf/issues/2155) issue.
+- Requirements:
+  - Updated PyTorch version (2.1.0).
+  - Updated numpy version (<1.27).
+- Deprecations/Removals:
+  - (PyTorch) Removed legacy external quantizer storage names.
+  - (PyTorch) Removed torch < 2.0 version support.
+
 ## New in Release 2.6.0
 
 Post-training Quantization:

diff --git a/docs/Installation.md b/docs/Installation.md
@@ -69,7 +69,8 @@ as well as the supported versions of Python:
 
 | NNCF      | OpenVINO   | PyTorch  | ONNX     | TensorFlow | Python |
 |-----------|------------|----------|----------|------------|--------|
-| `develop` | `2023.1.0` | `2.1`    | `1.13.1` | `2.12.0`   | `3.8`  |
+| `develop` | `2023.2.0` | `2.1`    | `1.13.1` | `2.12.0`   | `3.8`  |
+| `2.7.0`   | `2023.2.0` | `2.1`    | `1.13.1` | `2.12.0`   | `3.8`  |
 | `2.6.0`   | `2023.1.0` | `2.0.1`  | `1.13.1` | `2.12.0`   | `3.8`  |
 | `2.5.0`   | `2023.0.0` | `1.13.1` | `1.13.1` | `2.11.1`   | `3.8`  |
 | `2.4.0`   | `2022.1.0` | `1.12.1` | `1.12.0` | `2.8.2`    | `3.8`  |
diff --git a/docs/compression_algorithms/CompressWeights.md b/docs/compression_algorithms/CompressWeights.md
@@ -6,14 +6,13 @@
 
 The Weights Compression algorithm is aimed at compressing the weights of the models and can be used to optimize the model footprint and performance of large models where the size of weights is relatively larger than the size of activations, for example, Large Language Models (LLM). The algorithm compresses weights only for Linear and Embedding layers.
 
-##### INT8 and NF4 modes
+#### Supported modes
 
 By default, weights are compressed to 8-bit integer data type - "INT8" mode.
-OpenVINO backend has also an experimental support for "NF4" mode - compression to [nf4](https://arxiv.org/pdf/2305.14314v1.pdf) data type.
-It goes with a grouped quantization, when small group of weights (e.g. 128) in the channel dimension share quantization parameters (scale).
-First embedding and last linear layers are always compressed to 8-bit integer data type in the "NF4" mode.
-Percent of the rest layers compressed to NF4 can be configured by "ratio" parameter.
-E.g. ratio=0.9 means 90% of layers compressed to nf4 and the rest to 8-bit integer data type.
+OpenVINO backend also supports 3 modes of mixed precision weight quantization with a 4-bit data type as a primary precision - INT4_SYM, INT4_ASYM and NF4. The primary precision in case of INT4_SYM mode is unsigned 4-bit integer and weights are quantized to it [symmetrically](https://github.com/openvinotoolkit/nncf/blob/develop/docs/compression_algorithms/Quantization.md#symmetric-quantization) with a fixed zero point equals to 8. In case of INT4_ASYM mode - also unsigned 4-bit integer, but weight are quantized to it [asymmetrically](https://github.com/openvinotoolkit/nncf/blob/develop/docs/compression_algorithms/Quantization.md#asymmetric-quantization) with a typical non-fixed zero point. In case of NF4 mode - [nf4](https://arxiv.org/pdf/2305.14314v1.pdf) data type without zero point.
+All 4-bit modes have a grouped quantization support, when small group of weights (e.g. 128) in the channel dimension share quantization parameters (scale).
+First embedding and last linear layers are always compressed to 8-bit integer data type.
+Percent of the rest layers compressed to 4-bit can be configured by "ratio" parameter. E.g. ratio=0.9 means 90% of layers compressed to the corresponding 4-bit data type and the rest to 8-bit integer data type.
 
 #### User guide
 
@@ -24,26 +23,206 @@ from nncf import compress_weights
 compressed_model = compress_weights(model)
 ```
 
-- Compress weights to nf4 data type with group size = 128, except first embedding and last linear layers - they are compressed to 8-bit integer data type.
+- Compress weights symmetrically to 4-bit integer data type with group size = 128, except first embedding and last linear layers - they are compressed to 8-bit integer data type.
 
 ```python
 from nncf import compress_weights
 from nncf import CompressWeightsMode
-compressed_model = compress_weights(model, mode=CompressWeightsMode.NF4)
+compressed_model = compress_weights(model, mode=CompressWeightsMode.INT4_SYM)
 ```
 
-- Compress weights of 90% of layers to nf4 with the group size 64, and the rest of layers to 8-bit integer data type.
+- Generally, `INT4_SYM` mode is the fastest mixed-precision mode, but it may lead to a significant accuracy degradation or perplexity increase.
+  Compressing weights asymmetrically (`INT4_ASYM` mode) is the way to increase accuracy, however in turns it slows down inference a bit.
+  If the accuracy or perplexity is still not satisfying, there are 2 more hyper-parameters to tune: `group_size` and `ratio`.
+  Lower group size and less ratio of 4-bit layers usually improve accuracy at the sacrifice of inference speed.
+  Below is the example how to compress weights of 90% of layers to 4-bit integer asymmetrically with the group size 64, and
+  the rest of layers to 8-bit integer data type. The same parametrization is applicable for `INT4_SYM` mode.
 
 ```python
 from nncf import compress_weights
 from nncf import CompressWeightsMode
-compressed_model = compress_weights(model, mode=CompressWeightsMode.NF4, group_size=64, ratio=0.9)
+compressed_model = compress_weights(model, mode=CompressWeightsMode.INT4_ASYM, group_size=64, ratio=0.9)
 ```
 
-##### Limitations
+- `NF4` mode can be considered for improving accuracy, but currently models quantized to nf4 should not be faster models
+  quantized to 8-bit integer. Here's the example how to compress weights to nf4 data type with group size = 128.
+  Different `group_size` and `ratio` are also supported.
+
+```python
+from nncf import compress_weights
+from nncf import CompressWeightsMode
+compressed_model = compress_weights(model, mode=CompressWeightsMode.NF4)
+```
+
+#### Evaluation results
+
+Here is the perplexity and model size before and after weight compression for different language models on the [Lambada OpenAI dataset](https://github.com/openai/gpt-2/issues/131#issuecomment-497136199).
+`g32` refers to the group size equals to 32, `r60` - to the ratio equals to 0.6.
+
+<table>
+<thead>
+  <tr>
+    <th class="tg-0pky">Model</th>
+    <th class="tg-0pky">Mode</th>
+    <th class="tg-0pky">Perplexity</th>
+    <th class="tg-0pky">Perplexity <br>Increase</th>
+    <th class="tg-0pky">Model Size <br>(Gb)</th>
+  </tr>
+</thead>
+<tbody>
+  <tr>
+    <td class="tg-0pky">databricks/dolly-v2-3b</td>
+    <td class="tg-0pky">fp32</td>
+    <td class="tg-0pky">5.01</td>
+    <td class="tg-0pky">0</td>
+    <td class="tg-0pky">10.3</td>
+  </tr>
+  <tr>
+    <td class="tg-0pky">databricks/dolly-v2-3b</td>
+    <td class="tg-0pky">int8</td>
+    <td class="tg-0pky">5.07</td>
+    <td class="tg-0pky">0.05</td>
+    <td class="tg-0pky">2.6</td>
+  </tr>
+  <tr>
+    <td class="tg-0pky">databricks/dolly-v2-3b</td>
+    <td class="tg-0pky">int4_asym_g32_r50</td>
+    <td class="tg-0pky">5.28</td>
+    <td class="tg-0pky">0.26</td>
+    <td class="tg-0pky">2.2</td>
+  </tr>
+  <tr>
+    <td class="tg-0pky">databricks/dolly-v2-3b</td>
+    <td class="tg-0pky">nf4_g128_r60</td>
+    <td class="tg-0pky">5.19</td>
+    <td class="tg-0pky">0.18</td>
+    <td class="tg-0pky">1.9</td>
+  </tr>
+  <tr>
+    <td class="tg-0pky">facebook/opt-6.7b</td>
+    <td class="tg-0pky">fp32</td>
+    <td class="tg-0pky">4.25</td>
+    <td class="tg-0pky">0</td>
+    <td class="tg-0pky">24.8</td>
+  </tr>
+  <tr>
+    <td class="tg-0pky">facebook/opt-6.7b</td>
+    <td class="tg-0pky">int8</td>
+    <td class="tg-0pky">4.27</td>
+    <td class="tg-0pky">0.01</td>
+    <td class="tg-0pky">6.2</td>
+  </tr>
+  <tr>
+    <td class="tg-0pky">facebook/opt-6.7b</td>
+    <td class="tg-0pky">int4_asym_g64_r80</td>
+    <td class="tg-0pky">4.32</td>
+    <td class="tg-0pky">0.07</td>
+    <td class="tg-0pky">4.1</td>
+  </tr>
+  <tr>
+    <td class="tg-0pky">facebook/opt-6.7b</td>
+    <td class="tg-0pky">nf4_g64</td>
+    <td class="tg-0pky">4.35</td>
+    <td class="tg-0pky">0.1</td>
+    <td class="tg-0pky">3.6</td>
+  </tr>
+  <tr>
+    <td class="tg-0pky">meta-llama/Llama-2-7b-chat-hf</td>
+    <td class="tg-0pky">fp32</td>
+    <td class="tg-0pky">3.28</td>
+    <td class="tg-0pky">0</td>
+    <td class="tg-0pky">25.1</td>
+  </tr>
+  <tr>
+    <td class="tg-0pky">meta-llama/Llama-2-7b-chat-hf</td>
+    <td class="tg-0pky">int8</td>
+    <td class="tg-0pky">3.29</td>
+    <td class="tg-0pky">0.01</td>
+    <td class="tg-0pky">6.3</td>
+  </tr>
+  <tr>
+    <td class="tg-0pky">meta-llama/Llama-2-7b-chat-hf</td>
+    <td class="tg-0pky">int4_asym_g128_r80</td>
+    <td class="tg-0pky">3.41</td>
+    <td class="tg-0pky">0.14</td>
+    <td class="tg-0pky">4.0</td>
+  </tr>
+  <tr>
+    <td class="tg-0pky">meta-llama/Llama-2-7b-chat-hf</td>
+    <td class="tg-0pky">nf4_g128</td>
+    <td class="tg-0pky">3.41</td>
+    <td class="tg-0pky">0.13</td>
+    <td class="tg-0pky">3.5</td>
+  </tr>
+  <tr>
+    <td class="tg-0pky">togethercomputer/RedPajama-INCITE-7B-Instruct</td>
+    <td class="tg-0pky">fp32</td>
+    <td class="tg-0pky">4.15</td>
+    <td class="tg-0pky">0</td>
+    <td class="tg-0pky">25.6</td>
+  </tr>
+  <tr>
+    <td class="tg-0pky">togethercomputer/RedPajama-INCITE-7B-Instruct</td>
+    <td class="tg-0pky">int8</td>
+    <td class="tg-0pky">4.17</td>
+    <td class="tg-0pky">0.02</td>
+    <td class="tg-0pky">6.4</td>
+  </tr>
+  <tr>
+    <td class="tg-0pky">togethercomputer/RedPajama-INCITE-7B-Instruct</td>
+    <td class="tg-0pky">nf4_ov_g32_r60</td>
+    <td class="tg-0pky">4.28</td>
+    <td class="tg-0pky">0.13</td>
+    <td class="tg-0pky">5.1</td>
+  </tr>
+  <tr>
+    <td class="tg-0pky">togethercomputer/RedPajama-INCITE-7B-Instruct</td>
+    <td class="tg-0pky">int4_asym_g128</td>
+    <td class="tg-0pky">4.17</td>
+    <td class="tg-0pky">0.02</td>
+    <td class="tg-0pky">3.6</td>
+  </tr>
+  <tr>
+    <td class="tg-0pky">meta-llama/Llama-2-13b-chat-hf</td>
+    <td class="tg-0pky">fp32</td>
+    <td class="tg-0pky">2.92</td>
+    <td class="tg-0pky">0</td>
+    <td class="tg-0pky">48.5</td>
+  </tr>
+  <tr>
+    <td class="tg-0pky">meta-llama/Llama-2-13b-chat-hf</td>
+    <td class="tg-0pky">int8</td>
+    <td class="tg-0pky">2.91</td>
+    <td class="tg-0pky">0</td>
+    <td class="tg-0pky">12.1</td>
+  </tr>
+  <tr>
+    <td class="tg-0pky">meta-llama/Llama-2-13b-chat-hf</td>
+    <td class="tg-0pky">int4_sym_g64_r80</td>
+    <td class="tg-0pky">2.98</td>
+    <td class="tg-0pky">0.06</td>
+    <td class="tg-0pky">8.0</td>
+  </tr>
+  <tr>
+    <td class="tg-0pky">meta-llama/Llama-2-13b-chat-hf</td>
+    <td class="tg-0pky">nf4_g128</td>
+    <td class="tg-0pky">2.95</td>
+    <td class="tg-0pky">0.04</td>
+    <td class="tg-0pky">6.6</td>
+  </tr>
+</tbody>
+</table>
+
+#### Limitations
 
 - The algorithm is supported for OpenVINO and PyTorch models.
 - The compression applies in-place.
 - The compressed model is not trainable.
-- NF4 mode, grouped quantization and mixed nf4-int8 precision selection is available for OpenVINO backend only.
+- INT4_SYM, INT4_ASYM and NF4 modes, grouped quantization and mixed precision selection is available for OpenVINO backend only.
 - NF4 support is experimental - models quantized to nf4 should not be faster models quantized to 8-bit integer.
+
+#### Additional resources
+
+- [LLM Weight Compression](https://docs.openvino.ai/nightly/weight_compression.html)
+- [Optimize and Deploy Generative AI Models using Hugging Face Optimum Intel](https://docs.openvino.ai/nightly/gen_ai_guide.html)
+- [Optimum Intel documentation](https://huggingface.co/docs/optimum/intel/inference)