v2.8.0
nikita-malininn
released this
24 Jan 13:06
·
2364 commits
to develop
since this release
Post-training Quantization:
Breaking changes:
nncf.quantize
signature has been changed to addmode: Optional[nncf.QuantizationMode] = None
as its 3-rd argument, between the originalcalibration_dataset
andpreset
arguments.- (Common)
nncf.common.quantization.structs.QuantizationMode
has been renamed tonncf.common.quantization.structs.QuantizationScheme
General:
- (OpenVINO) Changed default OpenVINO opset from 9 to 13.
Features:
- (OpenVINO) Added 4-bit data-aware weights compression. For that
dataset
optional parameter has been added tonncf.compress_weights()
and can be used to minimize accuracy degradation of compressed models (note that this option increases the compression time). - (PyTorch) Added support for PyTorch models with shared weights and custom PyTorch modules in
nncf.compress_weights()
. The weights compression algorithm for PyTorch models is now based on tracing the model graph. Thedataset
parameter is now required innncf.compress_weights()
for the compression of PyTorch models. - (Common) Renamed the
nncf.CompressWeightsMode.INT8
tonncf.CompressWeightsMode.INT8_ASYM
and introducenncf.CompressWeightsMode.INT8_SYM
that can be efficiently used with dynamic 8-bit quantization of activations.
The originalnncf.CompressWeightsMode.INT8
enum value is now deprecated. - (OpenVINO) Added support for quantizing the ScaledDotProductAttention operation from OpenVINO opset 13.
- (OpenVINO) Added FP8 quantization support via
nncf.QuantizationMode.FP8_E4M3
andnncf.QuantizationMode.FP8_E5M2
enum values, invoked via passing one of these values as an optionalmode
argument tonncf.quantize
. Currently, OpenVINO supports inference of FP8-quantized models in reference mode with no performance benefits and can be used for accuracy projections. - (Common) Post-training Quantization with Accuracy Control -
nncf.quantize_with_accuracy_control()
has been extended byrestore_mode
optional parameter to revert weights to int8 instead of the original precision.
This parameter helps to reduce the size of the quantized model and improves its performance.
By default, it's disabled and model weights are reverted to the original precision innncf.quantize_with_accuracy_control()
. - (Common) Added an
all_layers: Optional[bool] = None
argument tonncf.compress_weights
to indicate whether embeddings and last layers of the model should be compressed to a primary precision. This is relevant to 4-bit quantization only. - (Common) Added a
sensitivity_metric: Optional[nncf.parameters.SensitivityMetric] = None
argument tonncf.compress_weights
for finer control over the sensitivity metric for assigning quantization precision to layers.
Defaults to weight quantization error if a dataset is not provided for weight compression and to maximum variance of the layers' inputs multiplied by inverted 8-bit quantization noise if a dataset is provided.
By default, the backup precision is assigned for the embeddings and last layers.
Fixes:
- (OpenVINO) Models with embeddings (e.g.
gpt-2
,stable-diffusion-v1-5
,stable-diffusion-v2-1
,opt-6.7b
,falcon-7b
,bloomz-7b1
) are now more accurately quantized. - (PyTorch)
nncf.strip(..., do_copy=True)
now actually returns a deepcopy (stripped) of the model object. - (PyTorch) Post-hooks can now be set up on operations that return
torch.return_type
(such astorch.max
). - (PyTorch) Improved dynamic graph tracing for various tensor operations from
torch
namespace. - (PyTorch) More robust handling of models with disjoint traced graphs when applying PTQ.
Improvements:
- Reformatted the tutorials section in the top-level
README.md
for better readability.
Deprecations/Removals:
- (Common) The original
nncf.CompressWeightsMode.INT8
enum value is now deprecated. - (PyTorch) The Git patch for integration with HuggingFace
transformers
repository is marked as deprecated and will be removed in a future release.
Developers are advised to use optimum-intel instead. - Dockerfiles in the NNCF Git repository are deprecated and will be removed in a future release.