EQuant provides a collection of algorithms to improve post-training quantization (PTQ) accuracy and extends PyTorch observers and quantizers with new ones. It also implements basic fusion methods and extends PyTorch backend with new fusion recipes. All quantized models support quantization-aware training (QAT) mode.
pip install -e .
from equant import generate_qconfig_mapping, quantize, convert
# define quantization recipe, for more details see below
qconfig = [
{
'weight': {
'dtype': 's8',
'qscheme': 'per_channel_symmetric',
'observer': 'min_max',
'quantizer': 'lsq'
},
'activation': {
'dtype': 'u8',
'qscheme': 'per_tensor_affine',
'observer': 'quantile',
'observer_kwargs': {
'quantile': 0.99999
},
'quantizer': 'lsq'
},
'layers': ['*'] # quantize all layers
}
]
# convert qconfig to PyTorch format
qconfig_mapping = generate_qconfig_mapping(model, qconfig)
# convert model to fake-quantized mode
qmodel = quantize(model, qconfig_mapping, example_inputs)
# calibrate model
for data in dataloader:
_ = qmodel(data)
# convert a fake-quantized model to the quantized model
model_int = convert(qmodel)
QConfig is a list of dictionaries where each dictionary contains its own quantization recipe for specific layers:
qconfig = [
# scheme 1
{
'weight': {
# quantization recipe for weights
},
'activation': {
# quantization recipe for activations
},
'layers': [...] # list of layers for this scheme
},
# scheme 2
{
'weight': {
# another quantization recipe for weights
},
'activation': {
# another quantization recipe for activations
},
'layers': [...] # list of layers for this scheme
}
# scheme 3
...
# scheme n
]
Each quantization recipe (for both weights and activations) contains information about:
-
Data type — in the format [s/u][n_bits], where s means signed data type and u stands for unsigned data type; for special cases n_bits may be float number. For instance, s7.7, s5.9, u6, etc. are all valid data types.
-
Quantization sheme — one of the followings:
- per_tensor_symmetric
- per_tensor_affine
- per_channel_symmetric
- per_channel_affine
-
Quantizer — one of the followings:
- fixed_qparams — scales and offsets are frozen
- lsq — enables learnable scales, based on Learned Step Size Quantization
- lsq+ — enables learnable scales and offsets, based on LSQ+: Improving low-bit quantization through learnable offsets and better initialization
-
Observer — one of the followings:
- min_max
- moving_average
- quantile
- mse
- histogram (supports only per tensor granularity)
-
Observer parameters
Note: in case of mse observer during calibration phase its highly recommended to use batch size as much as possible.
QConfigMapping is a PyTorch format to store quantization recipe. It also allows to save quantization scheme in yaml format to avoid its generation every time. Moreover, you can make sure that it was generated correctly and contains right information:
# generate qconfig mapping only once...
qconfig_mapping = generate_qconfig_mapping(...)
# ...and save it
qconfig_mapping.save('qconfig.yaml')
# then create it from existing configuration
qconfig_mapping = QConfigMapping.from_file('qconfig.yaml')
You are free to edit configuration file while it preserves correct values.
There may be a large accuracy drop after PTQ stage and therefore there exists numerous bunch of methods to improve quantized model quality. EQuant provides implementation for some of them:
- Cross-Layer Equalization
from equant.algorithms import cross_layer_equalization
model = cross_layer_equalization(model)
- Smooth Quant
from equant.algorithms import smooth_quant
qmodel = smooth_quant(qmodel, dataloader)
- Bias Correction
from equant.algorithms import bias_correction
qmodel = bias_correction(qmodel, dataloader)
- AdaRound
from equant.algorithms import adaround
qmodel = adaround(qmodel, dataloader)
Migrates quantization difficulty between two consecutive linear layers.
For more details see Data-Free Quantization Through Weight Equalization and Bias Correction.
Migrates quantization difficulty between activations and weights.
For more details see SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models.
To reduce quantization error we can correct bias by adding the difference between expected outputs of full-precision and quantized models.
For more details see Data-Free Quantization Through Weight Equalization and Bias Correction.
We can consider rounding during quantizaztion as an optimization task and find better parameters by chosing the appropriate rounding direction (up or down):
where
As a rule, before quantization some layers need to be fused (such as linear layers and batch normalization) and for this purpose EQuant provides some fusion methods:
- Batch Normalization fusion
from equant.fuse import fuse_conv_bn
model = fuse_conv_bn(model)
- One-kernel convolution fusion
from equant.fuse import fuse_conv_conv1x1
model = fuse_conv_conv1x1(model)
- One-step residuals fusion
from equant.fuse import fuse_residuals
model = fuse_residuals(model)
-
Data-Free Quantization Through Weight Equalization and Bias Correction
-
SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models
-
Up or Down? Adaptive Rounding for Post-Training Quantization
-
LSQ+: Improving low-bit quantization through learnable offsets and better initialization