Skip to content
/ EQuant Public

Toy Quantization Library for PyTorch Models

Notifications You must be signed in to change notification settings

Mr6one/EQuant

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

34 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

EQuant

EQuant provides a collection of algorithms to improve post-training quantization (PTQ) accuracy and extends PyTorch observers and quantizers with new ones. It also implements basic fusion methods and extends PyTorch backend with new fusion recipes. All quantized models support quantization-aware training (QAT) mode.

Installation

pip install -e .

Basic Usage

from equant import generate_qconfig_mapping, quantize, convert


# define quantization recipe, for more details see below
qconfig = [
    {
        'weight': {
            'dtype': 's8',
            'qscheme': 'per_channel_symmetric',
            'observer': 'min_max',
            'quantizer': 'lsq'
        },
        'activation': {
            'dtype': 'u8',
            'qscheme': 'per_tensor_affine',
            'observer': 'quantile',
            'observer_kwargs': {
                'quantile': 0.99999
            },
            'quantizer': 'lsq'
        },
        'layers': ['*'] # quantize all layers
    }
]

# convert qconfig to PyTorch format
qconfig_mapping = generate_qconfig_mapping(model, qconfig)

# convert model to fake-quantized mode
qmodel = quantize(model, qconfig_mapping, example_inputs)

# calibrate model
for data in dataloader:
    _ = qmodel(data)

# convert a fake-quantized model to the quantized model
model_int = convert(qmodel)

QConfig

QConfig is a list of dictionaries where each dictionary contains its own quantization recipe for specific layers:

qconfig = [
    # scheme 1
    {
        'weight': {
            # quantization recipe for weights
        },
        'activation': {
            # quantization recipe for activations
        },
        'layers': [...] # list of layers for this scheme
    },
    # scheme 2
    {
        'weight': {
            # another quantization recipe for weights
        },
        'activation': {
            # another quantization recipe for activations
        },
        'layers': [...] # list of layers for this scheme
    }
    # scheme 3
    ...
    # scheme n
]

Each quantization recipe (for both weights and activations) contains information about:

  • Data type — in the format [s/u][n_bits], where s means signed data type and u stands for unsigned data type; for special cases n_bits may be float number. For instance, s7.7, s5.9, u6, etc. are all valid data types.

  • Quantization sheme — one of the followings:

    • per_tensor_symmetric
    • per_tensor_affine
    • per_channel_symmetric
    • per_channel_affine
  • Quantizer — one of the followings:

  • Observer — one of the followings:

    • min_max
    • moving_average
    • quantile
    • mse
    • histogram (supports only per tensor granularity)
  • Observer parameters

Note: in case of mse observer during calibration phase its highly recommended to use batch size as much as possible.

QConfigMapping

QConfigMapping is a PyTorch format to store quantization recipe. It also allows to save quantization scheme in yaml format to avoid its generation every time. Moreover, you can make sure that it was generated correctly and contains right information:

# generate qconfig mapping only once...
qconfig_mapping = generate_qconfig_mapping(...)

# ...and save it
qconfig_mapping.save('qconfig.yaml')

# then create it from existing configuration
qconfig_mapping = QConfigMapping.from_file('qconfig.yaml')

You are free to edit configuration file while it preserves correct values.

Algorithms

There may be a large accuracy drop after PTQ stage and therefore there exists numerous bunch of methods to improve quantized model quality. EQuant provides implementation for some of them:

Pre-calibration algorithms

  • Cross-Layer Equalization
from equant.algorithms import cross_layer_equalization


model = cross_layer_equalization(model)
  • Smooth Quant
from equant.algorithms import smooth_quant


qmodel = smooth_quant(qmodel, dataloader)

Post-calibration algorithms

  • Bias Correction
from equant.algorithms import bias_correction


qmodel = bias_correction(qmodel, dataloader)
  • AdaRound
from equant.algorithms import adaround


qmodel = adaround(qmodel, dataloader)

Cross-Layer Equalization

Migrates quantization difficulty between two consecutive linear layers.

For more details see Data-Free Quantization Through Weight Equalization and Bias Correction.

SmoothQuant

Migrates quantization difficulty between activations and weights.

Alt text

For more details see SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models.

Bias Correction

To reduce quantization error we can correct bias by adding the difference between expected outputs of full-precision and quantized models.

$$ b^{'} = b + E[W_{fp}x_{fp}] - E[W_{q}x_{q}] $$

For more details see Data-Free Quantization Through Weight Equalization and Bias Correction.

AdaRound

We can consider rounding during quantizaztion as an optimization task and find better parameters by chosing the appropriate rounding direction (up or down):

$$ W_{fake-q} = s \cdot \text{clip}\left(\left \lfloor \frac{W_{fp}}{s}\right \rfloor + h(V) + Z, n, p\right) $$

where $h(v)$ is any differentiable function that takes values between $0$ and $1$. For more details see Up or Down? Adaptive Rounding for Post-Training Quantization.

Fusion

As a rule, before quantization some layers need to be fused (such as linear layers and batch normalization) and for this purpose EQuant provides some fusion methods:

  • Batch Normalization fusion
from equant.fuse import fuse_conv_bn


model = fuse_conv_bn(model)
  • One-kernel convolution fusion
from equant.fuse import fuse_conv_conv1x1


model = fuse_conv_conv1x1(model)
  • One-step residuals fusion
from equant.fuse import fuse_residuals


model = fuse_residuals(model)

References

About

Toy Quantization Library for PyTorch Models

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages