EQuant

EQuant provides a collection of algorithms to improve post-training quantization (PTQ) accuracy and extends PyTorch observers and quantizers with new ones. It also implements basic fusion methods and extends PyTorch backend with new fusion recipes. All quantized models support quantization-aware training (QAT) mode.

Installation

pip install -e .

Basic Usage

from equant import generate_qconfig_mapping, quantize, convert


# define quantization recipe, for more details see below
qconfig = [
    {
        'weight': {
            'dtype': 's8',
            'qscheme': 'per_channel_symmetric',
            'observer': 'min_max',
            'quantizer': 'lsq'
        },
        'activation': {
            'dtype': 'u8',
            'qscheme': 'per_tensor_affine',
            'observer': 'quantile',
            'observer_kwargs': {
                'quantile': 0.99999
            },
            'quantizer': 'lsq'
        },
        'layers': ['*'] # quantize all layers
    }
]

# convert qconfig to PyTorch format
qconfig_mapping = generate_qconfig_mapping(model, qconfig)

# convert model to fake-quantized mode
qmodel = quantize(model, qconfig_mapping, example_inputs)

# calibrate model
for data in dataloader:
    _ = qmodel(data)

# convert a fake-quantized model to the quantized model
model_int = convert(qmodel)

QConfig

QConfig is a list of dictionaries where each dictionary contains its own quantization recipe for specific layers:

qconfig = [
    # scheme 1
    {
        'weight': {
            # quantization recipe for weights
        },
        'activation': {
            # quantization recipe for activations
        },
        'layers': [...] # list of layers for this scheme
    },
    # scheme 2
    {
        'weight': {
            # another quantization recipe for weights
        },
        'activation': {
            # another quantization recipe for activations
        },
        'layers': [...] # list of layers for this scheme
    }
    # scheme 3
    ...
    # scheme n
]

Each quantization recipe (for both weights and activations) contains information about:

Data type — in the format [s/u][n_bits], where s means signed data type and u stands for unsigned data type; for special cases n_bits may be float number. For instance, s7.7, s5.9, u6, etc. are all valid data types.
Quantization sheme — one of the followings:
- per_tensor_symmetric
- per_tensor_affine
- per_channel_symmetric
- per_channel_affine
Quantizer — one of the followings:
- fixed_qparams — scales and offsets are frozen
- lsq — enables learnable scales, based on Learned Step Size Quantization
- lsq+ — enables learnable scales and offsets, based on LSQ+: Improving low-bit quantization through learnable offsets and better initialization
Observer — one of the followings:
- min_max
- moving_average
- quantile
- mse
- histogram (supports only per tensor granularity)
Observer parameters

Note: in case of mse observer during calibration phase its highly recommended to use batch size as much as possible.

QConfigMapping

QConfigMapping is a PyTorch format to store quantization recipe. It also allows to save quantization scheme in yaml format to avoid its generation every time. Moreover, you can make sure that it was generated correctly and contains right information:

# generate qconfig mapping only once...
qconfig_mapping = generate_qconfig_mapping(...)

# ...and save it
qconfig_mapping.save('qconfig.yaml')

# then create it from existing configuration
qconfig_mapping = QConfigMapping.from_file('qconfig.yaml')

You are free to edit configuration file while it preserves correct values.

Algorithms

There may be a large accuracy drop after PTQ stage and therefore there exists numerous bunch of methods to improve quantized model quality. EQuant provides implementation for some of them:

Pre-calibration algorithms

Cross-Layer Equalization

from equant.algorithms import cross_layer_equalization


model = cross_layer_equalization(model)

Smooth Quant

from equant.algorithms import smooth_quant


qmodel = smooth_quant(qmodel, dataloader)

Post-calibration algorithms

Bias Correction

from equant.algorithms import bias_correction


qmodel = bias_correction(qmodel, dataloader)

AdaRound

from equant.algorithms import adaround


qmodel = adaround(qmodel, dataloader)

Cross-Layer Equalization

Migrates quantization difficulty between two consecutive linear layers.

For more details see Data-Free Quantization Through Weight Equalization and Bias Correction.

SmoothQuant

Migrates quantization difficulty between activations and weights.

For more details see SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models.

Bias Correction

To reduce quantization error we can correct bias by adding the difference between expected outputs of full-precision and quantized models.

$$ b^{'} = b + E[W_{fp}x_{fp}] - E[W_{q}x_{q}] $$

For more details see Data-Free Quantization Through Weight Equalization and Bias Correction.

AdaRound

We can consider rounding during quantizaztion as an optimization task and find better parameters by chosing the appropriate rounding direction (up or down):

$$ W_{fake-q} = s \cdot \text{clip}\left(\left \lfloor \frac{W_{fp}}{s}\right \rfloor + h(V) + Z, n, p\right) $$

where $h(v)$ is any differentiable function that takes values between $0$ and $1$. For more details see Up or Down? Adaptive Rounding for Post-Training Quantization.

Fusion

As a rule, before quantization some layers need to be fused (such as linear layers and batch normalization) and for this purpose EQuant provides some fusion methods:

Batch Normalization fusion

from equant.fuse import fuse_conv_bn


model = fuse_conv_bn(model)

One-kernel convolution fusion

from equant.fuse import fuse_conv_conv1x1


model = fuse_conv_conv1x1(model)

One-step residuals fusion

from equant.fuse import fuse_residuals


model = fuse_residuals(model)

Name		Name	Last commit message	Last commit date
Latest commit History 34 Commits
assets		assets
equant		equant
.gitignore		.gitignore
README.md		README.md
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

EQuant

Installation

Basic Usage

QConfig

Note: in case of mse observer during calibration phase its highly recommended to use batch size as much as possible.

QConfigMapping

Algorithms

Pre-calibration algorithms

Post-calibration algorithms

Cross-Layer Equalization

SmoothQuant

Bias Correction

AdaRound

Fusion

References

About

Releases

Packages

Languages

Mr6one/EQuant

Folders and files

Latest commit

History

Repository files navigation

EQuant

Installation

Basic Usage

QConfig

Note: in case of mse observer during calibration phase its highly recommended to use batch size as much as possible.

QConfigMapping

Algorithms

Pre-calibration algorithms

Post-calibration algorithms

Cross-Layer Equalization

SmoothQuant

Bias Correction

AdaRound

Fusion

References

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages