Quantization

References

Hugging Face blog : Precision & Quantization

Huggingface documentation

Huggingface blog

Basics of quantization

Basics of precision

Quantization refers to a model compression technique.

It reduces the precision of the numerical values in the model’s parameters to decrease memory usage and improve computational efficiency.

Quantizing a model from FP32 (32-bit floating-point) to INT8 (8-bit integer) involves converting the high-precision floating-point weights and activations into lower-precision integer values. Here’s an example to illustrate the process.

Model size & Inferencing

Model weights need to be loaded in memory for inferencing. Larger models demands larger amounts of memory (RAM). Although it is possible to partially load the model in memory as needed but it leads to higher inference latency. This because the inferencing would require the model weights to be read from disk into the memory. The table below shows the approximate sizes of models that need to be loaded in memory for inferencing.

Model Size Comparison

Model Approximate Size (GB) GPU Memory Requirements (GB)
Gemma 2B 4 8-16
Llama 7B 13 16-32
Llama 13B 24 32-64
CodeLlama-22B-v0.1 42 64-128

Note: These are approximate sizes and memory requirements. The actual GPU memory needed may vary depending on the specific hardware and software configuration, as well as the desired inference speed.

Static vs Dynamic Quantization

Here’s a table summarizing the key differences between static quantization and dynamic quantization in PyTorch:

Feature Static Quantization Dynamic Quantization
Quantization Type Weights and activations are quantized beforehand. Only weights are quantized; activations remain in FP32.
Quantization Timing Quantization happens during model preparation (offline). Quantization happens at inference time (online).
Calibration Requires calibration data to determine scaling factors for activations. No calibration data required; scaling is determined at runtime.
Model Size Reduction Generally leads to significant reduction in model size, as both weights and activations are quantized. Reduces model size primarily by quantizing weights only.
Performance Can lead to faster inference and reduced memory usage, but may have some overhead during calibration. Faster inference due to the use of quantized weights, but activations can slow down performance slightly.
Use Case Suitable for deployment scenarios where model size and inference speed are critical. Useful for applications where immediate model deployment is required without a separate calibration phase.
Framework Support Supported in various frameworks, but requires proper configuration for each layer type. Simpler to implement with built-in support in frameworks like PyTorch.

In summary Static Quantization is typically used when you have a calibration dataset and can afford the upfront cost of preparing the model, while Dynamic Quantization is often used for simpler scenarios where weights need to be quantized without the need for calibration.

Post Training Quantization libraries (PTQ)

Here are some common libraries that support post-training quantization for neural network models:

1. PyTorch

  • Post-Training Quantization (PTQ) is supported natively in PyTorch for both static and dynamic quantization.
  • Modules:
    • torch.quantization.quantize_dynamic()
    • torch.quantization.quantize_static()
  • Supports both per-tensor and per-channel quantization for weights.
  • Common quantization types include:
    • Dynamic Quantization (for inference-time optimization).
    • Static Quantization (with calibration before deployment).

2. TensorFlow Lite (TFLite)

  • TFLite supports various quantization techniques, including post-training quantization.
  • Modules:
    • tf.lite.TFLiteConverter with options for post-training quantization.
  • Types of quantization:
    • Full Integer Quantization
    • Float16 Quantization
    • Dynamic Range Quantization
  • TFLite models are optimized for mobile and edge devices.

3. ONNX Runtime

  • ONNX Runtime offers built-in support for post-training quantization.
  • Modules:
    • onnxruntime.quantization.quantize_dynamic()
    • onnxruntime.quantization.quantize_static()
  • Focuses on providing efficient inference on hardware platforms like CPUs, GPUs, and specialized accelerators.

4. Hugging Face Transformers

  • Hugging Face leverages PyTorch and supports quantization methods, including post-training quantization, through integration with the PyTorch framework.
  • Can be combined with PyTorch’s quantization functionalities.

5. Apache TVM

  • TVM is a machine learning compiler stack that offers post-training quantization for optimizing models across different hardware targets.
  • It provides flexible quantization and code generation for different devices, including CPUs and GPUs.

6. OpenVINO

  • Intel’s OpenVINO toolkit supports post-training quantization to optimize models for inference on Intel hardware.
  • OpenVINO’s Post-training Optimization Tool (POT) can be used to reduce model size and improve inference speed.

7. TensorRT

  • NVIDIA TensorRT supports post-training quantization for deployment on GPUs, especially for reducing precision to INT8 for faster inference.
  • Works with models exported in ONNX format.

8. Augmented Open Neural Network Exchange (AONNX)

  • AONNX is a quantization-aware variant of ONNX designed for post-training quantization optimization.

These libraries are commonly used for model compression and inference optimization on different hardware backends, including mobile devices, edge devices, and specialized hardware like GPUs and TPUs.