Hugging Face blog : Precision & Quantization
Quantization refers to a model compression technique.
It reduces the precision of the numerical values in the model’s parameters to decrease memory usage and improve computational efficiency.
Quantizing a model from FP32 (32-bit floating-point) to INT8 (8-bit integer) involves converting the high-precision floating-point weights and activations into lower-precision integer values. Here’s an example to illustrate the process.
Model weights need to be loaded in memory for inferencing. Larger models demands larger amounts of memory (RAM). Although it is possible to partially load the model in memory as needed but it leads to higher inference latency. This because the inferencing would require the model weights to be read from disk into the memory. The table below shows the approximate sizes of models that need to be loaded in memory for inferencing.
Model | Approximate Size (GB) | GPU Memory Requirements (GB) |
---|---|---|
Gemma 2B | 4 | 8-16 |
Llama 7B | 13 | 16-32 |
Llama 13B | 24 | 32-64 |
CodeLlama-22B-v0.1 | 42 | 64-128 |
Note: These are approximate sizes and memory requirements. The actual GPU memory needed may vary depending on the specific hardware and software configuration, as well as the desired inference speed.
Here’s a table summarizing the key differences between static quantization and dynamic quantization in PyTorch:
Feature | Static Quantization | Dynamic Quantization |
---|---|---|
Quantization Type | Weights and activations are quantized beforehand. | Only weights are quantized; activations remain in FP32. |
Quantization Timing | Quantization happens during model preparation (offline). | Quantization happens at inference time (online). |
Calibration | Requires calibration data to determine scaling factors for activations. | No calibration data required; scaling is determined at runtime. |
Model Size Reduction | Generally leads to significant reduction in model size, as both weights and activations are quantized. | Reduces model size primarily by quantizing weights only. |
Performance | Can lead to faster inference and reduced memory usage, but may have some overhead during calibration. | Faster inference due to the use of quantized weights, but activations can slow down performance slightly. |
Use Case | Suitable for deployment scenarios where model size and inference speed are critical. | Useful for applications where immediate model deployment is required without a separate calibration phase. |
Framework Support | Supported in various frameworks, but requires proper configuration for each layer type. | Simpler to implement with built-in support in frameworks like PyTorch. |
In summary Static Quantization is typically used when you have a calibration dataset and can afford the upfront cost of preparing the model, while Dynamic Quantization is often used for simpler scenarios where weights need to be quantized without the need for calibration.
Here are some common libraries that support post-training quantization for neural network models:
torch.quantization.quantize_dynamic()
torch.quantization.quantize_static()
tf.lite.TFLiteConverter
with options for post-training quantization.onnxruntime.quantization.quantize_dynamic()
onnxruntime.quantization.quantize_static()
These libraries are commonly used for model compression and inference optimization on different hardware backends, including mobile devices, edge devices, and specialized hardware like GPUs and TPUs.