The Quanto library is designed for the quantization of machine learning models, particularly focusing on techniques that improve inference efficiency on hardware with limited resources (such as mobile devices, embedded systems, or specialized AI accelerators). It helps to compress neural networks by reducing the precision of weights and activations, leading to smaller models, faster inference, and reduced energy consumption while attempting to maintain model accuracy.
Model Quantization: Quanto provides tools to apply various quantization techniques (such as 8-bit, 4-bit quantization) on neural networks, reducing the precision of weights and activations from higher precision (like FP32) to lower-bit representations.
Post-Training Quantization: Similar to other quantization libraries, Quanto supports post-training quantization (PTQ), meaning the model can be quantized after it has been trained, without needing to modify the training process.
Optimizing Inference: By quantizing the model, Quanto makes neural networks more efficient during inference, improving their performance on resource-constrained devices.
Cross-Hardware Compatibility: Quanto aims to optimize models to run on a variety of hardware, from CPUs to specialized accelerators (like GPUs, TPUs, and other AI hardware), taking advantage of quantization-friendly operations on those devices.
Yes, Quanto (like most quantization libraries) creates a new version of the model that is optimized and quantized. The original full-precision model remains unchanged, and a new quantized model is generated. This quantized version has reduced precision in its weights and, depending on the type of quantization, possibly activations as well.
This quantized model is a new version of the model specifically optimized for efficiency at inference time, while the original full-precision model remains intact for possible further use (such as retraining, fine-tuning, or comparison).
Quanto specializes in weight-only quantization, where only the model’s weights are quantized to lower precision (e.g., 8-bit or 4-bit), while activations remain unquantized or are handled dynamically during inference. This approach significantly reduces model size and memory bandwidth usage, especially useful for large models like transformers, where the weight matrices are the dominant storage component.
Quanto performs quantization in-memory for inference optimization, but it doesn’t have the capability to serialize the quantized weights and model structure to a standard format (like a .pt file for PyTorch or .onnx for ONNX). The quantized model exists only temporarily in memory during the execution, and to reuse it, you must apply the quantization process again each time.