Affine quantization is a commonly used quantization technique, especially in deep learning and neural networks, to map floating-point values (typically 32-bit, FP32) to lower-precision integer values (e.g., 8-bit integers, INT8). It achieves this by linearly transforming the original floating-point values to integers through scaling and offsetting. This transformation helps reduce memory usage and computation complexity while maintaining reasonable model accuracy.
Scale (S):
This is a factor used to map the range of floating-point values to integer values. The larger the scale, the more granular the quantized representation becomes.
Zero-point (Z):
The zero-point is an integer value that corresponds to zero in the floating-point space. It helps shift the range of the quantized values to match the range of the original data.
Given a floating-point value \( x \), affine quantization converts it into an integer \( q \) using the following formula:
\[ q = \text{round}\left(\frac{x}{S}\right) + Z \]Where:
To convert a quantized integer back into a floating-point value, the reverse process is applied:
\[ x \approx S \times (q - Z) \]Let’s say we have a floating-point value \( x = 0.25 \) that we want to quantize to an 8-bit integer representation. Assume the following parameters:
Using the quantization formula:
\[ q = \text{round}\left(\frac{0.25}{0.01}\right) + 128 = \text{round}(25) + 128 = 153 \]So, the quantized integer value \( q \) is 153.
Converting FP32 to INT8
This example shows the details of each step. Let’s assume we have a weight matrix in FP32 format, and we want to quantize it to INT8.
Here’s an example of a simple FP32 weight matrix (each element is a 32-bit floating-point number):
\[ W_{FP32} = \begin{pmatrix} 0.5 & -0.25 & 0.125 \\ -1.0 & 1.75 & -0.625 \end{pmatrix} \]To convert from FP32 to INT8, we need two key parameters:
For simplicity, let’s assume the INT8 range is \(-128\) to \(127\), and we have:
The quantization formula is:
\[ W_{INT8} = \text{round}\left(\frac{W_{FP32}}{\text{Scale}}\right) + \text{Zero-point} \]Where:
Now, apply the quantization formula to each element of the FP32 weight matrix:
\[ W_{INT8} = \text{round}\left(\frac{W_{FP32}}{0.01}\right) + 0 \]Let’s compute this for each element:
After applying quantization, the INT8 weight matrix becomes:
\[ W_{INT8} = \begin{pmatrix} 50 & -25 & 13 \\ -100 & 127 & -63 \end{pmatrix} \]To recover an approximation of the original FP32 values, we can use the dequantization formula:
\[ W_{FP32} \approx \text{Scale} \times (W_{INT8} - \text{Zero-point}) \]For example, converting the INT8 value \(50\) back to FP32:
\[ W_{FP32} \approx 0.01 \times (50 - 0) = 0.5 \]This shows how you can quantize and (optionally) dequantize weights when converting between FP32 and INT8.