Affine Quantization

Affine quantization is a commonly used quantization technique, especially in deep learning and neural networks, to map floating-point values (typically 32-bit, FP32) to lower-precision integer values (e.g., 8-bit integers, INT8). It achieves this by linearly transforming the original floating-point values to integers through scaling and offsetting. This transformation helps reduce memory usage and computation complexity while maintaining reasonable model accuracy.

Key Components of Affine Quantization
  1. Scale (S):
    This is a factor used to map the range of floating-point values to integer values. The larger the scale, the more granular the quantized representation becomes.

  2. Zero-point (Z):
    The zero-point is an integer value that corresponds to zero in the floating-point space. It helps shift the range of the quantized values to match the range of the original data.

Formula for Affine Quantization

Given a floating-point value \( x \), affine quantization converts it into an integer \( q \) using the following formula:

\[ q = \text{round}\left(\frac{x}{S}\right) + Z \]

Where:

  • \( q \) is the quantized integer value.
  • \( x \) is the original floating-point value.
  • \( S \) is the scale factor.
  • \( Z \) is the zero-point.
Dequantization Formula

To convert a quantized integer back into a floating-point value, the reverse process is applied:

\[ x \approx S \times (q - Z) \]
Example

Let’s say we have a floating-point value \( x = 0.25 \) that we want to quantize to an 8-bit integer representation. Assume the following parameters:

  • Scale (S): 0.01
  • Zero-point (Z): 128 (since 8-bit integers range from 0 to 255)

Using the quantization formula:

\[ q = \text{round}\left(\frac{0.25}{0.01}\right) + 128 = \text{round}(25) + 128 = 153 \]

So, the quantized integer value \( q \) is 153.

Step-by-Step Example

Converting FP32 to INT8

This example shows the details of each step. Let’s assume we have a weight matrix in FP32 format, and we want to quantize it to INT8.

Step 1: Original FP32 Weight Matrix

Here’s an example of a simple FP32 weight matrix (each element is a 32-bit floating-point number):

\[ W_{FP32} = \begin{pmatrix} 0.5 & -0.25 & 0.125 \\ -1.0 & 1.75 & -0.625 \end{pmatrix} \]
Step 2: Define Quantization Parameters

To convert from FP32 to INT8, we need two key parameters:

  1. Scale: A factor used to map FP32 values to INT8 values.
  2. Zero-point: A value that maps 0 in FP32 to an integer within the INT8 range.

For simplicity, let’s assume the INT8 range is \(-128\) to \(127\), and we have:

  • Scale: 0.01
  • Zero-point: 0
Step 3: Quantization Formula

The quantization formula is:

\[ W_{INT8} = \text{round}\left(\frac{W_{FP32}}{\text{Scale}}\right) + \text{Zero-point} \]

Where:

  • \( W_{FP32} \) is the original floating-point weight.
  • \( \text{Scale} \) is the quantization scale.
  • \( \text{Zero-point} \) shifts the values to the INT8 range.
Step 4: Applying Quantization

Now, apply the quantization formula to each element of the FP32 weight matrix:

\[ W_{INT8} = \text{round}\left(\frac{W_{FP32}}{0.01}\right) + 0 \]

Let’s compute this for each element:

  • \( 0.5 \) → \( \text{round}\left(\frac{0.5}{0.01}\right) = \text{round}(50) = 50 \)
  • \( -0.25 \) → \( \text{round}\left(\frac{-0.25}{0.01}\right) = \text{round}(-25) = -25 \)
  • \( 0.125 \) → \( \text{round}\left(\frac{0.125}{0.01}\right) = \text{round}(12.5) = 13 \)
  • \( -1.0 \) → \( \text{round}\left(\frac{-1.0}{0.01}\right) = \text{round}(-100) = -100 \)
  • \( 1.75 \) → \( \text{round}\left(\frac{1.75}{0.01}\right) = \text{round}(175) = 175 \) (but capped at 127 due to INT8 range)
  • \( -0.625 \) → \( \text{round}\left(\frac{-0.625}{0.01}\right) = \text{round}(-62.5) = -63 \)
Step 5: Resulting INT8 Weight Matrix

After applying quantization, the INT8 weight matrix becomes:

\[ W_{INT8} = \begin{pmatrix} 50 & -25 & 13 \\ -100 & 127 & -63 \end{pmatrix} \]
Step 6: Dequantization (Optional)

To recover an approximation of the original FP32 values, we can use the dequantization formula:

\[ W_{FP32} \approx \text{Scale} \times (W_{INT8} - \text{Zero-point}) \]

For example, converting the INT8 value \(50\) back to FP32:

\[ W_{FP32} \approx 0.01 \times (50 - 0) = 0.5 \]

This shows how you can quantize and (optionally) dequantize weights when converting between FP32 and INT8.