Exercise#1 Quantization Math

Objective

Learn the basic mathematics behind quantization. Use following techniques to convert FP32 to INT8/INT4

  • Affine quantization
  • Uniform quantization

Steps:

  1. Review the idea and mathematics behind Affine technique

Affine quantization

  1. Open the notebook locally or in Google colab

  2. Run through the code to see Affine quantization at work

  3. (Optional) Try out the Uniform quantization technique in notebook

Open notebook locally

ex-1-quantization-math

Google colab

  • Open the notebook in Google colab
Open In Colab