Post

Quantization: A Deep Dive into Model Compression

An in-depth exploration of quantization techniques for model compression and efficiency.

Quantization: A Deep Dive into Model Compression

Introduction

Imagine you want to run Llama 2 70B on your machine. There’s just one problem: in its native FP32 precision, the model weights alone take up roughly 280GB of RAM memory and additional memory of around 20GB for context which grows with sequence length. That’s more than most high-end GPUs can handle, let alone a laptop.

Now what if you could shrink the moel down to 35GB or even 17GB withougb losing much of its quality?

That’s what quanitzation does. It is the process of reducing the numerical precision of a model’s weights and activations (for e.g. converting 32-bit floating point numbers to 8-bit integers) so that model gets smaller and its need less memory and compute to run.

Number Representation & Data Types

Before we can shrink a model, we need to understadn what we are shrinking. AI models internally performs mathematical operations on weights and activations, and how those parameters are stored determines both the precision & accuracy of the results and the memory model consumes.

Weights & activations are stored in floating point format, which has three components:

  • Sign bit: Indicates whether the number is positive or negative.
  • Exponent: Determines the range of the number (how large or small it can be).
  • Mantissa (or significand): Represents the precision of the number (how many decimal places it can have).

More bits means more room for the exponent and mantissa, which allows for a wider range of values and greater precision. For example, FP32 (32-bit floating point) has 1 sign bit, 8 bits for the exponent, and 23 bits for the mantissa, while FP16 (16-bit floating point) has 1 sign bit, 5 bits for the exponent, and 10 bits for the mantissa.

Precision refers to the number of significant figures or decimal places a number can represent. In floating point numbers, precision is determined by the mantissa—more bits in the mantissa mean finer granularity and better accuracy for decimal values.

Bits Spectrum

Here’s how the common data types compare:

Data TypeTotal BitsSign BitsExponent BitsMantissa BitsRange of ValuesMemory Usage
FP32321823~-3.40e38 to ~3.4e384 bytes
FP16161510-65504 to 655042 bytes
BF1616187~−3.38e38 to ~3.38e382 bytes
INT88008-128 to 1271 byte
INT44004-8 to 70.5 byte
INT22002-2 to 10.25 byte
INT11001-1 to 00.125 byte

A few things to notice:

  • FP16 halves the memory usage of FP32 but sacrifices both range and precision. This can cause overflow issues during training when gradients get very large.
  • BF16 is an interesting compromise: it keeps the same exponent (and thereof range) as FP32 but has less precision than FP16. This makes it more effective for training because gradient magnitudes are preserved even if the values are slightly less precise.
  • INT8 and lower integer formats (INT4, INT2, INT1) are integer formats where there is no exponent, no mantissa, just whole numbers within a fixed rrange. They’re much cheaper to compute with but can’t represent the same nuance as floating point formats.

As you can see, as we reduce the number of bits, we also reduce the range and precision of the values we can represent. This is the trade-off that quantization makes: by using fewer bits, we can save memory and computational resources, but we may lose some accuracy in the process.

How Memory Is Calculated

Eachh parameter is a model is stored as a number in a given data type, and each data type uses a fixed number of bits. So, the memory usage of a model can be calculated using the formula: \(\text{Memory Usage} = \frac{\text{Number of Parameters} \times \text{Bits per Parameter}}{8} \text{ bytes}\) Where:

  • Number of Parameters is the total number of weights and biases in the model.
  • Bits per Parameter is the number of bits used to represent each parameter (e.g., 32 for FP32, 16 for FP16, 8 for INT8, etc.).
  • We divide by 8 to convert bits to bytes.

Let’s take a 7 billion paramter model as an example: | Data Type | Bits per Parameter | Calculation | Memory Usage (GB) | |———–|——————–|————-|——————-| | FP32 | 32 | (7e9 * 32) / 8 | 28 GB | | FP16 | 16 | (7e9 * 16) / 8 | 14 GB | | BF16 | 16 | (7e9 * 16) / 8 | 14 GB | | INT8 | 8 | (7e9 * 8) / 8 | 7 GB | | INT4 | 4 | (7e9 * 4) / 8 | 3.5 GB | | INT2 | 2 | (7e9 * 2) / 8 | 1.75 GB | | INT1 | 1 | (7e9 * 1) / 8 | 0.875 GB |

As you can see, by reducing the precision of the data type, we can significantly reduce the memory usage of the model, especially for large models with billions of parameters, where the memory requirements can quickly become unmanageable.

Seeing it in Code-Action

1
2
3
4
5
6
7
8
9
10
11
12
import torch

model = torch.randn(1000,1000) # A simple tensor

print(f"FP32: {model.element_size() * model.nelement() / 1e6:.2f} MB")
print(f"FP16: {model.half().element_size() * model.nelement() / 1e6:.2f} MB")
print(f"INT8: {model.char().element_size() * model.nelement() / 1e6:.2f} MB")

# Outputs
FP32: 4.00 MB
FP16: 2.00 MB
INT8: 1.00 MB

In this example, we create a random tensor of shape (1000, 1000) which has 1 million elements. We then calculate the memory usage for FP32, FP16, and INT8 formats. As you can see, the memory usage decreases as we reduce the precision of the data type. Same tesnor, same shape but 4x less memory just by changing the data type. Now if we scale this to a model with billions of parameters, the memory savings become significant, allowing us to run larger models on hardware with limited resources.

Quantization Techniques

In practice, we do not need to map the entire FP32 range [-3.4e38, 3.4e38] to the smaller range of INT8 [-128, 127]. Instead, we need to find a way to map the range of our data (the model’s paramters and activations) to the smaller range of the target data type.

Symmetric & Asymmetric Quantization are two common techniques for doing this mapping and are forms of linear mapping.

Symmetric Quantization

In symmetric quantization, the range of the original values is mapped to a symmetric range around zero in the quantized space. This means that the quantized value for zero in the original data type space is exactly zero in the quantized space.

A diagram showing symmetric quantization mapping the range of original values to a symmetric range around zero in the quantized space. In symmetric quantization, we assume that the data is centered around zero and we use the same scale factor for both positive and negative values. The formula for symmetric quantization is: \(q = \text{round}\left(\frac{x}{s}\right)\) Where:

  • $q$ is the quantized value (e.g., an INT8 value).
  • $x$ is the original floating point value (e.g., an FP32 value).
  • $s$ is the scale factor, which is calculated as: \(s = \frac{\max(|x_{\text{min}}|, |x_{\text{max}}|)}{2^{b-1} - 1}\) Where:
  • $x_{\text{min}}$ and $x_{\text{max}}$ are the minimum and maximum values in the data (e.g., the weights or activations).
  • $b$ is the number of bits in the target data type (e.g., 8 for INT8).
This post is licensed under CC BY 4.0 by the author.