What Is Quantization? Optimizing Data Compression

Written by Coursera Staff • Updated on Apr 18, 2025

Explore quantization and why it’s used to optimize large machine learning models. Learn about how this technique might affect the speed, efficiency, accuracy, and resource requirements of your model.

[Featured Image] A machine learning engineer uses quantization to increase efficiency in their models.

Quantization is a data compression technique that allows machine learning models to operate more efficiently without substantial performance loss. As data becomes more readily available and machine learning models become more complex, leveraging techniques to effectively manage and process this information is necessary to support the development of advanced artificial intelligence systems.

Quantization offers a way to retain the accuracy and precision of these models without requiring specialized computational resources or advanced hardware components. By exploring what quantization is and different techniques, you can decide whether and how to use this method to enhance your AI algorithms.

What is quantization?

Quantization is a technique for reducing the size of machine learning models without sacrificing accuracy or function. Large language models (LLMs), or signal processing models, often use a high volume of high-precision data. While high-precision data often leads to high accuracy, the computation resources needed to run these types of machine learning models can limit their use in many applications, such as with edge devices or resource-constrained settings.

Quantization converts high-precision data into lower-precision data by compressing it to reduce data loss. By optimizing quantization, you can reduce your model's computational burden while keeping key functions intact. This increases processing speed, efficiency, and compatibility with different types of devices and systems.

Quantization use cases

You might choose to use quantization in many situations. It may be especially relevant in the following scenarios.

You need to reduce your model's memory footprint. Production systems often run on edge devices or older hardware, each of which may have a smaller memory capacity than other device types.

You need to compress images while maintaining key attributes. Quantization can reduce image saturation and compress signals to store more data in less space.

You want to store multimedia representations efficiently. If you have an influx of video, digital images, and audio, quantization can help you store high volumes of data without losing important information.

You want to increase inference speed. Quantization typically increases inference speed while reducing memory footprints and processing power.

You want to increase sustainability and accuracy. Because quantization reduces the computational power needed, it extends the applicability of large models and creates a more sustainable framework for implementation.

Types of quantization

If you want to reduce the size of your model, you can choose between two main types of quantization: post-training quantization (PTQ) and quantization-aware training (QAT). Each has its own set of advantages and disadvantages, so it's important to consider your goals and priorities to ensure you pick the right strategy.

Post-training quantization (PTQ)

If you have an existing model and apply quantization to it, this is post-training quantization (PTQ). What you are doing, in this case, is converting your model to a lower-precision version of itself without needing to retrain your machine learning algorithm. You might choose this method if you want to prioritize speed and efficiency without redesigning your model.

With PTQ, you can decide between dynamic and static quantization. Dynamic quantization means that values of certain attributes of your model (e.g., activation functions) are computed when executing the function. This can improve results but is often slower than static quantization and requires more resources. With static quantization, you decide on certain metrics ahead of time in your model, which increases speed during execution and reduces computational load. This makes static quantization more compatible with edge devices or real-time applications.

Advantages: The main advantage of this technique choice is that it doesn’t require as much data as quantization-aware training because the training has already taken place. Typically, you only need a small subset of training data to calibrate the new model. Because of this, PTQ is typically a much faster and more efficient way to perform quantization than QAT.

Disadvantages: On the other hand, because you are converting an existing model rather than performing quantization during training, you’re at a higher risk of your model performance degrading during this process. If you are prioritizing performance over speed, you might choose QAT instead.

Quantization-aware training (QAT)

Contrary to PTQ, quantization-aware training (QAT) performs the conversion of high-precision data to low-precision data during the pre-training or fine-tuning process. This helps optimize your model with lower-precision data, which can increase the performance of the new model. If you have a large number of training methods and are in the process of designing your algorithm, this might be the right choice for you.

Advantages: Because QAT takes place during training, your model is likely to perform better with lower-precision data than models that you quantize after their development. For applications where you have representative training data and a clear idea of your goals, this approach can help yield the best results.

Disadvantages: Because QAT occurs during the model tuning process, it requires a higher computational load and a large volume of high-quality training data. It also requires you to be in the design process of your model. If you have an existing model or resource constraints, PTQ might offer worthwhile benefits.

Who uses quantization?

Professionals in the machine learning and AI space may use quantization to optimize models for different applications. For example, as a developer who works with neural networks or machine learning models, you may be tasked with designing algorithms that balance performance and resource consumption.

In some cases, you might have resource constraints such as hardware requirements or memory limitations. Understanding quantization helps to determine the best way to design models that meet the requirements of your organization or applications and allows you to choose the techniques that best fit your use case.

How to decide whether to use quantization

Deciding whether to perform quantization on your model, either during or after model training, is a decision that hinges on the requirements of your application and your available resources. Factors that may influence your decision include:

Accuracy vs. speed

If your application requires high accuracy above all other considerations, then quantization may not be appropriate. Quantization often increases speed while mildly reducing accuracy. For many use cases, this trade-off may be worth it, but you will need to determine this for each individual scenario. If needed, you can experiment with different quantization levels to find the right one, if any, for your application.

Available memory

If you have limited memory storage, you may choose quantization to increase the amount of data you can store under a certain limit. This technique can convert and compress images, speech, text, audio, and other signal types to retain their information at a lower precision level.

Hardware requirements

If you embed your model within certain hardware, you may need to reduce the computational load. This constraint will determine the appropriate type and level of quantization. Some embedded devices only support certain data types, which might require quantization to convert data to an acceptable format.

Learn more about artificial intelligence on Coursera

Quantization can help you optimize your machine learning models for faster inference, more efficient data storage, lower power consumption, and compatibility across devices. To further explore quantization, understanding the basics of artificial intelligence can help you conceptualize each element and how it fits into the bigger picture.

Start by exploring the IBM AI Engineering Professional Certificate or Generative AI with Large Language Models with AWS and DeepLearning.AI for a self-paced, comprehensive introduction to modern concepts.

Keep reading

Updated on Apr 18, 2025

Written by:

Coursera Staff

Editorial Team

Coursera’s editorial team is comprised of highly experienced professional editors, writers, and fact...

This content has been made available for informational purposes only. Learners are advised to conduct additional research to ensure that courses and other credentials pursued meet their personal, professional, and financial goals.