DeepSpeed Unveils New Model Quantization Technique

Published February 26, 2026 by The Daily Scout

Microsoft's DeepSpeed library has introduced Mixture-of-Quantization (MoQ), a new technique for progressively quantizing models during the training process. MoQ is designed to significantly reduce the memory footprint and inference costs of large language models without a substantial loss in accuracy, making large-scale deployment more economically viable.

Why it matters

- Mixture-of-Quantization (MoQ) is a form of Quantization-Aware Training (QAT), which integrates quantization into the training process to minimize accuracy loss, contrasting with Post-Training Quantization (PTQ) that converts a fully trained model. QAT techniques like MoQ generally achieve higher accuracy than PTQ, especially for lower precision like INT4, but require more computational resources and complexity during the training phase. - A key differentiator for MoQ is its progressive quantization schedule; it begins training with higher precision (e.g., 16-bit) and gradually reduces the bit-width according to a predefined schedule. This process can be dynamically adjusted for each layer using second-order information (eigenvalues) to determine a layer's sensitivity to quantization, allowing less sensitive layers to be quantized more aggressively. - The primary motivation for MoQ is reducing the memory bandwidth bottleneck for large models where parameter loading time from memory is a dominant factor in inference latency. By quantizing only the model weights to formats like INT8 or INT4, while keeping activations in FP16, MoQ significantly reduces the model's memory footprint. - For a 17-billion-parameter Turing-NLG model, combining MoQ with DeepSpeed's inference optimizations enabled it to run on a single GPU with a 1.7x latency reduction and a 6.2x cost saving compared to a 4-GPU baseline. - The shift from 32-bit floating-point (FP32) or 16-bit (FP16) to lower-precision 8-bit or 4-bit integers (INT8/INT4) directly impacts hardware efficiency. It allows AI accelerators and modern CPUs with specialized instruction sets, like Intel's VNNI, to leverage faster integer arithmetic, reducing power consumption and increasing throughput. - DeepSpeed itself is a broader library of tools from Microsoft Research for large-scale AI, known for innovations like the ZeRO (Zero Redundancy Optimizer). MoQ is one component of the DeepSpeed Inference suite, which also includes optimizations for parallelism and high-performance kernels. - The competitive landscape for advanced quantization includes other techniques like Activation-aware Weight Quantization (AWQ), which focuses on protecting salient weights with large activation magnitudes, and GPTQ, another post-training method designed to achieve near-QAT accuracy.

Key numbers

QAT techniques like MoQ generally achieve higher accuracy than PTQ, especially for lower precision like INT4, but require more computational resources and complexity during the training phase.
A key differentiator for MoQ is its progressive quantization schedule; it begins training with higher precision (e.g., 16-bit) and gradually reduces the bit-width according to a predefined schedule.
By quantizing only the model weights to formats like INT8 or INT4, while keeping activations in FP16, MoQ significantly reduces the model's memory footprint.
For a 17-billion-parameter Turing-NLG model, combining MoQ with DeepSpeed's inference optimizations enabled it to run on a single GPU with a 1.7x latency reduction and a 6.2x cost saving compared to a 4-GPU baseline.

What happens next

A key differentiator for MoQ is its progressive quantization schedule; it begins training with higher precision (e.g., 16-bit) and gradually reduces the bit-width according to a predefined schedule.

Sources

Quick answers

What happened in DeepSpeed Unveils New Model Quantization Technique?

Why does DeepSpeed Unveils New Model Quantization Technique matter?

Mixture-of-Quantization (MoQ) is a form of Quantization-Aware Training (QAT), which integrates quantization into the training process to minimize accuracy loss, contrasting with Post-Training Quantization (PTQ) that converts a fully trained model. QAT techniques like MoQ generally achieve higher accuracy than PTQ, especially for lower precision like INT4, but require more computational resources and complexity during the training phase. A key differentiator for MoQ is its progressive quantization schedule; it begins training with higher precision (e.g., 16-bit) and gradually reduces the bit-width according to a predefined schedule. This process can be dynamically adjusted for each layer using second-order information (eigenvalues) to determine a layer's sensitivity to quantization, allowing less sensitive layers to be quantized more aggressively. The primary motivation for MoQ is reducing the memory bandwidth bottleneck for large models where parameter loading time from memory is a dominant factor in inference latency. By quantizing only the model weights to formats like INT8 or INT4, while keeping activations in FP16, MoQ significantly reduces the model's memory footprint. For a 17-billion-parameter Turing-NLG model, combining MoQ with DeepSpeed's inference optimizations enabled it to run on a single GPU with a 1.7x latency reduction and a 6.2x cost saving compared to a 4-GPU baseline. The shift from 32-bit floating-point (FP32) or 16-bit (FP16) to lower-precision 8-bit or 4-bit integers (INT8/INT4) directly impacts hardware efficiency. It allows AI accelerators and modern CPUs with specialized instruction sets, like Intel's VNNI, to leverage faster integer arithmetic, reducing power consumption and increasing throughput. DeepSpeed itself is a broader library of tools from Microsoft Research for large-scale AI, known for innovations like the ZeRO (Zero Redundancy Optimizer). MoQ is one component of the DeepSpeed Inference suite, which also includes optimizations for parallelism and high-performance kernels. The competitive landscape for advanced quantization includes other techniques like Activation-aware Weight Quantization (AWQ), which focuses on protecting salient weights with large activation magnitudes, and GPTQ, another post-training method designed to achieve near-QAT accuracy.