DeepSpeed Unveils New Model Quantization Technique

Microsoft's DeepSpeed library has introduced Mixture-of-Quantization (MoQ), a new technique for progressively quantizing models during the training process. MoQ is designed to significantly reduce the memory footprint and inference costs of large language models without a substantial loss in accuracy, making large-scale deployment more economically viable.

- Mixture-of-Quantization (MoQ) is a form of Quantization-Aware Training (QAT), which integrates quantization into the training process to minimize accuracy loss, contrasting with Post-Training Quantization (PTQ) that converts a fully trained model. QAT techniques like MoQ generally achieve higher accuracy than PTQ, especially for lower precision like INT4, but require more computational resources and complexity during the training phase. - A key differentiator for MoQ is its progressive quantization schedule; it begins training with higher precision (e.g., 16-bit) and gradually reduces the bit-width according to a predefined schedule. This process can be dynamically adjusted for each layer using second-order information (eigenvalues) to determine a layer's sensitivity to quantization, allowing less sensitive layers to be quantized more aggressively. - The primary motivation for MoQ is reducing the memory bandwidth bottleneck for large models where parameter loading time from memory is a dominant factor in inference latency. By quantizing only the model weights to formats like INT8 or INT4, while keeping activations in FP16, MoQ significantly reduces the model's memory footprint. - For a 17-billion-parameter Turing-NLG model, combining MoQ with DeepSpeed's inference optimizations enabled it to run on a single GPU with a 1.7x latency reduction and a 6.2x cost saving compared to a 4-GPU baseline. - The shift from 32-bit floating-point (FP32) or 16-bit (FP16) to lower-precision 8-bit or 4-bit integers (INT8/INT4) directly impacts hardware efficiency. It allows AI accelerators and modern CPUs with specialized instruction sets, like Intel's VNNI, to leverage faster integer arithmetic, reducing power consumption and increasing throughput. - DeepSpeed itself is a broader library of tools from Microsoft Research for large-scale AI, known for innovations like the ZeRO (Zero Redundancy Optimizer). MoQ is one component of the DeepSpeed Inference suite, which also includes optimizations for parallelism and high-performance kernels. - The competitive landscape for advanced quantization includes other techniques like Activation-aware Weight Quantization (AWQ), which focuses on protecting salient weights with large activation magnitudes, and GPTQ, another post-training method designed to achieve near-QAT accuracy.

Get your own daily briefing

Scout delivers personalized news, insights, and conversations tailored to your role and industry.

Download on the App Store

Shared from Scout - Be the smartest in the room.