FP8 Quantization Now Mainstream

FP8 for training and inference has become a mainstream technique for performance optimization, with firms like Microsoft, Meta, and Google reporting 30-40% throughput improvements on Hopper and Blackwell GPUs. One case study showed a Llama-2 7B model trained entirely in FP8 matched BF16 accuracy while achieving a 34% throughput gain. This shift makes hardware-aware optimization a standard practice for reducing compute and memory requirements.

- NVIDIA's Hopper architecture introduced two FP8 data types: E4M3 (4 exponent bits, 3 mantissa bits) for a wider dynamic range suitable for activations, and E5M2 (5 exponent bits, 2 mantissa bits) offering more precision for weights and gradients. The subsequent Blackwell architecture doubles the FP8 performance, with the B200 GPU reaching approximately 9 petaFLOPS. - The key advantage of FP8 over INT8 is its floating-point representation, which provides a greater dynamic range to handle outlier values common in large models, reducing the risk of overflow and underflow errors. This makes FP8 particularly effective for quantizing transformer activations, which often have heavy-tailed distributions. - While FP8 halves the memory footprint compared to BF16/FP16, its successful implementation requires careful management of numerical stability. This often involves retaining higher precision (FP16/BF16) for gradients and optimizer states and may require adjustments to hyperparameters like learning rates to prevent training divergence. - NVIDIA's Transformer Engine is a key library that accelerates Transformer models by automatically managing mixed-precision training, dynamically switching between FP8 and FP16/BF16 to optimize performance while maintaining accuracy. This engine is designed to leverage the 4th and 5th generation Tensor Cores found in Hopper and Blackwell GPUs, respectively. - The overhead of quantizing and dequantizing tensors can be a performance bottleneck, sometimes consuming up to 30% of the total execution time for matrix multiplication operations on H800 GPUs. To mitigate this, techniques like operator fusion are used to combine quantization steps with preceding operations (e.g., LayerNorm), reducing redundant memory read/write cycles. - Initial native support for FP8 data types has appeared in PyTorch 2.3, allowing for more direct and flexible implementation of custom FP8 algorithms beyond the higher-level abstractions of libraries like Transformer Engine. TensorFlow has also integrated FP8 GEMM into its XLA backend.

Get your own daily briefing

Scout delivers personalized news, insights, and conversations tailored to your role and industry.

Download on the App Store

Shared from Scout - Be the smartest in the room.