FP8 adoption cuts model serving costs
Major technology companies including Microsoft, Meta, and Google are now training and serving frontier models using FP8 precision, which can reduce compute and memory costs by up to 50% compared to BF16. Benchmarks for a Llama-2 7B model running in FP8 on Hopper and Blackwell GPUs showed a 34% throughput gain while maintaining accuracy.
- FP8 exists in two main formats: E4M3 (4 exponent bits, 3 mantissa bits) for higher precision and E5M2 (5 exponent bits, 2 mantissa bits) for a wider dynamic range. E4M3 is typically used for weights and activations in the forward pass, while the wider range of E5M2 is better suited for representing gradients during the backward pass. - NVIDIA's Transformer Engine library is crucial for leveraging FP8, as it automatically handles the casting of operations to FP8 and manages scaling factors to maintain accuracy. This engine provides a framework-agnostic C++ API and integrates with major deep learning libraries like PyTorch, JAX, and TensorFlow. - Hardware support is a key enabler, with NVIDIA's Hopper, Ada Lovelace, and Blackwell architectures featuring fourth and fifth-generation Tensor Cores specifically designed for FP8 operations. For instance, the H100 GPU's Transformer Engine with FP8 delivers up to 9 times faster AI training and 30 times faster inference on large language models compared to the A100. - While FP8 offers significant performance gains, it can introduce numerical instability. To counter this, techniques like delayed scaling, where scaling factors are determined from a history of tensor statistics, are used to maintain model accuracy and training stability. - The upcoming Blackwell GPUs build upon Hopper's capabilities with dual transformer engines and support for new microscaling formats like MXFP8. MXFP8 assigns a distinct scaling factor to each block of values within a tensor, offering more granular control and better accuracy for tensors with wide dynamic ranges. - Historically, the adoption of new numerical formats like FP16 and BF16 has taken around 3-4 years to become widespread. Following this pattern, FP8 is projected to become the standard for training by approximately 2028. - The transition to FP8 is not without challenges for enterprise adoption, including the need for modern hardware and potential model sensitivity. Additionally, broader issues like data quality, integration with legacy systems, and a shortage of skilled AI professionals can hinder the adoption of new technologies like FP8. - FP8 shows a significant advantage over INT8 for quantizing LLMs due to its higher dynamic range, which is better for handling the outlier-prone activations common in transformer models. This makes FP8 suitable for quantizing not just weights, but also activations and the KV cache, leading to more comprehensive performance improvements.