FP8 Becomes New Standard for LLM Training

Major AI labs including Microsoft, Meta, and Google are now training their largest models using FP8 precision, achieving 30–40% throughput improvements over BF16 while maintaining accuracy. The Llama-2 7B model was trained entirely in FP8, resulting in a 34% throughput gain. This shift is enabled by NVIDIA's Hopper and Blackwell GPU architectures, which feature a Transformer Engine optimized for FP8 operations.

- The FP8 format comes in two main variants: E4M3 (4 exponent bits, 3 mantissa bits) and E5M2 (5 exponent bits, 2 mantissa bits). E4M3 offers more precision and is typically used for weights and activations in the forward pass, while E5M2 provides a wider dynamic range, making it better suited for gradients during the backward pass. - NVIDIA's Transformer Engine dynamically selects between FP8 and higher precision formats like FP16, applying scaling factors automatically to maintain model accuracy. This process is handled by the architecture, simplifying the adoption of mixed-precision training for developers. - For inference, which is often memory-bound, FP8's primary benefit is reducing the memory footprint of model weights and the KV cache by 50% compared to BF16. One benchmark using Mistral 7B on an H100 showed FP8 delivering a 33% improvement in output tokens per second and a 24% cost reduction per million tokens versus FP16. - While training in FP8, critical operations and accumulations are often kept in BF16 or FP32 to maintain numerical stability and prevent convergence issues. This mixed-precision approach is crucial for preserving model accuracy. - The adoption of FP8 was a cross-industry effort, with Arm, Intel, and NVIDIA jointly publishing a whitepaper to create a common standard for 8-bit floating-point formats. This collaboration aimed to ensure model interoperability across different hardware platforms. - FP8 has a higher dynamic range than INT8, making it more suitable for quantizing the full range of components in an LLM, including weights, activations, and the KV cache. This allows FP8 to retain more information, which is especially important for smaller models where every parameter is critical. - The latest NVIDIA Blackwell architecture introduces MXFP8, a new format that assigns different scaling factors to blocks of values within a single tensor. This more granular scaling allows for the consistent use of the higher-precision E4M3 format, further mitigating quantization errors.

Get your own daily briefing

Scout delivers personalized news, insights, and conversations tailored to your role and industry.

Download on the App Store

Shared from Scout - Be the smartest in the room.