FP8 Training Becomes New Industry Standard

Leading AI companies including Microsoft, Meta, and Google have adopted FP8 (8-bit floating point) for training frontier models, achieving 30-40% throughput improvements over the previous BF16 standard. The Llama-2 7B model was trained entirely in FP8, matching BF16 accuracy with a 34% throughput gain. This shift requires engineers to design training pipelines that leverage hardware like NVIDIA's Hopper and Blackwell GPUs with Transformer Engine support.

- The FP8 standard is composed of two distinct formats: E4M3 (4-bit exponent, 3-bit mantissa) is typically used for weights and activations to prioritize precision, while E5M2 (5-bit exponent, 2-bit mantissa) is used for gradients to accommodate their wider dynamic range. - NVIDIA's Transformer Engine is a key software and hardware component that automates the use of FP8, dynamically switching between 8-bit and 16-bit precision during training to maximize speed while maintaining model accuracy and stability. - The push for an 8-bit floating-point format was a collaborative effort, with NVIDIA, Arm, and Intel jointly authoring a 2022 whitepaper that proposed the FP8 standard for more efficient deep learning. - Adopting FP8 is not a simple drop-in replacement for BF16; it introduces challenges like numerical instability and requires more careful hyperparameter tuning to manage activation outliers and prevent gradient overflow or underflow. - To overcome instability, engineers have developed novel techniques, such as the "Smooth-SwiGLU" activation function, specifically designed to prevent the outlier amplification that can cause training divergence in full FP8 models. - The successor to the Hopper GPU architecture, NVIDIA Blackwell, pushes precision even lower by introducing support for FP4 and FP6 formats, indicating a continuing industry trend toward smaller data types for AI workloads. - While FP8 excels in general NLP tasks, the more stable BF16 format is often still preferred for training models on numerically sensitive tasks like arithmetic reasoning or code generation, where FP8 can lead to accuracy degradation. - Beyond throughput gains, FP8 offers significant memory savings; for instance, NVIDIA's COAT framework demonstrated a 1.54x reduction in memory usage compared to BF16 by also compressing optimizer states and activations into the 8-bit format.

Get your own daily briefing

Scout delivers personalized news, insights, and conversations tailored to your role and industry.

Download on the App Store

Shared from Scout - Be the smartest in the room.