Hyperscalers Adopt FP8 as New Standard for Model Training

Microsoft, Meta, and Google are now training large models entirely in the FP8 numerical format, cutting compute and memory requirements by approximately 50% compared to BF16. Training a Llama-2 7B model in FP8 reportedly achieved a 34% throughput gain while matching BF16 accuracy. This efficiency requires Nvidia Hopper or Blackwell generation GPUs and the Transformer Engine library.

- FP8 is not a single standard but a family of two main 8-bit floating-point formats: E4M3 (4 exponent bits, 3 mantissa bits) and E5M2 (5 exponent bits, 2 mantissa bits). E4M3 offers more precision for values near zero, making it suitable for forward passes with weights and activations, while E5M2 provides a wider dynamic range, which is often better for representing gradients during the backward pass. - The move to FP8 is a continuation of the trend of using lower-precision formats to accelerate deep learning, which started with the shift from 32-bit floating-point (FP32) to mixed-precision training using 16-bit formats like FP16 and BFloat16 (BF16). While BF16 has been a standard for efficient training, FP8 offers a further reduction in memory and computational needs. - NVIDIA's Transformer Engine is a key library that simplifies the use of FP8 by managing the complexities of mixed-precision training. It automatically handles the casting of operations to FP8 within a specific context, updates scaling factors, and provides optimized building blocks for Transformer models, which is crucial as major deep learning frameworks do not yet natively support FP8. - The latest NVIDIA Blackwell architecture enhances FP8 support with new "micro-tensor" formats like FP4 and FP6, and introduces a more granular, hardware-handled scaling mechanism. This allows for even greater efficiency and is designed to accelerate trillion-parameter models, with Blackwell's FP8 performance rated at approximately 9 petaFLOPS compared to Hopper's 4 petaFLOPS. - While training is often compute-bound and sees a throughput gain of 1.3x to 1.5x with FP8, inference is typically memory-bound. For inference, the primary benefits of FP8 are a reduced memory footprint for model weights and the key-value cache, leading to lower latency. - The adoption of FP8 is not limited to NVIDIA GPUs; AMD is also enabling FP8 support for its MI300 GPUs through the ROCm Transformer Engine library. This indicates a broader industry trend towards leveraging lower-precision formats for accelerating Transformer models. - Achieving stable training in FP8 requires careful management of scaling factors to prevent the loss of small gradient values, a problem that becomes more pronounced in extended training runs. Research has shown that instabilities can arise from outlier amplification by certain activation functions, like SwiGLU, over long training periods. - The practical impact of FP8 has been demonstrated in training large models, such as a 7B parameter Llama model, on datasets with trillions of tokens. In one case, DeepL was able to build significantly larger and higher-quality translation models with FP8 training, achieving up to double the throughput at the same latency during inference compared to BF16.

Get your own daily briefing

Scout delivers personalized news, insights, and conversations tailored to your role and industry.

Download on the App Store

Shared from Scout - Be the smartest in the room.