FP8 emerges as new standard for training

Training large language models with FP8 (8-bit floating point) precision is reportedly cutting compute and memory needs by approximately 50% compared to BF16, while maintaining production-level accuracy. Major technology firms including Microsoft, Meta, and Google are now using FP8 for frontier model training, achieving throughput gains of 30-40%. The format is currently supported on NVIDIA's Hopper and Blackwell series GPUs.

- The FP8 specification, jointly published by NVIDIA, Arm, and Intel, defines two main variants: E4M3 (4 exponent bits, 3 mantissa bits) for higher precision and E5M2 (5 exponent bits, 2 mantissa bits) for a wider dynamic range. This allows for flexibility, with E4M3 often used for forward passes and E5M2 for backward passes where gradients may require a larger range. - NVIDIA's Transformer Engine, available in Hopper and subsequent architectures, is designed to leverage FP8. It uses heuristics to dynamically select between FP8 and FP16 precision for different neural network layers to maintain accuracy while maximizing throughput. Benchmarks on GPT-3 style models have shown that the Transformer Engine can boost FP8 performance by 60% on H100 GPUs. - While BF16 has a much larger dynamic range due to its 8 exponent bits, FP8's two formats, combined with scaling factors, enable more efficient hardware use. In terms of memory, FP8 uses 1 byte per value, whereas BF16 uses 2 bytes. This can lead to FP8 being up to twice as fast as BF16 in ideal hardware scenarios. - The adoption of FP8 is not limited to proprietary solutions. The Open Compute Project (OCP) has established a specification for an 8-bit floating-point format (OFP8) to ensure interoperability across different hardware and software. Additionally, open-source libraries like vLLM have incorporated FP8 support, with contributions from companies like Neural Magic and Anyscale. - For practical implementation, using FP8 with NVIDIA's Transformer Engine requires tensor dimensions to be divisible by 16. Operations deemed safe for lower precision are wrapped in an `fp8_autocast` context manager in PyTorch, which handles the casting and scaling automatically. - Looking ahead, the NVIDIA Blackwell architecture enhances low-precision computing by introducing support for even smaller formats like FP4 and FP6, alongside improved FP8 tensor cores. The Blackwell B200 GPU is projected to achieve approximately 9 petaFLOPS in FP8, a significant increase from the Hopper H100's ~4 petaFLOPS. - Research has demonstrated the viability of training large models entirely in FP8. One study successfully trained a 7B Llama 2 model, matching the accuracy of a BF16 baseline while increasing training throughput by about 34%. Another effort training a 175B parameter model with an FP8 mixed-precision framework saw a 64% speed increase and a 42% reduction in memory usage compared to a BF16 baseline. - Challenges in FP8 training at scale have been identified, particularly with certain activation functions like SwiGLU causing instability over long training runs. To address this, researchers have proposed modified activation functions, such as Smooth-SwiGLU, and techniques for quantizing Adam optimizer moments to FP8, further improving memory efficiency and training stability.

Get your own daily briefing

Scout delivers personalized news, insights, and conversations tailored to your role and industry.

Download on the App Store

Shared from Scout - Be the smartest in the room.