FP8 Training Matches BF16 Accuracy
Technical reports show that training large language models with FP8 (8-bit floating point) precision can now match the accuracy of BF16. This advance can cut compute and memory requirements by nearly half, delivering 30-40% throughput gains, but requires deployment on NVIDIA's Hopper or Blackwell series GPUs.
- The key software enabler is NVIDIA's Transformer Engine, a library that automatically manages mixed precision by deciding which layers to run in FP8 and which to keep in a higher precision format to maintain numerical stability. - FP8 utilizes two distinct 8-bit floating-point formats: E4M3 (4 exponent, 3 mantissa bits) for its higher precision in forward passes and E5M2 (5 exponent, 2 mantissa bits) for its wider dynamic range, which is better suited for handling gradients during the backward pass. - The technique was introduced with the fourth-generation Tensor Cores in the Hopper GPU architecture, which are capable of FP8 operations and can offer double the computational throughput of 16-bit operations. - For inference, quantizing a model like Mistral 7B from FP16 to FP8 has shown a 33% increase in output tokens per second and a 24% reduction in cost per million tokens. - Achieving stable training required overcoming the limited dynamic range of an 8-bit format; this is handled by techniques like delayed scaling, where scaling factors are dynamically computed to prevent the loss of gradient information. - Beyond training, FP8's reduced memory footprint for weights, activations, and the KV cache is highly beneficial for inference on multi-instance GPUs (MIGs), which may have as little as 10GB of VRAM. - While PyTorch lacks a native FP8 datatype, the Transformer Engine library integrates with it—as well as with JAX and TensorFlow—by providing specialized modules and an `fp8_autocast` context manager.