FP8 Precision Gains Traction in LLM Training

Major AI labs including Microsoft, Meta, and Google are increasingly training frontier models using FP8 numerical precision, reporting throughput gains of 30-40%. The InternLM2 technical report confirms that FP8 can cut compute and memory needs by up to 50% compared to BF16 without degrading model quality. These benefits are currently limited to NVIDIA's Hopper and Blackwell-class GPUs.

- The FP8 specification, jointly released by NVIDIA, Intel, and Arm, defines two main formats: E4M3 (4 exponent, 3 mantissa bits) for higher precision in forward passes and E5M2 (5 exponent, 2 mantissa bits) for a wider dynamic range suitable for gradients in the backward pass. - NVIDIA's Transformer Engine, introduced with the Hopper architecture, is key to FP8 adoption as it dynamically selects between FP8 and FP16 precision to accelerate training and inference while maintaining accuracy. The newer [Blackwell architecture](https://vertexaisearch.cloud.google.com/grounding-api-redirect/AUZIYQHpAnOzhY6zIuG6UiUVDRhFClnS5e-vAgwkfuYh1RDstlszwx9KGFcw9LKDK1mwgObnWRw15ZP-JH4a1Fhte3Vz72xEePLEwdtyG0jMS-Q00cfCBQckHS0asDg_u-afENSC626IyhTClejJrnW3qave5guiguEWU_VzKC_2HDHjGcJ1PKLWjGYc5eKU24TUmBqMvRenyD74JbhrIjxx6OaCwNQRgHT3ow==) features a second-generation Transformer Engine with support for even lower 4-bit precision (FP4). - The performance jump from the previous generation of GPUs is substantial; the NVIDIA H100 GPU shows up to a 4.5x speedup in model inference performance over the A100, with a 2x increase from the architecture and another 2x from switching to FP8. For training, the H100 can be up to 2.4 times faster than the A100 using mixed precision. - While FP8 boosts throughput, it can introduce training instability, leading to loss spikes and requiring techniques like dynamic scaling factors to prevent numerical overflow or underflow. For tasks sensitive to numerical precision, like code generation or mathematical reasoning, models trained with FP8 have sometimes shown performance degradation compared to BF16. - The upcoming Blackwell B200 GPUs are expected to double the FP8 performance of the Hopper H100, reaching approximately 9 petaFLOPS, compared to Hopper's 4 petaFLOPS. Blackwell architecture also introduces a dedicated decompression engine to reduce data transfer bottlenecks. - Support for FP8 is expanding beyond NVIDIA. Intel's Gaudi 2 and AMD's Instinct MI300 series accelerators also include hardware support for FP8 matrix operations, indicating broad industry movement towards the format for AI workloads.

FP8 Precision Gains Traction in LLM Training

Get your own daily briefing