FP8 Becomes New Standard for LLM Training

FP8 (8-bit floating point) precision is now the baseline for frontier model training and inference, reportedly slashing compute and memory costs by up to 50% compared to BF16. Microsoft, Meta, and Google are using FP8 for their largest models, with Meta reporting a 34% throughput gain on Llama-2 7B training while matching BF16 accuracy. This standard is currently limited to NVIDIA's Hopper and Blackwell GPU architectures.

- The FP8 standard specifies two main formats: E4M3 (4 exponent, 3 mantissa bits) and E5M2 (5 exponent, 2 mantissa bits). E4M3 offers greater precision and is typically used for weights and activations, while E5M2 provides a wider dynamic range, making it more suitable for gradients. - The specification was jointly developed by NVIDIA, Arm, and Intel to create a common standard for hardware and software interoperability, aligning with IEEE 754 conventions. - Hardware acceleration for FP8 is enabled by the Transformer Engine found in NVIDIA's H100 and subsequent GPUs, which dynamically selects between FP8 and 16-bit floating-point formats to maximize speed while preserving model accuracy. - To maintain accuracy with the limited dynamic range of FP8, techniques like delayed scaling are used, where a scaling factor is calculated based on the maximum absolute values observed over previous training iterations to map tensors into the FP8 range without losing significant information. - On NVIDIA's H100 GPUs, using FP8 can lead to significant performance gains, with benchmarks showing up to 9x faster training and 30x faster inference on large language models compared to the previous generation A100 GPU. - Framework support is becoming more common, with libraries like NVIDIA's Transformer Engine and experimental native support in PyTorch enabling FP8. For inference, vLLM can utilize FP8 E4M3 quantization for the KV cache to reduce its memory footprint and increase throughput. - The evolution of low-precision formats continues with NVIDIA's Blackwell architecture, which introduces even smaller formats like FP4 and FP6, as well as MXFP8, a variant that uses block-level scaling for more granular control over precision.

Get your own daily briefing

Scout delivers personalized news, insights, and conversations tailored to your role and industry.

Download on the App Store

Shared from Scout - Be the smartest in the room.