FP8 Training Matches BF16 Accuracy

Published February 16, 2026 by The Daily Scout

Technical reports show that training large language models with FP8 (8-bit floating point) precision can now match the accuracy of BF16. This advance can cut compute and memory requirements by nearly half, delivering 30-40% throughput gains, but requires deployment on NVIDIA's Hopper or Blackwell series GPUs.

Why it matters

- The key software enabler is NVIDIA's Transformer Engine, a library that automatically manages mixed precision by deciding which layers to run in FP8 and which to keep in a higher precision format to maintain numerical stability. - FP8 utilizes two distinct 8-bit floating-point formats: E4M3 (4 exponent, 3 mantissa bits) for its higher precision in forward passes and E5M2 (5 exponent, 2 mantissa bits) for its wider dynamic range, which is better suited for handling gradients during the backward pass. - The technique was introduced with the fourth-generation Tensor Cores in the Hopper GPU architecture, which are capable of FP8 operations and can offer double the computational throughput of 16-bit operations. - For inference, quantizing a model like Mistral 7B from FP16 to FP8 has shown a 33% increase in output tokens per second and a 24% reduction in cost per million tokens. - Achieving stable training required overcoming the limited dynamic range of an 8-bit format; this is handled by techniques like delayed scaling, where scaling factors are dynamically computed to prevent the loss of gradient information. - Beyond training, FP8's reduced memory footprint for weights, activations, and the KV cache is highly beneficial for inference on multi-instance GPUs (MIGs), which may have as little as 10GB of VRAM. - While PyTorch lacks a native FP8 datatype, the Transformer Engine library integrates with it—as well as with JAX and TensorFlow—by providing specialized modules and an `fp8_autocast` context manager.

Key numbers

Technical reports show that training large language models with FP8 (8-bit floating point) precision can now match the accuracy of BF16.
This advance can cut compute and memory requirements by nearly half, delivering 30-40% throughput gains, but requires deployment on NVIDIA's Hopper or Blackwell series GPUs.
- The key software enabler is NVIDIA's Transformer Engine, a library that automatically manages mixed precision by deciding which layers to run in FP8 and which to keep in a higher precision format to maintain numerical stability.
The technique was introduced with the fourth-generation Tensor Cores in the Hopper GPU architecture, which are capable of FP8 operations and can offer double the computational throughput of 16-bit operations.

What happens next

Beyond training, FP8's reduced memory footprint for weights, activations, and the KV cache is highly beneficial for inference on multi-instance GPUs (MIGs), which may have as little as 10GB of VRAM.

Sources

Quick answers

What happened in FP8 Training Matches BF16 Accuracy?

Why does FP8 Training Matches BF16 Accuracy matter?

The key software enabler is NVIDIA's Transformer Engine, a library that automatically manages mixed precision by deciding which layers to run in FP8 and which to keep in a higher precision format to maintain numerical stability. FP8 utilizes two distinct 8-bit floating-point formats: E4M3 (4 exponent, 3 mantissa bits) for its higher precision in forward passes and E5M2 (5 exponent, 2 mantissa bits) for its wider dynamic range, which is better suited for handling gradients during the backward pass. The technique was introduced with the fourth-generation Tensor Cores in the Hopper GPU architecture, which are capable of FP8 operations and can offer double the computational throughput of 16-bit operations. For inference, quantizing a model like Mistral 7B from FP16 to FP8 has shown a 33% increase in output tokens per second and a 24% reduction in cost per million tokens. Achieving stable training required overcoming the limited dynamic range of an 8-bit format; this is handled by techniques like delayed scaling, where scaling factors are dynamically computed to prevent the loss of gradient information. Beyond training, FP8's reduced memory footprint for weights, activations, and the KV cache is highly beneficial for inference on multi-instance GPUs (MIGs), which may have as little as 10GB of VRAM. While PyTorch lacks a native FP8 datatype, the Transformer Engine library integrates with it—as well as with JAX and TensorFlow—by providing specialized modules and an fp8_autocast context manager.