48% FP8 speedup (NVIDIA)
- NVIDIA says its NeMo RL framework achieved a 48% training speedup using end‑to‑end FP8 precision while matching BF16 accuracy. ( ) - The benchmark covers reinforcement‑learning training workloads, not just inference gains, per NVIDIA's technical post. (developer.nvidia.com) - NVIDIA frames the result as an end‑to‑end efficiency improvement that reduces wall‑clock training time in their tests. (developer.nvidia.com)
Reinforcement learning for language models runs in a loop: the model generates answers, then trains on the results. NVIDIA said on April 20 that running more of that loop in 8-bit floating point, or FP8, cut end-to-end training time by 48% in NeMo RL while matching BF16 accuracy in its tests. (developer.nvidia.com) The company’s post focuses on training, not just inference. NVIDIA said modern reinforcement-learning pipelines split into a generation phase with tight latency demands and a training phase that needs high throughput, and that NeMo RL is its open-source library for those post-training workloads. (developer.nvidia.com, docs.nvidia.com) FP8 is a smaller number format than bfloat16, or BF16, so it moves less data and lets supported graphics processors do more math per cycle. NVIDIA’s Transformer Engine documentation says FP8 support is built for Hopper, Ada, and Blackwell GPUs to raise performance and lower memory use in both training and inference. (docs.nvidia.com, docs.nvidia.com) The catch is that reinforcement learning is sensitive to tiny numerical mismatches. NVIDIA said rollouts often run in vLLM while training runs in Megatron Core, and those separate engines use different CUDA kernels, which can amplify errors when models are pushed into lower precision. (developer.nvidia.com) NVIDIA’s answer was to keep linear layers in an FP8 path across both rollout and training instead of mixing precisions between stages. In the company’s description, weights, input activations, and output gradients use block-wise FP8 with FP32 scaling factors, while attention, normalization, nonlinear functions, and output projections stay in BF16. (developer.nvidia.com, docs.nvidia.com) That design lines up with how NVIDIA’s FP8 software stack already works. Transformer Engine says it manages scaling factors needed for FP8 training, and NeMo RL’s FP8 documentation says the library supports FP8 generation and FP8 training, with training implemented through Transformer Engine linear layers. (docs.nvidia.com, docs.nvidia.com) NVIDIA’s own documentation also shows the limits of the current setup. The NeMo RL FP8 guide says the module is “under active development,” recommends FP8 generation and training on Hopper GPUs, and says Blackwell does not yet support the same DeepSeek-style FP8 recipe for training, where NVIDIA currently recommends FP8 for generation and BF16 for training. (docs.nvidia.com, github.com) The broader push is to make reinforcement learning cheaper to run as reasoning models spend more time generating and then learning from those generations. NVIDIA framed the FP8 result as a wall-clock gain for the whole loop, not just a faster kernel, which is the metric labs care about when they are paying for clusters by the hour. (developer.nvidia.com) For now, the 48% figure is NVIDIA’s benchmark on its own stack, published one day after the post went live on April 20, 2026. The immediate question is whether other labs can reproduce the same speed and accuracy trade-off on their own models, hardware mix, and reinforcement-learning recipes. (developer.nvidia.com, docs.nvidia.com)