NVIDIA demonstrates FP8 RL training

- NVIDIA published an April 20, 2026 NeMo RL recipe showing end-to-end FP8 reinforcement learning for reasoning models, using GRPO across rollout and training. - The core trick is alignment — FP8 math on linear layers in both vLLM generation and Megatron training, where FP8 offers 2x peak throughput versus BF16. - If the recipe holds up broadly, RL for reasoning agents gets cheaper — and the rollout bottleneck stops dominating quite so hard.

Reinforcement learning for reasoning models is turning into a systems problem. The model math matters, but the plumbing matters just as much — especially when rollout generation eats most of the time and money. That is the gap NVIDIA is trying to close with a new NeMo RL recipe published on April 20, 2026: run the whole RL loop in FP8 where it counts, not just inference, and keep the training stable enough to be useful. ### What actually changed? NVIDIA didn’t announce a new base model here. It published an implementation recipe inside NeMo RL, its open-source post-training stack, for high-throughput reinforcement learning with end-to-end FP8 precision. The target workload is reasoning-style post-training with GRPO, where a model generates candidate answers, gets scored, and updates itself from that feedback loop. ### Why is RL the expensive part? Because this kind of RL is really two jobs glued together. First comes generation — the model has to produce lots of rollouts fast, with low latency. Then comes training — the system has to crunch those samples at high throughput. NVIDIA’s point is that reasoning RL is not one monolithic workload, so optimizing only one side leaves a lot of performance on the table. ### Why does FP8 matter here? FP8 is an 8-bit floating-point format. Basically, it cuts precision to gain speed and reduce memory traffic. That trade has been attractive for pretraining and inference for a while, but RL is nastier because small numeric mismatches can compound across the loop. NVIDIA says FP8 math on linear layers has 2x peak throughput versus BF16 math, which is the raw performance carrot behind this whole effort. ### So what was the hard part? The hard part was not just quantizing a model. It was making two different engines agree closely enough. In NeMo RL, rollout generation commonly runs through vLLM, while training runs through Megatron Core. Those stacks use different CUDA kernels and different numerical paths. In low precision with a token multiplicative probability error, and says “acceptable” values are usually under about 1.03 to 1.05. ### What is the recipe? The recipe is narrower than “everything in FP8.” NVIDIA uses block-wise FP8 for linear layers — with E4M3 data and FP32 scaling factors — while leaving attention, normalization, nonlinear ops, and output projections in BF16. So this is a selective end-to-end recipe, not a reckless all-FP8-everywhere fetishization and custom scaling behavior. ### Does this run everywhere? Not really. NVIDIA’s own docs say the recommended full FP8 recipe is for Hopper GPUs. For Blackwell, the current recommendation is FP8 for generation but BF16 for training, because the DeepSeek-style FP8 training path with FP32 scaling factors is not yet supported there in the same way. The NeMo RL builds. ### Why should anyone care? Because rollout-heavy RL has become the tax you pay for better reasoning and better agents. If you can shrink that tax without wrecking convergence, you make experimentation cheaper and faster. That matters for anyone training specialized reasoning models, tool-using agents, or post-trained open models inside the Nemotron and NeMo ecosystem. ### What’s the bottom line? This is not “FP8 solved RL” — but it is a concrete sign that NVIDIA thinks reasoning-model RL is mature enough to optimize like a production systems stack. The real news is not just lower precision. It’s that the company is trying to make rollout and training behave like one coordinated pipeline instead of two mismatched halves. If that works beyond NVIDIA’s own recipe, cheaper large-scale RL gets a lot more plausible.

NVIDIA demonstrates FP8 RL training

Get your own daily briefing