NVIDIA releases Nemotron diffusion report
- NVIDIA published the Nemotron-Labs-Diffusion technical report on May 19, detailing a tri-mode language model that combines autoregressive, diffusion and self-speculation decoding. (research.nvidia.com) - The report says Nemotron-Labs-Diffusion-8B decodes 5.9 times more tokens per forward than Qwen3-8B and delivered 4 times higher throughput on SPEED-Bench. (research.nvidia.com) - The report, model collection and related papers are available through NVIDIA Research, Hugging Face and arXiv pages shared May 20. (research.nvidia.com)
NVIDIA used its research channels on May 20 to circulate a new technical report for Nemotron-Labs-Diffusion, a language model family that combines autoregressive generation, diffusion decoding and self-speculation in one architecture. The report itself is dated May 19 on NVIDIA Research and describes the system as a “tri-mode” model that can switch decoding methods depending on deployment conditions. (research.nvidia.com) The paper is part of a broader burst of AI research sharing around model efficiency and safety. (research.nvidia.com) Alongside the diffusion report, community posts and paper feeds highlighted work on on-policy self-distillation for safety alignment and multimodal safety failures, both posted to arXiv in recent days. For people trying to place this in NVIDIA’s larger model lineup, the company’s developer site says Nemotron is its family of open models with open weights, training data and recipes, aimed at agentic AI and multimodal workloads. (research.nvidia.com) NVIDIA says technical reports for recreating the models are freely available. ### What exactly did NVIDIA release? NVIDIA Research published “Nemotron-Labs-Diffusion: A Tri-Mode Language Model Unifying Autoregressive, Diffusion, and Self-Speculation Decoding” on Tuesday, May 19, 2026. (research.nvidia.com) The publication page lists NVIDIA researchers alongside University of Chicago co-author Jingyu Liu and links to a technical report and Hugging Face collection. The report says the model is trained with a joint autoregressive-diffusion objective. (arxiv.org) NVIDIA says that lets the same model operate in standard left-to-right mode, in diffusion mode, or in a self-speculation setup where diffusion drafts and autoregressive decoding verifies. (developer.nvidia.com) ### Why is “tri-mode” the main technical claim? NVIDIA says the three decoding modes are meant to address different inference trade-offs rather than force one generation method to win outright. The report says autoregressive and diffusion objectives are “complementary,” with diffusion improving lookahead planning and autoregressive training supplying left-to-right linguistic priors. (research.nvidia.com) The same page says self-speculation mode uses diffusion for drafting and autoregressive decoding for verification, and that this setup outperformed multi-token prediction methods on both acceptance rate and device efficiency. NVIDIA also says a “speed-of-light analysis” showed diffusion had the potential to produce up to 76.5% more tokens per forward pass than self-speculation under an optimal sampler. (research.nvidia.com) ### What numbers did NVIDIA put forward? NVIDIA says the Nemotron-Labs-Diffusion family scales to 3 billion, 8 billion and 14 billion parameters, with base, instruct and vision-language variants. The report says those models outperformed state-of-the-art open-source autoregressive and diffusion language models in both speed and accuracy. (research.nvidia.com) The clearest benchmark claim centers on the 8B model. NVIDIA says Nemotron-Labs-Diffusion-8B decoded 5.9 times more tokens per forward than Qwen3-8B while delivering better accuracy, which the company says translated into 4 times higher throughput on SPEED-Bench using SGLang on a GB200 GPU. (research.nvidia.com) ### How does this fit with NVIDIA’s current Nemotron push? NVIDIA’s developer page says the newer Nemotron 3 family is positioned around efficient multimodal and agentic AI systems, including a 1 million-token context window for some models and deployment through frameworks such as vLLM, SGLang, Ollama and llama.cpp. The company’s GitHub repository describes Nemotron as a hub for training recipes, deployment guides and end-to-end examples. (research.nvidia.com) That context matters because the diffusion report extends NVIDIA’s recent emphasis on open technical artifacts. The Nemotron developer page says weights, training data and recipes are available for community evaluation, and the diffusion publication page links directly to a Hugging Face collection under NVIDIA’s account. ### Why were safety papers circulating at the same time? (research.nvidia.com) ArXiv listings from May 14 and May 18 show active discussion around safety alignment and multimodal safety gaps. One paper, “Reducing the Safety Tax in LLM Safety Alignment with On-Policy Self-Distillation,” studies whether models can preserve more reasoning ability while improving safety by training on their own rollouts with a frozen teacher copy. A separate paper, “Safety Geometry Collapse in Multimodal LLMs and Adaptive Drift Correction,” argues that multimodal inputs can weaken refusal behavior by compressing the separation the model uses to identify harmful inputs. (developer.nvidia.com) The authors propose an inference-time correction method called ReGap. NVIDIA’s diffusion report is already posted on NVIDIA Research and in a Hugging Face collection, while the related safety papers remain available through their arXiv entries. (developer.nvidia.com) Those pages are the next places to watch for revisions, code releases or follow-on benchmarks from NVIDIA researchers and outside authors. (research.nvidia.com) (arxiv.org 1) (arxiv.org 2)