NVIDIA releases Nemotron diffusion LMs
- NVIDIA published Nemotron‑Labs‑Diffusion, a family of diffusion‑based language models in sizes from 3B to 14B, including vision‑language variants. - The diffusion approach generates multiple tokens in parallel and allows model revisions, aiming to improve inference speed and GPU utilization versus autoregressive decoding. - Community forks quickly added support (buun‑llama.cpp forks, Q8 Hugging Face quants), increasing practical access for experimentation and deployment. (x.com)
NVIDIA published Nemotron-Labs-Diffusion on May 19, adding a new open-weight bet to the growing push beyond standard left-to-right text generation. The release is a family of 3B, 8B and 14B models, with base, instruct and vision-language variants, and NVIDIA describes them as “tri-mode” systems that can run in autoregressive, diffusion, or self-speculation decoding modes. (research.nvidia.com) What makes this notable is the decoding method. In a standard autoregressive model, tokens are generated one after another. NVIDIA says diffusion decoding can generate multiple tokens in parallel and revise them during generation, while self-speculation uses diffusion to draft and autoregressive decoding to verify. The company says that setup is meant to keep throughput high across different deployment and concurrency conditions. (research.nvidia.com) NVIDIA’s headline claim is that the approach is not just architectural novelty but a serving advantage. In its research writeup, the company says Nemotron-Labs-Diffusion-8B decodes 5.9 times more tokens per forward pass than Qwen3-8B with better accuracy, translating to 4 times higher throughput on SPEED-Bench with SGLang on a GB200 GPU. It also says a “speed-of-light” analysis shows diffusion could deliver up to 76.5% more tokens per forward pass than self-speculation under an optimal sampler. Those are NVIDIA’s benchmarks, but they explain why the company is framing diffusion as an inference-efficiency story as much as a model story. (research.nvidia.com) The release is broader than a single text model. NVIDIA’s Hugging Face collection shows text models at 3B, 8B and 14B, plus base variants and an 8B vision-language model. The VLM model card says it accepts interleaved image and text input and produces text output, carrying the same tri-mode backbone into multimodal use. (huggingface.co) The practical angle matters here. NVIDIA published the models on Hugging Face with deployment instructions, and the VLM card includes examples for Transformers, vLLM and SGLang. That lowers the barrier for developers who want to test the models in existing serving stacks rather than wait for a bespoke NVIDIA-only runtime. (huggingface.co) There is also evidence the surrounding tooling moved quickly. A buun-llama-cpp fork has a dedicated conversion script for Nemotron-Labs-Diffusion, and the broader llama.cpp ecosystem already includes diffusion text generation examples and CLI parameters for diffusion-specific sampling and scheduling. That does not mean mainline llama.cpp fully supports NVIDIA’s models out of the box, but it does show the open-source inference community already had diffusion-oriented plumbing in place and began adapting it. (github.com) So the story is less “NVIDIA released another open model” than “NVIDIA is testing whether diffusion can become a practical serving path for language models.” If the company’s throughput claims hold up outside its own benchmarks, Nemotron-Labs-Diffusion gives developers a concrete open-weight package to compare against conventional autoregressive stacks. (research.nvidia.com)