NVIDIA research & Nemotron 3

NVIDIA surfaced Nemotron 3 Super — a 120B‑parameter family with a 12B active mixture‑of‑experts path and 1M‑token context, claiming ~2.2× throughput versus a GPT‑120B open model via hybrid Mamba‑Attention MoE and NVFP4 training on ~25T tokens. (x.com, x.com)

NVIDIA has released Nemotron 3 Super, an open 120 billion-parameter language model built to run faster by activating only about 12 billion parameters per token. (research.nvidia.com) The model page says Nemotron 3 Super was published March 10, 2026, with a 1 million-token context window and open weights, datasets, recipes, and technical reports. The accompanying paper appeared on arXiv in April 2026. (research.nvidia.com, (arxiv.org)) A parameter is a learned setting inside a model, and a mixture-of-experts design works like a team where only a few specialists answer each request. NVIDIA says Nemotron 3 Super has 120.6 billion total parameters but uses 12.7 billion per forward pass, which cuts the amount of computation needed at inference time. (research.nvidia.com) The architecture mixes Mamba blocks, which are designed to handle long sequences efficiently, with attention layers, which help a model focus on the most relevant tokens. NVIDIA says that hybrid design is meant to keep long-context performance while raising throughput compared with standard Transformer-only systems. (research.nvidia.com, (arxiv.org)) NVIDIA says Nemotron 3 Super was pre-trained on about 25 trillion tokens using NVFP4, a 4-bit floating-point format, and adds LatentMoE plus multi-token prediction layers for native speculative decoding. On its model page, the company says those changes target both training efficiency and faster generation. (research.nvidia.com, (research.nvidia.com)) In its technical report, NVIDIA says the model reaches about 2.2 times the throughput of an open GPT-120B baseline at comparable quality. In a separate March 2026 blog post, the company framed the same release as delivering up to 5 times higher throughput for agentic artificial intelligence workloads on Blackwell systems, a larger claim tied to specific deployment settings. (research.nvidia.com, (blogs.nvidia.com)) The release fits NVIDIA’s broader Nemotron 3 push, which the company introduced in December 2025 as a family of open models in Nano, Super, and Ultra sizes. NVIDIA has pitched that line as infrastructure for “agentic” systems that carry out multi-step tasks, not just single-turn chat. (nvidianews.nvidia.com, (arxiv.org)) NVIDIA is also using the release to argue that open models can compete on long-context and reasoning-heavy workloads without matching the full active size of the biggest dense systems. The company’s developer page says Nemotron 3 models are designed for deployment through frameworks including vLLM, SGLang, Ollama, and llama.cpp on NVIDIA hardware from edge devices to data centers. (developer.nvidia.com) The immediate test is whether outside developers reproduce NVIDIA’s speed and quality numbers on their own hardware. For now, the release gives NVIDIA a new open model to pair with its Blackwell chips and its software stack. (research.nvidia.com, (blogs.nvidia.com))

NVIDIA research & Nemotron 3

Get your own daily briefing