NVIDIA's Nemotron specs

- NVIDIA announced Nemotron 3 Super, a 120‑billion‑parameter model with a 12B inference trunk for efficiency. - It uses LatentMoE with 512 experts, was trained on about 25 trillion tokens in 4‑bit precision, and tops several benchmarks. - The model highlights vendor push toward model‑hardware co‑design and sparse‑expert tricks for large‑model efficiency. (x.com)

Large language models usually do the same amount of work for every token; NVIDIA’s new Nemotron 3 Super is built so only a small slice of the model wakes up each time. (research.nvidia.com) NVIDIA disclosed Nemotron 3 Super in March 2026 as an open-weight model with 120 billion total parameters but 12 billion active parameters during inference, plus a context window of up to 1 million tokens. (developer.nvidia.com) The design uses a mixture-of-experts setup, which works like a switchboard that routes each token to a subset of specialized submodels instead of the whole network. NVIDIA said Nemotron 3 Super uses LatentMoE with 512 experts and was pre-trained on 25 trillion tokens. (arxiv.org) NVIDIA also trained the model in NVFP4, a 4-bit floating-point format that cuts memory and compute costs during pretraining. The company’s technical report says Nemotron 3 Super is the first model in the Nemotron 3 family to use that recipe from the start. (arxiv.org) The architecture mixes Mamba layers, which are designed for fast sequence processing, with standard Transformer attention layers, which are better known for handling token relationships. NVIDIA said that hybrid layout is aimed at long-context workloads where throughput and memory pressure both matter. (developer.nvidia.com) NVIDIA’s product page says the model reaches up to 2.2 times the inference throughput of GPT-OSS-120B and up to 7.5 times that of Qwen3.5-122B on an 8,000-token input and 64,000-token output setting. The same page says it posts higher or comparable accuracy across several benchmarks and outperforms both on the RULER long-context test at 1 million tokens. (research.nvidia.com) Those numbers fit a broader shift in artificial intelligence model design: companies are no longer only scaling parameter counts, but also tuning models around the hardware and kernels that will run them. NVIDIA tied Nemotron 3 Super to its open Transformer Engine kernels, cuBLAS-backed training stack, and deployment support across vLLM, SGLang, Ollama, and llama.cpp. (research.nvidia.com) (developer.nvidia.com) NVIDIA also bundled more of the stack than many model launches do. Its developer materials say weights, datasets, technical reports, and training recipes are available, which lets outside developers inspect or reproduce more of the system than a closed application programming interface release would. (developer.nvidia.com) (docs.nvidia.com) Nemotron 3 Super is the middle model in a three-part family that NVIDIA outlined in late 2025, between the smaller Nano model and a larger Ultra model that the company has said is forthcoming. The release turns a familiar artificial intelligence race into a more specific one: how much model can be used per token, per watt, and per second. (arxiv.org) (deeplearning.ai)

NVIDIA's Nemotron specs

Get your own daily briefing