Open model throughput jumps
Open-source Neatron/Nemotron 3 Super is being touted as topping deep‑research benchmarks and boosting tokens/sec by about 7x via software optimization. (x.com)
NVIDIA’s Nemotron 3 Super is published as a 120‑billion‑parameter hybrid Mamba‑Transformer mixture‑of‑experts model that fires roughly 12 billion active parameters per forward pass and supports a 1,000,000‑token context window. (unsloth.ai) NVIDIA’s technical report benchmarks Nemotron 3 Super as achieving up to 2.2× higher inference throughput than GPT‑OSS‑120B and up to 7.5× higher throughput than Qwen3.5‑122B on an 8k‑token input / 16k‑token output workload. (research.nvidia.com) Third‑party coverage and lab tests put real‑world output rates in the hundreds of tokens per second, with one review noting roughly 450 output tokens/sec and claims of as much as a 5× throughput increase over earlier Nemotron Super releases. (allclaw.org) Commercial benchmarking and cost analyses show the model’s efficiency can materially lower serving costs—example pricing and benchmark posts cite running Nemotron 3 Super at approximately $0.10 per million input tokens on some inference platforms. (tokencost.app) Architectural gains come from Nemotron’s MoE routing and hybrid layers (reducing FLOPS by activating only expert subsets per token) combined with inference‑engine improvements (vLLM and similar stacks have documented multi‑fold throughput gains), which together explain the published 7×‑range speedups in comparative tests. (llmgarage.ai) NVIDIA has published weights, recipes and deployment cookbooks on its Nemotron developer hub and GitHub, and major hosts including Amazon Bedrock, Hugging Face and Perplexity have added or documented Super‑class availability following NVIDIA’s GTC announcement in March 2026. (github.com)