Nvidia Blackwell Lead
Nvidia’s Blackwell Ultra GPUs crushed MLPerf Inference v6.0, setting new throughput and per‑token cost records — a win the company credits to hardware‑software co‑design. Nvidia is also pushing Gemma 4 for local, agentic multimodal workflows on RTX and DGX Spark, signaling more emphasis on on‑prem and edge LLM deployments. (wccftech.com, developer.nvidia.com, blogs.nvidia.com)
MLPerf Inference v6.0 introduced a new GPT-OSS‑120B benchmark and a latency‑constrained DeepSeek‑R1 interactive scenario, and NVIDIA published its MLPerf v6.0 submission on April 1, 2026. (mlcommons.org) NVIDIA reported up to 2.7× throughput gains and more than 60% reduction in cost‑per‑token driven by software advances in TensorRT‑LLM and the Dynamo framework plus techniques like kernel fusion, optimized attention data‑parallelism, Wide Expert Parallel, Multi‑Token Prediction, and KV‑aware routing. (developer.nvidia.com) Scale‑out tests used Quantum‑X800 InfiniBand to link four GB300 NVL72 racks with 288 Blackwell Ultra GPUs, and NVIDIA cited system‑level peaks of roughly 2,494,310 tokens/sec (DeepSeek‑R1 offline) and 1,555,110 tokens/sec (DeepSeek‑R1 server) in the v6.0 run. (developer.nvidia.com) NVIDIA said per‑GPU server throughput on DeepSeek‑R1 rose from about 2,907 tok/sec in the prior round to roughly 8,064 tok/sec in v6.0 (≈2.77×), and the company reports 291 cumulative MLPerf training+inference wins since 2018—about nine times the total of all other submitters combined—with 14 partner vendors participating. (cloudnews.tech) Partner submissions show similar Blackwell gains at smaller scales: Lambda published a GB300 4‑GPU run for GPT‑OSS‑120B registering 60,220 tok/sec (Offline) and 53,463 tok/sec (Server), which Lambda framed as ~1.29× and ~1.22× improvements versus an HGX B200 baseline. (lambda.ai) Google’s new Gemma 4 family ships in E2B, E4B, 26B (MoE) and 31B sizes and supports multimodal, agentic workflows and over 140 languages, and NVIDIA says Gemma 4 has been optimized to run efficiently on RTX PCs, DGX Spark, and Jetson edge modules with tooling like NeMo Automodel and NVIDIA NIM for local and on‑prem deployments. (blog.google)