NVIDIA + TensorRT-LLM: MLPerf gains

NVIDIA’s Blackwell Ultra systems and TensorRT-LLM co-optimizations smashed MLPerf inference records — GB300 NVL72 hit 8,064 tokens/sec per GPU on a server scenario, a ~2.77x software-driven throughput gain versus six months ago. The jump underscores that software (batching, quant, memory tricks) now delivers gains comparable to hardware upgrades for real-world LLM serving. (developer.nvidia.com) (manilatimes.net)

NVIDIA reported a system-level throughput record of about 2.5 million tokens per second using four GB300 NVL72 systems with 288 Blackwell Ultra GPUs interconnected by NVIDIA Quantum‑X800 InfiniBand. (developer.nvidia.com) NVIDIA said it was the only platform to submit results across all newly added MLPerf Inference v6.0 models and scenarios, including DeepSeek‑R1 (interactive), Qwen3‑VL‑235B‑A22B, GPT‑OSS‑120B, WAN‑2.2‑T2V‑A14B and DLRMv3. (developer.nvidia.com) The company credited a suite of software techniques — kernel fusion, optimized attention data parallelism, disaggregated serving, Wide Expert Parallel, multi‑token prediction and KV‑aware routing — with delivering up to ~2.7x throughput improvements and over 60% cost‑per‑token reductions on the same hardware. (developer.nvidia.com) NVIDIA’s TensorRT‑LLM package provides a benchmarking CLI (trtllm‑bench) and production runtimes used in the optimizations submitted to MLPerf, with documentation describing parallelism strategies, quantization and KV cache tradeoffs. (nvidia.github.io) TensorRT‑LLM is published as an open‑source project on GitHub and includes model recipes, LoRA/QLoRA support and disaggregated serving examples that mirror the stack NVIDIA used for its v6.0 submissions. (github.com) MLCommons said Inference v6.0 is the most significant revision yet — five of eleven datacenter tests were new or updated, adding an open‑weight GPT‑OSS‑120B benchmark, an expanded DeepSeek‑R1 interactive test and the suite’s first text‑to‑video and upgraded recommender (DLRMv3) benchmarks. (mlcommons.org) NVIDIA noted 14 partners submitted on the Blackwell Ultra platform in this round, listing ASUS, Cisco, CoreWeave, Google Cloud, HPE, Lenovo, Supermicro, Lambda and others as contributors to the submissions. (developer.nvidia.com)

NVIDIA + TensorRT-LLM: MLPerf gains

Get your own daily briefing