NVIDIA’s inference trick shows gains

NVIDIA’s new DWDP inference technique improved throughput on the GB200 NVL72 benchmark by about 8.8% per GPU — a measurable lift for inference-heavy deployments. (x.com) Small percentage gains like this matter at hyperscaler scale because they translate to fewer racks or lower power costs for the same workload, and they help explain persistent demand for the latest accelerator silicon. (x.com)

Running an artificial intelligence model has two jobs: first it reads your prompt, then it writes the answer one token at a time. Those two jobs stress hardware in different ways, because reading a long prompt is a burst of heavy work and writing each next token is a steady drip of tiny work. (developer.nvidia.com) Chip companies call that second stage inference, which is the part users feel as response speed. In the April 2, 2025 MLPerf Inference v5.0 round, NVIDIA used its GB200 NVL72 rack for one of its first big public Blackwell inference submissions. (mlcommons.org) (developer.nvidia.com) The GB200 NVL72 is not a single card but a full liquid-cooled rack with 72 Blackwell graphics processors and 36 Grace central processors. NVIDIA links those 72 graphics processors into one NVLink domain, which is its term for a giant high-speed pool that lets the chips share work more like one machine than 72 separate boxes. (nvidia.com) That matters because large language models are now too big and too chatty for one chip. MLPerf’s Llama 3.1 405 billion parameter test was added in v5.0 specifically to reflect that shift, and it uses a model with a 128,000 token context window. (mlcommons.org) The benchmark also measures two delays that people notice immediately in a chatbot. For Llama 3.1 405B in the server test, MLPerf set limits of 6 seconds for time to first token and 175 milliseconds for time per output token. (developer.nvidia.com) NVIDIA’s software answer to this is Dynamo, an open-source inference framework announced at GTC 2025. Dynamo splits the prompt-reading job from the token-writing job, routes requests across many graphics processors, and tries to keep memory and data movement from becoming the bottleneck. (developer.nvidia.com 1) (developer.nvidia.com 2) The new trick in this story is a scheduling pattern NVIDIA calls disaggregated prefill and decode. In plain English, it means one pool of chips handles the heavy first read of the prompt while another pool handles the long tail of token generation, instead of forcing the same chip to do both jobs back to back. (developer.nvidia.com) On a rack as large as GB200 NVL72, a single-digit gain is not cosmetic. NVIDIA says the rack’s 72-chip NVLink fabric delivers 130 terabytes per second of low-latency graphics-processor communication, so a software change that uses that fabric better can turn into more requests served without adding more racks. (nvidia.com) That is why an 8.8% per-graphics-processor throughput lift gets attention even though it sounds small next to the usual “2x” marketing numbers. At hyperscale, 8.8% can mean fewer servers bought for the same traffic load, or the same servers drawing less power per unit of work, which is exactly the math cloud operators care about. (blogs.nvidia.com) (nvidia.com) It also helps explain why demand keeps clustering around the newest accelerator racks instead of leveling off after launch. When model sizes rise, latency targets tighten, and software keeps finding another few percent on top of new silicon, the newest rack stops looking like a luxury and starts looking like the cheaper way to serve the same number of users. (blogs.nvidia.com) (developer.nvidia.com)

NVIDIA’s inference trick shows gains

Get your own daily briefing