NVIDIA boosts inference efficiency

NVIDIA quietly shipped a software optimization that raises inference throughput per GPU — useful if you run lots of inference racks. (The company released a DWDP inference optimization for GB200 NVL72 racks that improved TPS/GPU by about 8.8% on the DeepSeek‑R1 benchmark, with the tradeoff of slower first‑token latency.) (x.com) That’s the kind of incremental performance win that can change rack‑level economics without new silicon — worth watching if you manage cluster cost or model serving SLAs.

Serving a large language model is a two-part job: first the system reads your prompt, then it generates one token at a time. The first part is measured by time to first token, and the second part is where operators count tokens per second because that is what fills the rack. (developer.nvidia.com) The expensive part is that many modern models no longer fit on one graphics processor, so one answer has to be split across many chips. Every time those chips stop and wait for each other, the whole job slows down like a 72-lane toll booth forced to move at the speed of the slowest car. (arxiv.org) DeepSeek-R1 is one of the models that makes this hard because it uses a mixture-of-experts design. That means each token only wakes up a small subset of the model’s specialists instead of all 671 billion parameters at once. (developer.nvidia.com) That design saves compute, but it creates a routing problem. Different tokens get sent to different experts, so some graphics processors end up busier than others, and the usual parallel methods force the fast chips to wait at every layer for the slow chips to catch up. (arxiv.org) NVIDIA’s new trick is called Distributed Weight Data Parallelism. Instead of making every chip hold the same expert weights or stop for collective synchronization, it lets a chip fetch missing expert weights from peer chips on demand and keep moving independently. (arxiv.org) The company says it implemented this in TensorRT-LLM, its inference software stack, and tested it on a GB200 NVL72 rack with DeepSeek-R1. In an 8,000-token input and 1,000-token output setup, NVIDIA reported about an 8.8% gain in output tokens per second per graphics processor at comparable user throughput in the 20 to 100 tokens-per-second-per-user range. (arxiv.org) The hardware matters here because GB200 NVL72 is not a normal server. NVIDIA sells it as a liquid-cooled rack with 36 Grace central processors, 72 Blackwell graphics processors, and a single NVLink domain with 130 terabytes per second of rack-scale GPU communication. (nvidia.com) That huge internal fabric is what makes remote weight fetching practical. If one chip has to borrow an expert from another chip, fifth-generation NVLink gives the rack 1,800 gigabytes per second of bidirectional bandwidth between chips, which is much faster than pushing the same traffic across a looser cluster network. (developer.nvidia.com) The tradeoff is speed at the very start of an answer. NVIDIA’s own discussion of modern inference setups separates time to first token from steady decode speed, and this optimization is useful when operators care more about rack throughput than shaving every millisecond off the first word. (developer.nvidia.com) That is why a single-digit software gain can change economics without any new silicon. If a fleet already runs Blackwell racks, an 8.8% throughput lift means more tokens sold per graphics processor, lower cost per million tokens, and a little more life squeezed out of the same hardware budget. (arxiv.org)

Get your own daily briefing

Scout delivers personalized news, insights, and conversations tailored to your role and industry.

Download on the App Store

Shared from Scout - Be the smartest in the room.