NVIDIA GB200 NVL72 excels in MoE

- NVIDIA’s GB200 NVL72 story is less about a single new benchmark today and more about a growing pile of Blackwell MoE results becoming concrete. - The number that keeps showing up is 10x for MoE on NVL72, plus newer software gains like 2.8x more throughput per GPU in months. - That matters because frontier open models are increasingly MoE, so inference buyers now care as much about interconnect and routing as raw FLOPs.

The hardware story here is rack-scale AI inference — not just faster chips, but faster communication between chips. That matters because the hottest open models now use mixture-of-experts, or MoE, which is a clever way to make giant models cheaper per token. The catch is that MoE shifts the bottleneck. Compute still matters, but traffic between GPUs starts to matter just as much. That is why NVIDIA keeps framing GB200 NVL72 as the Blackwell system built for this exact problem. ### What is GB200 NVL72, exactly? GB200 NVL72 is NVIDIA’s full-rack system with 72 Blackwell GPUs and 36 Grace CPUs tied together as one giant NVLink domain. NVIDIA’s pitch is simple — instead of treating a rack like lots of separate boxes, treat it like one massive accelerated computer with very high internal bandwidth. On NVIDIA’s product page, the company says that layout delivers 10x greater performance for MoE architectures and 30x faster real-time trillion-parameter LLM inference versus H100-based setups in the cited scenarios. (developer.nvidia.com) ### Why are MoE models the hard case? An MoE model does not use its whole parameter set for every token. It routes each token to a smaller subset of “experts.” That saves compute, which is why models like DeepSeek lean on the design. But every routing decision creates traffic — tokens and activations have to move to the right experts, often across many GPUs. So the problem stops being “how many tensor cores do you have?” and becomes “how fast can the rack shuffle data without stalling?” Google’s A4X writeup says the main constraint for these giant MoE systems has shifted from raw compute density to communication latency and memory bandwidth. (nvidia.com) ### Why does NVL72 help so much? Because NVL72 is built around keeping that expert traffic on a very fat internal fabric. NVIDIA says the rack provides a 72-GPU NVLink domain with 1,800 GB/s bidirectional bandwidth between chips, and 130 TB/s of low-latency GPU communications across the NVLink switch system. Basically, MoE rewards short distances and wide pipes. If expert routing can stay inside one tightly coupled rack instead of hopping across a looser cluster, throughput rises and latency falls. (cloud.google.com) ### Is this just hardware, or software too? Very much software too. NVIDIA’s January Blackwell MoE post says TensorRT-LLM updates alone boosted reasoning inference throughput by up to 2.8x per Blackwell GPU in three months. The optimizations it highlights are exactly the ones you would expect for MoE serving — NVFP4, multi-token prediction, disaggregated serving, and better all-to-all communication primitives. So when people say “GB200 is winning MoE,” they usually mean the rack plus the software stack plus quantization tricks, not the silicon in isolation. (developer.nvidia.com) ### What do the real deployments look like? Google Cloud’s January A4X recipe is a good clue. It pairs GB200 NVL72 with NVIDIA Dynamo and publishes deployment patterns for throughput-optimized and latency-optimized MoE serving. In one 8K input / 1K output setup, Google says it reached over 6K total tokens per second per GPU in throughput mode and 10 ms inter-token latency in latency mode. That is not a lab curiosity — it is a blueprint for operators trying to serve frontier MoE models at scale. (developer.nvidia.com) ### Why is this landing now? Because the benchmark world is getting more public. SemiAnalysis’ InferenceX has become a live comparison layer for modern inference hardware, with continuous results across NVIDIA and AMD systems. At the same time, NVIDIA’s newer DeepSeek-V4 materials make the commercial angle explicit: MoE and long-context models are no longer edge cases. They are becoming the default shape of high-end inference demand. (cloud.google.com) ### So what should buyers take from this? The lesson is that MoE inference is exposing a different kind of winner. The best system is not just the GPU with the biggest spec sheet. It is the one that can move expert traffic cheaply, keep latency predictable, and absorb software improvements fast. Right now, that is the lane GB200 NVL72 seems built to dominate. (developer.nvidia.com) (inferencex.semianalysis.com)

Get your own daily briefing

Scout delivers personalized news, insights, and conversations tailored to your role and industry.

Download on the App Store

Shared from Scout - Be the smartest in the room.