NVIDIA’s Blackwell Wins
NVIDIA showcased how inference providers using its Inference Reference Architecture on Blackwell are gaining big performance and efficiency improvements, with models like Kimi K2.5 topping leaderboards. (x.com) The demos underline that optimized stacks and reference architectures are accelerating real‑world inference advances, not just raw model scale. (x.com)
NVIDIA’s latest Blackwell pitch is not really about a faster chip. It is about a faster system. In recent demos and blog posts, the company has been showing that inference providers running open models on Blackwell are getting large gains in throughput and cost, especially when they use NVIDIA’s full software stack instead of treating the GPU like a raw commodity. That is the real story behind the company’s “Inference Reference Architecture,” a blueprint for how to wire together hardware, runtimes, memory layers, orchestration, and serving software so modern models actually move quickly in production (docs.nvidia.com, blogs.nvidia.com). That matters because inference is where AI becomes expensive. Training a model is a one-time shock. Serving it to millions of users is the daily bill. NVIDIA says providers including Baseten, DeepInfra, Fireworks AI, and Together AI are cutting cost per token by as much as 10x on Blackwell versus Hopper when they combine Blackwell hardware with TensorRT-LLM, low-precision formats like NVFP4, and newer orchestration software such as Dynamo (blogs.nvidia.com, investor.nvidia.com). NVIDIA’s own customer material gives a more concrete version of the same claim: Baseten says Blackwell, TensorRT-LLM, and Dynamo let it serve five times as many requests on busy endpoints, cut latency by up to 38%, and improve price-performance on frontier reasoning models by up to 225% (nvidia.com). The reason this works is that frontier open models have changed shape. The best ones are no longer just dense giants that light up every parameter for every token. They are usually mixture-of-experts models, which route each token to a smaller set of specialized subnetworks. NVIDIA has been blunt about this shift. In a December 2025 post, it said the top 10 most intelligent open-source models on the Artificial Analysis leaderboard all used MoE designs, including DeepSeek-R1, Mistral Large 3, OpenAI’s gpt-oss-120B, and Moonshot AI’s Kimi K2 Thinking (blogs.nvidia.com). That architecture is more efficient in theory, but it is also harder to serve well because the experts need to exchange data constantly across GPUs. This is where Blackwell’s rack-scale design starts to matter. NVIDIA’s GB200 NVL72 links 72 Blackwell GPUs with fifth-generation NVLink and NVLink Switch chips, with 1,800 GB/s of bidirectional bandwidth across the rack, specifically to keep those expert-to-expert transfers from turning into a traffic jam (developer.nvidia.com, blogs.nvidia.com). NVIDIA says recent TensorRT-LLM updates alone raised throughput per Blackwell GPU by up to 2.8x in three months on DeepSeek-R1, and Dynamo 1.0 can boost Blackwell inference performance by up to 7x in some benchmarked workloads by routing requests and memory more intelligently across the cluster (developer.nvidia.com, investor.nvidia.com). Kimi is a useful example because it shows both the model trend and the serving trend at once. NVIDIA’s February 4, 2026 technical post describes Kimi K2.5 as a 1 trillion-parameter multimodal MoE model with 384 experts, about 32.86 billion active parameters per token, and a 262K context window (developer.nvidia.com). On the provider side, Artificial Analysis shows Kimi K2.5 already spread across 15 API providers, with big variation in speed and latency. Clarifai led the listed providers at 404.7 output tokens per second, while Fireworks was among the lowest-latency options at 5.91 seconds to first token, and DeepInfra was the cheapest on blended price at $0.90 per million tokens (artificialanalysis.ai). That spread is the point. The model alone does not determine the user experience. The stack does. So when NVIDIA says Blackwell is winning, it is really saying something narrower and more important. The company has turned inference into a systems contest that rewards integration. The GPU still matters. But the bigger advantage now comes from getting the routing, precision, interconnect, memory, and software layers to act like one machine. That is why NVIDIA’s reference architecture keeps showing up in the same sentence as the leaderboard results, and why a trillion-parameter Kimi model can look very different depending on whose endpoint you hit first (docs.nvidia.com, artificialanalysis.ai).