Blackwell cuts inference costs 35x

- NVIDIA and SemiAnalysis said on February 16 that GB300 NVL72 Blackwell Ultra systems can cut AI inference token costs 35x versus Hopper. - The eye-catching detail is 50x more throughput per megawatt — a power-and-economics jump aimed at low-latency, long-context agent workloads. - Cheaper tokens are not shrinking AI buildouts. They’re accelerating hyperscaler spending on racks, networking, memory, power, and whole data centers.

NVIDIA’s Blackwell story is not really “GPUs got faster.” It’s “serving AI got dramatically cheaper” — at least for the kinds of workloads everyone suddenly cares about, like coding agents and long-context assistants. On February 16, NVIDIA highlighted new SemiAnalysis benchmark data showing GB300 NVL72 Blackwell Ultra systems delivering up to 35x lower cost per token than Hopper, plus 50x higher throughput per megawatt on some inference workloads. That matters because the bottleneck in AI has shifted. Training still matters, but the bill that keeps showing up every day is inference — the cost of answering user requests at scale. (blogs.nvidia.com) ### What changed here? The headline change is economic, not just technical. NVIDIA is saying Blackwell Ultra turns the same data-center power budget into far more tokens, which means more user queries, more agent steps, and more revenue per rack. The company’s public pitch is very explicit now — GB300 NVL72 is about “lowest token cost,” not just peak performance. That is a subtle but important shift in how AI infrastructure gets sold. (nvidia.com) ### Why does “cost per token” matter so much? Because inference is the meter running after the model is built. Every chatbot reply, every coding suggestion, every multistep agent workflow burns tokens and ties up hardware. If the cost per token falls hard enough, products that looked marginal suddenly look viable. A service can offer longer context windows, faster responses, or l(nvidia.com)heaper inference changes what product teams are willing to ship. (nvidia.com) ### Why is the gain so large? Part of it is the chip, but the bigger story is the whole stack. NVIDIA keeps stressing “hardware and software codesign” — new low-precision math, faster interconnect, rack-scale systems, and software like TensorRT-LLM and Dynamo that keeps improving after the hardware ships. In the February data, NVIDIA also said TensorRT-LLM improvements alone boos(nvidia.com)o the 35x number is not a simple one-chip-versus-one-chip apples-to-apples story. It is the result of a tuned system. (blogs.nvidia.com) ### Why are agents the key use case? Agents are expensive in a different way from plain chat. They often need low latency, long context, and multiple back-and-forth reasoning steps. NVIDIA tied this directly to coding assistants and agentic workflows, noting that software-programming-related AI queries rose from 11% to about 50% last year in(blogs.nvidia.com)g through a lot more tokens. (blogs.nvidia.com) ### Does cheaper inference mean less infrastructure spending? Turns out, probably the opposite. When the unit cost drops, demand usually expands to fill the gap. Better economics make it easier to justify rolling AI into more products, serving more users, and letting each request consume more compute. That is one reason the infrastructure bu(blogs.nvidia.com)an to invest $500 billion over four years in U.S. AI infrastructure. (openai.com) ### So where does the money go now? Not just into the GPU die. Once inference gets cheaper and demand rises, the pressure shifts toward everything around the chip — networking, memory bandwidth, power delivery, cooling, and physical data-center capacity. Futurum estimated in February that Microsoft, Alphabet, Amazon, Meta, and Oracle together were on track for $660 billion t(openai.com)centers, and networking. The constraint is increasingly system scale. (futurumgroup.com) ### What’s the catch? The catch is that “up to 35x” is a best-case benchmark framing, not a universal law. Real savings depend on workload shape, latency targets, software maturity, and whether buyers can actually deploy full rack-scale Blackwell systems efficiently. But even if the real-world gain is materially lower, the direction is clear — inference economics are improving fast enough to change buying behavior. (blogs.nvidia.com) ### Bottom line? Blackwell is making AI serving cheaper by enough to matter. But cheaper tokens do not mean a smaller AI buildout — they mean more reasons to build. The center of gravity is moving from “who has the best model demo” to “who can deliver the most intelligence per watt, per rack, and per dollar.” (nvidia.com)

Get your own daily briefing

Scout delivers personalized news, insights, and conversations tailored to your role and industry.

Download on the App Store

Shared from Scout - Be the smartest in the room.