Vera Rubin promises 90% inference cut

- NVIDIA’s Vera Rubin story hardened on May 1 into an economics claim: the next platform is being sold on cheaper inference, not just faster chips. - The number driving that shift is up to 10x lower inference token cost versus Blackwell, which investors restated as roughly a 90% cut. (investor.nvidia.com) - That matters because AI demand is moving from training into inference, where token cost, power use, and rack efficiency decide who wins. (nvidia.com)

NVIDIA’s next chip pitch is getting more specific. Vera Rubin is not just “the next GPU.” It is being framed as a machine for making AI answers cheaper to produce. That is the real story. On May 1, the investor chatter around Rubin zeroed in on a simple claim — inference could get about 90% cheaper than on Blackwell GB200 systems, which is another way of saying up to 10x lower token cost. (investor.nvidia.com) ### What is the actual claim here? NVIDI(nvidia.com)iver “up to 10x” lower inference token cost than the Blackwell platform, alongside a 4x reduction in the number of GPUs needed to train mixture-of-experts models. That is the source of the “90% cheaper” line now circulating in stock commentary — 10x lower cost translates to roughly a 90% reduction. (investor.nvidia.com)ining a model is the giant upfront expense. Inference is every answer after that — every chatbot reply, coding suggestion, search summary, and agent step. If those answers get dramatically cheaper, providers can either cut prices, widen margins, or serve much more demand on the same power budget. NVIDIA’s own inference pages now lean hard into “cost per token,” “tokens per watt,” and profitability, which tells you where customer conversations are going. (nvidia.com) ### Why now, not six months ago? Because the workload mix is changing. NVIDIA has been arguing for months that the market is moving from a training-heavy phase into an inference-heavy one, especially for agentic AI and coding assistants. Its February post on Blackwell Ultra made the point directly: these workloads need low latency, long context, and much better throughput per megawatt. In other words, raw peak performance still matters, but only if it lowers the bill for useful output. (blogs.nvidia.com) ### What makes Rubin different? The pitch(nvidia.com)NVIDIA is bundling the Vera CPU, Rubin GPU, NVLink 6, ConnectX-9, BlueField-4, and Spectrum-6 into a rack-scale system, then pairing that hardware with software tuned for the inference path. Basically, the company is saying the savings come from the whole stack working together, not from a single benchmark hero number. (investor.nvidia.com)IA is still publishing fresh Blackwell gains, including claims of up to 35x lower cost per token for GB300 NVL72 versus Hopper and major throughput gains from software updates on GB200. That matters because Rubin’s economics story only lands if customers already believe NVIDIA can keep squeezing more output from each generation through software as well as silicon. (nvidia.com)nue fits inside my power cap.” That is a more durable sales argument, because enterprise AI deployments live or die on operating economics, not bragging rights. NVIDIA is trying to make Rubin the default answer to that spreadsheet. (nvidia.com) ### What is the catch? The phrase is “up to 10x,” not a universal guarantee. Real savings depend on model type, latency target, software stack, an(nvidia.com)t even with that caveat, the direction is clear — NVIDIA wants the market to judge Vera Rubin less like a faster engine and more like a cheaper cost structure for AI. (investor.nvidia.com) ### Bottom line The important change is not that Rubin sound(nvidia.com)ubin is being sold as an inference economics machine — a way to cut token cost enough that more AI workloads become worth running at scale. (investor.nvidia.com)

Vera Rubin promises 90% inference cut

Get your own daily briefing