Nvidia promises cheaper inference
Public reporting says Nvidia’s next architecture, Vera Rubin, is expected to cut inference token cost by around 10x compared with Blackwell, shifting the competitive focus toward inference economics over raw benchmark speed. Vendors and cloud providers are already preparing to surface those economics, which could change which inference‑heavy features are economically viable. (techi.com)
Inference is the part of artificial intelligence that turns a trained model into answers, and Nvidia says its next Rubin systems will make that step far cheaper than Blackwell. (nvidia.com) Nvidia said on January 5, 2026 that the Rubin platform would cut inference token cost by up to 10x versus Blackwell, and on March 16 it said Vera Rubin had entered full production. (nvidianews.nvidia.com 1) (nvidianews.nvidia.com 2) A token is a small chunk of text a model reads or writes, and cloud bills often scale with how many of those chunks a service generates. Nvidia’s developer blog says Vera Rubin NVL72 can deliver up to 10x lower cost per million tokens than Blackwell NVL72 on long-context, reasoning-heavy inference. (developer.nvidia.com) Rubin is not just a new graphics chip. Nvidia says the platform combines a Vera central processor, Rubin graphics processors, NVLink 6 networking, ConnectX-9 SuperNICs, BlueField-4 data processing units, Spectrum-6 Ethernet, and a Groq 3 LPX inference accelerator into one rack-scale system. (investor.nvidia.com) (nvidianews.nvidia.com) That matters because the economics of serving answers are becoming as important as training benchmark scores. Nvidia’s inference page says Rubin pairs GPUs for “prefill,” the first heavy read of a prompt, with LPX hardware for “decode,” the fast token-by-token generation that follows. (nvidia.com) Vendors are already selling the current generation on token economics, not just raw speed. Nvidia said on February 12 that Baseten, DeepInfra, Fireworks AI and Together AI were cutting cost per token by up to 10x on Blackwell with optimized inference stacks. (blogs.nvidia.com) Cloud providers are also laying the groundwork for the next cycle. Microsoft said in January that Azure data centers were being designed for Vera Rubin NVL72 deployments, while Amazon has already put Grace Blackwell GB200 UltraServers into general availability for training and inference. (azure.microsoft.com) (aws.amazon.com) Nvidia is framing the next contest around cost per token, tokens per watt, and rack efficiency because those numbers decide whether always-on agents, long-context assistants, and reasoning models can be sold at a profit. Its March technical post said Rubin POD racks were built for high-throughput, low-latency inference and dense context memory at data-center scale. (developer.nvidia.com) Rivals are pushing the same argument from the other side. Cerebras said in late 2025 that customers were asking not only for speed but for “price-performance,” and published token-per-second and cost-per-million-token comparisons against Blackwell-based services. (cerebras.ai) If Rubin delivers the 10x token-cost cut Nvidia is advertising, the next fight in artificial intelligence infrastructure will be over how cheaply clouds can sell answers, not just how fast chips can post scores. (nvidia.com)