Nvidia pushes 'cost-per-token'

Nvidia is reframing data-center procurement around a single total-cost metric it calls 'cost per token', arguing buyers should assess hardware plus software and utilisation together. The messaging is meant to change procurement conversations and puts emphasis on software optimisation, ecosystem support and real-world utilisation as part of TCO. (blogs.nvidia.com/blog/lowest-token-cost-ai-factories) (datacenterknowledge.com/infrastructure/nvidia-pushes-cost-per-token-as-defining-metric-for-ai-data-centers)

Nvidia is trying to change how companies buy Artificial Intelligence infrastructure by pushing one number above all others: cost per token. (blogs.nvidia.com) (datacenterknowledge.com) A token is a small unit of generated text, and Nvidia said on April 15 that data centers running generative and agentic Artificial Intelligence should be judged by the all-in cost to produce those units, usually measured as cost per million tokens. (blogs.nvidia.com) The company is drawing a line between older purchasing yardsticks such as raw compute cost and floating point operations per second per dollar, and a newer output metric tied to inference, the stage when a trained model answers prompts. (blogs.nvidia.com) (datacenterknowledge.com) Nvidia’s argument rests on a simple equation: buyers often fixate on the hourly price of a graphics processing unit, but the bigger swing factor is how many tokens a system can actually deliver. (blogs.nvidia.com) (datacenterknowledge.com) That token output depends on more than the chip. Nvidia pointed to interconnects, lower-precision formats such as four-bit floating point, speculative decoding, key-value cache management, and overall system utilization as the pieces that decide real-world throughput. (datacenterknowledge.com) The message lands as inference takes a larger share of Artificial Intelligence spending, shifting attention from training giant models once to serving answers continuously at scale. Data Center Knowledge said that change is pushing operators toward throughput, latency, and efficiency metrics instead of peak compute alone. (datacenterknowledge.com) Nvidia backed the pitch with its own Blackwell-versus-Hopper comparison using DeepSeek-R1, saying Blackwell systems cost about twice as much per compute hour but can deliver up to 65 times more tokens per second per graphics processing unit. (datacenterknowledge.com) The company has been building that case for months. In a February 12 post, Nvidia said Baseten, DeepInfra, Fireworks AI, and Together AI were cutting cost per token by up to 10 times on Blackwell with optimized inference software and open-source models. (blogs.nvidia.com) It tied the same idea to benchmark results on April 1, when Nvidia said Blackwell Ultra posted the highest throughput in MLPerf Inference version 6.0 and that software changes on the same infrastructure produced up to 2.7 times more throughput and more than 60% lower cost per token. (developer.nvidia.com) (mlcommons.org) Not everyone is ready to accept cost per token as the single buying metric. Data Center Knowledge reported that analysts see the framework as better suited to hyperscale environments and still early for many enterprise information technology teams, which often buy for mixed workloads, limited utilization, and longer replacement cycles. (datacenterknowledge.com) The immediate effect is less about a new formula than a new sales argument: Nvidia wants buyers to compare complete systems, software stacks, and utilization rates, not just the hourly price of a chip. (blogs.nvidia.com) (datacenterknowledge.com)

Nvidia pushes 'cost-per-token'

Get your own daily briefing