Inference costs plunged — $150 per 1M tokens claim

Social reports from GTC said centralized inference costs collapsed roughly 10x to about $150 per 1M tokens, and NVIDIA China H200 approvals (400k+ GPUs) were cited as boosting supply — a big driver of current pricing dynamics. If true, that shifts the economics of running large inference fleets. (x.com)

NVIDIA’s GTC keynote formally framed a layered “token” pricing architecture for inference services with multiple paid tiers tied to latency, context length and throughput, and the company published full session materials and a recording of the keynote. (nvidia.com). (youtube.com) Independent market analysis and public price trackers show API and self‑hosted inference rates for many mid‑tier and budget models have already fallen to fractions of a dollar per million tokens, with one industry estimate putting GPT‑4‑level equivalence near $0.40/1M tokens. (introl.com). (introl.com) Reuters reporting and multiple trade outlets say Beijing has conditionally approved purchases of NVIDIA’s newest data‑center GPUs for several of China’s biggest internet companies, signalling regulatory permission for targeted commercial imports. (marketscreener.com). (marketscreener.com) Those approvals, sources told Reuters, include export and domestic‑policy conditions that some buyers describe as restrictive, and parts of the approval process have delayed conversion of authorizations into firm purchase orders. (datacenterdynamics.com). (datacenterdynamics.com) Earlier reporting showed Chinese customers had placed orders totaling more than two million units while visible H200 supply sat below one million units and U.S. export rules cap shipments to approved recipients at under half the volume of U.S. domestic sales. (datacenterfrontier.com). (datacenterdynamics.com) NVIDIA’s Vera Rubin platform and the announced Groq partnership were presented with token‑throughput and tokens‑per‑watt uplift claims used to justify higher‑performance pricing tiers, with company materials and technical briefings citing multi‑dozen‑fold efficiency gains versus the prior generation. (newegg.com). (newegg.com) Market observers warn that even if conditional approvals turn into shipments, production capacity, third‑party verification requirements, export caps and local procurement rules will make any downward pressure on cloud‑rental and spot GPU prices patchy and gradual across regions and providers. (networkworld.com). (networkworld.com)

Inference costs plunged — $150 per 1M tokens claim

Get your own daily briefing