Inference Cost Paradox

- Operator discussions show inference, not training, is becoming the durable economic bottleneck as usage scales. - Creators report inference costs fell ~1000x from 2022–2025, yet agentic workflows multiply tokens per task and can spike spend ~320%. - That divergence makes model calls, state management, and batching central to budgeting and FinOps conversations (x.com).

Training built the current wave of artificial intelligence, but serving each answer is becoming the bill that sticks. Microsoft said on October 30, 2024 that demand on its cloud was “all inference,” not training. (datacenterdynamics.com) Inference is the work a model does after it is trained: every prompt, tool call, search, and response. On that October 30, 2024 earnings call, Satya Nadella said Microsoft was even turning away some training demand because “we have so much demand on inference.” (datacenterdynamics.com) The price of a single model call has dropped fast. Stanford’s 2025 AI Index said the inference cost for a system performing at GPT-3.5 level fell more than 280-fold between November 2022 and October 2024. (hai.stanford.edu) Model vendors now advertise those lower unit prices in public rate cards. OpenAI lists GPT-5.4 at $2.50 per 1 million input tokens and $15 per 1 million output tokens, while Anthropic lists Claude Opus 4.7 at $5 and $25, and Google lists Gemini 3.1 Pro at $2 and $12 for prompts up to 200,000 tokens. (openai.com) (anthropic.com) (ai.google.dev) The catch is that newer products do not make one call and stop. Anthropic says Opus 4.7 is built for “complex agentic workflows,” and Google markets Gemini 3.1 Pro for “agentic capabilities,” which means software that plans, searches, calls tools, and loops until it finishes a task. (anthropic.com) (ai.google.dev) Each extra step adds tokens, storage, and latency. Google charges separately for context caching storage and for search grounding after 5,000 prompts a month, and OpenAI charges $10 per 1,000 web-search calls on top of token prices. (ai.google.dev) (openai.com) That is why operators now talk less about the one-time cost of training and more about recurring usage costs. OpenAI says its Batch API cuts input and output costs by 50%, Anthropic says prompt caching can cut costs by up to 90% and batch processing by 50%, and Google offers a 50% Batch API reduction on paid tiers. (openai.com) (anthropic.com) (ai.google.dev) Caching only works when requests repeat enough to hit the same prefix. OpenAI says cache effectiveness can fall when matching requests exceed about 15 per minute and notes in-memory caches generally stay active for 5 to 10 minutes of inactivity, up to one hour. (developers.openai.com) The spending pressure sits on top of a much larger infrastructure buildout. Sequoia wrote on June 20, 2024 that the industry had become “AI’s $600B question,” using Nvidia revenue, data-center costs, and target margins to estimate how much annual revenue the ecosystem would need to justify the capital spending. (sequoiacap.com) So the paradox is simple: the cost of one token keeps falling, but the number of tokens, calls, and supporting services per task keeps rising. In 2026, the budget fight is no longer just over training runs; it is over how often products ask models to think. (hai.stanford.edu) (openai.com)

Get your own daily briefing

Scout delivers personalized news, insights, and conversations tailored to your role and industry.

Download on the App Store

Shared from Scout - Be the smartest in the room.