Throughput and token economics

A model training job reported at about $5 million this week, underscoring that modern LLM training budgets can be large but finite. (x.com) Inference pricing examples surfaced showing around $5 per million input tokens and $25 per million output tokens, with caching saving roughly 90% and batching reducing cost by about 50% in tested setups. (x.com)

Training a large language model can cost millions of dollars, but serving it can be counted token by token. Anthropic this week put one flagship example at $5 per million input tokens and $25 per million output tokens. (anthropic.com) A token is a chunk of text, not a whole word, and application developers are billed separately for what they send in and what the model sends back. OpenAI’s pricing docs say tokens are billed at each model’s input and output rates, and Anthropic’s Opus 4.7 page lists the $5 and $25 per million-token figures now circulating. (developers.openai.com) (anthropic.com) Those prices drop when the same prompt is reused or the work is delayed. Anthropic says prompt caching can cut costs by up to 90%, and its batch processing option cuts prices by 50% for jobs that can wait. (platform.claude.com) (anthropic.com) Caching works like saving a repeated preface so the system does not reread it at full price every time. Anthropic’s docs say the default cache lasts 5 minutes, can be extended to 1 hour at higher write cost, and applies to the full prompt prefix up to the marked cache point. (platform.claude.com) Batching is the cheaper lane for work that is not interactive, such as evaluations, labeling, or overnight processing. OpenAI says its Batch API offers 50% lower costs with a 24-hour turnaround target, and Google says its Gemini Batch API is priced at 50% of standard cost with a similar target window. (developers.openai.com) (ai.google.dev) That split between training cost and inference cost shapes how AI companies build products. A one-time training run can run into the millions, but the ongoing business depends on how many tokens users consume and how much of that traffic can be cached, batched, or shifted to cheaper models. (arxiv.org) (openai.com) Public estimates for frontier training runs are already far above single-digit millions for the biggest systems. Epoch AI’s 2024 paper estimated GPT-4 and Gemini 1.0 Ultra in roughly the $30 million to $40 million range and projected that the largest runs could exceed $1 billion by 2027 if recent trends continue. (arxiv.org) (epoch.ai) Inference pricing, though, is visible to every developer with an application programming interface bill. OpenAI’s current pricing page shows separate rates for input, cached input, and output, while Google’s Gemini docs and Anthropic’s Claude docs now market caching and batch discounts as standard cost controls rather than edge-case features. (openai.com) (ai.google.dev) (anthropic.com) The practical result is that two apps using the same model can have very different margins. A chatbot that repeats a long system prompt across thousands of sessions can lean on caching, while a real-time tool that generates long answers pays much more for output tokens, the costliest line item in Anthropic’s published Opus pricing. (platform.claude.com) (anthropic.com) That is why a reported $5 million training job and a $5-per-million-token inference price can both be true at the same time. One is the cost to build the model once; the other is the meter that keeps running every time someone uses it. (arxiv.org) (anthropic.com)

Throughput and token economics

Get your own daily briefing