Inference costs plunged

- Community discussions report inference costs fell roughly 1000x since 2022, shifting bottlenecks toward architecture and context. - Reported figures indicate a drop from about $20 to roughly $0.40 per million tokens for inference workloads. - Practitioners note selective memory techniques can cut agent runtime costs by about 90%, materially changing deployment economics. ( )

Running a language model has gotten cheap fast: developers now talk about GPT-4-level inference at roughly $0.40 per million tokens, down from about $20 in late 2022. (introl.com) A token is a chunk of text, and inference is the part users actually pay for when a model reads a prompt and writes back an answer. Epoch AI reported in March 2025 that the price to hit specific performance milestones had been falling by 9x to 900x a year, depending on the task. (epoch.ai) The new floor is visible in public price sheets. OpenAI lists GPT-4o mini at $0.15 per million input tokens and $0.60 per million output tokens, while Google lists Gemini 3.1 Flash-Lite Preview as its “most cost-efficient” model and prices Gemini 3 tiers by the million-token block. (developers.openai.com, ai.google.dev) Cheaper tokens have changed what engineers optimize for. Instead of asking only which model is smartest, teams now spend more time deciding how much context to send, what to cache, and what an agent should remember between calls. (docs.langchain.com, mem0.ai) That shift shows up in vendor tooling. Google says context caching can cut repeated-input costs by 90%, and Anthropic says Claude Haiku 4.5 can deliver “up to 90%” savings with prompt caching and 50% savings with batch processing. (ai.google.dev, anthropic.com) Memory, in plain terms, is a way to avoid stuffing the whole conversation back into the model every time. LangChain’s docs describe long-term memory as information that persists across conversations, with agents loading only the files or “skills” they need when they need them. (docs.langchain.com) Benchmarks now try to measure that trade-off directly. A 2026 survey from Mem0 says the LOCOMO benchmark compares memory systems on recall, token use, and latency, and an Atlan review cites one setup that cut token cost by about 90% while also reducing p95 latency from 17.12 seconds to 1.44 seconds, with lower accuracy. (mem0.ai, atlan.com) The price drop has not been uniform across the market. Google charges a premium for priority inference, discounts Flex and Batch by 50%, and bills separately for caching storage; Anthropic and OpenAI also keep higher prices on larger or more capable models than on their smallest high-volume offerings. (ai.google.dev, platform.claude.com, developers.openai.com) The result is a different kind of bottleneck. Tokens are cheaper than they were three years ago, but architecture choices — what to send, what to store, and what to skip — now decide much more of the bill. (epoch.ai, docs.langchain.com)

Get your own daily briefing

Scout delivers personalized news, insights, and conversations tailored to your role and industry.

Download on the App Store

Shared from Scout - Be the smartest in the room.