Stanford flags unpredictable agent costs

- Stanford’s Digital Economy Lab highlighted a new paper on May 5 showing AI coding agents burn far more tokens than chat or reasoning systems. - In tests on SWE-bench Verified, identical agent-task runs swung by as much as 30x, and some models used 1.5 million more tokens than GPT-5. - That makes agent pricing messy — more spend does not reliably buy better results, and even models misjudge their own costs.

AI agent cost is turning into its own product problem. Not because tokens are expensive in the abstract, but because agent runs behave less like a meter and more like a slot machine. That is the real news in Stanford Digital Economy Lab’s May 5 write-up of a new paper from Stanford, MIT, Michigan, Google DeepMind, Microsoft AI, and All Hands AI. The team looked at coding agents, and the punchline is simple — the same kind of task can trigger wildly different token bills, with weak links between spending more and succeeding more. (digitaleconomy.stanford.edu) ### Why are agents so much pricier than normal chat? A normal chat exchange is mostly one prompt and one answer. An agent is different. The model reads the task, takes an action, gets feedback, then rereads the growing pile of prompt, tool output, logs, and prior steps before it acts again. That creates a context s(digitaleconomy.stanford.edu)t. Most of that cost came from input tokens, not output. (digitaleconomy.stanford.edu) ### What did the researchers actually test? They analyzed trajectories from eight frontier models on SWE-bench Verified, a benchmark built around real software bug-fixing tasks. So this was not a toy “write a haiku” setup. It was a closer look at the kind of agent loop people actually want to pay for — inspect code, try fixes, read errors, try again. (arxiv.org) ### Where does the unpredictability show up? In repeat runs. The paper says the same agent on the same task could vary by up to 30x in total token use. That means cost is not just model-dependent or task-dependent. It is also path-dependent — the exact sequence of actions, retries, and context growth changes the bill. Stanford’s write-up makes the same point bluntly: you often on(arxiv.org)r. (digitaleconomy.stanford.edu) ### Does spending more at least buy better results? Not reliably. One of the uglier findings is that higher token usage did not map cleanly to higher accuracy. Performance often peaked at intermediate cost and then flattened out at higher spend. Basically, some agents are not “thinking harder” when they burn more tokens. They are just wandering longer. (arxiv.org) ### Which models looked more efficient? The paper says model choice mattered a lot. On the same tasks, Kimi-K2 and Claude-Sonnet-4.5 consumed, on average, more than 1.5 million additional tokens versus GPT-5. That does not mean one model is universally better at everything. But it does mean “pick a strong model” is not enough guidance if your workflow economics matter. (arxiv.or([arxiv.org)humans estimate which tasks will be expensive? Not very well. Human ratings of task difficulty only weakly matched actual token cost. That fits a broader pattern in agent systems — something easy for a person can be awkward for an agent, while something tedious for a person may be straightforward for the model. Stanford’s lab framed this as a Moravec’s Paradox-style misma(arxiv.org) (digitaleconomy.stanford.edu) ### Can the models predict their own cost? Also not well. The paper says frontier models showed only weak-to-moderate ability to predict their own token use, with correlations up to 0.39, and they systematically underestimated the real total. That is a big deal for anyone trying to offer fixed-price agent workflows or “pay per completed task” products. If the agent cannot forecast its own burn, the vendor is guessing too. (arxiv.org) ### So what matters now? The bottom line is not “agents are too expensive.” It is that agent cost is unstable, opaque, and tightly tied to system design. Startups can’t treat token spend as a back-office detail anymore. They need budgets, caps, observability, and pricing that assumes variance — because the paper suggests variance is not a bug around the edges. It is part of how current agents work. (digitaleconomy.stanford.edu)

Get your own daily briefing

Scout delivers personalized news, insights, and conversations tailored to your role and industry.

Download on the App Store

Shared from Scout - Be the smartest in the room.