LLM bills leaking cash
Recent tooling audits show big waste in LLM spend — one profiler flagged 43% of a bill as wasted, equal to $1,240 of a $2,847 invoice. (x.com) Analysts warn that unconstrained agents can cost $5–$8 per task and that API vs self‑host breakeven sits near $10,000/month, so enforcing spend limits and routing decisions is becoming essential to avoid runaway costs. (x.com)(x.com)
The new AI budget problem is not that large language models are expensive. It is that most teams still do not know what they are paying for. A recent cost profiler made that painfully concrete. In one project, it projected a monthly LLM bill of $2,847 and flagged $1,240 of that as avoidable waste, or 43.5 percent. The waste was not exotic. It came from duplicate calls that should have been cached, retry loops that kept burning tokens after failures, bloated context windows, and expensive models doing work that cheaper ones could handle. (github.com) That is the part many teams miss. LLM costs do not usually explode in one dramatic moment. They seep out through dozens of small design choices. A classifier that always returns a few words does not need a frontier model. A summarizer that sees the same document again and again should not pay full price every time. An agent that keeps its whole conversation history on every turn gets more expensive simply by continuing to exist. The meter runs on tokens, and tokens pile up faster than most product dashboards reveal. (github.com) Agents make this worse because they turn one request into a chain of requests. The first call can be cheap. The fifth call is not. Stevens’ summary of agent economics describes the trap clearly: with multi-turn loops, context grows each round, so later steps carry the cost of earlier ones too. The result is quadratic token growth. The same piece also notes that a single LLM call might take less than a second, while an orchestrated agent flow with reflection can stretch to 10 to 30 seconds. More time often means more calls, more context, and more money. (online.stevens.edu) That is how “helpful” turns into “wasteful.” The usual leaks are now familiar. Retry storms after malformed outputs. Prompt bloat from ever-longer system instructions. Redundant chains that ask two or three models to confirm what one model already knew. Routing failures that send simple tasks to premium models by default. Cost observability vendors now pitch this in the language finance teams understand: not tokens, but dollars at risk. One benchmark example from ZenLLM shows $18,400 in flagged monthly waste on a $62,400 AI spend, with model overkill, retry waste, prompt bloat, and duplicate chains driving the losses. (zenllm.io) The fix is not mysterious either. It starts with measuring every request by feature, team, and model instead of staring at a provider invoice after the month is over. From there, the highest-return moves are usually dull engineering work: cache deterministic outputs, shorten prompts, cap retries, restrict access to costly models, and route simple jobs to smaller systems. Helicone says caching alone typically cuts costs by 15 to 30 percent for applications with repeated queries. TrueFoundry’s guidance is even more blunt: quotas, rate limits, and budget caps should be enforced before spend spikes, not explained afterward. (helicone.ai) The providers themselves are quietly telling the same story through pricing. OpenAI now lists cached input tokens for GPT-5.4 at one-tenth the cost of standard input tokens. Anthropic says prompt caching can cut costs by up to 90 percent on Claude. Google’s Gemini API now includes both implicit and explicit context caching, with explicit caching priced separately to make repeated context cheaper than resending it every time. These are not edge features anymore. They are admissions that repeated context is one of the easiest ways to waste money in production. (openai.com) This is also why the old argument about API versus self-hosting is getting sharper. The breakeven point matters only after a team has stopped lighting money on fire at the application layer. If your stack is riddled with retries, overpowered routing, and swollen prompts, moving the same bad behavior onto your own GPUs does not solve the problem. It just changes the bill. The profiler example makes that visible in one line item: 723 of 847 weekly classifier calls were exact duplicates, and the suggested fix was a cache decorator worth about $310 a month. (github.com)