Hidden 43% LLM spend leak
- John Medina published a DEV Community post on April 24 arguing many teams waste about 43% of large language model API budgets because provider billing dashboards show totals, not operational cost drivers. - Medina says the biggest leaks are retry storms, duplicate calls, context bloat and oversized model choices, and argues teams need per-user, per-model and per-feature attribution to find them. - The claim lands as OpenAI and Anthropic push prompt caching and token-level pricing, making cache hits, prompt structure and model routing central to controlling spend. (developers.openai.com)
A DEV Community post published April 24 argues many teams are wasting roughly 43% of their large language model API budgets because they only see aggregate provider bills. (dev.to) John Medina, the post’s author, says he analyzed usage across teams and found four recurring leaks: retry storms, duplicate calls, context bloat and wrong-model selection. He frames the problem as missing attribution by user, model and feature, not just high token prices. (dev.to) His example of a retry storm is an agent that keeps re-asking for valid JSON after a failed response, turning one interaction into dozens of paid calls. He also points to wrappers that resend long chat histories or repeated prompts that should have been cached. (dev.to) That framing matches how model vendors bill. OpenAI says prompt caching works automatically on recent models, can cut latency by up to 80% and reduce input-token costs by up to 90% when requests share an exact prefix. (developers.openai.com) OpenAI’s documentation says caching starts on prompts that are 1,024 tokens or longer, and cached prefixes generally stay active for 5 to 10 minutes of inactivity, up to one hour. It also says overflow at roughly 15 requests per minute for the same prefix can reduce cache effectiveness. (developers.openai.com) Anthropic’s pricing page breaks this out even more explicitly. For Claude Sonnet 4.6, base input tokens are priced at $3 per million, cache hits at $0.30 per million, and output tokens at $15 per million. (platform.claude.com) That means the same workflow can have very different costs depending on whether prompts are repeated cleanly enough to hit cache, whether the app retries, and whether a cheaper model could handle the task. A single monthly bill will not show which of those choices drove the spend. (dev.to) (developers.openai.com) (platform.claude.com) Medina’s prescription is to measure cost per successful workflow, not just cost per token. In practice, that means logging which request belonged to which feature, team, user and model, then comparing cost against completed outcomes instead of raw usage. (dev.to) He also uses the post to promote LLMeter, an open-source tool he says tracks costs per model, per user and per day across providers including OpenAI, Anthropic, DeepSeek and OpenRouter. The article does not publish the underlying dataset behind the 43% figure. (dev.to) The narrower takeaway is less about one universal 43% number than about where hidden spend accumulates: repeated prompts, long context windows, retries and premium models used for cheap tasks. Those are all line items the providers already expose in token accounting, but not in a way that maps cleanly to business workflows. (dev.to) (openai.com) So the “hidden leak” is not a new fee from OpenAI or Anthropic. It is the gap between what the invoice totals say and what the application actually did to generate them. (dev.to)