Routing can cut model bills
Developers report big savings by routing easy requests to small models and reserving heavy models for complex prompts—one thread claims routing 70% of workload to Flash/Haiku/Mini trims an estimated $400/month GPT‑4o bill to about $25/month (roughly a 16x saving). (x.com) Others show tiered approaches—Arch‑Router‑1.5B reportedly cut costs about 50% by escalating only the prompts that need larger models—proof that prompt complexity tiering is a practical lever. (x.com)
The cheapest way to use a powerful language model is often not to use it. That is the point behind routing, the now-common trick of sending simple prompts to a small, cheap model and escalating only the hard ones to something larger. The idea sounds obvious. The surprise is how much money it can save in practice, because the price gap between model tiers is now enormous. OpenAI’s GPT‑4o mini is priced at $0.15 per million input tokens and $0.60 per million output tokens, while GPT‑4o is listed at $2.50 and $10.00 for the same units. That is a 16.7× gap on input and output before any other optimizations kick in (openai.com, openai.com). That price spread is what makes the viral routing examples plausible. One developer thread claimed that sending roughly 70 percent of traffic to small models such as Flash, Haiku, or Mini could cut an estimated monthly GPT‑4o bill from about $400 to about $25. The exact number depends on prompt length, output length, and which “Flash” or “Haiku” model is doing the work, so the thread is an anecdote, not an audit. But the underlying math is real. Anthropic says Claude Haiku 4.5 starts at $1 per million input tokens and $5 per million output tokens, still far below a flagship model, and Google’s Gemini pricing page similarly separates low-cost Flash models from pricier top-tier ones (anthropic.com, platform.claude.com, ai.google.dev). Once those price tiers exist, the next question is who decides what counts as “hard.” Some teams do it with simple rules. A short classification request goes to a mini model. A long prompt with code, tools, or multiple steps gets promoted. Others use a dedicated router model that reads the prompt first and picks the cheapest model likely to do the job well. Amazon now sells this idea directly through Bedrock Intelligent Prompt Routing, which uses one endpoint to choose between models in the same family based on predicted response quality for each request, with the stated goal of optimizing both quality and cost (docs.aws.amazon.com). That matters because routing is no longer just a hack from indie builders on X. It is becoming infrastructure. OpenRouter offers provider routing that can prioritize price or throughput across model providers, turning cost-aware selection into a default platform feature instead of a hand-built sidecar (openrouter.ai). The same shift is happening inside research. A 2025 paper from Katanemo introduced Arch‑Router, a 1.5 billion-parameter routing model designed to map prompts to preferred model choices by domain and action. The paper’s claim is not merely that routing saves money. It argues that a small router can outperform larger proprietary models at the routing task itself, because choosing the right model is a different problem from answering the prompt (arxiv.org, huggingface.co). That distinction is easy to miss. Developers tend to think about model quality as one ladder, from weak to strong. Routing treats the stack more like a toolbox. A cheap model can classify intent, rewrite text, extract fields, or answer routine questions well enough that using a flagship model is wasteful. OpenAI’s own model docs describe GPT‑4o mini as a “fast, affordable small model for focused tasks,” and recommend large models for broader or more complex work (openai.com, openai.com). And routing is only one lever. Prompt caching cuts repeated input costs on OpenAI models by 50 percent when the prefix has been seen recently, and Batch API processing cuts both inputs and outputs by 50 percent for asynchronous jobs. Those discounts stack with model selection. A system that routes easy prompts to a mini model, reuses long shared prefixes, and batches background work is not shaving pennies. It is changing the economics of building with LLMs from “every request hits the expensive brain” to “most requests never needed it” (openai.com, openai.com).