Route 80% of queries to cheap models

- Teams building language model apps are increasingly routing routine prompts to cheaper models first, then escalating only harder requests to premium systems. - OpenAI’s pricing shows why: GPT-5.4 mini input costs $0.75 per million tokens versus $2.50 for GPT-5.4, a roughly 70% gap. - The pattern mirrors academic router research and vendor guidance favoring mini models for simpler work. (lmsys.org)

Companies trying to rein in language-model bills are increasingly sending most prompts to cheaper “mini” models and reserving premium models for harder work. (openai.com) (developers.openai.com) The basic idea is triage: classify a request as simple or complex, answer the simple ones with a low-cost model, and escalate the rest. Researchers at LMSYS call that “LLM routing” and frame it as a way to balance cost against response quality. (lmsys.org) (arxiv.org) That cost gap is now large enough to make routing attractive on its own. OpenAI lists GPT-5.4 at $2.50 per million input tokens and $15.00 per million output tokens, while GPT-5.4 mini is $0.75 and $4.50, and GPT-5.4 nano is $0.20 and $1.25. (openai.com) Anthropic’s pricing points in the same direction. Claude Sonnet 4.6 is listed at $3 per million input tokens and $15 per million output tokens, while Claude Haiku 4.5 is $1 and $5. (platform.claude.com) Vendors are also spelling out which jobs belong on smaller models. OpenAI says its large GPT models perform better broadly, while mini models are “fast and inexpensive for simpler tasks,” and its GPT-4o mini page pitches the model for focused work like classification, keyword extraction, translation, and tag generation. (openai.com) (developers.openai.com) That means a support bot, document tagger, or first-pass coding assistant does not need the same model that handles a multi-step legal analysis or a difficult debugging session. The router’s job is to decide which requests are routine before the expensive model ever gets called. (lmsys.org) (developers.openai.com) Academic results suggest the savings can be real, though they depend on the benchmark and the model pair. LMSYS reported cost reductions of more than 85% on MT Bench, 45% on MMLU, and 35% on GSM8K while maintaining about 95% of GPT-4 performance in its tests. (lmsys.org) The underlying research paper describes routers as lightweight systems trained to choose between a stronger and weaker model during inference. In its evaluations, the authors said routing cut costs by more than 2 times in some settings without reducing response quality. (arxiv.org) Routing is often paired with a second cost lever: prompt caching. OpenAI says prompt caching can cut time-to-first-token latency by up to 80% and input token costs by up to 90% when requests share the same long prefix. (developers.openai.com 1) (developers.openai.com 2) Caching and routing solve different problems. Caching makes repeated prompts cheaper; routing decides whether a prompt needs an expensive model in the first place. (developers.openai.com) (lmsys.org) The result is a more selective stack than the old default of sending every request to the best available model. As providers widen the price spread between flagship, mini, and nano systems, that selectivity is becoming part of normal application design. (openai.com) (platform.claude.com)

Get your own daily briefing

Scout delivers personalized news, insights, and conversations tailored to your role and industry.

Download on the App Store

Shared from Scout - Be the smartest in the room.