Routing papers: cheaper multi‑LLM stacks

Researchers and practitioners are sharpening routing strategies that send requests to the cheapest model that can do the job — threads this week pointed to papers like FrugalGPT and RouteLLM and to frameworks (like LangGraph) that implement cascades and multi‑LLM routing. (Muratcan Koylan listed routing papers including FrugalGPT and RouteLLM; another thread flagged LangGraph as an example framework.) (x.com) (x.com)

Most companies still do the simplest thing in artificial intelligence today: every prompt goes to one big model, even when half the prompts are as easy as “summarize this email.” A routing system changes that by acting like a call-center triage desk, sending easy jobs to a cheap model and only escalating hard ones to an expensive one. (arxiv.org) That idea showed up in research before it became a product pattern. The 2023 paper FrugalGPT from Stanford laid out three levers — prompt adaptation, model approximation, and model cascades — for cutting large language model costs without taking a blanket quality hit. (arxiv.org) A model cascade is the easiest piece to picture. One model answers first, a checker estimates whether that answer is good enough, and only uncertain cases get passed upward to a stronger model. (ar5iv.labs.arxiv.org) FrugalGPT’s headline result is why people still cite it. The paper says its cascade could match the best single model with up to a 98 percent cost reduction, or beat GPT-4 accuracy by 4 percent at the same cost on the tasks it tested. (arxiv.org) The newer paper RouteLLM pushes the same basic goal in a different direction. Instead of asking several models in sequence, it trains a router to choose the likely winner up front, using preference data that compares strong and weak models on the same prompt. (arxiv.org) That difference matters because a cascade can spend money twice on one request. A learned router tries to make one decision at the door, which the RouteLLM authors pitch as a way to keep quality close to a strong model while avoiding the repeated calls that make cascades expensive. (arxiv.org) The economics underneath this have only gotten sharper as model menus have widened. OpenAI’s current pricing page shows GPT-5.4 at $2.50 per 1 million input tokens and $15.00 per 1 million output tokens, while GPT-5.4 nano is listed at $0.20 input and $1.25 output, a gap big enough to make routing worth engineering. (openai.com) Vendors are also making small models less embarrassing. OpenAI’s GPT-4.1 launch post said GPT-4.1 mini beat GPT-4o on many benchmarks while cutting cost by 83 percent, which means the “cheap first, expensive only when needed” playbook now has better cheap options than it did a year ago. (openai.com) That is why this moved from papers into frameworks. LangGraph’s documentation centers on nodes and conditional edges, where a function decides what runs next based on the current state, which is exactly the plumbing you need for “send this prompt to model A, and only route to model B if confidence is low.” (docs.langchain.com) LangGraph’s workflow guide makes the split explicit: fixed workflows for predictable paths, and agents for dynamic paths. In practice, teams use that to build supervisors, reviewers, and fallback branches so a cheap model can draft, a stronger model can rescue failures, and the whole path stays inspectable in code. (docs.langchain.com)

Get your own daily briefing

Scout delivers personalized news, insights, and conversations tailored to your role and industry.

Download on the App Store

Shared from Scout - Be the smartest in the room.