Routing models for throughput
Technical notes circulating this week highlight that sparse Mixture‑of‑Experts setups (a 26B MoE with ~4B active parameters) can deliver roughly 7.9× throughput compared with a comparable dense model, and that edge clearing or price‑weighted routing can reduce latency and cost by directing traffic to the best endpoint. Those performance and routing numbers are being used to argue for hybrid fleets rather than single‑model stacks. (x.com) (x.com)
A language model is a prediction engine, and every extra parameter usually makes each token slower and more expensive to generate. A sparse design called Mixture of Experts changes that by activating only part of the model for each token instead of the whole network. (ai.google.dev) Google DeepMind’s Gemma 4 line, released on April 2, 2026, shows that split clearly: the 26B A4B model has 26 billion total parameters, but only about 3.8 billion are active at a time, while the 31B model is dense and uses all of its parameters on every token. Google lists both as 256K-context models aimed at workstations and consumer graphics processors. (deepmind.google) A recent benchmark repository using a DGX Spark system with NVIDIA GB10 and 121 gigabytes of video memory reported 12.52 tokens per second for Gemma 4 26B A4B text generation versus 3.40 for Gemma 4 31B dense. That works out to about 3.7 times the throughput in that setup, not 7.9 times. (github.com) The basic idea is specialization. In a Mixture-of-Experts model, a router sends each token to a small subset of “experts,” so the system keeps the capacity of a large model without paying the full compute bill on every step. (ai.google.dev) That same routing logic is now being applied above the model layer. Instead of sending every request to one application programming interface endpoint, teams are adding a gateway that can choose among providers, regions, or model types based on cost, latency, privacy, or safety rules. (arxiv.org) The vLLM Semantic Router paper, posted to arXiv on February 23, 2026 and revised on March 6, says model routing has become a “critical systems challenge” as model choices multiply. Its framework routes requests across backends including vLLM, OpenAI, Anthropic, Azure, Bedrock, Gemini, and Vertex AI using request signals like context length, language, modality, and authorization level. (arxiv.org) Production tools are already exposing simpler versions of that idea. LiteLLM’s router documentation says it can load-balance across deployments and route by latency, cost, rate limits, retries, and fallbacks across providers such as Azure and OpenAI. (docs.litellm.ai) That leaves operators with two layers of choice: sparse routing inside a model, and traffic routing across models or endpoints. The result is a hybrid fleet, where a cheap fast model handles easy requests, a larger dense model handles harder ones, and the gateway shifts traffic when one region gets slow or expensive. (arxiv.org) There is still a tradeoff. Mixture-of-Experts systems can create communication and balancing overhead, and NVIDIA said in a February 2026 engineering post that expert-parallel communication in DeepSeek-V3 can account for more than 50 percent of training time without optimization. (developer.nvidia.com) So the current argument is less “one model wins” than “one router decides.” Google’s April 2026 Gemma 4 release and the March 2026 routing papers point to the same operating model: keep multiple engines available, then send each token or request to the cheapest capable path. (deepmind.google)