Routing trick doubles throughput

- Cache-aware, prefix routing for KV caches raised serving throughput by about 108% compared with round-robin. - Benchmarks also show decoding is memory-bound at low batch sizes, limiting naive scale-ups. - These results push teams toward smarter request routing and memory budgeting instead of just increasing batch size. ( )

Large language model servers got a simple lesson from new routing tests: sending each request to the machine that already holds its conversation history can roughly double output on the same hardware. (llm-d.ai) (docs.dynamo.nvidia.com) The underlying trick is the key-value cache, a running memory of earlier tokens that lets a model avoid recalculating the same prompt over and over. NVIDIA’s Dynamo router says its key-value mode scores workers by cache overlap and current decode load, instead of cycling requests with round-robin. (baseten.co) (docs.dynamo.nvidia.com) In September 2025, llm-d, a Kubernetes-based serving project backed by contributors from IBM, Red Hat, Google and Alibaba Cloud, published benchmarks showing “double the throughput” from precise prefix-cache-aware scheduling on identical hardware. Its guide says the setup tracks live vLLM cache events from serving instances rather than guessing from traffic patterns. (llm-d.ai) (github.com) Those gains show up in workloads with repeated prefixes, like chatbots that reuse the same system prompt or agents that revisit long context windows. Baseten said on March 16, 2026 that it often sees about 2x faster time to first token in production after adopting key-value-cache-aware routing with NVIDIA Dynamo. (baseten.co) (arxiv.org) The second result is less flattering for brute-force scaling: generation is often limited by memory traffic, not raw math. The vLLM docs describe decode as memory-bound and prefill as compute-bound, and Stanford’s Hydragen paper says decode spends its time reading large key-value caches from memory during generation. (docs.vllm.ai) (arxiv.org) That means bigger batches do not automatically fix slow serving, especially at low concurrency. DigitalOcean’s April 2026 benchmarking guide says decode happens one token at a time and is “strictly memory-bound,” while vLLM warns that pushing concurrency too far can trigger cache preemption and recomputation when GPU memory fills up. (digitalocean.com) (docs.vllm.ai) Serving stacks are starting to adapt around that split. vLLM enables chunked prefill to mix compute-heavy prompt ingestion with memory-heavy decoding, and Dynamo supports separate prefill and decode worker pools for larger deployments. (docs.vllm.ai) (docs.dynamo.nvidia.com) The practical change is in what operators tune first. Instead of treating the load balancer as a neutral traffic cop, newer systems treat it as part of the inference engine, with routing, cache locality and memory budgets deciding how many useful tokens a cluster can actually serve. (github.com) (llm-d.ai)

Get your own daily briefing

Scout delivers personalized news, insights, and conversations tailored to your role and industry.

Download on the App Store

Shared from Scout - Be the smartest in the room.