vLLM+Mooncake demo shards KV cache across hosts to scale agent serving

- vLLM published a May 6 demo showing Mooncake’s distributed KV cache plugged into vLLM, aimed squarely at long, tool-heavy agent sessions. - On agent traces from Codex and GPT-5.4 over SWE-bench Pro, the setup hit 3.8x throughput, 46x lower TTFT, and 8.6x lower latency. - The point is simple: shard KV state across hosts, stop recomputing giant shared prefixes, and make multi-turn agents much cheaper to serve.

KV cache is the part of LLM serving that quietly gets expensive. Every long prompt leaves behind attention state the model wants to reuse, but on normal setups that state is trapped on one machine or recomputed over and over. That is tolerable for chat. It gets ugly for agents — especially coding agents — because they keep looping through long histories, tool outputs, and tiny incremental updates. On May 6, the vLLM project showed a demo integration with Mooncake that spreads that KV cache across hosts instead of treating each GPU as an island. (github.com) ### What is the actual news here? The news is not “KV cache exists.” The news is that vLLM wired Mooncake’s distributed KV store into its serving stack and showed it on realistic agent traces, not toy synthetic prompts. The writeup says the combined system scaled nearly linearly to 60 NVIDIA GB200 GPUs and delivered 3.8x higher throughput, 46x lower time-to-first-token, and 8.6x lower end-to-end latency. (github.com) ### Why are agents the hard case? Agent workloads reuse huge prefixes. A coding agent keeps the system prompt, prior turns, memory, and earlier tool results, then adds only a small delta each turn. vLLM’s post says that in its Codex and GPT-5.4 traces on SWE-bench Pro, context length reached roughly 80K tokens by (github.com)rage input-to-output ratio was about 131:1. Basically, the model keeps rereading a book to answer one new sentence. (github.com) ### So what does Mooncake change? Mooncake was built around the idea that KV cache should be a first-class distributed resource, not a side effect living only in one GPU’s memory. Its earlier architecture split prefill and decode work and used pooled CPU, DRAM, and SSD resources to hold KV state more flexibly. In (github.com)generating prefixes from scratch. (arxiv.org) ### Why does sharding KV across hosts matter so much? Because prefill is the expensive part for long-context agents. If 95% of a request is old context, recomputing that old context on every turn is waste. A distributed KV cache turns that repeated prefix into something closer to a lookup. That is why time-to-first-token moves so dramatically in the demo — the system is no longer s(arxiv.org)t can start answering. (github.com) ### Is this just a benchmark trick? Not really, and that is the interesting part. The vLLM team says it used traces collected from Codex and GPT-5.4 on 610 SWE-bench Pro sessions, with a median of 33 turns per trace. That matters because agent serving has a very different shape from single-shot chat benchmarks. Th(github.com)badly matched to real agents. (github.com) ### What was broken before this? Local prefix caching already helped, but it hit a hard boundary — the cache lived where it was created. Once requests moved across workers or hosts, the system either had to transfer state awkwardly or rebuild it. vLLM already had Mooncake support for prefill-decode disaggregation (github.com)rsist and be shared across a larger pool. (docs.vllm.ai) ### What is the catch? The catch is operational complexity. Distributed cache systems need scheduling, eviction, consistency rules, and fast transport, or they become a bottleneck themselves. Mooncake’s whole research story is that KV-centric scheduling is the hard part, especially under load. So the demo is promising, but the real test is how smoothly this lands in production clusters with mixed workloads and failure cases. (arxiv.org) ### Bottom line? This is a pretty clear signal about where LLM serving is heading. As agents get longer-lived and more tool-heavy, the scarce resource is not just FLOPs — it is reusable context. vLLM and Mooncake are betting that the winning serving stack will treat KV cache like shared infrastructure, not disposable scratch space. (github.com)mooncake-store.md))

Get your own daily briefing

Scout delivers personalized news, insights, and conversations tailored to your role and industry.

Download on the App Store

Shared from Scout - Be the smartest in the room.