KV‑cache for LLMs
- Researchers proposed Prefill‑as‑a‑Service, a cross‑datacenter KV cache to speed large‑model inference. - The idea is to share prefilled token context across servers, reducing repeated prefill compute and token costs. - Distributed KV caching could cut latency and operational expenses for production LLMs serving many concurrent requests. (x.com)
Large language models spend one burst of work reading your prompt and a slower stream of work writing the answer. A paper posted April 16 says those two jobs no longer need to stay in one data center. (arxiv.org) The first step is called prefill: the model digests every input token and builds a running memory called the key-value cache, or KV cache. The second step is decode: the model uses that cache to generate output one token at a time. (arxiv.org) In today’s large-model serving stacks, prefill is mostly limited by raw compute, while decode is mostly limited by memory bandwidth. The paper says that mismatch is why prefill-decode disaggregation has become the dominant deployment pattern for large-scale serving. (arxiv.org) The sticking point is the cache itself. In dense-attention models, prefill creates so much KV-cache traffic that prefill and decode usually have to remain inside one low-latency, high-bandwidth network domain. (arxiv.org) Ruoyu Qin, Weiran He and six co-authors from Moonshot AI and Tsinghua University argue newer hybrid-attention models shrink that cache enough to move it farther away. Their system, Prefill-as-a-Service, selectively sends long-context prefill to separate compute-dense clusters and ships the resulting KV cache back over commodity Ethernet for decode. (arxiv.org) The paper says smaller caches alone are not enough because real traffic is bursty, prompt lengths are skewed, prefix caches are unevenly distributed, and inter-cluster bandwidth changes over time. So the design adds bandwidth-aware scheduling and cache-aware request placement on top of the offload scheme. (arxiv.org) That setup lets operators scale prefill capacity and decode capacity separately instead of buying one uniform cluster for both jobs. The paper says it also removes the requirement that both phases share the same remote-direct-memory-access fabric, or RDMA, inside one site. (arxiv.org) In the team’s case study, built around an internal 1-trillion-parameter hybrid model, the heterogeneous deployment with Prefill-as-a-Service delivered 54% higher serving throughput than a homogeneous prefill-decode baseline and 32% higher throughput than a naive heterogeneous baseline. The authors say those gains came with only modest cross-datacenter bandwidth use. (arxiv.org) The paper does not claim a universal win for every model or every network. It is a systems result tied to hybrid-attention architectures, real-world traffic assumptions, and the economics of moving cache data instead of repeating prefill work. (arxiv.org) If that trade holds up outside one internal model, the expensive part of reading long prompts could become a shared service instead of a repeated cost on every server. That would turn KV cache from a local byproduct into something operators route across clusters on purpose. (arxiv.org)