PrFaaS cross‑DC gains

- Kimi’s PrFaaS decouples LLM prefill and decode across datacenters, cutting key‑value cache bandwidth by up to 36×. (x.com) - The authors report throughput increased 54% and latency fell 64% from the decoupling design. (x.com) - Those cross‑datacenter techniques offer bandwidth and latency lessons for hybrid trading architectures separating hot and cold paths. (x.com)

Moonshot AI researchers say they can move a key part of large language model serving across datacenters instead of keeping it inside one tightly linked cluster. (arxiv.org) Large language model inference has two steps: prefill, when the model reads the prompt, and decode, when it generates tokens one by one. The new paper, posted to arXiv on April 16, 2026, says those steps can be split across separate sites if only selected long-context requests are offloaded. (arxiv.org) The bottleneck is the key-value cache, a running memory of the prompt that usually has to move with the request. The authors write that dense-attention models create so much of that traffic that prefill and decode stay “tightly coupled” inside a single high-bandwidth network domain. (arxiv.org) PrFaaS, short for Prefill-as-a-Service, sends long prompts to standalone prefill clusters and then ships the resulting cache over commodity Ethernet to local prefill-decode clusters for generation. The system also adds bandwidth-aware scheduling and cache-aware request placement instead of offloading every request. (arxiv.org) The paper says newer hybrid-attention models shrink that cache enough to make cross-cluster transport practical. In the authors’ case study, the setup used an internal 1-trillion-parameter hybrid model rather than a public benchmark model. (arxiv.org) On the reported results, the authors say the PrFaaS version delivered 54% higher serving throughput than a homogeneous prefill-decode baseline and 32% higher throughput than a naive heterogeneous baseline. They also say the design removed the need for all accelerators to share the same low-latency remote direct memory access fabric. (arxiv.org) The work extends Mooncake, the Kimi serving system Moonshot AI and Tsinghua University researchers first posted in June 2024. That earlier paper separated prefill and decode clusters inside the serving stack and said Kimi handled 75% more real-workload requests under its architecture. (arxiv.org) The new paper does not make the cross-datacenter design sound automatic. It says bursty workloads, skewed request lengths, uneven prefix-cache distribution, and fluctuating inter-cluster bandwidth can still cause congestion, unstable queues, and poor utilization if prefill is externalized naively. (arxiv.org) What comes next is likely less about whether prefill and decode can be split, and more about where the split belongs. Moonshot’s answer in April 2026 is that some of that boundary can move beyond a single datacenter without moving every request with it. (arxiv.org)

Get your own daily briefing

Scout delivers personalized news, insights, and conversations tailored to your role and industry.

Download on the App Store

Shared from Scout - Be the smartest in the room.