Split Inference Boosts Throughput

A disaggregated inference approach reportedly raised LLM throughput by ~70% by splitting prefill and decode workloads — a practical lever for lowering cloud inference costs and reducing latency. This pattern matters if you’re architecting scalable inference pipelines in the cloud. (x.com)

Research systems report a wide range of gains from phase separation: the DistServe prototype measured up to 4.48× improvement in goodput on mixed workloads, while POD‑Attention reported up to 22% higher end‑to‑end serving throughput and up to ~59% speedup in attention kernels in microbenchmarks. (haoailab.com) AWS published a formal integration of disaggregated inference with the llm‑d project on March 16, 2026, shipping a ghcr.io/llm-d/llm-d-aws container and NIXL support intended to enable multi‑node prefill/decode deployments on SageMaker and EKS. (aws.amazon.com) Engineering notes and handbooks flag operational costs: PD disaggregation requires moving KV cache state between prefill and decode workers and can increase data‑transfer and orchestration complexity, with threshold behaviors that sometimes make disaggregation slower for small workloads. (bentoml.com) Benchmarks tying disaggregation to hardware mix show concrete TCO and throughput outcomes: a Gimlet Labs study claimed ~1.7× TCO improvement using heterogeneous accelerators, and LMSYS/SGLang published per‑node throughput figures of ~52.3k input tokens/sec and ~22.3k output tokens/sec for a multi‑H100 setup. (gimletlabs.ai) Not all stacks treat disaggregation the same: vLLM’s experimental “disaggregated prefill” feature is explicit about being experimental and notes that it does not by itself guarantee higher raw throughput, while DistServe and follow‑ups emphasize improvements in goodput and SLO adherence rather than only tokens/sec. (docs.vllm.ai) Split deployments over wide‑area links add new trade‑offs: a February 2026 arXiv study of privacy‑aware split inference measured ~8–9 tok/s on Mistral‑7B over an ~80ms RTT and projected 15–19 tok/s at 20ms RTT, while also quantifying privacy leakage where attacker token recovery fell from ~59% at a 2‑layer split to ~35% at an 8‑layer split. (arxiv.org)

Get your own daily briefing

Scout delivers personalized news, insights, and conversations tailored to your role and industry.

Download on the App Store

Shared from Scout - Be the smartest in the room.