AR LLM serving deep dive
An expert thread laid out serving patterns for AR reasoning LLMs: model compression (pruning/quant), prefill/KV cache management, speculative decoding, kernel tuning, and disaggregated serving — plus tradeoffs across TP/DP/PP argued. The checklist highlights concrete levers to cut latency and cost across real‑time generative pipelines.
PrefillShare ([arxiv.org)] introduced a frozen shared-prefill module that lets heterogeneous decode heads reuse one KV cache and reported 4.5× lower p95 latency and 3.9× higher throughput in multi‑model agent workloads. Disaggregating prefill and decode—first formalized by DistServe ([huggingface.co)]—is now available as an experimental feature in vLLM to isolate TTFT from inter‑token latency and in TensorRT‑LLM for production routing and resource isolation ([docs.vllm.ai)]. PagedAttention (vLLM) reduces KV cache fragmentation to near‑zero and reports 2–3× throughput gains by switching KV storage to page/slot blocks ([arxiv.org)], while learning‑based KV compression work like AttentionPredictor claims up to 13× KV compression and 5.6× speedups on long‑context workloads ([arxiv.org)]; a separate MIT method publicized this year advertises context compaction up to 50× in seconds for some workloads ([venturebeat.com)]. Speculative decoding research (Leviathan et al.) demonstrated theoretical speedups and practical experiments showing ~2–3× improvements ([proceedings.mlr.press)], Google reported similar product gains of 2–3× in translation/summarization ([research.google)], and vendors claim larger hardware‑specific boosts (NVIDIA cited 3.6× on H200), with production writeups stressing careful draft‑model selection and verification to avoid quality regressions ([introl.com)]. Activation‑aware quantization (AWQ) won MLSys attention for enabling low‑bit LLMs with minimal quality loss and the authors reported >3× speedups versus FP16 in some edge scenarios ([proceedings.mlsys.org)], while GPTQ toolchains and optimized CUDA kernels like Marlin are widely used in practice to run 4‑bit models faster on A100/H100 GPU stacks ([huggingface.co)]. Empirical parallelism guidance shows Tensor Parallelism (TP) tends to lower single‑request latency, Pipeline Parallelism (PP) drives throughput at the cost of pipeline bubbles, and hybrid TP+PP knobs let operators trade latency for goodput—papers quantify interleaved PP giving ~10% throughput uplift and megascale runs (Megatron) reaching 502 PFLOP/s across 3,072 GPUs as concrete examples of these tradeoffs in practice ([arxiv.org)]. Kernel‑level tuning remains decisive: FlashInfer published up to 31× faster shared‑prefix decoding on very long prompts with paged KV kernels ([flashinfer.ai)], and production stacks combine those kernels with KV prefetching/prefill strategies (PRESERVE) to squeeze HBM bandwidth and achieve ~1.6× end‑to‑end speedups in reported experiments ([arxiv.org)].