LLM serving playbook

NVIDIA and community threads pushed a clear production playbook this week: split prefiller and decoder stages on Kubernetes, tune async/ chunked prefill and KV‑cache, and match framework to hardware for max throughput—vLLM for dynamic RAG, TensorRT‑LLM for raw token/s on NVIDIA. The guidance includes concrete configs (FP8 KV cache, Triton attention backend, PagedAttention, continuous batching) that teams are using to squeeze 2–3x throughput in production and avoid common post‑fine‑tuning cost surprises. (x.com; x.com; marktechpost.com)

NVIDIA published a detailed technical blog on March 23, 2026 that documents a disaggregated LLM inference architecture — describing independent prefill, decode and router services for Kubernetes and authored by Anish Maddipoti, Sanjay Chatterjee, Rohan Varma and Ekin Karabulut. (developer.nvidia.com) vLLM’s production-stack includes a step‑by‑step “Disaggregated Prefill” tutorial and Helm examples that show how teams are deploying separate prefiller and decoder pods and wiring KV‑cache transfer in Kubernetes. (github.com) The original PagedAttention paper (arXiv:2309.06180) reported that paged KV caching cuts external fragmentation from roughly 60–80% waste to under ~4% wasted memory and demonstrated ~2×–3× throughput gains in serving experiments. (arxiv.org) vLLM’s docs now list an FP8 Quantized KV Cache feature for shrinking KV size to increase context capacity, while vLLM maintainers have warned that FP8 KV‑cache needs an attention kernel (paged attention) to avoid dequantize overheads. (docs.vllm.ai) TensorRT‑LLM and its Triton backend expose inflight/continuous batching and a Triton attention backend that implement paged attention and kernel plumbing (inflight_batcher_llm) to reduce per‑token kernel overhead on NVIDIA stacks. (docs.nvidia.com) A current vLLM GitHub feature issue (#25373) and the NVIDIA guide both call out redundant tokenization and first‑token generation between prefiller and decoder as a production bottleneck, and contributors are proposing tokenization handoff and first‑token optimizations to remove that duplicate work. (github.com)

LLM serving playbook

Get your own daily briefing