72 serving optimizations mapped
A public mapping cataloged 72 LLM serving optimizations across nine layers—examples include PagedAttention batching, KV‑cache eviction and prompt caching—with claims of up to 90% cost reduction from caching strategies. The framework emphasizes system‑level efficiency gains beyond model changes for production LLM workloads. (x.com)
A new public map of large language model serving work counts 72 optimizations across nine layers, shifting attention from model training to the machinery that runs models in production. (arxiv.org) (docs.vllm.ai) Serving is the part after a model is built: one phase reads the full prompt, called prefill, and a second phase generates tokens one by one, called decode. Researchers say those two phases create distinct compute and memory bottlenecks that make deployment expensive even after model quality is fixed. (arxiv.org) (usenix.org) The map highlighted techniques such as PagedAttention, prompt caching, and key-value cache eviction. PagedAttention stores attention memory in fixed-size blocks so systems can avoid fragmentation and pack more requests onto the same hardware. (docs.vllm.ai 1) (docs.vllm.ai 2) Prompt caching targets repeated prefixes, like the same system prompt or shared document context across requests. OpenAI says prompt caching can cut input token costs by up to 90% and latency by up to 80%, while Anthropic says its prompt caching can reduce costs by up to 90% and latency by up to 85% for long prompts. (developers.openai.com) (anthropic.com) The key-value cache is the model’s short-term working memory during generation, and it grows with prompt length, model depth, and concurrent users. vLLM’s prefix caching design hashes prompt blocks and reuses matching blocks across requests, while evicting unused blocks with a least-recently-used policy when memory fills up. (docs.vllm.ai) That system view has become more important as LLM serving research has expanded quickly since 2023. A July 17, 2024 survey from Northeastern University and the Massachusetts Institute of Technology said the field had already produced enough work across major systems and machine learning venues to make it hard for practitioners to track the most practical ideas. (arxiv.org) Some of the biggest gains now come from scheduling, not just memory tricks. Sarathi-Serve, presented at OSDI 2024, said chunked prefills and stall-free scheduling delivered 2.6 times higher serving capacity for Mistral-7B on one A100 and up to 3.7 times for Yi-34B on two A100s compared with vLLM. (usenix.org) Cluster-level schedulers push the same idea further by moving requests between model instances while they are running. Llumnix, also presented at OSDI 2024, reported an order-of-magnitude tail-latency improvement, up to 1.5 times faster handling for high-priority requests, and up to 36% cost savings at similar tail latency. (usenix.org) The practical message from the 72-item map is that LLM cost no longer depends only on which model a company picks. It also depends on how aggressively the serving stack reuses prompts, packs batches, manages memory blocks, and routes work across GPUs. (docs.vllm.ai) (arxiv.org)