72 LLM Serving Optimisations
- A public deep dive catalogued 72 LLM serving optimisations across nine layers, from attention tweaks to prompt caching. - The writeup highlights app‑edge efficiencies like prompt caching and KV eviction that can cut inference costs significantly. - The analysis argues most production cost gains stem from application‑level caching and routing, not just model internals. (x.com)
Running a large language model is usually a two-part job: read the prompt, then generate tokens one by one. A new public explainer from Avi Chawla says the biggest savings often come from avoiding repeated work, not from changing the model itself. (blog.dailydoseofds.com) Chawla’s post, listed in the Daily Dose of Data Science archive on April 18, 2026, is titled “72 Techniques to Optimize LLMs in Production.” An earlier companion post, dated March 23, 2026, was titled “A Practical Deep Dive on LLM Inference and Optimization!” (blog.dailydoseofds.com) The basic bottleneck is straightforward: long prompts force the model to recompute attention over thousands of tokens before it can answer. The 2024 survey “LLM Inference Serving” describes that first pass as the “prefill” stage, followed by a slower “decode” stage that emits output token by token. (arxiv.org) That split helps explain why caching shows up so often in production playbooks. If a system can reuse the work from a repeated system prompt or shared conversation prefix, it can skip part of prefill and cut both latency and compute. (arxiv.org) The serving literature has focused heavily on engine-level fixes such as batching requests together, managing memory for the key-value cache, and separating prompt processing from token generation. The 2024 survey says the field accelerated after 2023 as deployments spread and low-latency serving became a central systems problem. (arxiv.org) More recent industry writeups make the same point in plainer cost terms. Morph’s March 27, 2026 guide says wasted context, redundant computation, and idle hardware are three major sources of inference overhead, and it lists prompt reuse, continuous batching, and key-value cache compression among the main fixes. (morphllm.com) In plain English, the key-value cache is the model’s scratchpad for tokens it has already read. Eviction policies decide what to throw out when memory fills up, much like a browser discarding old tabs to keep the machine responsive. (arxiv.org) That is why application-layer choices keep surfacing alongside model tricks like quantization or faster attention. A team that routes easy requests to smaller models, trims stale context, and reuses common prefixes can reduce costs before a graphics processor does any extra math. (morphllm.com) The thread running through the recent explainers is less about one silver bullet than about stack discipline. In 2026, the cheapest token is often the one a serving system never has to process twice. (blog.dailydoseofds.com)