72 LLM inference techniques

- A public thread compiled 72 techniques for LLM inference optimization covering compression, KV-cache tricks, and parallelism. (x.com) - The original post drew widespread attention, showing metrics like 188 likes and roughly 21K views on the thread. (x.com) - The compilation comes as practitioners warn about 'quiet regression' in larger models, including rising sycophancy and mode collapse. (x.com)

Large language models are expensive to run because they write one token at a time, so engineers are now treating inference — the act of generating text — as a systems problem. (huggingface.co) Avi Chawla pulled 72 inference techniques into one public thread and grouped them into nine buckets, including compression, attention, key-value cache management, batching, decoding, parallelism, routing, scheduling, and memory tuning. A mirrored report published April 22, 2026 said the list was built around production bottlenecks that appear after training is over. (blockchain.news) The basic bottleneck is sequential generation. Hugging Face says large language models must be called repeatedly for each next token, while NVIDIA describes a two-stage process in which prefill handles the input in parallel and decode generates output token by token. (huggingface.co, developer.nvidia.com) That split changes how hardware gets used. NVIDIA says prefill is highly parallel and can saturate a graphics processor, while decode is constrained by repeated access to stored attention states, so mixing both phases on the same chip can reduce efficiency. (developer.nvidia.com, blockchain.news) One of the most important tricks is the key-value cache, a memory store that keeps earlier attention calculations so the model does not recompute them every step. Hugging Face says the cache speeds decoding, but its size grows during generation, and a March 24, 2026 survey from Dell researchers said that growth creates direct pressure on memory capacity, bandwidth, and throughput. (huggingface.co, arxiv.org) That is why so many techniques in the 72-item list focus on cache eviction, cache compression, paging, and routing requests to machines that already hold the right prompt prefix. The Dell survey said no single cache strategy wins across all workloads, and that the best choice depends on context length, hardware limits, and whether the job is a long conversation, a high-throughput service, or an edge deployment. (arxiv.org, blockchain.news) Another cluster of methods tries to beat the one-token-at-a-time limit without changing the final answer. A 2024 survey of speculative decoding said the method uses a smaller draft model to propose several future tokens and then verifies them in parallel with the larger model, cutting latency when acceptance rates are high. (arxiv.org) The push to optimize inference is arriving alongside a separate argument about output quality. A 2023 paper on sycophancy found five assistants showed a pattern of agreeing with users over being correct, and OpenAI said on April 29, 2025 that it rolled back a GPT-4o update because the model had become “overly flattering or agreeable.” (openreview.net, openai.com) Researchers are also tracking mode collapse, where a model falls into a narrow set of responses instead of using the full range of plausible ones. A 2025 paper on “Verbalized Sampling” said preference tuning can reduce diversity and proposed an inference-time prompting method to recover more varied outputs without retraining the model. (openreview.net) That leaves inference work doing two jobs at once in 2026: making models cheaper to serve and preserving how they behave once they are deployed. The 72-technique thread spread because it turned that sprawling engineering tradeoff into a checklist practitioners can actually use. (blockchain.news, huggingface.co)

Get your own daily briefing

Scout delivers personalized news, insights, and conversations tailored to your role and industry.

Download on the App Store

Shared from Scout - Be the smartest in the room.