72 LLM optimizations mapped

- A recent mapping lists 72 LLM optimizations across the serving pipeline, emphasizing app-edge techniques like prompt caching. (x.com) - Prompt caching and stacked tiering are cited as yielding up to ~90% cost reductions versus naive model use. (x.com) - The checklist highlights token-count inconsistencies and IO/batching levers engineers should audit when estimating costs. (x.com)

Large language model costs are no longer just about which model you pick. A new 72-point map argues the bigger savings now sit across the whole serving stack, from app logic to GPU scheduling. (developers.openai.com) One of the clearest levers is prompt caching, which reuses work on repeated prompt prefixes instead of recomputing them. OpenAI says it can cut latency by up to 80% and input-token costs by up to 90% on supported requests. (developers.openai.com) OpenAI’s cache starts on prompts that are 1,024 tokens or longer, and exact prefix matches matter. The company says static instructions and examples should go first, while variable user data should go last. (developers.openai.com) The basic idea is simple: the expensive part of a long request is often reading the prompt, not writing the answer. OpenAI’s February 18, 2026 cookbook says caching skips that “prefill” work by reusing stored key-value data from earlier runs. (developers.openai.com) That shifts cost planning away from headline model prices and toward request design. OpenAI says cache hits require an exact repeated prefix, can weaken when shared traffic rises above roughly 15 requests per minute on one prefix, and can be steered with a `prompt_cache_key`. (developers.openai.com) The same pattern is showing up across vendors. Anthropic’s current model pages for Claude Sonnet 4.6 and Claude Opus 4.7 both advertise up to 90% savings from prompt caching and 50% savings from batch processing. (anthropic.com) Engineers are also tuning the layer below the application. The open-source serving engine vLLM lists high-throughput serving, multiple decoding algorithms, and tensor, pipeline, data, expert, and context parallelism among its core features. (github.com) Recent research describes why those knobs are multiplying. A November 25, 2025 paper from Georgia Tech, Google, Intel, Intel Labs, and Google DeepMind says modern serving now spans retrieval, key-value cache lookups, model routing, staged decoding, and multi-step reasoning rather than a single prefill-decode path. (arxiv.org) That is why a checklist with 72 optimizations lands now: teams are being pushed to audit token counting, batching, routing, and cache behavior before they trust a cost estimate. The map’s core message is that “use a cheaper model” is no longer the whole answer. (developers.openai.com)

Get your own daily briefing

Scout delivers personalized news, insights, and conversations tailored to your role and industry.

Download on the App Store

Shared from Scout - Be the smartest in the room.