RadixAttention course released
DeepLearning.AI published a free 80‑minute course on RadixAttention, a technique for caching KV tensors to cut LLM inference costs by reusing intermediate state across requests. (x.com) The material also shows how the approach extends to diffusion models, making it relevant for production teams trying to lower latency and billable compute. (x.com)
Large language models spend a surprising amount of money recomputing the same first few pages of a prompt. If 10 users all start with the same policy manual, the model usually rereads that manual 10 separate times before it answers anything. (lmsys.org) That rereading happens inside something called a key-value cache, which is just the model’s running scratchpad for tokens it has already processed. The cache stores intermediate state so the model can predict the next token without rebuilding all prior attention math from zero. (docs.sglang.io, lmsys.org) Most serving systems only reuse that scratchpad within one request. When the request ends, they often throw away useful state even if the next request begins with the exact same token sequence. (lmsys.org) RadixAttention is the trick for keeping that old work and organizing it like a family tree of shared prefixes. If two prompts begin with the same 2,000 tokens, they can share one cached branch for those 2,000 tokens and split only when the text diverges. (arxiv.org, mintlify.com) The “radix” part comes from the radix tree data structure, which stores sequences by common beginnings instead of duplicating them. In practice, that means one long system prompt, one document header, or one chat history can be reused across users and across requests. (mintlify.com, arxiv.org) That is why the new DeepLearning.AI course is aimed at production inference, not just model theory. Its course page says learners will implement SGLang’s RadixAttention, extend caching across users and requests, and measure the speedups directly. (learn.deeplearning.ai) The course is called “Efficient Inference with SGLang: Text and Image Generation,” and DeepLearning.AI lists it as 80 minutes long. It was built with LMSys and RadixArk, and it is taught by Richard Chen, a member of technical staff at RadixArk. (learn.deeplearning.ai) The image-generation part is the twist. DeepLearning.AI says the same course applies SGLang’s caching and parallelism ideas to diffusion models, which are the systems behind tools that generate images by gradually denoising random noise into a picture. (learn.deeplearning.ai, learn.deeplearning.ai) SGLang itself started as a serving system for structured language model programs, and its original paper highlighted RadixAttention as one of the runtime optimizations that made those programs faster. The current SGLang documentation now also includes a diffusion section, which shows how far the idea has moved from text-only serving into broader generative workloads. (arxiv.org, docs.sglang.io) For teams paying graphics processing unit bills every hour, this is less about a new model and more about not paying twice for the same tokens. If your app has repeated prompts, repeated documents, or repeated conversation prefixes, shared cache reuse can cut latency and memory pressure without changing the underlying model at all. (lmsys.org, mintlify.com)