LMCache Cuts RAG Latency by 3-10x

Priyanka Vergadia spotlighted LMCache (open-source), which persists the KV cache across tiers (VRAM to S3) for 3-10x latency cuts in RAG/multi-turn QA, saving GPU cycles. It integrates with vLLM/SGLang and is adopted by Google Cloud and CoreWeave. This is a significant optimization for RAG and multi-turn QA applications.

LMCache, developed by a team at the University of Chicago, is an open-source Key-Value caching solution designed to optimize LLM engines like vLLM and SGLang by extracting and sharing KV caches across queries and engines. It functions as a Knowledge Delivery Network (KDN) that accelerates LLM applications, potentially achieving up to 8x faster speeds at an 8x lower cost. LMCache is licensed under Apache License 2.0. LMCache reduces latency and saves GPU cycles by allowing LLMs to prefill text only once, storing KV caches of reusable texts and reusing them across serving engine instances. This approach reduces the time to the first token (TTFT) and is particularly effective in multi-round question answering and Retrieval-Augmented Generation (RAG) scenarios, offering 3-10x delay savings. Use cases that benefit most from LMCache include repeated user queries, enterprise Q&A systems, long prompts with repeated structures, multi-turn chats, and RAG applications using repeated templates. LMCache integrates with vLLM by computing identifiers and querying for matching KV cache chunks upon receiving a prompt. If a cache hit occurs, LMCache retrieves the KV chunk and returns it to vLLM, which injects these KV tensors into the model's attention cache, avoiding recomputation. New KV cache chunks generated are then handed off to LMCache for asynchronous storage on CPU, disk, or other backends. It also supports integration with SGLang for KV cache offloading. LMCache's architecture supports both cache offloading (prefix reuse across queries) and prefill-decode (PD) disaggregation (cross-engine/GPU cache transfer). Its key features include optimized KV cache data movement, a modular KV cache connector, and a control API for cache orchestration across different storage layers. LMCache has become a PyTorch Ecosystem project.

LMCache Cuts RAG Latency by 3-10x

Get your own daily briefing