LMCache Tool Slashes vLLM Latency by up to 10x
A new open-source tool called LMCache is promising 3-10x latency reductions for vLLM by persisting and sharing KV caches across different instances. The tool, already adopted by Google Cloud and CoreWeave, uses tiered storage from GPU to S3 and is particularly effective for RAG and multi-turn QA workloads.
LMCache originated as a research prototype at the University of Chicago before evolving into an open-source project now formally part of the PyTorch Ecosystem. The project was initiated by the team that would later form Tensormesh, a company that now provides official support for the tool. The tiered storage architecture is key to its performance. LMCache creates a hierarchy that extends from fast GPU HBM for the active cache, to CPU DRAM for a "hot" cache, and finally to local SSDs or remote object storage like S3 for persistent, larger-scale storage. This allows it to manage KV caches that are far larger than what can fit in a single GPU's memory. A critical feature for RAG systems is "CacheBlend," which moves beyond simple prefix caching. Since RAG workloads often combine dynamically retrieved documents, traditional prefix caching is ineffective. CacheBlend can fuse the individually pre-computed KV caches of multiple, non-contiguous text chunks, dramatically cutting down the prefill computation. While vLLM has its own prefix caching, its benefits are often lost in distributed deployments where standard load balancers scatter related requests across different nodes, destroying cache locality. LMCache is a core component of the "vLLM production-stack," which adds a prefix-aware routing layer to ensure requests are sent to the specific instance already holding the relevant KV cache. Performance benchmarks highlight a significant reduction in Time-To-First-Token (TTFT). In one test with long contexts, LMCache achieved a 58.8% reduction in TTFT and a 355.3% increase in input tokens per second. However, there is a trade-off: cache-miss scenarios can incur a performance degradation of 3% to 15%, making it most effective for workloads with recurring patterns. The tool's ecosystem extends beyond vLLM, with integrations for SGLang and support for various storage backends like Redis. It's also being integrated with other major infrastructure tools; NVIDIA's Dynamo