KV cache is inference bottleneck
Analysis from the field argued that KV cache I/O, not raw compute, is the dominant bottleneck for LLM serving — and shared-storage solutions like LMCache could reshape the economics of large-scale inference by reducing redundant cache replication argued. That changes where engineering effort should go: fast, scalable KV storage and cache sharding over raw GPU ops.
An arXiv study "Compute Or Load KV Cache? Why Not Both?" measured a 2.6× average reduction in Time-to-First-Token for its hybrid "Cake" scheduler versus compute-only or I/O-only baselines. arxiv.org LMCache's open-source repo shows active development and ~7.6k stars on GitHub, indicating rapid community adoption. github.com An LMCache arXiv paper (v2, revised Dec 5, 2025) reports up to 15× throughput improvement when paired with vLLM on multi-round QA and document-analysis workloads. arxiv.org Google Cloud published a GKE blog (Nov 7, 2025) that benchmarked tiered KV-cache setups on 8× NVIDIA H100 Mega 80GB machines across 1k–100k token contexts and found HBM+RAM+local‑SSD tiers improved TTFT and throughput. cloud.google.com vLLM's production-stack tutorial documents remote shared KV cache using LMCache to enable cross-instance cache reuse in deployments. github.com An IBM blog post (Feb 6, 2026) analyzed combining llm-d, LMCache, and IBM Storage Scale and concluded shared KV caches materially change inference economics by increasing reuse rates. community.ibm.com NVIDIA's Dynamo technical blog (Sep 18, 2025) describes KV-cache offloading to CPU RAM and SSDs to reduce GPU memory pressure and raise concurrency in long‑context inference. developer.nvidia.com