IndexCache speeds long‑context LLMs

A technical deep dive shows 'IndexCache' reuses top‑k token indices across transformer layers (70–100% overlap) to cut prefill time dramatically at 200K context lengths with minimal code changes and no extra memory, boosting long‑context LLM performance reported. The approach is pitched as a low‑friction optimization for engineers working on LLM infra and inference pipelines.

The paper lists Yushi Bai, Qian Dong, Ting Jiang, Xin Lv, Zhengxiao Du, Aohan Zeng, Jie Tang and Juanzi Li of Tsinghua University and Z.ai and was submitted to arXiv on March 12, 2026. arxiv.org IndexCache explicitly partitions transformer layers into a small set of Full layers that run their own indexers and a larger set of Shared layers that reuse nearby Full-layer indices, offering both a training-free greedy search and a training-aware multi-layer distillation loss to select and optimize retained indexers. arxiv.org On a 30B DSA model the authors report removing roughly 75% of indexer computations and measuring up to 1.82× prefill and 1.48× decode speedups at very long contexts, with evaluations spanning nine long‑context and reasoning benchmarks. arxiv.org The THUDM/IndexCache repository publishes an indexcache.patch that targets SGLang and lists explicit support for DeepSeek‑V3.2 and GLM‑5, with the patch tested against SGLang commit b638b25b in the repo README. github.com Repository benchmarks state the indexer accounted for 81% of prefill time at 200K context length and show a 30B DSA prefill baseline of 19.5s reduced to 10.7s on NVIDIA H100 hardware, while GLM‑5 (744B) validation runs reported ≈1.2× end‑to‑end speedup. github.com Reproducibility steps in the README specify cloning SGLang, checking out commit b638b25b, applying indexcache.patch, and running the supplied 30B DSA H100 benchmarks to replicate the reported prefill/decode numbers and GLM‑5 validation results. github.com

Get your own daily briefing

Scout delivers personalized news, insights, and conversations tailored to your role and industry.

Download on the App Store

Shared from Scout - Be the smartest in the room.