Indexing becomes LLM bottleneck

Grigory Sapunov highlighted a paradox: sparse attention reduces compute, but metadata search across tokens still scales quadratically, making indexing the new O(L^2) bottleneck for large models. That shifts attention from pure compute to data-structure and retrieval designs as models scale. (x.com)

A large language model can read 100,000 words only if it can decide, for each new word, which old words are worth looking back at. The original Transformer design compares every token with every other token, so the work grows with the square of the sequence length. (arxiv.org) Engineers spent the last few years attacking that square-law cost with faster kernels like FlashAttention. FlashAttention keeps exact attention but cuts memory traffic between high-bandwidth memory and on-chip memory, which is why long prompts got cheaper to run. (arxiv.org) Another fix is sparse attention, which works like skimming a book by jumping to the few pages that matter instead of rereading every page. DeepSeek Sparse Attention does that by using a small scorer, called an indexer, to pick the top-k past tokens for each query token. (github.com) (arxiv.org) That sounds like the quadratic problem is gone, but the catch sits inside the picker. The HISA paper spells it out: the sparse attention step scales well, yet the indexer still scans the full prefix for every query, which leaves an O(L²) bottleneck at each layer. (arxiv.org) That is the paradox Grigory Sapunov pointed to this week: once you make attention cheaper, the metadata search for which tokens to keep becomes the expensive part. In other words, the model stops being bottlenecked by math on selected tokens and starts being bottlenecked by finding those tokens in the first place. (x.com) (arxiv.org) You can picture it like a warehouse that sped up packing boxes but still sends a worker down every aisle to decide which shelf to visit. The packing line got faster, but the aisle search still touches almost everything, so total time barely falls at very long context lengths. (arxiv.org) Researchers are already redesigning that search step instead of the attention step. HISA replaces a flat token-by-token scan with a two-stage hierarchy: first score blocks, then inspect tokens only inside the surviving blocks. (arxiv.org) On its reported benchmarks, HISA delivered about 2× indexer speedup at 32,000 tokens and 4× at 128,000 tokens while keeping more than 99% overlap with the original selected-token sets. That is a data-structure result as much as a model result: the same answers come from searching less of the space. (arxiv.org) A second paper, IndexCache, attacks a different waste: repeated search across layers. Its authors measured 70% to 100% overlap in selected tokens between adjacent layers in a 47-layer DeepSeek Sparse Attention stack, then reused those indices instead of recomputing them every time. (github.com) (arxiv.org) That reuse cut up to 75% of indexer computations, and the project reports 1.82× prefill speedup at 200,000-token context on a 30 billion parameter model. The same README says the indexer had grown to 81% of prefill time at 200,000 tokens, which is exactly the kind of profile you see when the old bottleneck has moved. (github.com) The practical shift is simple: long-context models are starting to look less like pure matrix-multiply machines and more like search systems. If that trend holds, the next gains will come from better indexes, block layouts, cache reuse, and retrieval logic, not just faster graphics processors or denser math kernels. (arxiv.org 1) (arxiv.org 2)

Get your own daily briefing

Scout delivers personalized news, insights, and conversations tailored to your role and industry.

Download on the App Store

Shared from Scout - Be the smartest in the room.