TurboQuant cuts AI memory needs
Google DeepMind posted TurboQuant, a technique that reportedly reduces AI memory use by 6x and attention computation by 8x without accuracy loss. The announcement highlights new efficiency techniques aimed at lowering the resource cost of running large models. (x.com)
Artificial intelligence models store past words in a running memory, and that memory often becomes the bottleneck as prompts get longer. Google researchers said TurboQuant can shrink that memory by about 6 times while preserving model quality. (arxiv.org) The method comes from a paper posted on arXiv on April 28, 2025 by Amir Zandieh, Majid Daliri, Majid Hadian and Vahab Mirrokni, with affiliations including Google Research and Google DeepMind. The paper describes TurboQuant as an “online vector quantization” system, which means it compresses data on the fly instead of building a large codebook ahead of time. (arxiv.org) In plain terms, vector quantization turns long lists of numbers into smaller labels, like replacing detailed coordinates with entries from a compact lookup table. TurboQuant’s paper says that for key-value cache quantization in large models, it reached “absolute quality neutrality” at 3.5 bits per channel and only marginal quality loss at 2.5 bits per channel. (arxiv.org) The key-value cache is the saved working memory a transformer uses so it does not recompute every earlier token from scratch. Reducing that cache lowers memory pressure at inference time, which is the stage when a model is answering a user rather than being trained. (arxiv.org) Attention is the part of a transformer that compares each new token with earlier tokens, and its cost rises quickly as context grows. The post circulating online said TurboQuant cuts attention computation by 8 times, a claim consistent with the broader push to make long-context systems cheaper to run. (x.com) Google DeepMind has spent the past two years publishing work on long-context efficiency, including Infini-attention in 2024 and Titans in late 2024. Both lines of research target the same problem: standard attention captures dependencies well, but its compute and memory costs scale poorly on long sequences. (arxiv.org 1) (arxiv.org 2) TurboQuant is not framed in the paper as a new chatbot or a new foundation model. It is infrastructure research: a way to store and retrieve model memory more compactly, with experiments that also extend to high-dimensional nearest-neighbor search used in vector databases. (arxiv.org) Google DeepMind’s public research pages list April 2026 updates on Gemini and other model releases, but TurboQuant itself appears in the research literature rather than a major product announcement. That leaves the immediate question less about whether the technique exists and more about when it gets folded into deployed systems. (deepmind.google 1) (deepmind.google 2)