TurboQuant cuts AI memory 6x
Google DeepMind unveiled TurboQuant, claiming it reduces model memory use by about 6x and attention computation by 8x without hurting accuracy, which could lower costs for large models. The announcement on social channels frames TurboQuant as an efficiency layer that could compound cheaper and better AI model deployments. (Peter Diamandis post about TurboQuant) (NBIS AI deal analysis)
Artificial intelligence systems spend much of their memory on a running notebook of past tokens, and Google Research said TurboQuant can shrink that notebook by about six times. (research.google) That notebook is the key-value cache, which stores earlier attention states so a model does not recompute them for every new word. Google said the cache becomes a bottleneck as model size and context length grow, especially on long prompts. (research.google) Google published TurboQuant on March 24, 2026, and the underlying paper was first posted to arXiv on April 28, 2025 by Amir Zandieh, Majid Daliri, Majid Hadian, and Vahab Mirrokni. The company said the method will be presented at the International Conference on Learning Representations, or ICLR, 2026. (research.google) (arxiv.org) The basic trick is compression with less bookkeeping. Google said older vector-quantization methods often need extra constants stored in full precision, while TurboQuant is designed to cut that overhead and keep the compressed vectors useful for attention math. (research.google) (arxiv.org) In the paper, the authors said TurboQuant first randomly rotates vectors and then applies scalar quantizers to each coordinate. They then add a one-bit Quantized Johnson-Lindenstrauss, or QJL, transform on the residual to remove bias in inner-product estimates. (arxiv.org) Google said 4-bit TurboQuant delivered up to an 8x speedup in computing attention logits against 32-bit unquantized keys on Nvidia H100 graphics processors. The paper said KV-cache tests were quality-neutral at 3.5 bits per channel and showed only marginal degradation at 2.5 bits per channel. (research.google) (arxiv.org) The claim targets inference, the stage when a trained model serves users, not the earlier stage when engineers train the model. Inference costs have become a larger focus as companies push longer context windows and higher user concurrency on the same hardware. (research.google) (deepmind.google) Google tied the work to both large language models and vector search engines, which use high-dimensional vectors to compare meaning or similarity quickly. The paper said TurboQuant also outperformed existing product-quantization methods in nearest-neighbor search recall while reducing indexing time to nearly zero. (research.google) (arxiv.org) The immediate test is whether model builders adopt TurboQuant in production systems and report end-to-end savings beyond the attention kernel itself. For now, Google’s public numbers are specific: about 6x less KV-cache memory, up to 8x faster attention-logit computation, and no measured quality loss at 3.5 bits per channel in its reported benchmarks. (research.google) (arxiv.org)