DeepMind's TurboQuant
DeepMind published TurboQuant, which it says reduces model memory use by about 6× and cuts attention computation by about 8× while preserving accuracy. (x.com) The announcement circulated on X as a potential route to lower model compute and memory costs. (x.com)
Large language models keep a running memory of earlier words so they can answer the next one fast. Google Research said on March 24 that TurboQuant shrinks that memory enough to cut usage by about six times without reducing benchmark accuracy. (research.google) That running memory is called the key-value cache, a store of vectors — long lists of numbers — saved for every prior token in every attention layer. Google said the cache has become a bottleneck because those vectors consume large amounts of high-bandwidth memory on accelerators as context windows grow longer. (research.google) Quantization is the compression step: it replaces precise floating-point numbers with fewer bits, like rounding a long decimal into a shorter code. The TurboQuant paper, first posted to arXiv on April 28, 2025, said its method reaches “absolute quality neutrality” for key-value cache quantization at 3.5 bits per channel and only marginal degradation at 2.5 bits per channel. (arxiv.org) The paper said TurboQuant first rotates vectors into a form that is easier to compress, then applies a scalar quantizer to each coordinate. To correct bias in inner-product estimates — the similarity score attention uses — the authors add a one-bit Quantized Johnson-Lindenstrauss transform on the residual error. (arxiv.org) Google said the method is scheduled for presentation at the International Conference on Learning Representations in 2026, while its related PolarQuant method is slated for the International Conference on Artificial Intelligence and Statistics in 2026. The company also said TurboQuant is aimed at two markets at once: key-value cache compression for model serving and vector search for retrieval systems. (research.google) The company’s blog said traditional vector quantization often loses some of its savings because systems must also store extra constants for each block of data. Google said that overhead can add one or two extra bits per number, and framed TurboQuant as a way to reduce that penalty. (research.google) TurboQuant enters a field that already has working cache-compression baselines. A 2024 paper called KIVI said its tuning-free two-bit method let Llama, Falcon, and Mistral models keep nearly the same quality while using 2.6 times less peak memory and raising throughput by about 2.35 times to 3.47 times on real inference workloads. (arxiv.org) Google’s paper makes a narrower speed claim than some posts on X suggested. The company said TurboQuant cuts attention computation by about eight times, not total end-to-end model latency, and the gain comes from making the similarity search over cached vectors cheaper. (research.google) The authors also cast the work as a theory result, not only an engineering tweak. The arXiv paper said TurboQuant comes within a small constant factor of the information-theoretic lower bound on distortion rate, which is a formal way of saying the compression is close to the best possible tradeoff between size and error. (arxiv.org) If those numbers hold up in production systems, the immediate effect is simple: longer prompts or larger batches can fit into the same accelerator memory. That is why a paper about compressing vectors, not training a new model, spread so quickly through the model-serving world this month. (research.google)