TurboQuant: 6× Memory, 8× Speed
Google launched TurboQuant, claiming up to 6× lower LLM memory use and 8× speedups—changes that could radically cut research compute costs and reshape model experimentation cycles. The improvement is being framed as a big efficiency lever for labs and chip-sensitive projects. (x.com)
The arXiv preprint lists Amir Zandieh, Majid Daliri, Majid Hadian and Vahab Mirrokni as authors and was first posted April 28, 2025. (arxiv.org) Google Research republished the work on its blog on March 24, 2026 and flagged the paper for formal presentation at ICLR 2026. (research.google) TurboQuant implements a two‑stage, data‑oblivious pipeline that uses a random rotation plus coordinate‑wise scalar quantizers (PolarQuant) followed by a 1‑bit Quantized Johnson‑Lindenstrauss correction to remove inner‑product bias. (arxiv.org) The paper reports that KV‑cache quantization is effectively lossless at about 3.5 bits per channel and shows only marginal quality degradation down to about 2.5 bits per channel. (arxiv.org) Authors validated TurboQuant across long‑context benchmarks such as LongBench and Needle‑in‑a‑Haystack and on contemporary models referenced in early reports, including Gemma and Mistral. (kucoin.com) Independent developers published multiple open‑source implementations and a pip package within days of the blog post, even though Google has not yet posted an official code release. (github.com) (pypi.org) (starkinsider.com) The announcement provoked immediate market moves in memory suppliers, and analysts cautioned that a research‑stage algorithm without production deployment does not automatically change long‑term demand forecasts. (scmp.com) ICLR 2026 runs April 23–27 in Rio de Janeiro, where the community will have an opportunity to scrutinize poster materials and accompanying proofs for TurboQuant and its companion methods. (iclr.cc)