TurboQuant: 6× Memory, 8× Speed
What happened
Google launched TurboQuant, claiming up to 6× lower LLM memory use and 8× speedups—changes that could radically cut research compute costs and reshape model experimentation cycles. The improvement is being framed as a big efficiency lever for labs and chip-sensitive projects. (x.com)
Why it matters
The arXiv preprint lists Amir Zandieh, Majid Daliri, Majid Hadian and Vahab Mirrokni as authors and was first posted April 28, 2025. (arxiv.org) Google Research republished the work on its blog on March 24, 2026 and flagged the paper for formal presentation at ICLR 2026. (research.google) TurboQuant implements a two‑stage, data‑oblivious pipeline that uses a random rotation plus coordinate‑wise scalar quantizers (PolarQuant) followed by a 1‑bit Quantized Johnson‑Lindenstrauss correction to remove inner‑product bias. (arxiv.org) The paper reports that KV‑cache quantization is effectively lossless at about 3.5 bits per channel and shows only marginal quality degradation down to about 2.5 bits per channel. (arxiv.org) Authors validated TurboQuant across long‑context benchmarks such as LongBench and Needle‑in‑a‑Haystack and on contemporary models referenced in early reports, including Gemma and Mistral. (kucoin.com) Independent developers published multiple open‑source implementations and a pip package within days of the blog post, even though Google has not yet posted an official code release. (github.com) (pypi.org) (starkinsider.com) The announcement provoked immediate market moves in memory suppliers, and analysts cautioned that a research‑stage algorithm without production deployment does not automatically change long‑term demand forecasts. (scmp.com) ICLR 2026 runs April 23–27 in Rio de Janeiro, where the community will have an opportunity to scrutinize poster materials and accompanying proofs for TurboQuant and its companion methods. (iclr.cc)
Key numbers
- Google launched TurboQuant, claiming up to 6× lower LLM memory use and 8× speedups—changes that could radically cut research compute costs and reshape model experimentation cycles.
- (x.com) The arXiv preprint lists Amir Zandieh, Majid Daliri, Majid Hadian and Vahab Mirrokni as authors and was first posted April 28, 2025.
- (arxiv.org) Google Research republished the work on its blog on March 24, 2026 and flagged the paper for formal presentation at ICLR 2026.
- (research.google) TurboQuant implements a two‑stage, data‑oblivious pipeline that uses a random rotation plus coordinate‑wise scalar quantizers (PolarQuant) followed by a 1‑bit Quantized Johnson‑Lindenstrauss correction to remove inner‑product bias.
What happens next
- (scmp.com) ICLR 2026 runs April 23–27 in Rio de Janeiro, where the community will have an opportunity to scrutinize poster materials and accompanying proofs for TurboQuant and its companion methods.
- (iclr.cc) Google launched TurboQuant, claiming up to 6× lower LLM memory use and 8× speedups—changes that could radically cut research compute costs and reshape model experimentation cycles.
Quick answers
What happened in TurboQuant: 6× Memory, 8× Speed?
Google launched TurboQuant, claiming up to 6× lower LLM memory use and 8× speedups—changes that could radically cut research compute costs and reshape model experimentation cycles. The improvement is being framed as a big efficiency lever for labs and chip-sensitive projects. (x.com)
Why does TurboQuant: 6× Memory, 8× Speed matter?
The arXiv preprint lists Amir Zandieh, Majid Daliri, Majid Hadian and Vahab Mirrokni as authors and was first posted April 28, 2025. (arxiv.org) Google Research republished the work on its blog on March 24, 2026 and flagged the paper for formal presentation at ICLR 2026. (research.google) TurboQuant implements a two‑stage, data‑oblivious pipeline that uses a random rotation plus coordinate‑wise scalar quantizers (PolarQuant) followed by a 1‑bit Quantized Johnson‑Lindenstrauss correction to remove inner‑product bias. (arxiv.org) The paper reports that KV‑cache quantization is effectively lossless at about 3.5 bits per channel and shows only marginal quality degradation down to about 2.5 bits per channel. (arxiv.org) Authors validated TurboQuant across long‑context benchmarks such as LongBench and Needle‑in‑a‑Haystack and on contemporary models referenced in early reports, including Gemma and Mistral. (kucoin.com) Independent developers published multiple open‑source implementations and a pip package within days of the blog post, even though Google has not yet posted an official code release. (github.com) (pypi.org) (starkinsider.com) The announcement provoked immediate market moves in memory suppliers, and analysts cautioned that a research‑stage algorithm without production deployment does not automatically change long‑term demand forecasts. (scmp.com) ICLR 2026 runs April 23–27 in Rio de Janeiro, where the community will have an opportunity to scrutinize poster materials and accompanying proofs for TurboQuant and its companion methods. (iclr.cc)