TurboQuant: KV‑cache Squeeze
Google/DeepMind’s TurboQuant method reportedly compresses LLM KV‑cache 6× and yields an 8× speedup with no accuracy loss, a notable infrastructure efficiency leap for large‑context models (x.com). That kind of KV compression could materially cut memory footprints for long‑context generation and make large multimodal workflows cheaper to run in production (x.com).
Google Research published a TurboQuant blog post on March 24, 2026 under the byline of Amir Zandieh and Vahab Mirrokni. (research.google)) The underlying paper first appeared on arXiv on April 28, 2025 and lists Amir Zandieh, Majid Daliri, Majid Hadian, and Vahab Mirrokni as authors. (arxiv.org)) The method is a two‑stage pipeline that pairs PolarQuant (a random orthogonal rotation followed by per‑coordinate quantization) with a Quantized Johnson‑Lindenstrauss (QJL) residual step. (research.google)) The authors report that the technique attains what they call “quality neutrality” at about 3.5 bits per channel and shows only marginal degradation at roughly 2.5 bits per channel. (arxiv.org)) Google’s experiments evaluated TurboQuant across standard long‑context suites — LongBench, Needle In A Haystack, ZeroSCROLLS, RULER and L‑Eval — using open models such as Gemma, Mistral and Llama‑3.1‑8B‑Instruct. (research.google)) The team ran hardware tests on NVIDIA H100 accelerators and notes that the reported performance gains refer specifically to attention‑logit computation rather than full end‑to‑end inference throughput. (research.google)) Open‑source implementations and community ports appeared within days, including GitHub projects that provide PyTorch and Triton paths plus experimental ports for MLX and local inference stacks. (github.com)) Google has slated TurboQuant for formal presentation at ICLR 2026 (April 23–27, 2026), and industry analysts are already parsing the paper for implications across memory demand and inference infrastructure. (iclr.cc))