TurboQuant shrinks KV cache

TurboQuant claims a 6× compression of LLM KV cache and up to 8× faster inference with no accuracy loss — a potential memory and latency win for deployment. (x.com) Multiple posts amplify the claim across March 27–28, making this a fast‑moving systems story for on‑prem and cloud LLM operators. (x.com)

Google Research published a technical blog introducing TurboQuant and said the work will be presented at ICLR 2026; the blog post appeared on March 25, 2026. (research.google/blog/turboquant-redefining-ai-efficiency-with-extreme-compression/) (research.google) The underlying paper on arXiv (submitted April 28, 2025) formalizes TurboQuant as a two‑stage scheme that uses a random-rotation/PolarQuant preconditioning followed by a Quantized Johnson–Lindenstrauss (QJL) residual stage to reach near‑optimal distortion bounds and reports “absolute quality neutrality” at about 3.5 bits per channel. (arxiv.org/abs/2504.19874) (arxiv.org) Google’s reported benchmarks and independent coverage note that a 4‑bit variant was tested on Nvidia H100 GPUs and produced the largest attention‑compute speedups in those H100 runs. (tomshardware.com/tech-industry/artificial-intelligence/googles-turboquant-compresses-llm-kv-caches-to-3-bits-with-no-accuracy-loss) (tomshardware.com) The Google post and supporting materials list experiments on multiple models and benchmarks, including Llama‑family forks, Mistral, Meta/Gemini‑class checkpoints and Gemini, and independent community tests applied the method to Qwen2.5‑3B on an RTX 3060. (research.google/blog/turboquant-redefining-ai-efficiency-with-extreme-compression/, github.com/tonbistudio/turboquant-pytorch) (research.google) Open‑source ports and “drop‑in” tools arrived within days: public repositories such as hackimov/turboquant-kv, back2matching/turboquant, and tonbistudio/turboquant‑pytorch provide PyTorch/HuggingFace integrations and KV‑cache wrappers for local inference workflows. (github.com/hackimov/turboquant-kv, github.com/back2matching/turboquant, github.com/tonbistudio/turboquant-pytorch) (github.com) Financial coverage documented immediate market moves: shares of memory and storage firms including Micron, Western Digital and SanDisk fell in the sessions following Google’s announcement, according to Yahoo Finance reporting on sector reactions. (finance.yahoo.com/sectors/technology/articles/mu-wdc-sndk-fall-why-141945272.html) (finance.yahoo.com)

Get your own daily briefing

Scout delivers personalized news, insights, and conversations tailored to your role and industry.

Download on the App Store

Shared from Scout - Be the smartest in the room.