Google’s TurboQuant revealed
Google’s new TurboQuant quantization system reportedly trims memory use by ~6x and speeds inference up to ~8x — it’s already in production on models like Llama 3.1 and Mistral 7B (youtube.com). Media analysis says this shifts the economics toward small language models for routine enterprise tasks and claims SLMs now handle roughly 75% of enterprise AI workloads (youtube.com).
Google Research published a dedicated blog post on March 24, 2026 announcing TurboQuant and named Amir Zandieh and Vahab Mirrokni as lead authors, with the work slated for presentation at ICLR 2026 and related methods scheduled for AISTATS 2026 (research.google). The underlying paper, “TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate,” appeared on arXiv on April 28, 2025 and reports achieving “absolute quality neutrality” for KV-cache quantization at about 3.5 bits per channel and only marginal quality loss at 2.5 bits per channel (arxiv.org). TurboQuant’s method is explicitly two-stage: an initial high-quality PolarQuant scalar quantizer applied after a random rotation, followed by a 1-bit Quantized Johnson-Lindenstrauss (QJL) transform on the residual to produce an unbiased inner-product estimator and near-optimal distortion rates (arxiv.org). Open-source reimplementations appeared within days, including a PyTorch port that validated TurboQuant on models such as Qwen2.5 on consumer GPUs and community efforts to port the algorithm into Triton kernels for Gemma hardware were documented in developer blogs and GitHub repositories (github.com) (dejan.ai). Infrastructure reporters modelled downstream economics, with VentureBeat estimating material inference-cost reductions (reporting potential cost cuts in the neighborhood of 50% in some setups) and Forbes warning the efficiency gains could paradoxically raise total DRAM demand as more workloads become feasible to run at scale (venturebeat.com) (forbes.com). Google’s post frames TurboQuant as targeted at KV-cache and vector search problems and community write-ups say production-ready code is expected to follow in the coming quarter, with several independent implementations already demonstrating the technique outside Google labs (research.google) (renovateqr.com).