TurboQuant slashes LLM memory

Google’s TurboQuant algorithm reportedly cuts large‑model memory use ~6x and speeds inference ~8x — that materially lowers hardware requirements for running big models. Expect this to change tradeoffs for local experiments and cloud cost planning. (x.com)

Google Research published TurboQuant on March 24, 2026, crediting lead authors Amir Zandieh and Vahab Mirrokni and listing the work for presentation at ICLR 2026. (research.googleblog.com (research.google)) TurboQuant combines a PolarQuant stage (random rotation + per-coordinate scalar quantization) with a Quantized Johnson‑Lindenstrauss (QJL) correction to enable online, data‑oblivious vector quantization that does not require calibration or retraining. (openreview.net (openreview.net)) The authors report “absolute quality neutrality” at 3.5 bits per channel and only marginal degradation at 2.5 bits per channel in their experiments, with an outlier‑aware strategy that mixes 2‑ and 3‑bit allocations to hit effective bitrates. (arxiv.org (arxiv.org)) Evaluation ran on Llama‑3.1‑8B‑Instruct, Gemma, and Mistral across LongBench, Needle‑In‑A‑Haystack, ZeroSCROLLS, RULER and L‑Eval, with Needle‑In‑A‑Haystack showing full‑precision retrieval behavior out to 104,000 tokens under the reported settings. (research.googleblog.com (research.google)) Google’s benchmarks measured attention‑logit speedups on NVIDIA H100 hardware for quantized keys versus unquantized baselines, and independent community ports and a from‑scratch PyTorch implementation appeared within days (including validation on an RTX 3060 with Qwen2.5‑3B). (tomshardware.com (tomshardware.com)) (github.com (github.com)) Google frames TurboQuant as a targeted solution to KV‑cache bloat (the cache grows linearly with context length), and concrete engineering examples show why: a 36‑layer Qwen2.5‑3B’s 8K‑token KV cache is about 289 MB in FP16, making low‑bit online quantization directly relevant to per‑request memory and serving cost. (research.googleblog.com (research.google)) (github.com (github.com))

TurboQuant slashes LLM memory

Get your own daily briefing