TurboQuant shrinks KV caches
Google’s TurboQuant method reportedly compresses key‑value caches by about 10x—moving from 32‑bit to 3‑bit representations—without measurable accuracy loss, which cuts inference memory for KV caches from roughly 16GB to 3GB. That change lets models hold much longer contexts on smaller GPUs, altering the memory vs. compute tradeoffs teams consider when sizing inference hardware (KV compression report).
Large language models keep a running memory of earlier words, and that memory has become one of inference’s biggest hardware constraints. Google says its new TurboQuant method can shrink that memory sharply without lowering benchmark scores. (research.google) That running memory is called the key-value cache: a store of vectors for past tokens that lets the model attend to earlier text without recomputing everything from scratch. Google described the cache as a high-speed “digital cheat sheet” and said it has become a bottleneck as context windows get longer. (research.google) TurboQuant is a vector quantization method, which means replacing precise floating-point numbers with a much smaller set of low-bit symbols. In an April 28, 2025 arXiv preprint, the authors said the method is designed for online use, including key-value cache quantization during inference. (arxiv.org) Google published the broader TurboQuant write-up on March 24, 2026 and said the work is slated for the International Conference on Learning Representations 2026. The paper says TurboQuant reached “absolute quality neutrality” for key-value cache tests at 3.5 bits per channel and only marginal degradation at 2.5 bits per channel. (research.google, arxiv.org) The engineering target is not model weights but runtime memory that grows with every token in a long prompt. That makes the method most relevant for chatbots, coding assistants, and document systems that need to hold long conversations or large files in context. (research.google, arxiv.org) Google said older quantization methods often need extra full-precision scaling data for each block, adding 1 to 2 bits of overhead per value. Its earlier PolarQuant work attacked that overhead by using random preconditioning and a polar-coordinate transform so the method can skip explicit normalization. (research.google, research.google) TurboQuant then adds a second step called Quantized Johnson-Lindenstrauss, or Quantized J L, to correct bias in inner-product estimates after the first quantization pass. The arXiv paper says that two-stage design gets close to the theoretical lower bound on distortion, within a constant factor of about 2.7. (arxiv.org) Google tested the method on long-context evaluations rather than a single toy task. The company cited LongBench, a 21-dataset benchmark for long-context understanding, and said the compression preserved model quality on those evaluations. (aclanthology.org, research.google) The work lands in a crowded area of key-value cache compression research. Earlier papers such as KIVI, published in February 2024, reported 2-bit cache quantization with nearly unchanged quality and up to 2.6 times lower peak memory, while SnapKV, published in April 2024, cut cache size by keeping only the prompt features each attention head appears to use most. (arxiv.org, arxiv.org) Google’s claim is narrower and more specific than “all inference gets faster.” The company says TurboQuant reduces key-value cache memory and speeds similarity calculations used in attention, which changes how much long-context serving can fit on a given graphics processor. (research.google, arxiv.org)