TurboQuant slashes KV memory
Google unveiled TurboQuant, a KV‑cache compression method that cuts LLM key‑value memory by ~6x and promises up to an 8x inference speedup on H100s with no accuracy loss — a straight cost/perf lever for serving large contexts. vLLM has already integrated the approach and community work (including a Mitko Vasilev port) shows KV caches compressed enough to fit 4M+ tokens on a GB10 mini‑PC, suggesting immediate practical wins for inference stacks. (marktechpost.com) (x.com)
TurboQuant’s authors are Amir Zandieh, Majid Daliri, Majid Hadian and Vahab Mirrokni and the algorithm’s formal writeup is arXiv:2504.19874 (submitted Apr 28, 2025) with a poster scheduled at ICLR 2026 on Apr 25, 2026. (arxiv.org) (iclr.cc) The method is two-stage: a PolarQuant-style random rotation plus coordinate-wise scalar quantizers, followed by a Quantized Johnson–Lindenstrauss (QJL) residual correction step that restores unbiased inner-product estimates necessary for attention. (arxiv.org) (research.google) Paper experiments report “absolute quality neutrality” at ~3.5 bits/channel and only marginal degradation at ~2.5 bits/channel, and Google’s blog and independent coverage flag 4‑bit TurboQuant benchmarks that showed up to an 8× speedup on Nvidia H100 attention kernels. (arxiv.org) (research.google) (tomshardware.com) vLLM’s plugin-friendly quantization API enables out‑of‑tree schemes and multiple community ports of TurboQuant are already public, including mitkox’s vllm‑turboquant repository and independent implementations that target Hugging Face models. (docs.vllm.ai) (github.com 1) (github.com 2) Community benchmarks include a vLLM integration that reports Qwen3.5‑27B running on 4× RTX 3090 with TurboQuant showing essentially unchanged tokens/sec while cutting KV VRAM by ~21% (example repo: 0xSero), and a PyTorch demo replicating TurboQuant accuracy tests on a Qwen2.5‑3B run on an RTX 3060. (github.com) (github.com) A pip package and multiple open‑source repos advertise drop‑in usage for Hugging Face models and an OpenAI‑compatible inference server, while integrators warn that real‑world deployment hinges on GPU kernel support (Triton/FlashInfer/ custom CUDA) and platform-specific patches (H100 vs GB10/Blackwell), with community DGX‑Spark/GB10 vLLM images and forum threads already addressing SM121/Blackwell build issues. (pypi.org) (github.com) (forums.developer.nvidia.com) (github.com)