TurboQuant memory claims
Google Research’s TurboQuant is being touted on social feeds as a major LLM memory optimizer — one roundup says it cuts LLM memory use ~6x. Other posts push the claim further, reporting up to an 8x speed‑up and suggesting a roughly 50% inferred cost reduction for model deployments. (x.com) (x.com)
Google Research published a technical blog post on March 24, 2026 announcing TurboQuant and said the work will be presented at ICLR 2026. (research.google) The original academic manuscript for TurboQuant appeared on arXiv on April 28, 2025 and lists Amir Zandieh, Majid Daliri, Majid Hadian, and Vahab Mirrokni among its authors. (arxiv.org) (arxiv.org) The paper describes TurboQuant as a data‑oblivious, online vector‑quantization framework that uses a random rotation step plus two components called PolarQuant and a Quantized Johnson‑Lindenstrauss (QJL) transform to minimize distortion with formal near‑optimal guarantees. (arxiv.org) (arxiv.org) In experiments reported in the manuscript, the authors say the method attains “absolute quality neutrality” when compressing KV caches to about 3.5 bits per channel and only marginal quality loss at roughly 2.5 bits per channel. (arxiv.org) (arxiv.org) Industry writeups and Google’s blog note the team benchmarked TurboQuant on large‑context KV‑cache workloads and on modern accelerators such as Nvidia H100, and several community groups have already started implementing ports and benchmarks in libraries like llama.cpp and Apple Silicon toolchains. (venturebeat.com) (venturebeat.com) (github.com) Independent community repositories show active implementations (for example, a turboquant_plus repo claiming end‑to‑end Apple Silicon and llama.cpp integration), while reporting tradeoffs such as per‑token dequantization overheads at very long contexts in some decode paths. (github.com) (github.com) Several news outlets and blog summaries say code and broader tooling support are expected to follow the research disclosure in the coming quarter, while online discussion threads highlight both excitement and scrutiny around low‑bitwidth claims and real‑world throughput tradeoffs. (renovateqr.com) (renovateqr.com)