Google’s TurboQuant squeezes model memory

Google unveiled TurboQuant, a memory‑compression technique that can reduce AI memory needs substantially—enough to rattle memory suppliers and prompt a selloff in Samsung and SK Hynix. That kind of software-driven compression changes the economics of model deployment and lowers the bar for running larger models on commodity hardware. (nikkei.com) (meyka.com)

Google Research published the TurboQuant blog post on March 24, 2026, authored by Amir Zandieh and Vahab Mirrokni and announcing a suite of quantization methods (PolarQuant, QJL) to reduce vector and key‑value cache overhead. (research.google) The team says TurboQuant compresses KV‑cache representations down to a few bits and reports at least a 6× reduction in KV cache size while maintaining downstream accuracy. (research.google) Google’s formal paper for TurboQuant is available on arXiv (TurboQuant: Online Vector Quantization with Near‑optimal Distortion Rate, submitted Apr 28, 2025) and the method is scheduled for presentation at ICLR 2026. (arxiv.org) Benchmarks published alongside the release claim up to an 8× speedup on attention‑logit computation on NVIDIA H100 GPUs at 4‑bit precision compared with a 32‑bit JAX baseline, a figure cited by multiple technical outlets while noting the measurement is specific to attention inner‑product steps. (aihola.com) Markets reacted within hours: South Korea’s KOSPI fell nearly 3% and Samsung Electronics and SK Hynix shares dropped roughly 4.7% and 6.2% respectively on March 26, 2026, after the TurboQuant disclosure. (cnbc.com) Sell‑side and local analysts characterizing the move called the stock falls short‑term profit‑taking, arguing TurboQuant targets inference KV caches and may not reduce long‑run memory demand driven by larger models. (cnbc.com) Immediate community activity included ports and experiments in popular local inference toolchains within 24 hours of the announcement, and coverage noting TurboQuant’s primary impact is on inference KV cache memory rather than HBM‑intensive training workloads. (venturebeat.com)

Google’s TurboQuant squeezes model memory

Get your own daily briefing