TurboQuant+ KV cache cuts LLM memory by 4.6x
A developer posted that TurboQuant+ KV cache compression achieves about 4.6x compression at q8_0 speeds for local LLMs on Apple Silicon/Metal, promising major memory savings for on‑device models. That kind of compression could materially reduce unified‑memory pressure for edge inference. (x.com)
TheTom’s turboquant_plus repository ports Google’s TurboQuant to llama.cpp with Metal kernels and implements PolarQuant plus a Walsh‑Hadamard rotation for KV‑cache compression, with builds tested on Apple Silicon M5 Max. (github.com) Repo benchmarks show prefill throughput measured at 2,747 tokens/sec on an M5 Max Metal build and 2,694 tokens/sec in another Metal run, alongside reported perplexity shifts of −1.17% on CUDA and +1.1% on Metal. (sourcepulse.org) Independent community tests converted a 4.2 GB KV cache down to 897 MB on Apple Silicon when running TurboQuant variants in MLX/llama.cpp experiments. (aiproductivity.ai) Google Research published TurboQuant as a training‑free KV‑cache compression method on March 24, 2026, describing PolarQuant’s random rotations and sub‑byte quantization as core techniques. (research.google) Multiple open‑source ports appeared within days, including back2matching’s turboquant repo and hackimov’s PyTorch implementation, and press reporting noted community ports surfacing within 24 hours of Google’s release. (github.com) The turboquant_plus README documents an end‑to‑end run of Qwen 3.5 35B‑A3B MoE with a 3‑bit KV cache on an M5 Max via llama.cpp+Metal, and the author lists follow‑on plans such as adaptive bit allocation and expert‑aware MoE compression. (github.com)