Apple Silicon gets TurboQuant optimization
Thin Signal implemented TurboQuant KV cache compression on Apple Silicon using vector rotation to preserve attention in ML models—an efficiency trick for on-device inference. That kind of algorithmic memory compression can raise usable model size without new silicon, which matters for Fremont’s on-device AI targets. (x.com)
Google Research published the TurboQuant release and accompanying blog post this week, presenting the method as an open research result for KV-cache compression. (research.google) TurboQuant combines a PolarQuant stage (random orthogonal/vector rotation into polar coordinates) with a Quantized Johnson–Lindenstrauss (QJL) residual to reach ~3–4 bits per channel and the paper and blog report roughly 6x KV-cache memory reduction and up to 8x attention speedups on H100-class GPUs. (arxiv.org) Within 24–48 hours community ports appeared: llama.cpp discussion threads show new cache types (turbo3/turbo4) with Apple Metal kernels validated end-to-end on Apple Silicon, while independent repos report running Qwen 3.5 35B-A3B with 3-bit TurboQuant on M5 Max and a PyPI package for TurboQuant was published on Mar 25, 2026. (github.com) Open implementations use orthogonal rotations (including fast Walsh–Hadamard variants) and precomputed Lloyd–Max codebooks to avoid metadata overhead, and community notes flag prior Apple Silicon issues where quantized KV caches interacted poorly with flash-attention kernels — an interoperability engineering risk that ports are explicitly addressing. (github.com) Apple’s public on-device strategy—Apple Intelligence and the Foundation Models framework unveiled at WWDC and subsequent updates that expose on-device foundation models to developers—creates a concrete path where TurboQuant-style KV compression could extend local context lengths without extra DRAM per device. (apple.com) Market and infra signals are already visible: coverage and analysis note rapid community adoption into MLX/llama.cpp and other stacks, and multiple open-source repos and papers surfaced on Mar 25–26, 2026, indicating an immediate engineering push to integrate TurboQuant into Apple Silicon inference stacks. (venturebeat.com)