Low‑latency inference gains

A recent open‑source update—TurboQuant+—bundles optimizations (Higgs + PolarQuant) aimed at faster model inference in quant setups, signalling more attention to inference latency in research-to-trading stacks. The project positions itself as a modern alternative to older toolchains and is pitched at teams needing quicker signal evaluation and lower execution turnaround. That kind of infra work can matter as much as a marginal signal improvement for algo desks. (x.com)

Every time a model answers the next token, it has to reread a fast temporary memory called the key-value cache, and that cache can grow so large that memory traffic, not math, becomes the bottleneck. Google Research said on March 24, 2026 that TurboQuant was built to shrink that cache without losing accuracy, so the model spends less time hauling data around. (research.google) Quantization is the trick behind this: instead of storing every number with full precision, you round it into a much smaller code, like packing a long decimal into a short label. The tradeoff is usually damage to quality, because every rounding step throws away some detail. (research.google) TurboQuant tries to avoid that damage by first rotating the data into a friendlier shape before compressing it. Google’s writeup says the method leans on PolarQuant, which turns those rotated coordinates into something close to a bell-curve distribution that standard quantizers handle better. (research.google) That rotation step is not cosmetic. The PolarQuant paper says Walsh-Hadamard rotation alone explained 98% of the quality gain in one 9 billion-parameter model test, cutting perplexity from 6.90 with a basic five-bit method to 6.40, which was only 0.03 worse than half-precision weights. (arxiv.org) The new open-source project is TurboQuant+, a GitHub repository that describes itself as an implementation and research workspace for TurboQuant inside llama.cpp, the lightweight inference stack many local-model users rely on. The repository says it is experimental, not a separate long-term fork, and that stable pieces are meant to be upstreamed in smaller patches. (github.com) Its pitch is speed under memory pressure. The README says it compresses transformer key-value cache data by 3.8x to 6.4x, offers formats called turbo2, turbo3, and turbo4, and reaches near q8_0 prefill speed with about 0.9x decode throughput at long context on Apple Silicon. (github.com) The interesting engineering claim is that not every part of the cache is equally fragile. TurboQuant+ says multiple researchers validated that compressing the value side of the cache, even to 2 bits, had no measurable effect on attention quality when the key side stayed at higher precision. (github.com) It also says the first 2 and last 2 layers are unusually sensitive, so keeping those boundary layers at higher precision recovered 37% to 91% of the lost quality in its tests. That is the kind of result infra teams like, because it turns a blunt compression knob into a targeted one. (github.com) This is why low-latency inference work keeps showing up in trading-adjacent research stacks. If a desk can evaluate the same model with less memory movement and shorter turnaround, the gain can come from infrastructure rather than a better forecast, the same way a faster order router can matter even when the signal stays unchanged. (research.google, github.com) TurboQuant+ is still early-stage code, but the direction is clear: more of the performance fight is moving below the model into cache layout, compression formats, and backend-specific kernels. In practice, that means the next edge may come from shaving milliseconds off inference plumbing, not from squeezing another basis point out of the model itself. (github.com, research.google)

Get your own daily briefing

Scout delivers personalized news, insights, and conversations tailored to your role and industry.

Download on the App Store

Shared from Scout - Be the smartest in the room.