vLLM picks up 'HIGGS' quant trick and caching tips
vLLM merged HIGGS (aka 'turboquant') for low‑bit kv‑cache quantization into its main branch, promising more stable and faster inference for cached attention states. (x.com) Red Hat outlined prefix‑caching techniques that can cut time‑to‑first‑token by 30–80% and noted FP8 on H100 can deliver ~2x throughput gains, highlighting practical operator wins without new hardware. (x.com)
Large language models keep a running memory of each prompt, called the key-value cache, and that memory often becomes the bottleneck before the model weights do. This week, vLLM’s main branch added new low-bit cache formats for TurboQuant, also called HIGGS in community posts, alongside its existing floating-point 8-bit cache options. (github.com, docs.vllm.ai) That cache stores the attention state from earlier tokens so the model does not recompute the whole prompt on every next token. vLLM’s docs say quantizing that cache cuts memory use, lets more tokens stay resident, and can raise throughput or support longer context windows. (docs.vllm.ai) The new cache dtypes now listed in vLLM’s main-branch config include `turboquant_k8v4`, `turboquant_4bit_nc`, `turboquant_k3v4_nc`, and `turboquant_3bit_nc`. The same file shows the change landed in a commit dated two days ago on the repository’s main branch. (github.com) TurboQuant comes from an April 28, 2025 arXiv paper on online vector quantization, a way to squeeze high-dimensional data into fewer bits while trying to preserve the geometry the model uses for attention. The paper reports “absolute quality neutrality” for key-value cache quantization at 3.5 bits per channel and only marginal degradation at 2.5 bits per channel. (arxiv.org) That is the immediate appeal for operators running long prompts, retrieval-heavy chatbots, or many concurrent sessions on fixed GPU memory. If the cache shrinks, a server can keep more active requests in memory instead of spilling, recomputing, or capping context length. (docs.vllm.ai, arxiv.org) Red Hat has been pushing the same point from the operations side: most user-visible delay starts before generation gets going. In a March 9, 2026 tuning guide, Red Hat said time to first token includes both queueing delay and the prefill phase, where the engine processes the input prompt. (developers.redhat.com) That makes prefix caching the other half of the story. Prefix caching reuses work for repeated prompt beginnings, and Red Hat said vLLM’s V1 engine made prefix caching the default in version 0.8.0 with constant-time eviction and lower object-creation overhead. (developers.redhat.com) vLLM’s current cache docs also show a more conservative path already in production: floating-point 8-bit, or FP8, cache quantization. The docs say FP8 can use either per-tensor or per-attention-head scales, and recommend calibration to preserve accuracy. (docs.vllm.ai) On NVIDIA Hopper hardware such as the H100, Red Hat said FP8 inference in vLLM can deliver up to a 2x latency reduction with minimal accuracy degradation by using fourth-generation Tensor Cores built for FP8 math. That is a hardware-specific gain, but it shows why operators are stacking software tricks with newer datatypes instead of waiting for more GPUs. (developers.redhat.com) The practical takeaway is that vLLM’s performance work is moving on two tracks at once: reuse more prompt work with prefix caching, and store more attention state in fewer bits with quantized caches. Both target the same pain point in serving: getting the first token out faster and keeping more requests alive on the same hardware. (developers.redhat.com, github.com, docs.vllm.ai)