vLLM 0.19 and a big throughput datapoint

vLLM published v0.19.0 with zero‑bubble async scheduling, Model Runner V2, CPU KV cache offloading and NVIDIA B300 support to help high‑throughput deployments. (x.com) Independent benchmarks on a Qwen3.5‑35B run reported 3,426 tokens/sec with a 4.7M‑token KV cache using TurboQuant3.5, a concrete throughput number for infra planning. (x.com)

The open-source serving engine vLLM just shipped version 0.19.0, and this release is aimed at a very specific problem: how to keep large models busy instead of letting expensive accelerators sit idle. The update adds zero-bubble async scheduling for speculative decoding, a more mature Model Runner V2 path, general CPU KV-cache offloading, and support for NVIDIA’s B300 and GB300 generation. It also moves the project onto PyTorch 2.10 and expands model support again, which matters because vLLM has become one of the default back ends for people trying to turn open models into production systems. (github.com) That list sounds like plumbing, but the plumbing is the story. Serving an LLM is less about raw benchmark glory than about keeping memory, batching, and token generation from tripping over each other. Zero-bubble scheduling is meant to reduce the dead time between steps when speculative decoding is enabled, so the hardware spends more time doing useful work. Model Runner V2 pushes in the same direction with piecewise CUDA graphs for pipeline parallelism, better rejection sampling support for speculative decoding, streaming inputs, and other changes that shave overhead from the hot path. DBO, or dual-batch overlap, is also generalized beyond a narrow set of models, which is another way of saying the release tries to turn more edge-case optimizations into normal behavior. (github.com) The more practical addition may be CPU KV-cache offloading. KV cache is the growing memory footprint of a conversation as the model reads more tokens. It is often the real limit on concurrency and long context, not the model weights themselves. vLLM 0.19.0 adds a general offloading mechanism for that cache on the V1 engine, with pluggable policies and block-level preemption handling. That gives operators another lever when GPU memory is the bottleneck. It also helps explain why the release landed alongside excitement about a separate benchmark that pushed context length and throughput at the same time. (github.com) That benchmark, posted independently, used Qwen3.5-35B with TurboQuant3.5 and reported a 4.7 million-token KV cache, 1 million-token context via YaRN, and a peak batch throughput of 3,426.7 tokens per second on vLLM 0.19.0. Even if you treat that as an upper-end datapoint rather than a universal expectation, it is unusually concrete. Most infrastructure talk around long-context inference stays vague. This one gives operators a number they can actually reason about. (gist.github.com) The model choice matters here. Qwen3.5-35B-A3B is a mixture-of-experts model with 35 billion total parameters but only about 3 billion active per token, which makes it much cheaper to run than its headline size suggests. vLLM’s own Qwen3.5 recipe notes a native 262,144-token context window and recommends YaRN when stretching beyond that, including for throughput-focused serving. In other words, the benchmark did not come from nowhere. It sits right on top of the model’s design and the serving stack’s new emphasis on long-context, high-concurrency work. (docs.vllm.ai) TurboQuant is the other half of the datapoint. Google described TurboQuant last week as a compression method for model data, including KV cache compression, built to cut memory use without accuracy loss. An open-source TurboQuant implementation for LLM inference also appeared on PyPI in late March. That does not verify every detail of the benchmark setup, but it does explain why a 4.7 million-token KV cache is suddenly part of a serious performance conversation instead of a lab curiosity. Memory compression is turning context length from a hard wall into a budgeting exercise. (research.google) vLLM 0.19.0 also adds tuned support for NVIDIA B300 and GB300 class hardware, with all-reduce fusion enabled by default. That is a small line in the release notes, but it points to the real audience for this update. This is software for people who count tokens per second, GPU residency, and cache blocks because those numbers decide whether a deployment needs eight accelerators or six. On that score, the most memorable fact from this release is not a feature name. It is that someone ran Qwen3.5-35B with a 1 million-token context and a 4.7 million-token KV cache, and the graph peaked at 3,426.7 tokens per second.

Get your own daily briefing

Scout delivers personalized news, insights, and conversations tailored to your role and industry.

Download on the App Store

Shared from Scout - Be the smartest in the room.