vLLM 0.19 Boost
vLLM released version 0.19.0 with Gemma 4 support, zero-bubble async scheduling and speculative decoding aimed at higher throughput. The update also adds Model Runner V2 improvements—piecewise CUDA graphs, streaming inputs, ViT CUDA graphs and CPU KV-cache offloading—plus support for NVIDIA B300/GB300, signaling a focus on squeezing more performance from mixed CPU/GPU stacks. (x.com) (x.com)
vLLM, one of the main engines people use to serve open models in production, has shipped version 0.19.0. The release looks like a routine point update only if you ignore what it is actually about. It is a concentrated push on one problem: how to make the same hardware do more work before you buy more hardware. That matters because vLLM sits in the unglamorous middle of the AI stack. It is not a model. It is the software layer that keeps GPUs fed, batches requests together, manages KV cache memory, and turns a pile of accelerators into an API endpoint. The project began in UC Berkeley’s Sky Computing Lab and has since become a community project with broad industry use. Its whole pitch is speed, especially through techniques like PagedAttention, continuous batching, CUDA graphs, and speculative decoding. Version 0.19.0 adds support for Google’s new Gemma 4 family, and that alone makes the release more than housekeeping. According to the vLLM release notes, the update brings full Gemma 4 architecture support, including mixture-of-experts variants, multimodal features, reasoning, and tool use. Gemma 4 itself is Google DeepMind’s new open model line, with small edge-oriented models and larger 26B and 31B variants aimed at workstation-class and server-class deployments. vLLM’s own Gemma 4 recipe says those models can expose structured reasoning, function calling, and dynamic vision resolution through vLLM’s OpenAI-compatible API. But the more interesting part of the release is not model coverage. It is scheduling. vLLM says async scheduling now works with speculative decoding using “zero-bubble overlap,” a phrase that sounds like marketing until you unpack it. Inference systems often waste time in the seams between steps. One stage waits for another. One batch finishes and the next has not fully arrived. Speculative decoding tries to generate tokens faster by letting a smaller draft model guess ahead, then having the larger target model verify the guesses. If you can overlap those steps without idle gaps, throughput goes up. That is what zero-bubble is trying to do. Not a new model. Not a new chip. Less dead air. The rest of the release keeps pressing on the same weak points. Model Runner V2, vLLM’s newer execution path, picks up piecewise CUDA graphs for pipeline parallelism, streaming inputs, and better speculative decoding support, including rejection sampling with greedy decoding and logprobs. Vision encoders now get full CUDA graph capture too. That matters because graph capture cuts launch overhead on GPUs, and overhead becomes a larger tax as systems get more complicated and more multimodal. Then there is the CPU. For years, the implicit rule in high-end inference was simple: keep everything important on the GPU or pay the price. vLLM 0.19.0 bends that rule a little further by adding a more general CPU KV-cache offloading mechanism, with pluggable cache policies and block-level preemption handling. In plain terms, the system can spill some of the attention cache off the GPU when memory gets tight, instead of treating GPU memory as an all-or-nothing boundary. That is slower than keeping everything on-device. It is still often better than failing to serve a longer context or buying another accelerator. The hardware support tells the same story. vLLM 0.19.0 adds support for NVIDIA’s B300 and GB300 generation, with all-reduce fusion enabled by default and a tuned communicator for those chips. That is the kind of line you might skip in release notes. It is also the line that reveals who this update is for. This is software for operators trying to stretch mixed CPU and GPU systems, juggle larger multimodal models, and keep utilization high on expensive new hardware that does not forgive idle time. The release notes count 448 commits from 197 contributors. The concrete detail is simpler: even the vision encoder now gets a CUDA graph.