vLLM ships v0.21.0

- The vLLM project released v0.21.0 on May 16, adding speculative decoding support tied to thinking budgets, new EAGLE and MTP paths, and build changes. - GitHub’s release note said the update bundled 367 commits from 202 contributors, while speculative decoding now “respects reasoning/thinking budgets” for reasoning models. - The release notes and latest documentation list new speculative decoding options, migration details and examples for operators deploying v0.21.0 in production.

vLLM released version 0.21.0 on May 16, according to the project’s GitHub releases page. The open-source inference and serving engine said the update includes 367 commits from 202 contributors, deprecates Transformers v4 support in favor of Transformers v5, and now requires a C++20-compatible compiler for builds. The same release adds changes to speculative decoding, including support for thinking budgets in reasoning models, plus new model support for EAGLE and Multi-Token Prediction, or MTP. The release lands as vLLM continues to position speculative decoding as a way to reduce inter-token latency in memory-bound workloads. The project’s documentation says model-based methods such as EAGLE and MTP can deliver the strongest latency gains, though results depend on model family, hardware, traffic pattern and sampling settings. ### Which parts of v0.21.0 change decode behavior? GitHub’s v0.21.0 note said “speculative decoding now respects reasoning/thinking budgets,” a change aimed at making speculative decode work correctly with reasoning models. (github.com) The same note also lists “independent drafter attention backend selection” and “per-step allocation elimination” under speculative decoding changes. vLLM’s reasoning documentation says the server can cap reasoning generation with a configured thinking token budget. (docs.vllm.ai) Once that count is reached, vLLM forces the model to emit the configured reasoning end string, ending the reasoning block before the final answer continues. ### Why does a thinking budget matter for inference teams? vLLM’s documentation says reasoning models expose a separate reasoning field and can run with model-specific thinking settings, including defaults that vary by family such as Qwen3, Granite and Holo2. (github.com) In practice, that means operators can change how long a model spends in its reasoning phase, which affects token generation behavior at runtime. The release note ties that control directly to speculative decoding. (docs.vllm.ai) By making speculative decode respect reasoning or thinking budgets, vLLM is aligning a latency optimization path with the output controls already used for reasoning models, according to the project’s own documentation and release materials. ### What is new on EAGLE and MTP in this release? The v0.21.0 release note lists new speculative decoding support for “EAGLE for Mistral,” “Gemma4 MTP,” “MTP for MiMo-V2.5,” and “Cohere Eagle.” Those additions expand the set of model families that can use model-based speculative decoding paths inside vLLM. (docs.vllm.ai) vLLM’s speculative decoding guide describes EAGLE and MTP as model-based methods that generally offer the best latency reduction. The project says EAGLE uses a draft model path, while MTP is designed for models with native multi-token prediction capability and does not require a separate draft model. (github.com) The MTP documentation says Gemma 4 assistant checkpoints use vLLM’s Gemma 4 MTP path and share KV cache with the target model. (github.com) The page also says operators should use `"method":"mtp"` and start with a small speculative depth such as one token. ### What else in the release affects serving stacks? The same GitHub note says KV offloading now integrates with the Hybrid Memory Allocator, including scheduler-side sliding window group support and full HMA enablement. (docs.vllm.ai) It also says RayExecutorV2 is enabled by default and cites a “two-phase pause” change to prevent scheduler deadlock. The release also adds a TOKENSPEED_MLA backend for DeepSeek-R1 and Kimi-K25 prefill and decode on Blackwell GPUs, according to the GitHub note. (docs.vllm.ai) Those changes sit alongside the speculative decoding updates in a release aimed at inference and serving infrastructure rather than end-user model features. ### Where do operators go next? vLLM’s release process document says the project aims for a regular release every two weeks, with minor versions used for regular feature releases. (github.com) The v0.21.0 release page, the speculative decoding guide and the reasoning outputs documentation now carry the migration details, supported methods and configuration examples for teams testing the update. (github.com)

Get your own daily briefing

Scout delivers personalized news, insights, and conversations tailored to your role and industry.

Download on the App Store

Shared from Scout - Be the smartest in the room.