vLLM ships v0.20.0 release

- vLLM shipped v0.20.0 on April 27, adding DeepSeek V4 support, making FlashAttention 4 the default MLA prefill path, and switching default CUDA builds to 13.0. (github.com) - The release also adds TurboQuant’s 2-bit KV cache with 4× capacity, plus Model Runner V2 upgrades like full-CUDA-graph EAGLE prefill and fused sampling kernels. (github.com) - It matters because vLLM is the serving layer many teams tune first — and these changes target the exact bottlenecks behind giant MoE models. (github.com)

vLLM is the open-source engine a lot of teams use to actually serve large language models — not train them, but keep them fast, cheap, and stable in production. That laye(github.com)eirdly large, sparse, and memory-hungry. The old problem was just raw compute. The newer problem is everything around it — prefill speed, KV cache size, CUDA graph (github.com).20.0, released April 27, is basically a “make the plumbing less painful” update, but it lands in exactly the parts of the stack that decide whether giant MoE models feel usable or not. (github.com) ### What actually shipped? The headline items are pretty concrete: initial DeepSeek V4 support, FlashAttention 4 re-enabled as the default MLA prefill backend, TurboQuant 2-bit KV cache compression, and a batch of Model Runner V2 upgrades including full-CUDA-graph EAGLE prefill. vLLM also moved its default CUDA wheel and container image to CUDA 13.0, upgraded to PyTorch 2.11, and added support for transformers v5 and Python 3.14. (github.com) ### Why is FA4 a big deal? Prefill is the expensive “read the prompt” phase before token-by-token gener(github.com)cy. FlashAttention 4 becoming the default MLA prefill backend means vLLM is now leaning on a newer attention path for the DeepSeek-style MLA setup, with head-dim 512 and paged-KV support on SM90+ GPUs. In plain English — better defaults for the hardware and model shapes people are actually trying to run now. (github.com) ### What’s the 2-bit KV cache thing? KV cache is the running me(github.com)omes one of the first walls you hit on large models. TurboQuant’s 2-bit KV cache compresses that memory footprint enough to claim 4× capacity. That does not magically make every workload 4× faster. But it can let you hold much longer contexts or more concurrent requests before memory becomes the bottleneck — which is often the real limiter in serving. (github.com) ### Why does Model Runner V2 matter? Because a lot of infer(github.com)adds full-CUDA-graph EAGLE prefill, auto-resolving CUDA graph settings from the attention backend, fused rejection-sampling kernels, and fixes for stale token accuracy regressions. That sounds niche, but it’s the difference between “great kernel benchmarks” and “the whole system is actually faster.” (github.com) ### Is there proof this helps real models? There’s at least a strong signal. DigitalOcean said this week it hit 230 (github.com)ime-to-first-token for 10,000 input tokens, and said the stack used vLLM plus NVFP4 quantization, kernel fusion, and speculative decoding tricks like MTP and EAGLE3. That’s not a pure vLLM benchmark — it’s a whole-stack result — but it shows the exact kind of deployment these v0.20.0 changes are aiming at. (digitalocean.com)dels are where inference gets painful. DeepSeek V3.2 and Qwen 3.5 397B are the kind of sparse, giant systems that expose every weakness in a serving engine — memory pressure, long-context prefill, and draft-model coordination. vLLM’s recipes and docs now explicitly cover both families, which tells you where user demand is going. (docs.vllm.ai) ### What changed for developers? The catch is that some of th(digitalocean.com)2.11 is standard, and users on CUDA 12.9 are nudged toward a specific backend install path. So the upgrade story is “better defaults and newer kernels,” but with some dependency churn attached. (github.com) ### Bottom line? This is not a flashy end-user feature release. It’s a serving-stack release for people trying to make giant open mod(docs.vllm.ai) speculative decoding less painful, a lot of downstream AI products get cheaper and snappier without changing the model at all. (github.com)

Get your own daily briefing

Scout delivers personalized news, insights, and conversations tailored to your role and industry.

Download on the App Store

Shared from Scout - Be the smartest in the room.