vLLM Model Runner V2
vLLM released Model Runner V2 — a rebuilt execution core with modular design, GPU‑native prep, async processing and a Triton sampler — reporting throughput gains and better speculative decoding behavior. The update is toggled via VLLM_USE_V2_MODEL_RUNNER=1 and looks aimed at higher‑density inference and multi‑tenant packing. (x.com) (x.com)
vLLM published a Model Runner V2 design document that describes the execution core as a from‑scratch reimplementation and warns MRV2 is not yet feature‑complete or fully tested. The v0.18.0 release notes enumerate MRV2 work—calling out UVA block tables (#31965), M‑RoPE (#32143), and logit_bias/allowed_token_ids/min_tokens support (#32163)—and explicitly mark the new runner as experimental and disabled by default. A community technical writeup reports a measured 56% throughput uplift in CPU‑bound workloads when switching to MRV2, citing the project’s own benchmarks for CPU‑bottleneck scenarios. Multiple open issues show enabling MRV2 has triggered model‑specific problems in the wild, including a crash with Qwen3.5 mixed‑attention models and a pooling/embedding race condition that produced partially zeroed vectors. An RFC proposing migration from the V1 runner documents a targeted refactor that reduces the old GPUModelRunner monolith (≈6,283 lines) to a ~1,168‑line core plus ~40 submodules and recommends a phased migration with v1/v2 coexisting behind a feature flag. Ecosystem integration is already underway—Docker Model Runner added a vLLM backend for safetensors and high‑throughput containerized inference earlier—so MRV2’s performance and stability outcomes could quickly affect container‑first serving stacks if the experimental runner is hardened.