LMSYS and vLLM deliver day‑0 runtime speedups for Google's Gemma 4 via speculative decode
- Google released Gemma 4 MTP drafters on May 5, and vLLM plus LMSYS-backed SGLang shipped same-day runtime support for speculative decoding. - The headline number is speed: Google says up to 3x faster inference, while DFlash showed 3.13x on TPU v5p and about 2x on RTX 3090. - That matters because open-model launches usually wait on serving stacks; this time, Gemma 4’s low-latency path landed almost immediately.
Open-model inference is usually a two-step story. First the model ships, then the serving stack catches up. This week, that gap got a lot smaller for Google’s Gemma 4. Google released Multi-Token Prediction drafters for Gemma 4 on May 5, and the open runtime side — vLLM and LMSYS’s SGLang ecosystem — had support ready basically right away, which means the speedup is not just theoretical. (blog.google) ### What actually shipped? Google’s release was not a new base model. It was a set of Gemma 4 MTP drafters — lightweight companions for the existing E2B, E4B, 31B, and 26B A4B models. Google lists the MTP release separately from the original Gemma 4 launch, with Gemma 4 itself arriving on March 31, 2026 and the MTP add-on arriving on April 16, 2026; the public push around faster inference landed on May 5. (ai.google.dev) ### What is MTP in plain English? Normally a language model writes one token at a time, and every token needs a full pass through the big model. MTP changes that. A smaller draft model guesses several next tokens, then the full model checks them in parallel. If the guesses are right, you get a whole chunk of output for roughly the cost of one normal step — plus one extra token from the target model itself. (ai.google.dev) ### Why does that speed things up so much? The bottleneck is usually memory bandwidth, not raw math. The hardware spends a lot of time hauling weights around just to emit one more token, which leaves compute underused. Speculative decoding uses that idle headroom. Basically, the drafter keeps the machine busy while the big model does fewer expensive serial steps. (b([ai.google.dev)elopers-tools/multi-token-prediction-gemma-4/)) ### So where do vLLM and LMSYS come in? A model feature is only useful if serving frameworks can run it. vLLM’s Gemma 4 recipe now explicitly supports MTP speculative decoding, including NVIDIA, AMD, and Google Cloud TPU setups. On the LMSYS side, SGLang has already been building out MTP as a production inference feat(blog.google)m path was ready when developers wanted to deploy. (docs.vllm.ai) ### Are the “3x faster” claims real? They are real, but they need context. Google’s own framing is “up to 3x” faster inference with no quality loss. Ars Technica’s writeup highlighted DFlash results showing a 3.13x speedup on TPU v5p and roughly 2x on an RTX 3090. Those are strong numbers, but they are workload- and hardware-dependent — “up to” is doing real work here. (blog.google) ### Is every Gemma 4 model equally good at this? No — and this is the catch. Dense models benefit more cleanly because the same weights get reused when verifying drafted tokens. Mixture-of-Experts models like Gemma 4 26B A4B can lose some of that advantage because different tokens may trigger different experts, which means more weight movement from memory. So MTP works across the family, but the payoff is not identical. (ai.google.dev) ### Why is this a bigger deal than one benchmark? Because open models live or die on deployability. Gemma 4 was already positioned as a broadly usable open family across workstations, edge devices, and cloud hardware, and Google says it passed 60 million downloads within weeks. Faster inference makes that story more practical for chat, agents, and other latency-sensitive apps where waiting on every token kills the experience. (blog.google) ### Bottom line? The news is not just that Gemma 4 got a speculative decoding upgrade. It’s that Google, vLLM, and the LMSYS orbit closed the loop fast enough for developers to use it immediately. In open AI, that kind of day-0 runtime support is what turns a neat model feature into an actual product advantage. (blog.google)