Gemma 4 gets up to 3x speed

- Google released Multi-Token Prediction drafters for the Gemma 4 family on May 5, adding a new speculative decoding path for faster inference. - The headline claim is up to 3x faster generation with no quality drop, with Google showing gains across LiteRT-LM, MLX, Transformers, and vLLM. - That matters because open models increasingly compete on serving cost and latency, not just benchmark scores. (blog.google)

Open models have a boring problem that ends up mattering more than people expect — they’re often bottlenecked not by raw math, but by how fast hardware can move model weights around for each generated token. That makes inference feel slower and more expensive than it “should” be. Google’s new Gemma 4 update is aimed straight at that problem. On May 5, it released Multi-Token Prediction drafters for Gemma 4, for reasoning path, at least in Google’s reported tests. ### What actually changed? Gemma 4 now ships with a dedicated draft model for Multi-Token Prediction, or MTP. Instead of having the main model generate one token, stop, and then do the whole expensive loop again, the draft model proposes several future tokens ahead of time. The main model then checks those guesses in parallel and accepts the ones that match. That is the core speculative decoding trick. Google says all Gemma 4 variants — E2B, E4B, 31B, and 26B A4B — include this setup. ### Why does that speed things up? Standard decoding is memory-bandwidth bound. Basically, the hardware spends a lot of time fetching weights for each next-token step, and not enough time doing useful work. If a draft model can correctly guess multiple upcoming tokens, the larger model gets to validate several at once instead of trudging through them one by one. Think of it like having a fast shorthand typist sketch the next few words while the editor only checks which ops are hard. ### Is this a new model? Not really — it’s more of a new inference path than a brand-new flagship release. Gemma 4 itself launched in April as Google’s open model family with four sizes, support for over 140 languages, a context window up to 256K tokens, and positioning for on-device and edge use. The MTP release layers a speed optimization on top of that family rather than replacing it. The broad headline is “up to 3x faster,” but the fine print matters. The company says gains were measured as tokens per second across

Get your own daily briefing

Scout delivers personalized news, insights, and conversations tailored to your role and industry.

Download on the App Store

Shared from Scout - Be the smartest in the room.