DFlash weights land on Hugging Face

- Z Lab pushed Gemma 4 DFlash draft weights to Hugging Face, turning Google’s 26B and 31B open models into easier local inference targets. - The new repos are tiny beside the base models — roughly 0.4B and 2B draft models — and already wired for vLLM, SGLang, and MLX. - That matters because open models are getting cheaper to run locally, while agent runtimes like OpenClaw make swapping providers less painful.

Open-weight model use is getting a lot more practical — not because the base models changed, but because the plumbing around them just got better. Z Lab has now published DFlash draft weights for Gemma 4 on Hugging Face, including `gemma-4-26B-A4B-it-DFlash` and `gemma-4-31B-it-DFlash`. The point is speed and cost. These are small speculative draft models that sit next to a bigger verifier model and help it generate tokens faster with less hardware pain. Meanwhile, OpenClaw is pushing the other half of the stack — the runtime layer that makes model swapping less tied to one provider. Together, those two moves make “run it yourself” a lot less annoying. (github.com) ### What actually landed? The concrete news is simple. Z Lab added Gemma 4 support to DFlash and published two Hugging Face repos for it in the last day: one for Google’s Gemma 4 26B A4B instruction model and one for Gemma 4 31B instruction. The DFlash GitHub README now lists both as supported models, and the Hugging Face pages show fresh uploads and ready-to-run snippets for local serving. (github.com) ### What is DFlash doing here? DFlash is a speculative decoding method. Basically, instead of asking the full-size model to generate every next token one by one, you let a much smaller draft model guess a block of tokens in parallel, then let the larger model verify or reject them. If the draft model is good enough, you get more throughput without paying full price for every token(github.com)block diffusion model for “flash speculative decoding” and says it is meant for efficient parallel drafting. (github.com) ### Why do the model sizes matter? Because the draft models are small enough to change the economics. The Z Lab collection page lists the Gemma 4 26B DFlash model at about 0.4B parameters and the 31B DFlash model at about 2B, while the Hugging Face file page for the 31B draft shows a single `model.safetensors` around 3.07 GB. That is the whole trick — you are not downloading anoth(github.com)n. You are adding a much smaller sidecar. (huggingface.co) ### Where can people run it? Right away, in the usual local inference stacks. Z Lab’s README lists Transformers, SGLang, vLLM, and MLX support. The Gemma 4 path is still a little rough around the edges — the repo says Gemma 4 DFlash currently needs a temporary vLLM Gemma 4 build, with a Docker image and a fallback Git install from a vLLM pull request. So this is usable now, but not fully “pip install and forget it” yet. (github.com) ### Is this the only Gemma 4 speculator? No — and that is part of why this matters. Red Hat AI also has a Gemma 4 31B DFlash-style speculator on Hugging Face, marked preliminary, with deployment instructions for vLLM nightly and benchmark details. That means Gemma 4 speculative inference is turning into an ecosystem, not a one-off demo. Once multiple groups publish compatible draf(github.com)en fast. (huggingface.co) ### Where does OpenClaw fit? OpenClaw is tackling a different bottleneck. Fast local models are useful, but agents get sticky when the runtime assumes one provider’s auth, quirks, and model catalog. OpenClaw’s own docs describe a provider system where plugins handle onboarding, auth mapping, transport normalization, usage reporting, and model c(huggingface.co)gent logic can stay put while the model backend changes underneath it. (github.com) ### Why do these two things belong together? Because they solve opposite halves of the same adoption problem. DFlash makes open models cheaper and lighter to run. OpenClaw makes the application layer less brittle when you swap between local, hosted, or vendor-specific models. One reduces inference friction. The other reduces integrat(github.com)ly deploy. (github.com) ### What is the bottom line? The big change is not a new frontier model. It is that Gemma 4 just got easier to accelerate locally, and the surrounding agent stack is getting less provider-bound. That is how open models spread in practice — not through one giant launch, but through smaller pieces that make them cheaper, faster, and easier to swap in. (github.com)

DFlash weights land on Hugging Face

Get your own daily briefing