DFlash lands in runtimes
A speculative‑decoding technique called DFlash, which uses diffusion drafts verified by autoregressive checks, reported a 6.2× speedup on Qwen3‑8B and has been integrated into both SGLang and vLLM. That means popular open‑source serving stacks can immediately benefit from the draft+verify decoding pattern without custom forks. The result looks like a practical throughput win for medium‑sized models on existing inference stacks. (x.com)
A large language model normally writes one token at a time, like a person filling a crossword left to right, and that forces the graphics processor to wait on every next guess. The DFlash paper says that serial loop is the main reason inference stays slow even on fast hardware. (arxiv.org) Speculative decoding is the usual workaround: a smaller helper model drafts a few tokens, and the bigger target model checks them in parallel instead of writing each one itself. The catch is that most helper models still draft one token after another, so the shortcut keeps part of the old bottleneck. (arxiv.org) Diffusion is a different way to generate text, closer to filling in a whole blurry phrase at once and then cleaning it up, instead of typing character by character. Earlier diffusion-based drafting papers used that parallelism, but the DFlash authors note that diffusion models usually lag behind autoregressive models on text quality or cost too much to serve. (arxiv.org 1) (arxiv.org 2) DFlash changes the helper model, not the final judge. The paper describes a lightweight block diffusion drafter that proposes an entire block of tokens in one forward pass and conditions that draft on context features taken from the target model. (arxiv.org) That design matters because acceptance rate decides whether speculative decoding saves time or wastes it. Z Lab reports that DFlash gets higher acceptance rates while keeping generation lossless, which means the final output matches what the target model would have produced on its own. (arxiv.org) (z-lab.ai) The headline number is up to 6× lossless acceleration on Qwen3-8B, and the paper says that is up to 2.5× higher speedup than EAGLE-3, the autoregressive drafting method it compares against. Z Lab’s project page phrases the same result as roughly 6× on Qwen3-8B and says EAGLE-3 usually tops out around 2–3× because its drafter is still sequential. (arxiv.org) (z-lab.ai) The news this week is not just the paper result. The DFlash GitHub repository now shows launch commands for both SGLang and vLLM, which are two of the most used open-source serving stacks for large language models. (github.com) In SGLang, DFlash is already wired into the runtime with a `--speculative-algorithm DFLASH` option and dedicated runtime code under `sglang/srt/speculative/dflash_utils.py`. That means an engineer can swap in a DFlash draft model at serve time instead of maintaining a custom fork around the scheduler and verifier. (github.com 1) (github.com 2) In vLLM, the implementation has moved from feature request to merged support. A vLLM issue asking for DFlash was closed as completed on April 9, 2026, and the DFlash repository now includes a `vllm serve` example with `"method": "dflash"` inside the speculative config. (github.com 1) (github.com 2) The practical result is that DFlash now looks less like a clever paper benchmark and more like a deployment option for medium-size models such as Qwen3-8B and Qwen3.5-27B. The repository already lists released DFlash draft checkpoints for Qwen3, Qwen3.5, Qwen3-Coder, GPT-OSS, and Llama 3.1 families, so the path from paper to production is mostly “download model, flip runtime flag, serve.” (github.com) There is still one limit hiding in the fine print: DFlash needs a separately trained draft model for each target family, so it is not a universal speed switch you can bolt onto any checkpoint. But once that paired drafter exists, the hard part has shifted from runtime engineering to model training, and that is a much easier problem for open-source serving stacks to absorb. (github.com) (github.com)