Diffusion models decode faster
A new preprint on Introspective Diffusion Language Models (I-DLM) shows a strided-decoding approach that matches autoregressive quality while increasing throughput on standard benchmarks. (x.com)
Most language models write one token at a time. A paper posted to arXiv on April 13 says a diffusion model can now check old tokens and add new ones in the same pass, narrowing that gap in speed and quality. (arxiv.org) Diffusion language models start with many masked-out tokens and refine them in parallel, more like filling in a crossword than typing left to right. The problem has been quality: the authors say these models often fail to agree with their own earlier guesses, a property they call “introspective consistency.” (arxiv.org) The new system is called Introspective Diffusion Language Model, or I-DLM. The authors — from Together AI, the University of Illinois Urbana-Champaign, the University of Texas at Austin, Princeton, and Stanford — say its “introspective strided decoding” verifies prior tokens while advancing the sequence in the same forward pass. (arxiv.org) The paper says I-DLM is the first diffusion language model to match a same-size autoregressive model across 15 benchmarks. In the abstract, the 8 billion-parameter model scores 69.6 on AIME 2024 and 45.7 on LiveCodeBench version 6, topping LLaDA 2.1 Mini by more than 26 and 15 points on those tests. (arxiv.org 1) (arxiv.org 2) The repository published with the paper says the 8 billion-parameter I-DLM matches Qwen3-8B on ARC-C at 95.8 and IFEval at 84.7, while trailing slightly on MMLU at 82.4 versus 83.5 and on HumanEval at 93.3 versus 95.1. The same repository says the model was trained by converting Qwen3-8B with 4.5 billion tokens on eight Nvidia H100 graphics processors. (github.com) Speed is the other claim. The GitHub page says that, at concurrency 32 on one H100 using SGLang, I-DLM serves about 5,900 tokens per second versus roughly 1,600 for SDAR-8B, and delivers 186 to 193 tokens per second per request versus 43 to 52 for SDAR-8B. (github.com) (arxiv.org) That comparison matters because diffusion language models have spent the past year arguing that parallel decoding should beat left-to-right generation in production, while papers such as LLaDA and SDAR showed gains with tradeoffs in quality, model size, or serving complexity. I-DLM’s pitch is that it can reuse standard autoregressive serving tricks such as paged key-value cache, continuous batching, and CUDA graphs. (arxiv.org 1) (arxiv.org 2) (github.com) The paper is still a preprint, and the strongest numbers come from the authors’ own benchmarks and serving stack. But the code, model weights, and benchmark tables were posted alongside the paper, which means outside labs can now test whether the speedups hold beyond one H100 setup and one converted Qwen family. (arxiv.org) (github.com)