I‑DLM: diffusion matches AR quality

Researchers report an Introspective Diffusion Language Model (I‑DLM) that reaches autoregressive model quality while using a more efficient strided decoding approach. (x.com) The public notes emphasize parity on core generation metrics while claiming better decoding efficiency. (x.com)

Most chatbots write one token at a time. A new paper posted April 13 says a diffusion language model can now match same-scale autoregressive quality while generating in larger strides. (arxiv.org) Diffusion language models work more like iterative editing than left-to-right typing: they start with masked text and fill pieces in over several passes. That parallel setup can raise throughput, but earlier systems usually trailed autoregressive models on quality. (arxiv.org) (openreview.net) The new system is called Introspective Diffusion Language Model, or I-DLM. The authors say the gap came from “introspective consistency,” meaning autoregressive models tend to agree with the tokens they already produced, while diffusion models often do not. (arxiv.org) I-DLM tries to fix that with “introspective strided decoding,” which checks earlier generated tokens while advancing new ones in the same forward pass. The paper says that lets the model keep diffusion-style parallel decoding without giving up the consistency baked into autoregressive training. (arxiv.org) (yifan1130.github.io) On the headline comparison, the authors say I-DLM-8B is the first diffusion language model to match the quality of its same-scale autoregressive counterpart across 15 benchmarks. In the paper’s table, I-DLM-8B scored 69.6 on AIME-24 and 45.7 on LiveCodeBench-v6, versus 73.1 and 50.3 for Qwen3-8B, the autoregressive model it was converted from. (arxiv.org) (github.com) The project page reports a stronger AIME-24 number, 72.5, for I-DLM-8B, and says the model beat LLaDA-2.1-mini, a 16 billion-parameter diffusion baseline, by 29 points on that test and about 14 points on LiveCodeBench-v6. The paper abstract gives a slightly different comparison, saying I-DLM exceeded LLaDA-2.1-mini by more than 26 points on AIME-24 and 15 on LiveCodeBench-v6. (yifan1130.github.io) (arxiv.org 1) (arxiv.org 2) The speed claim is about serving many requests at once, not just one prompt in isolation. The paper says I-DLM delivers about 3 times higher throughput than prior state-of-the-art diffusion language models at large concurrency, and the code repository reports about 5,900 tokens per second at concurrency 32 on one Nvidia H100 versus about 1,600 for SDAR-8B. (arxiv.org) (github.com) The authors also say I-DLM can plug into standard autoregressive serving stacks because it keeps strict causal attention. Their repository says the inference engine reuses paged key-value cache, continuous batching, CUDA graphs, and SGLang integration instead of requiring a separate diffusion-specific stack. (arxiv.org) (github.com) One extra claim is narrower but notable: with a gated low-rank adapter, or LoRA, the project says its decoding can be “bit-for-bit identical” to the base autoregressive model while still accelerating generation. That result appears in the project materials and repository, not in the abstract’s benchmark summary. (yifan1130.github.io) (github.com) The paper is new, the code was posted this week, and the central results are the authors’ own. If the benchmarks and throughput numbers hold up in outside testing, the pitch is simple: a model that edits several words at once may no longer have to give up the quality of one that writes one word at a time. (arxiv.org) (github.com)

Get your own daily briefing

Scout delivers personalized news, insights, and conversations tailored to your role and industry.

Download on the App Store

Shared from Scout - Be the smartest in the room.