Stanford Researchers Announce Mercury 2 Diffusion LLM

Researchers have announced Mercury 2, described as the first reasoning diffusion Large Language Model. The model is reportedly five times faster than existing speed-optimized language models. The announcement suggests significant advances in applying diffusion techniques, commonly used in image generation, to the domain of language processing.

- The model was developed by Inception, an AI startup founded by researchers from Stanford, UCLA, and Cornell, including Stefano Ermon, a co-inventor of the diffusion methods used in image and video generation. - Unlike traditional autoregressive models (like GPT) that generate text token-by-token, Mercury 2 uses a diffusion-based approach to refine entire passages in parallel, a method common in image generation. - This parallel refinement process allows Mercury 2 to achieve speeds of over 1,000 tokens per second on NVIDIA Blackwell GPUs, which is more than five times faster than speed-optimized models like Haiku (at 89 tokens/sec) and GPT-5 Mini (at 71 tokens/sec). - The diffusion technique provides a form of built-in error correction, as the model iteratively revisits and refines its output, which can improve reasoning and reduce the cascading errors sometimes seen in sequential generation. - On performance benchmarks, Mercury 2 is competitive with speed-optimized autoregressive models, tying with GPT-5 Mini on the AIME 2025 benchmark with a score of 91.1. - The model supports a 128K context window and is designed for latency-sensitive applications such as voice assistants, real-time coding tools, and agentic workflows that require multiple steps of tool use. - Inception Labs is making Mercury 2 available via an OpenAI-compatible API, targeting developers building applications where low latency is a critical production concern.

Stanford Researchers Announce Mercury 2 Diffusion LLM

Get your own daily briefing