Stanford Researchers Announce Mercury 2 Diffusion LLM

Published by The Daily Scout

What happened

Researchers have announced Mercury 2, described as the first reasoning diffusion Large Language Model. The model is reportedly five times faster than existing speed-optimized language models. The announcement suggests significant advances in applying diffusion techniques, commonly used in image generation, to the domain of language processing.

Why it matters

- The model was developed by Inception, an AI startup founded by researchers from Stanford, UCLA, and Cornell, including Stefano Ermon, a co-inventor of the diffusion methods used in image and video generation. - Unlike traditional autoregressive models (like GPT) that generate text token-by-token, Mercury 2 uses a diffusion-based approach to refine entire passages in parallel, a method common in image generation. - This parallel refinement process allows Mercury 2 to achieve speeds of over 1,000 tokens per second on NVIDIA Blackwell GPUs, which is more than five times faster than speed-optimized models like Haiku (at 89 tokens/sec) and GPT-5 Mini (at 71 tokens/sec). - The diffusion technique provides a form of built-in error correction, as the model iteratively revisits and refines its output, which can improve reasoning and reduce the cascading errors sometimes seen in sequential generation. - On performance benchmarks, Mercury 2 is competitive with speed-optimized autoregressive models, tying with GPT-5 Mini on the AIME 2025 benchmark with a score of 91.1. - The model supports a 128K context window and is designed for latency-sensitive applications such as voice assistants, real-time coding tools, and agentic workflows that require multiple steps of tool use. - Inception Labs is making Mercury 2 available via an OpenAI-compatible API, targeting developers building applications where low latency is a critical production concern.

Key numbers

  • Researchers have announced Mercury 2, described as the first reasoning diffusion Large Language Model.
  • Unlike traditional autoregressive models (like GPT) that generate text token-by-token, Mercury 2 uses a diffusion-based approach to refine entire passages in parallel, a method common in image generation.
  • This parallel refinement process allows Mercury 2 to achieve speeds of over 1,000 tokens per second on NVIDIA Blackwell GPUs, which is more than five times faster than speed-optimized models like Haiku (at 89 tokens/sec) and GPT-5 Mini (at 71 tokens/sec).
  • On performance benchmarks, Mercury 2 is competitive with speed-optimized autoregressive models, tying with GPT-5 Mini on the AIME 2025 benchmark with a score of 91.1.

Quick answers

What happened in Stanford Researchers Announce Mercury 2 Diffusion LLM?

Researchers have announced Mercury 2, described as the first reasoning diffusion Large Language Model. The model is reportedly five times faster than existing speed-optimized language models. The announcement suggests significant advances in applying diffusion techniques, commonly used in image generation, to the domain of language processing.

Why does Stanford Researchers Announce Mercury 2 Diffusion LLM matter?

The model was developed by Inception, an AI startup founded by researchers from Stanford, UCLA, and Cornell, including Stefano Ermon, a co-inventor of the diffusion methods used in image and video generation. Unlike traditional autoregressive models (like GPT) that generate text token-by-token, Mercury 2 uses a diffusion-based approach to refine entire passages in parallel, a method common in image generation. This parallel refinement process allows Mercury 2 to achieve speeds of over 1,000 tokens per second on NVIDIA Blackwell GPUs, which is more than five times faster than speed-optimized models like Haiku (at 89 tokens/sec) and GPT-5 Mini (at 71 tokens/sec). The diffusion technique provides a form of built-in error correction, as the model iteratively revisits and refines its output, which can improve reasoning and reduce the cascading errors sometimes seen in sequential generation. On performance benchmarks, Mercury 2 is competitive with speed-optimized autoregressive models, tying with GPT-5 Mini on the AIME 2025 benchmark with a score of 91.1. The model supports a 128K context window and is designed for latency-sensitive applications such as voice assistants, real-time coding tools, and agentic workflows that require multiple steps of tool use. Inception Labs is making Mercury 2 available via an OpenAI-compatible API, targeting developers building applications where low latency is a critical production concern.

Get your own daily briefing

Scout delivers personalized news, insights, and conversations tailored to your role and industry.

Download on the App Store

Published by The Daily Scout - Be the smartest in the room.