Diffusion LLMs racing tokens

Non‑autoregressive approaches are making real speed claims—Mercury 2 hits 1,000+ tokens/sec in benchmarks, while models like Mamba‑2/xLSTM and byte‑level BLT promise big FLOP savings (BLT touts ~50% fewer FLOPs). These are concrete alternatives to classic autoregressive LLMs for high‑throughput inference. (x.com)

Inception publicly unveiled Mercury 2 on Feb. 24, 2026 and lists product pricing at $0.25 per 1M input tokens and $0.75 per 1M output tokens, plus a 128K‑token context window on its product page. (rits.shanghai.nyu.edu)) Inception’s technical notes and launch materials say their benchmarks ran on NVIDIA Blackwell GPUs and emphasize p95 latency behavior under load, while an independent writeup measured roughly 1.7 seconds of end‑to‑end latency on their Blackwell test setup. (inceptionlabs.ai)) The Byte Latent Transformer (BLT) paper by Artidoro Pagnoni et al. reports a FLOP‑controlled scaling study up to 8B parameters and 4T training bytes and states BLT can cut inference FLOPs by roughly half compared with comparable tokenized models. (aclanthology.org)) Meta Research published BLT code and demonstrations alongside the paper, and public coverage highlights BLT matching Llama‑3‑class performance while claiming ~50% fewer inference FLOPs in their benchmarks. (github.com)) Mamba‑2’s authors describe an SSD (structured state space duality) algorithm that restructures SSM computation into batched matmuls to leverage tensor cores — the blog notes matmul TFLOPS can outpace non‑matmul arithmetic by up to ~16x on A100/H100 hardware — and the original Mamba work reported up to ~5× higher inference throughput versus Transformers. (pli.princeton.edu)) xLSTM, accepted as a NeurIPS spotlight, introduces exponential gating and matrix‑memory variants and the project has open‑source code plus a published 7B xLSTM model on Hugging Face; authors’ efficiency tests report the highest prefill and generation throughput and the lowest GPU memory footprint among their comparisons. (proceedings.neurips.cc)) Taken together, the three lines of work use different levers for throughput: BLT reduces token count via entropy‑based byte patching, Mamba‑2 converts sequence ops into tensor‑core‑friendly matmuls via SSD, and xLSTM improves recurrent memory and parallelism through exponential gating — all projects ship code or models that enable independent throughput and FLOP verification. (aclanthology.org))

Get your own daily briefing

Scout delivers personalized news, insights, and conversations tailored to your role and industry.

Download on the App Store

Shared from Scout - Be the smartest in the room.