MLPerf endpoints results
- MLPerf Endpoints posted results showing models like DeepSeek‑R1, Llama 3.1 8B, and Qwen 3 Coder 480B across tested systems. (x.com) - The benchmark release lists those models running across 12 endpoint systems in the published comparison. (x.com) - Endpoint benchmarks help teams choose models for low‑latency inference and real‑time workloads. (x.com)
MLCommons has published the first MLPerf Endpoints results, a new benchmark meant to compare generative artificial intelligence services the way users actually hit them: through an API. (mlcommons.org) The public dashboard shows models including DeepSeek-R1, Llama-3.1-8B, Llama-3.1-70B, GPT-OSS 120B, and Qwen3 Coder 480B running across multiple systems. MLCommons said the first demonstration version launched on March 19, 2026 at Nvidia GTC with submissions from AMD, Google, Intel, Krai, and Nvidia. (endpoints.mlcommons.org, mlcommons.org) MLCommons said the release includes results on “nearly a dozen different systems,” and the live site lists systems such as Google Ironwood, Intel Arc Pro B60 and B50 nodes, HPE Cray XD670, GB300 NVL72, and H200 configurations. Each run links to a report with hardware, software, concurrency, throughput, and latency data. (mlcommons.org, endpoints.mlcommons.org) The basic problem MLPerf Endpoints is trying to solve is that a chatbot or coding assistant is not judged by one speed number. Buyers care about how long the first token takes to appear, how many tokens per second each user gets, and how total throughput holds up as more people hit the same endpoint. (mlcommons.org, mlcommons.org) That is why the dashboard emphasizes tradeoffs instead of a single winner. MLCommons says users can compare “Throughput vs. Interactivity,” “Throughput vs. Concurrency,” and latency curves, then filter by model, accelerator, and software stack. (mlcommons.org, endpoints.mlcommons.org) The benchmark also marks a shift from MLPerf’s older setup, where the load generator and model server were tightly coupled in one local process. In the new design, the benchmark client talks to any model-serving endpoint over standard interfaces such as Hypertext Transfer Protocol or gRPC, and the system under test is effectively just a URL. (mlcommons.org) MLCommons is pitching that change at a moment when its regular inference suite is also expanding toward newer large language models. In September 2025, MLPerf Inference v5.1 added DeepSeek-R1 and Llama 3.1 8B, and in April 2026, v6.0 added GPT-OSS 120B and a new low-latency DeepSeek-R1 interactive scenario. (mlcommons.org, mlcommons.org, mlcommons.org) The point of the endpoints project is less about crown-a-winner benchmark theater than procurement. MLCommons says the service is built for rolling submissions rather than fixed twice-a-year drops, so vendors can publish updated numbers as models, software stacks, and serving systems change. (mlcommons.org, mlcommons.org) For teams buying infrastructure for chatbots, coding tools, and other real-time workloads, the new results offer a more direct question than peak benchmark scores: how a specific model behaves on a specific endpoint when real users show up at once. (mlcommons.org, endpoints.mlcommons.org)