MLPerf Endpoints Results

- MLCommons published early MLPerf Endpoints inference results covering model‑system submissions. - Initial listings mentioned DeepSeek‑R1, Llama 3.1 8B, and Qwen 3 Coder 480B across different systems. - MLPerf will accept rolling submissions through Q2 2026, signalling ongoing benchmarking activity for inference workloads (x.com).

MLCommons has started publishing early MLPerf Endpoints results, a new benchmark meant to measure how generative artificial intelligence systems perform as live application programming interfaces, not just as local test runs. (mlcommons.org) The public dashboard shows recent submissions for models including DeepSeek-R1, Llama-3.1-8B, Llama-3.1-70B, GPT-OSS 120B, and Qwen3 Coder 480B, alongside systems such as Google Ironwood, Intel Battlemage variants, HPE Cray XD670, NVIDIA GB300 NVL72, and H200 configurations. (endpoints.mlcommons.org) MLCommons said it released the first demonstration version of MLPerf Endpoints on March 19, 2026 at NVIDIA’s GTC conference, with support from more than 30 organizations and submissions from five member organizations: Advanced Micro Devices, Google, Intel, Krai, and NVIDIA. (mlcommons.org) Inference is the step where a trained model answers prompts, and endpoint benchmarking measures the service people actually buy: a URL that returns tokens, not a model binary running in a lab. MLCommons said the new system treats the “system under test” as an endpoint reached over standard interfaces such as Hypertext Transfer Protocol or gRPC. (mlcommons.org) That changes what gets measured. Instead of one headline score, MLPerf Endpoints plots throughput, time to first token, tokens per second per user, and concurrency so buyers can see how a system behaves as more users hit it at once. (mlcommons.org) MLCommons said older MLPerf inference setups were tightly coupled, with the load generator and model server running as one local process. The new endpoint design decouples the client from the server so managed cloud services and on-premises systems can be tested with the same framework. (mlcommons.org) The group is also changing the release cadence. MLCommons said MLPerf Endpoints uses a continuous rolling submission process instead of the fixed twice-a-year schedule used by traditional MLPerf rounds, so vendors can add scores for new models and hardware more quickly. (mlcommons.org) That puts the new benchmark alongside a separate April 1, 2026 release of MLPerf Inference v6.0, which added tests for text-to-video, GPT-OSS 120B, vision-language models, DLRMv3, and YOLOv11 in the main suite. MLCommons has also been expanding the older inference benchmark with newer language models, including DeepSeek-R1 and Llama 3.1 8B in version 5.1. (mlcommons.org, mlcommons.org) For now, the endpoint site reads more like a live scoreboard than a finished league table: the model list is short, the system list is still growing, and each point links to a detailed run report with hardware, software, latency, and throughput data. MLCommons is betting that a benchmark built around live endpoints will keep pace with a market that now ships new models and serving stacks far faster than a six-month benchmark cycle. (endpoints.mlcommons.org, mlcommons.org)

MLPerf Endpoints Results

Get your own daily briefing