MLPerf adds Endpoints test

MLPerf posted a new Endpoints benchmark that measures GenAI serving performance using verified API points and no interpolation. The update targets model‑serving measurements for API endpoints rather than raw pretraining or hardware throughput. (x.com).

Generative artificial intelligence benchmarks usually test chips or servers in a lab. MLCommons’ new MLPerf Endpoints test instead measures model services the way customers use them: through an application programming interface, or API. (mlcommons.org) MLCommons released the first demonstration version of MLPerf Endpoints on March 19, 2026 at Nvidia’s GTC conference, and said more than 30 organizations backed the effort. Five member organizations submitted early results: Advanced Micro Devices, Intel, Google, Krai, and Nvidia. (mlcommons.org) The benchmark sends requests to a model-serving URL over standard web interfaces such as Hypertext Transfer Protocol, or HTTP, and gRPC, instead of requiring the load generator and model server to run as one local process. MLCommons said that design lets the same test evaluate managed cloud services and on-premises systems. (mlcommons.org) Serving is the part of artificial intelligence that answers a live user request, not the part that trains a model. MLPerf Inference has long measured that stage, but its published results have typically come on a roughly six-month cycle and focused on fixed benchmark setups rather than live API endpoints. (mlcommons.org) MLCommons said endpoint testing needs more than one headline number because large language model traffic changes with prompt length, output length, and the number of simultaneous users. Its new dashboard plots throughput, time to first token, latency, and per-user token speed across different concurrency levels. (mlcommons.org) The public results site shows those tradeoffs as curves rather than a single peak score. MLCommons says buyers can filter runs by model, accelerator, and software stack, then open a run report with hardware, software, concurrency, and token-rate details. (endpoints.mlcommons.org) MLCommons says MLPerf Endpoints is moving to rolling submissions instead of the fixed twice-a-year schedule used by MLPerf Inference. The group says that should make it easier to add new models and platforms closer to product launches and procurement cycles. (mlcommons.org) The organization frames the project as an extension of MLPerf, which it says already includes more than 90,000 reproducible results and is recognized by Institute of Electrical and Electronics Engineers and International Organization for Standardization and International Electrotechnical Commission Subcommittee 42. MLCommons says the endpoint version is meant to keep that benchmark role as generative artificial intelligence systems shift toward API-delivered services. (mlcommons.org) The practical change is simple: the system under test is now a URL. For cloud model vendors and enterprise buyers comparing live services, MLCommons is trying to make the benchmark look more like the product they actually deploy. (mlcommons.org)

MLPerf adds Endpoints test

Get your own daily briefing