AMD shows 1M+ tokens/sec
AMD's Instinct MI355X GPUs posted throughput above 1 million tokens per second in MLPerf Inference v6.0, highlighting scale-out efficiency and ROCm software scaling. The result was shared by AMD as evidence of strong inference performance for large workloads. (x.com)
Large language model inference is the work of turning a prompt into the next word, then the next one after that, fast enough to feel interactive. In the latest MLPerf Inference results, AMD said its Instinct MI355X systems cleared 1 million tokens per second on multi-node tests. (mlcommons.org) MLPerf Inference v6.0 is an industry benchmark run by MLCommons, and the new round was released on April 1, 2026. MLCommons said five of the 11 datacenter tests were new or updated, including a new open-weight large language model benchmark based on GPT-OSS 120B and an expanded DeepSeek-R1 reasoning test. (mlcommons.org) AMD’s submission used Instinct MI355X accelerators with Advanced Micro Devices’ EPYC 9575F central processors and the ROCm software stack, which is the company’s platform for running and tuning artificial intelligence workloads on its chips. AMD said an 87-graphics-processor cluster running Llama 2 70B topped 1 million tokens per second, and a 94-graphics-processor cluster running GPT-OSS 120B also passed that mark. (rocm.blogs.amd.com) AMD said the Llama 2 70B multi-node run used 11 systems and the GPT-OSS 120B run used 12 systems, with eight MI355X chips per system. AMD also said some graphics processors in those clusters were not healthy, so the submissions did not use every chip installed in the racks. (rocm.blogs.amd.com) The benchmark is designed to measure inference, which is the serving step after a model has already been trained. MLCommons said the suite is meant to be architecture-neutral and reproducible, so buyers can compare systems for real deployment jobs rather than vendor demos. (mlcommons.org) AMD framed the result as a software-and-scale story as much as a hardware one. In its technical write-up, AMD said the submission highlighted ROCm scaling across single-node and multi-node systems and included models run with low-precision formats such as MXFP4 to raise throughput. (rocm.blogs.amd.com) AMD also said MLPerf Inference v6.0 was its fourth submission round and that nine partners filed “Available” category results on Instinct platforms that could be rented or purchased immediately. Those partner submissions matter because MLPerf separates in-lab tuning from systems customers can actually order. (rocm.blogs.amd.com) The public code and result repository includes AMD’s optimized implementations for Llama 2 70B, GPT-OSS 120B and Wan 2.2 text-to-video, alongside scripts and measurement files for the submitted systems. That gives outside engineers a path to inspect how the runs were assembled, even if reproducing an 11-node or 12-node cluster is out of reach for most buyers. (github.com) For AMD, the headline number lands in a market where cloud providers and model developers are buying inference capacity in clusters, not single chips. The MLPerf release does not settle every real-world buying decision, but it does put AMD’s MI355X systems into the latest public scorecard for large-model serving. (mlcommons.org)