MLPerf v6 adds mixture‑of‑experts
- MLCommons said on May 5 that MLPerf Training v6.0 adds a DeepSeek-V3-based mixture-of-experts benchmark to its standard suite for large-language-model training. - The benchmark uses DeepSeek-V3 at 671 billion total parameters with 37 billion activated per token, adding sparse routing, MLA attention and load balancing. - MLCommons published benchmark details and reference code on its site and GitHub, with a smaller GPT-OSS 20B MoE test added May 7.
MLCommons has added mixture-of-experts, or MoE, models to MLPerf Training v6.0, extending the industry benchmark suite beyond dense large language models and into sparse architectures now used in frontier model training. The change was published in a May 5 technical post describing a new large-scale pretraining benchmark built on DeepSeek-V3, and followed on May 7 with a smaller MoE benchmark based on GPT-OSS 20B. MLPerf Training is the benchmark family companies use to compare how quickly systems can train models to a target quality level. MLCommons says the training working group sets the reference implementations, rules, policies and procedures used for those tests. ### Why does adding MoE to MLPerf matter for hardware teams? DeepSeek-V3 is not a dense model, and that changes what a benchmark is measuring. MLCommons said the new test is built on a 671 billion-parameter MoE architecture in which 37 billion parameters are activated per token, rather than running the full model on every token. (mlcommons.org) That matters because sparse models stress systems differently. MLCommons said the DeepSeek-V3 benchmark captures features that are “now standard in the industry,” including Multi-head Latent Attention, fine-grained expert segmentation and auxiliary-loss-free load balancing. (mlcommons.org) Those are the kinds of mechanisms that affect memory traffic, expert routing and interconnect behavior during training, not just raw matrix throughput. (mlcommons.org) ### What exactly is MLCommons benchmarking in the new DeepSeek-V3 test? MLCommons said the task is defined as LLM pretraining with a mixture-of-experts objective. The benchmark uses the C4 dataset, a Llama-3-compatible tokenizer with a 128,000-word vocabulary, and a sequence length of 4,096 tokens. The architecture details are specific. MLCommons said DeepSeek-V3 expands beyond a more typical top-2 routing setup over 16 experts to a design with 160 routed experts plus shared experts. (mlcommons.org) The benchmark also includes a two-token prediction objective, which the group said increases the compute-to-memory ratio during the backward pass. ### Why didn’t MLCommons just reuse a checkpoint and start timing from there? (mlcommons.org) MLCommons said early MoE training can spend substantial time in token imbalance, where experts are not yet receiving a representative distribution of work. In the benchmark slice the group studied, that imbalance accounted for about 50% of benchmarking time, which it said was not representative of steady-state MoE training. (mlcommons.org) To deal with that, the group adopted a warm-start approach. MLCommons said it fine-tuned the checkpoint for 50 steps so the token-per-expert distribution would better match the original DeepSeek tokenizer behavior before formal benchmarking began. ### Why add a second, smaller MoE benchmark right after the DeepSeek-V3 one? MLCommons published GPT-OSS 20B on May 7 as a smaller sparse pretraining benchmark that can run on a single 8-GPU node. (mlcommons.org) The organization said that was meant to lower the barrier to entry for participants that cannot field the large multi-node systems required by dense frontier-model tests. The GPT-OSS benchmark is also sparse, but at a different scale. (mlcommons.org) MLCommons said the model has 21 billion total parameters and activates 3.6 billion per token. The reference code is built on AMD’s Primus framework, and MLCommons said primary validation was conducted on AMD Instinct MI355X and Nvidia B200 systems. ### What should readers watch next? MLCommons hosts the benchmark descriptions on its site and the reference implementations in the mlcommons/training repository on GitHub. (mlcommons.org) The training benchmark dashboard page currently shows published v5.1 results, indicating the next visible step for v6.0 will be formal submissions and posted results rather than just benchmark specifications. (github.com)