New MLPerf 4.0 Benchmarks Standardize on GPT-4
The latest MLPerf Training 4.0 and Inference 4.0 benchmarks have been released, now including results for Blackwell GPUs. The new standards formalize testing on models like GPT-4 and Llama 2/3, and place a greater emphasis on real-world metrics like vLLM throughput, TensorRT-LLM latency, and mandatory energy efficiency reporting.
NVIDIA continues its dominance in the latest MLPerf Training v4.0 results, setting new records in five of the nine benchmark categories. The company showcased its performance at massive scale, using 11,616 H100 GPUs to train the GPT-3 175B model in just 3.4 minutes, a significant increase in both speed and scale from previous rounds. The new NVIDIA H200 GPU, equipped with 141GB of HBM3e memory, also made its debut, demonstrating up to a 47% performance increase over the H100 in certain benchmarks. This performance gain is not solely reliant on hardware; NVIDIA also highlighted a 27% speedup on its year-old 512-GPU H100 systems, attributing the gain entirely to software stack optimizations. Intel's Gaudi 2 accelerator is positioned as the primary benchmarked alternative to the NVIDIA H100, with Intel submitting results from a 1,024-accelerator cluster. That system trained the GPT-3 175B model in 66.9 minutes, with Intel emphasizing Gaudi's value in offering enterprises a more cost-effective and scalable option for GenAI projects. Google also submitted strong results for its Cloud TPU v5p, demonstrating near-linear scaling efficiency up to 6,144 chips on the GPT-3 benchmark. Each TPU v5p pod is composed of 8,960 chips, but single training jobs can scale to 6,144 chips, showcasing a powerful infrastructure option for large-scale model training. The v4.0 benchmarks expanded to include more realistic enterprise AI tasks. A key addition is a test for fine-tuning the Llama 2 70B model using Low-Rank Adaptation (LoRA), a common technique for customizing pretrained models. NVIDIA completed this new fine-tuning benchmark in 1.5 minutes using 1,024 GPUs, while Intel's eight-accelerator Gaudi 2 submission finished in 78.1 minutes. Inference benchmarks also grew, with the 70-billion-parameter Llama 2 and the Stable Diffusion XL text-to-image model being added to reflect the industry's shift toward generative AI. These additions are designed to measure how quickly systems can handle the larger, more complex models now being deployed. For the first time, power and energy consumption measurements were a mandatory component of the training benchmarks. This introduces a critical metric for assessing the total cost of ownership and environmental impact of AI infrastructure, forcing a greater focus on performance-per-watt.