New MLPerf Benchmarks Demand Efficiency
The latest MLPerf 4.0 benchmarks are raising the bar for AI performance. The new standards now include results for Nvidia's Blackwell architecture and mandate the reporting of granular energy efficiency metrics and FP8 training. It's a clear signal that enterprise customers now demand transparency on both speed and cost.
MLPerf 4.0's embrace of generative AI is evident in its new tests, which now include benchmarks for the Llama 2 70B large language model and the Stable Diffusion XL text-to-image model. This is a major step up in complexity from previous suites that used smaller models, reflecting the rapid evolution of real-world AI workloads. A key requirement is the use of FP8, an 8-bit floating-point format jointly developed by NVIDIA, Arm, and Intel. FP8 doubles the throughput and halves the memory requirements compared to 16-bit formats, delivering the performance of INT8 with accuracy close to FP16, a crucial combination for both training and inference efficiency. For the first time, the MLPerf Training benchmarks feature mandatory power consumption metrics, with Sustainable Metal Cloud (SMC) providing the inaugural submission. This industry-wide effort to standardize power measurement from microwatts to megawatts aims to bring transparency to the total cost and environmental impact of AI systems. NVIDIA's submissions showcased significant performance gains achieved solely through software optimizations on its Hopper architecture, with H100 performance increasing by up to 27% in the last year alone. The company also demonstrated massive scale, using a cluster of 11,616 H100 GPUs for its GPT-3 175B training benchmark. Intel positioned its Gaudi 2 accelerator as the main competitor to NVIDIA's H100, emphasizing a strong performance-per-dollar value proposition and the use of standard Ethernet networking for flexible scaling. Intel submitted results from a 1,024-accelerator Gaudi 2 cluster to demonstrate its performance on ultra-large language models. The results included a wide range of industry players, with 23 organizations submitting to the inference benchmarks. Notably, this included a joint submission from Red Hat and Supermicro using OpenShift AI, signaling the increasing importance of Kubernetes-native infrastructure for orchestrating large-scale AI workloads. Google's TPU v5p also featured in the results, with each pod containing 8,960 chips interconnected with a high-speed fabric. The v5p can train large models 2.8 times faster than its predecessor, the TPU v4, highlighting the intense competition in purpose-built AI hardware.