New MLPerf 4.0 Benchmarks Focus on Energy and Efficiency

The latest MLPerf 4.0 benchmarks have been released, incorporating results from Nvidia's Blackwell architecture. The new standards shift focus from raw performance to real-world metrics, making energy efficiency, vLLM throughput, and TensorRT-LLM latency "increasingly mandatory" for competitive evaluation.

The latest MLPerf 4.0 results signal a significant industry shift towards operational efficiency, with new power consumption benchmarks now running alongside performance tests. For the first time, organizations like SMC are publishing comprehensive power data, revealing that an immersion-cooled NVIDIA H100 HGX server can consume as little as 6.58 kW under test conditions, a substantial reduction compared to the 9-11 kW of typical air-cooled systems. This new focus on energy efficiency provides a more holistic view of AI hardware performance, moving beyond pure speed. NVIDIA's Blackwell architecture made a commanding debut, with submissions in the preview category showing up to a 2.2x performance increase over Hopper in Llama 2 70B fine-tuning. The GB300 NVL72 system, featuring the Blackwell Ultra GPU, demonstrated a 1.9x faster completion of the Llama 3.1 405B benchmark compared to the GB200 NVL72 at the same 512-GPU scale, resulting in a cumulative performance gain of up to 4.2x over the Hopper architecture. This substantial leap in performance is attributed to architectural enhancements and software optimizations. The new benchmarks have a strong focus on generative AI, with the inclusion of the Llama 2 70B and Stable Diffusion XL models. In the Llama 2 70B inference test, NVIDIA's memory-enhanced H200 GPUs, running on TensorRT-LLM, set a record by producing up to 31,000 tokens per second. Intel's Gaudi 2 accelerator was also submitted for the Llama 2 70B benchmark, delivering 8,035 offline tokens per second. Intel is positioning its Gaudi 2 as a cost-effective alternative to Nvidia's offerings, highlighting its value in the expanding generative AI market. The company submitted results for a large 1,024 Gaudi 2 accelerator system on the GPT-3 175B parameter model. While still trailing Nvidia's top-tier performance, Intel's participation underscores the growing competition in the AI accelerator space. The introduction of benchmarks focusing on inference engines like vLLM and TensorRT-LLM is another key development. TensorRT-LLM is an NVIDIA-specific library designed to maximize performance on their GPUs. In contrast, vLLM is a more flexible, open-source solution that is hardware-agnostic, offering a trade-off between peak performance and broader compatibility. This allows for more nuanced evaluations of hardware and software combinations for real-world deployment scenarios. The MLPerf 4.0 suite also introduced a new benchmark for fine-tuning large language models using Low-Rank Adaptation (LoRA) with the Llama 2 70B model. NVIDIA's platform demonstrated strong scalability in this task, completing the benchmark in 1.5 minutes on a large-scale submission. This addition reflects the growing importance of model customization and fine-tuning in enterprise AI applications. These latest benchmarks highlight the rapid pace of innovation in both AI hardware and software. For large-scale AI deployments, the interplay between raw performance, energy efficiency, and the flexibility of software tools like vLLM and TensorRT-LLM will be critical in making informed infrastructure decisions. The continued evolution of the MLPerf suite provides increasingly valuable insights for navigating this complex landscape.

Get your own daily briefing

Scout delivers personalized news, insights, and conversations tailored to your role and industry.

Download on the App Store

Shared from Scout - Be the smartest in the room.