New Playbook for H100 Cluster Operations

An operational playbook for managing large-scale H100 and Blackwell GPU clusters has been released, identifying liquid cooling failures as the top cause of incidents. The guidance highlights the use of NVIDIA DCGM 3.3+ for improved failure prediction and NVLink diagnostics. It also recommends monitoring ECC correction patterns for proactive error detection and maintenance.

- High-density GPU clusters are increasingly adopting liquid cooling as traditional air cooling is insufficient for racks drawing 50-100kW, a common requirement for modern AI hardware. Liquid cooling is approximately 3,000 times more effective at heat transfer than air. - The NVIDIA Blackwell architecture, announced in March 2024, succeeds the Hopper architecture and is built on a custom TSMC 4NP process. Blackwell GPUs feature 208 billion transistors, a significant increase from the H100's transistor count. - The GB200 NVL72 rack-scale system connects 72 Blackwell GPUs with 36 Grace CPUs in a liquid-cooled design, functioning as a single, massive GPU to accelerate real-time inference for trillion-parameter models. - NVLink is a high-speed, direct GPU-to-GPU interconnect that provides significantly higher bandwidth than traditional PCIe links, which is critical for multi-GPU communication in large model training. The fourth-generation NVLink offers up to 900 GB/s of GPU-to-GPU interconnect. - ECC (Error-Correcting Code) memory is a feature in data center GPUs that detects and corrects in-memory data corruption, which is crucial for the accuracy of large-scale scientific computations and AI model training. On modern NVIDIA GPUs like the Ampere and Hopper series, ECC is enabled by default and cannot be disabled. - A study of a 2,048-GPU A100 cluster over six months found that thermal degradation accounted for 41% of failures requiring replacement, and memory subsystem issues accounted for 28%. GPUs in the upper third of the racks failed 2.3 times more frequently than those in the lower third. - NVIDIA's DCGM is a suite of tools for managing and monitoring NVIDIA data center GPUs, offering features like active health monitoring, system alerts, and integration with cluster management software. It can be used to track metrics like NVLink bandwidth, errors, and utilization across GPUs. - Starting with the Ampere architecture, NVIDIA introduced features like dynamic page offlining and row remapping to handle uncorrectable ECC errors more gracefully. These features can isolate memory faults to the specific application that caused them, preventing a full GPU reset and allowing other workloads to continue unaffected.

New Playbook for H100 Cluster Operations

Get your own daily briefing