Meta Open-Sources GPU Cluster Monitoring Tool 'gcm'

Meta has open-sourced gcm (GPU Cluster Monitoring), a suite of tools for managing large-scale GPU infrastructure. The release provides utilities for monitoring, performing health checks, and collecting Slurm telemetry data. The tool is designed to help engineers maintain and operate extensive GPU clusters efficiently.

- The tool is purpose-built for High-Performance Computing (HPC) clusters that use the Slurm workload manager, allowing it to anchor hardware metrics to specific Slurm Job IDs for precise attribution of resource usage. - It runs proactive health checks using NVIDIA's Data Center GPU Manager (DCGM) both before a job starts ('Prolog') and after it ends ('Epilog'), which helps to automatically identify and drain faulty nodes before they waste compute cycles. - GCM is designed to address "silent failures," a common issue in large clusters where a single GPU's performance degrades and compromises an entire training run without crashing the node. - The telemetry processor converts low-level hardware data, such as GPU temperatures and NVLink errors, into the standardized OpenTelemetry (OTLP) format, enabling integration with modern observability platforms. - This toolset is used internally at Meta to manage the Fundamental AI Research (FAIR) team's workloads across clusters of hundreds of thousands of GPUs. - The project's architecture is a monorepo containing components written in both Python and Go. - Potential future expansions for the project include adding support for other job schedulers beyond Slurm and integrating with non-NVIDIA hardware from vendors like AMD and Intel. - This release follows Meta's pattern of open-sourcing its internal infrastructure tools, such as `dynolog` for CPU-GPU system profiling and `RCCLX` for enhancing communication on AMD GPUs.

Meta Open-Sources GPU Cluster Monitoring Tool 'gcm'

Get your own daily briefing