PyTorch publishes NCCL watchdog debug guide

PyTorch posted a practical guide to debugging NCCL watchdog timeouts—helpful for anyone running distributed GPU training or experimenting with multi-node setups. The notes cover a real pain point in scaling model training and troubleshooting infra issues. (x.com)

“Flight Recorder: A New Lens for Understanding NCCL Watchdog Timeouts” was published by Phillip Liu, Uttam Thakore, Junjie Wang and Justin Yang on March 25, 2026. (pytorch.org) The Flight Recorder mechanism continuously records collective-operation events into an in-memory ring buffer and can dump per-rank trace files when a watchdog timeout or manual trigger occurs. (docs.pytorch.org/tutorials/unstable/flight_recorder_tutorial.html) (docs.pytorch.org) An analyzer script in the PyTorch tools/flight_recorder directory runs heuristics on the dumped traces to identify culprits like stuck ranks, and the tutorial notes that the dump files are saved to /tmp by default. (github.com/pytorch/pytorch/tree/main/torch/distributed/flight_recorder) (github.com) The documentation lists specific environment toggles—TORCH_NCCL_TRACE_BUFFER_SIZE (recommended example value 2000), TORCH_NCCL_DUMP_ON_TIMEOUT, TORCH_NCCL_DEBUG_INFO_TEMP_FILE and TORCH_NCCL_TRACE_CPP_STACK—that must be set to enable collection and dumps. (docs.pytorch.org/docs/stable/torch_nccl_environment_variables.html) (docs.pytorch.org) The blog spells out common root causes that the Flight Recorder targets—CPU-side divergence, GPU hangs (CUDA/NCCL API stalls), and misconfigured collectives—and shows how collected telemetry can distinguish timing and stack-trace patterns associated with each cause. (pytorch.org/blog/flight-recorder-a-new-lens-for-understanding-nccl-watchdog-timeouts/) (pytorch.org) PyTorch treats Flight Recorder as a prototype feature (historly available behind flags and not always in binary distributions), the project plans integration with TorchComm, and the post notes Flight Recorder’s design and usage experience drawn from internal use at Meta. (pytorch.org/blog/pytorch2-5/) (pytorch.org)

Get your own daily briefing

Scout delivers personalized news, insights, and conversations tailored to your role and industry.

Download on the App Store

Shared from Scout - Be the smartest in the room.