Nvidia posts Colossus 2 model-loading update

- Nvidia said on May 18 it had posted a model-loading update tied to xAI’s Colossus 2 training system and very large checkpoint handling. - xAI’s Colossus page says the system has 200,000 GPUs today and a roadmap to 1 million GPUs, with storage capacity above 1 exabyte. - Nvidia’s post is on its NVIDIA AI X account, while xAI’s Colossus page lists current scale and roadmap details.

Nvidia used a May 18 post on its NVIDIA AI account on X to highlight a model-loading update tied to xAI’s Colossus 2 training system, according to the post referenced in the company’s social briefing. The post described model loading, I/O pressure and throughput requirements for petabyte-scale checkpoints at roughly “one million H100-equivalents,” according to the briefing and the linked X post. Nvidia did not publish the same material in an easily searchable blog post or documentation page that could be independently opened in this reporting. xAI’s Colossus page, however, says its current system has 200,000 GPUs, more than 1 exabyte of storage, and a roadmap to 1 million GPUs. ### What exactly did Nvidia say it had updated? Nvidia’s May 18 social post, as described in the source briefing, focused on model loading rather than model architecture or benchmark scores. The briefing says Nvidia linked the update to “large-scale training work with SpaceXAI and Colossus 2” and said the setup involved petabyte-scale checkpoints, heavy I/O demands and throughput requirements at about one million H100-equivalents. Nvidia has recently published related material on checkpointing and model state management in its developer channels. An April 9 Nvidia technical blog said synchronous checkpointing during large-model training can leave GPUs idle and that optimizer state is often the largest part of a checkpoint. Nvidia documentation for Megatron Bridge and NeMo also describes distributed checkpointing as a way to shard training state across files and reduce save-load overhead. ### What is Colossus 2 in this context? xAI’s Colossus page says the company built its original Colossus system in 122 days and then doubled it in 92 days to 200,000 GPUs. The same page says xAI is “running jobs with 150K+ GPUs and 99% uptime” and has a roadmap to 1 million GPUs. xAI’s published figures also give a sense of the storage and network scale behind the claim. The Colossus page lists 194 petabytes per second of total memory bandwidth, 3.6 terabits per second of network bandwidth per server and storage capacity above 1 exabyte. ### Why does model loading become a separate engineering problem at that size? Nvidia documentation says a checkpoint is the full saved state of a training run, including model weights, optimizer states and metadata needed to resume training. At large scale, that means loading is not just a file-read problem; it becomes a storage, network and coordination problem across many nodes. Megatron Core documentation says that at 256 nodes and beyond, primary bottlenecks in data loading can include index building and barrier synchronization, not only raw bandwidth. Nvidia’s checkpointing documentation also says distributed checkpointing shards state across multiple files to reduce memory overhead and improve GPU utilization during save and load operations. ### Did Nvidia verify the “million H100-equivalents” figure elsewhere? xAI’s own public Colossus page does not say the current live system is already at one million H100-equivalents. It says the present system is 200,000 GPUs and that the company has a roadmap to 1 million GPUs. That means the “roughly one million H100-equivalents” phrasing in Nvidia’s social post appears to refer to the scale target or equivalent compute framing used around Colossus 2, rather than a figure duplicated on the xAI page now. Reuters could not independently verify that exact equivalence claim from an Nvidia blog post or xAI technical paper available in search results. ### Where can readers check the underlying public material? xAI’s public Colossus page is the clearest source for the system’s currently published infrastructure numbers. Nvidia’s latest public documentation on checkpointing, Megatron Core data loading and checkpoint compression provides the closest official technical context for the issues cited in the social post. Nvidia has not, in the material reviewed here, published a fuller standalone explainer matching the social post’s wording. The next public confirmation would likely come through Nvidia developer documentation, an Nvidia technical blog, or an xAI infrastructure update that spells out how Colossus 2 handles checkpoint loading at larger scale.

Get your own daily briefing

Scout delivers personalized news, insights, and conversations tailored to your role and industry.

Download on the App Store

Shared from Scout - Be the smartest in the room.