120B LLM on one H200

A social post claims a 120‑billion‑parameter language model was trained on a single H200 GPU by storing parameters on CPU memory and using the GPU for compute. The post asserts the setup delivered a 1.84× speedup versus standard approaches and suggests the technique could lower the barrier to large‑model training for mid‑market custom agents. (x.com)

Large language model training usually keeps the whole model close to the chip, because graphics memory is fast but small. A new paper posted April 6 says it trained models up to 120 billion parameters on one NVIDIA H200 by keeping most of the model in 1.5 terabytes of host memory instead. (arxiv.org) (nvidia.com) The system is called MegaTrain, and the paper lists Zhengqing Yuan and Yanfang Ye of the University of Notre Dame and Hanchi Sun and Lichao Sun of Lehigh University as authors. Their code repository was public on GitHub by April 9 and describes the design as “RAM-centric,” with the graphics processor used as a temporary compute engine. (arxiv.org) (github.com) In plain terms, the method treats central memory like a warehouse and the graphics processor like a workbench. For each layer, MegaTrain streams weights into the H200, runs the math, and pushes gradients back out instead of leaving the full model resident on the device. (arxiv.org) That matters because the H200 has 141 gigabytes of on-package HBM3e memory, far less than what full-precision training needs once parameters, gradients, and optimizer state are counted. NVIDIA says the H200’s memory bandwidth is 4.8 terabytes per second, which is fast, but its capacity still caps what fits entirely on the chip. (nvidia.com) The paper says MegaTrain uses two tricks to keep that slower central memory from stalling the run. One overlaps fetching weights, doing computation, and offloading gradients across multiple CUDA streams, and the other swaps persistent autograd graphs for stateless layer templates that bind weights as they arrive. (arxiv.org) The headline speed claim is narrower than the social post makes it sound. The authors report 1.84 times the training throughput of DeepSpeed ZeRO-3 with central-processing-unit offloading on a 14 billion parameter model, not on the 120 billion parameter configuration itself. (arxiv.org) (deepspeed.readthedocs.io) DeepSpeed’s own documentation says ZeRO-Offload and ZeRO-3 Offload already move optimizer state, gradients, and sometimes parameters to host memory to stretch limited graphics memory. MegaTrain’s claim is that a system built around host memory from the start can use that hierarchy more efficiently on a single device. (deepspeed.ai) (deepspeed.readthedocs.io) (arxiv.org) The paper also frames the target workload as post-training rather than giant pretraining runs. It says instruction tuning, alignment, domain adaptation, and agent specialization are increasingly node-scale jobs, but still get blocked by memory demands and scarce access to top-end graphics processors. (arxiv.org) The work is still at the preprint stage, and the performance numbers come from the authors and their open-source repository rather than an outside benchmark. The paper’s central claim is concrete, though: if host memory can hold the model and the graphics processor can stay busy, one H200 can handle training jobs that normally spill onto many accelerators. (arxiv.org) (github.com)

Get your own daily briefing

Scout delivers personalized news, insights, and conversations tailored to your role and industry.

Download on the App Store

Shared from Scout - Be the smartest in the room.