Cloud pre‑training sticker shock

Tom Boyle warned that pre‑training a 405B model can cost about $53M in cloud GPUs and said many teams run GPUs only 2 hours/day — he also flagged inference as memory‑bound and Big Cloud as poorly suited to ML engineers. The thread frames runaway pre‑training costs and low utilization as common pain points. (x.com)

Meta’s published training metrics show the Llama 3.1 family consumed about 39.3 million H100 GPU hours for pretraining, with the 405B run staged across ~16,000 H100-class GPUs. (build.nvidia.com) Meta logged a 54‑day final pretraining snapshot that recorded hundreds of interruptions, with GPU hardware and HBM3 memory accounting for the largest share of unexpected failures during that run. (datacenterdynamics.com) Applying public cloud single‑GPU rental ranges to Meta’s 39.3M GPU‑hour figure produces broad sticker ranges: at roughly $1.50/GPU‑hr the bill is about $59M, at $2.00/GPU‑hr about $78.6M, and at $3.00/GPU‑hr about $117.9M — reflecting how per‑hour pricing drives large swings in pretraining spend. The 405B model’s raw parameter footprint translates to roughly three‑quarters of a terabyte of pre‑quantization memory for BF16/BF32 serving, prompting broad use of FP8/AWQ/GPTQ quantization recipes to reduce GPU count for inference. (notebookcheck.net) NVIDIA benchmarking showed TensorRT‑LLM and a post‑training FP8/INT4 optimizer can increase throughput (up to ~1.44× in published tests) and, in some quantized setups, allow the 405B family to be hosted on far fewer H200 GPUs than unquantized serving would require. (developer.nvidia.com) Independent studies and industry blogs continue to document pervasive under‑use of accelerators — Microsoft’s empirical study found many deep‑learning jobs ran at ≤50% GPU utilization, while practitioner reports estimate full clusters often average in the teens‑to‑twenties percent utilization range. (microsoft.com) Market pricing across major clouds and specialist GPU providers diverged markedly by 2025, with single‑GPU H100 rental quotes spanning low‑single‑dollars to double‑digits per hour depending on vendor, region, and commitment level — a spread large enough to change multi‑million dollar training totals. (intuitionlabs.ai)

Cloud pre‑training sticker shock

Get your own daily briefing