Model parallelism for trillion‑param models
- Model parallelism is the engineering trick that lets one neural network span many chips, splitting either its layers or its math when a trillion-parameter model will not fit on one device. - NVIDIA’s Megatron stack combines tensor, pipeline, data, expert, and context parallelism, while Google’s GSPMD compiler reported 50% to 62% utilization on 2,048 TPU v3 cores for models up to one trillion parameters. - The field shifted from “can it fit?” to “can it stay efficient?” as sparse models like Switch and systems like Pathways pushed larger runs across thousands of accelerators. (research.google)
A trillion-parameter model is too large for one accelerator, so engineers split the model itself across many chips instead of copying the whole thing everywhere. (docs.nvidia.com) (research.google) The simplest version is data parallelism: every chip holds the same model and works on different training examples. That breaks down once the model no longer fits in a single chip’s memory. (docs.nvidia.com) Model parallelism fixes that by slicing the network itself. Pipeline parallelism assigns different layer blocks to different GPUs, so one chip handles early layers while another handles later ones. (docs.nvidia.com) (deepwiki.com)) Tensor parallelism cuts up the math inside a single layer. Instead of one GPU multiplying a huge matrix, several GPUs each compute a shard and then exchange the results. (docs.nvidia.com) (github.com) Those two ideas are usually combined, because layers alone are not enough and splitting every operation alone is not enough. NVIDIA’s Megatron-LM paper reported 502 petaFLOP/s for a one-trillion-parameter model on 3,072 GPUs using tensor, pipeline, and data parallelism together. (arxiv.org) Google attacked the same problem with a compiler. GSPMD lets researchers write code closer to a single-device program, add partitioning hints to a small number of tensors, and let the compiler infer the rest. (research.google) (arxiv.org) In that paper, Google said GSPMD reached 50% to 62% compute utilization on as many as 2,048 Cloud TPU v3 cores for models up to one trillion parameters. Google later said the same approach could train a dense 4 trillion-parameter Transformer on 2,048 TPU cores. (arxiv.org) (cloud.google.com) Another route is sparsity, where the full parameter count is huge but only part of the network activates for each token. Switch Transformer used that design to build 395 billion and 1.6 trillion parameter models, combining expert, model, and data parallelism. (jmlr.org) (ar5iv.labs.arxiv.org) That is why “trillion parameters” is not one technique but a stack of them: memory sharding, communication scheduling, compiler partitioning, and sparse routing. Google’s PaLM, for example, trained a 540 billion-parameter dense Transformer on 6,144 TPU v4 chips with Pathways. (s10251.pcdn.co) The hard part now is less about proving a model can be split and more about keeping thousands of chips busy while they constantly exchange activations, gradients, and parameter shards. That is what model parallelism really names: not one trick, but the operating system for very large models. (docs.nvidia.com) (arxiv.org)