Scale tensor pipelines across GPUs

- PyTorch, NVIDIA and vLLM documentation now converge on the same playbook for scaling large language models: split layers across GPUs, feed micro-batches, and overlap work. - The key tuning knobs are 1F1B scheduling, interleaving, and micro-batch count; PyTorch lists all three, while Colossal-AI says 1F1B beats GPipe on memory. - The pattern is becoming standard across training and inference stacks, from TorchTitan to vLLM multi-node serving. (docs.pytorch.org)

Tensor parallelism chops a single layer across several GPUs, while pipeline parallelism stacks different layer blocks on different GPUs and keeps them busy at the same time. (docs.pytorch.org 1) (docs.pytorch.org 2) PyTorch’s distributed pipelining docs describe the basic trick: split one batch into micro-batches so different model stages can run concurrently on different devices. The runtime handles the micro-batch splitting, scheduling, communication, and gradient propagation. (docs.pytorch.org) That setup exists because a giant model often does not fit cleanly on one card, and ordinary data parallelism can stall on communication as clusters get large. PyTorch’s tensor-parallel tutorial says Fully Sharded Data Parallel collectives can become dominated by ring latency once world size pushes past roughly 128 to 256 GPUs. (docs.pytorch.org) A pipeline has a built-in waste problem called the bubble: early on, some GPUs wait for the first micro-batch to arrive, and late in the step others wait for gradients to come back. More micro-batches usually shrink that idle window, because the pipeline has more work to overlap. (docs.pytorch.org) The scheduling choice changes memory use. Colossal-AI’s pipeline docs say GPipe runs all forward passes before any backward passes, while 1F1B starts one backward pass as soon as a micro-batch clears the pipeline, which is generally more efficient on memory and sometimes time. (colossalai.org) That memory gain comes from releasing activations earlier. Instead of storing every intermediate tensor for every micro-batch until the end of the batch, 1F1B lets each stage start consuming and freeing them sooner. (colossalai.org) Interleaving pushes the idea further by giving each GPU multiple smaller model chunks instead of one contiguous block of layers. Colossal-AI says this requires the number of micro-batches to be an integer multiple of the pipeline stage count and can improve both memory efficiency and time efficiency. (colossalai.org) Tensor parallelism solves a different bottleneck inside each layer. PyTorch’s tutorial says it shards matrix-heavy modules such as linear layers and embeddings, then uses collectives like all-reduce, all-gather, and reduce-scatter to stitch results back together. (docs.pytorch.org) Once activations, not weights, become the main pressure point, sequence parallelism enters the picture. PyTorch describes it as a tensor-parallel variant that shards work along the sequence dimension for LayerNorm or RMSNorm to save activation memory during training. (docs.pytorch.org) These methods are increasingly used together rather than as rivals. PyTorch’s pipelining docs point to TorchTitan as a “3D parallel” example, and vLLM recommends tensor parallelism within a node and pipeline parallelism across nodes when one node is no longer enough. (docs.pytorch.org) (docs.vllm.ai) vLLM makes the deployment rule explicit: if a model fits on one node, use tensor parallelism across that node’s GPUs; if it does not, combine tensor parallelism with pipeline parallelism across nodes. Its example uses tensor_parallel_size=8 and pipeline_parallel_size=2 for two nodes with eight GPUs each. (docs.vllm.ai) The throughline is simple: shard the math inside layers, stagger the layers across devices, and tune micro-batches so memory spikes do not wipe out throughput. That is the operating manual most modern large-model stacks now appear to share. (docs.pytorch.org 1) (docs.pytorch.org 2) (docs.vllm.ai)

Scale tensor pipelines across GPUs

Get your own daily briefing