Tensor Parallelism and Costs
- Tensor parallelism splits one neural-network layer across several graphics processors so each chip holds only part of the math, cutting per-chip memory use for very large models. - NVIDIA’s Megatron Core says tensor parallelism is best for “large layers” and memory limits, while vLLM said on April 22 that long-context serving becomes KV-cache memory-bound at 128K-plus prompts. - Cloud H100 prices have fallen to roughly $1.49 to $6.98 per GPU-hour in 2026, shifting long-context cost math toward memory and interconnect efficiency. (perffeco.com)
Tensor parallelism is a way to split one giant model layer across several graphics processors instead of copying the whole layer onto each chip. (docs.nvidia.com) NVIDIA’s Megatron Core guide says tensor parallelism is used for “individual layers” and is best when layers are too large for one GPU’s memory. (docs.nvidia.com) The basic trick is to cut a weight matrix into pieces, let each GPU do part of the multiply, and then combine the partial answers with collective communication. NVIDIA NeMo says that sharding reduces model-state memory and also shrinks per-GPU activation sizes. (docs.nvidia.com) That memory relief comes with a network bill. NVIDIA’s TensorRT-LLM team said traditional multi-GPU all-reduce gets slower as GPU count rises because every step requires synchronization around the ring. (developer.nvidia.com) NVIDIA said in November 2024 that its MultiShot method on NVLink Switch can make all-reduce nearly three times faster by breaking it into reduce-scatter plus all-gather and using multicast. (developer.nvidia.com) Long-context inference makes the tradeoff harsher because memory shifts from model weights to the key-value cache, the running notebook of prior tokens the model must keep nearby. The vLLM team wrote on April 22, 2026 that standard full-attention decoders become KV-cache-memory-bound at 128K-token contexts and above. (vllm.ai) vLLM said halving KV-cache storage with FP8 can raise concurrency or support longer contexts at the same hardware cost, and in the best memory-bound decode cases the per-token KV-cache cost fell to 54% of the BF16 version. (vllm.ai) That is why tensor parallelism is only one lever in the bill. Teams also juggle cache precision, sequence length, data parallelism, pipeline parallelism, and context parallelism, which Megatron Core lists separately for long sequences of 8,000 tokens and up. (docs.nvidia.com) DeepSeek’s V3 technical report shows how far those system choices can move training economics. The company said the 671-billion-parameter mixture-of-experts model, with 37 billion parameters activated per token, used multi-token prediction and finished full training in 2.788 million H800 GPU-hours. (arxiv.org) (github.com) The cloud side has changed fast too. Perffeco’s March 2026 pricing survey put on-demand H100 SXM 80GB rates between $1.49 and $6.98 per GPU-hour, down more than 70% from 2023 peaks of roughly $7.50 to $11.00. (perffeco.com) Cheaper H100 time does not erase the architecture problem. As contexts get longer, the winning setup is often the one that spends fewer bytes moving activations and KV cache over the interconnect, not just the one renting the cheapest GPUs. (vllm.ai) (developer.nvidia.com)