DeepSpeed Enables Automatic Tensor Parallelism for HuggingFace Models

DeepSpeed has introduced a new feature for automatic tensor parallelism, allowing HuggingFace models that exceed single-GPU memory to be scaled across multiple GPUs. The update aims to reduce manual configuration and optimize throughput for large model inference. This functionality simplifies the process of deploying very large models for high-performance serving.

- Before this update, enabling tensor parallelism in DeepSpeed for HuggingFace models required developers to manually create and provide an injection policy, a complex configuration step that specified how to split the model's layers across GPUs. The new automatic feature determines and applies this policy at runtime. - The core technology behind DeepSpeed's memory optimization is the Zero Redundancy Optimizer (ZeRO), which partitions the model's weights, gradients, and optimizer states across multiple GPUs to reduce memory overhead. This approach has been shown to train models with up to 200 billion parameters significantly faster than previous methods. - This automation is part of a broader trend to simplify the deployment of massive models, as manual parallelization is a significant bottleneck. Manually implementing tensor parallelism requires a deep understanding of how to split individual layers ("row-wise" or "column-wise") to ensure the mathematical operations remain correct across devices. - While DeepSpeed focuses on memory-saving for very large models, other popular inference engines like vLLM are known for a balance of speed and ease of use, and TensorRT-LLM offers maximum performance but is limited to NVIDIA hardware. DeepSpeed's strength lies in its ability to handle models that are too large for other frameworks. - The automatic tensor parallelism is initially available for inference, but DeepSpeed is also developing it for training workflows where it can be combined with ZeRO to lower communication costs compared to other sharded data parallelism methods. - The push for such optimizations is driven by the massive hardware requirements and operational costs of deploying private large language models, which often require multi-GPU servers with specialized networking and cooling, leading to high capital expenditures. - Microsoft Research is the original developer of DeepSpeed, an open-source library for PyTorch designed to make distributed training more efficient. It was instrumental in training large models like the 176-billion parameter BLOOM model. - This feature supports a range of popular HuggingFace models, but notably excludes some architectures like GPT-2, XLNet, and DeBERTa in its initial release.

Get your own daily briefing

Scout delivers personalized news, insights, and conversations tailored to your role and industry.

Download on the App Store

Shared from Scout - Be the smartest in the room.