OpenAI Scales Kubernetes to 25,000 GPUs
OpenAI is now orchestrating 25,000 GPUs in production using Kubernetes, achieving 97% hardware utilization. The achievement demonstrates Kubernetes's extended capabilities for managing large-scale machine learning clusters, well beyond the historical 5,000-node limit. Key practices include topology-aware scheduling to minimize latency and building robust failure domain boundaries.
- Before scaling to 25,000 GPUs, OpenAI had already scaled a single Kubernetes cluster to 7,500 nodes to support large models like GPT-3, CLIP, and DALL-E. This earlier effort involved managing over 200,000 IP addresses and transitioning from Flannel to more advanced Azure-native networking solutions to handle the scale. - For efficient GPU utilization, OpenAI developed Triton, a Python-based language and compiler for writing highly optimized custom GPU kernels. Triton allows developers to fuse operations, which can accelerate PyTorch code by reducing memory transfers between the GPU and slower main memory. - The infrastructure for training these large models runs exclusively on Microsoft Azure, utilizing tens of thousands of NVIDIA A100 GPUs. This partnership has led to the creation of the Azure OpenAI Service, which provides API access to OpenAI's models. - To manage resource allocation and prevent idle nodes from being deallocated by the cluster autoscaler, OpenAI employs a "balloon" Deployment technique. This involves running low-priority pods that reserve resources, ensuring capacity is readily available for new high-priority training jobs without waiting for new virtual machines to spin up. - A significant challenge at this scale is the load on the Kubernetes API servers, which can use up to 70GB of heap memory in a 7,500-node cluster. OpenAI mitigated this by optimizing WATCHes on Endpoints, as services like 'kubelet' and 'node-exporter' created significant traffic with each node change. - Given the scale of GPU clusters, companies like OpenAI must comply with emerging regulations like the EU AI Act, which requires documentation and public summaries of copyrighted training data used for large-scale models. This adds a significant administrative layer to the management of training infrastructure. - In a move toward enhanced data privacy, Microsoft Azure now offers confidential computing options for AI, including confidential GPUs. This technology ensures that data and models remain encrypted even while being processed on the GPU, protecting them from access by cloud administrators. The Azure OpenAI Service's Whisper model is one of the first services to leverage this for confidential speech-to-text transcription.