Playbook: Preemptible GPUs Can Cut Costs by 91%

A new infrastructure playbook details how using preemptible GPU instances across AWS, GCP, and Azure can cut costs by 70-91%. However, the guide warns that utilization can be as low as 30% without tight orchestration, recommending dynamic job schedulers and checkpointing to mitigate interruption risks for training and non-critical inference.

Preemptible GPUs, also known as Spot Instances on AWS and Azure, are essentially unused cloud computing capacity sold at a significant discount. Cloud providers offer this spare capacity, which they maintain for handling demand spikes and maintenance, at savings of up to 91% compared to on-demand prices. The catch is that these instances can be reclaimed by the provider with very short notice. The notice period before an instance is terminated varies by provider. AWS Spot Instances give a two-minute warning, while Google Cloud's preemptible VMs provide a 30-second heads-up. Interruption rates also differ based on the GPU type and region; for example, H100 GPUs have a higher hourly interruption rate (around 4.1%) compared to older V100 GPUs (about 0.8%). Google's older Preemptible VMs had a 24-hour maximum lifetime, but their newer Spot VMs, like those on AWS and Azure, have no such restriction. To effectively use these instances for ML model training, checkpointing is crucial. This involves regularly saving the model's state, so if an interruption occurs, training can resume from the last saved point, minimizing lost work. Frameworks like TensorFlow and orchestration platforms like Kubeflow have built-in support for checkpointing, making them well-suited for preemptible environments. For inference tasks, preemptible instances are best for non-critical, asynchronous workloads. For example, batch processing of images or running secondary analysis on user data are good fits. Real-time, user-facing inference, however, is generally not recommended for these instances due to the risk of interruption impacting user experience. Beyond checkpointing, successful strategies involve diversifying the types of instances and regions used. Managed instance groups or services like AWS's EC2 Fleet can automatically re-provision preempted instances, helping to maintain a desired level of capacity. This multi-instance, multi-region approach builds resilience against the unpredictable nature of preemptible capacity. Companies like Spotify and Snap have leveraged these techniques to achieve significant cost savings. Spotify, for instance, managed to cut its machine learning costs from $8.2 million to $2.4 million by using AWS Spot Instances. These real-world examples demonstrate that with the right architecture and workload, the cost benefits of preemptible GPUs can be substantial.

Get your own daily briefing

Scout delivers personalized news, insights, and conversations tailored to your role and industry.

Download on the App Store

Shared from Scout - Be the smartest in the room.