GKE Autopilot adds GPU sharing
Google Cloud’s GKE Autopilot now supports GPU sharing, offering a route to more efficient AI infrastructure tenancy by letting multiple workloads share accelerator capacity. The capability is pitched as a competitive hybrid/cloud alternative for customers seeking better GPU economics. (x.com)
Google Cloud now lets Google Kubernetes Engine Autopilot workloads share a single graphics processing unit, extending GPU time-sharing to its managed Kubernetes mode. (docs.cloud.google.com) The change appears in Google’s Autopilot GPU documentation, which says workloads on version 1.29.4-gke.1427000 and later can request GPUs and “also use GPU sharing capabilities, like time-sharing.” A separate setup page says Autopilot clusters can use time-sharing starting with version 1.29.3-gke.1093000 and later. (docs.cloud.google.com 1) (docs.cloud.google.com 2) A graphics processing unit is an accelerator chip used for artificial intelligence training and inference, but Kubernetes normally makes teams ask for whole GPUs in integer units. Google’s documentation says that can leave one container holding an entire physical GPU even when it needs only a fraction of the capacity. (docs.cloud.google.com) Google offers three sharing methods in Google Kubernetes Engine: multi-instance GPU, GPU time-sharing, and NVIDIA Multi-Process Service. In time-sharing, the company says NVIDIA hardware switches between processes so each workload gets a timeslice on the same device. (docs.cloud.google.com) Autopilot is Google’s managed operating mode for Google Kubernetes Engine, where Google handles nodes, scaling, security settings, upgrades, and much of the infrastructure plumbing. That matters for teams that want shared accelerator capacity without also running their own Kubernetes node pools. (docs.cloud.google.com) Google has been widening where Autopilot can run. At KubeCon + CloudNativeCon Europe on March 24, 2026, Google said Autopilot compute classes were available for Standard clusters too, letting customers turn on Autopilot on a per-workload basis instead of choosing one cluster mode at creation time. (cloud.google.com) Google’s own March 6, 2026 example for “cost-effective AI” pairs Autopilot GPU time-sharing with virtual clusters from vCluster so separate teams can run isolated model-serving environments on the same shared GPU nodes. The post uses Ollama and describes Legal Research and Customer Support teams sharing hardware while keeping separate control planes and admin access. (cloud.google.com) The tradeoff is that not every sharing model gives the same isolation. Google says multi-instance GPU provides hardware isolation and predictable quality of service, while GPU time-sharing provides software-level isolation for address space, performance, and error handling. (docs.cloud.google.com) Google is pitching the feature around utilization and cost, not raw exclusivity. In its documentation, the company says GPU sharing is meant for workloads that do not need all of a device’s resources and can “save running costs” by reducing underused accelerator capacity. (docs.cloud.google.com 1) (docs.cloud.google.com 2)