Kubernetes 1.35 Pitched as 'OS for AI'

The Kubernetes 1.35 release is being described as a pivotal update positioning the platform as the "operating system for AI." Key features include enhanced GPU scheduling, improved workload isolation, and better support for multi-tenant clusters, enabling more efficient and secure orchestration of AI services.

- The introduction of native "Gang Scheduling" via a new Workload API is a significant change, moving beyond the previous pod-by-pod scheduling model. This "all-or-nothing" approach ensures that tightly-coupled distributed workloads, such as those for AI model training, only commence when all necessary pods can be scheduled simultaneously, preventing resource waste from partially-started jobs. - Security for multi-tenant AI environments is enhanced through several new and graduating features. Pods can now run with isolated user and group ID mappings via user namespaces, allowing a container to operate as root internally while being mapped to an unprivileged user on the host. Additionally, a new kubelet-level credential verification for cached images prevents a pod from using a private image it isn't authorized for, even if it's already on the node. - The Dynamic Resource Allocation (DRA) framework, which became generally available in version 1.34, is now considered stable and production-ready in 1.35. This allows for more granular, topology-aware scheduling of specialized hardware like GPUs and other AI accelerators, moving beyond the previous, less efficient device plugin model. - In-place resource updates for pods have graduated to General Availability (GA), a long-requested feature. This allows for the modification of a running container's CPU and memory requests and limits without requiring a pod restart, which is critical for minimizing disruption to long-running AI training jobs and stateful applications. - The release advances workload identity with the beta graduation of native pod certificates. The kubelet can now natively handle the request and filesystem mounting of certificates for pods, which includes automatic rotation, simplifying the setup of mTLS and service mesh architectures without relying on external controllers. - For privacy-preserving AI, the ecosystem is increasingly integrating Confidential Computing capabilities. Projects like Confidential Containers (CoCo), a CNCF sandbox project, aim to run Kubernetes pods within hardware-level Trusted Execution Environments (TEEs). This encrypts data while in-use in memory, protecting sensitive models and data from access by the underlying cloud infrastructure or even cluster administrators. - With the EU AI Act's full applicability deadline of August 2, 2026, approaching, the orchestration of "High-Risk" AI systems on Kubernetes now requires a focus on auditable governance. Features in Kubernetes 1.35 that provide stronger identity, resource isolation, and logging are foundational for meeting the Act's mandates on risk management, human oversight, and the technical documentation required for compliance.

Get your own daily briefing

Scout delivers personalized news, insights, and conversations tailored to your role and industry.

Download on the App Store

Shared from Scout - Be the smartest in the room.