Kubernetes Evolves for Agentic Workloads
The use of Kubernetes for ML is shifting toward agent-level resource negotiation to dynamically allocate GPU resources based on workflow demand. This approach, discussed in recent MLOps demonstrations, allows for more granular cost control and budget enforcement per tenant. The trend also sees the rise of AI gateways as a pattern for managing API access and routing requests across multiple models within a cluster.
- A core enabler for this shift is Kubernetes' Dynamic Resource Allocation (DRA), a feature that fundamentally redesigns how specialized hardware is managed. It moves beyond the limitations of the older device plugin framework, allowing for fine-grained resource claims, such as requesting a GPU with a specific memory size or compute capability. - The new allocation models allow for more sophisticated GPU sharing techniques beyond assigning a whole device. These include time-slicing, where multiple containers share a GPU by taking turns, and Multi-Instance GPU (MIG), which partitions a single GPU into multiple isolated, hardware-backed instances with dedicated resources. - An official Kubernetes project called the Gateway API Inference Extension is being developed to standardize routing for AI workloads. This allows for model-aware routing (e.g., based on the model name in an OpenAI API request body) and optimized load balancing based on real-time metrics from model servers. - Mature API gateways like Kgateway (formerly Gloo) are being repurposed as "AI Gateways." These can manage and apply security policies to traffic directed at multiple external LLM providers, such as OpenAI and Anthropic, from within the cluster. - One experimental architecture gives every AI agent its own virtualized Kubernetes control plane to act as an independent economic actor. These agents then bid against each other in real-time auctions for GPU access, with their budgets determined by business priority, reportedly increasing GPU utilization to over 90%. - To address the security risks of running autonomous agent code, Google is contributing to a new Kubernetes primitive called Agent Sandbox. It uses technologies like gVisor and Kata Containers to provide strong, kernel-level isolation for agent workloads, reducing the risk of container escapes or data exfiltration. - The push for dynamic allocation is a direct response to massive underutilization of expensive hardware, with some 2026 platform surveys indicating that average GPU utilization in Kubernetes clusters is less than 40%. Adopting FinOps practices to right-size resources and automate cleanup can often reduce Kubernetes costs by 30-40%.