Kubernetes as the Control Surface
Threads argue Kubernetes is becoming the default control surface for new AI services while virtualization remains essential for governance and tenancy, pushing architectures toward mixed estates. The guidance emphasizes stacks combining vLLM, Ray Serve/K8s, Kafka, vector DBs and observability to manage distributed AI workflows. (x.com)
Kubernetes is becoming the operating layer for new artificial intelligence services, while virtual machines still handle the isolation, policy, and tenancy many enterprises require. (kubernetes.io 1) (kubernetes.io 2) (techdocs.broadcom.com) Kubernetes is open-source software for deploying, scaling, and managing containerized applications, which makes it a natural fit for inference services that need to start, stop, and scale across many graphics processing units. Google Cloud, Amazon Web Services, and Microsoft all publish current guidance for serving large language models on managed Kubernetes services. (kubernetes.io) (docs.cloud.google.com) (docs.aws.amazon.com) (techcommunity.microsoft.com) A control surface is the layer operators use to declare what should run, where it should run, and how it should recover after failures. In current artificial intelligence stacks, that often means Kubernetes schedules the containers, while model servers such as vLLM handle token generation and request batching. (kubernetes.io) (docs.vllm.ai) Ray Serve enters when one model server is not enough on one machine. Ray’s Kubernetes example shows KubeRay, Ray Serve, and vLLM working together to deploy a Qwen 2.5 7B Instruct model with an OpenAI-compatible interface on Kubernetes. (docs.ray.io) (docs.vllm.ai) The rest of the stack fills in the pieces Kubernetes does not provide by itself. Apache Kafka moves events between services in real time, vector databases store and search numerical representations for retrieval, and OpenTelemetry collects traces, metrics, and logs across the request path. (kafka.apache.org) (pinecone.io) (opentelemetry.io) That combination matches how many production artificial intelligence systems actually work: one service receives a prompt, another retrieves context, another calls a model, and another evaluates or stores the result. Amazon’s “AI on EKS” materials and Google’s inference guidance both frame Kubernetes as the orchestration layer around those distributed components, not just the place where a single model binary runs. (aws.amazon.com) (docs.cloud.google.com) Virtualization has not gone away in that model. Kubernetes documentation says shared clusters bring security, fairness, and “noisy neighbor” problems, and VMware positions its current cloud stack as a way to deliver self-service infrastructure for both Kubernetes and virtual machine based applications in multi-tenant private clouds. (kubernetes.io) (techdocs.broadcom.com) That is why mixed estates keep showing up in vendor architectures. VMware’s private artificial intelligence reference materials pair vSphere and Tanzu Kubernetes, and Microsoft’s Azure Kubernetes Service guidance highlights Multi-Instance Graphics Processing Unit partitioning to isolate several model deployments on shared hardware. (github.com) (techcommunity.microsoft.com) The practical split is becoming clearer in 2026 documentation. If a team needs a single high-throughput endpoint, vLLM on Kubernetes is often enough; if it needs multi-node serving, multiple models, event pipelines, retrieval, and end-to-end tracing, the stack expands around Kubernetes rather than replacing it. (docs.aws.amazon.com) (docs.vllm.ai) (docs.ray.io) (kafka.apache.org) (opentelemetry.io) The result is not a winner-take-all shift from virtual machines to containers. It is a layered architecture in which Kubernetes increasingly acts as the control surface for artificial intelligence services, while virtualization remains the boundary many enterprises still use for governance, tenancy, and risk control. (kubernetes.io) (techdocs.broadcom.com)