Kubernetes for AI inference

The CNCF is pushing Kubernetes as the standard platform for running AI inference—adding llm-d support, in-place pod resizing, and AI-specific cluster controls to make large-model serving more Kubernetes-native. This signals a shift toward portable, cluster-first inference stacks for multi-modal models and edge-to-cloud deployments. (cloudnativenow.com)

IBM Research, Red Hat and Google Cloud announced the contribution of llm-d to the CNCF Sandbox at KubeCon Europe on March 24, 2026. (cloud.google.com) llm-d’s founding contributors include NVIDIA and CoreWeave, and the project lists additional backers such as AMD, Cisco, Hugging Face, Intel, Lambda, Mistral AI, UC Berkeley and the University of Chicago. (research.ibm.com) Google detailed a GKE Inference Gateway that leverages llm-d’s Endpoint Picker to route requests based on KV‑cache hit rates, inflight requests and queue depth, yielding a >35% Time‑to‑First‑Token reduction on Qwen Coder and a 52% P95 improvement on a DeepSeek workload while doubling prefix cache hit rates from 35% to 70%. (cloud.google.com) The CNCF’s Kubernetes AI Conformance Program has nearly doubled certified platforms since November, growing from 18 to 31 certified offerings and adding OVHcloud, SpectroCloud, JD Cloud and China Unicom Cloud. (prnewswire.com) Kubernetes AI Requirements (KAR) v1.35 makes stable in‑place pod resizing a mandatory conformance check and adds requirements for high‑performance pod‑to‑pod communication, advanced inference ingress, and disaggregated inference support. (cloudnativenow.com) CNCF said it will move beyond self‑assessments by building a “Verify Conformance Bot” for automated validation and plans to extend KAR later in the year to include Sovereign AI standards emphasizing sandboxing and enhanced data privacy. (cloudnativenow.com) llm-d, launched in 2025, positions itself as a Kubernetes‑native, vendor‑neutral distributed inference framework intended to compose with inference engines like vLLM to deliver low‑latency, high‑throughput serving across accelerators and clouds. (research.ibm.com)

Kubernetes for AI inference

Get your own daily briefing