AI infra is splitting into two planes

The industry is coalescing on a two‑plane model: Kubernetes for service-level concerns and tenancy, and HPC schedulers like Slurm for dense GPU job packing and throughput. (developer.nvidia.com) Efforts to standardize AI deployment — such as llm-d and broader CNCF conformance work — aim to make model runtimes portable and predictable across clouds. (thenewstack.io) Observability is becoming the central control surface for AI cost and reliability, with new tooling that targets GPU bottlenecks and failed jobs, and some inference use cases are explicitly moving to serverless edge approaches to avoid Kubernetes overhead. (itbrief.asia) (markaicode.com)

A year ago, a lot of companies talked as if one control system would run all artificial intelligence workloads. In April 2026, the stack is breaking in two: Kubernetes is handling shared services and tenants, while Slurm is handling the part where thousands of graphics processors need to be packed tightly and kept busy. (developer.nvidia.com) Kubernetes is the software world’s apartment manager. It decides which app gets which room, keeps services running, and separates one team’s workload from another team’s workload on the same cluster. (thenewstack.io) Slurm comes from supercomputing, where the job is less “keep this web service alive” and more “feed 8,000 graphics processors without gaps.” NVIDIA said this week that Slurm still schedules more than 65% of the Top500 supercomputers, which is why big training teams already have years of scripts, quotas, and accounting built around it. (developer.nvidia.com) The surprise is that companies are no longer choosing one or the other. NVIDIA’s new Slinky project runs Slurm on top of Kubernetes, and NVIDIA says it already uses that setup in production on clusters with more than 1,000 worker nodes and over 8,000 graphics processors. (developer.nvidia.com) That split exists because training a giant model and serving answers to users are different jobs. NVIDIA said its Kubernetes-plus-Slurm setup can run large language model training and multinode inference while matching the communication performance of non-containerized Slurm in its internal benchmarks. (developer.nvidia.com) Once two control planes show up, portability becomes the next fight. The Cloud Native Computing Foundation said on March 24 that llm-d entered its Sandbox program to push “any model, any accelerator, any cloud,” and The New Stack reported that conformance work is trying to make artificial intelligence behavior more predictable across vendors. (cncf.io) (thenewstack.io) That matters because a model runtime is turning into a shipping container. If the container shape is standard, a company can move inference software between Google, Red Hat, IBM, CoreWeave, NVIDIA, and other contributors in the llm-d ecosystem without rewriting the whole deployment story each time. (cloud.google.com) (cncf.io) The new bottleneck is not only scheduling. It is seeing which graphics processor sat idle, which job failed halfway through, and which team burned money on low utilization, so observability is moving from a dashboard on the side to the main control surface. (virtana.com) (tmcnet.com) Virtana’s April 8 release is a clean example of where the market is going. Its Nutanix integration promises a single view of graphics processor spend, utilization, efficiency, and wasted cost across both infrastructure and artificial intelligence environments, which tells you buyers now want cost control and failure analysis in the same pane. (virtana.com) (finance.yahoo.com) And then there is a third move, which is to skip the whole cluster manager for some inference jobs. Cloudflare says Workers AI runs models on serverless graphics processors on its network, and recent deployment guides are pitching sub-50 millisecond edge inference with no Kubernetes layer at all for lightweight, latency-sensitive workloads. (developers.cloudflare.com) (markaicode.com) So the shape of the market in April 2026 looks less like one winner and more like a map. Kubernetes is becoming the front desk, Slurm is becoming the engine room, observability is becoming the steering wheel, and edge serverless is becoming the shortcut when the trip is short enough. (developer.nvidia.com) (thenewstack.io) (developers.cloudflare.com)

Get your own daily briefing

Scout delivers personalized news, insights, and conversations tailored to your role and industry.

Download on the App Store

Shared from Scout - Be the smartest in the room.