AI breaks cloud‑native models
A new AppDevANGLE video argues AI workloads are stressing standard cloud‑native assumptions — things like GPU locality, interconnect topology, power/cooling density and storage throughput matter now. That means orchestration decisions must be hardware‑aware and deterministic rather than purely elasticity‑driven, creating room for infrastructure control planes that manage mixed VM/container and accelerator estates. (youtube.com)
Cloud software was built on the idea that one server is mostly like another server, so the scheduler can just grab any free machine and move on. Artificial intelligence jobs break that rule because an eight-chip training run can slow down if the chips are on the wrong board or connected through the wrong link. (kubernetes.io) A graphics processing unit is a math engine built for doing many small calculations at once, which is why model training and model serving both pile onto them. Kubernetes can request a graphics processing unit as a resource, but its own documentation says the implementation still has limits, which is a problem when the exact device layout affects speed. (kubernetes.io) That device layout is called topology, and it means the map of which processor talks directly to which other processor. Kubernetes added a Topology Manager because its older method could place central processing units and devices on different Non-Uniform Memory Access nodes, adding extra latency on multi-socket machines. (kubernetes.io) On modern artificial intelligence hardware, that map is no longer a small optimization. NVIDIA says its sixth-generation NVLink gives each Rubin graphics processing unit 3.6 terabytes per second of bandwidth, and its NVL72 rack links 72 graphics processing units in an all-to-all design for 260 terabytes per second of total bandwidth. (nvidia.com) Google’s Tensor Processing Unit pods show the same shift from generic servers to carefully arranged hardware. Google says a Tensor Processing Unit version 4 pod contains 4,096 chips tied together with reconfigurable high-speed links, and version 5p pods scale to 8,960 chips with multiple three-dimensional layouts. (cloud.google.com, cloud.google.com) Once a training job is spread across that kind of fabric, placement stops being a simple “find me eight accelerators” request. A job may need eight chips inside one fast island instead of eight chips scattered across a cluster, because the slow part is often chip-to-chip communication rather than raw compute. (nvidia.com, cloud.google.com) Storage has the same problem. Large model training streams giant datasets and checkpoints through the system, so the bottleneck can move from processors to input and output; SiliconANGLE’s March 26, 2026 report on cloud-native artificial intelligence infrastructure says production execution is shifting attention from model choice to the machinery that feeds and runs it. (siliconangle.com) Power and cooling turn the software problem into a building problem. NVIDIA’s Grace Blackwell rack-scale guide says these systems combine NVLink inside the rack with InfiniBand and Ethernet across racks, which means operators are designing around rack-scale communication patterns instead of treating every rack as a generic pool. (nvidia.com) That is why the control layer is changing shape. Kubernetes now has device plugins, node resource managers, and topology policies to line up processors, memory, and accelerators, but those tools were added to help the scheduler understand hardware it originally treated as mostly interchangeable. (kubernetes.io, kubernetes.io) The opening for new infrastructure companies is not “containers for artificial intelligence.” It is software that can see a mixed estate of virtual machines, containers, graphics processing units, and specialized accelerators, then make deterministic placement decisions based on the physical map of the hardware instead of the old cloud habit of chasing any idle server. (siliconangle.com, youtube.com)