JigarShahDC maps inference tiers
- Former DOE official Jigar Shah laid out a three-tier AI inference map, arguing serving workloads will spread across giant hubs, metro sites, and edge nodes. - His key claim was economic and physical: 100–500 MW Tier 1 campuses could absorb 60–80% of inference spend, but latency still pulls work closer. - That matters because AI buildouts are no longer just a power story — network fabric, replication, and geography now shape where models run.
AI inference is turning into an infrastructure placement problem. Not just a chip problem, and not just a power problem. The useful shift in Jigar Shah’s recent thread is that he tries to map inference the way people already map power grids or cloud regions — by tiers. Big centralized campuses do one job. metro sites do another. edge nodes do a third. That framing matters because the industry keeps talking as if all AI demand ends up in giant campuses, when the real picture looks more layered. ### What did Shah actually add? Shah’s contribution was a simple deployment model for inference: a small number of very large hubs, a wider layer of metro facilities, and then local edge capacity near users. The rough logic is familiar from cloud and telecom, but he applies it to AI serving rather than generic compute. In his broader recent posts, he has also argued that inference does not need the same giant campuses as frontier training and that NVIDIA is pushing more distributed inference footprints, including sub-5 MW facilities closer to users. (mckinsey.com) ### Why split inference into tiers? Because “inference” is not one workload. Some requests are huge, batchy, and cost-sensitive. Some are interactive and latency-sensitive. Some need a giant model spread across many accelerators. Others fit on one box. McKinsey has been making basically the same macro point from the hyperscaler side — training is pulling toward large high-density campuses, while inference is pulling builds toward metro areas with low round-trip time and strong interconnects. (bsky.app) ### So what belongs in Tier 1? The biggest campuses handle the heavy stuff — frontier models, large shared serving pools, and workloads where utilization matters more than being physically close to the user. That is where you can justify extreme power density, custom networking, and the operational pain of running very large GPU fleets. Shah’s 100–500 MW idea is aggressive, but the direction fits the market: hyperscalers are concentrating the most power-hungry AI infrastructure into a relatively small number of giant sites. (mckinsey.com) ### Why isn’t that enough? Because latency is stubborn. If you want fast interactive responses, voice, agents, or regional data handling, you keep getting pulled back toward metro deployments. That is the Tier 2 logic. You give up some scale efficiency, but you cut round-trip time and place capacity nearer demand centers. OpenAI has been making a similar “right systems for the right workloads” point in its inference partnerships and latency guidance — smaller or specialized serving footprints can matter as much as raw model size. (mckinsey.com) ### What is the network warning here? The interesting part of Shah’s thread is the reminder that AI networking is increasingly east-west, not just north-south. In plain English, the hard traffic is often inside the cluster — GPU to GPU, node to node — not just user requests coming in and answers going out. Dell’s AI networking guide describes this backend fabric as a dedicated high-bandwidth, low-latency inter-node network, often 400 GbE or higher for distributed jobs. (openai.com) ### But isn’t backend fabric mostly a training issue? Sometimes, yes. If a model fits on one server, inference can avoid a lot of that complexity. Dell says exactly that for single-server inference designs. But once models get too large, or operators use tensor parallelism, sharding, or distributed serving, inference starts looking a lot more like a networking problem too. NVIDIA’s own distributed tooling now treats sharding and domain parallelism as relevant to both training and inference, which is the key technical reason Shah’s warning lands. (infohub.delltechnologies.com) ### What’s the practical takeaway? Basically — stop planning inference as one monolithic fleet. Teams need to decide which workloads want giant centralized economics, which need metro latency, and which deserve true edge placement. And they need to budget for internal GPU traffic much earlier, because the bottleneck may be the fabric inside the cluster, not the internet pipe outside it. (infohub.delltechnologies.com) ### Bottom line Shah’s tier map is useful because it turns “AI datacenters” into a routing question. The winners will not just have more megawatts. They will know which inference jobs belong where — and they will build the network fabric to match. (bsky.app) (mckinsey.com)