Inference now dominates AI compute spend

- Deloitte and McKinsey now describe a real crossover: inference has moved from the side job to the main AI workload as models hit production. - Deloitte put inference at half of AI compute in 2025 and two-thirds in 2026; AWS, Microsoft, and Nvidia are now shipping chips built around that math. - That matters because serving tokens never stops — so AI economics now hinge less on one giant training run than on power, latency, and cost per query.

AI compute used to mean training — one huge run, one giant cluster, one eye-watering bill. But the center of gravity is shifting. The expensive part now is often serving models after they launch, not building them in the first place. That is the basic idea behind the “inference dominates spend” claim, and by 2026 it looks less like a hot take and more like the shape of the market. ### What changed? The cleanest signal came from Deloitte’s 2025 forecast, which put inference at about half of all AI compute in 2025 and two-thirds in 2026. McKinsey is a little more conservative over a longer horizon, but it points the same way — inference becomes the larger share of AI workloads and starts driving where and how data centers get built. Different models, same direction. (computerworld.com) ### What exactly is inference? Inference is the moment a trained model does useful work — writes a reply, ranks ads, generates code, answers a support question, or reasons through a workflow. Training is the boot camp. Inference is the day job. And the day job is relentless, because every user prompt, every refresh, every agent step, and every background task burns more tokens. That recurring load is why spend can flip so fast once products reach real usage. (computerworld.com) ### Why does production change the math? A frontier training run is huge, but it is episodic. Inference is smaller per request, but it runs all day, every day, for millions of people and systems at once. The catch is that users feel latency immediately. So providers cannot optimize only for raw peak performance. They also need low delay, predictable throughput, and much better performance per watt. That pushes the whole stack toward serving efficiency, not just training bragging rights. (mckinsey.com) ### Why are custom chips suddenly everywhere? Because inference is a margin problem. AWS says Inferentia is built specifically to deliver the lowest-cost inference in EC2, with first-gen systems offering up to 70% lower cost per inference than comparable EC2 instances, and Inferentia2 adding higher throughput and much lower latency. Microsoft’s Maia 200 is even more explicit — it calls itself an inference accelerator designed to shift the economics of large-scale AI. (mckinsey.com) These are not vanity chips. They are attempts to stop token serving from eating the business. ### Why is Nvidia still talking about “AI factories”? Because inference now looks industrial. Nvidia’s pitch around Blackwell is that the product is not really a chip but a full system for turning power, networking, cooling, and compute into “intelligence” at scale. That framing sounds grandiose, but it fits the new bottleneck. Once inference becomes the steady-state workload, the winning system is the one that keeps responses fast while squeezing more useful output from every rack and every megawatt. (aws.amazon.com) ### Why does this spill into data-center power? Inference-heavy infrastructure wants different geography and design. McKinsey argues training pushes giant high-density campuses, while inference also pulls build-outs toward metro areas where low latency matters. At the same time, total US data-center power demand is projected to climb sharply this decade. So when companies chase power deals, cooling advances, and faster site approvals, that is not separate from AI product demand — it is the physical consequence of inference becoming continuous utility work. (blogs.nvidia.com) ### What are hyperscalers seeing already? Alphabet’s April 29, 2026 earnings call showed the demand side in plain numbers: Google Cloud revenue grew 63% year over year, topped $20 billion for the first time, and backlog jumped past $460 billion. That does not isolate inference by itself, but it does show what happens when AI moves from demo to deployment. Usage compounds. Infrastructure commitments pile up. The spend stops looking temporary. (mckinsey.com) ### So what’s the bottom line? The important shift is not “training is over.” Frontier training still matters. But the business that gets built on top of those models is increasingly governed by inference economics — cost per token, watts per query, latency per response, and uptime under real traffic. Basically, AI is moving from a research-compute story to an operations-compute story. And that is why chips, power, and data-center design suddenly look like the whole game. (abc.xyz) (computerworld.com)

Get your own daily briefing

Scout delivers personalized news, insights, and conversations tailored to your role and industry.

Download on the App Store

Shared from Scout - Be the smartest in the room.