Google and hyperscalers squeeze GPUs
- Alphabet, Amazon, and Meta are still pouring tens of billions into AI infrastructure, while Google and AWS push custom inference chips to ease GPU pressure. - Google’s Ironwood TPU is built specifically for inference, and AWS says Trainium2 can deliver 30% to 40% better price performance than GPU instances. - The real choke point is memory — SK hynix says AI demand keeps HBM tight, so cloud GPU prices and contract timing stay exposed.
AI compute is splitting into two markets at once. One market is the glamorous one — frontier models, giant training runs, Nvidia racks everywhere. The other is the one that actually hurts budgets every day — inference, the steady stream of tokens you have to serve after the model is built. That second market is where the squeeze is showing up now. Google, Amazon, Meta, and the rest are still spending aggressively on AI infrastructure, but the supply chain underneath that buildout is tighter than the headlines make it sound. ### Why are people suddenly talking about inference? Training is a spike. Inference is a utility bill. Once a model ships, every query burns accelerator time and high-bandwidth memory, and that cost repeats all day. That is why Google framed its latest TPU generation, Ironwood, as the first TPU designed specifically for inference rather than just general AI acceleration. ### What did Google actually do? Google used Cloud Next to show where its head is. Ironwood scales to 9,216 liquid-cooled chips and comes with higher HBM capacity and bandwidth, plus faster interconnects between chips. The important part is not just raw speed. It is that Google is openly optimizing for serving “thinking” and agentic models at large scale — basically, the workloads that keep cloud bills running long after training ends. ### Is AWS doing the same thing? Basically, yes. AWS is pushing Trainium2 as a way to route some demand away from scarce top-end GPUs. AWS says Trn2 instances are built for both training and inference, and it claims 30% to 40% better price performance than its GPU-based P5e and P5en instances. That matters because hyperscalers are no longer just buying Nvidia. They are trying to substitute around bottlenecks with their own silicon. ### So where is the real bottleneck? Memory. Not just accelerators. HBM is the hard part because advanced AI chips need stacks of very fast memory sitting right beside the compute. If HBM is tight, the whole accelerator market stays tight. SK hynix said on April 23 that strong AI infrastructure investment kept demand elevated even in a seasonally weak quarter, and it tied that demand directly to inference. ### How tight is “tight”? Tight enough that memory makers are printing records. Micron posted record fiscal Q2 2026 results on March 18 and said the quarter was driven by strong demand and tight industry supply. SK hynix posted record quarterly revenue and operating profit in April, with operating margin hitting 72%. When both the cloud buyers and the memory suppliers are talking this as a seller’s market in the parts that matter most. ### Are hyperscalers still spending through it? Yes — and that is the point. Amazon said trailing-12-month free cash flow fell sharply because property and equipment purchases jumped by $59.3 billion, primarily reflecting AI investment. AWS revenue still grew 28% in Q1 2026. The message is simple: the biggest buyers are not backing off just because supply is constrained. They are spending through the shortage. ### What does that mean for everyone else? If you are a model builder, a startup, or a company bidding a big inference contract, your risk is not just “can I get compute?” It is “can I get predictable compute at a stable price for long enough to sign customers?” When hyperscalers lock up accelerators, memory, and power at scale, everybody below them gets pushed into the leftovers — or into custom-chip ecosystems that are cheaper but less portable. ### Bottom line? This is not just a GPU shortage story anymore. It is an inference economics story. Google and AWS are trying to dodge the crunch with custom silicon, but the memory layer is still tight, and hyperscaler spending is keeping it that way.