AI's new bottlenecks
- AI demand is still surging, but the limits are shifting from raw chip counts to memory, power, and utilisation. - Reports say millions of expensive GPUs are sitting underused, leaving billions of dollars of idle compute on company books. - That mismatch makes memory bandwidth, power design, and deployment efficiency the central constraints for scaling AI today ( ).
The tightest limit in artificial intelligence is no longer just getting more chips; it is keeping costly chips fed with memory, power, and real work. (forbes.com) Nvidia chief executive Jensen Huang said at GTC in March 2026 that the company sees a $1 trillion market for data-center systems through 2028, with inference — the live serving of models after training — taking center stage. Forbes reported on April 20 that the industry’s constraint is shifting from raw processor counts to the architecture around them. (forbes.com) At the same time, Cast AI said on April 21 that average GPU utilization in the enterprise environments it studied was 5%, based on data from tens of thousands of Kubernetes clusters across Amazon Web Services, Google Cloud, and Microsoft Azure. Its methodology page says the report covers January 1 to December 31, 2025, and extends GPU data through April 2026. (cast.ai) A graphics processing unit, or GPU, does AI math fast, but only when data arrives fast enough and jobs are scheduled cleanly. If memory is the chip’s workbench and bandwidth is the width of the hallway bringing in parts, a bigger chip still stalls when the bench is too small or the hallway is jammed. (nvidia.com) That is why Nvidia’s current pitch leans on memory and interconnect as much as raw compute. The company says DGX B300 systems use Blackwell Ultra GPUs with 288 gigabytes of HBM3e memory and 8 terabytes per second of memory bandwidth, while GB200 NVL72 racks link 72 Blackwell GPUs and 36 Grace central processing units in one liquid-cooled rack. (nvidia.com, nvidia.com) Power and cooling now shape how much AI hardware a data center can actually run. Nvidia says GB200 NVL72 relies on liquid cooling and a 72-GPU NVLink domain, a design choice aimed at packing more performance into the same power envelope than older air-cooled H100 systems. (nvidia.com) The cost pressure is rising with those engineering demands. The Register reported on January 5 that Amazon Web Services raised EC2 Capacity Block prices for machine-learning instances by about 15%, with p5e.48xlarge moving from $34.61 to $39.80 an hour in most regions; AWS said the adjustment reflected expected supply and demand patterns for the quarter. (theregister.com) Cast AI argues that overbuying is structural, not accidental, because teams reserve for peak demand, tolerate idle buffers, and still lack mature automation for placing GPU jobs. The report says its utilization figures were measured before customers enabled Cast AI’s optimization tools, which is the company’s business and a reason to read the findings as both research and marketing. (cast.ai) Nvidia and cloud providers are responding by selling systems that treat the rack, not the single chip, as the product. Nvidia says GB200 NVL72 delivers 130 terabytes per second of GPU communication bandwidth inside the rack, because large models increasingly hit communication and memory limits before they hit arithmetic limits. (nvidia.com) The result is a market where demand can surge and waste can rise at the same time. Companies are still racing to secure AI capacity in 2026, but the harder problem is turning reserved silicon into useful tokens, queries, and model responses hour after hour. (forbes.com, cast.ai)