Inference bottlenecks keep moving

- Providers report that cutting raw accelerator costs doesn’t solve deployment economics at scale. - New pressure is appearing on power delivery, management chips, and support components in AI servers. - The industry is shifting from accelerator scarcity to system‑level bottlenecks that raise operational and procurement complexity for teams deploying inference (theregister.com).

The shortage hitting artificial intelligence servers has moved past graphics chips and into the smaller parts that keep whole machines powered and controlled. (theregister.com) TrendForce said on April 15 that 2026 server shipment growth is now forecast at 13% year over year, down from an earlier expectation near 20%, because lead times for several general-server parts have stretched sharply. The firm said suppliers are steering capacity toward higher-margin artificial intelligence systems instead. (trendforce.com) The parts now under pressure include power management integrated circuits, which regulate electricity inside a server, and baseboard management controller chips, which let operators monitor and remotely manage a machine. TrendForce said PMIC lead times have stretched to 35 to 40 weeks, while BMC lead times have risen to 21 to 26 weeks. (trendforce.com) Inference is the stage where a trained model answers prompts, ranks results, or generates tokens for users, and it runs continuously once a service is live. Nvidia said this side of the market is growing fast enough that its GB300 NVL72 system is being pitched on “tokens per watt” and “cost per token,” not just raw performance. (nvidia.com) That changes what buyers have to optimize for. A cheaper accelerator does not remove the need for denser power delivery, more cooling, rack-level networking, and the support silicon required to keep a large inference cluster online. (nvidia.com; theregister.com) TrendForce said cloud service providers are still expected to drive about 28% year-over-year growth in artificial intelligence server shipments in 2026, with custom-chip systems likely to grow faster than graphics-processing-unit servers. That means the squeeze is landing at the same time demand is broadening beyond one chip vendor and into full server platforms. (trendforce.com) The supply problem is also showing up on older manufacturing lines, not just at the most advanced chip plants. The Register reported that PMICs and similar low-complexity chips often rely on 8-inch wafer fabs, where capacity is tighter and less attractive for new investment than leading-edge processor production. (theregister.com) The Register also reported that Samsung is planning to shut an 8-inch fab in Korea, a move TrendForce said would tighten PMIC capacity further for general servers; Samsung had not confirmed the closure as of April 23. That leaves server makers dealing with a market where the glamorous chips may be easier to source than some of the parts around them. (theregister.com) Nvidia’s sales pitch says newer inference systems can deliver 50 times the tokens per watt and 35 times lower cost per token than Hopper-based gear in the same power budget. Even if those gains hold, the industry’s current constraint is no longer only the accelerator card but the full stack of power, control, and procurement needed to deploy it at scale. (nvidia.com)

Inference bottlenecks keep moving

Get your own daily briefing