Compute shortages and workarounds

Posts report CPU bottlenecks for agent workloads with Intel/AMD lead times of 6–10 months, GPU prices up ~48%, and token demand surging—operators are slicing GPUs and using systems like HAMi to improve utilization by 30–50%. ( )

A new bottleneck is showing up in artificial intelligence infrastructure: the central processing unit, not just the graphics processor, is now limiting how fast many agent systems can run. (amd.com) An AI agent is a model with tools and steps, not a single reply. Google Cloud’s 2026 agent report defines agents as systems that use models plus access to tools to take actions, and AMD said those multistep workflows push more scheduling, data movement, memory, and control work onto central processors. (services.google.com, amd.com) That shift is now shaping hardware designs. On April 8, 2026, Intel and SambaNova said their new agentic inference blueprint would use graphics processors for prefill, SambaNova chips for decode, and Intel Xeon 6 chips as the host and “action” central processors. (newsroom.intel.com) The reason is utilization. OpenAI’s developer docs describe throughput in tokens per minute or tokens per second, while Microsoft said one Azure ND GB300 v6 rack reached 1,100,000 tokens per second on Llama 2 70B in an MLPerf Inference v5.1 submission. (developers.openai.com, techcommunity.microsoft.com) When token throughput climbs, operators need more than raw graphics capacity. AMD said agentic systems keep central processors busy with orchestration and data handling, and Intel said the industry is running into the limits of “GPU only” inference architectures. (amd.com, newsroom.intel.com) At the same time, the graphics market is staying tight. TrendForce said in a report published October 30, 2025 that new Blackwell rack systems were expected to lift artificial intelligence server revenue by almost 48% in 2025, with AI server revenue rising more than 30% again in 2026 as cloud providers expand inference infrastructure. (trendforce.com) That pressure is pushing companies to split one graphics card into smaller slices. HAMi, a Kubernetes middleware project, lets a workload take part of a device instead of the whole card, and its core controller can set hard limits on graphics memory and streaming multiprocessor usage through time slicing. (github.com, github.com) The practical appeal is that many jobs do not need a full card all the time. In a Cloud Native Computing Foundation case study published March 17, 2026, NIO said its continuous integration and testing jobs spent most of their time on compilation, file fetching, and preprocessing, with average graphics utilization of only 5% to 10% under full-card allocation. (cncf.io) NIO said it used a hybrid sharing setup with HAMi, NVIDIA Multi-Instance Graphics, and time slicing across about 600 graphics processors on roughly 80 nodes. The company reported a 10-fold utilization improvement in continuous integration pipelines and a 30% reduction in graphics hours for simulation workloads. (cncf.io) Other HAMi case studies point in the same direction. The project’s case-study page says KE Holdings reported a 3-fold improvement in platform graphics utilization, DaoCloud reached average utilization above 80% after virtual graphics adoption, and SF Technology reported up to 57% graphics savings in production and test clusters. (project-hami.io) The near-term workaround is not one new chip but tighter packing of the hardware already installed. As agent systems turn one prompt into a chain of tool calls, data fetches, and model passes, operators are treating idle graphics time and underpowered host processors as costs they can no longer ignore. (amd.com, cncf.io)

Get your own daily briefing

Scout delivers personalized news, insights, and conversations tailored to your role and industry.

Download on the App Store

Shared from Scout - Be the smartest in the room.