Bastani predicts +50% disaggregated gains

- OpenInfer chief executive Behnam Bastani argued this week that agentic AI will run on mixed chip stacks, not single-vendor GPU boxes. - He pointed to Intel plus SambaNova, Nvidia Rubin plus Groq LPX, and Meta plus AWS Graviton, saying disaggregation can lift capacity roughly 50%. - The pitch tracks a broader shift toward split prefill, decode, and orchestration systems for inference. (intel.com)

AI inference is splitting into specialized jobs, and Behnam Bastani says that split can raise capacity by about 50% for agent workloads. (x.com) Bastani, the chief executive of OpenInfer, pointed to three recent examples: Intel with SambaNova, Nvidia Vera Rubin with Groq LPX, and Meta with Amazon Web Services Graviton. (x.com) (intel.com) (developer.nvidia.com) (aboutamazon.com) The basic idea is simple: one chip family handles prompt ingestion, another handles token-by-token generation, and CPUs run the tool use, routing, and validation around the model. (pytorch.org) (developer.nvidia.com) That differs from the older pattern where the same graphics processor does nearly everything. Nvidia said disaggregated serving separates prefill from decode because running both on the same graphics processors creates bottlenecks. (blogs.nvidia.com) Intel and SambaNova announced a planned multiyear collaboration on February 24, 2026, built around Intel Xeon infrastructure and SambaNova systems for cost-efficient inference. Intel said the partnership is aimed at heterogeneous AI data centers, not a single-chip stack. (intel.com) Nvidia’s March 16, 2026 post described Groq 3 LPX as a rack-scale accelerator co-designed with Vera Rubin NVL72. Nvidia said Rubin handles prefill and decode attention, while LPX takes latency-sensitive decode work for agentic systems. (developer.nvidia.com) Amazon and Meta disclosed a separate version of the same trend on April 24, 2026. Meta said it will deploy tens of millions of AWS Graviton cores, and Amazon said agentic AI creates heavy demand for CPU-intensive tasks such as reasoning, search, and multi-step orchestration. (aboutamazon.com) (about.fb.com) Bastani’s 50% figure appears to be his estimate, not a company benchmark published in the source materials for those three partnerships. The official announcements describe higher efficiency, lower latency, or larger scale, but they do not state a common industrywide 50% capacity gain. (x.com) (intel.com) (developer.nvidia.com) (aboutamazon.com) The wider backdrop is that model serving is no longer just a race for bigger graphics processors. PyTorch and vLLM said prefill-decode disaggregation is already enabled in Meta’s internal stack, while Nvidia now calls disaggregated serving essential for large reasoning models. (pytorch.org) (blogs.nvidia.com) So the thread running through Bastani’s note is less “one winner” than “more plumbing.” The emerging inference stack pairs CPUs, GPUs, and specialty chips so each does the part of the job it handles best. (x.com) (intel.com) (developer.nvidia.com)

Get your own daily briefing

Scout delivers personalized news, insights, and conversations tailored to your role and industry.

Download on the App Store

Shared from Scout - Be the smartest in the room.