AI Puts Network Architecture Back in Focus
AI's demand for high-bandwidth, low-latency data transfer is making networking a critical, high-value component of the tech stack again. After years of being abstracted away by the cloud, network architecture is now a key performance lever for AI/ML workloads. This creates new opportunities for infra startups focused on optimized and programmable networking.
The performance bottleneck for large AI models is shifting from pure compute to data movement, as expensive GPUs sit idle waiting for data. Technologies like Remote Direct Memory Access (RDMA) have become critical, allowing network cards to access memory on other servers directly, bypassing the CPU and slashing latency. This has led to the widespread adoption of RDMA over Converged Ethernet (RoCE) in major AI clusters, including those at Meta. This shift ignited a battle between two networking standards: InfiniBand and Ethernet. InfiniBand, historically dominant in high-performance computing, offers ultra-low latency but often comes with higher costs and vendor lock-in. Ethernet, the ubiquitous data center standard, is now catching up for AI workloads with 400G/800G speeds and RoCE, championed by the Ultra Ethernet Consortium for its flexibility and cost-effectiveness. Nvidia solidified its dominance by acquiring Mellanox, a leader in InfiniBand founded by Eyal Waldman, for $7 billion in 2020. This strategic purchase gave Nvidia an end-to-end AI infrastructure stack, from its GPUs to the high-performance fabric connecting them, a move seen as crucial with the slowing of Moore's Law. Challengers are betting on an open, Ethernet-based ecosystem. Broadcom's Jericho3-AI chip is designed to connect up to 32,000 GPUs in a single fabric, aiming to reduce job completion times for AI training. Meanwhile, Arista Networks focuses on software-driven networking with its EOS (Extensible Operating System) to manage large-scale AI and cloud data centers. In India, the government's IndiaAI Mission is infusing over $1 billion to create sovereign compute capacity. This initiative involves collaborations with cloud providers like Yotta and Bangalore-based E2E Networks to build out AI factories using tens of thousands of Nvidia GPUs. The scale of investment is massive, with corporations planning their own vertically integrated infrastructure. The Adani Group, for example, has announced a $100 billion plan to build 5 GW of renewable-powered, AI-ready hyperscale data centers in India by 2035, partnering with major cloud providers.