AMD hands Ultra Ethernet to OCP

- AMD, OpenAI, Microsoft, Nvidia, Intel, and Broadcom released Multipath Reliable Connection to the Open Compute Project on May 6, opening a new AI-cluster transport. - MRC targets clusters with tens to hundreds of thousands of accelerators, spreading traffic across multiple paths and rerouting around failures instead of stalling jobs. - The real bottleneck is shifting from chips to transport software — and OCP is becoming the venue where Ethernet AI standards get set.

AI networking is getting its own standards war — but this one is less about raw speed and more about not falling over when a giant training job hits a bad link. The news is that AMD and a group of heavyweight partners just pushed a transport protocol called Multipath Reliable Connection, or MRC, into the Open Compute Project on May 6. That matters because frontier AI training now runs across tens or even hundreds of thousands of accelerators, and one flaky network path can waste a shocking amount of compute. MRC is basically an attempt to make Ethernet behave like a calmer, more fault-tolerant fabric for those giant clusters. ### What actually got handed over? AMD did not quietly donate some internal side project. It joined OpenAI, Microsoft, Nvidia, Intel, and Broadcom in releasing MRC through OCP under an open framework, with AMD saying it co-led the spec and contributed congestion-control work. So the handoff is really an ecosystem move — not a solo AMD drop. ### What is MRC in plain English? MRC is a transport protocol for AI clusters. (amd.com) Instead of sending traffic down one path and hoping nothing goes wrong, it spreads packets across multiple paths at once, then adapts quickly when congestion or failures show up. The point is to keep synchronized training moving even when the network is behaving like a real data center instead of a lab demo. ### Why is that such a big deal? (amd.com) Because large AI training jobs are brutally synchronized. Thousands of GPUs or other accelerators have to exchange data in lockstep, and the slowest straggler can hold back the whole group. At that scale, network tail latency and fault recovery matter almost as much as peak bandwidth. A single link failure can turn into a cluster-wide pause if the transport layer is brittle. ### What was wrong with the old approach? The coalition is pretty explicit here — the weak point is the transport layer most Ethernet AI clusters use today, especially RoCEv2. That protocol came out of assumptions that made more sense for earlier RDMA and storage-style environments. In giant AI fabrics, those assumptions can force in-order, lossless behavior that becomes awkward and fragile under bursty, many-to-many traffic. (amd.com) ### Why put this into OCP? Because OCP is where a lot of the open AI infrastructure plumbing is now getting hammered into shape. OCP launched the Ethernet for Scale-Up Networking, or ESUN, workstream in October 2025 to focus on headers, error handling, lossless transfer, and link resiliency for accelerator clusters. In other words, the standards venue was already there — MRC gives it a concrete transport building block. (broadcom.com) ### Is this the same thing as Ultra Ethernet? Not exactly. Ultra Ethernet is the broader push to make Ethernet better for AI and HPC. MRC is a specific transport contribution that fits that same direction of travel. OCP’s ESUN workstream also says it plans to align with groups like the Ultra Ethernet Consortium and IEEE, so this looks more like convergence than another clean-slate rival standard. ### Why is AMD involved so heavily? (opencompute.org) Because AMD is trying to win AI systems, not just sell chips. If customers build clusters around open Ethernet fabrics instead of tightly integrated proprietary interconnects, that can help AMD, cloud operators, and other vendors meet on more neutral ground. Open transport standards also make it easier to argue that the network should not lock buyers into one accelerator stack. That last part is an inference, but it fits AMD’s broader open-standards posture around OCP and UALink. (opencompute.org) ### So what should you watch next? Watch whether MRC turns into actual interoperable products — NICs, switches, and software stacks that multiple vendors ship, not just blog-post alignment. OCP’s ESUN process already has public specs and operator requirements in flight, which means the battleground is moving from slogans to implementation details. The bottom line is simple: AI clusters are now big enough that networking failures waste chip time at absurd scale. (amd.com) MRC is a sign that the industry knows the bottleneck is no longer just the accelerator — it is the fabric that keeps all those accelerators moving together. (opencompute.org)

Get your own daily briefing

Scout delivers personalized news, insights, and conversations tailored to your role and industry.

Download on the App Store

Shared from Scout - Be the smartest in the room.