OpenAI leads MRC protocol push

- OpenAI said it has released MRC, a new open networking protocol for AI supercomputers, with AMD, Broadcom, Intel, Microsoft, and NVIDIA. - The key claim is scale: MRC is meant for clusters above 100,000 GPUs and is already running in OpenAI’s biggest systems. - If it sticks, Ethernet gets stronger against InfiniBand in frontier AI buildouts — and cluster design gets a shared playbook.

AI training has a networking problem, not just a chip problem. Once you wire together tens of thousands of GPUs, the hard part is keeping all of them fed with data at the same pace for weeks or months. One slow path, one failed link, one unlucky traffic collision — and expensive accelerators sit around waiting. That is the gap OpenAI is trying to close with MRC, a new transport protocol it released through the Open Compute Project this week with AMD, Broadcom, Intel, Microsoft, and NVIDIA. (openai.com) ### What is MRC, exactly? MRC stands for Multipath Reliable Connection. In plain English, it is a way to move training traffic across many network paths at once while still keeping the reliability guarantees AI clusters need. The spec says it extends the familiar Reliable Connection model used in RDMA networking, but adds active multip(openai.com)one congested route. (opencompute.org) ### Why was the old approach breaking down? A giant training run behaves like a marching band — everybody has to stay in step. Traditional Ethernet-based RDMA setups, especially RoCEv2, can work well, but they get touchy at very large scale because a few colliding flows or a single failure can stall collective operations and waste (opencompute.org)o longer Ethernet switching hardware — it is the transport behavior on top of it. (broadcom.com) ### What changed this week? The big change is that MRC is no longer just an internal OpenAI trick. OpenAI published the protocol through OCP, and the 1.0 specification lists AMD, Broadcom, Intel, Microsoft, NVIDIA, and OpenAI as joint contributors. OpenAI also said MRC is already deployed across its largest supercomputers and has been used to train multiple frontier models. (opencompute.org) ### Why does multipath matter so much? Because modern AI clusters are built with lots of parallel links and multiple network planes. In theory that gives you redundancy and more bandwidth. In practice, old transport protocols do not fully exploit that structure. MRC is designed to “spray” traffic across many paths and reroute aroun(opencompute.org)nks fail or hot spots form. (cdn.openai.com) ### Is this an Ethernet story? Yes — and that is why the partner list matters. NVIDIA, Broadcom, AMD, Intel, Microsoft, and OpenAI do not usually line up around one networking spec unless the pain is real. The shared bet here is that Ethernet can scale deeper into frontier AI training if the transpo(cdn.openai.com)asier for many cloud and enterprise buyers to source than a fully proprietary stack. (broadcom.com) ### Does this replace InfiniBand? Not overnight. InfiniBand still has a strong position in high-performance AI clusters. But MRC is a serious attempt to narrow the gap by making Ethernet behave more like a purpose-built AI fabric at very large scale. The real signal is not that OpenAI inve(broadcom.com)t makes adoption much more plausible. (siliconangle.com) ### What should people watch next? Watch for NIC support, switch support, and whether hyperscalers outside this group start implementing the spec. Also watch whether OCP turns MRC into a real multivendor standard instead of a paper standard. If that happens, datacenter buyers may get a clearer recipe for building huge GPU clusters without locking themselves into one networking camp. (opencompute.org) ### Bottom line? OpenAI did not just publish a protocol. It tried to turn a private scaling lesson into shared infrastructure. If MRC works outside OpenAI’s own clusters, the next bottleneck in AI may shift again — away from moving data, and back to getting enough GPUs in the first place.

OpenAI leads MRC protocol push

Get your own daily briefing