OpenAI builds MRC networking
- OpenAI released MRC, a new open networking protocol co-developed with AMD, Broadcom, Intel, Microsoft, and Nvidia for giant AI training clusters. - The protocol is already running in OpenAI’s biggest supercomputers, including Microsoft Fairwater and OCI’s Abilene site, and was used on frontier-model training. - It matters because Ethernet AI networking still lacked a good transport layer at extreme scale — and rivals just standardized one.
AI training clusters have a networking problem that sounds boring until you see the stakes. Thousands of GPUs can be doing useful work, then one slow or failed network path makes the whole training step wait. That gets brutally expensive when the cluster is huge. OpenAI’s news this week is that it has turned one fix into an open spec — Multipath Reliable Connection, or MRC — built with AMD, Broadcom, Intel, Microsoft, and Nvidia and released through the Open Compute Project. ### What is MRC, in plain English? MRC is a transport protocol for moving data between GPUs across Ethernet-based AI networks. The simple idea is that one connection should not act like one fragile lane. MRC lets a single RDMA connection spread traffic across many network paths at once, so the system can keep bandwidth high, balance load, and route around trouble instead of stalling behind it. ### Why did OpenAI bother building this? Because frontier-model training is hypersensitive to delay. In these jobs, one training step can trigger millions of transfers, and one transfer arriving late can leave expensive GPUs idle. OpenAI says congestion plus routine link and switch failures become the dominant headache as clusters grow, which is why it treated networking as a core design problem for Stargate-scale systems rather than background plumbing. ### What was wrong with the old setup? The weak point was not Ethernet itself so much as the transport most people were using on top of it. Broadcom’s write-up is blunt here — RoCEv2 inherited assumptions from an older, more lossless world and starts to look awkward when you apply it to giant, multi-hop AI fabrics with violent traffic bursts. Basically, AI clusters need the network to behave than it should be. ### So what changed this week? Two things. First, OpenAI and its partners publicly released MRC as an open specification through OCP. Second, this is not just a paper design. OpenAI says MRC is already deployed across its largest supercomputers, and Nvidia says Microsoft’s Fairwater systems and Oracle Cloud Infrastructure’s Abilene data center are already relying on it. OpenAI also says it has used MRC while training multiple frontier models. ### Why does “multipath” matter so much? Because AI traffic is bursty and synchronized. Huge groups of accelerators talk at once, then wait for the slowest straggler. If traffic can only take one path cleanly, congestion and failures create long-tail delays that poison the whole run. Multipath turns the network into more of a live traffic system — packets can be sprayed across available routes and shifted away from overloaded or broken parts in real time. ### Is this just an Nvidia thing? Not exactly — and that is the interesting part. Nvidia says MRC was proven first and optimized on Spectrum-X Ethernet hardware, but the spec itself was jointly authored with AMD, Broadcom, Intel, Microsoft, OpenAI, and Nvidia, then published openly through OCP. So yes, vendors will each try to show their hardware runs it best. But the protocol is being positioned as shared infrastructure, not a closed moat. ### Why are rivals cooperating here? Because the cluster is now the product. Model companies want faster training. Cloud companies want reliable giant AI factories. Chip and switch vendors want their gear inside those systems. Everyone competes upstairs, but downstairs they still need common rules for how data moves. MRC is a good example of that — commercial rivalry on one layer, deep technical interdependence on the layer underneath. ### Bottom line? This is really a story about AI infrastructure maturing. OpenAI is no longer just consuming chips and cloud — it is helping define the networking standard for how giant training systems get built. If MRC sticks beyond the first deployments, the win is not just faster clusters. It is a more open Ethernet stack for AI at extreme scale.