OpenAI launches MRC networking protocol

- OpenAI released MRC, a new AI-cluster networking protocol, through the Open Compute Project on May 7 with AMD, Broadcom, Intel, Microsoft, and Nvidia. - The pitch is simple: keep training runs alive when links fail, while scaling two-tier Ethernet fabrics past 100,000 GPUs instead of choking on collisions. - That shifts the AI arms race toward networking efficiency — not just faster chips — because idle GPUs waste the most expensive part.

AI training has a networking problem now, not just a chip problem. Once you pack tens of thousands of GPUs into one training run, the hard part is keeping all of them moving together without one bad link or one congested path stalling the whole job. That is the gap OpenAI is trying to close with MRC, short for Multipath Reliable Connection, which it released through the Open Compute Project on May 7 alongside AMD, Broadcom, Intel, Microsoft, and Nvidia. ### What is MRC, exactly? MRC is a transport protocol for large AI clusters running over Ethernet. The basic idea is that instead of sending traffic down one path and hoping that path stays healthy, MRC spreads traffic across many paths at once, watches path health, and reroutes around failures fast enough that training can keep going. OpenAI says it extends RDMA reliable-connection semantics, but adds explicit multipath behavior and recovery tuned for giant training jobs. (openai.com) ### Why does this matter so much? Because frontier training runs are synchronized. Thousands of GPUs have to exchange gradients in lockstep. If one network hotspot slows part of the cluster, everybody waits. If one failure takes down the job, you can lose days of compute. OpenAI says MRC is already deployed in its and Microsoft’s largest training clusters and has been used to train recent frontier models, which is the strongest signal here — this is not a lab toy. (opencompute.org) ### Why wasn’t ordinary Ethernet good enough? Ethernet is cheap and everywhere, but large AI clusters are unusually punishing. They create many synchronized flows that can collide on the same links, producing congestion and long tail latencies. MRC’s answer is to “spray” traffic across many available paths and actively balance load between them, so one unlucky path collision does not become a cluster-wide slowdown. Basically, it tries to make commodity-style Ethernet behave more like purpose-built supercomputer fabric. (cdn.openai.com) ### What changed this week? The key news is not just that OpenAI built a protocol. It published the MRC specification through OCP, which means other vendors can implement it, test against it, and push it toward broader standardization. AMD framed the move as turning AI networking into an open, programmable, production-ready foundation, and OpenAI described the release as a way for the broader industry to use the design. (cdn.openai.com) ### Why are those partner names a big deal? Because the list spans most of the stack. Nvidia and AMD make accelerators. Broadcom and Intel matter in silicon and networking. Microsoft runs the cloud infrastructure OpenAI actually trains on. When rivals like that agree on a transport layer, the message is that the bottleneck has become too expensive to ignore. This is the industry admitting that a faster GPU is only useful if the cluster fabric lets the GPU stay busy. (opencompute.org) ### Does this replace InfiniBand? Not directly. The point is to make Ethernet much more competitive for giant AI clusters, especially where openness, cost, and ecosystem breadth matter. OpenAI’s paper also pairs MRC with static SRv6 routing and multi-plane Clos topologies, which together let very large clusters keep redundancy while staying at two tiers even beyond 100,000 GPUs. That is a scale claim aimed straight at the biggest supercomputer builders. (openai.com) ### What is the real takeaway? The frontier race is widening. For a while, the story was mostly about who had the best accelerator. Now the fabric between accelerators is becoming just as strategic. If MRC catches on, the win is not a flashy new chip — it is more usable throughput from the same cluster, fewer training crashes, and better economics on systems that already cost fortunes. (openai.com) (cdn.openai.com)

Get your own daily briefing

Scout delivers personalized news, insights, and conversations tailored to your role and industry.

Download on the App Store

Shared from Scout - Be the smartest in the room.