OpenAI builds MRC protocol
- OpenAI said on May 5 it released MRC through the Open Compute Project, a new Ethernet transport for giant AI training clusters. - The pitch is scale and fault tolerance — one connection can use hundreds of paths, and OpenAI says it has run MRC on 100,000-plus GPUs. - That matters because AI labs are hitting network limits, not just chip limits, as training jobs spread across bigger and more failure-prone clusters.
The bottleneck here is not the model. It’s the network moving gradients and parameters between huge piles of GPUs. Once training clusters get big enough, the fabric itself starts wasting time — packets collide on the same path, failures ripple outward, and a single bad link can slow an entire job. OpenAI’s news this week is that it has open-sourced a transport protocol called MRC, short for Multipath Reliable Connection, after already using it in production on its biggest supercomputers. (openai.com) ### What is MRC, exactly? MRC is a networking protocol for RDMA over Ethernet — basically a way to keep GPU-to-GPU traffic reliable while also letting a single connection spread data over many paths at once. Standard reliable connections tend to pick one path per flow, which is fine at smaller scale but awkward in giant AI clusters where many “elephant” flows smash(openai.com)licit multipath operation, path monitoring, and recovery logic. (opencompute.org) ### Why does multipath matter so much? AI training traffic is bursty and synchronized. Thousands of GPUs often need to exchange data at the same time, so if too many flows land on the same route, some links get congested while others sit underused. MRC’s trick is to spray one connection across many paths concurrently, using ECMP or SRv6-based routing, so the ne(opencompute.org)le version is this — instead of sending one truck down one highway and praying there’s no jam, it opens a convoy across many roads. (github.com) ### Why not just use InfiniBand? A lot of high-end AI systems already do. But Ethernet is cheaper, more common, and has a much broader vendor ecosystem. OpenAI’s framing is not “Ethernet is already perfect.” It’s the opposite — standard best-effort Ethernet needs extra transport machinery if you want it to behave well for giant synchroniz(github.com)conomics and openness of Ethernet fabrics. (openai.com) ### What did OpenAI actually claim? The big claims are about scale and resilience. OpenAI says MRC is deployed across its largest supercomputers, including Microsoft Fairwater systems and Oracle’s Abilene cluster, and that it has already been used to train multiple frontier models. In the technical paper, the team says the design supports clusters well above 100,00(openai.com)d have interrupted training. (sdxcentral.com) ### What else changed besides the protocol? Turns out MRC is only one piece. The paper pairs it with two other ideas — multi-plane Clos topologies and static source routing with SRv6. Together, those let operators build two-tier fabrics at very large scale, cut switch-layer depth, and give endpoints more freedom to route around broken links instead of waiting for the whole network to reconverge. (cdn.openai.com) ### Who built this with OpenAI? This was a multi-company effort. OpenAI named AMD, Broadcom, Intel, Microsoft, and NVIDIA as collaborators, and released the specification through the Open Compute Project under an open license. That matters because a transport protocol only becomes real infrastructure if switch vendors, NIC vendors, and cloud operators all line up behind it. (openai.com) ### So what’s the real significance? The deeper story is that frontier AI is now constrained by systems engineering as much as by model design. Labs can buy more GPUs, but if the interconnect falls apart under synchronized traffic, those GPUs just wait on each other. MRC is OpenAI saying the next scaling fight is inside the network stack — and that the answer may be open Ethernet plumbing rather than ever more bespoke fabric. (openai.com) ### Bottom line This is infrastructure news, but it’s important infrastructure news. OpenAI is trying to turn networking from a hidden failure point into a scaling advantage — and it’s doing it in public, with a spec other builders can adopt. (openai.com)