OpenAI launches MRC protocol with vendors

- OpenAI said on May 5 it released MRC through the Open Compute Project, built with AMD, Broadcom, Intel, Microsoft, and NVIDIA. - The key claim is scale: MRC is already running in OpenAI and Microsoft training clusters, with designs aimed at well over 100,000 GPUs. - This matters because giant training runs stall on tiny network hiccups; an open cross-vendor standard could cut wasted GPU time.

AI training networks are becoming a bottleneck in their own right. The compute is flashy — giant GPU clusters, bigger models, faster chips — but the thing that quietly breaks runs is often the network between those GPUs. OpenAI’s news this week is that it’s releasing a new transport protocol called MRC, short for Multipath Reliable Connection, and it built it with AMD, Broadcom, Intel, Microsoft, and NVIDIA. The point is simple: keep huge training jobs moving even when links get congested or parts of the network fail. (openai.com) ### What is MRC, actually? MRC is a networking protocol for RDMA traffic inside giant AI clusters. In plain English, it changes how data moves between GPUs so one connection can use many network paths instead of effectively betting on one path behaving well. OpenAI says that improves both performance and resilience in large training clusters, (openai.com)panies can adopt it too. (openai.com) ### Why does AI training need that? Because frontier training is synchronized. Thousands — sometimes far more — of GPUs have to stay in lockstep, trading data between steps. If one transfer arrives late, the whole job can wait. That means a tiny networking hiccup can leave very expensive hardware sitting idle. The paper behind MRC says tail la(openai.com)mance at very large scale. (openai.com) ### What problem is MRC trying to fix? The old failure mode is flow collisions and hot spots. Even if a network has plenty of total bandwidth, traffic can pile onto the same links while other paths sit underused. MRC “sprays” traffic across many paths and actively load-balances around congestion. OpenAI pairs that with multi-plane Clos network(openai.com) can bypass failures without waiting for the network to reconverge the usual way. (openai.com) ### Why are all these vendors involved? Because no one wants another proprietary island in AI infrastructure. The author list on the technical paper spans OpenAI, Microsoft, AMD, Broadcom, and NVIDIA, and OpenAI’s launch post also names Intel as a partner. That mix matters — chips, switches, cloud infrastructure, and model builders all have to(openai.com)off internal trick. (openai.com) ### Is this just a spec, or is it already in use? Already in use. OpenAI says MRC is deployed across its largest supercomputers, and reporting around the launch says that includes Microsoft’s Fairwater systems and the OCI-built Abilene supercomputer in Texas. The paper is even more direct: MRC and static SRv6 routing have been used in producti(openai.com)test frontier models. (cdn.openai.com) ### Why does “open” matter here? Basically, OpenAI is arguing that the next scaling wall is infrastructure complexity. If every hyperscaler and hardware vendor solves giant-cluster networking differently, adoption slows and costs stay high. Releasing MRC via OCP is a bid to turn one company’s hard-won inter(cdn.openai.com)n — vendors still have incentives to optimize around their own stacks — but it gives the industry a common starting point. (openai.com) ### Is this an Ethernet story too? Yes — and that’s part of the subtext. NVIDIA is framing MRC as proof that high-end Ethernet fabrics can support “gigascale” AI training, not just InfiniBand-style environments. OpenAI’s write-up also leans on simpler, lower-power network designs with fewer components. So this is not just a protocol launch. It’s also a push for a specific way of building giant AI factories. (openai.com) ### Bottom line? The interesting part is not the acronym. It’s that OpenAI is saying network reliability and tail latency have become first-order limits on frontier model training — and that fixing them now takes coordination across nearly the whole stack. If MRC spreads beyond the launch partners, it could become one of those invisible standards that quietly makes giant AI systems cheaper and less fragile. (openai.com)

Get your own daily briefing

Scout delivers personalized news, insights, and conversations tailored to your role and industry.

Download on the App Store

Shared from Scout - Be the smartest in the room.