OpenAI leads MRC networking push
- OpenAI said on May 5 it released MRC through the Open Compute Project, built with AMD, Broadcom, Intel, Microsoft, and NVIDIA. - The key trick is multipath RDMA: one connection can spread traffic across many paths, and OpenAI says it already trained frontier models with it. - This matters because 100,000-plus-GPU clusters break old east-west networking assumptions, pushing Ethernet closer to specialized AI fabric territory.
AI training networks are turning into the real bottleneck. Not the chips by themselves — the fabric between them. When you train a frontier model, huge numbers of GPUs have to stay in sync, and one late transfer can leave expensive accelerators waiting around. That is the gap MRC is trying to close. OpenAI said this week that it has released Multipath Reliable Connection through the Open Compute Project after building it with AMD, Broadcom, Intel, Microsoft, and NVIDIA. (openai.com) ### What is MRC, exactly? MRC is a transport protocol for RDMA traffic in very large AI clusters. The short version is simple — instead of treating one connection like one path, MRC lets a single reliable connection spray packets across multiple network paths at once. The goal is better throughput, better load balancing, and fewer stalls when part of the network gets congested or breaks. (github.com) ### Why does that matter so much for AI? Large training jobs are synchronized workloads. Thousands or even hundreds of thousands of GPUs exchange gradients and parameters together, step after step. If one transfer shows up late, the whole job can wait. OpenAI framed congestion, jitter, and ordinary link or switch failures as the common re(github.com)mining usable compute efficiency. (openai.com) ### What broke in the old model? Traditional reliable transports assume a cleaner world — pick a path, preserve ordering, recover from failure, move on. But giant AI fabrics are messy. There is always some background rate of failed links, overloaded paths, and uneven traffic. Broadcom describes MRC as an enhancement to RoCEv2, basically a redesign for Ethernet-based AI fabri(openai.com). (broadcom.com) ### So what changed this week? The news is not just that OpenAI talked about a networking idea. It open-sourced the specification through OCP, and partners are publicly tying it to real deployments. OpenAI says MRC has already been used to train multiple of its models. NVIDIA says the protocol was proven first in production and (broadcom.com)ol work. (openai.com) ### Where is this already running? This is the part that makes the announcement feel less theoretical. OpenAI says MRC is deployed across its largest supercomputers. Coverage around the launch points to Microsoft’s Fairwater systems and Oracle’s Abilene site in Texas as examples of the kind of infrastructure already using it. NVIDIA also explicitly names OpenAI, Microsoft, and Oracle as Spectrum-X users in this context. (sdxcentral.com) ### Is this really about Ethernet winning? Basically, yes — or at least Ethernet refusing to stay “good enough” and becoming AI-native. The consortium is pushing an open, multi-vendor path that keeps Ethernet in the fight for frontier training rather than ceding the hardest workloads to more specialized interconnect sta(sdxcentral.com)network gets more specialized even as the standard gets more open. (amd.com) ### What is the catch for operators? A fabric that sprays traffic across many paths and reroutes aggressively is great for resilience, but it also makes east-west traffic behavior harder to reason about. Observability, fault isolation, and segmentation all get trickier when the network is doing more adaptation underneath the workload. OpenAI’s own p(amd.com)ooling and telemetry keep up with the transport cleverness. That last part is more inference than stated roadmap, but it follows directly from how adaptive multipath systems behave. (openai.com) ### Bottom line? MRC is OpenAI trying to standardize one of the hidden lessons of frontier-model training: once clusters get big enough, networking becomes part of the model stack. The interesting move is not just the protocol itself. It is that OpenAI, Microsoft, NVIDIA, AMD, Broadcom, and Intel are trying to turn that lesson into shared infrastructure before Stargate-scale systems make the old design choices impossible to live with. (openai.com)