OpenAI pushes supercomputer network rethink

- OpenAI said on May 5 it is releasing MRC, a new AI supercomputer networking protocol, after using it to train recent frontier models. - The protocol spreads traffic across many paths, targets clusters beyond 100,000 GPUs, and was co-developed with AMD, Broadcom, Intel, Microsoft, and Nvidia. - That matters because networking — not just chips — is becoming the limiting factor for faster training and low-latency AI products. (openai.com)

The story here is not “OpenAI did a podcast.” It’s that OpenAI used the podcast to point at a real product and infrastructure move — a new networking protocol called MRC that it says is already running inside its biggest training systems. That matters because frontier AI is no longer bottlenecked only by chips. It is bottlenecked by the fabric between them. OpenAI published the MRC spec through the Open Compute Project on May 5, 2026, (openai.com)ale systems. (openai.com) ### What actually changed? OpenAI formally introduced Multipath Reliable Connection, or MRC, as an open networking protocol for large AI training clusters. The company said it developed the protocol with AMD, Broadcom, Intel, Microsoft, and Nvidia, then released the specification so other builders can use it too. In other words, this was not just a research idea or a podcast riff — it was an open-standard push tied to production deployments. (openai.com) ### Why is the network suddenly the problem? Training a frontier model means huge numbers of GPUs have to move in lockstep. Each training step can involve millions of data transfers, and one late transfer can leave expensive GPUs waiting around. At that scale, the slowest packet matters more than average speed. OpenAI’s paper is blunt about this — tail latency dominates performance in very large synchronous training jobs. (openai.com)ing to fix? Basically, two things keep breaking the illusion that a giant GPU cluster is one coherent machine — congestion and failures. Traditional single-path transport can create hotspots when too much traffic collides on the same route. And in very large clusters, link and switch failures stop being rare events and start becoming background noise. OpenAI says MRC attacks both problems at once by spreading traffic across many paths and letting the network route around trouble. (openai.com) ### How is that different from normal networking? The key trick is multipath transport. Instead of betting on one route, MRC “sprays” packets across many paths and actively load-balances them. Think less like a single freeway lane and more like a traffic system that keeps opening alternate lanes before a jam turns into a pileup. OpenAI and its partners also pair that with multi-plane Clos topologies and static source routing using SRv6, which (openai.com)avior higher up the stack. (cdn.openai.com) ### Is this just theory? No — and that is the important part. OpenAI says MRC is deployed in its largest supercomputers and has been used to train multiple frontier models. The joint paper says it is already running in OpenAI and Microsoft production clusters, and that it helped jobs ride out failures that would previously have interrupted training. That turns the announcement from “here’s a neat protocol” into “here’s plumbing we already depend on.” (openai.com) ### Why talk about voice and realtime AI too? Because the same underlying theme keeps showing up across OpenAI’s stack — latency is now a product problem, not just a systems problem. In a separate engineering post on May 4, OpenAI said low-latency voice at its scale means supporting more than 900 million weekly active users while keeping connection setup fast and media round-trip time low and stable. Training clusters and realtime voice are di(openai.com)enough, the network becomes part of the user experience. (openai.com) ### Why make the protocol open? Turns out OpenAI is arguing that shared infrastructure standards are now strategic. The company says open standards can reduce complexity and help AI systems scale across a broader partner ecosystem. That also explains the unusual coalition here — OpenAI, Microsoft, AMD, Broadcom, Intel, and Nvidia do not agree on much by accident. They agree when the bottleneck is big enough that everyone benefits from fixing the plumbing. (openai.com) ### So what is the real takeaway? The big shift is that AI competition is moving down the stack. Better models still matter. More GPUs still matter. But now the winners also need networking that keeps giant clusters synchronized and interactive products responsive under real-world failure and congestion. OpenAI’s MRC push is basically a public admission that the next leap in AI performance may come from the wires between the chips as much as the chips themselves. (openai.com)

Get your own daily briefing

Scout delivers personalized news, insights, and conversations tailored to your role and industry.

Download on the App Store

Shared from Scout - Be the smartest in the room.