Overview
OpenAI has collaborated with AMD, Broadcom, Intel, Microsoft and Nvidia to introduce Multipath Reliable Connection (MRC), an open-standard networking protocol designed to address throughput and routing bottlenecks in very large AI training clusters. The protocol was announced via the Open Compute Project on Tuesday.
How MRC works
MRC is an extension of RDMA over Converged Ethernet, which is an InfiniBand Trade Association standard that enables hardware-accelerated remote direct memory access between GPUs and CPUs. The extension divides network interfaces into many smaller links, creating a parallel fabric in which a single transfer can distribute its packets across hundreds of separate paths through the network.
Instead of relying on traditional dynamic routing, MRC leverages IPv6 Segment Routing to let the sender explicitly specify the route each packet should take. According to the announcement, the protocol detects failures in the fabric and reroutes traffic around them on a microsecond timescale - a substantial reduction compared with the seconds or tens of seconds typical of conventional network fabrics.
Deployment and usage
OpenAI reports that it has deployed MRC across its largest Nvidia GB200 supercomputers. Specific deployments cited include the company’s site hosted with Oracle Cloud Infrastructure in Abilene, Texas, and Microsoft’s Fairwater supercomputers. The company says MRC has been used in the training of multiple OpenAI models and that these efforts used hardware from Nvidia and Broadcom.
As an operational note, OpenAI reported that during training of a recent frontier model it rebooted four tier-1 switches without coordinating with teams running training jobs in the cluster, indicating the protocol and the environment in which it operates have been exercised under live operational conditions.
Scale of user activity
OpenAI states that more than 900 million people use ChatGPT every week, a usage metric the company provided in conjunction with the protocol announcement.
Implications for infrastructure
MRC’s design - splitting interfaces into multiple links and applying sender-specified IPv6 Segment Routing - is intended to increase parallelism across network fabrics and reduce the time to route around failures. The protocol is already in production use on OpenAI’s largest GB200-based systems and relies on hardware from multiple vendors named in the announcement.
What is limited or not stated
The announcement focuses on the technical design and current deployments at OpenAI but does not provide broader adoption timelines, detailed performance benchmarks beyond the general microsecond-scale failover claim, or a roadmap for deployment across other operators' fleets.