OpenAI and Chipmakers Unveil Multipath Reliable Connection to Reduce AI Training Network Bottlenecks

New open-standard protocol splits network interfaces into parallel paths and is already deployed on OpenAI’s largest GB200 supercomputers

OpenAI, together with AMD, Broadcom, Intel, Microsoft and Nvidia, announced Multipath Reliable Connection (MRC), an open-standard networking protocol intended to remove throughput constraints in large-scale AI training clusters. Built as an extension to RDMA over Converged Ethernet, MRC shards network interfaces into many smaller links and uses IPv6 Segment Routing to direct packets across hundreds of parallel paths. OpenAI reports production deployment of MRC on its largest Nvidia GB200-based supercomputers, including systems hosted with Oracle Cloud Infrastructure in Abilene, Texas, and Microsoft’s Fairwater systems, and says the protocol has been used in training multiple of its models.

OpenAI and Chipmakers Unveil Multipath Reliable Connection to Reduce AI Training Network Bottlenecks

INTC MSFT ORCL NVDA AMD

Key Points

OpenAI and partners AMD, Broadcom, Intel, Microsoft and Nvidia introduced Multipath Reliable Connection (MRC) as an open-standard protocol to reduce network bottlenecks in large AI training clusters.
MRC extends RDMA over Converged Ethernet and shards network interfaces into many smaller links so a single transfer can send packets across hundreds of parallel paths; it uses IPv6 Segment Routing to specify packet paths and reroutes around failures on microsecond timescales.
The protocol is deployed on OpenAI’s largest Nvidia GB200 supercomputers, including systems at Oracle Cloud Infrastructure in Abilene, Texas, and Microsoft’s Fairwater supercomputers, and has been used in training multiple OpenAI models using Nvidia and Broadcom hardware.

Overview

OpenAI has collaborated with AMD, Broadcom, Intel, Microsoft and Nvidia to introduce Multipath Reliable Connection (MRC), an open-standard networking protocol designed to address throughput and routing bottlenecks in very large AI training clusters. The protocol was announced via the Open Compute Project on Tuesday.

How MRC works

MRC is an extension of RDMA over Converged Ethernet, which is an InfiniBand Trade Association standard that enables hardware-accelerated remote direct memory access between GPUs and CPUs. The extension divides network interfaces into many smaller links, creating a parallel fabric in which a single transfer can distribute its packets across hundreds of separate paths through the network.

Instead of relying on traditional dynamic routing, MRC leverages IPv6 Segment Routing to let the sender explicitly specify the route each packet should take. According to the announcement, the protocol detects failures in the fabric and reroutes traffic around them on a microsecond timescale - a substantial reduction compared with the seconds or tens of seconds typical of conventional network fabrics.

Deployment and usage

OpenAI reports that it has deployed MRC across its largest Nvidia GB200 supercomputers. Specific deployments cited include the company’s site hosted with Oracle Cloud Infrastructure in Abilene, Texas, and Microsoft’s Fairwater supercomputers. The company says MRC has been used in the training of multiple OpenAI models and that these efforts used hardware from Nvidia and Broadcom.

As an operational note, OpenAI reported that during training of a recent frontier model it rebooted four tier-1 switches without coordinating with teams running training jobs in the cluster, indicating the protocol and the environment in which it operates have been exercised under live operational conditions.

Scale of user activity

OpenAI states that more than 900 million people use ChatGPT every week, a usage metric the company provided in conjunction with the protocol announcement.

Implications for infrastructure

MRC’s design - splitting interfaces into multiple links and applying sender-specified IPv6 Segment Routing - is intended to increase parallelism across network fabrics and reduce the time to route around failures. The protocol is already in production use on OpenAI’s largest GB200-based systems and relies on hardware from multiple vendors named in the announcement.

What is limited or not stated

The announcement focuses on the technical design and current deployments at OpenAI but does not provide broader adoption timelines, detailed performance benchmarks beyond the general microsecond-scale failover claim, or a roadmap for deployment across other operators' fleets.

Risks

Operational complexity and integration risk - MRC replaces dynamic routing with sender-specified IPv6 Segment Routing and shards interfaces into numerous links, which may present implementation and operational complexity for data center networking teams.
Vendor and hardware dependence - the announcement notes MRC has been used with Nvidia and Broadcom hardware, indicating deployments to date rely on specific suppliers.
Live operational variability - OpenAI reported rebooting four tier-1 switches without coordinating with training teams during a frontier model run, underscoring the potential for disruptive operational events in large-scale training environments.

Menu

OpenAI and Chipmakers Unveil Multipath Reliable Connection to Reduce AI Training Network Bottlenecks

Key Points

Risks

More from Stock Markets