Preferred on Google

OpenAI Podcast: AI Needs New Supercomputer Networks

OpenAI researchers Mark Handley and Greg Steinkrecker discuss the need for new supercomputer networks to handle AI training, highlighting challenges with traditional protocols and the benefits of their MRC system.

May 6 at 12:02 PM4 min read

Two men sitting in chairs, facing microphones, discussing AI supercomputer networking on the OpenAI Podcast. — Image credit: OpenAI· OpenAI Youtube

The relentless drive for more powerful AI models is pushing the boundaries of computing infrastructure, demanding a fundamental rethink of how supercomputers communicate. In the latest OpenAI Podcast episode, host Andrew Mane sits down with Mark Handley and Greg Steinkrecker from OpenAI's networking and workload systems teams to discuss why AI needs a new kind of supercomputer network. The core challenge, they explain, lies in the sheer scale and synchronization required for modern AI training, which overwhelms traditional networking paradigms.

The Bottleneck of Traditional Networking

As AI models grow in complexity and data requirements, the need to efficiently connect massive clusters of GPUs has become paramount. Mark Handley, a professor at University College London with decades of networking research experience, highlights the limitations of existing infrastructure. "Academia does not know a whole lot about what real data center workloads look like," Handley notes, contrasting the theoretical models with the practical demands of AI training. Greg Steinkrecker, from OpenAI's workload systems team, echoes this sentiment, explaining that the rapid advancement of AI has outpaced the capabilities of conventional data center networking, which was initially designed for the more distributed and less synchronized nature of the early internet.

Related startups

The full discussion can be found on OpenAI Youtube's YouTube channel.

Why AI needs a new kind of supercomputer network — the OpenAI Podcast Ep. 18 - OpenAI Youtube — Why AI needs a new kind of supercomputer network — the OpenAI Podcast Ep. 18 — from OpenAI Youtube

Steinkrecker elaborates on the problem: "When you're training AI models, you're not just sending data between individual machines; you have thousands of GPUs trying to talk to each other simultaneously. This means you need massive bandwidth and extremely reliable, low-latency communication. Traditional networking, which relies on statistical multiplexing and averages, breaks down under these highly synchronized, high-demand workloads." The issue is compounded by the fact that even a single link failure can cripple the entire training process, leading to significant downtime and wasted resources.

Introducing MRC: A New Approach to Network Reliability

To address these challenges, OpenAI has been developing and implementing a new networking approach called MRC (Message Rate Control), which includes techniques like packet trimming and segment routing. Steinkrecker explains the core idea: "Instead of relying on centralized coordination or slow convergence protocols when a link fails, MRC allows each endpoint to independently detect and reroute around failures in milliseconds. This means the system can continue to operate smoothly even if individual components fail, which is crucial when you have millions of components in a large-scale cluster."

Handley adds that this approach simplifies the overall system. "We've been able to turn off many of the traditional routing protocols because the endpoints themselves are now smart enough to manage the network. This removes a huge amount of complexity and potential failure points from the network management." The goal is to create a network that is so robust and efficient that researchers don't need to think about the underlying networking details, allowing them to focus on model development and training.

The Future of AI Supercomputing

The implication of this work is profound: it enables faster, more reliable, and more efficient training of AI models. By building a more resilient and performant infrastructure, OpenAI aims to accelerate the pace of AI research and development, ultimately leading to smarter and more capable AI systems for everyone. Steinkrecker emphasizes the collaborative aspect: "We're big believers in open standards and open source. We've worked with partners like Microsoft, Nvidia, Broadcom, AMD, and Intel to standardize these specifications, and we want to share this work with the community." This collaborative approach is seen as vital for advancing the entire field of AI infrastructure.

The development of MRC and similar technologies signifies a critical step forward in building the next generation of AI supercomputers, ensuring that the rapid progress in AI research can continue unhindered by infrastructure limitations.

© 2026 StartupHub.ai. All rights reserved. Do not enter, scrape, copy, reproduce, or republish this article in whole or in part. Use as input to AI training, fine-tuning, retrieval-augmented generation, or any machine-learning system is prohibited without written license. Substantially-similar derivative works will be pursued to the fullest extent of applicable copyright, database, and computer-misuse laws. See our terms.

#OpenAI #Mark Handley #Greg Steinkrecker #AI Research #Networking #Supercomputing #Data Centers #GPUs #Machine Learning Infrastructure

AI Daily Digest

Get the most important AI news daily.

+40k readers