Getty Images/iStockphoto
Ethernet scale-up networking powers AI infrastructure
Demands for data center and networking capacity for AI are likely to surge. Learn why AI providers are collaborating on Ethernet technology to support this growth.
Fast forward three years, and the world's largest AI clusters will be largely built on Ethernet.
That's the claim Broadcom made during a presentation at ONUG's Fall 2025 AI Networking Summit. Broadcom, alongside other major companies such as Cisco, Meta and Nvidia, is collaborating on Ethernet for Scale-Up Networking (ESUN). Their goal is to advance Ethernet in the growing scale-up domain for AI systems.
Learn why Ethernet can help companies prepare for the rapid expected growth of AI networks.
The network demands of AI
According to a McKinsey report from April 2025, infrastructure investors are planning to add 124 gigawatts of compute capacity to data centers between 2025 and 2030. OpenAI aims to contribute to 20% of this capacity alone over the next five years, a capacity that translates to roughly 75 million eXponential Processing Units (XPUs) -- GPUs, Tensor Processing Units and other custom accelerators -- deployed over the next five years.
This amount of compute is necessary driven by the increasing complexity of large language models. LLMs have an increasing number of parameters, but are also becoming multimodal, and incorporating processing-intensive capabilities such as memory and reasoning.
In addition, networks will play an increasingly important role in this buildout. At scale, machine learning requires connecting potentially millions of disparate XPUs together to build large superclusters. Networks provide the glue -- the load balancing, congestion control and failure mechanisms -- to enable efficient job completion times for these systems that draw on massive superclusters.
"The network is the supercomputer in AI infrastructure," said Hasan Siraj, head of software products and ecosystem at Broadcom.
Dimensions of scale in AI networks
AI networks have three scaling methods:
- Scale up.
- Scale out.
- Scale across.
Each type presents different requirements and challenges.
Scale-up networks
Scale-up network configurations house approximately 100 XPUs within a single rack. All accelerators are directly connected and can access each other's memory instantly. This creates a one-hop network where one XPU can access another's memory with minimal latency.
Key requirements for scale-up include high networking bandwidth, efficient data transfer and reliable transport protocols. The aggregate bandwidth of high-bandwidth memory modules is expected to increase significantly over the next few years.
Scale-out networks
Scale-out networks connect multiple scale-up racks together, potentially linking thousands of XPUs within a single data center. The architecture gets more complex at this stage, especially when moving from a two-tier to a three-tier network. This makes load balancing and congestion control extremely difficult.
Two-tier architectures have significant advantages over three-tier architectures. Benefits of two-tier architectures include the following:
- Fewer optical transceivers required.
- Lower latency.
- Higher reliability.
- Better performance.
- Lower power consumption.
The increased complexity of three-tier architectures creates various challenges. Drawbacks of three-tier architectures include the following:
- More optical transceivers needed.
- Higher latency -- five hops instead of three.
- Three times the number of switches.
- More link failures.
- Increased power consumption.
Scale-across networks
A 10-megawatt data center can house approximately 6,000 XPUs. Larger clusters require lossless connections between multiple data center buildings through scale-across implementations. Scale-across demands switches capable of de-buffering and line-rate encryption to maintain performance across facilities.
Benefits of Ethernet for AI scale-up
Many large hyperscalers rely on Ethernet today due to its many benefits. Advantages of Ethernet include the following:
- Open architecture. Ethernet is an open, standards-based technology managed and maintained by the IEEE and other standards bodies in a large ecosystem. This both encourages innovation and prevents vendor lock-in. Ethernet has standards at every layer of the stack to enable scale-up networking. Organizations that want to incorporate their own memory semantics and scheduling.
- Reliability. Modern Ethernet technologies minimize the risk of downtime by implementing significant congestion management and flow control mechanisms for lossless communication.
- Low latency. Modern high-speed Ethernet technology -- 400 gigabits per second (Gbps) and 800 Gbps -- has the features -- such as cut-through switching and precision latency management -- to meet the demands of AI networks.
- Efficiency. Ethernet is power- and cost-efficient. It's also flexible, enabling backward compatibility and the use of various media such as copper or fiber optics.
Ethernet scale-up networking development
ESUN is helping create standards for scale-up networking development. These specifications outline the principles for designing high-performance, open, large-scale AI data center infrastructure. ESUN focuses on this development by addressing the following two key points:
- Network functionality. Focuses on how traffic is sent across network switches, including lossless data transfer, error handling and protocol headers.
- XPU-endpoint functionality. Focuses on designing aspects of XPUs that are often tightly coupled to XPU architecture, such as workload partitioning, memory ordering and load balancing. Within ESUN, the SUE-Transport workstream aims to work on this endpoint functionality.
By addressing these areas of functionality, ESUN enables the following:
- Technical collaboration between operators and manufacturers.
- Interoperability of CPU network interfaces and Ethernet switch application-specific integrated circuits.
- Resilient, lossless single-hop and multi-hop components.
- Standards and best practices that align with other bodies, such as the Ultra-Ethernet Consortium and IEEE 802.3.
- Adoption across the industry through Ethernet's mature ecosystem.
If 75 million XPUs emerge in the next five years, they are unlikely to come from a single company -- the market will have some diversity. Some hyperscalers are building their own XPUs, for example. Ethernet promises to enable this diversity and innovation in the AI infrastructure industry.
Ben Lutkevich is a site editor for Informa TechTarget. Previously, he wrote definitions and features for WhatIs.