Generative AI is taking the world by storm and businesses are starting to look at ways to adopt AI technologies into their business processes. In some cases, businesses deploy their own AI fabric and GPU scale-out network within data centers and private clouds.
Building GenAI data centers from a network perspective differs greatly from traditional data center buildouts -- or even those that were designed to support high-performance computing (HPC).
Four key aspects of GenAI network fabrics can set enterprises on the right path for architecting a GenAI data center that meets a business's needs today and into the future.
- Half of all the time spent analyzing AI training data occurs on the network. Despite a focus on all the processing required to analyze AI training data using stacks of GPUs, it's important to note that half the processing of AI data happens within the network. This places the speed and agility at which networks transport large data sets front and center. After all, the pace of a GenAI application is only as fast as its slowest component. If properly built, the network can be eliminated as a potential performance bottleneck.
Building a highly scalable network is also key to GenAI data centers as it enables future growth capacity. Network switch fabrics must include hardware that can expand horizontally and vertically, as well as use network OSes on switching hardware that include advanced features, such as packet spraying, load awareness and intelligent traffic redirection. These features provide automated rerouting of traffic within the network and between GPU processing units that may become overloaded.
- Workloads are fewer in number but greater in size. Unlike HPC, which works to minimize network latency to ultralow levels, AI data center buildouts must focus on high throughput capacity. HPC networks are designed to transport thousands of simultaneous workloads requiring minimal latency while AI workloads are far fewer in number but much larger in size.
From a speed perspective, throughput becomes more important than network latency. For GenAI data centers, the ultralow latency advantages found in InfiniBand network fabrics used for HPC are diminished and the use of higher throughput Ethernet networks may soon become the norm due to the standard's higher throughput capacity.
- Network needs to handle dense connectivity. Shoehorning high-density racks of GPUs for GenAI processing is no easy task. Up to four times the switch port density is required in addition to all the network cabling that comes with it. According to a research report from the Dell'Oro Group, as much as 20% of all data center switch ports will be allocated to AI servers by 2027. The small form factor GPUs are also a challenge from a power and cooling perspective and may require significant changes to adapt to this higher density, including liquid cooling options and modifications to power capacity.
Early GenAI adopters have concluded that the use of multisite or micro data centers is the best option to accommodate this level of density. And, yet again, this puts pressure on the network interconnecting these sites to be as high-performing and resilient as possible.
- Network orchestration is a must-have. With all the complexities of GenAI data center networks combined with the need for optimized performance and high reliability, GenAI networks should not be managed using traditional command-line syntax and third-party performance monitoring tools. Instead, organizations should deploy an orchestration platform that delivers several useful features and performance insights that are baked into the control plane architecture from the start.
Orchestration platforms provide several benefits that greatly enhance the management of GenAI data centers, including the following:
- The automatic creation of the data center network underlay. This removes much of the complexity around brand-new buildouts and significantly decreases the time spent to stand a network up to the point where it's ready for network and network security policy creation.
- Intuitive and automated network overlay creation and ongoing NetOps management. Using a GUI, orchestration platforms let administrators create network and network security policies in a centralized location and automatically push the commands to those data center switches that require them. This enables the creation of network policy without having to learn complex CLI commands. The policy is created using standards-based guardrails within the system that largely eliminate manual configuration errors.
- Increased performance and health visibility. Orchestration tools also collect and analyze switch health and performance data from network switching hardware using several traditional and modern methods. The collection and analysis of network telemetry data is the newest kid on the block when it comes to health analysis. This is where the switch is configured to send real-time performance measurements to the orchestrator using specialized protocol standards, such as gNMI and NETCONF. The protocols are far more powerful compared to legacy monitoring protocols such as Simple Network Management Protocol and assist with the proactive identification of performance problems that can be remediated before causing network slowdowns or outages.
3 pillars of GenAI data center success
GenAI traffic flows differ greatly from traditional flows. Because GenAI intelligence growth cannot expand until the analysis of every packet of a training data set is complete, time is of the essence. To achieve efficiency within GenAI fabrics, network architects must strive for the three pillars of GenAI data center success:
- Sufficient network throughput.
- Automated optimization.
- Granular network performance insights.
Only then will the underlying network fabric foundation be set for success in enterprise GenAI endeavors.