Next-generation PCIe key to composable infrastructure progress
High-performance computing and AI environments are just two strong uses for composable architecture. Explore benefits and drawbacks of PCIe in composable systems.
IT is littered with promising ideas that never take hold. As interest in composable infrastructure flatlined over the past five years, one can be forgiven for putting the technology for disaggregating hardware resources from their host in the category of failed ideas. We're not ready to give up on the concept yet, though the little usage data that is available isn't promising.
A survey of IT executives and managers by Statista found that only 11% of respondents had production implementations of composable systems, while a majority of 52% wasn't interested in the technology. Indeed, composable infrastructure had the lowest level of interest among the 10 technologies surveyed.
Nonetheless, there have been some significant product developments in the last couple of years. These developments provide hope for composable evangelists that the concept will find a home within the enterprise, particularly in organizations that build large clusters for high-performance computing (HPC) and AI workloads.
Hardware composability: Background and technology
The idea for composable hardware goes back about a decade when Calxeda built a scale-out, modular Arm server with integrated 10 Gigabit Ethernet fabric. It was fast for the time and linked adjacent nodes in the chassis. Calxeda -- since defunct but whose intellectual property is now used by Silver Lining Systems -- was initially tapped by HP for its Project Moonshot servers, arguably the first attempt at building a composable hardware-software system. However, HP subsequently ditched it for Intel's new Atom processors. Moonshot has since evolved into HPE's Synergy lineup.
The concept evolved further when another startup, Liqid, launched in 2015 with a new approach to composable hardware based on PCIe fabrics. The centerpieces of Liqid's system include PCIe switches based on Broadcom components. A software management system helps configure and connect bare-metal servers composed of CPU, memory, network interface card (NIC), storage, GPU and field programmable gate array (FPGA) resources pooled in attached servers and expansion chassis.
Liqid initially used an internally designed switch built around silicon from PLX. It later adopted Broadcom's PEX8700 and PEX9700 PCIe Gen 3.0 switch silicon. In mid-2020, Liqid and Broadcom collaborated on a PCIe Gen 4.0 reference design. The collaboration uses Broadcom's PEX88000 switch that doubles the throughput over its Gen 3.0 part, with a bandwidth of 256 gigatransfers per second, per port. The switches are available in 24- or 48-port configurations. Each port defaults to four PCIe lanes, configurable to x8 or x16, with 100 nanosecond port-to-port latency.
PCIe makes an ideal interconnect for server clusters and composable infrastructure because of its ubiquity in modern processors, high bandwidth (64 Gbps per lane), low latency, lossless transport and direct memory access (DMA) support. Its nontransparent bridging feature enables the host processor to see switch ports as PCIe endpoints. Gen 4.0 switches, like the Broadcom PEX88000, embed an Arm processor for configuration, management and handling hot-plug events. They provide nonblocking, line-speed performance with features such as I/O sharing and DMA.
Drawbacks of PCIe include higher port costs than Ethernet and severe limitations on cable length that confine fabrics to a server rack. Consequently, Ethernet and InfiniBand have emerged as alternatives for composable infrastructure. For example, Liqid announced multifabric support for composability of all resource types -- CPU, memory, GPU, NIC, FPGA and storage -- across all major fabric types, including PCIe Gen 3.0, Gen 4.0, Ethernet and InfiniBand. In contrast, HPE only supports Ethernet -- and Fibre Channel (FC) for storage -- in its Synergy composable product.
Applications for composable architecture
Composable infrastructure was initially proposed as a way to cost-effectively share expensive GPUs in an AI environment, particularly for more computationally intense model training. However, composable is also viable for HPC clusters and bare-metal cloud infrastructure, particularly for smaller, niche providers. It also works for multi-tenant edge compute clusters, for example, in 5G base stations or cloud "micro" regions. Multinode composable fabrics, using PCIe-to-NVMe, NVMe-oF, FC or InfiniBand, are a popular option for distributed, scale-out storage systems in which pools of NVMe disks are shared with a server cluster.
Although unrelated to PCIe fabrics, PCIe NIC, GPU and FPGA cards are increasingly shared and virtually carved up among multiple VMs that use technologies such as Nvidia virtual GPU, FPGA sharing, SmartNICs and data processing units (DPUs). For example, VMware recently introduced Project Monterey to extend some features of VMware Cloud Foundation to DPUs like Nvidia's BlueField-2. The software enables the DPU's multiple Arm cores to host an ESXi instance that offloads network and storage services from the host CPU.
Longer term, Kit Colbert, VMware Cloud's CTO, sees Monterey evolving to support multiple hosts and other hardware accelerators.
"[The project] enables us to rethink cluster architecture and to make clusters more dynamic, more API-driven and more optimized to application needs," he said in a blog post. "We enable this through hardware composability."
The options to share and dynamically allocate hardware resources across servers are multiplying. They provide wider access to hardware accelerators and lower costs through greater resource utilization.