Nvidia finalizes GPUDirect Storage 1.0 to accelerate AI, HPC
Nvidia GPUDirect Storage's driver hit 1.0 status to enable direct memory access between GPU and storage and boost the performance of data-intensive AI, HPC and analytics workloads.
The Magnum IO GPUDirect Storage software that Nvidia introduced in late 2019 to accelerate AI, analytics and high-performance computing workloads has finally reached 1.0 status after more than a year of beta testing.
The Magnum IO GPUDirect Storage driver enables users to bypass the server CPU and transfer data directly between high-performance GPU memory and storage, via a PCIe switch, to lower I/O latency and increase throughput with the most demanding data-intensive applications.
Dion Harris, lead technical product marketing manager of accelerated computing at Nvidia, said GPUDirect Storage lowers CPU utilization by a factor of three and enables CPUs to focus on the work they were built for -- running processing-intensive applications.
At this week's ISC High Performance 2021 Digital conference, Nvidia announced that it added the Magnum IO GPUDirect Storage software to its HGX AI supercomputing platform along with the new A100 80 GB PCIe GPU and NDR 400G InfiniBand networking. Nvidia had to collaborate with enterprise network and storage providers to enable the GPUDirect Storage.
Storage vendors support GPUDirect
Storage vendors with generally available products integrating GPUDirect Storage include DataDirect Networks, Vast Data and WekaIO. Others with products in the works include Dell Technologies, Excelero, Hewlett Packard Enterprise, Hitachi Vantara, IBM, Micron, NetApp, Pavilion Data Systems and ScaleFlux.
Steve McDowell, a senior technology analyst at Moor Insights & Strategy, said Nvidia's GPUDirect Storage software will most often see use with high-performance storage arrays that can deliver the throughput required by the GPUs and support a high-performance remote direct memory access (RDMA) interconnect such as InfiniBand. Examples of GPUDirect Storage pairings include IBM's Elastic Storage System (ESS) 3200, NetApp's EF600 all-flash NVMe array and Dell EMC's PowerScale scale-out NAS system, he said.
Steve McDowellSenior technology analyst, Moor Insights & Strategy
"GPUDirect Storage is designed for production-level and heavy research deep-learning environments," McDowell said, noting the technology targets installations with a number of GPUs working on training algorithms where I/O is a bottleneck.
Nvidia DGX SuperPod with IBM ESS 3200
IBM announced this week that it had updated its storage reference architectures for two-, four- and eight-node Nvidia DGX Pod configurations and committed to support a DGX SuperPod with its ESS 3200 by the end of the third quarter. SuperPods start at 20 Nvidia DGX A100 systems and can scale to 140 systems.
Douglas O'Flaherty, program director of portfolio GTM and alliances at IBM, said using GPUDirect Storage on a two-node Nvidia DGX A100 can nearly double the data throughput, from 40 GB per second to 77 GB, with a single IBM ESS 3200 running Spectrum Scale.
"What it showcases for Nvidia is just how much data a GPU can start to work through. And what it showcases for us is that, as your developers and applications embrace this, especially for these large data environments, you really need to make sure that you haven't just moved the bottleneck down into storage," O'Flaherty said. "With our latest version of ESS 3200, we did a tremendous amount of throughput with just a very few systems."
O'Flaherty said customers most interested in Nvidia GPUDirect Storage include automobile manufacturers working on self-driving vehicles, telecommunications providers with data-heavy natural language processing workloads, financial services firms looking to decrease latency, and genomics companies with large, complex data sets.
Startup Vast Data has already received a handful of large orders for GPUDirect Storage-enabled systems, according to CMO and co-founder Jeff Denworth. Examples include a media studio doing volumetric data capture to create 3D video, financial services firms running the Apache Spark analytics engine on the Rapids open GPU data science framework, and HPC centers and manufacturers using PyTorch machine learning libraries.
Denworth claimed that using GPUDirect Storage in Rapids and PyTorch projects has enabled Vast Data to feed a standard Spark or Postgres database about 80 times faster than a conventional NAS system could.
"We've been pleasantly surprised by the amount of projects that we're being engaged on for this new technology," Denworth said. "And it really isn't simply a matter of just making certain AI applications run faster. There's a whole gamut of GPU-oriented workloads where customers are now starting to gravitate toward this GPUDirect Storage mode as the way to feed these extremely hungry machines."
Carol Sliwa is a TechTarget senior writer covering storage arrays and drives, flash and memory technologies, and enterprise architecture.