Gorodenkoff - stock.adobe.com
IBM has expanded its list of storage partnerships with supercomputer manufacturer Nvidia on reference architectures designed for artificial intelligence deployments.
The IBM SpectrumAI with Nvidia DGX reference architecture combines Nvidia's powerful DGX-1 servers and AI software stack with IBM's flash storage and Spectrum Scale parallel file system software. The new IBM AI converged infrastructure option will be sold exclusively through channel partners.
The partnership, revealed Tuesday, is IBM's second AI reference architecture with Nvidia. Last June, IBM made available an AI reference architecture designed for its Power-based servers with Nvidia GPUs, Spectrum storage software and flash storage.
IBM is not the only vendor to strike a deal with Nvidia on a reference architecture tailored to AI use cases. Pure Storage unveiled its AI-ready infrastructure, AIRI, in March. NetApp followed in August, and DataDirect Networks (DDN) launched its A³I platform in October.
Like IBM, Dell EMC has multiple Nvidia AI partnerships. Dell EMC in November added an AI reference architecture bundling Nvidia DGX servers with its all-flash Isilon storage. That followed Dell EMC's earlier AI Ready Solution for Deep Learning that uses Dell PowerEdge servers equipped with high-performance Nvidia GPUs and all-flash Isilon F800.
IBM AI reference architecture storage options
The IBM SpectrumAI for Nvidia DGX reference architecture gives customers two storage options. IBM Elastic Storage Server (ESS) is available with flash-based solid-state drives, capable of throughput ranging from 10 GBps to 40 GBps. In mid-2019, IBM plans to add an NVMe-based IBM FlashSystem 9100 configuration with Spectrum Scale that the company claims would deliver up to 40 GBps of throughput.
Eric Herzog, chief marketing officer and vice president of global channels for IBM storage, said the IBM AI system could scale from 300 TB on the low end to 8 exabytes or more on the high end. IBM Spectrum AI is composable, and customers can expand the system by adding servers, storage or additional components separately.
A rack with nine Nvidia DGX servers equipped with 72 Tesla V100 Tensor Core GPUs demonstrated 120 GBps of throughput, according to IBM. Herzog said bandwidth, rather than I/O or latency, is the key metric in AI deployments.
One potential differentiator for the new IBM AI reference architecture is the additional storage options that a customer can tack on to extend the base reference architecture for an extra fee. For instance, a customer could add IBM Spectrum Archive or IBM Cloud Object Storage to archive data, or they could add Spectrum Discover through an API to manage data.
"It gives you a way to create a very end-to-end AI reference architecture," Herzog said. "AI needs to constantly learn, and that means the data sets just get bigger and bigger. So, having that scalability with Spectrum Discover as a way to easily catalog and archive makes that a differentiated solution."
IBM AI differentiators
Chirag Dekate, a senior director and analyst at Gartner, said another key differentiator is IBM's strategy to make data a "first-class citizen in the AI pipeline." He said connecting the Spectrum AI ecosystem to DGX through Mellanox InfiniBand networking gear would enable data access at low latency and high bandwidth in the same namespace as the GPU processors.
"That essentially means a lot of the RDMA [remote direct memory access] operations that you would commonly do on the compute side you can theoretically do on the storage side, as well," Dekate said.
Dekate said IBM also enables parallel access to data through Spectrum Scale for higher throughput and extreme scalability with machine learning and advanced deep learning models. He said most NFS-based infrastructure options today provide sequential access to data, and if multiple nodes try to access data, users can run into I/O bottlenecks. To avoid the problem, engineers replicate data to different nodes and individually access the independent replicas, he said.
"Because IBM's Spectrum Scale is an inherently parallel file system, they can actually have one copy and expose and provide parallel access to different portions of the data simultaneously," Dekate said. "And by adding metadata layers and logical separation to physical data, they enable seamless access across several nodes."
Competitor DDN also offers data access through InfiniBand and a parallel file system, but Dekate said DDN's primary focus is its Lustre-based file system that can be harder to manage, especially for organizations without experienced Lustre engineers.
Dekate said deep learning represents a small portion of the AI market. He said many enterprises will start with simple machine learning techniques that do not require systems like Nvidia DGX-1, GPUs capable of petaflop-scale processing power, NVMe-based flash or parallel I/O. He said users would look at their existing in-house infrastructure and system software to unify their data layers and address their unique AI use cases before they consider reference architectures.
Henry Baltazar, a research vice president at 451 Research, said he expects AI reference architectures to eventually gain traction among enterprises seeking to simplify their deployments. In the meantime, he said many may be scared off by the potential price tag of $1 million or more for AI reference architectures.
Baltazar noted vendors are coming out with slimmed-down AI reference architectures that have less Nvidia DGX servers or less storage. He said he also envisions hyper-converged infrastructure vendors adding GPUs to their products. He said some HCI players already have GPUs intended for virtual desktop infrastructure, but not for AI.
List pricing for IBM SpectrumAI with Nvidia DGX starts at about $650,000 with a single Nvidia DGX-1 server, IBM all-flash Elastic Storage Server GS1S, IBM Spectrum Scale software and IBM installation support for the ESS.