madgooch - stock.adobe.com
When it comes to storage for AI applications, the key issue isn't that these apps consume more storage than other applications -- they don't. The key issue is that they consume storage differently. With AI applications, data moves from storage to AI processing or I/O. It also moves between different storage systems and media at different points in its lifecycle.
I/O is primarily tied to throughput, regardless of the type of storage or storage media the data is stored on. AI's three modes -- machine learning, deep machine learning and neural networks -- each ingest and process data differently and, therefore, have distinctive I/O requirements. A look at each reveals how AI applications are changing storage consumption.
Speed is key to storage consumption
Machine learning, the most common AI, can potentially use millions of data points to make predictions or decisions based on human-derived algorithms. The accuracy of an AI app's outcomes is tied to the number of data points ingested within a specified timeframe. More data points lead to more accurate predictions and decisions.
Time is the limiting factor: If a prediction or decision is required in "n" milliseconds, the speed at which the machine learning algorithm can ingest and examine the data points will determine the quality of the outcome. GPUs and high-performance computing have eliminated most processing bottlenecks. That leaves the storage I/O having to keep up with the machine learning algorithm.
Deep learning applications draw on millions, or even billions, of data points, and they make multiple passes on the same data. This exacerbates the I/O bottleneck problem.
Machine learning and deep machine learning algorithms can run on modern server architectures, but neural networks are different. Neural networks, also referred to as artificial neural networks, mimic the neuron architecture of the human brain. By definition, they require extensive scale-out GPUs and CPUs, ranging from dozens to millions of processors. The key to neural network storage is to provide extremely high-performance parallel file system throughput. This is where IBM Spectrum Scale, Lustre (open source), Panasas ActiveStor and WekaIO are a good fit.
Legacy block storage systems generally aren't able to deliver the hundreds-of-gigabytes- to terabytes-per-second read throughput required. However, several newer extreme-performance storage systems can meet these needs, some better than others, including: Dell EMC PowerMax, DirectData Networks (DDN) Storage Fusion Architecture, Excelero NVMesh, Fungible, IBM FlashSystem, Oracle Exadata, Pavilion Hyperparallel Flash Array and StorCentric Vexata.
Legacy file storage systems also aren't up to the task. But newer generations of parallel file systems with global namespaces can deliver the throughput needed. They include DDN EXA5, IBM Spectrum Scale, Panasas ActiveStor and WekaIO.
Solving the data migration problem
Keep in mind that both machine learning and deep machine learning use current and historical data to examine, learn, predict and decide. The historical data these apps use is unlikely to reside on super-fast, expensive storage systems. It may start there, but as it ages, it's moved to slower, cheaper storage systems or the cloud.
This is a problem. Moving data from fast, expensive storage to slower, cheaper systems must be simple and automated, and it also must be transparent to the AI applications. In other words, as far as the AI application is concerned, the data is still there. Data movement is the primary reason many AI projects fail, so getting this right is paramount to success. There are two ways to accomplish this.
The first is to build data migration into the storage system, so data is moved within the system or to the cloud. The system knows where the data resides -- i.e., the storage it is consuming -- and feeds it to the AI to process on demand. This approach to data migration and storage consumption suffers from limited system capacity. Once again, newer scale-out technologies mitigate this problem. Some examples of this are the Nimbus Data, Oracle Exadata X8M, StorOne and Vast Data.
The Oracle Exadata X8M uses both high-performance Intel Optane DC Persistent Memory Module (PMEM) in Application Direct Mode with NVMe flash in its primary storage servers and low-cost, high-capacity spinning disks in its XT storage servers. The amount of PMEM is limited to no more than 27 TB per rack and supports as many as 18 racks. That's potentially a lot of high-performance storage, which isn't cheap and not all databases or all data within a database require PMEM performance. The Oracle Database will move older, less accessed data from PMEM to NVMe flash and then to lower-performance and lower-cost storage servers, such as its low-cost XT storage server. The Oracle Database knows where the data is and still has access to It on-demand.
Vast Data's all-flash array uses high-performance, high-endurance Optane storage class memory (SCM) drives as cache for lower cost, performance and endurance quad-level cell (QLC) flash SSDs. Clever software puts most of the writes on the SCM and limits writes to the QLC flash. The data invariably ends up on the QLC while still providing exceptional AI read performance. Storage startups Nimbus Data and StorOne claim to provide similar capabilities.
The second approach is to move the data in the background, leaving a hierarchical storage management (HSM) stub, symbolic link (symlink) redirect or global namespace middleware. The traditional HSM is problematic in that the data must be moved back to the original storage first before it's read, and that's too slow. Stubs can be brittle and break, orphaning data. Symlinks can also break, albeit less frequently. And global namespace typically means something -- probably middleware -- is in the data path that adds some latency. However, that latency is nominal and only for the first readable byte.
Symlinks and global namespaces can provide the data required from multiple resources, concurrently filling the AI machine learning or deep machine learning app with the data points it needs. There are several products able to deliver these functions, including Datadobi, iRods (open source), Komprise and StrongBox Data Solutions StrongLink.
Transparent data movement is what's needed when providing AI machine learning and deep learning apps with the data volume they need at an acceptable cost. Such transparency drives high-cost, high-performance storage consumption to lower-cost and lower-performing storage consumption. This approach is a necessity to effectively handle the vast volumes of data AI applications demand.