Getty Images/iStockphoto

Tip

Enterprises Face New Storage Bottlenecks as AI Grows

For CIOs and IT leaders, AI performance depends on storage. See how strategies like NVMe and parallel file systems can prevent common AI bottlenecks.

AI and machine learning are being rapidly adopted across nearly all industries. Even so, AI cannot work without an underlying storage infrastructure that is well aligned with an AI workload's needs. It's no longer a matter of just how much storage an organization might have, but how quickly data can be accessed and processed.

Historically, enterprise storage has been primarily optimized for the data access patterns associated with transactional databases. However, this approach is unsuitable for AI workloads. AI workloads typically use massive amounts of data, much of which is unstructured. Unlike a database, access patterns are widely varied and mostly unpredictable. Unfortunately, data access bottlenecks can have a huge adverse effect on AI workloads, ultimately diminishing their business value.

Why Storage Matters in AI

One of the most pervasive myths related to AI is that GPU performance matters above all else. In reality, though, GPUs may sit idle while waiting for data. Storage performance ultimately matters just as much as GPU performance. Having the world's fastest GPU means nothing if the underlying storage can't keep up.

How Storage Vendors are Responding to AI Demands

AI-related storage challenges have driven storage vendors to introduce various changes. The most obvious of these changes is the move from SATA or SAS SSDs to NVMe-based flash storage. Vendors are doing this as a way of eliminating microsecond-level delays that might not have mattered much in the past, but now translate into wasted GPU cycles and increased costs.

Another way that storage vendors are responding is by making greater use of NVMe over Fabrics (NVMe-oF). This is important because AI clusters are composed of multiple nodes, each of which requires access to high-performance storage. Using NVMe-oF helps to eliminate the latency associated with traditional networks, thereby helping storage networks perform at a level that is closer to that of local storage.

Storage vendors are also increasingly using parallel file systems and scale-out NAS. This allows storage vendors to scale more than just capacity -- they can also scale performance. The nodes within an AI cluster typically all access the same data set. The problem with this is that multiple nodes may bombard a single storage controller with IO requests. Parallel file systems solve this problem by creating additional storage nodes, each with independent controllers. Now, instead of multiple storage nodes accessing a single storage array, those nodes are distributing their storage requests across multiple storage devices. This means that the load on storage controllers is reduced, thereby allowing IO requests to be handled more quickly.

Storage Prices and Bottlenecks for Businesses and Consumers

Businesses that host AI workloads must invest in high-performance storage or risk underutilized GPUs. When budgeting for storage IT leaders must remember that, as important as access to high-performance storage is for AI workloads, capacity is also becoming increasingly important. Modern AI models require larger sets of training data and longer data retention periods.

Given the storage requirements associated with training and operating AI workloads, many businesses are opting to take advantage of purpose-built cloud storage. These high-speed storage options are ideal, but organizations will need to plan carefully in order to avoid incurring data egress fees when moving datasets across clouds.

Enterprise AI workloads are having an impact on consumer pricing. With hardware manufacturers scrambling to meet the demands of enterprise customers who want to run AI workloads, they are producing fewer consumer-grade components. This trend is driving up prices across the consumer electronics industry. The biggest price increases have been related to memory, but nearly all consumer electronic devices have been impacted to at least some degree since most electronic devices contain memory chips.

Future Outlook

Going forward, it seems likely that CIOs will increasingly view storage as a competitive differentiator and will adopt IOPS per dollar as a KPI.

Given the need for supplying GPUs with data in real time, it seems likely that compute will move closer to the data and that edge AI applications will drive edge storage growth.

Interestingly, although AI workloads are driving the demand for storage, AI will also help to solve some of the problems that it created. AI-powered storage optimization offers predictive caching, automated tiering and failure prediction. This type of intelligent storage management is helping AI workloads to use storage more efficiently, thereby somewhat curbing storage costs.

The most important thing for organizations to remember is that simply adding flash storage is not going to solve all the storage bottleneck problems by itself. Organizations will need to rethink their storage pipelines, not just upgrade hardware.

Brien Posey is a former 22-time Microsoft MVP and Commercial Astronaut Candidate. During his 30+ year IT career, Posey has served as the CIO for a national chain of hospitals and healthcare facilities and as the lead network engineer for the U.S. Department of Defense at Fort Knox. He has also worked as a network administrator for some of the largest insurance companies in America.

Dig Deeper on Storage architecture and strategy