Storage innovation surges to keep pace with AI shift to inference

Several top storage vendors continue to explore ways to address AI-related challenges. Keep an eye on the evolving market of AI inferencing capabilities.

In the fast-paced, whack-a-mole world of AI infrastructure, innovation in one area has a habit of creating fresh challenges in another. As AI attention increasingly shifts to the "value" phase -- inference -- a new set of issues is emerging for those running AI workloads at large scale. Notably, these are driving a fascinating new wave of innovation in the storage and data ecosystem that will likely have implications as more regular enterprises get on board with advanced inferencing.

It's already clear that the underlying data infrastructure plays a critical role when running AI at scale. Indeed, research from Enterprise Strategy Group, now part of Omdia, found that data management issues are a top-two challenge for organizations deploying AI workloads. But, as the emphasis of AI rapidly shifts to advanced reasoning and agentic inferencing, those challenges are set to grow, potentially significantly.

Context windows: The data bottleneck for inference

In such advanced inferencing scenarios, there's a ton of focus on the specific challenges associated with growing context windows. Broadly, these windows refer to the amount of tokens an AI model can consider at one time when generating a response or prediction.

Context windows are growing in terms of both scale and complexity as users become more proficient with AI reasoning tools and expectations grow. Not only are users asking more sophisticated questions, but they are doing so over extended time periods.

Think about how you may have interacted with a generative AI tool and then perhaps returned with a follow-up question several minutes or even hours later. Did you expect the model to remember where you left off? Additionally, prompts can now handle data types beyond text, increasingly supporting content such as PDFs, code and even video.

This is leading to an explosion in the number of tokens generated for each prompt. As an example, Llama 1 supported a context window of 2,048 tokens when Meta released it in 2023. By contrast, earlier this year, Meta released Llama 4, which has support for up to 10 million tokens -- a 5,000 times increase.

The "10 million tokens" problem may be more of a theoretical limit than an actual problem that is felt today, but there is broad agreement that, regardless of the size of the environment, a wave of tokens is heading toward us. This tsunami will likely overwhelm today's infrastructure, so alternatives are required.

Enter the KV cache

The key-value (KV) cache is where the context window is built and stored and, as such, is a critical -- and highly compute-intensive -- step in the reasoning process for large language models (LLMs). It's where the contextual understanding of the input request is built; think of it as the short-term memory of the LLM.

The challenges associated with storing hundreds of thousands or even millions of tokens mean that, in more demanding or sophisticated environments, the KV cache quickly fills up. In such a scenario, older data must be evicted to accommodate new requests.

To regain context, older prompts must then be repeatedly recalculated. The "GPU tax" comes into play here. Rather than using expensive GPU cycles to create new insights, they must instead spend time recalculating data they have already created. That's an inefficient use of an expensive resource.

The KV cache was originally designed to deliver high-bandwidth memory (HBM), which is pinned to the local GPU server. Tools such as vLLM have emerged as popular and efficient ways to manage data here. However, HBM is the most expensive form of memory, and as token counts grow, so does the need for a larger KV cache. Accordingly, it becomes necessary to use additional memory resources. The next obvious step in the memory hierarchy is CPU memory -- DRAM. Here, frameworks such as LMCache enable KV cache offloading to local CPU memory.

But, as context windows and their associated tokens grow exponentially, this is not necessarily enough either. To cost-effectively meet the performance needs of advanced AI inferencing in the future, GPUs will likely need a KV cache that taps a larger pool of high-performance storage, such as NVMe. This would help context windows to dramatically expand, enabling GPUs to focus more of their time serving new prompts, while simultaneously boosting overall efficiency. Managing KV cache data this way also potentially brings other benefits, managing data on a global, more intelligent basis that can serve a wider variety of purposes.

Overall, the storage for AI space has seen strong development in recent years, as suppliers have worked to overcome a range of AI-related performance challenges.

Vendors prepare for the token tsunami

Given the stakes, it's not surprising that multiple industry players are intently focused on addressing this challenge. Overall, the storage for AI space has seen strong development in recent years, as suppliers have worked to overcome a range of AI-related performance challenges. We've seen the emergence of AI-focused storage specialists, each with their own technical approaches based on advanced data management software architectures and utilizing advancements in fast remote direct memory access technologies and protocols.

In addition, we're currently seeing a surge of activity around advanced inferencing specifically, including KV cache management.

For example, earlier this year, Weka announced a capability it calls Augmented Memory Grid (AMG), which is due for general availability in the fall. This feature uses Weka's parallel file system software to create an external pool of shared NVMe storage, which is attached directly to GPU servers and persists as a high-performance token warehouse. It drives memory-like performance but at NVMe costs, Weka said.

AMG can also run in conjunction with Weka's recently announced NeuralMesh Axon software, which implements its software stack entirely in the GPU server to take advantage of underused NVMe flash storage within the server itself. Indeed, Weka said customers will experience far greater value when running the two aspects together.

Though the technical implementation differs, Vast Data's Undivided Attention (VUA) feature, also announced earlier this year, offers similar capabilities. It uses Vast's Disaggregated Shared Everything architecture, in the process creating what it terms an "infinite memory space for context data." Like Weka's AMG, VUA is optimized for vLLM frameworks.

DDN is joining the party with the new Infinia object storage. It has a KV cache embedded directly within it; DDN recently detailed a test scenario it said delivers the fastest Time to First Token in the industry for advanced reasoning workloads.

Notably, Nvidia is also focused on addressing these challenges. Nvidia's play here, announced earlier this year, is Dynamo, a low-latency distributed framework for scaling reasoning AI models.

One of the several innovations within Dynamo, which supports existing frameworks such as vLLM, is KV Cache Manager. The feature enables the offloading of older or less frequently accessed KV cache blocks to more cost-effective memory and storage, such as CPU memory, local storage or external networked storage. Nvidia said this approach can support petabytes of KV cache data to be stored at a fraction of the cost of storing it in GPU memory.

Exactly how this will play with the various offerings from storage vendors is still to be determined. To some degree, Nvidia is offering an alternative approach here. However, Vast, Weka and others said they are working with Nvidia to offer integration between their capabilities and Dynamo.

The new Nvidia Inference Transfer Library (NIXL), a high throughput, low-latency point-to-point communication library, provides a consistent data movement API to move data rapidly and synchronously across different tiers of memory and storage. Optimized specifically for inference data movement, NIXL supports different types of memory, local SSDs and, crucially, networked storage from Nvidia storage partners.

Storage vendors are already working to achieve integration with NIXL. Weka has open sourced a dedicated plugin for NIXL. Meanwhile, Vast recently shared details of a test scenario that integrated the Nvidia NIXL GPUDirect Storage (GDS) plugin with the Vast AI OS. The test, Vast said, drove a single H100 GPU at 35 GBps using GDS without saturating the Vast AI OS' available throughput. In other words, storage would not be a bottleneck when offloading LLM KV caches to the Vast platform.

Looking ahead

It's tempting to view such context window challenges as something that should only concern a small number of large-scale AI builders, hyperscalers, neoclouds/AI service providers, large research institutions and so on.

Such an assumption may be dangerous. Although it's true that handling millions of tokens over thousands of GPUs is still the preserve of very few, more mainstream enterprises are starting to get their hands dirty with advanced reasoning and agentic AI. Such deployments may only amount to tens of GPUs, but the prompts will likely be just as sophisticated. So, the need to effectively marshal a large volume of tokens will be just as critical, perhaps even more so, given the resource constraints.

Hence, any infrastructure leaders contemplating their organization's inference journey should keep a close eye on developments here. The range of options for customers continues to grow and will likely expand further over the coming months. Mainstream storage providers -- including Dell, such as with Project Lightning; NetApp; and Pure Storage with FlashBlade//EXA -- are all sharpening their AI wares, and AI storage specialists, such as Hammerspace, are increasingly targeting the broader enterprise opportunity.

Ultimately, the inference-specific issues detailed here are but one aspect of a much broader set of memory-, storage- and data-related challenges that organizations will face as they scale their AI workloads. The real trick for infrastructure leaders will be to build an AI environment that can elegantly handle these issues alongside the myriad additional challenges they face. In this respect, the ongoing innovations from across the supplier ecosystem augur very well for the future.

Simon Robinson is principal analyst covering infrastructure at Enterprise Strategy Group, now part of Omdia.

Enterprise Strategy Group is part of Omdia. Its analysts have business relationships with technology vendors.

Dig Deeper on Storage architecture and strategy