Download this presentation: Best Storage Strategies for AI and ML
00:01 Dave Eggleston: Welcome to the storage for AI track. I'm Dave Eggleston, I'm an independent consultant, and I'm going to be talking about some of the recent trends in storage for AI. The track features a wide variety of speakers from across the industry, talking about their companies, their products and their approaches to storage for AI. Stay tuned for the rest of the track. Hope you enjoy it.
When we think about AI, I like to think of AI being the beast, and with data, we have to feed that beast, whether it's a small amount of data as in inferencing or a large amount of data, when we're doing training, we kind of get that data from and to the processor in order to do all of that calculation that makes AI work, and we have to do it ever-increasing rates. This is a big challenge. So, I like to think of AI as the beast we have to feed with data.
01:04 DE: And what about how things are changing around AI? Interesting chart here from Marvell shows that AI model growth has expanded in complexity 50 times in only the last 18 months. Now, can hardware keep up on its own? Well, if we look at transistors, we can see that that same 50x improvement in performance with that has taken 10 years. So, the answer is going to be no, we can't just throw more hardware or more hardware advances at it, we have to think about the architecture and the interaction of software and hardware.
01:46 DE: In addition to that, we have to consider more than just the CPU. We have many engines, many processing engines now doing the work for AI, they need data. So, that obviously includes the GPUs, which have become a real workhorse, especially for training. A new entrant, which is the DPU or data processing unit, which is sitting out there on the network, kind of an extension of SmartNICs. FPGAs, which have great program ability and flexibility, used mainly for inferencing tasks, and then ASICs, which are more job-specific, all of which are doing part of the job, and in some cases even doing part of the workload for AI. That's a diverse set of processors we have to feed data to. Now, fortunately, we do have a fabric NVMe or NVMe-oF, and that fabric can tie together these different processing units, so at least that's a good starting point of how we can have this coordination of the various processing units.
02:58 DE: One of the speakers coming up later in the track is from Lightbits Labs. I'm sure they'll talk specifically about their approach, which is using NVMe and TCP, so stay tuned for that talk. So, what we're looking at is moving from having storage, feeding a CPU, and then we've evolved to adding the GPU, and the GPU is also being fed through the CPU. But what we want to move to is where we have all of these devices accessing a pool of data and that can be shared. And, so, if we can have that free movement of data between all the different processors, that's the goal we want to get to.
03:43 DE: Now, the file system is going to be definitely involved here, and handling both the processor variety and then also having on-prem versus cloud. You might have a hybrid situation going on in doing our AI processing. And I think Jensen at Nvidia put a nice point on it during his recent keynote, he talks about AI being vastly more parallel and so much more compute-intensive, and we have to rethink that architecture. He puts it as, "AI requires a whole reinvention of the computing stack." So, that's the way we need to approach storage as well as part of a reinvention of how we do computing.
So, I'm going to talk about these four main trends in storage for AI, and I'm going to talk about them in terms of different approaches. The first is a software approach by these companies listed up here, and I'll drill into how they have changed the software to optimize storage to feed the beast.
04:56 DE: Another approach is computational storage, which is moving more of that intelligence right into the SSD itself, and I've listed several companies here, some of whom are speaking in other tracks here at FMS, but that's an interesting approach for how to accelerate the processing for AI. As mentioned earlier, we have this new entrant or new concept of a DPU, a data processing unit, and I'm going to touch on that and how that's an extension of that very hot area of SmartNICs, a little different idea of where you put that intelligence in the compute network.
And then finally, what if we didn't use storage at all? What if we held everything in memory? How does that work? And I'm going to look at a couple different examples from specific companies listed here, so let's dive in first to that software approach.
05:55 DE: Now, one of the key issues for the software approach is the CPU has largely become a traffic cop. The CPU can throttle that data, or the software stack associated with the CPU is really what's throttling the data. So, how do we address that? Well, the bypass or replace elements of that stack, which could be the file system or the drivers, like the block I/O and storage drivers. So, I'm going to look at three different approaches. Now, they also do work together on this new idea of GPUDirect Storage, so let me deal with that first.
So, what Nvidia has proposed with GPUDirect Storage is certainly a way to feed that GPU beast, and those GPUs do take a lot of data and need that getting in and out of them very quickly, so they utilize DMA to skip this DPU, and I'm going to look on the next slide and show how they do that. Now, note that that can handle both remote or local storage.
07:07 DE: So, here's the software stack, and what we can see in the chart is that they've labeled what's called a proprietary distributed file system, and we'll look at two examples here in a minute that help you to bypass some of the elements like the file system driver, block I/O driver and storage driver in a normal stack. And by doing that, you can get much better performance. Does it work? Well, let's look at this example from Nvidia, and they showed this recently at Nvidia GTC. This is an example where they're doing simulations for the NASA Mars Lander, and on the left-hand side, it's using the standard software stack and what they're showing is their throughput reaches only 32 gigabytes per second without that GPUDirect Storage-enabled. But when they do enable the GPUDirect Storage, they're able to get to 164.6 gigabytes per second or a 5x acceleration of data in and out of the GPU.
08:19 DE: Let's look at another example which Nvidia also showed at the recent Nvidia GTC, which is on the brand-new A100s, and in this case, they were working with Vast Storage, Vast Data and Vast Data's box had both QLC and Optane SSDs and so the Optane SSDs, obviously are high-performance SSDs, but backed up by the QLC SSDs. The key here, though, is enabling GPUDirect Storage, GDS, and they bypass that CPU and achieve three times the bandwidth. So, two examples there from Nvidia that show how it works.
Now, the speaker coming up after me is from WekaIO. And WekaIO has a file system which is optimized for NVMe, can handle that wide variety of processes we were talking about, and then also can manage data, whether it's on-prem or in the cloud. So, it has the versatility that's needed to give data and storage for AI. And here's an example that they also did on GPUs using the Nvidia A100 GPUs, and they also showed that they were able to achieve with their file system 152 gigabytes per second and perhaps the speaker will have an updated version of this immediately after me.
09:53 DE: I mentioned Vast Data earlier. Vast Data also has worked with Nvidia, not only with their hardware, which involve the SSDs that we talked about, but also they do have an NFS-based file system, and what they show here is with the legacy NFS, they can only achieve 2 gigabytes per second, implementing their Vast NFS goes to 46 gigabytes per second, and when they enable GPUDirect, they get up to that 162 gigabytes per second when you have both running.
Now, let's switch gears and talk about the computational storage. This is a different approach that I alluded to, where you're moving processing, moving intelligence towards storage. What you're doing there is you're offloading specific processes from the CPU, and those might be processes like encryption or compression, decompression or search are some of the common ones that are done. In this case, you're going to consolidate that with the SSD controller, that can be an ASIC or that could be an FPGA, different companies do have a different approach as I've listed down below. So, I'm going to dive in only on one of them, which is from NGD Systems, which has an ASIC-based SSD and show some of the examples from them. The other companies also have some really good examples and participate in the SNIA activity in promoting computational storage.
11:32 DE: So, what's the motivation behind computational storage? Well, I mentioned already, it's moving that processor into the SSD into what's called a computational storage device, and you can accelerate with lower power, so that's one of the additional benefits here, in addition to having that faster processing is achieving lower power consumption. And does it work? So, this is an example from NGD Systems using a Facebook image recognition system. Offloading the CPU, and as the chart on the right shows, what happens is you significantly reduce the loading time, the time for data to move in and out of the device by having the CSD enabled to do some of the function. The search time also reduces by putting multiple CSDs in parallel, so parallel processing, that parallelism that Jensen Huang was talking about as being necessary for doing AI, handling data for AI.
12:39 DE: A second example, also for inference, is doing object tracking and moving that object tracking into the CSD offloading the CPU. So, on the graph on the lower right, we also see that we get similar accuracy, but at much less power. So, again, a side benefit of this is the execution occurring in the CSD is very power-efficient.
Now, what about training? Here's an example as well from NGD Systems that shows a x86 training environment that was then moved, moving the ISP into the ISP engine into the CSD itself. And then what we see at the bottom is by adding CSDs, the more you add in parallel, it scales that training speed and does so pretty linearly, so good scaling in that example. So, that's computational storage, which is my second approach.
Now let's jump into something which in some ways is similar, it's called the DPU, is doing that processing that storage processing, but doing it more right at the network card itself. And what that gives us is network-attached, but it enables disaggregated storage, same message, we do get to bypass the CPU, and I'm going to look into two companies -- one, Nvidia and one, Fungible -- look a little bit at their roadmap and then how the DPU is architected.
14:19 DE: So, this graphic from Nvidia shows the more traditional approach all the way on the left, and it says, "Here's the network card, only doing network control," so you can see the storage is attached through the CPU. The next step in this is that network card or SmartNIC can be doing network and storage control by moving that storage, so it's accessed through that device, that network card. And then, finally, by having communication between the host CPU and then also that DPU device is then you get full infrastructure control, and both Nvidia and Fungible have a lot of points to make about why this is a superior way to doing software-defined software control of your data center. So, why implement a DPU? Well, we've talked about some of them here, we've talked about the data storage, and you've got that data storage for the expanding workload, feeding the beast.
15:27 DE: Also, I just mentioned briefly to having that software-defined infrastructure, and this being a good control point for achieving that software control of your data center. It's integrated right with the network card, but then also one of the key points is the security and the security that handles the data a little differently, handles it in a separate way from the application itself and can make multi-tenancy a reality. So, those are some of the benefits of a DPU.
Let's look at the chips themselves that implement them. This is a recent announcement from Nvidia on what they call the BlueField-2, and this is their DPU which has multiple cores, and as shown here, enables PCIe Gen 4 multiple lanes for managing a wide variety of storage devices and can achieve 5 million NVMe IOPS. They make the aggressive claim that this would replace 125 x86 CPU cores, that's again quite an aggressive claim. We shouldn't be surprised that that comes from Nvidia though.
16:44 DE: Something I found particularly interesting was their roadmap for DPUs going forward. Here, we can see the BlueField-2 --bthat's kind of the lower left part of that graph -- above it is the BlueField-2X, and what that shows is pairing a DPU with a GPU on that same card, and as they go forward, what's fascinating to me is integrating that GPU along with that DPU on the same piece of silicon. So, now we not only have those ARM cores that they talked about for intelligence in there, but now we've also added a powerful GPU to be integrated into a single SoC, and they talk about being able to do about 400 tops. So, pretty impressive performance, and it's going to be fascinating to see what Nvidia does in the processing that can be done in that DPU, with that GPU as well.
17:45 DE: As mentioned, Nvidi is not the only one. Fungible is going to be speaking quite a lot here at FMS, and they've already talked about, they have an F1 chip. F1 one chip is the DPU itself with MIPS cores. It allows the scale out of storage and that graphic in the middle gives us some idea of the DPU managing a large number of SSDs. What's interesting, though, is the hardware accelerators that they've built in here for DMA crypto, hashing, compression, decompression, erasure coding, and that's a lot of capability in hardware, so you get that incredible performance. So, that's the F1 chip itself.
18:31 DE: Now, what Fungible has announced very recently is taking that same F1 chip and put one or more F1 chips into a box, which is built with SSDs around that F1 chip, and Fungible now is calling that a storage cluster. Very interesting, they've taken their own chip and now built that product, and what they're demonstrating is absolutely incredible performance for 4K reads, and they're talking about getting 8.9 to 15 million IOPS out of that machine on the reads. And they've also, if you look in the table, picked what they call a nearest competitor, and both for read and write IOPS, you can see about three times the performance of the nearest competitor. So, again, very strong performance out of this Fungible storage cluster. So, I'd like to encourage you to also pay attention to the Fungible keynote, where I think they're going to talk about this in some detail as well as in some of the sessions.
19:39 DE: Now, let's go to the final trend, which is about memory. And this is really about what if we, instead of accessing that data from a storage device, a block storage, object storage, what if we moved in into load and store memory? And that would eliminate I/O completely. So, that requires a lot of memory, large memory. But where do you put it? So, I'm going to look at a couple different examples here of where you put that memory for your processing for AI and why that makes a difference and some interesting differences between these approaches.
20:21 DE: So, let's look, an example with an ASIC. This is from a startup company called Groq, which is doing a single chip, they call it TSP, Tensor Streaming Processor. It's a very, very large chip, it's 700 millimeter squared, so very large. It's done at a pretty advanced process, and the revolutionary thing here is the amount of memory they have on that chip, 220 megabytes of SRAM. So, that's our highest performing memory, and that's there to feed the MMU. So, again, feeding that beast for doing AI processing, but doing it out of SRAM, lots and lots of on chip and SRAM. They're targeting this for an inference application, so that's one approach, and it involves that very large amount of SRAM, which results in a very large chip. It's going to be interesting to see how successful Groq will be in penetrating the space.
21:24 DE: A different approach was talked about at Hot Chips this year by IBM on their Power10, and what I want to highlight coming out of the Power10 processor, a very flexible fabric, very flexible network. In some ways, this could be thought of as memory networking. What we can see is IBM has buffered the processor itself from the different types of memory, and you can see that on the right here, where it talks about OMI that connection, which can be run at a terabyte per second, so a very high-speed connection. And they show it going to different types of memory, which could be DRAM, and they show it to what's called the DDIMM over there, which is specific to IBM, but then they also have it going to GDDR or storage class memory for the future. That same link, it's shown as PowerAxon on the left, can go to what they call the integrated memory clustering. In other words, we could borrow memory from other different areas in order to feed that processor. So, pretty interesting idea, and the addressability of these links is up to 2 petabytes, an astounding amount of addressability and shows this goes far beyond which you might have as local memory, so a very different idea about how to feed that processor, which may be processing AI tasks and feed it from a wide variety of memory sources.
23:06 DE: Now, Intel is going to be talking a lot at FMS as they should be, about Optane, and at a recent Intel event they highlighted an issue with HPC applications, which could also be relevant to AI. What they talked about is when you're trying to get data in and out for these HPC or AI tasks, one of the biggest problems in achieving the top performance is the small and misaligned I/O that are in the block, so small amounts of data that don't fill a block and could be misaligned and yet you've got a throttle to compute until it's in blocks and then commit it to your SSDs or HDDs.
So, that's the problem they laid out and they had a pretty interesting solution here. And if you look in the chart at the bottom part, you can see what they've done is they have and what they call a DAOS Storage Engine, Distributed Asynchronous Object Storage engine, and that's making decisions about where to place your data. So, large I/O would flow right through to the SSDs and they show that here, whether that's 3D NAND or Optane SSDs, but small I/O, small block . . . And less than block size that are misaligned or small in size would go to Intel Optane persistent memory. So, now persistent memory is doing part of that storage job. When they did this, when they put this together, the speaker said they were surprised. They knew they had a good result on their hands, but they leapt to the top of the HPC IO-500 ranking by handling things this way, so quite an innovative result from Intel.
25:04 DE: Now, let's look at another example from Penguin Computing, which is an HPC specialist. And this definitely has to do with AI acceleration using Optane DIMMs. So, in this case, Penguin took the Facebook Deep Learning Recommendation Model, replaced the SSDs with persistent memory. So, the orange bars are the performance inferencing time with SSDs, and then with replacing that with persistent memory or accessing persistent memory, they drop their inference time dramatically almost 10 times faster as a result. They use software from MemVerge, and you'll hear MemVerge also speaking here at FMS, and MemVerge software is very nice because it virtualizes that access to persistent memory DIMMs, and you don't need to rewrite your application. So, this joint project between Penguin and MemVerge and Intel shows some of how moving those tasks from SSDs and moving it easily with no app rewrite into persistent memory can give you a big advantage on AI tasks.
26:27 DE: So, let's sum up some things we've talked about here. Talked about the four main trends being the software approach, GPUDirect Storage, some of the file systems, computational storage; moving intelligence into that storage controller itself to do part of the task; the very new DPU approach, which is moving intelligence into that SmartNIC and creating what's called that data processing unit. And then, finally, what if we didn't do I/O at all and put everything in memory, a couple different approaches there.
So, hopefully that's been helpful for you to understand what some of these key trends are, but stay tuned as I mentioned for the experts in this track who are coming up next, including some panel sessions where we'll drill into some of their specific activities and the future, the look ahead to how we can do storage for AI. Thank you very much for your attention.