Download this presentation: Analyzing the Effects of Storage on AI Workloads
00:00 Wes Vaske: Hey, everyone! I'm Wes Vaske, a principal storage solutions engineer here at Micron, and my team does application performance analysis, so we analyze how storage can impact the performance of applications. And for the past few years, I've been focused on AI applications.
So, I was here last year talking about some of the work I've done, analyzing how storage can impact AI workloads by looking at the performance of MLPerf outside of standard AI training processes. And this year, I'm going to talk about some more work that I've been doing. So, like I said, I was here last year, and I'm going to give a brief recap of some of those findings because they're still relevant here. I'm going to talk about how I use Nvidia Dali as a proxy or benchmarking maximum throughput today. So, image throughput from storage to your GPU, and that's going to lead us into Nvidia GPUDirect Storage or GDS.
01:05 WV: I'm going to go over what it is, and then I've got some performance data with local NVMe, as well as NVMe over Fabrics, which I think is going to be really interesting.
All right, so like I said, I was here last year, and I've condensed this down to just two slides to go over the most important pieces. The first one was trying to answer the fundamental question, "Does storage matter with AI workloads?" We get that question pretty frequently, especially considering that AI training is very compute-intensive. It's so compute-intensive that that's most of what we talked about, and the question is like, "Does my storage even matter?" So, to answer this question, I did, it's a little bit of a contrived use case, but I took the container that was running the training workload and I limited each separately and together, the amount of memory available to the container, as well as the performance of the disk.
02:07 WV: So, for the memory configurations, I gave it full access to the memory on the server or I limited it to just 128 gigs. And that number is because I'm using the ImageNet data set, which is around 150 gigs. So, 128 gigabyte limit meant that the whole data set cannot fit in memory, and it was going to need to read that from disk over and over. On the disk side, the fast disk was just unlimited access to the eight NVMe drives that were in the server, and a slow disk was giving a throughput limit of 500 megabytes a second. That is going to seem relatively low, especially in the age where we're going to be talking about tens of gigabytes a second in a few slides, but this is going to be similar to what you would get with cloud-attached storage if you're doing your training in the cloud somewhere. So, just to make sure that we're looking at poorly architected, well architected, that sort of thing.
03:00 WV: And what we find really shouldn't be surprising -- if you have a poorly architected system, you have slow disks and not enough memory, you can drastically hurt the performance of your training system. So, that cleared up a little bit right away. It matters. The next question is, "Where does it matter? And how does it matter? Does it matter today, or is this some question that we're going to face some years down the road?"
So, the next test that I did was a little bit more realistic. So, like I said, I used MLPerf and I used four of the different benchmarks that are part of MLPerf: Image Classification, Object Detection, Single Stage Detector and RNN Translator. When we just run these and we test different types of storage, we find that so long as your throughput meets a minimum for the specific model, you're going to get full training performance.
03:54 WV: However, that's not how people actually run things in the real world; things are very rarely 100% in isolation. So, what we next tested was doing simultaneous ingest. What we find with most customers, what they're doing is they will have a caching system that holds the data set for AI training, and they will move data in and out of that system as they're training different models. We also find that after an organization gets the process defined to set up one model, they find all sorts of reasons to expand that out to more models and more places in their business where they can use AI and machine learning. So, the simultaneous ingest is representing loading that next data set while you're training off of a data set that's currently on that storage. So, now we've taken something that was 100% read and turned it into a little bit more of a mixed workload, which is what we see in the real world.
04:56 WV: You are getting new data and getting that data on your system. The thing that was most interesting about this result . . . So, the bars that you see here are NVMe and stated devices, and I don't actually have the baseline numbers on here, but Object Detection and Single Stage Detector also saw some performance regression. Single Stage Detector saw up to a 30% performance reduction. That's massive, that's just from doing the simultaneous ingest. I do have a video up of their presentation if you want to get a little bit more of the details. I also have a lot of posts on our website where I go over this a little bit more in depth, but the main takeaway here was the dependency on storage performance cannot be predicted by the amount of disk activity that a training model is going to use.
05:47 WV: Image Classification had the highest disk activity, it was in the 1.2 gigabyte a second range, Object Detection and Single Stage Detectors were down in the tens to hundreds of megabytes a second range, so significantly lower. And the ones that were dependent on quality of storage were Single Stage Detector and Object Detection, not the most intensive models like we're training. One of the reasons for that with Image Classification, specifically, is a library that Nvidia developed called Dali, the Data Loading Library, and this does reloading and caching of data before it is needed to ensure that the quality of your storage doesn't impact the actual training application. It'd be great if we had that for all of the models and all types of data, but that's kind of a big ask.
06:39 WV: Now, I bring this up because, as I mentioned in the beginning, I did use Dali as kind of a proxy to analyze what our maximum theoretical throughput is. And I just want to let everyone know, this is a little bit of a hand-wavy benchmark. There are more optimizations that we could do, more tunings, just if we're really just looking for the peak number, but that's a little beside the point. Anyway, so what I tested here was a Dali pipeline with three steps. I kept it as simple as possible, it's going to read data from disk, it's going to do the image decoding from JPEG to Tensor, and then it's going to resize the model to fit the standard ResNet-50 model that we're using here. The container that was doing this process was memory-limited to ensure that data access goes to storage and it's not just coming from the file system cache. The set up was a RecordIO file, four threads per GPU were reading from that file, the images are the ImageNet2012 data set, same thing that everyone uses for actually benchmarking training hardware. So, if you see MLPerf and image classification, that is training ResNet-50 on the ImageNet data set, and that's what we're reading here.
07:56 WV: The other two bars on this chart, those come from the results that Nvidia posted benchmarking the performance of training ResNet-50 on ImageNet on these two different frameworks, TensorFlow and PyTorch. All of this is done on a DGX A100, with eight of the A100 GPUs. And what we see here, for a very compute-intensive workload training ResNet-50, TensorFlow is already at 75% of our peak throughput that we get with Dali. And faster storage doesn't increase this. The drives that we have in the DGX A100 as the eight drives that come in the unit, it will hit 50 gigabytes a second across those eight drives. This training throughput, though, or sorry, the preload throughput is only in the 2 to 2 gigabit a second range. There's a lot of performance left on the table, but the architectural issues that we're facing just don't allow us to get that additional performance, and that's because data has to go from disk to CPU memory, as sort of a bounce buffer, and then going into the GPU. So, this results in a storage bottleneck that you can't really fix with faster storage.
09:15 WV: And this leads us into Nvidia GPUDirect Storage. So, Nvidia GPUDirect Storage, GDS. What is it? The first question that everyone's going to be asking. So, it's a technology and software stack that Nvidia has been developing for a little while now. And at a high level, what it's doing is letting you skip the CPU memory so that you don't have to go through that one spot there. So, data can move directly from storage devices to GPUs, and those storage devices can be remote, over NICs or HBAs, it can be local NVMe devices. As long as the DMA capability is there, this theoretically should be able to work. It's generally going to improve throughput and reduce latency. The software system, right now, is currently in beta, I believe that it's open beta now. The application tie-ins are still under development, also tie-ins with the different AI frameworks, with Dali, things like that.
10:13 WV: It is designed to be general-purpose, it's not tied to any specific application, so it's going to be really exciting to see how integrations down the road really enable some cool use cases. There's a growing ecosystem of storage partners, so there's going to be a lot of options available for you. If you have more specific questions, I'm going to refer you to Nvidia for implementation details. They've got a series of blog posts and pages and presentations that they've done. I've got links to a couple of those here, as well as a link to a blog post that I did earlier this year, in July, where I'm going to go over some of the data that I have here. I'm going to go more in-depth in that blog post, as well as a little bit more in-depth into what GDS is actually doing. Magnum IO is the umbrella project for the different pieces for the accelerated I/O system that Nvidia is developing. I highly recommend checking out CJ Newburn's GTC presentation. He goes over great detail into what GDS is, how you can use it, how to implement it, as well as some performance data from different partners, including us here at Micron.
11:27 WV: All right, so let's get to the good part. What does it actually do for us? So, the first testing that I'm going to talk about was done with local NVMe storage. This is actually the server that I've been doing most of my testing on up to just a couple of months ago, a Supermicro-based box. It has eight Nvidia V100s in it, the SXM2 form factor. So, from a GPU perspective, this is going to be very similar to a DGX-1, as the same high-rate cube mesh NVLink architecture are connecting all of the GPUs. The CPU's memory and drives, though, are going to be a little beefier than a DGX-1 would have. I've got two Intel Skylake, some Platinum 8080Ms, 3 terabytes of memory in here. Remember, company, I have it available, I'm going to use it. And then I have eight NVMe SSDs. And I just want to point out that the NVMe SSDs in this system are connected to the CPUs, not to the PCIe switches as in a DGX-2 or a DGX A100. They'll come into play a little bit later.
12:33 WV: The workload that I'm showing here is a random read, where I am scaling the I/O transfer size. Each GPU is using 16 workers, that's kind of a "medium workload." The blog post that I mentioned, and that's linked here, does show a little bit more scaling for user counts, so you can see what that actually means. But what we're looking at here, for that medium workload, at small to medium block sizes, we see pretty dramatic increases in throughput and decreases in latency. At the top end, our performance is going to be similar because we're not really bottlenecked anywhere, this is just the performance limit of the drives that we had in the system at the time.
13:20 WV: What's interesting here is that we see right away how GDS can accelerate your storage, how you can get data into your GPUs faster and with lower latency. However, local NVMe can only provide as much capacity as you have slots in the server. And, like I said, this server only had eight slots for NVMe devices, and that's going to cap the total capacity that you can get. And then there's other issues with having local caches on separate individual systems. So, one of the questions that we're starting to hear more and more from customers, partly, I think, because NVMe over Fabrics is getting talked about, is "How can we get NVMe-like performance, local NVMe-like performance with the benefits of remote storage? Does NVMe over Fabrics actually provide that?" The short answer is yes.
Well, let's go into the long answer a little bit more. So, what I'm testing here is we have an EBOF, so an Ethernet bunch of flash, where we've partnered with Marvell and Foxconn, to get the chassis and inverter controllers and the Ethernet switches. So, the architecture here is each drive is connected to a converter controller that will present an NVMe device as an NVMe over Fabrics device, and those are then connected to the internal switches. And then we directly connected six of the 100-gig Ethernet ports on one of these switches to the DGX A100.
14:52 WV: So, we've got a total top-end theoretical throughput of 600 gigabit. We are going to have more information available on the EBOF and this architecture. There's a blog post on micron.com by Joe Steinmetz -- I highly recommend checking that out. There's also going to be another, or there has been an FMS presentation by one of my colleagues, Sajid, where he did more testing on the EBOF and dug into the performance details there. So, if you have questions on the EBOF, I highly recommend checking out that session here at FMS, as well as reading that blog post by Joe.
The test case that I set up, I wanted to keep things simple, especially for these first tests, I don't want a lot of complexity that could muddy the waters. So, we did Direct Connect, we didn't go through another switched architecture. Each of these six links was basically responsible for four NVMe over Fabrics targets. We have IP addresses assigned to all of 24 drives, IP addresses on each of the six NICs and it's pretty easy to set up that mapping. Each individual NVMe over Fabrics target was then individually formatted and mounted on the DGX A100. And we did that just to, again, keep some of the complexities out of the initial testing, also allowed us to do testing where we scaled up number of drives pretty arbitrarily. I don't have that data here for you today, but this enabled that testing.
16:24 WV: And then the way that we ran load. So, in the DGX A100, there are four PCIe switches, each one has two GPUs and two NICs attached to that same switch. So, GPUs would target NVMe over Fabrics devices that were managed or were coming across through a NIC that were on that same PCIe switch. So, we weren't going across the PCIe fabric or the NVLink fabric. Again, just to keep things simple. And what we found for a small block here is that the results are real, similar to what we had with local NVMe. This chart is going to look a little different than the one I had before because this is just a small block. That previous one looked at a range of transfer sizes. There is going to be some data in the blog post that I linked showing local NVMe with GDS, where I do scale the number of workers and that's what we're seeing here, the number of worker threads per storage target. And we get consistent reductions in latency, and consistently higher throughput with the GDS versus the legacy data path. And the peak here is significantly higher, we're looking at 4.1x higher peak throughput with GDS than without, again, we also get that increase in scaling. As we scale up the amount of load that we're putting against the system, we see an increase in performance as well.
17:52 WV: When we look at large transfers, we actually see something new here. So, the architecture of the DGX A100 results in I/O, large I/O throughput being higher with GDS than without. With the local that we saw before, our peak was basically the same. But because that data path with the NICs and the DGX A100 on the PCIe switches, that path to go through system memory is significantly longer than if those NICs were on the CPUs. Furthermore, the performance that we get out of local storage in the DGX A100 tops out around 50, 52 gigabytes a second. And we're able to beat that with just one EBOF with a bunch of mainstream NVMe devices and at the Micron 7300. And, here, we're getting top-end performance at 83% higher with GDS than without. That's a dramatic difference -- we're getting way more throughput that the non-GDS case just can't keep up with. And this is just one EBOF; this is something that you can scale out and scale up for more performance and additional capacity.
19:05 WV: So, to summarize: AI applications are hitting storage bottlenecks or some of them will soon. So, this is an issue that we need to be aware of and need to be addressing today so that when this becomes a big bottleneck, we have a solution in place for it. And that's where GDS, one of the ways that GDS comes in, so GDS is going to consistently improve throughput and latency, getting data from storage to GPUs.
And then finally, the NVMe over Fabrics solution that we're looking at provides near-local NVMe performance with the benefits of remote storage while being compatible with Nvidia GPUDirect Storage out of the box. If you have questions that you think of maybe after the question and answer session that we'll have here, feel free to reach out to me on LinkedIn or Twitter. I've got my LinkedIn link right there, my Twitter handle. There's also the Micron Tech Twitter handle if you have any broader questions or if you just want to follow us. Also check out the Micron blog. We'll have more information coming out over the following months and quarters as we do more and more with Nvidia GPUDirect Storage.
All right, from here we're going to go to questions. That's the end of my presentation. Thank you very much.