Download this presentation: Optimizing NVMe Drives for Your Applications
00:01 Andy Walls: Hi, and welcome to this session on "Optimizing NVMe drives for your applications." My name is Andy Walls, I am an IBM fellow, I am the CTO and chief architect for FlashSystems at IBM. It's really a privilege and an honor to be with you today. I know that this year is much more complicated than in previous years where we could meet together and we could discuss things and have drinks together and have meals -- things are much different.
00:34 AW: And yet I hope that you are taking as much advantage of the summit this year, as you can . . . And the virtual format has advantages, and I hope that it is working well for you.
So, we're going to talk about how to optimize NVMe drives for your application. Actually, the ideal is that your application doesn't have to do anything; the goal is that the drive itself is optimized for applications that are going to use it . . . That an application shouldn't have to know much about the SSD. So, what's important is that we optimize those NVMe drives inside or with a thin driver so that they are ready for your application. And a lot has been done, there are a lot of work . . . There's a lot of work being done on the NVMe standard in various different protocols and inside drive manufacturers and various places to make sure the NVMe drives are optimized for various use cases. Now, let's look at the landscape a little bit for NVMe today: First and foremost are NVMe SSDs. By and large, these come in U.2, 2.5-inch form factors.
01:54 AW: EDSFF does have some form factors today and they are gaining in popularity and . . . Over the next few years . . . I think we can expect to see more of them. Right now, it's the 2.5-inch form factor that dominates. We are taking over from SAS, little by little . . . NVMe over PCI Express, NVMe SSDs, I think in years to come will dominate. But it takes a while and there are longtails always as we shift to new protocols. So, by and large, when we talk about NVMe SSDs, we are talking about SSDs that are U.2 and connect over . . . over NVMe running on PCI Express.
Now, also NVMe includes enclosures . . . and there are support today for NVMe over Fabrics and that can be NVMe over Fibre Channel, can be NVMe over RoCE, even NVMe over IB. Those enclosures, usually, have NVMe SSDs in them. Now, there is a trend that has started over the last couple of years, and that is to have NVMe over Fabrics SSDs . . . Ethernet directly on the SSD, so that you can essentially get direct attached performance from network attachment. These are called Ethernet Bunch of Flash, as opposed to just a bunch of flash enclosures that might connect over NVMe.
03:34 AW: Is it an enabler for composability? I think so. I think we'll see this gaining popularity as well, but not taking over from the other NVMe over Fabrics options that are out there today, as well as just Fibre Channel. So, the key thing here is that regardless of the form factor, regardless of whether it is attached with NVMe over Fibre Channel or NVMe over Fabrics, whether it is just an NVMe SSD, optimizing the SSD is similar in all of these cases.
So, let's look at the many variables that have to be optimized in order to get as much out of that SSD as possible. First and foremost, we need it to have the lowest possible cost. Secondly, it has to have the highest efficiency. What do I mean by that, efficiency? An SSD might provide you 4.8 or 9.6 terabytes or 15.36 terabytes, but how much of that can you really use? In order to get all the costs that you can, you need to be able to use as much of that as possible. So, as low an overprovisioning as possible in order to give you the highest efficiency. The third parameter, of course, is the highest performance. And I think it's important, as we look at the highest performance, is that it's not just hero numbers, not just IOPS that can make your eyes pop.
05:10 AW: But it is, how does it perform in real-life applications where reads and writes are mixed, where block sizes are mixed, where you might have some sequential going on as well as some random or what appears to be random? So, in real-life workloads, getting a high performance as possible and really allowing for multiple applications to attach to and use that SSD, especially as the SSDs grow in capacity.
The fourth variable, and it is also extremely important, is the endurance. I have to be able to get enough endurance that that drive can last as long as I need it, while doing real-life workloads that do involve a lot of writes.
06:02 AW: And then lastly, but certainly not least, is not just to get consistent, good performance, but consistent low latency in real-life application. So, optimizing to get all of these is certainly quite the challenge. Now, what is it about flash technology that causes this to be challenging? Well, as we all know, flash has the unique attribute that you have to do an erase before you can write . . . and that necessitates a log structured array, it necessitates garbage collection and both of those things can involve high write amplification.
And, so, the very nature of flash requires you to have outstanding garbage collection, it requires using all kinds of techniques to reduce that write amplification as much as possible. And, of course, as thrilling as garbage collection is and as difficult it is to ensure that you have an efficient and effective garbage collection algorithm, we all know that flash wears out, so the higher the write amplification, the worse that effect is, getting all of the write array cycles that we possibly can so that the flash will have a useful life that allows it to be economical.
07:29 AW: Now, as we've gone to 3D flash and as we are fitting more and more bits into a NAND cell, and as we're going from even TLC to QLC, factors like cell-to-cell interference are even more challenging because now you're going to have a higher bit error rate and things can result in you having to reread or to write it again somewhere else. All of this adds to that challenge.
Of course, just the retention itself, being able to have a read retention that allows the device to be able to retain the data even if powered off for a certain period of time . . . and as we're trying to drive cost down, we want to be able to use QLC, four bits per cell. But, of course, four bits per cell means a longer programming time and read latencies that go up.
08:33 AW: So, all of these things combined means that using flash these days and making it optimized for real-life applications is not for the faint of heart. So, let's look at what the goal is, and then we'll talk about each one of these individually. The goal is to enable the cheapest flash possible. Now, there are many vendors out there, many suppliers that have fit-for-purpose flash enclosure. So, they are ideal for a capacity kind of play where they use QLC and you won't get as good a performance, but you can put a lot of data on them.
And then there are enclosures that use TLC -- you can get better performance, but they're more expensive. So, I think the goal is, how can we enable the cheapest flash and optimize it so that it can be used really in all applications and provide that cost effectiveness
09:46 AW: In order to do that you, have to be able to use as much of that flash as possible. You have to employ techniques to get as many writes as possible, which means you have to get the lowest possible write amplification and we have to use techniques to get the lowest latency possible for real workloads. So, we're going to look at each one of these in detail.
First of all, enabling QLC for more applications. At IBM, we announced our FlashSystem 9200 in February of this year, and that is an enterprise all-flash array. It's really a family of products that is used in data center applications, and it is an enterprise product. And we were able to ship QLC in that entire product range, it's only QLC. We were able to enable QLC to be used regardless of the application and still achieve all of the things that we just talked about in terms of endurance and performance.
10:50 AW: And, so, enabling QLC allows for awesome capacity. In our most recent FlashCore module that was released in February, you can see that we ship up to 38.4 terabytes in 1-, 2.5-inch U.2 SSD. That's an incredible amount, it's the densest of any 2.5-inch NVMe drive that I'm aware of -- we have other capacities as well.
Now, using QLC allows you to achieve low cost, but the other thing that we've been able to do and that can really help, is to have compression built-in as well. Compression, in a 38.4 terabyte drive allows us to store up to 88 terabytes in 2.5 inches. That's just incredible. We're able to do 1.7 petabytes in 2U that way. Now, having compression in an SSD does require the stack, in our case, the Spectrum Virtualize stack to be able to support out of space, monitoring for space, being able to give alerts if it looks like you're running out of space . . . But the data path is completely hardware in-line and nothing has to be done in order to do the compression other than the SSD, it's built-in. And, so, we get excellent performance, and most importantly, we are able to get extremely high capacity after the compression.
12:31 AW: And so, holy moly, 88 terabytes in 2.5 inch. Now, how do you get to use as much of that SSD as possible. What's the key to that, using as much of that flash as possible? Well, the key is to keep the write amplification low. This is really the goal of any Flash Translation Layer, is to keep that write amplification as low as possible . . . The goal would be to have it at one, which, in that case, means you are not doing any additional writes and you can get outstanding endurance.
But it's not just endurance these days. For the write amplification, the higher the write amplification, the more the drive is having to deal with garbage collection, which can get in the way of the host accesses . . . And it can hurt the performance. It can cause performance spikes and it can mean that your latency is not as consistent as what we want. So, how do we use as much of that flash as possible and keep that write amplification low? What we do in IBM is a trick called health binning and heat segregation.
13:44 AW: The health binning and heat segregation also help the basic endurance . . . But heat segregation is really good at keeping the write amplification low; if you group data together by the access heat, that means . . . by the write heat, that means that data that changes a lot will be grouped together and will change and be overwritten a lot and you'll get a very low write amplification. That's been extremely effective for us. What we found is that most applications truly are skewed, meaning that they abide some sort of a Zipf-ian distribution -- 80/20, 90/10 -- so that 20% of the data is accessed 80% of the time or something like that.
14:36 AW: In addition to heat segregation, it's very important that the garbage collection is able to be very efficient at achieving the write amplification . . . but doing so with a minimum of metadata. And I'll talk a little bit more about that in a moment. Obviously, the lower the coding rate on the error correction code allows you to eke out a little more of the flash for data. One thing that has come available in 2020 is Zoned Namespaces. This is something that many operating systems, applications, hyperscalers, many . . . many folks are taking advantage of so that the flash is not having to do a lot of that garbage collection. It's now up to the operating system, it's up to the application or the driver to write sequentially and then to reset and start over . . . And, so, the drive can focus and break the drive up into zones that have an ideal number of erase blocks in them, and by using these zones, multiple of them, you can get a very low write amplification. And, in fact, you can use much, if not all of the flash that way. Now, the upper level stack is responsible for the garbage collection.
16:13 AW: IBM has an example of this called Salsa. It is extremely good at doing a log structured array. You can see here, it does compression and deduplication, and it looks for recurring pattern detection and thin provisioning, all of that . . . and it can take advantage of a very inexpensive flash as zone . . . as ZNS as a matter of fact, and provide a high level of efficient garbage collection and very low write amplification.
Now, not all applications can make use of a ZNS, so how can the SSD itself get as many writes as possible and keep that write amplification down? The heat segregation that I mentioned at first, it is extremely good at keeping write amplification down, but you know what else it does? It allows you to get a high degree of endurance . . . because we combine that with the ability to project the block health. If I can tell the health of a flash block early on, then I can put data which is the healthiest on those . . . I'm sorry, I can put data which is the hottest, which changes the most, on blocks which are the healthiest. And then I can take data which changes the least often and put it on the less healthy blocks. That combination of things allows you to then have flash which is going . . . Its endurance is going to be determined by the average health, not the tails or the distribution.
18:02 AW: We've been extremely successful at IBM at doing characterization and determining the block health early on for the flash that we use from Micron. I talked a moment ago about the importance of ECC having a low coding rate, but at the same time, you need an ECC that can correct a very high bit error rate. Those two things are . . . oppose each other, and so we picked one with a fairly good coding rate but still can correct up to 1% bit error rate. That eats up some bits, but it allows us to get extremely good write endurance by extending the life, combined with these other techniques that we have developed. It's also extremely important that one not just use the voltage thresholds that are available that come with a NAND device -- we have developed read-level shifting optimizing these threshold values over the life of the product so that you are getting the lowest bit error rate you can as you progress over time.
19:18 AW: So, all of this together allows for getting as many writes as possible, so good write endurance while still not requiring high overprovisioning. And, again, remember the objectives: We want as a low a cost as possible, we want to be able to use as much of that flash as possible, and we want endurance that will allow us to use it for a good many years. Now, keeping that write amplification and the read amplification low . . . Talked a little bit about that, Salsa does a very good job at that, but the SSD itself also could do a good job. Inside our FlashCore module that we use in the 9200, we've been able to achieve very, very good write amplifications. And we take advantage of the fact that workloads are skewed, and we use a technique that our research team calls N-bin plus heat segregation.
20:21 AW: We break the blocks into . . . into bins. So, instead of having to do a lot of record-keeping and keep track of exact . . . The exact invalidity and then sort that, which takes a lot of metadata, we group things together by heat and then by bins and select from these bins for the garbage collection. And it allows for very effective and efficient write amplification. Obviously, compression also helps; if I write less data, I'm going to have a better write amplification.
Now, how do you get decent performance after all of that . . . After trying to get the cheapest flash you can, how do you get decent performance in addition to that and keep the write amplification down? We've developed a few techniques with our FlashCore module, one of which is what you see on the left here. QLC . . . It's an efficient data straddling.
21:29 AW: So, even if I want to read quite a bit of data, I make sure that it's spread across multiple lanes, so I get parallelism. So, I have multiple lanes trying to get that data for me, and that allows me then to optimize the latency because I am doing it in parallel across many lanes. Of course, having compression also helps because there's less data for me to read.
Another technique that we've done extremely efficiently is . . . or extremely effectively, is read heat segregation. So, I talked about the write heat segregation, read heat segregation is also important because QLC NAND flash give you the capability of putting some blocks in SLC mode. In addition to that, some NAND flash manufacturers have an asymmetrical access pattern over the different pages; in other words, some pages are faster than others. Having faster pages and SLC pages means that if I can keep track of the read heat, then I can put data that is read most often either in SLC or I can put it in the fastest page. What it means is also, though, developing an algorithm to determine this read heat very simply and with as little metadata as possible. One thing that we've actually done is to put in place a hinting architecture so that the upper-level stack can tell us the heat, can actually say, "This is metadata; this is going to be read quite a bit" or possibly it's just log data that is never going to be read again.
23:18 AW: So, it can tell us that as well, so that we can put it on the right tier in the flash. So, you can see that there has been a lot of progress in optimizing NVMe SSDs. Obviously, I've given you an IBM perspective. This is true for other NAND flash manufacturers and SSDs also. They can be extremely dense. The key is, how do I make use of all of that space as possible? And how can I use the cheapest flash like QLC? We found that they can be cost-effective by using built-in compression along with the cheapest NAND. We've also found that they can provide plenty of endurance even though it's QLC . . . and also provide a consistent low latency and high performance in real workloads.
Again, I hope that you have enjoyed this session. I hope that you are getting a lot out of the Flash Memory Summit, and I thank you for joining me, even if it is virtual.