Breakthroughs Enabling Enterprise QLC SSDs
This keynote presentation focuses on innovations that allow all-flash arrays with QLC to have the same endurance and performance as TLC-based arrays.
Download the presentation: Breakthroughs enabling enterprise QLC SSDs
00:05 Alistair Symon: Hello, and welcome to the keynote session that I'm going to give on QLC everywhere.
First, let me introduce myself. I'm Alistair Symon. I'm the vice president of storage development for IBM. So, all of IBM's storage, hardware and software products development reports sent to me. And, so, I'm going to be talking today about QLC everywhere, even as you'll see in primary storage.
So, first of all, let us take a look at the flash industry as a whole with storage. And if you look at the enterprise external storage market, that's mostly all-flash arrays now. It's also common to be used in hyper-conversion infrastructure configurations. And clearly, also, it's very, very ubiquitous in cloud for primary and high-end performance.
00:54 AS: And if we look at the chart below, you can see that according to IDC, in 2020, the AFAs are now out-selling the hybrid flash arrays for the first time. And by 2023, they expect the AFAs to be out-selling the combination of the hybrid flash arrays and the HDD-based ones. But, obviously, HDDs are still very common because they absolutely have a place with a slow tier and many of the low-end applications because we've got the nearline drives that are very low cost and still can provide very attractive cost and performance for those low-end requirements, as well as being perfect for media and back-up appliances and any data that needs to be kept -- that's cold for object storage or archival.
01:45 AS: So, given that we've got this industry dynamic where we've got all-flash arrays continue to grow in this primary storage space, but we've still got the disk drives being used for the slower workloads, which will continue for some time, what's further going to accelerate flash? Well, clearly, NVMe is definitely something with storage technologies that is doing that. It allows the true performance of the flash devices to show through to the applications, both at the drive level and as a network interface to the servers. And, obviously, as applications demand more performance, that's going to drive more adoption of flash.
02:28 AS: So, what we're going to talk about here is how QLC can drive more adoption of flash. Clearly, QLC is the lowest-cost flash that exists today within the industry. And the lower cost we can drive flash, the more applications and use cases it can serve. And should we also dare say that PLC at some point will be driving that adoption as well? So, you might be at this point be thinking, "So, is this really going to be really boring? What am I going to learn here? I mean, clearly, flash is ubiquitous. What's so special about an extra bit in QLC anyway?" And we've got this end-to-end NVMe I mentioned and it's common and growing. So, what new is there to learn here? Well, I'm going to tell you, no, it's absolutely not going to be boring. In fact, the excitement is palpable. So, what we're going to talk about here is how we in IBM have actually enabled QLC for primary storage.
03:29 AS: So, if you look at most adoptions of QLC today within the industry, they're implemented as tiered models because QLC doesn't have the endurance that TLC does as a non-flash. And, typically, that would be implemented by most vendors as a tier of storage where TLC is tier 1 for the high performance and a tier 2 with QLC for lower, colder storage which is more read-intensive than write-intensive. And, of course, while that architecture works, it has inherent complexities for the users because their architecture has to support multiple tiers; they've got to have software and architectures in the controller that automatically move data from one tier to another.
04:20 AS: And as good as those technologies are -- and we deploy them a lot with TLC, flash and disk drives today, of course -- you've got to worry about how dynamic that tiering is. And certain workloads where it changes its workload pattern quickly, there could be an occasion where it takes that learning algorithm some time to get it right, and that could lead to a performance issue. So, it's much easier if we can just have one tier of flash that does everything.
And so, that's really what we've enabled and what we're going to talk about here today. It's QLC that has great performance, like TLC and TLC-like endurance, just like TLC. So, let's talk about the problems first, associated with this because QLC for enterprise storage is definitely not for the faint of heart.
05:17 AS: There's a number of issues that we have to overcome if we're to even attempt something like this. First of all, the programming times for the flash is longer, it's 3x that of TLC. Read latency gets longer on the flash 2x to 3x TLC. The read retention is something that becomes more problematic with QLC versus TLC and making that read retention last as long as TLC is an issue. QLC is more susceptible to cell-to-cell interference. And all this combined with the read retention and interference, it makes the flash inherently less endurance. And for enterprise workloads, obviously, we have to be able to continue to write at a relatively high rate for very long periods of time over the lifetime of the product. So, trying to overcome these obstacles is certainly not for the faint of heart.
06:16 AS: So, it is very hard work. And can this actually be done? Well, it just seemed so hard that we had to find a way to do it. We wanted the challenge. And we were like, "Yes, we can do this." And indeed, we have now done it. If we look at the product that we introduced in the first quarter of this year, we have an NVMe drive, an NVMe U.2 dual-port flash drive that we call our FlashCore Module. It's our FlashCore Module 2, because it's the second generation of the FlashCore Module itself. And we've used a number of aspects of the intellectual property that paved the way to this in prior generations, both in our FlashSystem 900 product, and then into the FlashCore Module 1, that really we've used here as the building blocks to get around some of those obstacles we talked about in the previous chart.
07:16 AS: We're using this drive, next-generation 96-layer Micron QLC, and with that we still achieve, now, two drive writes per day. That's like 20x what most people would expect to see from a QLC drive. And that's because of all the things that we've done from an IP perspective in the drive itself. So, we've had, for some time, the ability to do health binning and heat segregation within the design of our flash drives, first off in the FS 900 and then with our FlashCore Module version 1. And that allows us to algorithmically spot the workloads that are being driven the fastest, the I/Os that are being . . . The pieces of data that are getting the most I/Os, and we have a way then of telling where to put those I/Os onto the flash that has the best health and the lowest heat, so that we can actually make sure that the flash that we have, even with TLC, can be elongated to very high longevity.
08:27 AS: And we combine that with the very popular, for us, FCM-based compression. So, we've got that capability within the FlashCore Module to compress data at line speed as it comes into the drive, and it will not impact performance of the above workload whatsoever. And this allows us to do very, very high-performance data reduction. And it also helps because with compression inside the FlashCore Module, we actually have to write less data to the back-end flash itself, so that helps with longevity of flash.
09:10 AS: But what we're doing in our FCM 2 design, which is based wholly on QLC flash technology, is we're using some extra features, like being able to switch pages from QLC to SLC and back again, using these techniques of the health binning and heat segregation, together with a smart data placement capability and read heat assessment to really . . . It's like an AI within the drive. It's going to work out exactly what type of page is going to be needed for each I/O as it's coming in. Does it need to be SLC? Does it need to be QLC? The drive will work this out. And it will allocate the pages appropriately, so we get the hottest data, in terms of I/O access, on the fastest flash and the healthiest flash. And it's this algorithmic capability that's in this drive that really lets us do some really amazing things and be able to get the longevity and write cycles of two drive writes per day -- that's the equivalent of TLC, even though this is entirely standard QLC flash that we use within the drive.
10:28 AS: And the other features we've got in there, is we've done a lot of work with our garbage collection algorithms to really ensure that we have low write amplification. So, we've got all these tools that we're bringing to bear inside the drive itself to enable QLC to last the lifetime of a TLC drive. And so, if we look at the chart on the bottom right, what you're actually seeing there is, on the X-axis you're looking at the program-erase cycles that the drive can do, and on the Y-axis is the raw bit error rate.
11:05 AS: Now, within the drive that we have, we already have a very, very robust error correction algorithm that's capable of reading data at about an order of magnitude higher bit error rate than standard SSDs. This was developed in partnership with our IBM research teams. And so that already gives us a head start in terms of the bit error rate that we can actually work with, within the drive, as the flash wears over time. And you can see that the bit error rate we can manage to is marked by that red asterisk. And a typical QLC drive would be down at about the three mark -- 3,000 program-erase cycles --that you can see here.
But with the IP that we've built into this drive, our program-erase cycles and the bit error rate follows the line that you can see here along the dark blue line. And if you project that blue line out from our measured numbers, and indeed we've even measured this projected line now to the full lifetime of the flash. But if you project that line out, you can see the point where the bit error rate becomes problematic is about at that 15,000, 16,000 erase cycles point. And that's what lets us get to the same duration of life with the QLC flash as we have with TLC. And then the green line just shows that that's really quite conservative because the compression will further reduce the number of writes that we have to do. And so, technically, we could even go beyond the endurance that we're showing here.
12:45 AS: So, we've made this absolutely as endurant as TLC flash, and the use of the SLC pages and some other techniques we'll talk about means that the performance is equally as good as TLC. So, what about the capacities that we can do with this? Well, we supply these modules in four different capacity sizes -- from 4.8 terabytes, 9.6, 19.2, up to an industry-leading capacity of 38.4 terabytes that's the densest flash drive that exists within the industry, and it's NVMe and it's QLC, and it performs and endures like TLC. And that lets us build a maximum system capacity, SATA 2 drawer that we have with our FS 9000 system, with 24 of these drives, 757 terabytes of usable capacity, and that will deliver up to 1.7 petabytes of effective capacity with the in-built compression in the drive. So, huge density, lowest cost flash, high performance. So, really, it was well worth the effort that we put into it; we were glad we took on the challenge.
13:57 AS: So, as I've talked about, the FCM QLC performs well. It's actually at a system level, has performance that beats TLC. So, if we want to see the graph on the measured performance of this, and look in the bottom right corner that we've got down here, what we've got here is a chart that shows the IOPS that we've measured on the X-axis, the response time on the Y-axis. And you can see here that we've got a very, very low performance, flat curve, very low latency, flat latency through the IOPS increasing, and the IOPS are down less than a millisecond all the way to the knee of the curve, and that's achieved with QLC, less than half a millisecond of response time. QLC, with a mixed workload of 70, 30 read writes, 50% cache at 16 K.
So, that's a very, very good result with an all-QLC drive. And if we look at some of the numbers that we've achieved here with FCM performance, read throughputs increased by 44%, write throughput increases 33%, read latency is cut by 40%, write latency is cut by 30%, and these are compared to TLC remember, and IOPS increased from 10% to 20%. These are comparing our FCM 2 QLC to our FCM 1 TLC.
15:31 AS: And, so, what have we done? This is our third-generation computation platform that we've done this with, and we've used our inline compression to do this, but we've used these techniques that I was talking about earlier, dynamic SLC allocation being one, to always put the highest accessed data onto the highest performance and healthiest flash.
And we can take advantage there of some of the things that we've got in the top right box that overcome some of these inhibitors, that takes advantage that most workloads are skewed, that only a small amount of the data is ever accessed with real high intensity. And the algorithms within the drive will always make sure that's placed on the SLC pages. But we go beyond just what's the capability of intelligence within the drive, the drive can allow a hint to be given in terms of which blocks should be put into certain performing pages. The controller or the software above that's talking to the drive can give a hint as to whether maybe a piece of data is very highly read-intensive, and that would give you a hint then in the drive as to where to place that drive. So, it's intelligent in the drive, but able to take hints from outside, and the smart data placement then is able to tell exactly where to put that particular piece of data on the right page of flash -- SLC, QLC -- how it should be used from a health perspective.
16:58 AS: And one of the other things that we have to do is we have to overcome the write duration of the program suspend. So, how do we go and do that? Well, one of the techniques that we've built-in here now is what we call a program suspend. So, what we're worried about when we're doing one of these really long write programs is holding off the read I/O. So, the way that we deal with that in this drive is we'll suspend write I/O at various points in time during that write cycle and allow reads to come in. So, it's like doing a . . . effectively a time slice of the write so that we don't hold off the read I/O and we don't suspend application I/O coming down to the drive. So, we overcome that inherent difficulty that was there with QLC. So, we've done all these different techniques that really have made QLC measurably, as you can see here, the same as TLC or better in this case, compared to our FCM 1 drive. Quite a result, I think you'll agree.
18:01 AS: So, having talked about the QLC aspects of this drive, let's talk about some other things that we can do with the architecture that we've got here. Well, one of the things is that computational storage is also starting to come to life, and that's just as exciting. Computational storage is the idea that we can actually offload the processes in the storage system or the application server that today is doing certain tasks on the system processor or controller processor, and pass that task down into the SSD device because it may be appropriate for that task to be more efficiently done directly or very close to the data that's involved. And, so, it's the idea of offloading a task from the controller or application processor into the drive itself. And you can see effectively that what we've built into our FlashCore Module, we've got our inline compression, that's a great example, really, of a computational aid that we've put inside the SSD because it's doing the compression that previously in other products -- right --are done by the controller CPU or the application processor CPU, it's done in this case in the SSD itself.
19:16 AS: And things like hints that that drive can do like the FCM, that can also be a very appropriate architecture for helping tasks that we would offload into the compute capability within the drive. So, this really does have amazing potential for the future. And if we look at what we mean by computational storage and what I talked about there, it's really a fancy name for accelerators for applications, filters that are made by applications or particular AI workloads you may want to have done by the SSD itself to offload the main processor or any other assist we might want to put in there. And the real goal is always to reduce how much data goes through the processors above the SSD and get the SSD to do that work itself. And when we have a design, like our FlashCore Module, for example, where we've got FPGAs that are built onto the SSD, we can use cores in these FPGAs to effectively build those offloads quite efficiently and easily.
20:21 AS: And, so, is this just hype? Is anything real here? Yes, absolutely, it's real. SNIA has a working group that's been established to deal with this particular activity within SSDs now, and we have offerings that are already available for a number of vendors that are doing this type of computational storage. And so, there is a huge potential going forward here to build out a suite of different assists for AI workloads. I mentioned that anything that might want to do . . . filtering of huge data sets, absolutely, that's things that could be used with this type of technology going forward. And I gave you a real-life workload example of that that we've done with our FCM compression.
21:08 AS: And I'm going to talk now just a little bit about one other thing that's coming along within the industry, and obviously this is persistent memory. So, over on the right we've got the hierarchical pyramid of storage and, being from IBM, we always want to start that pyramid on the base layer with tape.
Tape is still a vibrant part of our storage needs, certainly driven by hyperscalers today who have vast amounts of data they need to keep for a long amount of time. And, obviously, above that we have our nearline HDDs. Above that we've got QLC flash. Being from IBM, that's all we have now because all of our flash is QLC-based, we're not using TLC anymore. And then we have our SCM SSDs, the storage class memory, the persistent memory, I'm going to talk about here but in SSD form factor. And then we have persistent memory itself in between SSD, the storage class memory SSDs and DRAM.
22:08 AS: So, why would this have a place in the pyramid? Well, DRAM is expensive, as everybody knows, but at the same time we've got this need, an ever-growing need for a larger amount of memory as databases get bigger, as the amount of data we've got to process gets larger, and we've got to do more caching and more metadata associated with it. All this means that we need to increase the memory capacity of our servers and our storage devices. And persistent memory is a way where we can do read caching for example, at a lot less dollars per gigabyte than we can with DRAM. So, it's a great technology to give near-DRAM performance in terms of caching, but a lower cost while doing that. So, it's a great technology that enables some of these larger data sets to be processed.
22:57 AS: And, obviously, the persistency of this memory gives us the Holy Grail of storage controller design, which is to get rid of the yucky batteries. And I think anybody who's listening to this, who works in the storage control or design industry knows, dealing with batteries is just nothing but headaches. You've got to worry about making sure that they charge regularly, you've got to worry about making sure you replace them at the right time, you've got to make sure you do the right thing when they fail, you've got to make sure that you do the right thing with them when the power goes away from the storage system to save the data. There's just a whole set of headaches that go away if you can get rid of the batteries.
And the persistent nature of this memory combined with its lower cost than DRAM, and place with its high performance, means it's got a real place in storage controller architecture going forward. And it has a place in the tiering with the flash devices, be it QLC or TLC, because you can use caching with persistent memory as a read/write cache above your flash devices, or you can have that persistent memory in a SSD NVMe attached itself in a tiered environment with QLC or TLC flash.
24:14 AS: And, so, if we look at the way that would work, persistent memory either in the controller device or in the server, or in the NVMe storage device in the server or storage controller, it can be used for caching metadata and checkpointing in a persistent way -- it can be used for really large read caches. But it can also be used in its form factor with NVMe, with smart tiering. And the SSDs in that environment would act as a separate storage tier working with QLC and TLC with the higher performance being automatically migrated, or the metadata being migrated to the storage class memory device. So, you can either use it as cache, you can use it as smart tiering, you can use it in both cases to increase the performance of your in-memory databases or your storage workloads for these databases with the NVMe-attached device. And, in the future going forward, we're going to have the CXL bus that will allow this technology to be connected to the processors in a very, very low-latency way that's going to lead to even more architectural possibilities with composable storage.
25:33 AS: So, that was a very quick tour through the storage flash industry, and some very exciting things in QLC, computational storage and persistent memory, so thanks for listening.