Using PM and Software-Defined Architectures to Optimize AI/ML Workloads Analyzing the Effects of Storage on AI Workloads
Guest Post

Using Software to Improve the Performance and Endurance of High-Capacity SSDs

NAND flash technologies with more levels per cell deliver larger SSDs at much lower per-bit cost, but strength and performance can suffer. Here, you will find insights on how you can use machine learning and AI to enhance endurance of QLC.

Download this presentation: Using Software to Improve the Performance and Endurance of High-Capacity SSDs

00:01 Andy Mills: Hi. My name is an Andy Mills. I'm the CEO and co-founder of Enmotus, and I'm going to be talking to you today about using AI to improve the performance and endurance of high-capacity SSDs, in particular, QLC and PLC or penta. A couple of things here we're going to be touching on during this session. Machine learning software, we're going to be showing how that's been able to dramatically improve performance and endurance of the QLC device. And, obviously, forward looking towards PLC, the same kind of benefits. Also, a little bit about how you can use QLC a little differently than how people are using it today. Same basic NAND devices that are available in high volume today, but making far more use of the pseudo-SLC mode that's available in QLC devices and PLC going forward.

00:58 AM: And then the last thing we're going to touch on is how the host can be used to control these different segments of QLC programmed as SLC versus PLC, and how that can help benefit applications and some of the benefits you get out of exposing direct host access to some of these elements. Very quickly, the solid-state disk market has been growing very rapidly. Here's some numbers. As you look around the web and internet, you'll find a lot of numbers here, so I won't do all on them. But you've got 47 billion growing to 87 billion, 350 million SSDs in a very rapid shift towards NVMe to take advantage, of course, of the high bandwidths available with SSDs.

01:44 AM: And then SSDs themselves are getting much more cost-effective now and larger. It's a trend that we are all very happy to see, and PC OEMs have been replacing steadily all their hard drive-based SKUs in notebooks and desktops. Notebooks, of course, first, really led the charge to going pure SSD, but now you're seeing a lot more of that take place in desktops and, of course, servers and data centers. Lots of money being spent on 3D layer NAND and, of course, QLC and penta-level cell. PLC offers very significant capacity and cost gains. But there is a problem, and we want to touch on what those problems are here, and a lot of good solutions to those problems, but there's certainly things to be aware of.

02:29 AM: If we look at the way that NAND has evolved over the years, from single cell through to the proposed penta-level cell, a couple of different trends that are worth pointing out here. Of course, the obvious one is new generations are lower cost per bit. That's the whole benefit of having a 3D NAND technology is to lower the cost and increase the density, and that drives up your capacity. The two things that have not been going so well are generally the write performance, which is masked a lot today with SLC caching, but the actual raw media QLC performance, for example, is significantly worse. It could be a hard drive in some cases, so it's been shown to be as low as 60, we measure as low as 60 megabytes per second from something that was operating at 3 gigabytes per second. That's basically the difference between hitting a cache inside the SSD versus hitting the raw QLC media.

03:32 AM: And the other thing, which has been hitting us more and more is the reduction in endurance. The benefit of 3D NAND is offset, in many cases, against the fact that you cannot read and write, or write to the same cells. The P/E, program/erase, cycles is coming down dramatically. You just cannot write to a single cell as much as you could. The good news is we're getting more of those cells, so we can spread the load. The bad news is the individual cells are becoming weaker, as it were, in terms of endurance.

04:04 AM: A couple of things that they're really hitting is reliability and performance, unfortunately, with NAND in particular, has been worsening, and you really have to start looking outside NAND if we want to improve that. But that's OK. We can work with this, and the industry has done some pretty cool things so far, SSD vendors, in terms of handling that with SSD caching on board the SSD controller itself, which has solved many of the immediate issues. But there is, as always, a better way -- we want to touch on that here.

04:37 AM: And the one thing you'll find, people have asked us, "Why don't you just use SLC?" Well, SLC is really expensive, that's the problem. And SLC is several hundreds of dollars for 64 gigabytes is not uncommon now because it's only really used in a much smaller volume and in specific verticals where that single-level cell endurance is absolutely required. Well, the good news is you can have the best of both worlds here, so we'll talk about that. And we can do that with a thing we call machine learning, SmartSSDs. And there's various definitions of SmartSSDs we're going to be using it here in terms of the consumer market definition of a SmartSSD as opposed to more the enterprise version, but the concepts, of course, can be extended into both markets.

05:26 AM: Obviously we've seen the opportunity already, TLC at some point is going to hit its limits, QLC is going to come into the forefront here, driven by pure economics. The second piece, though, is the machine SmartSSD itself is the new category of higher endurance SSD by using Smarts instead of the raw media and caching. And we're able to show that here, and we'll show a little bit more detail on how that's accomplished. But you are able to really achieve TLC level or higher MLC using QLC if you have the right algorithms in there. Those algorithms, however, have to adapt to the OS, the operating system, or the user workloads.

06:04 AM: Lastly, the SmartSSD benefits themselves are really down to one thing, and that is how can you avoid writing to the QLC? And we call that write attenuation, as opposed to write amplification, which we've seen in many cases. Any good algorithm really has to avoid amplifying the number of writes to the QLC, or even the SLC for that matter. And we're going to show here how we have demonstrated SmartSSDs can pretty much almost eliminate writes under very specific conditions to QLC, or at least dramatically reduce them.

06:40 AM: The other thing you get out of a SmartSSD here we'll see is by inherently splitting it into two different media types, and we'll show you how that's done in a second, using just QLC NAND. You've got workloads that are now able to be assigned by the zone on the device itself. You really want a virtualization layer to handle that because you don't want the user to get involved with that, but that's, again, a SmartSSD has that ability to zone or know where those zones are and take advantage of them. And the whole point of this is to allow QLC and future low-endurance NANDs, such as PLC, to be adopted in higher volume applications where you want to make the right, "I don't care."

07:22 AM: Little diversion into what pseudo-SLC is, for those of you interested. Basically, QLC is a quad-level cell and you have 16 voltage levels, each assignable to a binary value and that's how you get the density. A single cell with 16 different levels is capable of representing up to 16 different unique values, which associate or go out to a four-bit word. Whereas when you're in pseudo-SLC mode, you're going to be using the full swing, or at least a much greater variance, to be able to represent a single 1 and a single 0. It might seem wasteful at first, but there's several benefits to doing this, as long as you only use a small part of it for that.

08:07 AM: SLC caching, for example, today makes use of this as like a write area where you can buffer all writes down to the QLC through SLC. And the reason for doing that is that you're going to be able to: a: write faster, and in some cases you can have a higher endurance region that's statically programmed, and we'll touch on that in a second, that allows you to write more to it and then slowly, lazily write off to the QLC area, which is going to take more time. It takes longer to write to a QLC. That's why you see those horrendous write cache drops with QLC going down to as low as tens of megabytes per second, when previously, when it's been going through the cache, it's been running at gigabytes per second. Very big difference in performance, first and foremost, between the two.

08:54 AM: Now, one thing that people may not be aware of is that there's two ways to program p-SLC, or pseudo-SLC, and that is statically with high endurance. You can't fix it, you can't change it, it's a region that's fixed as SLC, and typically that might be 624 gigabytes or 10, 20, whatever you want to pick as an SSD architectural decision. And that can be programmed all the way up to 30,000 P/E cycles, endurance, or dynamically. And dynamically means that you can switch back and forth between SLC or QLC. And, typically, how that's utilized today is that you are able to start life out a QLC device as an SLC, and then dynamically change the bits are you get past maybe x% fill, you can start then storing stuff in the QLC areas or modes of the device as you start to expand the capacity.

09:52 AM: There are two different modes you take advantage, and SmartSSDs have all three of these particular programming methods for the NAND itself. The key point is that statically programmed pseudo-SLC is capable of achieving up to 30,000 P/E cycles. But when you have it operate dynamically, it's typically only guaranteed to be the same endurance as a QLC cell because the vendor doesn't know it's been in QLC and has to only warrant it at that level, or probably several other reasons as well.

10:31 AM: Let's talk about the difference between how these are used. On the right-hand side, we have a standard QLC device, very typical today, where they might use the SLC as a buffer. You have a device that's an SSD controller with multiple NAND devices, they're all QLC in this case, and the two different color regions here really represent SLC and QLC. And these SLC regions can be programmed typically, for the yellow regions, as statically nailed up there. It's going to be a lot smaller because you don't want to use all your cells up this way, but it's very useful for buffering and caching. And, typically, while you're up to maybe 20% full . . . Sorry, for static, sorry, you're going to keep that fixed and that's there for write-back caching on the controller itself.

11:25 AM: Now, when you actually go to the blue region there, you're now operating in this hybrid mode where you're using dynamic SLC and QLC in its native form. And, typically, what happens there is you're going to write through the buffer through to the SLC portion of the device whilst it's operating that way in dynamic mode, and you might do that for the first 20% fill of the device. But as you start to fill up beyond that, you now have to start moving data into QLC portions in the housekeeping. What you generally see is, over time, a degradation and performance starts to occur naturally. You just can't avoid it because you're doing more housekeeping and you're starting to hit QLCs more. But you generally don't hit them directly, but you're filling up the write-back cache as it's having to manage more in the background and defer a lot of the operations.

12:23 AM: The key message then on the right-hand side is that that works, it works reasonably well, and that's indeed how most of the QLCs are built today. But you do pay a penalty once you start to copy large files or you're getting to the point where you're filling the device up. We have a test . . . We haven't been able to prepare in time for here, but we've just been testing in a lab where we've shown a game-load time using Final Fantasy XIV. And what we're able to show is that, as you fill up to 80%, it's almost like three times slower to load the game off a 85% plus full device than it is off an empty device. So, you get quite a significant slowdown, which is very noticeable. It's not a marginal thing, it's like a three times slower thing, and we have some data for that. Anyone interested, please reach out and we're happy to share that with you.

13:14 AM: On the left-hand side, now you've got a different method of allocation and that is we're using the SLC statically for a much bigger portion of the device. We're going to be sacrificing more of those QLC cells, so there is a tradeoff, you're going to take more for the SLC portion. But the difference is, you're going to be able to make those two regions independently accessible by the host. Let's talk about that a little bit more. Essentially what you've got is two . . . Think of these as two regions, even though they're all butted together and they're in the same physical SSD controller domain, the machine intelligence software is going to know exactly where that split is. We've been able to show, using this technique so far, that you can get almost a 5X improvement in endurance over TLC this way.

14:03 AM: The reason you can make that claim at all is that you do it by avoiding writing to the SLC cache QLC portion altogether. You push all the data as much as you can onto that SLC and, of course, you're getting a ratio of going up from 1,600 P/E cycles to as much as 30,000 or even higher, depending on the grade of flash you're using. That's where that benefit comes from. It's down to how good is your algorithm in pushing data up to that higher tier, but it can certainly result in quite a nice little jump. Obviously, higher over a standard QLC 25 times, but almost as close, if not higher, than a TLC device can be accomplished because you're using SLC.

14:46 AM: The first concept is you divide into the two pools, we talked about that. The second concept is that you add this machine intelligence software layer that has direct access to the two different kinds of flash, and it knows the differences between the two. And then by the time you put your user files on top of that device, through usually a virtualization layer, you're now able to put together a dynamic system rather than a statically defined system that's able to manipulate and move data around. So, any heavy traffic, especially write traffic, can now be prioritized under SLC, not buffered, stored on the SLC portion, and then the light traffic can be diverted off to the QLC, and you can rebalance that over time as you're using it. It's, essentially, you're smart provisioning the SLC on the fly.

15:34 AM: And that's a little bit more evident as you go through this kind of exercise and look at how a little bit of data on multiple things going on on this slide here, but the top section shows you your boot drive, where your system C drive and recovery is, for example, for a standard C Windows 10 boot drive here. You got your operating system, you got your application and its data, application two and its data, going through a fast remapping layer. The first thing to notice what your AI engine is doing or your machine learning data is doing is it's watching how the I/Os and where those I/Os are statically, and mapping them onto the SLC. And then the other section is going, "Hey, you not used a whole lot of the OS," for example, and it's going to lead that on the QLC. So, you now already started to separate out the heavy writes, heavy reads and put them in two different areas of that disk.

16:28 AM: Now, because the SmartSSD driver knows where those two regions are, it can successfully deploy or store those components quite safely on those two regions and know statically that they're not going to move or change from that without it getting involved in that movement. Extend the concept out and you can see, as data is warming up, the data for application one might be hot, but the application itself is not using a whole lot of I/Os, so it's going to stay on the QLC. And its data portion is going to end up on the SLC, and so on and so forth.

17:08 AM: Basically, you're ending up with something where the whole disk is being analyzed, the active data is migrated in the background as opposed to the foreground. Remember cache is a foreground task. You're doing this more as a background task, watching where things needs to live, adjusting it after the fact, if needed, and in some cases adjusting it on the fly as you get more advanced to AI technology, getting out there with this kind of solution. And, again, it comes down to one basic goal you're trying to achieve, which is attenuate the rights to the QLC portion of the device, or write avoidance, as I like to call it.

17:48 AM: Let's look at the performance results you get out of this. And we threw in for kicks something that's going to look better, but the only reason we did so was to show that a Gen 4 device -- very expensive Gen 4 device based on TLC for PCIe -- is actually not that far ahead in real-world applications. Again, there's a couple of different messages here. First and foremost, the FuzeDrive -- which is Enmotus AI-based SSD that we have now produced and just started to ship -- if you look at the third-party independent tests that we've been done on here, if you focus on the real-world benchmarks, you get a much better idea of how AI is going to work because raw benchmarks tend to just thrash on one piece with a fixed data workload, whereas the reason that doesn't work well in real life is because you really get a mix of things going on in real life. This is why we like the PCMark 10 quick storage test as well because it's going to be moving stuff, moving the data patterns around a lot more randomly as an end user would be.

18:57 AM: If you look at it from that standpoint, a Generation 3, Gen 3 device here, which is what our solution is for this first solution . . . And we are working on a Gen 4 to be absolutely clear. But even with Gen 3, Gen 4 gives you a little bit of a kick up here if you've got TLC and you've got maybe a higher end performing controller. But even when you use the same basic controller, these are both based on the same controller, the Enmotus, and the third one along here, for example, you see there's not much of a delta. No. 1, you're not going to get better performance out of this thing out of the gate. It's going to behave pretty much like a regular Gen 3 QLC device will behave. It does perform better in some cases. Again, that's really dependent on the controller itself, not so much the AI or the machine learning. You got a really nice . . . At least a nice confirmation that you're no worse then, certainly. And as we move towards Gen 4, you're going to be able to keep up and keep going in the general direction of keeping up with what the SSD would have done natively.

20:07 AM: Now, most of these tests are done with the devices around half-full because in TweakTown's review here, Jon is very good at making sure that you fill up the device because that's how most people are going to be using it, at least half-full. Then you're in that region, being in the dynamic SLC-cached QLC area of the device, rather than operating a nice, clean SLC-only region. That's the reason you see a lot of these are actually done with 50% full, but I encourage you to go look at that and just see the results of it. But that's what the thing looks like there.

20:47 AM: What the benefits really come in, we have other data to show this even better now, but typically what you see is the FuzeDrive, for example, all this AI technology produces much more consistent delivery and performance when you get more workloads in play as you fill up. So, if you go again to 85% full, you're going to see pretty much the same performance as you saw when it was empty because what it's doing is taking the active components and putting them on the SLC, which is far more consistent. Whereas in a conventional TLC/QLC, you're going to see this drop off because it only has so many finite resources to play with. And the more you fill it up, the more the housekeeping is going to kick in. That's No. 1. You get much better consistent performance out of these devices simply because you're avoiding the QLC, that's the key concept, and you're using the smarts in the AI to do that.

21:45 AM: A little bit worried about endurance. We decided endurance was really complicated. First time we started to look at it, we shipped about 14 million different licenses out to different systems. And as we saw and users out there with our general SSD hard drive or SSD/SSD software, of which this is derived from, we saw that users are pretty confused about what does endurance really mean. That's why most people say, "I don't care." We decided to introduce something based on the JEDEC standard -- there's nothing new here other than normalizing it to a single terabyte, or per gigabyte, so dividing by the capacity. And we found there was a very much simpler scale that you could adopt here, that's the gold, silver, bronze. It is based entirely on the TBW JESD210 spec, no changes there, but it's normalized to a standard capacity. And we found that was much easier to explain to people, the differences between the two at a glance. People could see it on a product, they could see in a brief that this SSD here is regular QLC, it's bronze-based, this is a machine intelligent SSD, it's gold-based, an MLC device is gold-based, for example. TLC kind of fits into the silver. So, it gives a nice little segmentation.

23:08 AM: And we're more than happy for people to take our icons, use them if they want to do their own. But we think it's important for the industry to get behind a common standard to make it easier for end users to understand because as SSDs get bigger, it's going to get very confusing as to why that QLC technology seems better the bigger it is when really nothing changed in terms of if its endurance or reliability, you can just write more to it. Of course, you can because it's bigger, but you really want to give someone more of a normalized scale.

23:42 AM: And talking about that scale, we took the JEDEC test in random . . . And please have a look at the endurance white paper we provide at the bottom. I apologize for the string there, but that's how it came out of my browser. But, essentially, you'll see, because of the SLC, what you now get is a managed SLC with much higher endurance is better than a cached SLC. Managed SLC with AI allows you to push data more predictably and more consistently to the SLC level. That's why you can see this additive stack-like approach to the solution here, where you say, "If you take a stand and 100, 200 terabytes written QLC," you are getting that same 100 on the QLC portion, or 200, but because you've added this bigger slab where you can put permanent storage or store data on there, you can pin it there, you can do all kinds of stuff with the SLC region, you're now going to get much better endurance out of that solution. So, you can see as high as 3.6 petabyte for a 2 terabyte device affected down to 1.6 or the 940 gig or the 900 gig that we have, we changed since this slide here a little bit, but you're seeing, you're getting solidly into that silver class of 800 or so terabytes written, using off-the-shelf QLC in that. That's the big benefit you get out of using machine smarts to manage your stuff instead.

25:12 AM: Quick word on what we're providing here. It is an 900 gig on this slide, 900 gig, 1.6 terabyte. We developed this in partnership with Phison as the industry's first smart AI-powered SSD for consumer applications. And performance is pretty much SLC-class capacity, is obviously 60% more because we're basically able to now bring it down to the price point of a 1 terabyte much closer, and then a 1 terabyte MLC-class device, for example, a higher end pro device, so you're getting more capacity and you're getting this gold-class endurance now. The product itself will soon be available on Amazon and other usual places, and through several system builders.

25:57 AM: Let me wrap up here by just saying, we've pretty much demonstrated for ourselves and now to a lot of the vendors outside that it is possible to use machine learning and AI to greatly enhance endurance of QLC and put it more into mainstream. It takes advantage of the p-SLC modes, of course. And the nice thing here is we're able to take off-the-shelf QLC NAND and controllers, and we just tweak the firmware, and we did that with the help of Phison, the course. And we're able to tweak it, so we present the SLC directly to the host driver rather than being hidden by the SSD controller itself, and it does allow for this new class of high-endurance smart consumer devices. OK?

Thank you. Appreciate you listening, and everyone stay safe. Thank you.

Dig Deeper on Flash memory and storage

Disaster Recovery
Data Backup
Data Center