Where the New High-Speed Interfaces Fit and How They Work Together Benefits of Native NVMe-oF SSDs
Guest Post

Accelerating Flash for a Competitive Edge in the Cloud and Beyond

Explore what changes to NAND SSD data placement and management mean for real-life apps focusing on the most popular NoSQL databases which are largely used in the cloud, the different configurations, and more.

Download this session: Accelerating Flash for a Competitive Edge in the Cloud and Beyond

00:00 Speaker 1: Good morning, good afternoon, good evening, wherever you are and welcome to this session on "Accelerating Flash for a Competitive Edge in the Cloud and Beyond."

As you know, this presentation is part of a much larger track that is focusing on new storage technologies, and zoning spaces -- ZNS for simplicity -- is a big part of it. So, this session is focusing on that part of the technology as well, but is a little bit different from the conventional ZNS topics because we are not going to dig into details on how ZNS work, there are other session that cover that. We have an excellent speakers, and I really suggest you, recommend you to follow those presentation if you want to learn more because what we want to do over here is more to step backward and look at what a user . . . What type of benefit and advantage is the user will see from this type of technology. Meaning, if I am an end user, I am not a storage expert. I'm running a real-life application to do real-life work because that's where my money go. Why should I look at this? What value is for me there? And so we cover a little bit of that, but Micron is a SSD manufacturer, so we also focus specifically on the value where we can provide more contribution, and there is a new technology coming out, as you are aware, called QLC, quad-level cell, which is having some major advantages from ZNS, and that is what we'll cover over.

01:36 S1: So, looking from an agenda standpoint, we are not going through the ZNS technology as said, but we'll just mention some of the key tenants, just to understand what is good and what is bad, and what is in a technology, what is needed to make QLC viable, and therefore why ZNS is going to help on that. After that eye-level introduction, we'll really dig into the real meat of it, which is a look at applications. And for that, we picked two applications: one more modern workload, meaning more cloud-oriented, and another one more traditional, something that has been around for many more years of design in different ways, and trying to see what are the advantages.

In a modern workload, we basically use the Apache Cassandra -- this is a very well-known popular NoSQL database used in a lot of cloud application -- and we are running a standard benchmark, YCSB -- this is a Yahoo Cloud Serving Benchmark -- so a very typical cloud-oriented type of environment. So, we want to point out the values of ZNS and QLC in this space, and in the case of traditional workload, we'll do something similar but using more traditional setup with MongoDB and running the more traditional TPC, Transaction Processing Performance Council type C benchmark. This has been around for what, 30 years probably, so it's very traditional. On top of that, we also added one piece of software that is a good companion to ZNS and is called HSE.

03:14 S1: HSE -- we'll see it a little later on -- is Heterogeneous-Memory Storage Engine. It's an open source that Micron has contributed to the open source community, and because of its nature is a very good fit for ZNS. And then we'll mention a little bit about future work. ZNS is new, the QLC is new, everything is on the move or so. We'll point out some of the good and some of the bad, and most of all, some of the other where some work is still needed to make further progress.

So, without much ado, let's move into this, and let's look, what do we care? Why the end user should care about ZNS? Well, at the end, when you look at the first performance, there are two main aspects that people focus: Is performance, mean how fast I can run something, or it is life, meaning how long my device will live, how long will it take because I have to replace it and so on and so forth. When we look at this, in this presentation and/or in this work, we decide to focus more on life and not performance. The reason is because performance is a difficult topic. If we are really focusing on one specific user case, yes, absolutely, you can focus on performance and tuning, but when you look to general technology, performance have contribution everywhere, not only the SSD.

04:36 S1: It depends on the system where it run on, the number of processor, the type of processor, the RAM, database, and so on and so forth. And if the storage is not 100% bottleneck, you only get diminishing return. So, the idea is, performance is important, but we'll focus on that on specifical tuning because it is a very global type of tuning impacting too many parameters. Life, instead, is what really the SSD can pull out and it is 100% consumed and 100% valuable for storage. And what we mean by that is that every SSD has an expected life, which is based on its own endurance and depending on the workload, some device may last longer or may last shorter, and this is becoming, as we'll see, a topic with QLC and ZNS is really going to help over here.

Life in this space will be through this sort of use media management, we'll see in a little bit, and as an outcome will give more flexibility in the media selection for an end user. If I were an end user, I will try to get the cheapest device that meet my requirement and QLC is definitely a contender, assuming that we can meet all the other parameters and that is where ZNS will help a lot. So, let's look at a little bit on what's happening and why this is a topic.

06:05 S1: Not all workloads are suitable for conventional SSD with QLC, these are the interesting statement and is due to two different aspect. The first one is new technology with higher mid-densities suffers in endurance, meaning the life of the technology, the more base you pack, the lower the margin, the shorter the life. This has always been true. SLC was longer lived than MLC, that is a longer life than of TLC and so on, but the interesting thing is that until now, SLC was replaced by MLC with lower margin and TLC replace MLC with even lower margin. QLC, in some case, will have negative margin, meaning it cannot be an obvious replacement for every single workload, in some cases simply negative.

What does it mean, negative? Means it will live below the expected life of an SSD and that is extremely problematic because of life management, because of cost, because means that you have to replace more often. So, this is really something that has to do with the technology, and as our main vendors are doing everything possible to improve it, but there is a limit on the physics there, so we can't go beyond some of the technology aspect.

07:18 S1: The other area, which is really interesting though, is how does SSD manage data? Remember, SSD today are still a derivative of fast hard drives; the software stack, the data location has not changed much in 30, 40 years, and what worked for our drive may not work for normal conventional SSD today. One of that has to do with the fact that the data come from the system in random order, and then they have to write them sequentially. Somebody has to do that data management, and that data management today is done by SSDs, and through that they do a lot of data movement and they create what is called write amplification.

07:58 S1: What is write amplification? Means so many data camps in the SSD to be written and a multiplier of them is written in a NAND because you have to move them around. You write data; the first time you write it, then you have to move it around to maintain the mapping, and you write the second time and then you move that again, and for the third time. The ratio between how much NAND write and SSD write happen is called write amplification, which is a key aspect over here because write amplification is basically reducing the life of SSD. What I mean by that is that you have a NAND that is capable of let's say 1,000 writes and that there is a write amplification of three means that one write come in and four -- the one coming in plus the three of management -- are in the back end. So, actually the system only see one-quarter of the write. It's like its life is reduced by a factor of four. This is a very big problem for QLC, and this is where new data layout are needed to be able to avoid this fantastic, and incredibly large, data life reduction.

09:08 S1: Why does ZNS help in this? Well, ZNS help because it transfer all these data management to the host making it more efficient. So, the pattern coming to ZNS is really sequential, and this is leveraging the fact that the host will combine his own mapping with the physical device mapping, making a more efficient map and will result in a reduced write amplification, which in term, lower write amplification means proportionately longer SSD endurance and life, better performance, even though it's a side effect, and most of all, will enable these new technologies. So, this is why ZNS is key, because through write amplification reduction it can extend the life of a QLC SSD.

09:56 S1: So, what we did? We built a prototype, the prototype that we built is using an existing SSD that Micron has been shipping for a while. This is called 2300; interesting enough, we picked a client-grade and somebody may raise an eyebrow at that, but there is a reason there, actually there are two reason. One, we are developing and working with new technologies, we started this when ZNS was not a standard, working with NVMe consortium, working with us with partners, and we wanted to have a very stable platform. We want to have something which is stable on conventional SSD where we can just change the ZNS part and make sure it works, so that would be a stable SSD.

10:39 S1: On the other side is that we want to make sure that everything is the same, including the overprovisioning. Overprovisioning is a big contributor for write amplification, meaning the higher the OP, the lower the write amplification. And what ZNS has is pretty much zero overprovisioning and in enterprise, zero overprovisioning doesn't exist. So, client-grade can do that and we pick that so that we can really compare apple with apple, with zero OP and NAND and see what happened.

11:14 S1: So, we build this prototype, it's built on TLC, but it will work exactly same on QLC -- it's completely independent. We picked TLC for stability. It's 2 terabyte, it's about 5,000 zone of 400 megabyte each, just to give some idea the number of active zone and open zone for people that really care about details on ZNS, how it works. Not really relevant at the end, if not for performance aspect. And we really, what we want to do is to run a comparison of the two setup -- conventional and ZNS -- for all the benchmark.

11:48 S1: So . . . Let's get into the benchmark that we did. What we did is that, the first one is, as we said, Apache Cassandra. What we did over here, this is a layout of the stack. Basically, we use Cassandra. Cassandra we pick it completely unmodified, there is not one line of code that has been changed, this is absolutely standard, and on that we use file system Btrfs, quite popular, where we had to do changes to support zone because Btrfs is designed to support conventional namespaces, does not have the concept of zones.

12:27 S1: So, we modified it to be able to do . . . to support the zone spaces, to run the garbage collection and to do this back-end alteration. Interesting enough, the only thing that we did here to associate filing zone is simply pick file that Cassandra write on the same directory timestamp. So, we are simply saying all the files that you create within a specific directory goes on a zone or a set of zone, and that's it. Nothing else. So, it's still open to a lot of improvement, but we want to show how much improvement is just from this simple changes.

Then there is another layer which deal with the zone itself, we call it DM-ZNS. This is a variant of an existing device mapper which is called DM Zone, but we started that thing was not really available yet, so we made all the changes to make it work with ZNS and our prototype. So, this will be distributed through open source and will allow simply ZNS to run as a device mapper, like in many other Linux open source. In final, we use our prototype for the data. One thing that is open for debate is the use of conventional namespace for metadata.

13:42 S1: There is a big discussion ongoing in industry whether is needed or not. We decide to use it because makes implementation from one side simpler, but most of all, the number of changes for the end user is limited, so limits the risk for the end user. Can be improved, but that's probably the safest option as of now. So, once we have that, let's get into the benchmark.

First thing that we did is that what Cassandra calls "write stress." This is not real life. This is just to see how things work, just to get familiarized with that. And, basically, it writes about 300 GB. We have two different setup over here. There's a tab called 2300, is a conventional Micron 2300 SSD. The one that is called ZNS is, of course, with ZNS. And there are two charts over here that show exactly the same type of profile. I try to keep the scale consistent so you can appreciate a difference in sizing. And basically, they show, in green, how many writes goes to SSD; in blue, what gets back to the NAND. And you can see immediately in the conventional SSD how, somewhere, like this over here, there is a lot more of write to the NAND than to the SSD and this is because of the garbage collection . . .

15:00 S1: So, it is periodic every so much time. What happened here is basically that we've write 300 GB to the SSD, but NAND write 940. With the write amplification, as we said, the ratio between what is written in NAND and what is written on the SSD about 3, 3.09. ZNS? Well, you see that the two lines are pretty much the same. The green line and the blue line almost perfectly overlap and, actually, we wrote 294 GB in this case on the SSD and 334 went to NAND.

Now keep in mind that this also has all the mapping and garbage collection in the host. What results on the NAND is a 2.13 write amplification. So, the ratio between the two of them is about 2.7. What it means from a practical use is that a device running ZNS, running this exact workload will live 2.7 times longer than the one on conventional SSD. So, for example, if you have a QLC that in this condition may live, let's say, two years instead of the five that you expect. Well, the moment that you which switch to ZNS, all of a sudden, without changing anything else, just moving to ZNS, life get extended by a factor of 2.7, so it will live five or six years.

16:23 S1: So, it can make a big difference between failing to meeting a requirement to absolutely meeting and sometime, exceeding requirement. So, this is just the general setup.

Let's look to real-life-type of implementation. This is YCSB. This is a third-party benchmark, we have no control on that. It's Yahoo Cloud Serving Benchmark. And we see some of the same in terms of the difference between green and blue over here, but the distribution is more like real life, so you have a lot of more traffic, there is probably a lot of more override, which cause a lot of more write amplification. But if you look in numbers, at the end of the day, things are not much different. Yes, in this case, 2.68 is a bit better. ZNS is pretty much independent on what happened on the workload. And you can see on the green and bluish line here, big difference here, not much over here, and again, an improvement in life of about 2.4. So, real life YCSB running on Cassandra, ZNS extend the life of the potential SSD about 2.5 times, 2.4 times exactly. Which again, all of a sudden, may bring life of this device into the range of viable application.

17:47 S1: One thing that we found a bit disturbing is this 1.13. Good as it is, it is actually not as good as 1.0 as we would expect it. So, what we did is that we looked to, "Hey. How are the zone mapped over here? How does Cassandra use this zone?"

And this is a zone map. The way this work is pretty simple. On the x axis, you have the zone's number, so it is like 5,000 zones over here. The y axis, this blue line, represent how much each zone is used. So, line that goes to the top of means 100% use. Line which end somewhere in between, like over here, means that these zones probably are used like 60%. The white is a empty space. And the line that are completely white, those are erased zone -- they are not used that are available to be used there. So, even just with this very simple implementation, we see the . . . What utilization is 95.2%. So, it's extremely high by itself, but should be better. We could improve this. So, 100% will probably be the target for this, or close to 100%, which would result in the write amplification of 1.0. So, we still have work to do over here, to improve this . . . The basic life affiliation as is already by 95%, a huge, huge advantage, but it still can still be improved.

19:15 S1: So, the first thing that come out from this simple test is that everything has been equal. Same hardware, same NAND, same overprovisioning. ZNS reduce write amplification by a factor of about 2.5. And this include all the garbage collection and management operation. This is important because again this dispel the false myth that moving the garbage collection to the host is just moving the problem somewhere else. Now, you move the map somewhere else and they become much more effective and therefore can be better used.

19:46 S1: So, basically, we can have a device that is 2.5 times more resilient, may last 2.5 times longer, which means that we move from a negative margin possibly for QLC into ample margin and probably potentially longer life than people may expect. And this is on new workload.

Now, one of the thing that we were thinking is that, yes, but new workload already designed with new concept in mind. So, they're already designed for this type of solution. How about the more traditional workload? So, let's look to MongoDB.

20:24 S1: This is the other one with HSE and TPC. Same type of slide, we'll do a little bit of different flow. The first one was more introductory. This one we are talking about . . . We want to look to TPC on MongoDB and HSE. So, let's make a word about what is HSE? HSE is Heterogeneous-Memory Storage Engine. This is an open source code that Micron has released to the open source community -- is available for download for Linux community. And what it does basically create sort of a tier storage, meaning given that the problem is having very large capacity devices which have low endurance; how about having one small storage high performance for hot data and separate traffic from the large capacity cold data?

21:16 S1: That is what HSE does. HSE introduced this type of tiering, so it is independent from ZNS. You can run on every type of device, and actually we will compare with a real-life device and see how this allows both separating the traffic workload and improve the life of the capacity tiers and adding zones to that will further extend the life over there.

21:43 S1: Setup is the same. We're still using the same 2300 and the same ZNS prototype. For the tiering, we added one more device when it's needed, which is called 9100, this is a ship-in device from Micron that is an enterprise high-performance-type of SSD for hot data. Comparison again is exactly the same on exactly everything on harder and other provision being the same.

22:11 S1: So, let's look at what happened. This is a bit of a nice chart with a lot of traffic. The reason this that we want to show you is separation of traffic just to have some background over here. The cold color, green, and blue are the cold tier. The hot, the warm color, the orange and green for the warm tier. More specifically, blue is cold write, this one here. Green is a cold read. Orange is a hot write and red is hot read.

22:48 S1: In the case of when we run it on, see this label here, we run only on 2300. Here, there is no ZNS, no HSE. Then we run with same as above, only with HSE, so it is still 2300, but HSE, that use 9100 as hot tier. And then we do exactly the same on the left, making a progression here and replace the capacity with the ZNS device. The real mess of this slide is what happened to the blue color. Why the blue color? Because blue, as we said, is a cold tier write. So, if we just use this workload, TPC-C in MongoDB; see basically, in this case where there is no hot and warm, so it's one color only.

23:32 S1: This red line is pretty active around, what is it, a 100 megabyte per second. So, you can see there is traffic here. Now, HSE by itself, see what happen to the blue line, is reduced to almost zero . . . not zero. There is something over here a little bit here. But sometimes there are huge peaks there. This huge peak is because being a tier you have to swap data in and swap data out. And, so, you see what is impact here. When you go to ZNS, well, if you compare to the left is sort of the same, but there are no more peaks. All the peaks has been cut down, and all these peaks of write means much, much lower write. And, therefore, much longer endurance.

24:12 S1: So, at a glance here, I can show you why the 2300 may be limited because of this workload and why HSE is helping first step and with ZNS making this really, really way better. So, let's look to the write amplification. Here we did in a bit different way that not with the previous experiment with Cassandra because write amplification is not a number, it's actually a combination of numbers and they change in time. So, we are look a little bit more into how write amplification work over here.

24:43 S1: First of all, there are two different type of write amplification. What the device, meaning the SSD, creates -- this slide -- and what the host created will be in the next slide. So, if you look just the pure SSD, if you look to just the 2300, you see the write amplification moves this way. The average is about 2.2, 2.3, sorry, of write amplification. If we add HSE, the same one is a completely different type of behavior. This is like, it's just another application that is in between. And it helps reducing it though, it's about 1.9, meaning that the hot tier is filtering out the number of write, but still they are scattered around. So, there is still a pretty high write amplification.

25:28 S1: When you go to ZNS, there is write amplification. You may not see it, but it is here very down to the bottom, flat line absolutely as close to zero as possible, 1.1. Should be 1.0, you know, same as in the previous discussion for Cassandra can be further reduced, but you can see at a glance that the write amplification on the device get significant improvement, and therefore, the life get improved in proportional way, and this is on the device. But given there is a host, the garbage collection that happened at the host now, what does it mean for the system itself?

26:06 S1: Well, let's look at over here, this is the total write amplification, meaning what the application sees, this include all the write amplification created by MongoDB, by HSE, by the SSD, everything. And it's interesting for two or three reason. The first is, what is the number over here? You know, on a normal SSD you have a total write amplification about 11, 11.1x. 11.1x means that for every kilobyte that you write, well, actually, 11.x kilobyte are written on the NAND. Or if you can read it the other way around, it means that whatever amount of life you have on your SSD, only less than one-tenth of it, one-eleventh of it is actually available to the system, and that is what is killing a lot of QLC application in real life. You know, you're reducing over here an entire order of magnitude to the life here.

27:00 S1: So, if you add HSE, thing change a little bit but not so much, from 11 to 10. You know, it's a bit of reduction. Why is this is not wider gap? Because here, as you said, there is a lot of data management that is still to be done by HSE. When you go to ZNS, there is write amplification, there is mapping here, the entire map that was moved to HSE, happened in HSE and all its own native is still there. So, it is 6.7. But still if you compare from 11.1 down to 6.7, well, there is still a significant improvement of write amplification.

27:45 S1: We'll see the detail a little bit later, but here the chart want to show how much better this is. The other thing I want to bring up is that we ran this benchmark for about five hours, but if we look over here the data keep growing. So, five hours may not be enough, so we say, "Well, let's try to run it and see what happen if we run much longer."

So, if we look to a much longer, this is what we get, we only use the 2300 native and the one with ZNS. We skip the one in between because this benchmark like three days, you know, in a 50 hours, two days, each one over here.

28:24 S1: And one of the thing that you notice is that yes, write amplification can grow, keep growing to about little bit half of that, probably for 20 hours, then start settling down. Same thing for ZNS; after about 20 hours things settle down. So, these are probably more representative for long term. An interesting thing is that ZNS is still six point something, it's not changed much. But this one here, regular one, keep growing and growing up to 15 on average. So, really, if you want to run this type of application, ZNS compared to the same device with conventional SSD and with HSE create a write amplification improvement that is about 2.3x. So, you know, 2.3 times longer life.

29:09 S1: One other interesting thing here is, you know, you want to say, "OK, what is HSE doing?" So, let's spend a little bit of time on that. Basically, HSE is separating, we said, cold and hot data. So, we are again charting 2300 over here and then the same 9100 plus 2300 and 9100 plus ZNS. The column on the left is how much the media are used and the column on the right is how much data are really written there. So, when we run the benchmark for about five hours, just to give you an idea, we use 800, the database use 800, a little bit more than 800 gigabyte and writes generally a little bit less than 5 terabyte on that.

29:49 S1: When you start HSE, what you expect is that you have a small hot data, which are the red over here, which is smaller than the green cold data but it absorb most of the write, so the number of writes are much bigger on the red and not on the green, which is exactly what is happening over here. So, hot tier is smaller and there's more write, and just to give an idea we have like 200 gigabyte of hot tier which takes, you know, 4,500, probably like 80% of the writes.

30:21 S1: Moving to ZNS, this is further with use. We actually use an even smaller hot device, less than probably 150 gigabyte out of a terabyte, and this keeps absorbing the same amount of write. This is why all the traffic get reduced and why the life can really be extended to make this device really valuable.

30:43 S1: So, in summary, what this benchmark is doing is sort of confirming the same data that we have seen from the other benchmark. Life improvement, I think was 2.5, life was 2.5 improvement, with Cassandra, it is 2.3 over here, still about the same order of magnitude. Remember that write amplification still on SSD is still 1.1, which means that the data mapping can still be improved, which means this data still can be improved. But, you know, basically, all of a sudden we are giving a 230%, 250% boost to life to the SSD and this will bring a lot of QLC into the deployment level for these type of devices. So, QLC really would get a very, very big benefit of this device.

31:37 S1: So, this is just a summary and is pretty much confirming what I just said that both the traditional and the modern workload, one addressing cloud, the other transaction. So, one design with modern SSD in mind, the other one more traditional, all of them can benefit in sort of the same type of way from this type of solution.

32:05 S1: There are also other traditional file system that will give probably much more advantage because of being traditional design without any of this in mind. It is more work, but even a bigger advantage if we do that. So, there are a lot of further tuning we can do over here, either in the SSD or in the open source or in the standard. And, you know, I would invite everybody that has been listening to this presentation for so long, for this 30 minutes now, that is interested to contact me or contact your representative in the standard body or in the open source because there is a lot of potential corporation we can further do over here and really make QLC as a technology and ZNS as a data layout and software stack provide a really modern stack that will allow 2020 decade, the type of workload to really benefit and extend their own capability with the new SSD technology.

33:11 S1: Thank you very much for listening to all this presentation and, you know, please feel free to contact me for any question . . . for any question you may have. Thanks a lot, have a good day.

Dig Deeper on Flash memory and storage

Disaster Recovery
Data Backup
Data Center
Sustainability and ESG