Making NVMe Drives Handle Everything from Archiving to QoS
Learn how the existing ZNS architecture for NVMe drives can be suited for better host data placement and scheduling than traditional block I/O and tradeoffs. Javier Gonzalez also briefly touches on the Zone Random Write Area concept.
Download this presentation: Making NVMe Drives Handle Everything from Archiving to QoS
00:00 Javier Gonzalez: Hi and welcome. Thanks for joining us in this session. My name is Javier; for the next 30 minutes or so, we will be talking about the different tradeoffs that you can encounter when you want to use different types of CNS devices and how these tradeoffs maps to different use cases.
So, let me share my screen. The first question that we get when talking about CNS is "Why do we even need a new interface in NVMe?" NVMe SSDs are already mainstream, they have a great performance, they are easy to deploy, they scale very well through NVMe over Fabrics; however, we still have a number of challenges that we have not been able to tackle through block-based conventional NVMe solid-state drives.
The first one is reducing the cost gap with nearline hard drives. The second one is reducing the write amplification factor and the overprovisioning, which are so tightly coupled with the way we design block-based flash translation layers today, and the third one is how we provide an explicit mechanism to offer quality of service in multi-tenant environments.
01:31 JG: So, we will address these challenges. First of all, we will talk about the basic concepts of the Zoned Namespaces specification and the benefits. Then we will cover the three main use cases that we see for Zoned Namespaces and then we will dive directly into the core of the talk that is the tradeoffs when designing a host-based CNS host FTL. Yeah. I will have a couple of bucket slides which I will not cover here, but they will be available in the slide deck, so I hope that will help you if there's something that you don't understand. Please check those slides at the end of the talk.
02:16 JG: So, in terms of CNS benefits, there are three main categories that we identified. The first one is that CNS allows us to reduce write amplification and overprovisioning. We can do this because by using zones, we can directly map data structures in the host objects to these zones and remove a number of redundancies in terms of mapping tables, in terms of ways of doing data movement which trigger that we do not need to move data in different layers and we do not need to save a portion of the NAND to maintain the performance while we do that data movement.
03:00 JG: The second benefit is in terms of quality of service. So, we will see how there are different zone configurations that would allow us to enforce a strict QoS, and I hope that by the end of this talk, you understand this and you know which tradeoffs you need to take if you want to enable quality of service through CNS.
The third main benefit is the total cost of ownership, reducing it. And it's kind of a consequence of many things. So, first of all, by reducing write amplification and overprovisioning, we are extending the lifetime of the SSD -- that's money. When we are enabling higher bit count cells, like QLC, that is money. When we are reducing the number of mapping table . . . The size of the mapping table and probably the number of the mapping tables on the device because we're working at a zone granularity instead of working at a LBA granularity, we can reduce the DRAM on the SSD and that is money.
04:09 JG: So, let's go directly into the basic of the CNS NVMe command set. Now, there are many talks out there, also within the virtual FMS that you probably want to see before this one, if you are not familiar with CNS. I will give an overview, but I will not get into the details, so I'll refer you to those talks before we continue here. There are three things that you need to understand when you want to get a grasp on how CNS works and how you map that operation to the benefits.
So, the first one is that the LBA address space is divided in zones, that's pretty obvious. This zone will become our mapping unit, and this is important because that's how we're going to reset, that's how we're going to manage the SSD and the host.
The second concept is that of the write pointer. Each zone has a write pointer associated with it; whenever we write to the zone, we have to do it sequentially following the write pointer.
05:21 JG: Now, there are two ways we can write to zone devices. The first one is through the write command, normal write command. When we do that, we have the constraint that we have to do it at queue depth one. This is because NVMe does not guarantee ordering, meaning that even though we write sequentially to the device, even on the same NVMe queue, if we have several I/Os in flight there is no guaranteed ordering when they arrive to the device, so we might violate the write pointer. So, when we are using write, queue depth one.
06:00 JG: There's a different way of writing to a zone device and that is through the Append command, which is essentially a way of sending a nameless write to a particular zone in such a way that we do not specify the specific LBA zone, but rather we let the device choose that LBA and returning it to us in the completion path. This has its benefits and it also has its challenges, and we will cover that in a couple of slides, but for now, bear with me.
We have the concept of the write pointer, that's important to understand, and whether we use the write command or the Append command, we need to consider that the write pointer is there.
The third thing to understand is that each zone has a zone state machine associated with it and that zone state machine, its transitions are primarily driven by the host.
06:53 JG: Now, I do not have time to get into the details of each state and each transition. I will focus in what for me is the most important transition of all, and that is the transition that allows us to put a zone back into the empty state, that's when we reset the zone. And this is because this transition is the one that is allowing us to do host garbage collection, which is the one that allows us to reduce write amplification, reduce overprovisioning and virtually eliminate data movement within the SSD causing latencies, unpredictability in general. This deck is available in the NVM Express website -- it's publicly available, so please download it and take a look at it.
07:45 JG: There is a couple of extensions that are very interesting for CNS. One is simple copy, which is already ratified and publicly released; the other one is Zone Random Write Area, which we're still discussing in NVMe. If you are part of NVMe, I challenge you to go and be part of the discussion and try to make the Zone Random Write Area to be better. I will not be covering it in this talk due to time constraints. I want to focus in other things, but I do have one slide at the back that you can take a look at to get the concepts of this Zone Random Write Area and why it is of importance to the CNS overall NVMe ecosystem.
09:16 JG: The second use case is what we're calling log I/O. As you're very well aware of, we have a number of applications and file systems that are designed to be flash-friendly. This essentially means that they try to be Append-only, use log-structured data structure as much as possible. For this use case, these applications and file systems are the ones that are going to be able to adapt CNS easier and really shine on the benefits they want to offer because they are going to be able to complement it with even less write amplification, even less necessarily overprovisioning, better semantics to do host-based garbage collection. This is for general storage systems, so the ones using these paradigms of log-structured file systems or databases will be able to use CNS and take advantage of it very, very quickly. There will be able to be large zones, smaller zones. Again, I will touch on that in a couple of the slides.
10:29 JG: The third use case, which is probably the less obvious, is that of I/O predictability, and here we rely on a particular way of organizing zones in the device. If we choose to do smaller zones mapped to a particular number of physical resources, we will be able to have several applications within the same SSD and still be able to provide I/O predictability across these applications. So, in essence, we cannot explicitly address the noisy neighbor problem with zones, if zones are mapped in the right way. Again, I will touch on this when we cover the tradeoff on how we map zones on the device, but I wanted to give you an overview of these three main use cases, which can be covered with zones. We just need a little bit more of work to do.
11:28 JG: So, now, the current talk is, if you want to adopt CNS, you want to write a CNS host FTL, what is your design space? We're going to cover what are the options you have in terms of mapping zones so that you can decide different CNS devices and different ways of doing things, what is the right models that you can use for different applications, and then what is the I/O submission model that you can choose.
12:01 JG: Now, this is going to be interesting because this is a never-ending tradeoff that you need to take, but we are going to speak to why this is especially interesting when we talk about CNS. So, zone mapping, coming back to the use case, if you want to map it, archival is very much large zones . . . Log I/O is in the middle between large zones and small zones. I/O predictability goes more to the small zones, so when you go through large zones, these are zones that either stand across the whole . . . on the device or through a large number of dies on the CNS device.
12:43 JG: This is very good for systems that are already deploying the smart hard drives because the changes to the host ecosystem are minimal; you can adopt this very easily by doing the small changes at the driver level. If you use Linux, these are already available in Mainline. These also have the benefit that, since you have a single zone that you need to take care of, you do not need to think about striping. You do not need to think about how you're going to map to the zones on the host device, at least for a single SSD.
13:27 JG: There's a couple of drawbacks when you use large zones. The first one is that you do not have visibility of the parallelism of the device, and this might be OK for . . . especially for archival, this might be OK. The second problem is that you will have to deal with very large zones, especially if you're using CNS in combination with QLC. Now, this might become a problem because when you have large zones that can be several gigabytes and you do a host garbage collection, you're going to introduce write amplification just, unless you have very large objects that are mapped here, and this is kind of counterintuitive, and we do see people having trouble with this.
14:12 JG: The third drawback is that if your CNS SSD, if you want redundancy, RAID, for example, large zones are basically going to put that redundancy into the SSD. So, you are stick with a particular type of redundancy and you have to stick to the overprovisioning that is needed to cover that redundancy so you don't have that flexibility on the host side.
On the other end of the design stage, you have small zones. Small zones are good if you want to control data placement and quality of service by using this LBA separation, so that if you have a way of knowing where your zones are mapped, you can play with it to either assign it to different applications or assign it to different parts of your application, your user I/O threads and your garbage collection threads. There is also the benefit that if you want to do system-based redundancy, you do not need to pay for the extra other provision that is needed to do also redundancy on the device. So, with the smaller zones, we can get rid of that device RAID.
15:28 JG: The third benefit is that, at a system level, if you're using several SSDs and you're striping across those SSDs to get the bandwidth that you require, you are already doing striping, you already have that logic and you already have accounted for the CPU utilization. We are just changing the granularity from a single namespace block device to zones, so that extra work is not needed. Now, the counterargument for this is that, if you're using a single SSD, the drawback of using the small zones is that you will have to pay for the logic of doing the striping, and you will have to pay for the CPU in cycles of doing the striping. So, this is something you might want to think about. The other drawback of this solution is that currently, we do not have a standard way of representing these memory units or isolation domains in the CNS .
16:32 JG: From our perspective, we see small zones a better fit for the use cases we're hearing about CNS -- gives the flexibility that allows people to create a hierarchy of CNS devices and target different use cases, including archival, without having to buy dedicated devices for dedicated parts of the system. We obviously need to do more work on this, specifically into how we bring this to the standards body, so stay tuned.
17:13 JG: The second tradeoff is in terms of the write model. Now, we described before, when you write to a CNS device, you can use write at queue depth one. Now, this is very limited. We have heard of use cases where this is OK because the overall system is using a layer of persistent caching, system memory, NVDIMMs that allow it to provide the performance to the user and then have some sort of write-back where performance is not important at the CNS layer level.
17:55 JG: This is probably a very specific use case, not for everybody, but I wanted to mention it here that it is possible to do it this way. If you're using large zones, Append is the way to go. This tradeoff is, basically, you have these large zones because you want to simplify a part of your host stack, especially for a single SSD, but you don't want to lack the performance -- you can use the Append command and increase your queue depth. The only challenge is that you will have to do changes to your completion path to remap the LBA that on the submission is now anonymous and on the completion is real. There are some file systems or applications where this is fairly trivial or easy to implement. There are some other cases where this is more complicated. It is up to you to decide in which end you end.
18:56 JG: The third way of writing is zone striping. Now, if you're using smaller zones, you have the possibility to manage your bandwidth latency tradeoff by selecting to which number of zones you are writing based on the bandwidth that you need at that moment or based on the number of applications you want to run on a single SSD at the moment. As I mentioned before, we do not have a way to get this standardized yet or it is not on the standard yet. If you decide to go this way, we do like this model -- we think that it addresses some of the real issues, especially to provide the smaller zones -- talk to us.
19:54 JG: There is, as I mentioned before, it has benefits in terms of providing QoS, but if you're thinking of a single SSD level, you will have to deal with the striping, which might be . . . if you don't already do a striping across several devices.
The third tradeoff is how do you submit I/O? And I think this is a never-ending tradeoff. Some people prefer to use in-kernel I/O. We have better and better I/O paths, lately io_uring, where you can match performance of SPDK on the kernel path. You can also choose to write, to use an NVMe driver user space and use an SPDK. I'm not going to get into which one is better. I think everybody knows the benefits and the drawbacks of each solution. What's interesting here is that when you ask people to rewrite their applications, we know that from the Open-Channel times, it's not a good reaction. And even when you get away of CNS and you just think of block I/O, rewriting an application to, for example, today, you use io_uring, you really need to argue for that within your organization.
21:28 JG: For CNS, you already need to rewrite part of the application, and CNS being part of NVMe makes a good argument for it. So, I can see that many people will be presented with this decision and it's not always easy to say, "I want to go with one or the other," or "I want to do a POC here and see what the differences are." To tackle this problem, we are providing something that we call xNVMe.
Now, xNVMe is a library that presents a common I/O API for what we call NVMe-native applications; that is applications that do think about latencies, do think about bandwidth and today, do think about CNS. The value proposition is when you need to really send your application, you need to choose an I/O back end, we give you the possibility to not make a bad decision. So, we provide a common API where you can choose, at runtime actually, through which I/O back end you want to submit the I/O. So, your application, without changes, if it aligns with xNVMe API, we'll be able to submit through io_uring or through SPDK without changes. It would actually be able to run, if you, for example, use SPDK or on either Linux or FreeBSD, and you will also be able to run it on Windows.
23:07 JG: There are . . . I would like to point you . . . I don't have time to get into the specific details of xNVMe. I want to mention that with an extensive evaluation, we have come to the conclusion that the cost is around 15 nanoseconds, that is around memory access, and we can match the I/O back end performance that you can see both in io_uring and SPDK of these 10 million IOPS per one thread. I would encourage you to go to GitHub and get the latest version of xNVMe and take a look at it, and go to xnvme.io, where you have a number of talks from Simon. The latest one, I believe it is in DC, USA, where he got going to the specifics of xNVMe, how to use it, what's the state, how you can get involved and contribute to it. Look forward to your contributions to xNVMe.
24:08 JG: Now, we've gone through the basic concepts of CNS, the use cases, archival, system log I/O applications, I/O predictability. We've gone through different tradeoffs on how to design a CNS host-based FTL, from which kind of zone representation you need, large zone or small zones with the way you need to write to the device, and the latest part, how you actually send me the I/O, which I/O back end you use, and how xNVMe can help you, now we put it all together in this RocksDB case study.
24:50 JG: We have done our own port of RocksDB based on xNVMe and here we make use of . . . We choose small zones because we prefer . . . Applications like RocksDB use level trees, so it's not always very large objects, SSTables. In the beginning, you might have fairly smaller SSTables, having very large zones will create a lot of write amplification when we need to garbage collect those zones and do the compression and the compaction for them. We have chose to use striping across these zones. We actually also use the Zone Random Write Area because there are, even though SSTables in RocksDB are fully sequential, there are configurations where you can spare some of the write amplification by doing in-place updates at the end of the SSTable for the metadata. So, we take advantage of the Zone Random Write Area to do that. Now, I know I have not touched on that, please go look at the bucket slides or other talks like the one in SDC this year, to fully understand how Zone Random Write Area works.
26:18 JG: Put it in all together, we see that we can get a write amplification, we can improve that write amplification. While using the CNS back end through xNVMe, we set write amplification at, for around one; one and then five, six zeroes. This is actually because we have not gone and done the work on bLSM tree to perfectly map the SSTables to zones. You can improve that. But, realistically, you will always have some level of loss LBAs at the level. But the interesting part is that when we run the same workload of RocksDB in this back end or in a back end that uses a file system that uses CNS, or when we use a back end that uses a file system without CNS, we see that at a software level we can have a 2x reduction on write amplification. Now, this tells a lot about how these redundant mappings can influence the write amplification. We're not even talking about the device or the amplification in itself.
27:34 JG: So, this is only at the software level. The interesting part is that if you know the write amplification on your device at any moment in time, the write amplification on the host is a factor, it factorizes the write amplification on the device. So, that number can be pretty large. I do not want to give you an absolute number of the write amplification because that would be choosing a . . . I can find workloads where I can tell you 20x improve in write amplification, but that would not be true. Depends very much on how . . . The workload depends very much on the write amplification of the vendor that you use today, how you use the device. So, I would rather leave it to yourself to take this 2x reduction on write amplification at the host level, which can be measurable and reproduced by anybody, and then use that to factor and multiply by the write amplification that you have on your device, and the benefits you see can come from that. So, I think that takes all the time that we have.
28:39 JG: Some conclusion points. Three main benefits of CNS: Reduce write amplification and overprovisioning; better QoS, explicit mechanisms to provide QoS; and reduce total cost of ownership. CNS has a lot of use cases. It's not only meant for archival, it can also be used to provide I/O determinism, and it can also be used to amplify the benefits of flash-friendly databases, applications, file systems today. And we are actively working in extending CNS to facilitate the transition of Open-Channel SSD user to CNS so that they can leverage all the benefits of NVMe.
29:26 JG: We have covered different tradeoffs on how to design a zone FTL. You can choose one or another; it depends very much on your application, it depends very much on the work that you've done previously, and the work you want to put into it. And it also depends on the different types of SSDs you're choosing, either from same vendors who will have different types of SSDs and across vendors, we have different types of CNS SSDs. We have presented you with xNVMe, which we believe, we see from early adopters of CNS that is becoming a fundamental part of how to design the transition and the rewrite of applications for CNS. There's actually a very solid ecosystem in Linux with a lot of vendors contributing. Kudos to Western Digital, who has done an amazing job from the early times of the zoned device abstraction in Linux to the adoption of CNS today. I also have a slide on the backup, where you can see the overview of the whole ecosystem from different tools, NVMe-CLI, blkzone, plugins for fio, the old kernel architecture, QEMU, etcetera.
30:53 JG: So, again, awesome that we have such a good ecosystem and that people are starting to contribute actively there. And then, finally, call for action. If you're interested in CNS, you see that some of the use cases we talk about, some of the things that we're saying in terms of the tradeoffs resonate on you, please reach out to us and we'll be happy to talk about it.
31:17 JG: So, thank you very much for your time. I hope you enjoyed the virtual Flash Memory Summit. It is a pity not to be there and be able to go grab a beer with you guys, but please reach out and we can always have a virtual beer, which is like super-popular nowadays. So, thanks a lot and enjoy.