Download the presentation: How Zoned Namespaces Improve SSD Lifetime, Throughput and Latency
00:00 Matias: Hi. Today, I'm going to talk about how Zoned Namespaces improve SSD lifetime, throughput and latency. My name is Matias, I'm a director of emerging systems architectures within Western Digital. So, let's get started.
First, I want to talk about the block interface. What's the block interface? It's the read/write interface that you have where you can access your storage devices, where you read logical blocks (LBAs) or you write logical blocks. Initially, this was for hard drives and later this will apply to SSDs. The idea is with interfaces that it hides the specific details, how to manage media and so forth, and presents this very general efficient storage interface up to the host on which they can build this application on top.
00:55 Matias: It has worked and continues to work really well for multitude on storage systems, but for flash SSDs, the cost of supporting this block interface has become quite high and is growing, has grown over time. And the reason for that is there's a fundamental mismatch between the operations allowed, read/write and of underlying flash technologies operations like read program arrays. So, whereas you can read or write a single LBA with the block interface, with the flash interface, you usually read at a certain granularity, you program at a certain granularity and you erase at a certain granularity. Furthermore, you can only do this specific number of times.
01:46 Matias: To kick this off, I want to point you to the graph on the right, and where we have compared . . . What we're going to talk about, Zoned Namespace SSDs, and general . . . SSD. And for that we see that when we look at the write throughput, which we're depicting here, and then we kind of say we do a full drive write two times. And for the conventional SSD, we have what's known as a 70% overprovisioning and 28% overprovisioning -- I'll get back to how that relates to this -- and then we have the ZNS SSD, which have 0% overprovisioning. And when you write to an SSD the first time, you usually always get really good performance. But when you get to the second drive write and there's this garbage collection process that kicks in within the SSD, depending on the implementation of the software within the SSD and depending on how much overprovisioning you have and so on, you . . . For conventional SSD events, you would have like if you're 20% overprovisioning, you have 50% lower throughput when you get to the second write, and for when you only have 7% overprovisioning, you actually get 75% lower throughput. Whereas for ZNS SSDs, which we're going to talk about today, we see no drop in throughout.
03:01 Matias: You can have a consistent performance from your begin to write to the drive and keep going on and on. And these results are based on a pre-production ZNS SSD and conventional SSDs. And conventional SSD platform, it's the same platform for all of the three here -- they only difference is the firmware between the three.
03:28 Matias: All right, so to get you a little bit into what an SSD is, I'll give you a little bit of background. An SSD bundles, like 10 to 100 of these NAND tips -- they're flash technology and an SSD controls together. The SSD controller manage the NAND tips characteristics and exposed the storage through a host storage interface, such as MME which typically implements the block interface. Inside of the controller, runs this as a management software, which is often called the flash translation layer. And the reason why we saw here before with the throughput drops, I'm going to take us through why that happens.
So, for data placement on conventional SSDs, take this example to write where you have four files, A through D. They write it into a file system, and when they go into a conventional SSD, it typically has an erase unit that it writes into. An erase unit is effectively a set of flash blocks which you write . . . Must write sequentially to. They wear out; you can only write to them a certain number of times, and whenever you want to get to the next part, they must be erased before you can rewrite. It's the erase part that says how many known as the . . . how many times you can erase. So, one of the responsibilities of this flash translation layer is to map this very simple interface, block interface, read/write the logical blocks onto the physical addresses of the medium.
05:02 Matias: And what typically happens is that these four files, when you write them down is that they don't have the same lifetime. One of the files might be a temporary file, so it might only use five minutes while all the data that you wrote at the same time might stay there for days, weeks, years. Months and years. And in that case, well, what happens is that at some point it needs to be garbage collected to free up new space. And since we can only write sequentially within this erase unit or set of flash blocks, we have to clean up another flash block to enable that. And that's called garbage collection process. And when it has done its thing on a single erase unit, then you can rewrite, write to it again.
05:58 Matias: So, what that means is that within the SSD, the conventional SSD, it rewrites data. So, data you wrote once from the host point of view gets into SSD and gets rewritten multiple times. That is known as the write amplification factor. The best write amplification factor you can have is 1x where you write the data once from the host to the drive and it's only written once. But for conventional SSDs, very often the write amplification factor is much higher. And what that leads to is lower write throughput, higher read latencies because you're writing more and that impacts read latencies, and it also has a higher dollar-per-gigabyte cost due to the overhead of flash translation layer in general. And this write amplification factor occurs due to this mismatch between the block interface and the SSD's media interface.
06:57 Matias: So, the cost of this is, and when you look at it from the flash translation layer point of view, there's two things. One is the media overprovisioning, which improves performance, but it increases dollars per gigabyte. Where you say, and it's directly related. If you have very little overprovisioning, like 7%, you can see out here to the write for this particular workload, we've shown . . . to be overwrite workload, where at 7%, you have a write amplification factor of over three and a half. If you have 28% overprovisioning, you have slightly above two and if you have more than twice the media, you always have the opportunity to find a flash block without any valid data in it you get this write amplification factor. But the key is that you actually have to pay for that media to get that excellent performance, and you don't get to store your own user data.
The garbage collection process within the SSD solely uses it for its own purposes. The other thing is, you have 1 gigabyte of mapping table that you manage, that's logical block onto the physical media mapping table, and it usually uses 1 gigabyte per 1 terabyte of media. And, typically for enterprises, this is stored in DRAM to get the best performance. It improves the performance, but it increases also dollars per gigabyte. And it is such that the media and DRAM are often the bulk cost of an SSD.
08:29 Matias: So, the idea of ZNS, Zoned Namespace SSD is that can overprovisioning and DRAM be significantly reduced, while we improve the throughput latency and dollars per gigabyte? That's the question. And that's what it does, so with Zoned Namespace/ZNS SSDs, it enables the host and the storage device to collaborate on data placement. The host software stack has to naturally write sequentially within an SSD's erase unit. What this does is that it eliminates the write amplification factor of the SSD's garbage collection process. It eliminates the conventional overprovisioning. It needs less DRAM due to a smaller mapping table. Since your writing sequentially, you don't need to have what's known as the fully associative mapping table -- you can have a more close-grained granularity mapping table. And also, while doing that, you can have higher throughput and lower latency.
"And how much?" you're saying. Well, as we saw early on, the conventional SSD's 7% overprovisioning maxes out at 236 megabytes per second, where the ZNS SSD continue to be able to write at a higher throughput. So, we see over here that for a ZNS SSD where a conventional SSD would drop off at 236 megabytes per second, a ZNS SSD is actually able to sustain 4.6 more writes per second than the conventional SSD is able to.
10:05 Matias: And suddenly you can write much faster to an SSD, you get consistent write performance. But, also, how does this impact our latencies? So, for the latencies we have this example. We saw before that the conventional SSD were only able to sustain these around 236 megabytes per second, so let's see what is the latency impact if we, while doing writes, we also measure the read latency of a single 4K random read at queued F1.
So, at that point, what we do is that we start out with no writes going on, just random reads, and the first thing we want to note is that the pre-production ZNS SSD has slightly lower performance latency, whereas a conventional SSD is slightly higher and that's only due to the firmware maturity of the pre-production ZNS SSD. Production ZNS SSD would have the same average speed latency. And then as we add writes, like 25 megabytes per second, let's say 50 megabytes per second, like a 95/5 random read-write workload, what we see is, when we compare it to the ZNS SSD that goes down here, and we have the conventional SSD, we are actually 12% better average latency than the conventional SSD.
11:30 Matias: Similarly, if we go to have a 90/10 random read/write workload, we have 29% better average latency, and when we go to a typical 80/20 random read/write workload, we actually see 50% improvement in average latency. So, as there are some writes going on, which typically is in an SSD, you will see significantly improved average latency from your higher workload.
12:00 Matias: All right, so there's a lot of benefits to this, and to make all this happen is where we have this new specification. So, the Zoned Namespace Command Set specifications, the new command set that's been added into the NVMe Express organization, it was released back in June. And what it does, it introduces the zoned storage model, the introduction of zones and their associated behavior. The Zoned Namespace, namespace types inside of NVMe which supports this model, and a Zoned Namespace Command Set where this namespace support exposed these new commands. There is zone management received, zone management sent commands and while you can still do read/write and so on. And early on we had, before this specification there was a big movement in the industry to have this collaborative data placement, and with this new specification, we are now having a standardized interface to perform collaborative data placement between the host and the device for NVMe devices.
13:19 Matias: All right, so the Zoned Storage Model, just to give you an overview . . . We said that in an erase unit, you had to write sequentially. You can think of a zone mapping to an erase unit within the SSD. Zones are laid out sequentially in an NVMe space. You have zone 0, zone 1 and so on and then there's a set of LBAs within each zone. The zone size is fixed, which means each zone has the same number of LBAs per zone and applies to all zones in the namespace, for example, like a 512 megabit size.
13:52 Matias: Since we are in SSDs, and particularly for the flash media, it's not always that you have a zone size, for example, be a power of 2, so what we added into ZNS is the zoned writable capacity, which defines, maybe you can write 512 megabyte . . . The zone size is 512 megabyte, the writable area is 500 megabytes. And then, around these zones there are rules in how you can read them and write them.
The great part about the ZNS Command Set is that it builds upon or inherits the commands from the NVMe command set, so you can still use your flash command, you read command, write command and so on. It's just when you write to the media, there are certain rules that you have to respect when your writing to it. So, when you write to a zone, it must be written sequentially . . . It's called sequential write required. And if you want to write to it again, you have to reset it . . . Issue this zone reset command, which is done through . . . a zone reset action, which is done through the zone management send command. Each zone has a set of associated attributes, for example, write pointer that points to the next place to write sequentially. The zone starting LBA that points to the first LBA, the zone capacity which is the writable area within the zone, and the zone state that defines the read/write access rules.
15:14 Matias: And this process . . . So, the Zoned Storage Model is active for ZNS, so namespaces, it's very similar how it is for SMR drive. So, one of the things that we did when we wrote the specification, within the NVMe organization was to make sure that we aligned the SMR interface specifications, ZAC and ZBC and T10 and T13. We were very peculiar about making sure that those two or three specifications had the similar Zoned Storage Model, which means that you very likely can use the same source, you can use the same surface tag for SMR hard drives as you would for ZNS SSDs. With small wares, ZNS SSDs has the zone capacity where SMRs hard drives does not. You can always write to the full zone size within a SMR drive. It might not be the same for a ZNS SSD.
16:14 Matias: Then we have, I told about . . . There's this zone state, which controls what you can do. And there are like the state machine that works when you write to a zone, it's empty there's no writes going . . . There's no writes to a zone. You can be ready to write to it. When you write to it, it goes into an implicitly opened state, and if you explicitly open it using a zone management send command, it goes into explicitly opened, but let's keep it simple. And then as you write to it . . . You can read to it, you can read data before this write pointer. Obviously, you haven't written the data yet. If you try to read outside of the write pointer, you will get either read error or you will get just like predefined data back. It depends on how the error recovery feature within NVMe is configured. It's a orthogonal to how ZNS works. And then when you written all the writable LBAs within a zone it transitions to full; and then if you want to transition it from full to empty, you issue this zone reset action, using this zone management send command and then you go back to empty. That's it.
17:22 Matias: Right, so for reading from a zone, whereas writes are required to be sequential within a zone, reads maybe issued to any LBA within a zone and in any order. There is one caveat, and that is that if the zone is in this offline state, which means that it's basically offline, you're not addressable and you can neither write to it nor read to it, then that would result in an error. But, for any other zone state, you can read any other zones that are in any other zone states -- you can read the data in any order that you like. So, this when . . . This you only have to think about ZNS when you're writing to it. When you read, you can do it in any order and size as you like.
18:10 Matias: All right, so I talked a little about SMR hard drives. There is, I mean . . . It implements SMR, ZAC, and ZBC specifications there, and you have the ZNS SSDs with implement NVMe Zoned Namespace Command Set specification. We worked really hard in the standards group to align to the ZAC, ZBC specifications to allow interoperability, which means we have a single unified software stack that supports both software types. Some of the work that we've been working on is to utilize already mature Linux storage stack that's built for SMR hard drives.
So, today, when SMR hard drives came to the market . . . I mean when I talk about host SMR hard drives, what I mean is host-managed SMR hard drives which requires host software to be managed. For that, we've had support for quite a while since Linux Kernel 4.10. So, even before that, you could have pass-through commands. But since the stack has been, for SMR hard drives have been built up many years, it's mature, it's robust and it's used by many of the biggest global companies today. Therefore, it's already been battled tested, it's there. So, for ZNS SSD what we had to do was just add on these extra attributes, for example, the writable zone capacity, and then we could effectively already have a mature storage stack.
19:42 Matias: So, we did that and we also added in support into f2fs. We're working on . . . FS. Damien DeMoll has developed ZoneFS, which allows you to expose zones as files up to host system. We added in support to dm-zoned, dm-linear, dm-flakey and so on. And, furthermore, so there's the Linux Kernel, which is one data path and how you get to a ZNS drive. SPDK also recently got support in the 2010 release. So, now you can also use SPDK with ZNS drives. It has support for the zone management send and receive command and support for identifying, reporting zones, zone reset and accents and all of that is supported today. Which means you not only have the Linux Kernel, the newest Kernel storage stack, you also have the SPDK storage stack if you use that as your storage back end, and that too . . . as your storage back end to communicate with your devices.
We also worked a lot on tooling. So, FIO has support today, NVMe-CLI, and libcbd where we . . . Libcbd is a continuation of a library called libcpc, which were specifically for SMR hard drives. Libcbd sets the pass-through commands of . . . Before it had like Kernel 4.10 and now it's the library you can go use when you develop zoned storage application, the application that want to implement support for those devices. We have block test where you can test the zone block devices, and we have added in support to utilities as well.
21:27 Matias: There's this block zone utility, so when you go in and want to report zones, there's already like a block zone command inside of most Linux distributions of which you can actually just report zones, you reset zones, and so on. There's applications -- we have RocksDB, which we are currently in purpose of enabling, where we have the ZNFS as the new storage back end for RocksDB that Hans Holmberg has been working on. You can see his SDC presentation below. We are planning MySQL support. HSE is a Micron project, which is also planning a ZNS support. We have SEF, which we also planning. We have put some patches and still working on it, and the same for HCFS, we're also planning a ZNS support and, in general, zone block device support. So, even if I say ZNS, it means in general zoned storage supports, so this works both for SMR hard drives, but also for ZNS SSDs. It works together.
22:31 Matias: All right, so to make all this easier and to adopt and use, we've launched zonedstorage.io. It's a Kernel community site, which defines the Linux Kernel features from which version, system-compliant tests, Linux distribution support, application and libraries, benchmarking, how do you get started, how you get going -- I mean, everything you need to get going with zoned storage devices.
So, the summary, a ZNS SSDs you collaborate on data placement, you improve your write throughput by over 4x, your read latency by 57% or more in a 80/20 random read/write workload. You have lower dollars per gigabyte because we reduce the overprovisioning and the DRAM requirements of the SSD, it's full standardized in the NVMe Express organization, and right now it's available as a standalone specification and it's easily adopted through the existing Linux software storage stack.
23:36 Matias: The storage stack, we have already quite a few, both the base with the newest Kernel SPDK, we have the tooling, and there's even more applications and there's more that's being worked on by everyone in the community and enabling it. Thank you and have a great day.