Evolution of Ethernet-Attached NVMe-oF Devices and Platforms
As NVMe-oF ecosystem continues to mature, storage systems now have design choices for the type of external and internal fabrics to use, as well as the various attach points. Learn about Ethernet-attached storage, specifically NVMe over Fabrics devices and platforms, and what is required for a successful deployment.
Download the presentation: Evolution of Ethernet-attached NVMe-oF Devices and Platforms
00:02 Ihab Hamadi: Hello, and welcome to this FMS talk. This is Session B7. My name is Ihab Hamadi, I'm a fellow and senior director at Western Digital Corporation. My focus is systems architecture. This presentation is about Ethernet-attached storage, more specifically NVMe over Fabrics devices and platforms.
00:24 IH: NVMe has been around for a few years, NVMe over Fabrics has been around for a few years, NVMe has been around even longer. And the ecosystem of NVMe over Fabrics continues to mature, which has in turn enabled the number of design choices and that list continues to grow. We now have choices also for the type of external and internal fabrics to use, as well as different attached point possibilities, all of this inevitably also presents tradeoffs for the different designs.
01:00 IH: Now, starting with some background. NVMe as a storage protocol was developed to be a modern protocol, in many cases replacing some of the older storage protocols that had some inherent limitations and could not provide the attributes needed for today's data center. Parallelism is one attribute that's increasingly critical for today's data center. NVMe over Fabrics can provide better parallelism. Usually parallelism is needed because modern CPUs have multiple cores, modern applications, have multiple threads that actually run on those cores. Those threads running in parallel can perform I/O to storage and not have to serialize those I/O because the storage only has one queue, for example, or a few limited queues. So NVMe removes this bottleneck and maybe attempts to remove some cost components by reusing hardware building blocks. This can vary from one design to another.
02:00 IH: NVMe is an efficient protocol that requires minimal processing on the host side within the application or within the storage stack, going to the operating system drivers and so on. NVMe, again, can generally enable high bandwidth due to its efficiency.
02:20 IH: All right, let's move our attention to the evolution of NVMe over Fabrics controller. Starting from the left side of the slide, going back to the early days of NVMe over Fabrics, some of the demos at the time your standard off-the-shelf server, your CPU, DRAM, RNICs. The first demos of NVMe over Fabrics were NVMe over RoCE. This made a lot of sense at the time, is good for proof of concepts, it's good for . . . When you didn't have much of a standard, and things were pretty fluid at the time.
02:51 IH: And as we moved on as an industry, we went through a shrink process to the second wave of NVMe over Fabrics controller that has two threads. The first thread was embodied by what was called SmartNICs at the time, much of this is evolving today into DPUs and similar things. And, at the time, SmartNICs were largely mini servers. So, you've moved on from mostly x86-based servers and to perhaps ARM-based servers with some amount of DRAM, a front-end consisting of an RNIC all within one product with some varying amount of offload. So, this constituted one path.
03:36 IH: Another path was FPGA-based. If you are looking to do very, very specific things and you are looking for perhaps speed, the speed of hardware instead of doing things in software like you would do on the SmartNICs, then that was a good option for that.
03:55 IH: And then the last wave that we're sort of seeing now is one where we're focused on streamlining the controller, removing some cost components of the controller and making the process easier. So, we're seeing a lot of ASIC-based bridges to convert between NVMe over Fabrics and PCIe. Those bridges tend to be low cost, so a fraction of the cost of your SmartNICs. However, they do not support or provide any of the other data services or data processing capabilities that you would see with the SmartNICs. So, great for overall connectivity to NVMe over Fabrics, great for entry into that ecosystem. If you're looking to do more, than perhaps the second wave is still a good choice for that.
05:02 IH: Let's move on to the NVMe over Fabrics SSD implementation. Again, starting from the left-hand side of the slide where you see in a typical SSD today is an SSD controller. Of course, you see a bunch of flash, you see some DRAM and you see miscellaneous other smaller components. So, one way to get to NVMe over Fabrics drives is by slapping an interposer on top of that, simply converting between PCIe on one side, Ethernet and NVMe over Fabrics on the other side.
05:35 IH: Now, as you go deeper into the integration of NVMe over Fabrics controllers, you can actually put the same interposer within the drive itself, so now what you see externally is just an NVMe over Fabrics interface. However, we're talking shadow-level of integration here, so you still pretty much have the same components, but you put them all in one package. And then, ultimately, the deeper level of integration is where NVMe over Fabrics becomes part of the controller -- you perhaps eliminate some of the unneeded pieces there around PCIe conversion and so on. And this is really when you start to see the cost of this going back down to the levels where this becomes more feasible and more economically feasible for the industry.
06:27 IH: OK, let's move on to the evolution of NVMe over Fabrics platform. Starting with the left-hand side of the slide, the diagram represents the type of designs prevalent today, so the diagram shows a NVMe over Fabrics controller storage platform that is highly available with two I/O modules, non-HA design would have just one I/O module. Now, let's unpack the I/O module and take a look at the main components within that I/O module.
Starting from the fabric side, what you'll see is either SmartNICs or ASICs providing the interface to the NVMe over Fabrics domain. In the case of RNICs, you may see certain level of advanced virtualization features, for example, or perhaps some sort of services. In the case of ASICs, you will see cost savings, you will see lower latency, you'll see more consistent latency overall and almost no virtualization. Both SmartNICs and ASICs have PCIe on their back end, and this sort of connects you to a PCIe fabric and eventually to your NVMe PCIe-based drives within that platform.
07:48 IH: Now, let's shift our attention to the right column, or the right-hand side of the slide, and the process that we go through is an integrate and shuffle, which is another common process that we see in the industry here. With this process, the drives themselves change their interface from PCIe to Ethernet, like we were talking about in the previous slides. And with this wave of change, we'll see other simplifications as well, so we'll see PCIe switches now replaced with plain Ethernet switches, and what you'll see is a small CPU that's needed to run the control plane on the Ethernet switch.
So, this change effectively moves the attach point for NVMe over Fabrics all the way to the drive, and this will make even more sense as you scale the number of drives within the platform, so now you can scale it much beyond what you were able to do with the paradigm on the left-hand side. We're starting to see some early products in the category from the right-hand side, and we're starting to see more solutions to some of the pain points that access there.
09:10 IH: Now, let's move our attention to the NVMe over Fabrics controller landscape. On the X axis, we have the degree of functionality, starting with transports and interfaces only going all the way to data processing units representing in today's DPUs and so on. On the Y axis, we see the endpoint scale and starting from 1-1 scale, which is prevalent in the transposer model, all the way to one-to-end scale where a certain controller -- in some cases an ASIC controller or a SmartNIC controller -- could cover actually any number of drives and the typical range that we see is anywhere from four to 16 drives per controller. In both cases, you scale by adding more instances of the 1-1 controller or more instances of the one-to-end controllers, depending on the type of degree of functionality you're looking for. What you see on the left-hand side is mostly just access to this NVMe over Fabrics domain. What you see on the right-hand side is more advanced features, so now we're starting to see interesting things like computational storage being a possibility there; we're starting to see things like various types of acceleration being a possibility there, and so on.
10:35 IH: OK, let's go over some of the remaining pain points that the industry and customers have been discussing. The first one is management, and this is everything from addressing and configuration of the device. Some of what we have today is around open specifications to address this pain point, so OCAPI is now part of the OCP project. Redfish has new features that are addressing NVMe over Fabrics and so on. Interoperability is a critical topic. Some of the early versions of NVMe over RoCE worked reasonably well when you're working with N1 vendors. How do we move past that and allow multiple vendors to work very well together? And some of the interesting pieces that we see there are rate limiting, even some of their congestion management, there are a few congestion management protocols and framework that have emerged over the past few years. There's also a . . . that you do or out remapping and so on. And for that, we do have open composable lab that works in collaboration with UNH InterOperability Laboratory, all trying to address all the interoperability issues and interoperability needs of this industry to be able to use NVMe over Fabrics more broadly.
12:08 IH: And then the third area is around security, and with that we're seeing more and more products beginning to offer features around TLS encryption for the data path and also the control path. We're also seeing more features around supporting ACLs.
And then the last one is telemetry. What do you do when you run into a problem? Port mirroring, for example, comes into play in some of these scenarios. There are also some other attempts from the industry to standardize how we do telemetry.
12:46 IH: Now, let's talk about what all of this evolution means in the areas of open composability. With NVMe over Fabrics controllers, devices, platforms and all the specs evolving, open composability is now becoming more of a reality. So, for the first time, we have all the ingredients needed for true open composability and true open composable infrastructure in the data center. NVMe over Fabrics is considered the backbone for the interconnect in this case, so these aggregated elements will include compute as one type represented by servers with architectures really of your choice.
13:30 IH: And then we've got shared accelerated storage that is flash-based with DNS for control over data placement to multitenant isolation, all the way down to the low-level flash constructs -- noisy neighbor issues hopefully going away in many of these cases -- all accessible within a single Ethernet-based storage fabric. And then we also see slightly cooler tiers with less stringent SLAs, for example, perhaps backed by disk-based storage and another cooler tier for secondary storage, now with things like legacy tape or perhaps new platforms that are emerging and beginning to replace some of the legacy tape in that case. And what binds all of this together is some of the open composability API, so OCAPI, as I mentioned, is now part of a OCP project. And this API defines ways to compose virtual bare-metal systems out of all of these elements.
OK, and this will conclude my talk today. Thank you all for your time and attention.