00:01 Rob Davis: Good morning everybody, and welcome to the Flash Memory Summit panel session on Ethernet SSDs. My name is Rob Davis, and I'm the vice president of storage technology for Nvidia Networking. On the panel with me today is Matt Hallberg from Kioxia, Ihab Hamadi from WDC, and Zachi Binshtock from Nvidia. To kick off the panel discussion, they are going to start off with a very quick overview of their personal and company views of Ethernet SSDs and how they are approaching this potentially new way to connect SSDs to a network. But first, I'm going to start off with a quick overview of just what Ethernet SSDs are and where they fitted into the overall storage networking picture.
01:03 RD: For any of you who attended the session on Ethernet SSDs last year at Flash Memory Summit, this slide will look familiar. On the left side is the current way NVMe over Fabrics storage networking solutions are implemented today. The servers connect over the network to a JBOF, just a bunch of flash, with RoCE -- RDMA over Converged Ethernet protocol -- or NVMe over Fabrics on TCP. These JBOFs are very much like standard dual-socket servers, except they have PCI switches to enable connecting lots of PCI-based NVMe SSDs, and everything is double wired for high availability. But all those CPUs are doing in a JBOF is converting NVMe over PCIe from the SSD to NVMe over Fabrics on Ethernet. Enter the EBOF, an Ethernet-attached bunch of flash, the picture on the right. Instead of a dual-socket server, it's basically an Ethernet top-of-rack switch, and with SSD slot, but instead . . . But it has SSD slots. And those SSDs have Ethernet ports on them instead of PCIe. And now I'm going to turn it over to the panel, starting with Matt, to explain this further.
02:28 Matt Hallberg: Hi, everyone. As Rob said, my name is Matt Hallberg, I'm the senior product manager at Kioxia, managing the enterprise NVMe and Ethernet-attached SSDs. I've been in the industry for a long time, throughout test measurement, working on SSDs since at SandForce, you'll see my background there. So, I've been around for a while, as have most of us in the storage industry. Next slide please Rob.
So, why Ethernet SSDs, and why EBOFs? These are my top three reasons. First one is cost savings. With an EBOF, you are cutting out unnecessary components when you're looking to just add storage to a network. So, typically today with the server you have to buy CPU, DRAM, everything else, and with an EBOF, you're simply buying an EBOF and you're buying the Ethernet SSDs. You don't need a CPU, you don't need DRAM, you don't need HBAs or a NIC. And you save money on power and cooling, 24 Ethernet SSDs -- if they're around 20 watts and you've got 24 of them in an array, you're at 480 watts. If you're cooling 480 watts, that's a lot less than having to cool a system with the CPU and DRAM and everything else, it's going to be a lot less.
03:44 MH: Performance is another big reason. We're utilizing existing RoCEv2 networks with fast performance and low latency with NVMe over Fabrics. We're avoiding the bounce buffer of having to wait for the CPU and DRAM to service I/O. We're avoiding stranded performance of stir performance over a single HBA or a NIC, if you're using direct connect. And with that, why would you waste money on NVMe SSDs in the system if you can't really pull the performance out of the drives through a single NIC or a single HBA?
Lastly, SDS for things like GPUDirect and Magnum IO will, in the future, allow direct GPU access to NICs, or from NICs to Ethernet SSDs. And, lastly, is flexibility. It's a lot easier to just add an EBOF to an existing RoCEv2 switch than to go provision a server, get everything set up direct to your administration and make the drives available. All right?
04:45 RD: Ihab, go ahead, please.
04:47 Ihab Hamadi: Good morning everyone, my name is Ihab Hamadi. I'm an engineering fellow at Western Digital where I lead systems architecture. And today, we'll try to take a holistic view on NVMe over Fabrics and NVMe over Fabrics SSDs. Next slide, please.
As we look at NVMe over Fabrics, the beginnings and early POCs took place about six years ago. Today, we have a 1.1 spec for NVMe over Fabrics with products out there actually conforming to that spec. For transports, we originally started with NVMe over RoCE. Today, in addition to NVMe over RoCE, we also have NVMe over TCP and NVMe over Fibre Channel. For features, we have things like encryption with TLS, we have advanced discovery, all part of the specification. So, as we look at things like interoperability, that has also come a very long way. In the early days, we didn't have much interoperability at all.
05:42 IH: Today, we have both Open Composable Compatibility Lab and UNH InterOperability Laboratory, offering plugfests, automated compatibility suites and compatibility lists for customers to buy products that have been tested with each other. As we look at the workloads of today, they're also very different from a few years ago, many driven by AI/ML. And our ability to match these workloads with the right infrastructure is one of the ways that will continue to drive performance into the new decade. All of this is fueling a shift to more composable infrastructure and is making legacy storage architecture, and even protocols, less desirable, and frankly, in most cases, less relevant to the data center. There are different approaches to building NVMe over Fabrics storage systems with interesting tradeoffs and design points, and we're hoping to get into some of those discussions within this panel.
06:38 RD: Great, thank you. Zachi?
06:42 Zachi Binshtock: Hi everyone. I'm Zachi Binshtock. I'm leading the software and system architecture at Nvidia Networking. I'm responsible for switch system, operating systems and management systems for their switch business line. And next slide please, Rob, thanks.
So, in this slide, we will see a breakdown of the operating system of a switch, the different components and software stack, which essentially the same for the top-of-rack and the EBOF. And on top of that, you add the capabilities to manage and monitoring the flash drives. And some note about the data planes. So, we are speaking about the data planes, except the RoCE transport, the TCP RoCE transport towards the drives, and there is the management plane, usually towards the switch CPU that manages the drives and the operating system of the switch, and that runs the Redfish protocol or others in order to manage the entire system, the EBOF system together. That's it.
07:53 RD: Thanks Zachi. And with that, we're going to switch over to the panel discussion and get questions from the audience -- it looks like we have a couple. So, first off, I'll ask this one to, let's see, how about you Ihab? What is the anatomy of an Ethernet SSD and when do you think we're going to start seeing more of them in the market?
08:27 IH: Yeah, it's interesting, Rob. As we look at that space today, there are new and exciting classes of Ethernet SSD devices, generally. Some of those devices may internally be composed of multiple smaller units, and we actually see some of those products out there in the market. And that has been going through an evolution. So, what started with a bunch of, say SSDs with maybe a SmartNIC on the front end, is now moving towards more and more bridges, and of specifically-built silicon that is driving some of the economics for that in the right direction. And then as we look at the mainstream SSD form factors, adding Ethernet to them, even form factors like U.2, U.3 and EDSFF.
09:22 IH: If you look at SSD today, you'll see some flash, you'll see an SSD controller silicon, some DRAM and miscellaneous capacitor and other things. That's also been going through an evolution to add NVMe over Fabrics Ethernet interface to it. The first iteration of that is really a dongle in the form of . . . you slap it in front of a drive, and it converts between NVMe over Fabrics and PCIe and NVMe. And that's also going to continue to go through more integration to where that eventually gets integrated into the controller of SSDs, again continuing to drive some of the economics there. In terms of timing, some of these units are already available from various vendors, and the economics are all getting in the right direction. The more integration we see of NVMe over Fabrics controllers within that main SSD controller, the better economics we will have. And we expect to see some of those to hit the market within the next couple of years.
10:28 RD: Matt, do you have anything to add to that?
10:31 MH: Yeah, no, that was a pretty thorough response. Really the anatomy is, as Ihab said, is you got a couple of different methods, just where the bridge is. Is the bridge behind the SSD itself? Is it on the SSD or is it located somewhere else within the EBOF?
10:51 RD: Zachi, anything to add there?
10:56 ZB: No, Rob.
10:56 RD: All right, how about a question for you then, Zachi? How different is an EBOF with Ethernet SSDs from the switch perspective than a top-of-rack switch?
11:12 ZB: So, in my opinion, they are not that different. So, you want to have the entire set of capabilities that you have on the top-of-rack, meaning that you can provision quality of service and handle RoCE traffic, and you can run the protocol to Layer 2 protocols or Layer 3 protocols to interconnect into the fabric, and you want to support IPv4 and IPv6 and all of that. But there are some differences, and I think the differences are around the telemetry. Once you combine the two systems together, the drives and the top-of-rack together, you want to very pinpoint the problems, trying to understand where is it. Is it on the drive level or at the switch level? So, you need to have better telemetry. That's all for the system.
12:03 ZB: The second thing is, I think, is security. So, you need to really think about security. Top-of-rack currently do not own the data, so they can get around with some security nuances that are not . . . You need to take care now together as a system, you need to look at the . . . It started from the hardware itself, the hardware root of trust of the system. How you run the flashes, authenticate the flashes and how you test for the software, all that together need to have better security as a system as a whole. This is one difference from the top-of-rack, and another one is that on the top-of-rack, you can get away with oversubscription. You cannot do that with an EBOF. You need to do . . . not to handle any oversubscription, otherwise you will get bottlenecks. So, essentially, the entire set of operating system that you have on the top-of-rack with some subtle nuances as a system of EBOF.
13:14 RD: Great, thank you. Ihab, any additional comments?
13:19 IH: Yeah, the biggest one that really comes to mind is the ability of the switch to deliver . . . to actually serve blocks and receive blocks, your data essentially with minimal drops, preferably zero drops. So, as Zachi was talking about, things around oversubscription, you want to try and eliminate that as much as possible. If you're using NVMe over RoCE, then the ability to actually configure the switch in terms of the various cues and even things like flow control, things like congestion management, all become critical to minimize, and hopefully, eliminate any packet drops there.
And then the other thing that comes to mind is also, if you look at the control plane of the switch, the ability to actually automate that as much as possible in terms of overall configuration, in terms of the overall orchestration, which is also a trend for top-of-rack switches, but it even becomes more pronounced and more critical for things like EBOFs.
14:32 RD: Cool. Matt, any input there?
14:36 MH: No, that was pretty thorough from both guys.
14:38 RD: All right. Well, here's one for you then, Matt. God, we're getting lots of questions. This is great. Why our Ethernet SSDs, EBOFs not taking off if there's so many advantages that you talked about earlier?
14:53 MH: Well, I think what we're seeing in EBOF, EBOF- and Ethernet-attached SSDs is the same thing that we are seeing on . . . When we move from PCIe-based SSDs to NVMe-based SSDs. Today, it's a lack of infrastructure. There've been a couple of systems announced and proof-of-concept systems announced from several different companies like Foxconn, and WD, and us, and Samsung. There's a lot of guys who are starting to wade into the pool of this kind of storage paradigm, and we're not there yet. So, the software support, the third-party software support is coming around. There's lots of standards activity that's going on, but I wouldn't say that we're at a fully baked part or we're in a fully baked position yet where someone's ready to dive all in.
15:50 RD: OK. Ihab, any additional comments?
15:53 IH: Yeah, I touched on interoperability a little bit in my introduction. So, we originally used to see some interoperability issues that are not uncommon for new technologies, and those interoperability issues are, by and large, behind us at this point. So, we're definitely heading in the right direction there. And the availability of a rich ecosystem from various vendors is definitely something that would help propel this further in the right direction.
In the beginning, we had one or two vendors, now we're beginning to see more and more vendors for drives, for platforms, and even on the host side we're seeing a lot of innovative ways to actually present NVMe over Fabrics storage to servers and to hosts. In many cases, without even needing to configure whole stacks of RoCE, or even TCP for that matter. You can still do that, but there are some innovative solutions out there that will make fabric storage look exactly like local storage. So, I think all of these features and new capabilities that we're adding to the ecosystem will help with the adoption cycle.
17:09 MH: Yeah, if you don't mind me adding, and really, this march just needs a couple of guys to pick it up and show in the real world, and all of us vendors are telling everybody, this is the positives of this technology. We're all seeding out there and doing proof of concepts, and as soon as someone is able to say, "Hey, in my workload I'm seeing crazy benefits. Everything that these guys are telling me is true." That's when this thing goes from super bleeding edge to the early adoption and into mainstream adoption cycle curve.
17:43 RD: Zachi, anything to add?
17:47 ZB: Nope.
17:48 RD: All right. No problem here. Question for you though, Zachi. So, what are the EBOF features which can be actually provided by the operating system of the EBOF or the switch operating system?
18:09 ZB: So, I think the vendor, the flash drive vendor tools can be added to the operating system. So, this is one addition to the switch operating system, and the other thing is management interface is like Swordfish or Wordfish that can be added, and be more in traditional, not in networking gear, but more of storage gear that can be managed. So, this is additions that can be added. They can be added, technically, it's fairly easy now with modern operating system, to add those kind of capabilities using Docker and stuff like that. So, those are the main additions, and on top of that, you can add, as I have said, automation and telemetry. So, it's fairly easy to take it from there and add more logic to the switch to make life easier to manage and operate those kind of systems.
19:12 RD: Cool. Ihab, anything to add? Or Matt?
19:17 IH: I think that covered it.
19:21 MH: Ihab, I'll defer to you if you want to chime it in.
19:28 RD: We're good on that one?
19:29 IH: Yeah.
19:31 RD: OK. So, speaking about the management, Ihab, you want to give us . . . There's a question about what's the management of the Ethernet SSDs, what kind of management issues was . . . Zachi was talking about. Can you elaborate on that at all? Or Matt, one of you guys?
19:51 IH: Sure, maybe I can start on that one. Management really involves a few pieces there, so the first and very obvious one is the discovery and configuration of these drives, everything from assigning IP addresses or maybe custom MAC addresses in some cases, and then depending on the transfer protocol. If you're using RoCE, for example, you have to configure things like flow control and you got to configure PFC and ETS and those sort of things. Both RoCE and TCP will also require some congestion management, that's also getting to the point where interoperability is getting easier, but we're also seeing more and more advanced congestion management emerging, particularly within the past three, four years. So, that all falls under the umbrella of management. Then you get things like speed settings and then you get various capabilities that in some cases might be vendor-specific even for this type of storage.
21:03 RD: How about loading firmware and that sort of thing?
21:08 IH: Absolutely, the ability to update firmware and the ability to, in some cases, even verify that this is the right firmware and the firmware that you think that you're loading, so it gets into some security aspects there as well.
21:23 RD: Matt, anything to add?
21:24 MH: Yeah, I'll add that there's a lot of encouraging work going on. So, there's an alliance between SNIA, DMTF and work on Redfish and Swordfish specifically, for the management of the Ethernet SSDs. So, there's quite a bit of collaboration, that's not the only management software that's out there, but there's a lot of things going on. There's NVMe-MI over BMC, there's quite a few things for managing these SSDs, and as the guy said, the fun parts like firmware updates, et cetera, that's on its way, these are all things that are in process, and we've got a lot of brilliant specification guys that are having weekly meetings and things like that. So, progress is coming, management software is there, there's multiple suites; again, Redfish is the one that I have been acutely aware of, there's a lot of us working specifically on that. I know if you look at some of the developers of Swordfish and Redfish, ourselves and WD are on that list. I believe Nvidia may be on the list as well, but I know WD and Kioxia are definitely working on it.
22:45 RD: Zachi, it seems like that's one major difference between a top-of-rack switch and an EBOF switch is that there's . . . You don't know what SSD is going to be plugged in. So, how do you handle a different management layer needed for whatever vendor's SSD is plugged into your EBOF?
23:10 ZB: That's a fair question. I think the idea is to standardize the management interface, that's one way to go. I think that there's something that is . . . I'm not clear on the answer here, is how do you present a system, so you can separate the management of the switch from the flash drive using a BMC, or you can combine it, too? So, this is something that I think I asked in the panel actually. What is their point of view here, because the switches usually have an out-of-band interface and through that you manage all the entire system. Do we need to have a separate interface to manage the drives? I'm not certain that this is the right direction, but this is an open question. How do you manage the entire system? Is it together, a single interface or is it a separate interface to the flash drives and to the network? And the network is the switch.
24:18 RD: So, basically, the different flash drive vendors would have their own piece of software maybe running on a switch that manages their drive.
24:28 ZB: Yeah. Exactly, yeah.
24:30 RD: What do you guys think on the panel?
24:35 MH: Well, I mentioned SMBus previously, so with NVMe SSDs or PCI SSDs, starting with NVMe, I want to say NVMe 1.2, NVMe-MI or management interface, came about to be pretty popular. We've seen it on reference designs from the Intel platform group, it's on Dell, HP, Lenovo, all the major OEMs and the ODMs too, all have BMCs and . . . In general, I can't speak for all drive manufacturers, but in general, the majority of us all support this side band out-of-bound protocol, so it's over I²C, and within NVMe, starting in 1.2 or actually 1.3, now, you have the concept of NVMe over MCTP, MCTP over PCIe. So, that's the management transfer protocols.
25:34 MH: So, there are ways that are aligned to how I believe things that are done at the system level. I also wanted to point out that in terms of interoperability and having a litany of different kinds of drives behind the host, one of the things that we are doing -- when I say we, I mean we as the entire Ethernet SSD and EBOF ecosystem -- i, we're working with guys like UNH-IOL for interoperability, we're going to have Ethernet SSD compliance testing where if any of the engineers out there on the call have been to some of these plugfests, essentially, you know you're going from room to room or from table to table, plugging into everybody's different implementations and making sure that drive behavior, host behavior, software behavior is as vanilla as it comes. You plug it in, it works, you plug it in, it works, you unplug it, you hot-plug it, it works and you're basically ensuring that everything is doing . . . All the components that are part of a system, are doing the same thing consistently, across multiple devices, multiple hosts, et cetera.
26:46 RD: Great, Matt. I think we're probably out of time, unfortunately. This has been a great discussion. So, I want to thank everyone on the panel, and also we as a group want to thank the Flash Memory Summit for giving us this opportunity to explain the new Ethernet SSDs and where they are and what the advantages are to using them. Thank you very much.
27:17 MH: Thank you, everybody.
27:18 IH: Thank you, everybody.