Accelerating Flash for a Competitive Edge in the Cloud and Beyond Evolution of Ethernet-Attached NVMe-oF Devices and Platforms
Guest Post

Benefits of Native NVMe-oF SSDs

Deploying NVMe-oF natively at the SSD level is an approach that can yield greater scalability, performance, reliability, availability and serviceability (RAS) while also helping to contain compute and memory investment costs.

Download the presentation: Benefits of Native NVMe-oF SSDs

00:01 Matt Hallberg: Hello, my name is Matt Hallberg and I'm the senior product marketing manager over at Kioxia for our Ethernet and enterprise NVMe SSDs. And today, I will be going over Ethernet SSDs leading to higher performing storage. Contact information is there just in case.

So, just a couple of quick notes about me. I am, as I mentioned, I'm at Kioxia America -- Kioxia, formally known as Toshiba Memory America. I've been in the market for about 18 years related to storage, starting off with test measurement related to SAS and SATA, PCIe, InfiniBand, Fibre Channel, Bluetooth, USB, you name it. I've generally been around it and I've been involved with it. And, as of more recently, I've been involved on the SSD side with companies like SandForce, LSI, Seagate and Kioxia.

01:09 MH: So, today, as we're going through Ethernet SSDs, I wanted to talk to you a little bit about the way the infrastructure works. I'm going to be introducing a concept called EBOF, or Ethernet bunch of flash. So, let's do a comparison of JBOF and EBOF.

I'm sure everyone here is familiar with what a JBOF is, but essentially it's a general-purpose server. You have a CPU, DRAM, NIC, HBA, generally a PCIe switch if you're not directly connected to the CPU, and you just add more and more and more drives. I think the today's servers can handle 8 to 12 drives, maybe some can handle some more. But, in general, if you wanted to add flash or add storage to your data center, you had to buy an entire server with drives inside of it. Also, with the JBOFs, when you're moving data to and fro, everything has to go through the CPU and DRAM, and inside of that box, you've got to cool everything, CPU, DRAM, NIC, drives, etcetera.

02:20 MH: So, there's a new architecture called Ethernet bunch of flash. Essentially, it's a bare-metal system. You don't have to buy CPU or DRAM; there's a switch inside of the box. There's no additional components needed outside of that. It utilizes RDMA over Converged Ethernet or RoCE v2 infrastructure. It takes advantage of NVMe over Fabrics with no middleman to add latency or CPU overhead, meaning you don't have to worry about the data coming in through the NIC, through the CPU and DRAM to the HBA, to the switch, to the SSDs, and so back and forth between that line. There is a potential of adding latency and CPU overhead if your CPU is doing anything else or if your DRAM is busy with other activities within the server. So, for an EBOF, simply just a box, and it's basically, you could buy a box, you can fill it up with 24 drives, you can fill it up with 12 drives, how many drives you want, and as you want to add storage, you're merely adding these drives, these Ethernet-attached drives to your existing RoCE v2 network. So, a lot easier to add storage, and again, you're not having to deal with all the components on the inside of the box.

03:37 MH: So, this is part of a larger move towards disaggregation. So, today's server storage model, mostly server model, is what I would call a converged infrastructure. That means you buy a server, you use a server for one workload, you use a different server for another workload, another server for an ML workload, another server as a web server. And, so, essentially what's happening is you're buying all these systems and some workloads are storage-centric, some workloads are CPU-centric, some workloads need a lot more NIC activity and some don't. And so what ends up happening is on all these servers, if they're all being used for different things, you have what's called underutilized resources or siloed resources, essentially, where you paid a bunch of money for a box for one purpose and you're not really using the other thing and so you sort of overinvested.

04:36 MH: There's been a big movement, and this has been a movement for a long time, towards composable disaggregated infrastructure. So, what that essentially means is now you have a system that's dedicated to storage, you have a system that's . . . or composable for storage, you have a system that's composable from compute, meaning you have a composable fabric, a composable software that essentially acts as a composer. So, here's my storage, here's my CPU. I have these tasks that need the CPU; well, I'm going to assign this CPU usage or compute usage to this user set that needs it. This user set needs storage or I'm going to assign it to this user set for storage. So, everything is manageable from a top-end level; that's very efficient resource allocation.

05:24 MH: EBOF is actually built on composable disaggregated infrastructure by reducing the cost and cooling that you need, and I'll get into that in a minute, improving the performance and latency because you're operating these drives at 25-gig Ethernet behind NVMe over Fabrics and you're bypassing, having to go everything through the CPU, DRAM, at least inside of the. . . From the switch to the SSD, and you're increasing the scalability of adding more drives to the network. Again, simply just buying a box, throwing some drives in, no CPU, no DRAM, etcetera. So, for me, disaggregation is a fantastic thing that, again, I've watched come along for a while, and I've been very excited with the advancements in today's technology. PCIe Gen 4 is now making our NICs and everything else faster. You can get to 200 Gigabit Ethernet NICs -- 400 Gig is on the way with PCIe Gen 5. You have things like NVMe over Fabrics, so where you used to just have low-latency, high-performance or high-throughput interconnects, and networks were through InfiniBand. Now, with the RoCE v2, lots and lots of people are seeing the benefits of deploying over an existing Ethernet network.

06:49 MH: So, I'm going to go over a couple of different use cases where I believe Ethernet, SSDs and EBOFs provide a great amount of benefits. The first one is on data centers. So, for me, I believe, that the No. 1 thing that you're getting out of NVMe over Fabrics is performance and latency and that has been proven. A couple of different universities I saw have actually proven that 100 Gb Ethernet NVMe over Fabrics is very, very similar with not a lot of drop-off versus some of the more recent InfiniBand speeds and interconnects. There's a lot of efficiency that's being increased, there's no latency overhead from the CPU and DRAM, or any of the components. I talked about that earlier.

Supply-wise, from a data center, you always want to make sure you're not dealing with long lead times, and the last few years have shown us that there have been long lead times on things like CPU, there has been DRAM shortages, there has been availability of PCIe Gen 4 components and systems.

07:54 MH: You're bypassing all of that by switching to an EBOF. On the cost side, you're not having to buy a CPU, you're not having to buy a DRAM or any of the other components inside of what you would call a general-purpose server. And then you look at other things like system cooling.

If you think about it, inside of a EBOF, you can have up to 24 drives. If those drives are pulling 20 watts, that's 24 times 20, which is 480 watts. There are some other components that are in the system, the IOM, which aren't drawn a lot of power. So, effectively, you're cooling 24 drives and a minimal set of components versus having to cool X amount of drives in the system, your GPU or your CPU, your DRAM and everything else. As an example, like today, if you were to go buy some servers, the minimum general-purpose server has a PSU that's rated for 750 watts, and generally you'll have to buy two of those. So, we're talking about 1,600 watts for a server versus maybe 500 watts for one of these EBOFs. If you're thinking of like an average cost of a dollar per watt, that's a big difference -- even if it's 750 versus 500, that's 250 watts that you're having to pay for and cool over what you have with an EBOF.

09:17 MH: And, lastly, inside of the data center scalability is one of the most key items you can have, and with an EBOF, it's scalable disaggregated storage. You're literally just adding an EBOF to an existing RoCE v2 network for high-performance computing; compute nodes are very resource-intensive. And inside of these compute nodes, everything is essentially dedicated to PC, or everything for PCI lanes are dedicated to GPUs. So if, your chipset has 80 lanes, the majority of those lanes are going to be to put in as many GPUs as possible because, you want your box to do lots and lots of transactions, lots of processing.

So, generally, you'll find systems that have a lot of GPUs and just a few drives, or a few lanes set aside for local storage. One of the primary applications in the HPC sector is this concept called burst buffering. And that's essentially a DRAM offload to storage, for checkpointing or for snapshots, or -- here's what I've been working on -- I want to make sure nothing gets corrupted and so I can restart at the snapshot without having to load the entire amount of data that I've already processed so far. Similar for checkpointing.

10:37 MH: And then within high-performance computing you have this thing called an I/O phase, that comes after a computation phase. The computation phase is essentially your GPUs that have been chugging and chugging and chugging datasets, or things like that, and the I/O phase is when they want to offload the data, generally the DRAM or a local storage, and you want to load new data in.

So, while the data is being offloaded and loaded onto the GPUs, your GPUs are just sort of sitting around waiting for something to happen. So, for me, if I went and spent a lot of money on a compute node, I want to make sure that my GPUs are as busy as possible all of the time. So, with an EBOF, you're able to take advantage of the high performance and low latency in NVMe over Fabrics to quickly offload and load data to the GPUs, which allows GPUs to reduce idle wait states. And, again, this is very scalable, so you don't have to buy multiple systems to add lots and lots of . . . You don't have to buy a server, is what I'm saying, to buy more storage to offload from your compute nodes.

11:51 MH: We've got a cool chart in the upper right-hand corner, this was using . . . We have a KumoScale NVMe over Fabrics software, and there was a system that had 8 GPUs and storage located elsewhere on the NVMe over Fabrics, and then 8 GPUs with local storage. And what this chart is showing you is generally the performance of having your offload, or your transactions from your GPUs to storage, are not impacted by going over an NVMe over Fabrics. So, you're using the fabric to do offload and on load; there's literally no latency, or negligible latency and performance difference, between that and local storage. So, that's pretty compelling to me because when you think of, "Oh, I got to store my information to the network." That's going to be a lot slower than writing to locally. Well, in this case, we were able to demonstrate that that wasn't actually what was going on.

13:04 MH: My next use case is something that came up quite a bit. Nvidia showed this off at their recent trade show or, not trade show, informational show about GPUDirect and Magnum IO storage. Essentially, inside of a compute node you have a lot of GPUs, you don't have a lot of lanes for SSDs. And while your GPUs are trying to offload data or get data loaded in, everything has to go through CPU and DRAM. So, what's happening inside of the system is, in some cases, if your CPU and DRAM are already working on other things or they're overloaded, you have this phenomenon called bounce buffer which heavily impacts the performance that you're getting -- your transactions per minute and your performance, megabytes per second or however else you want to measure it. And I saw a cool demo where they had data going through the CPU and DRAM and then using GPUDirect with Magnum IO they're able to bypass sending the data to the SSDs.

14:16 MH: Basically, they were now directly sending and receiving data from the SSDs, and there is also access to go right to the NIC that's inside of the system. So, by bypassing the CPU and the DRAM, you avoid this bounce buffer, you avoid the delays that you'll see when you encounter bounce buffer situations. And, again, just to keep touching on this, you want to keep your GPUs full of data all the time to accelerate your training phases. Any time your GPUs are waiting for something else, you're wasting money, you're wasting time. So, that's why I think EBOFs are a great tool because not only you're going to see really good performance and little latency talking to the drives via 25-gig Ethernet, these same EBOFs could be part of your data pool.

15:07 MH: And, so, instead of having a much slower storage in the data pool now, as you're starting to tier up data from maybe the cold section of data, you're moving it to warm, you're moving it to hot tier, and you're preparing it to send it off to a compute node, you're able to do that in the background, on a RoCE v2 network and really accelerate everything so that when the GPUs are ready, the data's ready and the transfers happen quickly.

The last use case that I've looked at is storage arrays. So, inside of a storage array, you'll see HBAs, you'll see NICs, sometimes you'll see HCAs, which are host card adapters for InfiniBand. One of the cool things that's really come about with RoCE v2, is RoCE, as I mentioned earlier, RoCE v2 has become a big time, I wouldn't say a replacement, but a complement or . . . There's not a lot of tradeoffs between InfiniBand and RoCE v2. As I mentioned before, I've seen some university studies where the latency and performance difference is less than 5%.

16:15 MH: And instead of having to have InfiniBand as your interconnect, which could be expensive, more expensive than a RoCE v2 deployment, you're able to use your existing RoCE v2 environment and supposedly it's cheaper and a lot easier to set up, but I'm not an administrator. Also, within the storage boxes, there's a lot of redundancy that's required for high availability, so each of the SSDs inside of a storage box are dual port. The neat thing about an EBOF is generally there's two I/O modules and you are baking redundancy into the system. The system has got two 25-gig Ethernet ports per drive, so server A, server B or whatever else is on the network will find a path to talk to the drive, either port one of the drive or port two of the drive. Your redundancy is already baked into the system. And really what it is at the end of it is just adding an additional Ethernet switch as you add more EBOFs to your storage array.

17:16 MH: OK. So, this is what we're doing in the market as Kioxia. We introduced the world's first native NVMe over Fabrics Ethernet-attached SSD. It's an Ethernet-based derivative of our PCI Express Gen4, CM6 and CE6 series SSDs, which were introduced back in February earlier this year. We see this as a great fit for expansion storage of AFAs or software-defined storage with EBOFs. And we've been working very closely and we'd feature Marvell's Fabrico Ethernet SSD controller converters, meaning Ethernet on one side and then it converts it to PCIe to the drive. And that bridge chip is built into the SSDs; there's no external T-card or bridge PCB that interrupts the connection. Some of the key features of the drive itself is, again, it supports single- or dual-port, 25-gig Ethernet via RoCE v2.

18:18 MH: It's NVMe over Fabrics 1.1, it's compliant with NVMe 1.4. It's a 2.5-inch, 15-millimeter z-height form factor drive, 2 terabyte, 4 terabyte, 8-terabyte capacity supported. Some of the other neat features are it supports up to two-diode failure recovery and other reliability features. Support for Redfish, which is working very closely with the SNIA on getting Ethernet SSD baked into Redfish management software. And it also supports NVMe-MI, which is management interface, storage management specifications with support for IPv4 and IPv6. Lastly, just to summarize, we feel Ethernet bunch of flash is going to be a game-changer for data centers and storage arrays, just about anywhere it can fit in with lots and lots of applications, and we're hoping that you're going to ride the wave with Kioxia.

19:14 MH: Either just to retouch on some of the multiple benefits for deploying EBOFs again, applications in HPC, AI, ML, CDI, which is a composable disaggregated infrastructure. Another case is benefit from using NVMe over Fabrics in EBOFs, you're reducing system cost, network complexity, power budget with Kixoia's Ethernet SSDs and the EBOFs are available in the market. And, lastly, we're now sampling with ecosystem members Marvell, Accton and Foxconn-Ingrasys. Accton has an EBOF, Foxconn-Ingrasys has an EBOF. And these are some of the high-level system benefits, it's a simplified EBOF design allowing the Ethernet SSD to connect directly to an embedded Ethernet switch. It's available as a 2U system, which can handle up to 24 drives with up to total of 600 gigabits per second storage throughput.

20:08 MH: Each system supports 2.4 terabits per second of connectivity, which can be split between network connectivity and daisy-chaining additional EBOFs. We're seeing very high performance with the drives, about 840k IOPS per drive, and at the system level we're actually seeing around 20 million IOPS per 24-bay EBOF with 4K random reads, which is pretty amazing, 20 million IOPS. And this also runs Marvell's EBOF SDK leveraging the sonic network operating system, and enabling advanced discovery and management functions.

So, that's all I had for you guys today. I really appreciate your time. I know sitting around and watching videos on the web is difficult. And, yeah, it's been a pleasure and have a great rest of your FMS sessions. Bye bye.

Dig Deeper on Flash memory and storage

Disaster Recovery
Data Backup
Data Center
Sustainability and ESG