NVMe in Cloud Applications
Large installations such as Facebook, Mellanox, Western Digital, and Microsoft choose NVMe technology for their cloud storage. Learn about NVMe technology at scale, deploying NVMe flash on Facebook, and much more.
Download the presentation:
- NVMe in the cloud: Download 1
- NVMe in the cloud: Download 2
00:00 Mark Carlson: OK, so the way this is going to work is if you have a question, please raise your hand, and then I'll unmute you and allow you to talk at that point. So, why don't we get . . . Yeah, if you can come into the other . . . That's great. Otherwise, we'll just go with what we have.
00:21 MC: So, I'm Mark Carlson, I work for Kioxia, I'm also co-chair of the SNIA Technical Council, and I'm active in the NVMe community, working on various technical projects there as well. And today we have a panel of four different speakers: Lee Prewitt, Kamaljit Singh, John Kim and Wei Zhang. And, so, we'll hear a brief set of five-minute slides from each participant, and then we'll open it up for Q&A. Again, raise your hand and we can unmute you, or you can type in the Zoom chat.
So, first up is NVMe technology at scale -- Lee Prewitt's going to talk. And then Kamaljit's going to talk about NVMe technology and flash SSDs in cloud applications. And then John's going to talk about Nvidia's NVMe technology in the cloud, and Wei Zhang is going to talk about deploying NVMe flash on Facebook. So, Lee, take it away, just tell me when to change the slides.
01:30 Lee Prewitt: Great. So, I just want to talk about, oh, the places your data will go. So, here's our Microsoft's mission statement. Next slide.
This article is part of
Flash Memory Summit 2020 Sessions From Day One
Obviously, Azure is a global-scale cloud provider, millions of servers around the world with tens of millions of NVMe devices. Next slide.
So, I've talked about this before, maybe some of you have heard this in previous FMS sessions, we put together a big tube of steel, filled it full of servers and sunk it in the ocean, and we called that Project Natick; the idea of how we can move the actual server capacity closer to users, as a good portion of the world's population lives within 10 to 20 miles of an ocean or of a body of water, and we basically took all our standard equipment, our basically stock standard servers, our Gen 3 servers, put them into this and we let it run. Next slide.
02:48 LP: So, just after about two years of runtime at about 117 feet of water, we pulled the system back up. As you see, they're kind of spray cleaning off all the barnacles. We opened it up and took a look at all the equipment, and over that time we basically had a one-eighth failure rate of what we would get in an equivalent data center on land. And you could think of that, there's several factors that might be in play there, all the way from a very stable temperature environment, an environment where there was nobody coming in and out, bumping into things, pulling out the wrong piece of equipment or anything like that, and also we had in there a dry nitrogen atmosphere so there was less oxidation and things like that. But with this sort of a data center, that's . . . Great sorts of TCL that help with that sort of thing, but . . . So, next slide.
03:50 LP: What does that mean for the SSDs inside of that piece of steel? So, really what we need to do is we need to be able to do remote debuggability because we're not going to send a diver out there with a JTAG to open that up, go inside and figure out what's wrong with the device. So, we need to have the very fine-grain ability to debug issues at scale. And there's many pieces of the NVMe protocol that can help with this: telemetry, device self-test, some stuff that we've added in in our own OCP specification around air injection and cooperative recovery, and out-of-band debugging via the SMBus. But what this all means is no vendor-unique commands, no vendor-unique tools -- all are either NVMe or OCP-based debugging. And so that helps reduce costs because we can get through a debug session without having to send anybody out to a data center. Next.
04:55 LP: So, as kind of everybody knows, this is the Flash Memory Summit. What would it be without a little bit of HDD bashing? HDDs have a good pro in the fact that they're inexpensive. Their cons, though, is pretty much everything else. And with the SSD, this kind of is the reverse. So, next slide.
So, how can we start to take advantage of all of the good things of flash with helping to reduce the cost per bit of those devices? And with that, we now have ratified the Zoned Namespaces within NVMe; this allows for a radical reduction in use of DRAM on the device, so . . . because you have very much . . . You have large indirection tables instead of on an LBA-based indirection table mechanism. This way, with the ZNS, with the sequential writes and being very nice to the flash, you can get rid of overprovisioning so we're actually storing customer data on all the bits. And we basically treat the QLC exactly how it likes to be treated so that the WAF gets reduced to very close to 1. And, so, it allows for efficient use of QLC media. And all of this, of course, reduces costs. Next.
06:18 LP: With that, we have now the OCP specification for NVMe SSDs. What this does is, Facebook and Microsoft got together, kind of put together their requirement specifications and made them common. This builds on all the great work being done in NVMe and the NVMe protocol, and basically lists all that kind of common base firmware across all the series of requirements that we have to allow for a single source of truth in the firmware. And so this makes it such that the SSDs can be built with one firmware, tested across multiple vendors, across multiple customers, through test houses and this sort of thing, so at the end of the day we end up with a very robust product for hyperscale data centers, which of course reduces costs. Yeah. I'm done. Thank you very much. You're on mute, Mark.
07:34 MC: All right, speaking now to everybody. Kamaljit, you want to take it from here? Kamaljit . . . Oh. Now you can talk, you must've come in again.
07:50 Kamaljit Singh: Can you hear me?
07:51 MC: Yes.
07:53 KS: Thank you. Yeah, I don't know what's going on. I've tried several times. I just can't get the video.
08:00 MC: Don't worry about it. Just speak. [chuckle]
08:02 KS: Thank you everyone for the patience. So, I'd like to discuss some of the key trends that we are seeing around data center SSDs as they relate to the cloud space, cloud applications. First off, we are seeing . . . We have been seeing continued strength in cloud deployments over the last few years, and we're projecting that we're going to see very nice growth for the next several years, close to even 40%.
Secondly, the NVMe technology itself, the standard is quite mature and it's allowing more mature products to be deployed that support different types of workloads, from performance to mainstream to capacity types of SSDs.
Thirdly, the NAND technology itself is also becoming more mature and more dense, from 3 bits per cell-based TLCs to 4 bits per cell-based QLCs coming onto the market soon, and in addition the Zoned Namespaces technology is, just like Lee said, is going to be a key QLC enabler. Which allows read-intensive workloads and enables streaming data services and read-intensive workloads like AIs and cloud deployments.
09:43 KS: Fourthly, the form factor itself has been, for SSDs, has been changing quite a bit, and the latest standard – EDSFF-based E1.L and E1.S -- is coming onto the scene and it's going to take over from U.2 and M.2, and it's making the storage servers to be more densely packed. And lastly, NVMe over Fabrics specification built on top of NVMe allows disaggregated infrastructure, which helps us with scaling the storage. Next slide, please.
So, regarding the TAM data that we are projecting, we're seeing that the worldwide cloud flash TAM is continuing to grow, and we are expecting it to grow at about 43% rate for the next several years. And deployment of NVMe standard-based SSDs is also continuing to grow at a very high rate and is expected within the next three years to even take over almost from the SAS and SATA completely.
And thirdly, composable disaggregated infrastructure TAM is expected to grow over $3 billion over the next three years or so. So, this is boding pretty well for SSD and deployments.
11:15 KS: Next slide, please. Now, the NVMe technology standard is giving us some key benefits for cloud applications. In cloud, a lot of the applications are very highly QoS-driven and NVMe helps immensely in that with very low latencies. In addition, the Zoned Namespaces standard, which was recently ratified, is going to be enabling reduced write amplification and is going to reduce the cost also by removing the need for that much DRAM. So, it improves the efficiency and reduces the cost, which is really key for cloud deployments.
And thirdly, very highly densely packed storage is enabled by QLC, which is coming soon to the scene. And so overall, NVMe and flash is enabling cloud quite well. Next slide, please. I think you should press a couple of more times. Yeah.
So, I'd like to present a case study of SN640, Ultrastar SN640 SSD, NVMe SSD from Western Digital, as an example of what the capabilities are and what we are seeing real life.
12:57 KS: I won't go through all the details written here, but basically what we're trying to show you is that it's available in many capacities, and it's got a very, very low coefficient of variance. So, it's very predictable, basically. And so very, really low latencies are being enabled by this particular drive, and it's pretty much very close to the top of the available competitive drives that are in the market. And this drive is highly optimized for various types of workloads for cloud, like web search engines and data warehouses and AI and such. Next slide, please. One more time, please. Thanks.
The left graph is trying to highlight the fact that not only is the QoS very . . . The coefficient of variance is so low, but across the different types of capacities of the same drive you see a lot of consistency. So, that's the key takeaway from the graph on the left. And the graph on the right highlights the fact that the Western Digital SN640 QoS for random mixed-I/Os is pretty low and is very comparable to the competition, and especially for cloud applications. Next slide, please.
And lastly, NVMe over Fabrics is a key differentiator that allows scaling of I/Os and even especially for cloud deployments, and enables very high performance sharing of the data that's available. Yeah, I think that's about it. I think Lee already touched on several of these topics. Thank you.
15:14 MC: All right, thanks Kamaljit. John?
15:18 John Kim: Hello, Mark. Thank you. So, I want to talk a little bit about Nvidia with NVMe technology in the cloud. And, so it turns out, of course, that Nvidia, a major part of our focus is AI and machine learning and big data analysis, and so then there's the question of how do you use NVMe to support this in the cloud? And what we see is that on the private cloud, a lot of our customers will take these servers that we build, or that our partners build on similar specs called the EGX or the Nvidia HGX or Nvidia DGX, and deploy them in the cloud, in private cloud or hybrid cloud, and these servers today tend to use internal NVMe SSDs. And the larger systems will use something we call a GPUDirect, or GPUDirect storage, in order to get really fast access to storage over the network. And that could be with NVMe over Fabrics or it could be with other protocols, but it tends to be used to access flash storage between the GPUs and between . . . Or these servers, which have GPUs and CPUs, to give them really fast access to that storage to give you a faster AI in the cloud experience.
16:37 JK: And when it comes to public cloud, the GPUs and the AI capability tend to be available from the larger cloud service providers. So, for example, we have something called the Nvidia Quadro Virtual Workstation or VWS, and this is available from Google Cloud and AWS, and it gives workers remote access to a GPU-powered workstation, for engineering, for graphics, for creative applications. So, this allows them to basically do AI and get GPU through a cloud server on a public cloud service, and these servers will tend to use flash, though the actual choice depends up to the service provider what type of flash they use and how they connect it. Actually, if you could . . . Sorry, thanks. All right. Also, if we can go back, Mark.
Another case of AI in the cloud is the GPU compute instances. And this is where . . . It's the upper-right example. VWS is a virtual workstation that individual people access through the cloud. GPU compute instances are online compute instances with containers that you spin up in the cloud and have GPU capability so that you can do that graphics, or typically AI and machine learning. And, so, examples are AWS EC2 and the P3 or G4 instances, the Google Cloud project GPUs on Compute Engine, and then also GPU solutions are available through IBM, Alibaba, Microsoft Azure, Oracle and other cloud vendors.
18:15 JK: And again, these vendors tend to use flash, some use NVMe, some don't. I think the trend, as Kamaljit talked about, is for them to move towards NVMe as time goes by or all NVMe SSDs. But each of them, again, decides based on their own criteria and their own requirements what type of flash to use and how to connect it. But the key is that they are getting access to AI power with GPUs and flash storage in the public cloud.
So, a little bit more about the servers. So, I mentioned the DGX and the latest one is the DGX A100, and this is an appliance that Nvidia sells. It's used in private cloud, and customers can buy it from Nvidia or they can buy similar versions of this server from our partners, like Lenovo or Dell EMC or Penguin, I think Supermicro and others, HPE. And this gives them the power to, again, use GPUs, AI and flash storage in their own private cloud. If you could go to the next slide.
19:24 JK: So, this is an example, as mentioned in the previous slide, every DGX A100 system actually has six NVMe SSDs. So, two of them are for boot, they're M.2 SSDs, and then four of them, and you can see them on this diagram here, four of them are the U.2 form factor SSDs, and those are for actual data and caching for the AI process. And then these systems can also access NVMe over Fabrics and other types of networked flash storage, and they have two 200 gigabit-per-second connections, either Ethernet or InfiniBand, in each one of these systems that allow them very fast access to the storage. And that's in addition to the six 200 gigabit connections they have for the GPUs to talk to each other, so beyond those, they actually have two more of these 200 gigabit connections just for storage access.
So, to sum up, this is one way that Nvidia delivers the power of AI with GPUs and flash storage. Customers can apply this appliance in their private cloud or hybrid cloud, or as mentioned, they can go into the public cloud and also get access, but then the NVMe or flash storage architecture or choices there are based on how the cloud service provider wants to provide that storage. All right, Mark, thank you. That's the end of my section.
20:44 MC: Thanks, John. Wei Zhang.
20:48 Wei Zhang: Can you guys hear me?
20:52 MC: Yes.
20:53 WZ: OK, all right, thank you, Mark. And good afternoon everyone, and good morning if you are in Asia.
So, today, I'm going to talk about the journey of deploying NVMe flash at Facebook, and how the NVMe technology better enable us to connect the 3 billion users around the world. So, Facebook began its flash journey about 10 years ago. In the beginning, the need for flash was driven by some database applications. These applications require consistent lower latency and higher IOPS. So, when deploying them to the hard drive media, we realized we were buying capacities in order to satisfy the IOPS and latency requirements. So, in other words, we're basically straining the hard drive capacity due to the performance demand. And this is not really cost effective.
21:44 WZ: So, we made the move to a flash solution. At that time, there are many options to choose from for a flash deployment. So, one of the typical ways to deploy flash is using a PCIe Add-In-Card. Even though the form factor of these cards looks similar, the protocols of accessing flash are quite different. And there are proprietary technologies using render-defined hybrid interfaces and also cloud source software SATA. But these cards offered a superb performance, but also at a very expensive cost, and maintenance of these cards are also a headache. We had to maintain proprietary drivers and keep patching it for bug fixes, feature additions and build production tooling around such software stack. Next slide, please.
As time goes on, more and more performance-demanding applications in our fleet requires to be deployed on flash media. And we could not really deploy everyone on expensive AICs, nor does everyone need the kind of performance the AIC can offer.
22:55 WZ: We therefore took a cheaper solution and cheaper alternative -- that is SATA SSDs. So, in this case, we reused OCP Knox chassis; Knox chassis was designed for HDD but could be used for SATA SSDs as well. So, SATA SSDs use standard hardware interface and leverage the Inbox Upstream Linux speci stack. However, the SATA protocol and speci storage stack are really mismatched to the performance of the NAND flash can offer.
For example, the Knox chassis design is a SAS expander which limit the IOPS to around a 400K, and we knew at that time that this solution is going to be short-lived. Next slide, please.
So, at this point, we fully realized the potential of flash, and we wanted to deploy them at even greater scale. So, the way to unleash the potential of the NAND flash is really to build a hardware system and software stack that is designed for it.
24:00 WZ: So, at the center of this architecture is the NVM Express. So, we designed Lightning Java. This is an end-to-end PCIe-based solution which allowed us to create a flash resource pool, and built on top of the Lightning is a concept card, a flash desegregation, in order to achieve better compute to flash ratios with the goals of not straining flash capacities. Lightning allowed us to enable NVMe flash at much larger scale. Along the way, Lightning solved many system-level challenges to fulfill the disaggregation goal, and some of the solutions are at the cost of increased system capacity.
For example, we needed enough PCIe lines to connect to certain drives to form a flash resource pool. We also needed to support the PCIe supplies hub part to facilitate the drive repairs. And both of these require a PCIe fanout switch to be designed in the architecture. We also needed to design a carrier card for the SSD in order to support M.2 form factor. And next side, please.
25:12 WZ: So, over the past few years, since Lightning was designed and implemented in production, the flash and compute industries has evolved greatly. First, newer generation of the CPUs has offered more PCIe lines to connect more end devices, and furthermore, the root part of the CPUs has built-in support of the PCIe, PCIe supplies hub part. On the other hand, the size of the NAND has grown significantly over the years -- this is to enable a higher density drive. So, we went from 1 terabytes, 2 terabytes, 4 terabytes, now with 8 terabytes inside. The industry has also come with the newer SSD form factor, the EDSFF, which allowed higher power, brighter thermal and front-end loading. With all these advancement, we were able to greatly simplify the flash over the time, as we show here. This is a OCP . . . based flash server with EDSFF SSDs. Next slide, please.
26:22 WZ: So, the NVMe-based approach allowed us to deploy flash without breaking the bank NVMe is an industry standard, gave us choices when selecting the SSD vendors. And because it's a standard, it is supported by a single NVMe driver in the Linux and also a NVMe CLI management tool. This greatly reduce our deployment capacity. So, without NVMe, it would have been a mission impossible in order for us to source, qualify and maintain such a scale flash fleet. Next slide, please.
While benefiting from the NVMe industry, Facebook is also contributing back. Facebook has been an advocate for the EDSFF form factor; we are using EYS. We are also working with other cloud players; Microsoft -- Lee just mentioned in his previous slide stack -- to build an OCP cloud SSD spec, so that we can share knowledge and share the implementations. These sessions are covered more in the FMS, and if you are interested, I would certainly recommend to dial-in in some of these sessions to get more information. With that, that's all I have. Thank you very much.
27:43 MC: Thanks, Wei. OK, we have one minute left. If anybody wants to raise your hand or type a question in the chat bar, otherwise I'll just go ahead and end the session. All right, thank you for attending and we appreciate your attendance. Go check out the other FMS presentations out there. There was one this morning that Lee and Ross did that was really informative, I think. All right, thank you very much.
28:20 WZ: Thank you.
28:20 JK: Thank you, Mark. Thank you everyone.
28:23 MC: All right. Bye.
28:23 KS: Thanks, everyone.