Download this presentation: Flexible Computational Storage Solutions
00:07 Neil Werdmuller: Hello, and welcome to the Flash Memory Summit virtually this year. Today, we're going to talk to you about flexible computational storage solutions and I'm here, I'm Neil Werdmuller. I'm the director of storage solutions with Arm and I'm based in Cambridge in the U.K. Hello, everyone.
00:27 Jason Molgaard: Hello, my name is Jason Molgaard, and I'm a storage solutions architect with Arm. And I work very closely with Neil on computational storage and other storage-related architectures. It's nice to meet you.
00:40 NW: Great. Well, thank you all for joining us today. Very brief view of the agenda so, I think really to introduce this, we need to get into what's driving computational storage. Then we need to look at some of the different controller architecture options and then I think we can delve into really what's driving Linux. And, also, what are the key workloads of computational storage and how that plays along with Linux, and then we'll touch on the conclusions.
So, diving straight in. So, yeah, what's driving? So, here, basically, the overview really, computational storage is all about generating insight, really where the data is stored. So, that's on the storage drive itself.
01:28 NW: So, in the traditional model on the left here, then we have basically compute. So, that's probably servers, a host of some description and, typically, what that will do is make a request to the storage. The storage will go off and retrieve it from the NAND or the hard disk platters, move it into the memory, wrap it up in protocols -- so NVMe, PCIe -- and shift it back to the compute. Only now can that compute actually start processing it. And once it's done that, perhaps it's going to move results back to the storage. But, obviously, there's quite a bit of time while the server here is waiting for that data to be retrieved and then moved across whatever fabric it may be. And this may be directly attached over a PCIe; it may be going over the Ethernet and internet from a very long distance. So, in the computational storage case, we have these servers, the host that wants to do this and that can make an operation, but then the compute can happen directly on the data that's stored.
02:28 NW: So, it's moved from NAND or from the disk platter into the DRAM on the device. Now, it can be computed and then you can perhaps return results if that's being driven. And we basically see two main variants of this, one of which we term autonomous computational storage, and this is potentially where really the drive is just managing itself. It has workloads installed on it and perhaps it's basically, soon as data stored on the drive, it's picking up that data and doing some form of processing on it. An example might be that if this host happens to be a medical scanner, it's storing images to a drive, great, but this drive itself can be doing machine learning on those images in the background, maybe doing detection for cancer in all of those images. And then if actually, it return results to the server, but it's not moving huge images around in the system, it can do that compute directly here.
03:25 NW: The other topic which we'll talk a little bit more about later, but it's that host managed computational storage. And this is more where we see micro-operations being sent to the drive and it performs some functions and maybe it's searching for a particular item in a database on the drive and it'll return the result. But often, there's a lot more communication in this system, in this host management system. But really, what this all boils down to is it's going to be very much more energy efficient than moving data backwards and forwards. So much of this data these days are images and video stored on the internet and in the data centers -- those are huge and it requires energy to move them. It also takes length, it takes time to move those images from the storage to the compute. If you can just move it from the NAND with really high bandwidth inside these controller devices, so from the NAND to the DRAM and process, much, much faster.
04:16 NW: And, again, if you're moving this data, you get better security. Obviously, this opens up, if you've got some errors, this could open up bigger security issues and we'll talk about that later. Obviously, security is very, very important. But it's all about data-centric workloads. So, really, I think in order for computational storage to take off, we all accept there needs to be a good standard behind it. And SNIA, with the Computational Storage Technical Work Group, is driving this forward very rapidly. There is a draft standard available today, but it's really about enabling big customers to be able to multi-source drives that would be interoperable from different vendors. And again, here, with this diagram, there are two particular ways, spaces really, where the compute can happen.
05:04 NW: So, here, you can have compute that's separate from the drive, that may be over multiple drives, but you can also have the compute in the drive itself. So, obviously, already drives have had a fair amount of compute in them, but if you add some computational storage here, this becomes very, very scalable. The more drives you have, the more compute you have. So, again, we think that's an interesting solution. Obviously, I think there are many ways of adding compute on the drive. You could use FPGAs, I think a lot of early proof of concepts and products have been using FPGAs, but those were a challenge to program and to update. They were also a challenge in terms of power and cost, but they can also have very high performance. So, FPGA is certainly a solution here, but we're certainly seeing a move more towards ASIC solution, so in a custom compute that's able to move that forward.
06:01 NW: Obviously, a lot of companies involved in this technical working group through SNIA, huge number of people individually contributing, but it's a great team effort from all of these different companies to make this a reality. So, really, what is the technical working group doing? It's looking to define NVMe extensions that enable computational storage services. So, to enable discovery of drives that got computational storage, to configure them and actually use the compute that's on those drives. And, obviously, the simple way that's going to happen is that, this is a simple diagram of an existing SSD. So, you've got a PCIe interface, you've got NVMe commands coming over that PCIe. Today, they're handled by the front-end processor and it manages retrieving or writing data to and from the NAND and moving it to the DRAM.
06:55 NW: So, the next idea that we see is very interesting, is obviously we're seeing a big movement towards NVMe over Fabrics and certainly TCP/IP is a, Ethernet it's a really simple example of this, that I think we can all understand. So, here, basically all it is, is these NVMe commands are now, instead of coming over PCIe, they're being carried in Ethernet packets and they arrive at these drives. But again, it will be the same extensions to NVMe that are going to be delivered and going to be processed.
07:25 NW: So, what we think is very exciting, really, and solves a lot of the challenges that I think are brought up by these systems is where we now add Linux to this solution. So, we've now got this Ethernet interface and we've got our standard NVMe over Fabrics packets arriving. So, if it's an NVMe over Fabrics, it can either be processed, if it's a standard NVMe packet, processed by this. If it's computational storage packet in NVMe, it could be processed by this Linux. And there's another advantage here, that actually, because you got a TCP/IP link to this computational storage drive, you can basically . . . It basic is a mini server. Anything that you would normally do over Ethernet to connect to a server, you can do over here; you don't have to, but it's an interesting option that adds flexibility and we believe, time to market just to make this all happen very quickly. But the beauty of this is it all just connects up to any standard fabric in the data center. You can run any standard Linux distribution here, you can run a very mini cut-down version with just the things you want or a full-blown version of Linux.
08:34 NW: The workloads can be deployed either using some of the standards that are being developed for the NVMe, but you could also just use things like Docker running with Kubernetes to download workloads directly, as you would with any other server box in the system. The key feature here, and Jason will talk more about this in detail later, but Linux is able to understand the file system. A standard SSD just gets blocks of data to store and it breaks it into little pages, puts it into NAND in the best possible places, and on a read request, it'll get back those pages, assemble them in the DRAM and then send them back. But it doesn't know that those four blocks happen to make up a JPEG.
09:16 NW: If you've got Linux on the drive, now, you can mount that file system and you understand what that data is and now you can start doing this autonomous compute on that data. You know that it's an MPEG image, MPEG video, you can now do some ML on that to look for anomalies or look for alerts. So, that's where we think this is a really big value. But again, it's really about just moving data from the NAND to the RAM and then being able to process it.
09:47 NW: One thing that's very key here is the bandwidth in this system is much, much higher than over a PCIe interface or even over Ethernet, typically. So, you can get very high bandwidth in this system and that enables very high performance inside here. As I say, you can manage it using any of the tools that you use with any other server, and of course, it adopts all the standard security systems that already have been existed. You don't need to reinvent any new wheels here, they already exist and you can just adopt those.
10:21 NW: So, again, eBPF, the extended Berkeley Packet Filters is a way of basically having a virtual machine that can run on the drive. Now this enables, in our view, more of the host managed computational storage. So, this is where you can download small little sets of code; it could be larger, but typically, we're expecting them to be reasonably small and then they can be invoked by the host. So, we're expecting quite a lot of communication in this system, but again, it enables the host to really manage this computational storage drive and it can do whatever compute it needs and it's being asked to do by the host. EBPF, though, is based on running on top of the Linux kernel, so it is, we believe, a simplest way to get this up and running is have Linux on the drive, then it's all there and it just works, as well as building the other options on top. So, again, this is a very simple way of making it happen and of course . . . I think that it's a very interesting approach. So, now I was going to hand over to Jason to talk about the next section.
11:30 JM: Great, thank you, Neil, so let's take a look at the controller architecture options and how we would go about building a computational storage drive. So, first, taking a look at a traditional SSD was shown on this diagram over on the right-hand side, there's already a lot of compute built into the SSD controllers of today. It's often broken up into a front end and a back end, where the front end manages the host interface and the FTL, the flash translation layer, and the back end is the flash management. In the front end, these are typically today, Cortex-R and, in some cases, Cortex-A applications processors, and in the back end these are typically Cortex-R real-time and Cortex-M processors. There certainly is a lot of hardware accelerators in an SSD: encryption, LDPC, compression. There could be Arm Neon, the machine learning SIMD instructions to accelerate machine learning-type workloads, could even be built on an FPGA. But one thing is for certain, is all these SSD controllers require a significant amount of DRAM for storing the flash translation tables to the tune of 1 gigabyte per terabyte of NAND.
12:44 JM: So, for a typical SSD, that's a 16, 32 or even 64 terabytes, that requires 16, 32 or 64 gigabytes of DRAM, certainly a significant amount of DRAM. And as Neil mentioned, a lot of drives today are based upon PCIe, some moving to PCIe and NVMe over Fabrics over TCP and but still a little bit of SATA and SAS out there as well.
13:12 JM: So, one of the things Neil mentioned is the autonomous processing on the drive, so this compute that's already there to date does a significant amount of processing autonomously. In particular, the host sends LBAs to the drive; logical block addresses are then translated into physical addresses for where the data is actually stored in the media. And the controller does all this mapping, it does all this translation autonomously. So, as Neil mentioned, if we now have the ability to have Linux on the drive that can recognize file systems and mount the file system, the drive now is capable of knowing which LBAs comprise a file and then, subsequently, which physical addresses contain the data for those LBA representations. And the drive can now perform any kind of operation on those files that the host was previously doing. This could include processing either in place, after the files have been stored, or as the data is streaming in. But the bottom line is that any workload that was running on the host can now be run on the drive, any processing that was done on the host can process on the drive.
14:31 JM: So, what are the options to add Linux to the drive? Well, if we take a look at these pictures over on the right-hand side, we've got our traditional SSD in the top that's connected to an interposer card or a daughter card that has an applications processor. So, it's using a traditional SSD, connecting an interposer card to the front of it with that applications processor. This is a great proof of concept to enable getting an applications processor for running Linux into the drive data path, but it has some challenges. So, certainly, the big one is that the back-to-back PCIe interfaces. So, you still have to package up the user data into NVMe, PCIe, move it across that interface and unpackage it at the other side, and then actually perform the compute in that applications processor. So, we haven't saved much of the bandwith or latency out of the drive. We've certainly saved it going to the host, but we're still moving a significant amount of data.
15:35 JM: Of course, this applications processor requires its own non-volatile storage for the Linux and its own DRAM. So, the next logical step is, well, let's bring that applications processor into the controller itself, and that's what's represented by the diagram in the middle. We've got the same essential SSD overall architecture, except we have now brought the applications processor in. We've eliminated the challenges mentioned previously with the back-to-back PCIe interfaces. This applications processor is now able to share DRAM, it's able to use the non-volatile NAND storage for the Linux installation and just solves all those challenges, making a more streamlined architecture.
16:23 JM: Well, the next logical progression then, shown in the bottom diagram is, maybe we can have the applications and front-end processors combined. And this is especially true, as I mentioned in a previous slide, some folks are already using applications processor for managing their FTL in the front end of their SSD controller. So, why not combine all of these together into one cluster of cores? And this will provide a lot of benefits in terms of area savings, power reduction and new flexibility that we'll get into in just a couple of slides.
16:57 JM: But first, I want to mention, if you look over on the left-hand side of the slide, we've got an example of a fairly recent version of Debian 9 Linux, and what its system requirements are. So, it needs a minimum of 128 megabytes and maybe it prefers 512 megabytes of DRAM and 2 gigabytes of hard drive capacity for the installation. So, if we think about that DRAM, 512 megabytes, that's less than 5% of a 16 terabyte SSD that has 16 gigabytes of DRAM. So, certainly not trivial, but not a large amount either. It may be possible to just use that 16 gigabytes or potentially add just a small amount more to the drive, and certainly with 16 terabytes, using 2 gigabytes for storage is inconsequential compared to the overall capacity.
17:52 JM: So, I'd like to introduce to you the Cortex-R82. This is a new processor from Arm, it is a real-time CPU with a very large address space. It's a 64-bit processor with a 40-bit physical address that enables addressing large address maps. And this is especially useful in storage devices where, as I just mentioned, 16 gigabytes of DRAM needs to be addressed and along with all of the other peripherals and devices and accelerators in the system. And the Cortex-R82 allows that to be done with a flat address map. But it is still a classic real-time CPU from Arm with low latencies and consistent performance to achieve the best possible IOPS and lowest latency performance through the drive.
18:43 JM: But this real-time processor also contains a memory management unit that is optional, and this enables running Linux on these cores as well. And it can be Linux in addition to real-time or combinations, based upon whatever is required, and we'll get into a little bit more of that in the next few slides. But the Cortex-R82 also enables coherency either between additional clusters of Cortex-R82, or Cortex-R82 and clusters of Cortex-A, so you can build up a more sophisticated system with additional Cortex-A cores. You can even have coherency between other clusters or even across CXL or CCIX. And the R82 contains the Neon SIMD instructions to accelerate machine learning workloads, and machine learning is a great use for computational storage.
19:42 JM: So, let's talk a little bit more about the flexibility that the Cortex-R82 provides. So, if we look at this classic enterprise drive diagram over on the left-hand side of the slide, there's four cores that have been all allocated for real-time, and this allows the drive to achieve the best possible IOPS, the best throughput and lowest latency. Well, we can take that exact same controller and potentially repurpose it for a different product targeted at computational storage where we reallocated half of the cores in this example for running Linux, running a rich operating system. No changes were needed to the controller, just rebooted those Linux, those cores into Linux instead of running the real-time. And this enables that computational storage drive to be created without actually having another tape out, and that saves a lot of development cost, a lot of mask set costs and allows the product to be delivered very quickly.
20:46 JM: So, we can extend this to having additional flexibility in the balance of the workload. So, with that same controller now, it's possible to have the drive change its characteristics as the workload changes. So, maybe during the day, during busy times of the day, there's a tremendous amount of data that's being read from or written to the drive, and we need to allocate three or maybe all of the cores to running real-time in order to give the best possible performance. But then that same controller, later on in the day, we can say, "Well, the drive is not as busy. Let's move to having more of the cores dedicated towards computation." And now you can reboot those cores in the Linux and begin doing compute on the data that's been stored in that drive.
21:37 JM: So, as Neil had mentioned earlier, essentially, if we step back and take a look at it, we have an edge server. So, if you look at this diagram on the right-hand side, the labelled classic edge server, essentially what we've drawn here is an edge server with an SSD connected to it. So, the edge server has a CPU, a DRAM, PCIe and network interface. And connected to that PCIe interface is an SSD with a PCIe interface, a CPU, DRAM and flash. Well, because we can now run any workload on that drive, that computational storage drive, we essentially have the diagram on the left. It's an edge server with a CPU, DRAM and flash, and we can substitute for the PCIe, we could put a network interface on it, as Neil had mentioned earlier, running TCP/IP and NVMe over Fabrics for connecting and communicating that drive natively using the Ethernet protocol, and you can even power that drive potentially using Power over Ethernet. All right, I'm going to hand it back to Neil to discuss Linux and the key workloads.
22:49 NW: All right, thank you, Jason. So, yeah, this section, we're going to talk about really what we believe is driving Linux and what are the key workloads that are really pushing it forward. So, I think one of the fundamental parts here is that basically, if you have Linux on the drive, any of the workloads that run Linux now move next to the data. They don't need to be recompiled; you can take advantage of that huge cloud native software ecosystem that's being developed. That really speeds development, accelerates innovation. We're not reinventing anything here, we're just being able to move things directly onto the storage to take advantage of that benefit of not having to move those huge data sets around across a bigger fabric.
23:39 NW: Obviously, I think it pulls with it all of the standard tools, development systems that people are used to in developing Linux, that all exist. You can now use it, put that workload directly onto the drive. All the different Linux distributions that run on Arm, there are . . . All of the major ones are there, there is a huge number of applications, databases and other things as well, that all just comes. And again, all of the huge ecosystem of Linux developers, they know these tools. Now they can actually do this development directly on the drive. So, I think we're really going to see some innovation and some exciting new applications.
24:17 NW: Obviously, just to give you a very quick idea of the ecosystem here, but obviously, we've got networking, we've got operating systems, all of the different container systems and virtualization, all the languages, the workloads, it all just comes, and now you can run it directly on the data. In terms of which workloads make sense to move next to the data, we see a great deal of interest from many, many of our partners. Obviously, offload, doing some basically low-level functions like compression, encryption, erasure and coding directly on that data -- some of those are very obvious. And instead of having to move that data to a host, do compression, move it all the way back again and store it, it's just a huge waste of energy and bandwidth, and it adds a lot of latency. Just compress it directly on the drive. Deduplicate. There are many applications here that are really interesting.
25:15 NW: Obviously, databases -- again, I think those are a really good example, where typically, you're moving huge amounts of data around a system. But actually, if you just want to do a search, why not just send the search to the drive where that data resides and return that data? Machine learning, again, machine learning is really about the data. Typically, the neural networks boil down to quite a few MAC, multiply-accumulate, instructions at the low level. And, basically, things like Neon and the Arm processors does a reasonable job at that. Obviously, you could add neural processing units, you can add much more capabilities. But just having that basic level, when you're able to do it directly on that data where it resides, it can be really beneficial. You don't necessarily need a huge performance because you can do it in the background. Potentially, you can do that every time a file is stored, you can do your categorization, your . . .
26:10 NW: Those kind of things, but yeah, there's a lot of applications. And some of these others, delivery content network, doing offload from the SmartNIC to an attempt computational storage drive, it's going to go offload from the offload. We see that as really interesting. Obviously, Jason touched on edge computing, but this image video, clearly, that's where we're seeing an awful lot of call from our partners and their customers, so there's some really exciting things to happen. In transportation as well, in automotive but also in avionics, people are looking at things like using ML to look at anomalies in telematic data and doing it as the data is being stored. So, there are some very interesting things happening there as well. In custom workloads, obviously, these are just a few ideas, there are thousands, infinite number of potential custom workloads that the people can develop. And the idea with the Linux is it can be the end customer, either the hyperscale or the cloud company that can move their own application directly on to the drive, onto that data with whichever Linux distribution they want, have complete control of that on the drive itself, where the data resides. And that's what we see is really exciting -- it really enables people to work on that data.
27:30 NW: So, obviously, how do you get workloads and things onto the drive? Well, you can use the computational standard storage, computational storage standardized protocols that are being developed via the SNIA Technical Work Group, end up in the NVMe protocol standards and use those over NVMe or NVMe over Fabrics. But there's also the containerization that you can do which, because you've got a TCP/IP link to the drive, you can then basically use any of the standard Kubernetes, Docker . . . all of those good things to enable you to deploy whatever you want into that Linux system.
28:10 NW: Obviously, the extended Berkeley Packet Filters are interesting as well and yes, it is possible to create probably a strip-down version of this that can run on a bare-bones, cut down Linux system that allows you to run the eBPF virtual code that's been delivered, but that's pretty complex. If you just run Linux on a machine, that can just happen without needing any new secret sauce, it just makes it very easy to do. So, there are many different variations here, and obviously we're very keen on the standardization. I think the middle box here, it's still standardized, it's just looking at it in slightly different angle. And, obviously, once you've got that, then some of these other standard things like eBPF or other systems that are available on Linux, they all become available. So, I was going to hand back to Jason really for the conclusion. Thank you.
29:05 JM: All right. Thank you, Neil, so to wrap things up today. So, what we've discussed that there are a large number of workloads and they're very diverse for running on computational storage, lots of different use cases and applications. And we think that they're best enabled through on-drive Linux. The Linux community has . . . There's a huge developer community, lots of applications imported and optimized to run well on Linux and in Arm, and it just enables everything to become available very quickly and easily and accelerate development. The controllers themselves require flexibility and they need to be handle different products, different workloads, without having different mask sets. We want to be able to take out one controller and develop different products, all in one and we want to be able to dynamically change that based upon the workload over the course of the day, having real-time and Linux go back and forth as needed for different workflows or different levels of activity on the drive. And, of course, in order to support these high-performing drives and increasing data rates, we still need to have a real-time core that's capable of providing that real-time support and low latency, fast access.
30:23 JM: So, with that, thank you very much for joining us today and we hope you enjoyed the presentation.
30:30 NW: Thank you very much, everyone. Enjoy the rest of your virtual FMS. Thank you.
30:35 JM: Thank you.