Using PM and Software-Defined Architectures to Optimize AI/ML Workloads
Explore how you can close AI, machine learning and analytics workload performance gaps using persistent memory and software-defined architectures.
Download this presentation: Using Persistent Memory and Software-Defined Architectures to Optimize AI/ML and Analytics Workloads
00:00 Kevin Tubbs: Hi, my name is Kevin Tubbs, and I'm the senior vice president of Strategic Solutions Group at Penguin Computing. And today I'll be talking to you about using persistent memory and software to find architectures to optimize AI, ML and analytics workloads.
Before we get started, I'm going to give you a little brief introduction into Penguin Computing. Penguin Computing . . . we are a wholly owned subsidiary of Smart Global Holdings, and we are part of a global ecosystem of technology providers, two of which are Smart Embedded Computing and Smart Wireless Computing. Our overall goal and structure there, and why we're talking to you today with Intel, is how do we do compute anywhere? How do we deliver advanced technology from edge all the way out to core?
And before we get to that, we're going to talk about our market trends. Things that we're seeing going over in the technology space, or that we see this hardware extraction and movement to workload-driven architectures. What we mean by that is there's a lot of cutting-edge technology, a lot of acceleration of real-time workloads that require new technologies to meet those demands, and as the users get more sophisticated, they begin to use and want to accelerate that data-driven workload and the core underlying hardware is being extracted away.
01:27 KT: And you look at it as well as how most people consume the cloud today, they want to be able to consume workloads and business insights the same way. So, as we see that technology transition, we've looked at how do you achieve that and take the core technology, say from Intel as well as other software, to find partners and put them into a one end-to-end solution.
This article is part of
Flash Memory Summit 2020 Sessions From Day Three
A key driver of that is software-defined architectures. We believe that combining the underlining technology of a hardware inside of a Penguin server with complete middleware, software, all the way up through the extraction to the user space is what it takes to deliver this cutting-edge technology. And one of the key drivers for that are real-time workloads, artificial intelligence and advanced analytics, and these are very data-driven technologies which drives the need for us to use persistent memory.
And finally, the key market trend is the ability to deliver that core technology or that core workload in a completely portable way. Think about things being data-driven; data happens everywhere from the edge to the core, and you may to be able to compute on it from the edge to the core. So, to enable that we have to make sure that we can do workload portability and ensure that everything is cloud native.
02:58 KT: So, Penguin is focused on cloud and data center scale computing, where Smart Embedded Computing is working or focused on near edge and other technologies and medium edge. And we have Smart Wireless Computing out at the four edge and IoT, and we find that this combination allows us to use AI, ML, advanced analytics, as well as real-time workloads to create a complete portfolio. And we'll tie that into why persistent memory is important.
So, one of the biggest challenges that we have in this data-driven workload is what we consider our big memory challenges, and the movement of these workloads to be more data-centric, more driven towards digital transformation requires us to compute and deliver on them faster. And this generally happens in a very memory-centric part of the workload, so it's hot data, it's very warm data, and this usually happens in DRAM. And on the other side of what's happening right now is an explosion and the amount of data that's required and stored across all of the different workloads, all the different business units. So, that has to be stored just for the immense part, the sheer size of the data, it's stored in a core or a code, anything from LAN, SSDs, all the way down to hard drive and tape.
04:34 KT: So, what that creates is this performance gap for access. How do I close that gap? And that's why we're here today to partner with Intel to talk about their technologies, and specifically they're Optane technologies and how that's used to close the gap, both from a memory perspective with Intel Optane memory DIMMs and from a storage perspective with Intel Optane SSDs. Particularly in the case here, we're focused on persistent memory, and we'll talk more about the capacity and persistence and what effect that has on our real-time workloads.
So, the main reason that we're very excited about it is they are a large trend right now to move towards very data-driven workloads and in that space we consider them to be in-memory database, advanced analytics, including AI and ML, cloud workloads, and HCI or hyper-converged infrastructure and compute. One of the most exciting things about what we're doing is those are the areas that we found the most synergy in and the most traction in right now, but we also believe that persistent memory plays an important role in how we accelerate time to insight. And so that's a lot of workloads that are yet to be discovered, we won't get into it here, but we're definitely interested in how that affects, how performance computing as well.
06:02 KT: So, we partner with Intel and a software-defined storage company called MemVerge, I mean a software-defined architecture company called MemVerge, computing and working to speed time to value and insights specifically geared towards data-intensive workloads.
So, next, we're going to talk about some of the challenges in software-defined architectures or how software-defined architectures can be solutions to some of the value props that customers are looking for. Intel Optane memory DIMMs provide a persistent memory layer that allows us to extend the DRAM into much larger capacities, as well as the persistence allows us to add data services on top. So, this is a very key feature to enable new technologies. One of the challenges that we have are, that allows that new technology has different ways that it's going to affect the application space. So, there's levels of modification effort involved in the application, and there's different ways of which the application can use the Intel Optane persistent memory capabilities. So, you have memory mode, you have storage mode, and you can go up to App direct mode, but there's even variants in there. So, how will a new researcher or a new scientist that wants to focus on their workload, how can they use that quicker or adopt that technology faster?
07:37 KT: We want to look for an optimize software or a software to find architecture to do so, and that brings us to MemVerge. MemVerge is a software-defined architecture which allows us to have virtual OS of memory by addressable persistent without cochanges and cochange is a very key point to it. It also enhances the traditionally non-volatile memory space by allowing us to have data services for high availability and durability, but at memory speeds, and this allows us to supercharge performance while given the flexibility in the memory space.
So, what do we get from that? And the overall goal is we want to bring mission-critical apps to big memory, and the key things that enable that and get that a technology adopted faster are plug and play, so I don't want to have to have to rewrite my applications. How to adopt this technology faster? The second case would be how do I interact and interject with data service, quick crash recovery, and also how do I scale that memory out? The solution to that is MemVerge Memory Machine. It's a soft certifying architecture; it virtualizes both the DRAM and the Optane memory and allows you to use it without having to program it yourself. It also allows low-latency communication between multiple nodes, so I can scale it out and across the cluster.
09:03 KT: And by putting all that together, I'm able to leverage memory data servers that were not available from just pure non-volatile memory that includes zero aisle snapshot, replication and tearing. So, now we're going to transition into the use cases, which in this case is real-time workloads -- you're thinking about latency-sensitive transactional workloads like trading, real-time big data analytics and financial services, as well as AI, ML and inferencing, like fraud detection and social media. Specifically, we're going to look at a case study that Penguin worked with my partners to leverage some key technology and production-scale inferencing.
So, the motivation we want to look at, we talked about why it's important to use persistent memory, how we can leverage Intel Optane memory DIMMs, and now we want to look at a specific case. In this case, we're going to look at the deep learning recommendation model from Facebook and the open source deep learning recommendation model, and how persistent memory can enhance it. So, if we look at the specific issue, kind of the transfer requirements, we're having a large growth and model and embedded table size. And this means that I have a specific model and I can store that model information and dense in-vectors, and be able to compute on it, but I also have a lot of sparsity inside of these models and they are stored into what we call embedded table size.
10:34 KT: So, I have a large growing amount of model data which goes into the gigabyte level, but my embedded table size scales to terabytes. So, how do I balance those two different things? And how do I achieve the overall goal of fast online inferencing? The solution is how do we put both models and embedded tables into DRAM? So, if we try to put both the models and embedded data into DRAM, the limitations we have there are high TCO, we're going to have a large amount of memory, and then I'm going to have a limited DRAM space and I'm limited to a single server. And then I have the added case that that limitation as a solution is limited by the fact that now that memory is now volatile, and this in production inferencing presents some challenges.
So, to address these limitations, we believe the overall goal should be how to use both DRAM and Intel Optane persistent memory in conjunction with each other. So, we want to look at how do I put models into DRAM and embedded tables into obtain persistent memory? So, in this case, we're looking at a case study for Facebook's deep learning recommendation model and personalization, and for personalization and recommendation systems. Here, we're looking at one of our traditional AI, ML customers, and the key things to look at when you look at the Facebook DLRM is a very vast amount of or varying technological challenges.
12:16 KT: So, we are the top part of it is compute-dominated, there's a section that's communication dominated, and then there's a section that's memory bandwidth as well as memory capacity dominated. So, this presents different challenges along the entire workload. So, specifically for what we're looking at today is, how do we leverage the performance capabilities of using both DRAM and Intel Optane persistent memory?
So, in this case, we're looking at what happens when the customer on the workload has data that is greater than what can fit in memory. So, once I get the data that grows larger than what's in memory, I trigger the need to start using swap space and the swap space is on NVMe. That's shown here in the orange, and what happens here is, because the data is greater than memory, I'm forced to use my swap and right to NVMe. NVMe is a great and performance reading and technology, but we still have penalties of latency, so that presents a challenge on how do we deliver production-scale inferencing? So, the solution is to use and put the model data into DRAM and put the embedded tables into Intel Optane persistent memory.
13:47 KT: And when we do that, we're actually able to scale to perform the inferencing runtimes orders of magnitude faster and we're also able to scale them as the size of the models grow of what's being consumed in DRAM and the number of models that we put inside. So, the main takeaway here is by allowing us to use Intel Optane persistent memory, we're able to drastically improve the overall runtime without having to go to NVMe layers. What does the customer receive as a result? We have a single platform that's able to work for multiple different models, be able to scale to larger models and even in the future, if we were going to grow outside of nodes, we're able to scale both inside of a single server and out to more nodes. So, now I have a flexible software-defined platform that allows us to use persistent memory for AI and ML.
So, one of the last key things that is also a challenge with the AI and ML production-inferencing workload is, "How do I improve fault tolerance with new model publishing?" So, one of the key things about delivering AI and ML workloads in production is the idea of continuous learning. So, I'm always going to look for a new model but moving to a new model can be risky. It is also very hard to roll back. So, how do I change and being able to move back and forth and roll back models, because it takes a large amount of time due to slow I/O?
15:34 KT: So, we look at leveraging Intel Optane persistent memory and persistent memory technology in general to allow us to take a snapshot of the model that served in the application. Then we can restore and reload from that in a continuous fashion, and then that solution allows us to deliver instantaneous snapshot without interrupting the online inferencing. We can have an instantaneous rollback without loading our publishing time, and we can snap a snapshot, roll back and recovery all within one second. So, overall, now we're adding in data-level services to memory speed applications and inferencing that we did not have without persistent memory.
So, that brings us to the end of our talk for today. In summary, we talked about some of the challenges with delivering data-driven and real-time workloads, how persistent memory along with software-defined architectures can achieve that, and we went into a deeper dive on how we can leverage Intel persistent memory DIMMs to accelerate production-scale inferencing and also talked about how to deliver that in a data services perspective.
So, that concludes my talk for today. Thank you, guys listening. If you have questions, please do not hesitate to reach out to Penguin Computing and we will have more in the future on how we're leveraging Intel persistent memory DIMMs to deliver cutting-edge technology.