IDC Real-World Applications and Solutions for Persistent Memory
Progress in using persistent memory has been quite slow because of the lack of standards, suitable interfaces and systems software. Here is a look at persistent memory in the real world.
00:20 Ashish Nadkarni: Hello, and welcome to this panel on the adoption of storage-class memory in the enterprise. I trust you all are doing well. My name is Ashish Nadkarni, and I'm a group vice president within IDC's infrastructure practice. I lead a team of analysts who deliver qualitative and quantitative research on compute, storage, cloud and edge infrastructure research worldwide.
We have four industry panelists today, I'll let them introduce themselves, their role and how flash has changed in their organization. Gentlemen, thank you for your time and insights today. Let us now start with a quick round of introductions, I'll start with Eric, you want to go ahead Eric, introduce yourself?
01:00 Eric Karpman: Thank you. My name is Eric, and I've spent about 26 years now in financial services industry, where I basically led various technology efforts within major banks and within major asset management organizations and hedge funds, around basically redesigning the technology to support the business goals, I've supported specifically trading technologies, or wealth management technologies, asset management technologies, risk management technologies. Those are the roles I've held. So obviously, memory and the data has been the most important asset of those institutions and I had to deal with them quite extensively. Thank you.
01:49 AN: It's great. Thanks, Eric. Tony?
01:51 Tony He: Hi, my name is Tony, I work for a financial services company located in downtown New York. I have been with that company for over 13 years now, and my role is helping our front office to transform legacy applications and re-platform them to use the latest technologies like flash, in this case, obtain memory to drastically speed up their performance, without actually changing code. Because any time we do a code change, we have to go through extensive re-question testing, so using hardware to improve performance is the most reliable way of get that edge.
02:39 AN: Indeed, thank you, Tony. Ernst?
02:42 Ernst Goldman: Hello, my name is Ernst Goldman, I'm at global financial services institution, household name, been in the industry for over 20 years. Currently, I run data engineering and data platform organization. I have under my oversight a large-scale compute and storage platform aligned with capital market line of business of the firm. What we are grappling with, of course, escalating raw data, escalating regulatory oversight demands and need to have a consistent operational model to deliver into the business infrastructure services and capacity, hyperscale . . . Aiming to build hyperscale, palatable TCO is pretty much driving most of our architectural decisions and their actualizations.
03:38 AN: Thank you, Ernst. And then my IDC colleague, Eric Burgener.
03:43 Eric Burgener: Yeah, hi everyone. I'm Eric Burgener, I'm a research VP in the infrastructure systems group, and I basically cover solid-state storage technologies in the enterprise. I've got about 32 years working in high tech. Last seven, I've been at IDC.
03:58 AN: Thank you, Eric. So there's a lot of interest around storage-class memory, so I think this panel will be very informative for all of you listening in. Before I hand the mic over to my panelists and ask them questions, a quick view on how this will work. So I'm going to start with a question, then I'll go around the room and have each of our guests answer the question, provide their perspective, and then finally, Eric will chime in and provide his knowledge of the industry, what he's seeing. Each of these questions is set up to highlight the benefits and issues around the use of storage-class memory.
And so, what we really want to bring out in this panel is how are IT practitioners benefiting from it? What are some of the challenges they face, and how they can expand that footprint? So, let me start with the first question. So, the first question is, what type of persistent memory and or storage-class memory products are you using? So, we are assuming that this will be 3D XPoint-based devices, they don't have to be. But what we'd like to hear include sort of device types, PCIe cards and servers, devices that go into storage systems, what kind of storage systems, and how are they being used for system storage cache, you don't have to mention any vendors if you don't want to. Let's start with, say, Ernst. What is the adoption been, and how are you using it?
05:32 EG: Yeah, well, definitely for . . . In search for improvement in performance in agencies, the adoption of NAND type of memory has been already on the way for years. We do have PCI Express cards and BMP protocol, four lanes connectivity, but I think the distinction I have to make is that we aiming for scaling out, so the debate of scale up versus scale out has been kind of settled for us several years ago. So, at a range, our motto is build architecture that scales out, and most of the components, building block components, serves towards that goal. So a year early on in the adoption stage, we sourced from IBM the IO Scale 2, we pivoted to Supermicro actually as our provider for a lot of components, but the ports that we are building has a Cisco or Arista switches depending whether it's on-prem or cloud. And memory drives, we source from Intel primarily.
07:02 AN: Tony, how about you?
07:04 TH: Yeah. We are predominantly an Intel shop. We have been using second-gen Intel scalable processors in our data center, and the only type of PM compatible with the CPU is Intel's own Optane memory. We use that to . . . For basically, it's two use cases. We start off with the first use case, which is to use that for our Java applications. Basically, we specify the heap size to be directly to be 100% of the PM. This instantly improved our application performance, but later on we encountered another bottleneck because our Java applications don't work on their own. They need to write to our structured database, in this case it's Oracle. What end up happening was over time, we found there's bottleneck actually with the Oracle tier. We then reconfigured our server for Oracle as well. Now, we're front-to-back all using . . . The whole application is using persistent memory.
And so, for the storage we start off with NAS. It's our legacy platform, but in the end, NAS couldn't keep up with . . . It has a limitation with Ethernet, gigabyte Ethernet card. So, we ended up upgrading that to the more costly SAN storage over fiber network. And only after that upgrade, we are seeing dramatic performance improvement. So overall, this journey has been very costly, but I guess the performance gains justified the overall TCO.
09:09 AN: Great. Thank you, Tony.
09:10 TH: Thank you.
09:11 AN: Eric, how about you?
09:13 EK: Yeah, it's all driven, in our case by the business needs. The type of business application that's using that type of storage or memory. And so, since I oversee a lot of trading applications and trading environments, obviously, we have a usage of in-memory computing quite a lot where we've actually built clusters of servers and we utilize their memory components in an aggregated format. But we also persist, obviously lots of storage for regulatory and for other needs that we will have the flash arrays and the SSD drives and SANs as well, and Arch storage and lots of fiber-attached storage.
So, it all depends on really the business use and the type of latency and performance required for a particular application, as well as the type of data utilized. For example, I know Tony mentioned this. We are utilizing lots of small data elements like market data processing and things of that nature, which are quite structured. And so, we require smaller block sizes to handle this. Again, that may also drive us into which type of storage to utilize which is best optimized for processing this type of data element.
10:42 AN: Great. I'd like to stay with you for a minute here. A couple of you talked about the workloads. So, jumping into the workload, were you initially deploying persistent memory or storage-class memory on existing workloads or were they greenfield workloads? And if they were existing workloads, how were you able to optimize them or what changes did you have to make in order to really make use of the new technologies that you were using?
11:13 EK: It was primarily driven by the new workload. So basically, we started with the existing workloads and we saw some of the challenges with performance, with the disaster recovery, with backups, with the synchronization, lots of other issues related to that. And when we started the process of rewriting a lot of applications, we also had to change our data and storage paradigm supporting those applications. So, I would say, primarily it started as a new way of accommodating new applications but also supporting the existing applications with the existing data workloads.
12:01 AN: Great, thank you. And Ernst what about you? Was it existing or new?
12:06 EG: Yeah. Saying that it's new might not be completely, technically accurate. The nature of those workloads might extend beyond the history of the storage class. However, it was business mandate to really step up in terms of the data field that we were which is ultra-low latency trade and price arbitrage, market-making type of endeavors that really required completely different class of performance that traditional storage even optimized SAN could not really support and sustain. So, in that way, you could say application modernization class . . . If we were to say a completely new workload come into the picture, I would say that's more in . . . Came from big data and AI/ML initiatives. Those were in building real-time streaming events and platforms. Those were the only new workload. So, in a way, it's a mix. It's modernization of legacy business applications that were always mission critical to the adoption of the new class of workloads that frankly could not probably exist in a traditional storage tier.
13:47 AN: Great. So, Eric Burgener, I see you nodding your head. So, I just wanted to see . . . You've been hearing Tony and Eric talk about their deployment. What are you seeing? What perspectives would you have to offer?
14:01 EB: Well, it's very common that we see people first begin to experiment with PCIe cards that have some kind of Optane storage capability on them. And then as those applications grow, they really want to be able to share that high-performance storage across a network for better capacity utilization that are unable to grow, things of that nature. So, each of the panelists didn't really comment on, "We started that way and then we went to more network options." But certainly when you move to the network options, that's when NVMe over Fabrics becomes more important because that's how you deliver that latency that you can create on that Optane-optimized storage system all the way to the application server side. So that's another wrinkle in all of this. I also think it's very interesting that each of the panelists commented that really, it's workload capabilities or requirements that are driving the need for this and that's clearly what we've seen. Some of the workloads that they've mentioned are ones I commonly hear. So, in memory databases and other sort of transactional databases where revenues can be increased if the performance increases, that's very common.
15:10 EB: Also, the analytics workload. So, one of the things we've noticed in the enterprise, particularly as organizations undergo digital transformation, is we see more of a real-time orientation with these types of big data analytics workloads. And that's another area where commonly you see Optane-based storage devices brought in to enable higher degrees of concurrency and better throughput that get you closer to that real-time orientation. So, I think what we're hearing from the panelists is certainly something that we're seeing.
One thing that we're also seeing that they didn't mention, however, is you asked about greenfield versus brownfield deployments. And one of the areas that we've seen Optane storage in enterprise storage systems, so SAN-attached devices, is to enable higher infrastructure density so that people can consolidate workloads without concerns about things like noisy neighbor problems. So that's an area that wasn't commented on and I'd actually be interested if any of you three are using the performance of Optane in that consolidation manner as well or is it purely just to drive lower latency for particular workloads.
16:22 AN: Yeah. That's a great question and I'd like to add to that, and in the interest of time, maybe combine two questions. So, the number one, what, Eric, you asked, and number two would be, what benefits did you expect to achieve by deploying persistent or storage-class memory and did the product meet your expectations? So maybe we start with Tony. And so, Tony, why don't you tell us how the performance side has been addressed and how the products are expected to give you the benefits and the expectations that you have from it.
16:56 TH: So, before we start doing the migration, migrating meaning adopting persistent memory, we actually did a very thorough KPI measurement. So basically, we collect . . . We did extensive profiling of our job applications and Oracle performance to measure the read/write speed, and to work out which part of the code is doing load-only operations versus the code doing a lot of store operations. So, we need to separate the two so that later on when we try to . . . After we install persistent memory, we can work on how to optimize. The load should go to the persistent memory, store should go to the usual memory. So, this is other factors come into . . . We actually did a lot of homework because in order to show performance, drastic performance improvement, we need to do the before and after comparison.
17:56 AN: That's right, that's right. Yeah, that's a great point. What about Eric, Eric Karpman? What would you say in terms of the performance, the benefits, the expectations?
18:09 EK: Yeah, yeah, like I said, in our business, in my business, it's all . . . It's results-oriented projects. So, everything, all of our expenses, everything is billed back to the business units and their business contracts. So, everybody has very strict SLAs and we have to show clear benefits of why we're implementing a certain technology. So obviously, we do lots of testing, lots of . . . And Eric mentioned, whether you look at the benefits in latency versus performance and things like that, but it's really together. So we look at the beginning of our application or processing of our data and to the end of it. And then we split it into each component and we try to analyze this component. Is it best optimized? Is it . . .Where do bottlenecks exist? Is on the latency, is it on the performance, or on the transactional side, application side, the data side, database? So, we go step by step in that detailed analysis to really find the most optimal plan.
19:23 AN: Great, thank you. Ernst?
19:25 EG: Yeah. I would like to calibrate . . . When we talk about performance, it's not quite straightforward. So, it might be a given that we get significant performance improvement, possibility of performance improvement with adopting this architecture. However, the path to realizing that improvement is not straightforward. So again, that was mentioned by other panelists that application specifics and profile really play a significant role, what type of read and write. For example, for many big data workloads with sequential writes, there is no performance benefit from this particular storage type. So that's why we have a port and we have convergence. You call it density. We would call it converged architecture, where we have usually different classes of storage converged in those ports because understanding the different class, different tier might be better suited for each particular function within the application or particular workload assets.
20:51 EG: So, the gains in performance would be found in literally maintaining competitive edge where really microseconds matter in market, making low-latency trading, high-frequency trading. There is another is Eric mentioned in our business, time is money, so when we run a training test on AI/ML segment it just like return of the results can be compressed in a matter of weeks, days, so that improvement on one hand, some class of application benefit in the way that they could really be valid market participant and maintaining competitive edge. In another is just operational improvement in terms of what is the efficiency return of our data science and AI/ML programs that can be really measured in terms of what results compressed in timelines. So that I would say that's the impact of adopting that range.
22:05 AN: Great, thank you. So, we're almost out of time. I'd like to ask you all just this sort of big picture future question, so what are your plans, or what are your organization's plans for persistent and storage-class memory going forward? Do you think you'll be investing more in it, less, why, why not? If you could just give us a quick answer on directionally where your organization plans to go with it, so maybe we'll start with Eric Karpman.
22:36 EK: Yeah, it's definitely in the direction of our company because it seems to be fast enough to accommodate most of our business needs. There's still going to be, still different types of storage and memory usage for different, specific applications. But I think in general term, we can say that that's probably the direction as it's proven to be enterprise-ready, meaning that here enough, it works with distributed data centers like we have it. There's lots of enterprise features which we value, and I think that's going to be in the direction, even though there will still be various needs for other types of specific storage usages.
23:30 AN: Great, thank you, Eric. Tony?
23:33 TH: Definitely for I/O-intensive applications, we will stick with persistent memory but for CPU-intensive apps, we are having second thoughts about sticking with Intel because AMD is releasing far more cores per CPU in the last 12 months, and we are looking at potentially migrating some CPU-intensive workloads to AMD EPYC platform to take advantage of these more cores. So for us, it's kind of like we have to decide, it's not one-size-fits all solution. Again, it comes down to application profiling and use cases.
24:20 AN: Thank you, Tony. Ernst?
24:23 EG: So operational model is just to build and stay with that architecture and scale out, so one sort of constraining fact is all of that scale up versus scale out, one constraining fact in scale out is actually is, it's not easy to basically extricate if you want to do drastic changes. So any change would be tempered, so I think we stay the course, it will grow linearly, it's not going to gain more share of overall footprint, I would say we would stay with 20% of SSD in our footprint, but it will grow linearly definitely.
25:13 AN: Thank you. Eric, having heard this, what do you think? Where's storage class or persistent memory going? You get the last word.
25:21 EB: Well, actually, I wanted to highlight two things I think we've heard from panelists, so number one is they're clearly using this technology with their mission-critical workloads, so there doesn't seem to be a concern about whether or not technology is ready for prime time.
And the other comment that I heard from each of the panelists, although in a different manner, is that there is a TCO argument to be made for this, and even though the media type and devices based on it are expensive, if you put that in a workload where you can drive bottom-line business benefits, higher commission revenues, more volume, et cetera, there is a TCO argument that can be made. So, I think that's interesting.
Certainly, looking forward, as companies building these types of products achieve volume, ramp to volume, things of that nature prices will come down, and I think that will allow this type of storage media to be used in a broader set of workloads. We've certainly seen that happen with regular NAND flash, as we moved from MLC to TLC to QLC, even now we're starting to see some of those technologies be deployed for secondary storage workloads that really are not latency sensitive. And while I don't think that'll happen in time soon for 3D XPoint-based media, I do see over time as dollar per gigabyte comes down, we will see this technology used for a broader set of workloads.
26:44 AN: Great, thank you, Eric, that was really insightful. And with that, we are out of time, but I'd like to thank all of you, Ernst, Tony, Eric and Eric for your time, really appreciate your insights. And I think this is a . . . It's a very active space and we certainly at IDC going to track it well but thank you again for your time and insights. Have a great day.
27:11 EK: Thank you.
27:12 TH: Thank you.
27:13 EG: Thanks.
27:13 EB: Thank you