00:10 Jia Shi: Welcome to Flash Memory Summit 2020. My name is Shi, I work at Oracle. I build the Exadata database machine. I'm super thrilled today to be given the opportunity to take you on a tour to experience an Exadata database transaction, under the hood. I'd like to show you some of the secrets that we have in how we harness the power of persistent memory.
So, before we get started on our tour, I'd like you to meet one of our Exadatas. So, as you can see on this slide, we have an Exadata rack on the left side, and within this rack we have a bunch of database servers that can be either two sockets or a socket database servers where our Oracle database software runs, and then at the bottom we have the storage servers, which are where our data is stored, and we have a bunch of very cool storage features that we have built in the storage server for our smart I/O processing.
01:14 Shi: And connecting these two, we have the super-fast 100 gigabits per second RoCE, which stands for RDMA over Converged Ethernet, and we use the RoCE network for both our database-to-database communication, as well as database-to-storage communication. Last but not least, we have persistent memory, we have 1.5 terabytes of persistent memory in each one of our storage servers. So, in our storage servers we can either have disks plus flash, where we have flash as a caching tier on top of the disks, and now we put in persistent memory, and we build another tiered cache on top of flash, and we call it the persistent memory cache. Or you can have persistent memory in our extreme flash storage servers where you all have all flash storage -- the persistent memory again can serve as a caching tier on top of that.
02:13 Shi: So, now that you've met an Exadata, let me take you go under the hood of an OLTP transaction. OLTP as we all know, stands for online transactional processing. So, what is an OLTP transaction? I'd like you to meet Ben. Ben is a very typical user who . . . Let's say, in this case, wants to deposit $1,000 to his bank account. So, let me break down how a database transaction happens in this case. First, the application is going to send the SQL to the database server. Upon receiving the SQL, the database server will parse the SQL and create a cursor, and this is a pure in-memory on the CPU operation, so it's super-fast. Then, as part of executing the cursor, we go traverse an index tree trying to locate Ben's account. In this case, perhaps we use Ben's user account ID as a primary key to look up his account. And, typically, because those index trees are very frequently accessed by different applications, they're usually cached in memory. So, again, we're happy. We're on the CPU, we're hitting the memory or the CPU cache, the performance is great.
03:28 Shi: Next, we found Ben's account, but where is the block? Where is the actual data block that contains his information? Oh no, in this case, the data is a miss, it's a miss in the buffer cache. We don't have the data in our memory. So, what do we have to do in this case? Uh oh, we have to go to the storage. So, I may be a bit dramatic here that I call it falling off a I/O cliff, but in a second, you will know why I call this I/O cliff. Because in order to get to Ben's data, which is stored in the storage server, the database software at that point had to pause all the processing that was super-fast on the CPU, from memory, have to go to the I/O and wait for that block to be returned and this is a big gap in the transaction processing. So, here comes our OLTP challenge No. 1: What is the I/O cliff for the random data reads?
04:23 Shi: So, let's take a closer look at the random data read. In this case, the database identified the role of Ben's account, but it's a miss in the buffer cache. So, naturally, it has to issue the read to the storage server. And, in this case, let's say my storage server has a bunch of hard disks in there and that I have a flash cache sitting in front of it, which we used to actually host the hot data in the flash so that it can accelerate this kind of OLTP random reads. So, in this case, let's say I'm lucky -- Ben's data is actually in my flash. So, I go ahead and perform a read from the flash cache and I'll send the data back to the database, and in this case, the database gets a buffer back, put it in his buffer cache and I can continue the OLTP transaction processing.
05:13 Shi: So, how long did we wait for this random block read? Well, let's take a closer look. So, in this case, I have broken down the various layers that's involved in doing an I/O. As you can see, not only do we have to traverse a network -- take a full round trip -- we also have to go through the database application, through the kernel, through network driver both on the database side as well as on the storage server side, so that's quite heavy.
First, let's take a look at how long does it take to do an 8K, which is a typical data block-size read on the local flash. Well, it takes probably less than 100 microseconds in the latest NVMe flash generation. So, with that, you'll wonder, "Hey, what is the end-to-end I/O read latency?" So, what we have is about 200 microseconds. As you have guessed, 100 microseconds comes from the flash cache read itself and there's another 100 microseconds that comes from the network round trip, the I/O software stack processing, the network I/O interrupt processing, the context switches and all those other software overhead.
06:26 Shi: OK, so now that we know about the 200 microsecond I/O cliff, how do we conquer it, right? Persistent memory. I haven't talked about PMem for a while and now here it is. So, this is the first secret sauce that we have for solving the I/O cliff problem. So, PMem, just a quick recap, is a new silicon technology. It reads really fast, almost at memory speed, much faster than flash, right? It's a bit slower, but still much faster than flash, but the good thing about writes is that they actually survive power failure unlike a DRAM. The only thing about using PMem that you need to be careful with is that it actually does require quite sophisticated algorithms to make sure that the writes to PMem are indeed the persistent across power failures, that involves special instructions that you need to call if you do a core access from a CPU core to write to a PMem, or if you go ahead and write it through a PCIe device like server RDMA.
07:31 Shi: All right, so let's take a look at RDMA. RDMA stands for remote direct memory access. So, what does that mean? That means that one server can access a memory region on a different server directly through the network card, and this direct access does not require any software processing on the storage server side. So, that's really fast. Now, let's see how we can combine the power of persistent memory and RDMA to help us conquer the I/O cliff here.
So, first, you may think of "OK, maybe a naive solution is, look, PMem is, you tell me that PMem is much faster than flash," so perhaps I'll replace my flash cache with a persistent memory cache. How does that sound? I say, "that sounds great." And if I have PMem cache in my storage server, and if I do a local read, nothing else changes just to do a local read. I can get about 1.6 microseconds if I access from a local core, and I get 2.4 microseconds if I access from a remote core to a QPI link. So, that's pretty fast, right?
08:42 Shi: What about the end-to-end latency? How is my I/O cliff looking like? OK, 100 microseconds because you'll remember that there's still the 100 microsecond I/O software overhead plus network round trips in my equation, right? That is now going away, so even though I mostly eliminated the 100 microsecond latency from the flash read, my end-to-end latency is great, but it's not awesome. How about I choose a different way to do my read? How about I get rid of all the I/O overhead and do a perform . . . I perform a DMA RDMA over my 100 gigabits RoCE. How does that look like? So, if I were to do that, I'm actually able to achieve sub-19 microsecond latency on an Exadata system, and this is 10X faster random read. So, let's take a closer look on how we actually accomplish that.
So, if we have a miss in the buffer cache, instead of issuing a read request to the storage server, the database simply issues RDMA reads directly against the memory region on the storage server. So, in this case, let's say Ben's account sits in the PMem cache. The database is able to locate that and be able to perform a direct RDMA read to fetch that bank balance all the way back into its buffer cache. And this 8K data block read involves no software on the storage server side, so it's very, very fast. And that is how we get to the 10X performance improvement.
10:20 Shi: And now let's take a look back at Ben's transaction. Remember the I/O cliff that it falls off the 200 microseconds? So, what happens now? If I were to do a RDMA, my database process actually just sits on the CPU spins and pulls for the completion for the RDMA, and that takes less than 19 microseconds and this is faster, 10x lower latency -- at the same time, it's also significantly cheaper than yielding the CPU and getting scheduled back because I actually avoid all the cost of a context switch. So, with that, I'm able to get the data back in my memory and update Ben's account balance from $2,000 to $3000, so I'm really happy, right? My performance is back on track.
11:07 Shi: So, now what happens next? If you're familiar with database transactions, you will know that there's the thing called the ACID property, and D in the ACID means durability. What that means is that we have to make sure that the change is persisted in durable persistent media. So, how do we do that? We have to commit a transaction, and by committing a transaction, we have to go and write that transaction to the storage server.
Oh no, here's my I/O cliff again. All right, so can lightning strike the same place twice? What is this I/O cliff for the redo log writes? So, let's take a close look. What happens with a redo log is that the database has to send a redo log writes to the storage to be persisted. And in our case on Exadata, we send the redo log writes simultaneously to both the DRAM cache in front of the persistent DRAM cache, in front of our hard disk controller, as well as our flash log. So, what this means is that if either device is glitchy temporarily, it's OK. As soon as any write finishes, we can send a ARC back to the database. We call this the flash log feature.
12:27 Shi: And this itself is able to give us a low-latency write, but still involves the whole messaging protocol and doing the I/O to the actual flash or the DRAM cache, how long does it take? Yes, 200 microseconds, that's very similar to the random read latency that we have experienced before. So, can PMem and RDMA come to the rescue again here? Remember the I/O cliff? Yes, let's try to see how we can use PMem and RDMA to avoid falling off the I/O cliff.
So, in this case, if we were to do an RDMA log write to this PMem in the storage, how is that going to work? So, what we can do is we can send the RDMA containing the log write to the storage server. And what happens in this case in the storage server, we create a shared receive queue that has a ton of those PMem buffers that's sitting there waiting for the redo to land. And then we'll deposit the redo into one of those available PMem log buffers, and because persistent memory is persistent, that redo is considered to be committed right away. So, we have a NIC-to-NIC arc sent back to the database server as soon as this writes to the PMem has made it into an ADR safe domain, and there's no software handling involved here. So, that's extremely low latency, and in the background, our software on the storage server will destage those redos from the PMem log buffer to the backing store. This will allow the buffer to be freed up and to be reused for future redo writes.
14:11 Shi: So, you'll say, "Shi, that's looks so great. What happens if I have a power outage? What happens if I just wrote my data, my redo to PMem and the server crashed? Is my redo safe?" As I mentioned earlier, yes, it is. There is a P in PMem which stands for persistent, so the way that we issue the PCIe write from the NIC to the PMem is that we make sure that PCIe write bypasses the volatile CPU cache. So, that will never make into the CPU cache or landing the ADR safe zone of the PMem directly, and that's how we guarantee that this write will be persistent across a power fail.
So, now, let's take a picture . . . Let's take another look at Ben's transaction. Remember the I/O cliff for the redo log writes? What can we do here? Yes, we go ahead and issued a RDMA write to the PMem log and that gave us 8x faster log writes and that allowed us to avoid the I/O cliff again. So, you will ask me, "Shi, OK, this is all great. What's the big deal about those 200 microseconds? Why is that special?"
I'd say it is actually quite special because even though bank is a simple silly example that I used here, in the real world, there are a ton of critical OLTP applications that will really benefit from the low latency and high throughput of the PMem acceleration.
15:40 Shi: For example, we have fraud detections in the real world or real-time analytics that needed to be performed when the user clicks on the screen or moves around, all of those are very high-throughput and low-latency applications that will greatly benefit from the PMem acceleration. And then there, there we go. Remember the RDMA? We do the RDMA read to accelerate a random read I/O, and we do the PMem RDMA write to accelerate the commits.
Now, this brings me to the end of my conversation. We have this OLTP on Exadata where we can have a cake and eat it too We dedicate more than 99% of our PMem space for PMem cache and less than 1% for our PMem log, so we're able to actually build the largest cache to cache as much hot data in there as possible, and we get 10x better transaction I/O latency, 8x faster log writes and 16 million read IOPS on a full rack of database machine.
So, I'm extremely proud of what we have accomplished with persistent memory, and I would like you to think about how you can harness the power of PMem in your own application. If you have any ideas or you have any questions, you want to continue our discussion, please feel free to drop me an email. I am all yours. Thank you.
17:10 Yao Yue: Hi, everybody. My name is Yue. I am a senior staff software engineer at Twitter. Today, I want to talk about caching on PMem, how Twitter uses an iterative approach to mesh our in-house caching solution with this new technology in storage. So, I want to mostly divide this short talk into four parts. First is the overview of what caching looks like at Twitter. Second is how we started with Intel's help and Intel's equipment, and how we reproduced the same testing on Twitter's production environment, and eventually how we came up with a new design to better take advantage of PMem.
17:55 Yue: So, first, let's look at caching at Twitter. So, here, what I mean by caching is in-memory distributed cache mostly located within a single zone or single data center, and Twitter has lots of them. So, we have over 300 cache clusters in production. They are scheduled as smallish fixed-size jobs that are stacked on top of each other, so we have tens of thousands of instances running on many thousands of machines. The largest caches we have has a max QPS of roughly 50 million, and we have a published SLO of 5 millisecond at 99.9% percentile latency and we try to uphold that. And because we have a reasonably sizable operation, any change we introduce, PMem is not an exception, we want to do that in the same codebase so we can retain the same high-level API, and we want to be able to have configuration options to turn it any feature on and off, and still achieve predictable performance. If anybody is interested in the detailed explanation of how cache works and how caches behave at Twitter, we have a new paper published at OSDI 2020, and you can click on the link.
19:19 Yue: So, here are some quick cache facts for those who have not worked with cache extensively. For those who do, I'm sure these things are obvious. First, is that because we are using memory to store a lot of data, obviously caches are, more often than not, quite memory hungry. The second thing, sort of to counter that fact is despite using a lot of memory, actually the performance bottleneck of cache is very, very rarely, if ever, on memory bandwidth or memory speed. Mostly, we are seeing bottlenecks, throughput bottlenecks on host networking stack.
So, this gives us a hint. We can potentially take advantage of PMem as a denser storage as long as we don't shift the bottleneck from network to memory. And also cache tends to be quite important in production, and this is definitely true for Twitter where the cache . . . Any cache incident can easily translate into a site outage. So, the cache's availability is very important. And also because we use a lot of resources to run caches, any improvement, in terms of efficiency, can lead to cost reduction, which is desirable for a business. And, finally, from an operational perspective, since we run so many instances across different clusters, it's always nice if we are able to achieve faster resource so we can upgrade and maintain our clusters very efficiently.
20:58 Yue: So, why did we start considering PMem? The first obvious thing is, PMem is denser storage. So, for the same amount of cost, or for the same footprint, or for the same number of hosts, we can simply put more data. And this has apparent TCO benefit if for any caches that's memory bound, which is the majority of them. And also, even for those who aren't, if we can put more data and serve more hits out of cache as compared to misses, then there are secondary benefits like we can avoid slower and more expensive queries going to other services such as databases.
And the second aspect is taking advantage of the persistency aspect of PMem to achieve graceful shut down and faster reboot. Again, this has availability benefits, but because we haven't done . . . I would say we haven't done enough work in this regard. I'm going to skip the second aspect in today's talk.
22:03 Yue: So, we're mostly focused on the density benefit of PMem. So, here's the master plan of exploring PMem. We want to use a modular framework due to the constraints that I was talking about for running cache at Twitter, and this framework is called Pelikan. It's an open source to modular cache. It can behave like memcached or it can behave like Redis or any other protocol you may want to introduce, and has very similar runtime behavior across different interfaces. And when it comes to testing, we started on Intel's equipment, but then reproduced everything on Twitter's equipment. And finally, our observation and insight led us to come up with a new design that is much better tailored for PMem. Here's the test setup. We have a packed, stacked post that mimics the production behavior, the production configuration at Twitter.
23:07 Yue: In terms of object size, we go from very small to moderate, but generally on the smaller side. Data size is one thing we try to explore by trying everything from very small, 4 gigabyte, which is what we do most of the time with DRAM instances to 32 gigabyte that reflects the density difference between PMem and DRAM. And in terms of number of connections, we try to do realistic. So, it's going to be either 100 or 1,000 instances per . . . or 1,000 connections per instance. So, that's relatively high. We want to focus on primarily latency because it's easier to scale-out throughput by adding more instances, but latency is something that is much harder to change. And in terms of PMem versus DRAM, we want to see . . . Do we have degraded performance with PMem? If so, what is the reason? We want to compare Memory Mode and App Direct Mode because that seems obvious. And we want to understand if more data are served out of PMem, does that change the performance characteristics? And if it does, how can we identify and understand the bottleneck?
24:33 Yue: So, we started with a lot of help from Intel. This was about two years ago, so at the time, the easiest way of getting access to PMem at all was using Intel's lab. So, Intel actually helped us conduct most of the tests I'm reporting here, and they have really beefy configurations. They have fast CPUs, fully populated PMem channels. This is a loaded configuration, and we ran a large number of jobs with a typical rewrite ratio on Intel's lab. The first test is in Memory Mode because that doesn't require any code modification, and the results reported here, as you can see, sort of checks all the boxes.
On the latency side, everything is under 2 millisecond, so that is well within the SLO that we were targeting. And in terms of throughput, you can see the percentage of data that should reside on PMem actually has very little impact on throughput. For different value sizes, they appear to be roughly the same. And the only big dip we saw comes when the value size is larger than a single MTU. So, this goes back to sort of this . . . Reaffirms us that the bottleneck is primarily on the network.
26:12 Yue: And then we switched to App Direct Mode. Here, we actually need to modify Pelikan, because otherwise it won't recognize PMem as a separate device, and by taking advantage of PMDK, we were able to add a new abstraction called data pool and sort of modify the rest of the codebase to use data pool in about 300 lines of code in C. So, this is not a big change. And this gives us the flexible invocation of DRAM or PMem with just a configuration switch. So, we went to test with App Direct Mode. This time, we skipped the largest data that . . . The value size because we know it would trigger the network bottleneck, and for the rest of them, again, we see that at about 1.3 to 1.4 million QPS, the latency looks excellent -- it's under half a millisecond -- and the data set size actually does not affect throughput or latency very much at all. So, second time we have confirmed that the bottleneck is most likely where it used to be.
27:36 Yue: So, these are all really great results from Intel, so with this, we finally got some samples at Twitter and we want to reproduce their experiments on our platform, and we were very optimistic going into this. So, we first reproduced the Memory Mode test with a focus on if we can uphold the SLO at the target throughput that we set, which is about 1 million QPS. Note this target throughput is actually lower than what Intel's lab was able to produce, and the results are actually much worse.
So, if you look at this chart, when we have a smaller data set, so like 4 gig, so again, this is on the lower end of data this size, things are good. But as we go to have a significant portion of the data being served out of PMem, when we have a large number of keys in a large heap, our latency actually very quickly start to breach the 5 millisecond SLO. This was a little surprising, initially, but as soon as we start inspecting the difference between our configurations of the platform, I think we have a very obvious explanation. You can see that while we have about as much PMem capacity, Twitter's configuration actually has far fewer PMem channels per socket.
29:11 Yue: So, this lost of PMem bandwidth, I think is the primary reason that we saw latency degrade so quickly. And the other, secondary probably reason is that we also used more connections than the Intel lab's configuration because we feel that is necessary to de-validate it for our production use. So, not all hope is lost. We still have App Direct Mode. How does it do in App Direct Mode? This time, it's much better. So, we have the same dataset size in different stages, from 4 gigs to over 70 gigabytes of data per instance. This time, by treating PMem explicitly as a different service with the code change that Intel introduced into Pelikan, we were actually able to stay well within the SLO.
30:06 Yue: So, there is a way to hit our throughput target while having a much larger data set if we can stick to App Direct Mode. So, what we have observed in our own PMem test sort of hints at the fact that we are not that far from turning PMem into a bottleneck, even though, for the moment being, we were able to avoid it. But what if we want to input the throughput further, maybe with some advancement in networking technology or kernel improvement? What if we want to start adopting more data structures that actually need to do memory access? Are we going to hit the PMem bottleneck? I think we will, especially with this few channels per socket. So, we started thinking of a new design that actually respects and takes into consideration what works for PMem. And I got a lot of help, actually. This is a work that is primarily done by Junhui Yao, who is a Ph.D. student at CMU, and we start to think, "What are the desirable properties accessing PMem?
31:23 Yue: So, we set out some goals like we want to . . . really want to minimize random reads and random writes because PMem works better, much better, if we can keep most reads and writes sequential. And there's a sort of another goal that is neutral to PMem, it's not inspired by PMem, but we noticed then that cache options are actually quite small, so we want to minimize metadata overhang. So, we were inspired by some designs that are targeting SSD because PMem behaves, in many ways, more like SSD than DRAM.
32:04 Yue: So, if we can mimic that behavior, I think we will get good results. So, we started organizing items not by size as traditionally dying memcached -- that has been unchanged for the past 15 years -- instead, we were organizing them by TTL. And the nice thing of that is, with that change, we stopped needing to do storage bookkeeping at item level. We can do most bookkeeping at the segment level, which is a large, chunky, needy sequential read and write, and that's great. And the other . . . The primary data structure that needs some upgrade is the hash table. Previously, traversing a hash table, the link list of the hash table requires a lot of random reads. So, Junhui came up with this very clever design, which is a multi-occupancy block that is chain, so that's blockchain in one sentence. [chuckle] And this turned the most of the read and write access actually into sequential from random, and that significantly improves the throughput we can get from PMem.
33:19 Yue: So, with these changes, we focus on testing the difference between the traditional floppy storage and the new PMem-aware storage. And, as you can see, the benefit is quite noticeable, especially in terms of write. We actually, in some cases, achieve more than to axe the throughput because we try to design for PMem instead of using a design that was previously for DRAM and just hoping it will work. So, that is what we have done so far. What we think, going forward, what we'll do is this new set cache is most likely what we would use for post-DRAM and PMem because it has such wonderful performance. But there are some other takeaways. The first is, we understood where the bottleneck of a cache is.
34:09 Yue: So, the goal was primarily always to avoid turning PMem into the new bottleneck, and for that, we succeeded by sticking with App Direct Mode, which is a clear winner in my opinion. But Memory Mode served its purpose. We were able to get . . . hit the ground running. Hit off the ground running very fast without code changes because we were able to test in Memory Mode, and it gave us an idea of the ballpark of what we should be expecting. And, also, it's important to pay due diligence, because we didn't want to rely on results from a different platform, and it turned out that difference actually mattered. So, we were glad that we tested on Twitter's platform and optimized accordingly to make it work.
And finally, I want to say, so we didn't come into this project saying, "We are going to design the cache, best cache for PMem." We only did it when it became necessary by experiment results and when we had a deeper understanding of the characteristics of the media. So, I think I would like to hold that attitude, continue going forward, like taking an incremental approach when we want to design new features.
35:34 Yue: So, for the future, what I really am interested in is sort of the persistency aspect of PMem, which we have looked over for the most part. But I think there's going to be some very interesting results if we continue to invest in this area. So, with that, I conclude the talk, and I have provided a couple of links. If you want to read the source code, or learn more about the Pelikan project, you can click the links and feel free to ask me questions.