Using Persistent Memory to Accelerate Tweet Search
As an option to improve per-node throughput, Twitter sought to test the storage of tweet data in persistent memory instead of SSDs. Discover how they integrated the technology and results they saw.
Download the presentation: Using Persistent Memory to Accelerate Tweet Search
00:04 Andy Wilcox: Hello. I'm Andy Wilcox, I'm a senior staff site reliability engineer at Twitter where I work mostly on performance. I'm joined today with Matt Singer, our senior staff hardware engineer, and we're going to talk about using persistent memory to accelerate tweet search at Twitter. We're going to talk a little bit about how we organize our search infrastructure, a little bit about how it works and how we integrated this persistent memory and the results we saw in these experiments.
00:33 Matt Singer: Thanks a lot, Andy. To follow up on what Andy said, Twitter has been investigating using this persistent memory to accelerate certain applications, including this search service that we run based on Apache Lucene, where . . .
Today we're going to discuss the benefits we observed for using persistent memory on this type of service. So, we want to talk more about why we think technologies like persistent memory might be relevant to a search service. And the basic problem is with search, you've got obviously a lot of data to search, it's really not financially feasible to put all of this data in DRAM, so you're typically storing a lot of the search data on a NAND SSD, but what you really want to do for your service is either improve query latency or improve how many requests per second can be serviced from one machine, and ideally, if you can do both, you can buy less servers and have a faster service, so if you have less servers to support the RPS, each server needs to have more storage, thus using DRAM becomes infeasible for capacity reasons in that situation. Andy, can you talk more about the way search works?
01:53 AW: Sure. So, we have a lot of tweets, several thousand per second for a long part of the company, I believe it's several trillion at some level, so it's an interesting problem: How do we search them all?
This article is part of
Flash Memory Summit 2020 Sessions From Day One
Well, we try to act like good computer scientists and divide and conquer, like many problems are decomposed, and the way we do it is with specific date tiers. So, we've been in operation since 2006, you might imagine, we put all tweets from 2006 in a year and all that occur in 2020 in a different year. That's not exactly how it works, as you can imagine, there weren't as many tweets in 2006 as there were in 2020, so we bookend those tiers with a start and stop date that gets us approximately equal sized tiers, but for the purpose of what we're going to discuss today, you can imagine it just on a yearly basis, it makes the problem a little bit simpler.
02:50 AW: So even with date range tiers, each tier holds a lot of data, so we break it down further into partitions, each partition serves a sub date range, this gives us a . . . We do this for a dual reason, one is it's less data, but it's also a smaller date range, and this helps make the searches faster as well as actually fit on a single piece of hardware.
We also need reliability, so we add replicas. Of course, you need at least two replicas for a deploy and a backup basic reliability. The replicas also help us to scale out on load, so if you look at this diagram here, if post 13 is down, we can post 6 can take pick the load. If they're both operating, we can round-robin between them to, again, spread out the load a little bit.
The request flow is kind of interesting how it gets down to the tiers, our requests come in the Twitter front doors called Twitter front end, from here, the requests are directed to a pool of our search roots, the search roots are in charge of doing some caching and doing the scatter gather of the queries and results, so the requests are sent out to each tier, for a large part of searches, we want to know all the best matches, so it goes to all of the tiers and the roots collect all the results, and then apply some learning to try to get you the best list of results.
04:19 AW: Inside each tier, we're doing again, round-robin between the partitions as I mentioned. So, a little bit about Lucene internals so you can understand where the time is spent and where our memory is spent in this application. When Lucene application starts up, it reads in a bunch of documents to build in a dictionary of where word terms occurred. So, this happens in a couple of phases, each document is tokenized, meaning it just divides it into the individual words and added add it to a dictionary. There are different filters that occur on this data such as stemming, filtering, normalization and synonym expansion. Stemming is where we try to get down to the root of the word, so for example, hugs, hugging, hugged, they're all roots of the word hug, so we would just replace that word in there in the dictionary. We filter out simple articles -- in, the -- are not particularly useful and they're not going to give you any kind of . . . They're not going to assist in any search results, so we get rid of those.
05:25 AW: Normalization is removing accents, punctuation, things of the like, and in some cases, we might want to add synonyms, so for example, if the tweet mentions fast, we you might want to put the word quick in there next to it, they mean substantially the same thing, and we might pick up some matches that would otherwise be missed just because somebody didn't use the exact word, so as we scan through all the documents, we build these dictionaries and posting lists.
So, most of these things wind up in memory, but there are some parts on disk. When we did the design of this, there are trade-offs we had to make to balance the size of how much is in memory versus how much speed we want, and determine to tell how long we want to spend waiting on disk this. The posting list over in the dictionary, that is just a pointer back to which documents contain the term, so if you're searching for the word big, for example, in our little toy set of data, you'll see that it occurs in document two and three, and we you could return those two documents. So even though parts of this dictionary and posting lists are in-memory, there's still a fair amount of disk I/O, so it wouldn't be feasible to store the whole terabyte data set range in DRAM to try to eliminate that, it's just too much data.
06:43 AW: Because I just mentioned with DRAM out of the picture for the whole data set, persistent memory offers us a way forward to realize the terabyte range possible that we would need for this particular application, but it's quite a bit faster than NVMe as well, on the flip side of that, trade-off in microsecond latencies for the NVMe versus nanosecond latencies for the PMEM.
So, the persistent memory has lots of different modes that are very interesting that we can choose from when we were doing this experiment. Memory mode is essentially just DRAM emulation. It's great for a quick testing of large memory footprints. You just basically put in that mode and use it, and you get the benefit of a large memory footprint. At direct, it requires you to use the native library that is on GitHub under the PMDK, I believe. That's a very interesting library. It provides lots of interesting sophisticated primitives that could make use of, for example, linked lists, so on and so forth, in PMEM. So, if you're really trying to get all you need out of an app, you could use those tools.
07:53 AW: Finally, there is Storage over App Direct, so with something called DAX, direct access storage, you can mount a file system directly on the PMEM and the special driver removes the page cache from the IO path and establishes direct mappings to the persistent memory addresses. This means you have direct RAM-like access to the PMEM with zero copy, and this is what we ended up using it in our testing.
In practice, how do we set this all up? We use an existing platform, and we added four 512GB Optanes split across the memory controllers, each combined into a region per memory controller. As I just mentioned the DAX mode allows us to directly use the PMEM for our storage, so over on the right-hand column here, you can see how we ended up configuring this. It's as simple as making a file system. DAX mode is supported on EXOFAST or EXT-4, and then mounting that file system. When we change our app config, use the file system and we were ready to go. Since our app already uses a persistent memory layer, this all seemed way too easy, meaning it used Nmap to access data, but it worked just fine. This is just super easy to get going.
09:24 AW: So, here's a different view of how the layers stack up. The NVDIMMs are combined into regions using the BIOS tool, then you use the NDCTL tool to define namespaces. The dev PMEM 0 and PMEM 1 are created automatically when you create the namespaces, if you put them in DAX mode. For our use, we just use a direct correspondence with the region, although more sophisticated arrangements are possible, of course.
And then on top of that, you see the DAX file system we used. It's possible to do striping or concatenation using one of the device mapper functionalities in Linux, if you needed to do that. But in our test, we had enough data, just fit in on one side, so we just used dev PMEM 0 directly. Over to Matt to discuss the experiment and our results.
10:17 MS: Thanks, Andy. Before we talk about the experiment, I kind of wanted to talk about some limitations we had. As Andy mentioned, we were limited to testing on a previously qualified system in our data center that didn't have the ability to run a 222 config of persistent memory. That is a config where every memory controller channel has both DRAM and persistent memory. Because we could only insert four of these persistent memory DIMMs, we decided to use a smaller data set for this test. That data set was about 600 gigs, and we wanted to compare this apples to apples. So even when we ran our NVMe testing, we use that same 600-gig data set. Of course, this application actually becomes a lot more interesting on the terabyte scale, and Twitter would be looking at somewhere above four terabytes practical use case, but . . . So we made some adjustments to the test process.
11:25 MS: So this experiment that we're running attempts to mimic the ratio of page cache that would be available to the data set size, if we were to scale the system up to a much larger data set size, and we'll reflect on that a little bit more in some of the later slides. There are also a lot of variables that happen that are changing as you scale up the data set size and scale up the request per second that a host might take, and the way that may affect your redundancy and your replicas that Andy was talking about.
12:10 MS: We only have time to show a subset of the experiments that we ran, so here, we're going to focus on the NVMe system and the persistent memory system, and they're both receiving about 8000 requests per second. On this slide, let's look at some of the top line metrics. The NVMe system, we have about 20 gigs of the system memory being used for page cache, about 600 gigs of data. Both the I/O wait and the system time are both about 6%, and we compare that to the PMEM system serving the same data set. The I/O wait goes from 6% to zero, and the system time drops from 6% to 2%. At the same time, the user time goes up from 71% to 91%, and overall, we're using a little bit more CPU and that seems to reflect in a little bit higher quality search results that we'll see later.
13:14 MS: Here's a way to look at the way that the DAX driver is working. An obvious difference here that we're seeing is the NVMe system is experiencing a total of about 400,000 combined page faults per second, and about 94,000 of these are major page faults. That's a number that we come back to when we look at some of the NVMe stats later. And if you look at the PMEM system, we only have about a total of 4K page faults per second, and only 2,000 major faults per second.
And as we mentioned, the big difference between having this data on the memory bus versus having to page it, is there's a tremendous difference in latency. The NVMe system shows a pretty low overall bandwidth, and this has a lot to do with the fact that there's a substantial number of things being . . . Hits in the page cache. But the latency that we see from the drives is consistently near the drive spec at about 80 microseconds. The PMEM system shows higher read throughput, but the persistent memory read latency is about 445 nanoseconds. So, you know, about a half a microsecond versus 80 microseconds is about 160 times lower latency to access this data.
15:00 MS: On our NVMe system, here's where I mentioned earlier to take a . . . To remember that 94,000 major faults per second, we are seeing about 80,000 IOPS to our two disks in the system. So, that's where we're seeing most of our delays, and we're seeing a much more consistent read rate through the DCPMM at about three gigabytes per second. And from all of these metrics that we see, we see around 90% lower total throughput to the NVMe disc than we see when we are using the DCPMM, and from that we can infer that there's probably about a 90% cache hit rate for queries that are in the page cache on the NVMe configuration. So, let's talk about what this meant for the service metrics, now that accessing this data has become 160 times less latent. Andy?
16:08 AW: Thanks Matt. So, the two things we're concerned about in the service, as in most services, are quality and latency. Though the extreme tail doesn't occur that often by definition, you choose to understand what happens on the tail. On a busy system, you can have very long latencies, which can gum up a bunch of things along your request path. So, if you look at this first graph here, you can see the incredible reduction in the maximum latency, and it's spiking sometimes to 20 or 30 seconds, this is in milliseconds, quite high, I guess a disk just gets behind. Whereas on the PMEM system, we reach a maximum of about 600 to 700 milliseconds, it's a dramatic difference.
At lower latencies down the tail though, the four nines and three nines are still quite a good game. They're both about 30% better. As you can see in the graph in the middle the . . . Both those percentiles are in the red lines for the NVMe, with its corresponding ones and PMEM with the blue and orange. Below the three nines, we didn't see much difference in the lower order latencies. This is probably due to a page cache, which is happening most of the time, we believe so for example the P90, the P50, there's hardly any difference between these systems.
17:36 AW: The other thing we're concerned about is early termination. The way the scene and search works is, after it's looked for some results for a certain period of time, if it hasn't exhausted the data it has to search, it'll stop anyway in the interest of latency. So, that means you're not getting quite the result you would, because obviously it stopped searching. So, with the PMEM system, we got about a 25% decrease in early terminations, if you multiply that back into a request rate of about 8K, we're getting higher quality results for about 2% to 3% of queries, which is all pretty significant.
18:21 AW: So, as Matt mentioned earlier on in the presentation, we have a lot of knobs to turn in this experiment, and we did turn a lot of them. For this app to run well, there's a bunch of things we have to consider. For the old gen that's going to scale with data set size, as I mentioned previously the dictionaries and lists are stored there. So, the more data you have on disk, the more you'll have in these internal data structures, it scales pretty linearly. The new gen, obviously, you create a lot of garbage as you service these requests, so we have to scale the new gen with RPS to make sure we're keeping up and not adding latency and garbage collection.
19:05 AW: On the NVMe configs page cache, we have to scale that with the data set size as we've already alluded to, and I'll discuss that a little bit more in just a second. On the DRM size, obviously we want to try to maximize use, it needs to be big enough to hold the old gen, new gen and the page cache, so we have to kind of back fills into those as well. And on the PMEM side, we also want to try to maximize issues, we want to get as much of . . . You know, biggest data set as possible onto that system.
19:37 AW: So, for NVMe instead of PM, you need much more system to your RAM for the page cache than the 3.5% we tested here. We did other testing, which seems to indicate that to get the IOA down to something not so terrible around a few percent. We need about 10% of the old gen of the data set size in page cache.
So, in the table below here we have some hypothetical configurations. For example, just skipping to the bottom, with an eight terabyte data set size in PMEM, we can get by with a terabyte of DRAM, but we won't need to account for any extra memory in page cache. But in turn, if we have this on an NVMe system with the same configuration, we'd need to add nearly another 800 gigabytes of DRAM just to get us the same . . . Well, to give us the hit rate we need in page cache too so the system would perform at all.
20:43 AW: So overall thoughts on this, with the NVMe storage, we're going to have to scale with a DRAM as we had mentioned and with DCPMM, we're not going to need to do that. The DAX driver was just incredibly convenient to use, bypassing the page cache turned out to be one of the biggest wins when we did this test. And finally . . . Yeah, when we did this test, we didn't need more bandwidth in one of the DCPMMs provided, but we'd obviously need more for capacity reasons, so we're not close to stressing the bandwidth of the PMEM. Over to Matt.
21:28 MS: So, search is an application where we've now got these metrics that lead us to conclude that we can benefit from a really low latency access to storage, but we didn't need high bandwidth. So as we look to the future, CXL is a really appealing model for attaching persistent memory because it's going to allow us a lot more flexibility in how much storage is attached since there won't be a limitation on the DIMM form factor. On the other hand, apps like Memcache that are running much lighter weight queries at a much higher request rate, are still going to be very appealing for memory bus attach storage.
22:02 MS: There's some other aspects of persistent memory that we didn't touch on here, and we're doing more work on how we can natively leverage persistent memory through an app direct mode in things like Memcache, and that's going to let us speed cache hydration through resets of cache software or the system. And, as we're finding as we make these fundamentally large changes in how big the data set size could be or how many requests per second we can process, there's a lot of iteration towards what the optimal config will be. I really appreciate your time listening to our presentation today, and a big thank you to my co-presenter, Andy.