Handling Slow Disks in Heterogeneous SSD Deployments
You'll learn about differences between some disk performance specifics and how these performance variations can be managed.
Download the presentation: Handling Slow Disks in Heterogeneous SSD Deployments
00:02 Chaitanya Solapurkar: Hello everyone. Welcome to our session on Handling Slow Disks in Heterogeneous SSD Deployments. My name is Chaitanya, and I'll be co-presenting this talk with my teammate, Alejandro. We both work at Amazon CloudFront, which is a service under AWS.
So, we are here today to talk about how CloudFront manages a fleet of hardware and specifically looks at disk performance. So, here's what we're going to cover today. We'll start with a motivation of why we even look at these issues in CloudFront and then start with a quick overview of what CloudFront does, and then talk through some of the disk performance specifics in terms of what variations we notice. We'll then look at how do we handle these performance outliers in the interest of maintaining good viewer experience. And then, finally, we'll wrap up with a conclusion.
01:29 CS: So, performance matters because it can help with a few things. It can help lead to more pageviews, it can lead to better customer experience, it can lead to higher conversions rate, as well as help a website show higher in search engines. So, multiple studies have shown that even delays on the order of 100 millisecond or half a second can hurt whether users convert or finish a process they started such as online shopping or whether they finish a search that they initiated. So, if the pages take too long to load, then many users will just abandon it. This is a very important reason why we look closely at performance.
02:34 CS: So, to understand a little more about it, let me start with giving a quick overview of what CloudFront is. So, CloudFront is a content delivery network that securely delivers content to a global audience with low latency and high transfer speeds. Now, the top priority for CloudFront and AWS is security and availability, but performance is also a key aspect of what we do. So, we also provide support for accelerating various different workloads that you can see on the screen, so whether that's the full site delivery or streaming. No, not only this, but there is customizability in terms of adding custom logic which executes independently from the origin server which helps you customize the experience to viewers. And there's also tools built-in which provide ways to manage and configure these CloudFront distributions and also integrate with other AWS services.
This article is part of
Flash Memory Summit 2020 Sessions From Day One
04:01 CS: So, in terms of the physical footprint of CloudFront . . . so CloudFront is present globally in over 216 locations. This has been growing at a fast rate, and it's only grown since the time that we present this to you. So, these points of presence are located close to viewers which have good connectivity in terms of peering and transit to end-user networks. And not only this, there is also a private AWS backbone network link which is used by CloudFront to connect with other AWS services. So, the great part about this is since it's a private network, we monitor and scale and evaluate its performance very closely. So, not only do we have these PoPs close to viewers, but we also have a midtier caching layer called the regional edge cache, which is used to reduce the origin load by reducing the number of requests that go to the origin. And this also helps improve performance on the origin side by maintaining persistent connections.
05:27 CS: So, now that there are a number of such edge locations, there are two key parts to this architecture. One is that determines how a viewer request even reaches a PoP or an edge location and what happens within that edge location. So, to start with the first one, which is reaching a PoP, so CloudFront selects the best PoP for a viewer based on a number of criteria such as the measured latency of different viewer networks to edge locations, whether there is capacity to serve the traffic there. It also takes into account the customer configuration and preference in terms of where they want to be routed to and, last but not the least, is the cache reusability in terms of having some stickiness of where request for CloudFront distribution land.
06:33 CS: So, this part is really important because in many areas we have PoPs which are very close by and shares some network characteristics. So, we want to ensure that the CloudFront distribution goes to the same PoP with a higher probability so that there's better cache reuse. And once these requests reach the PoP there is also . . . We have multiple cache servers which actually serve out the content, so load balancing across these cache servers is also important to have a balanced usage of the resources there as well as in terms of SSDs to have balanced wearout or aging of the hardware.
07:27 CS: Yeah, other couple of factors that also play an important part is, which cache layer is storing the content for the different objects that are requested. So, different cache layers are meant for different types of workloads, whether that's the popular content or the long-tail content. And so placing it on the cache layer, as well as what algorithm runs at different cache layers, influences how much work is put on the hardware at that location and specifically the disks because CloudFront uses a disk-based cache. So, to talk more about the . . . Actually, let me hand it over to Alejandro.
08:20 Alejandro Proano: Thank you, Chaitanya. In the case of our cache fleet, we have a mix of hardware types depending on the PoP generation. Within each PoP, we also mix the hardware depending on the function of the host. For example, a cache host will have more storage than a DNS host. PoPs also experience different traffic profiles depending on the location and the season. For example, during a sport season, we expect our PoPs to serve more video traffic while during the shopping season we expect them to serve more small objects. We noticed that for different workloads the frequency at which PoPs are maintained or upgraded is also different.
09:00 AP: Our caching application, as Chaitanya explained, depends on SSD storage. Since most of our content is served out of disk, the latency introduced by the hardware has a direct impact on how long it takes for the viewer to receive the content. Moreover, some workloads require more writes to the disk. This will also have an impact on the life of the SSD. The fleet of our SSDs consists of multiple vendors and models even for the same PoP generation. We noticed that even though the SSDs have a similar specification, their firmware implementation leads to very different profiles of performance and reliability.
09:43 AP: For example, consider three SSDs labeled as A, B and C. Each of them belong to a different vendor and they are part of the same PoP generation. We collected the data for several days from PoPs with similar traffic profiles and plotted the probability distribution for each of them. We can see the two metrics: a wait which measures the average time for an I/O operation to complete and the media wearing indicator, MWI, which is used to estimate the percentage of life that an SSD has left.
In the case of latency, we notice that B shows the best performance followed by C and A. In the case of the MWI, B shows the best followed by A and followed by C. Moreover, note that because we collected the data over the same period of time, the width of the probability distribution of MWI will show us how fast the life of the SSD is degrading. In the case of C, it's by far the worst of the three. And in this case, the best is A.
11:00 AP: One of the factors that affects the performance and the life of the SSD is the implementation of garbage collection. As a refresher, remember that for an SSD to write a page that page must be empty or deleted. Since SSDs also write and read in pages, but when they need to delete something it has to be in blocks, blocks being a set of multiple pages. There is a garbage collection, most valid pages to new blocks so that the stale pages that were in the block can be deleted. In our simple example, we have six pages in a block of size of nine and then a file is updated so three of them become stale and then three are written new. Garbage collection will move the six pages to a new block and then it will delete the old block. Given that garbage collection needs to do more writing to the SSD, this phenomenon is called write amplification.
12:13 AP: In our three SSDs that we mentioned before, we kept track of latency and the MWI to measure write amplification over time. The time series of the latency shows that the latency grows as time passes. In the case of A . . . for the case of C. But in the case of A and B, the latency actually stays in the same boundary. Similarly, for the MWI, we can see that the rate at which A and B decreases is much lower than the one of C, which tell us that write amplification is impacting the wearing of the disk.
12:56 AP: One more problem that we need to highlight is that SSDs and the operating system might have different views of the data that is stored in the disk. And this might cause garbage collection to be less efficient. For example, let's assume that three pages are deleted from the disk by the operating system. And then, when the operating system is writing . . . Sorry, when the SSD is running garbage collection, since the SSD might not know that these three pages are stale, they are moved to a new block in order for the old block to be deleted. This is very inefficient. So, the Trim command was introduced to address this problem. It enables the operation system to notify the SSD which blocks or which pages are no longer valid. And Trim is aimed to reduce write amplification making garbage collection more efficient.
14:00 AP: There are two implementations of Trim in Linux. One is on-demand, using the fstrim command, which is basically a one-time run that we can pair with a cron job, so that you can run it periodically, every day, every hour, every week in whatever frequency you want. You also can run it continuously, in which case, every time that you delete a file, the operating system will run Trim and inform the disk what pages are being declared staled. And this is enabled by the discard option when you're mounting your disk.
14:38 AP: To evaluate periodic and continuous Trim, we ran a couple of experiments. In the first one, we ran Trim daily during the lowest traffic that our PoPs experience for a whole week. We collected data three days before the experiment, during the experiment, during those seven days of the experiment and just excluding the time when fstrim was running and then we also collected the data only when fstrim was running and then three days after the experiment without Trim enabled.
15:16 AP: The latency graph shows us that there was a slight improvement when . . . in the week that fstrim was running. We even note that the improvement remained three days afterwards. However, when fstrim is running, the latencies were very high for SSD A and C, but lower for SSD B. This made us curious, so we started looking for other metrics. And we note that the kilobytes per second written to the disk when fstrim was running, it was much higher, which is telling us that the operating system or the filesystem was writing intensively when the command was running. Also, here we can note that the workload reduced on the three days after we ran fstrim, which is also a benefit, a benefit in the case of the write amplification factor. So, after the experiment, the benefits of fstrim actually remained.
16:24 AP: In the case of continuous Trim, we ran the experiment for a full day. We collected data the day before the experiment, the day of the experiment and the day after. In the latency graph, we see that an improvement for A and B, and for C actually we see a regression. We also see that, the day after, the latency was almost the same as before. And for this day, actually, there was a little bit of improvement. When we look at the amount of data written per second, we actually . . . something similar as before, that we're writing a lot more when Trim is enabled and also, we're writing less the day after we disabled Trim. So, as we can see, in our case, periodic Trim provides the best strategy. On top of that, we have systems that will assist with performance and capacity management of our SSD fleet. This will be explained next by Chaitanya.
17:40 AP: So, given that there's multiple factors that influence the viewer-observed performance when it comes to disks, we've evaluated Trim, as Alejandro mentioned. There's also overprovisioning, which can be done, which gives garbage collection more headroom when looking for additional pages to write out. So, in terms of constantly monitoring and evaluating the latency of disks, so we've built services that do this measurement in a statistical manner to detect any deviations from the norm. So, what happens is when such slow disks are detected, we check with the service whether it is OK to take such a disk out of service and this is basically a mechanism to prevent a sudden loss of disk capacity and some of the spikes can happen due to changes in workloads. But if such a workload is observed in a larger number of hosts, then it's probably related to workload rather than anything to do with the host.
19:07 CS: So, on taking disks out of service, we then run corrective actions such as Trim and . . . But also, in some cases, might need to reformat or wipe the partitions clean. Not only do these metrics play a part in the ongoing operations, but this MWI is monitored to track the remainder life of disks. And what we've observed is with lower media wear out indicator values, there is higher chances of errors occurring on disks, whether they are transient or nonrecoverable and we also definitely see the latency profiles increasing, which is the more common case. Now, overall, this also feeds back into our capacity planning and forecasting system where we prioritize PoPs with lower MWI, which benefit from a sooner refresh.
20:19 CS: So, that concludes our talk today on managing slow disks in CloudFront. So, to recap, the workload that you place on disks heavily influences what is the performance profile seen by the disk. So, the two systems that we spoke about are the routing and caching systems of CloudFront. These disks are from different vendors. While they can look similar in terms of specifications, it can vary in the observed performance due to different implementations. Running mitigations such as Trim or taking disks out when higher I/O latencies are observed, these can help in managing the performance expectations from the disks and all of this can be done by frequently collecting and evaluating disk metrics, which is important to meet the performance bar that customers expect and also in longer-term capacity planning. With that, on behalf of Alejandro and myself, I would like to thank you for listening to our talk. Thank you.