SSD Reliability and Debug at Scale
IT pros at Facebook discuss flash reliability at scale in the Facebook infrastructure and how they are solving the problems they have run into.
Download this presentation: SSD Reliability and Debug at Scale
00:00 Vineet Parekh: Good morning, folks. Today I'm going to be talking about a topic which is really important for Facebook's infrastructure: flash reliability at scale in the Facebook infrastructure. Some of you -- just like me prior to joining Facebook -- must be wondering what do I mean when I say "at scale." Let's take a look at it.
Here's a snapshot of number of users using Facebook's designed applications. Globally, there are more than 3.1 billion people using Facebook, WhatsApp, Instagram or Messenger each month. To meet the demands of this growing user base, we need hundreds of thousands of servers in our data center. Inside these servers exist hundreds of thousands of storage devices. With billions of people using our application, we have a really high bar for reliability and efficiency from our infrastructure. We build monitoring system to measure system events from different hardware components in our infrastructure. Flash is a key example of that. Having reliable flash devices is extremely important for an infrastructure.
With that, I'm here to share with you how we debug flash devices at scale, what data do we collect from these flash devices, different remediation methods we apply at scale, example of failures we see in production, breaking down these failures, debugging challenges we face in our environment, example of improvements Facebook is driving to help raise the bar for debug.
01:32 VP: Let's start with an overview of debug. This is a 15,000-foot level of how we debug hardware components at Facebook. Once any hardware device is deployed in the fleet, there is monitoring systems design to look at various events happening on these devices. We will talk about some of those events with using flash as an example in the next slide. Alerts are designed for various system events, for example, a flash device is not enumerated anymore. We also look at anomalies in our fleet happening at scale using the large sample of data which we collect from these hardware devices.
This article is part of
Flash Memory Summit 2020 Sessions From Day One
The next step here is triaging the failure, breaking down into what failure type it is, the impact to the application based on which the appropriate remediation action is taken. The last step is reviewing if the remediation has actually fixed the issue or not, and is then look back into monitoring, alerting an anomaly detection. To efficiently maintain reliable hardware devices in any hyperscale data center, it is important to collect health information of every hardware device in the fleet for monitoring. Let us look at what we collect from each flash drive in the Facebook fleet. At-scale log collection happens periodically on every flash drive in the fleet.
02:57 VP: Here is a snapshot of what we capture from individual storage devices in the Facebook fleet, Smart attributes, dmesg logs from the systems these drives are connected onto, Smart CloudHealth Log as defined by the OCP NVMe Cloud SSD 1.0. Some of the example counters in there are like nonstatistics, PCA statistics, health statistics, etcetera. Occasionally, at least preferred, we do collect telemetry and drive event logs from individual drives. The advantage of periodic log collection is that it is rendering-model independent and can be efficiently captured by our automation. And the nonperiodic logs often comes with custom methods depending on the SSD render you are using. Enhance is very manual in nature, and it's not a scalable solution.
Before we dive into different types of failures, let's take a look at our health monitoring setup and try to understand how that impacts flash reliability at scale. Our infrastructure does failure detection by using a daemon on the machine called a machine checker. This machine checker tool does checks on hardware devices like memory, CPU, NIC, SSDs, etcetera. It also checks system event logs, dmesg logs, etcetera. This machine checker tool is capable of correlating events and making a smart decision on the health of a system. This tool runs periodically and collects output.
04:31 VP: If the machine fails, an alert is created by the alert manager which runs off the machine. Once the alert is created, it is fed to a failure divergent system called Facebook Auto-Remediation. This block picks up hardware failures, processes logged information, and executes custom remediation accordingly. Some of the examples of these custom remediation include slide cycle, reboots, etcetera. If the FBAR fails to fix the machine, we try to do a low-level software fix called Cyborg. Cyborg tries to go ahead and do software fixes such as firmware update and reimaging. Even if this does not fix the machine, we go ahead and create a manual repair ticket. When the repair ticket is created, a DC technician tries to go ahead and carry out various hardware and software fixes. One of the example could be is go ahead and just swap the SSD. In addition, we log repair actions to understand effectiveness of the remediation which was applied.
Now that we have gone through an overview of the health monitoring and remediation setup, let's take a look at how failure types which get triggered by the alert manager from the data which is collected by machine checker. This slide breaks down few examples of many failures we see in the fleet in two broad categories, failures we see from an application-level point of view, and the failures which are caught by monitoring individual flash devices in the fleet.
06:05 VP: Some of the examples of the failures we see from an application point of view includes I/Os getting stalled, an application seeing a reduction in its read/write bandwidth, a data corruption issue happening due to a flash device, capacity getting disabled suddenly. From a fleet monitoring point of view on an individual SSD, the failures get triggered for various reasons, example being an endurance limit of an SSD reaching its threshold, Smart attributes exceeding the thresholds set by Facebook for those individual attributes, a high amount of read and write errors, protocol errors or various types of interface errors happening due to a flash device. Now, let's try to dive deeper into a few examples of these failure types. We will try to understand that -- why does this issue happen in an SSD and how does the application see this issue manifested into. In this slide, I'm going to walk you through a few examples of the things we look for in the dmesg log which are related to flash.
07:14 VP: The first example here is that of an I/O error. An I/O error usually happens due to a fundamental firmware architectural flaw within an SSD, an SSD controller hang or the SSD going into a read-only mode. The application impact because of this I/O errors could lead to possible file system corruption and drive aborts. The second example here is of an I/O timeout. Usually, this happens due to workload susceptibility and firmware design issues within an SSD. The application impact to these issues could result in I/O stalls, a complete service disruption and a huge performance impact.
The last one here is a critical medium error. Typical failing reasons for this are because of high media errors happening in an SSD due to its NAND, the SSD going into a read-only mode, etcetera. The application impact because of this could be file system errors, the disk being completely unusable and several tasks getting disrupted. In the next slide, we will walk through a few of the Smart examples which trigger failures. A failure here is triggered when predefined thresholds are exceeded for Smart attributes.
08:34 VP: Few examples here being critical warning set by an individual flash device, a high amount of media errors happening due to a combination of bad NAND blocks, uncorrectable reads, several end-to-end errors, etcetera, percentage of free blocks falling below the said threshold. So, let's try to recap a bit here. We went through what type of data do we collect, what data do we monitor and trigger failures on. So, now that we've gone through these two, let's try to see based on the failure type what types of remediation methods do we apply in our infrastructure.
Before we go to the individual types of remediation, a general rule to all the remediation below. Any type of remediations in production environment are very disruptive to the application and the services. Let's walk through a few examples of these remediation. Drive swaps, a complete drive swap to the same or a different model. Firmware upgrades, rolling out firmware fixes for a specific drive model. The cost of doing this includes qualification of each new firmware module, possible disruption to the service, and a high risk of changing this firmware in production.
09:56 VP: Next is overprovisioning. Overprovisioning drives to create extra space for performance issues is a high cost to do something like this in production because it involves changes to the application layer, modifying the expectations of the host related to the performance of these drives and loss of planned capacity, the last one being modification of deallocate command. These happen due to latent firmware issues in the drive on handling of deallocate or Trim commands under heavy stress workload. The cost of this means performance penalty, high risk due to changes in the NVMe driver, in the kernel, and forced overprovisioning leads to loss of capacity. To recap here a bit, we went through our health monitoring system setup, examples of failure types which we trigger on, and examples of remediations which we apply for those failures.
10:55 VP: In the next set of few slides, I would like to walk you through some of the debug challenges which we face in Facebook and some of the steps which we are trying to do at Facebook to make a change in that space. Let's walk through it.
First, Smart is not at all smart. Smart attributes are not enough to debug all the different types of problems happening due to an SSD in a data center. We all know that it barely provides any insight into the internal condition of the drive.
Second, I often hear that telemetry is a solution to all the major problems and debugging related to flash. I'm going to be making a little bit of a controversial statement here. I believe telemetry is extremely overrated. The current model of telemetry log collection just does not work at scale. Hyperscalers are left in dark while vendors debug a root cause, these logs. These logs are collected in encrypted fashion and sent to the vendor, which usually leads to a long turnaround time for just the first level of debug.
12:06 VP: I believe we need more human-readable or structured logs for at-scale debug. The better we would be in understanding the usage as well as debugging of these flash devices at scale, the better we would be in terms of reliability as well as designing the next-generation flash devices for a hyperscale environment. I would like to walk you through an example here. This solution would not only help us get extremely efficient at debugging these type of issues, but will also help the hyperscalers apply prompt remediation at the source of the failure.
12:43 VP: One of the top flash-related issues we see in our environment is SSD-induced, high-latency events. These latency events are caused when an SSD takes high amount of time in completion of read, write or Trim commands. The latency spikes here often ranges from milliseconds to seconds in the worst case. At its current form, Smart or telemetry just does not help us in debugging and collection of information to debug these latency issues right at the time when the event happens. There is currently no standard method to help us debug these latency events happening in a hyperscale environment. Facebook, as part of the OCP Cloud SSD specification, will be adding a latency monitoring feature. This feature will help us in enabling monitoring of flash latency at scale. It would also help us in debugging of any SSD-caused outliers happening in a hyperscale environment. It will also help us establish if cause issues are SSD-related or not happening in a hyperscale environment.
13:58 VP: Here's an example of how visibility into latency events would allow us to isolate the drive from production traffic as early as when the event happens -- would allow us to capture logs at the time of failure and also allow us to monitor latency trends which are happening at scale. This is just one example of how improving the logging around understanding of SSDs could make such a drastic improvement and change in debugging of SSDs in production environment.
With that, I would like to conclude my presentation here. In summary, at-scale debugging of SSDs is really challenging due to lack of human-readable and structured debug information available about internal of the SSDs. Storage industry needs to evolve and provide better debug tools. Facebook welcomes industries' ideas on how to improve debug at scale. I strongly believe that together, we can make debugging SSDs in production better. Thank you. Stay safe.