kentoh - Fotolia

Guest Post

Monitoring the Health of NVMe SSDs

Many capabilities have been built into NVMe technology, including enhancing error reporting, logging, management, debug, and telemetry. Here are insights on how you can manage the status and health of NVMe SSDs (like notifying users when an SSD failure occurs).

Jonmichael HandsGuest Contributor

Published: 23 Feb 2021

Download the presentation: Monitoring the Health of NVMe SSDs

00:03 Speaker 1: Hey, guys, this is Jonmichael Hands, I'm a product manager and strategic planner at Intel for our data center NVMe SSDs. I also co-chair the NVM Express marketing workgroup and the SNIA SSD special interest group. Today, I'm going to talk about monitoring the health of NVMe SSDs. This is a follow-up on a topic that we did a recent webcast on NVM Express on this topic that stemmed from a blog post I wrote on the topic which came from a lot of questions we got from reviewers on how SSDs fail, how are NVMe SSDs different and what tools does NVMe have to kind of monitor the health and smart information on drives in real time to help kind of diagnose and monitor NVMe SSDs.

So, without further ado, I'm going to go walk you guys through some of the interesting things that I think, that I see, of monitoring the health of NVMe SSDs, just from experience. I used to run a validation team at Intel, so, obviously, I know a lot of the failure mechanisms of SSDs in general.

01:04 S1: Two, I've helped kind of work and refine and define a lot of the features in NVMe today based off lots of customer feedback, based off all the partners that are part of NVM Express kind of contributing to the overall specification as far as trying to accomplish these goals.

So, the first thing is really about how SSDs fail. So, one of the things, you know this is very important for monitoring the health of SSDs because the increased prevalence, obviously, is where to look for as far as what the likely candidate of a failure is. And if you think about things where most people think SSD failures happen from endurance or hardware failures, but actually, if you look at this, this is a very, very small percentage of actual failures. For one, endurance, you can monitor it with smart – well, I'll talk a little bit about it when I show the smart logs -- but you have a something called percent used in NVMe, which is basically the gas gauge that shows what percentage of the endurance you've used and you also have available spares and reserve spares, that's part of the standard NVMe smart log where you can monitor endurance.

02:04 S1: You can also project endurance in real time, as far as being able to project and model what the endurance of a drive is going to look like over the five year life of the drive, based off what the workload looks like and it's very easy to model. I wrote the model at intel.com/endurance based off some Python scripts that basically do this, basically monitoring the right implication and projecting the endurance. But, basically, in a lot of use of enterprise drives even with one in three drive writes per day, class drives being the mainstream drives today, endurance failures are not very common just because one, they're understood very well and two, most customers just are not using that much endurance -- and we'll talk a little bit of that when I look at some case studies.

The other thing is hardware failure. So, in the long lifecycle, so enterprise and data center NVMe SSDs take a long time to get to market, so generally it's a year plus lifecycle and in that there's quality and reliability tests, there's validation, there's hardware screening, there's SSD controller power on -- I mean all this stuff happens.

03:12 S1: And a lot of the hardware issues get weeded out. And I'm not going to say that they can't exist, but things like capacitor failures or resistor failures or ASIC failures just aren't that common. Now, media failures, now the actual, as you look into increasing prevalence now, most enterprise drives are designed to actually withstand media failures with, so most of the drives, enterprise drives today have something like an onboard XOR or a RAID engine -- or you know some vendors call it fail in place -- but basically enterprise SSDs are able to withstand failures, so not only just specific blocks or page failures, but also entire die failures. So, typically, this is not a very common failure mode, although sometimes there's a lot of NAND or something or a bad firmware that causes more media failures than is normal and so it's not to say that it can't happen, but it is more rare than what is the No. 1 cause of SSD failures, it's firmware issues.

04:12 S1: And one, it's just because NVMe and all SSD firmware is extremely complex, moving data around, doing garbage collection, monitoring the logical and physical mapping of the SSDs and firmware. Just SSD firmware has become really, really complex and almost all the majority of failures we see in over 50% are actually just some sort of firmware issue. In this case, it's interesting because most of the time, there's nothing actually wrong with the drive and upon a graceful reset -- and we'll talk a little bit about some of the teachers in NVMe coming to help with this -- you can kind of bring the drive back to life. But we're going to walk through kind of how to monitor the health of SSDs to figure out, OK, if you have a drive failure or suspected failure, how do you figure out where it is? And we'll talk about some stuff like over-temperature, incompatibility and performance, if a drive is not enumerating or not coming up on the PCIe bus, what do you do? How do you figure that out?

05:13 S1: So, there's a couple of case studies that I talked through in the webcast. If you were to go start from scratch, wanted to learn about SSD reliability -- which I mentioned is really important if you want to understand how to monitor the health -- these are the papers I'd suggest reading. So, these, the first one is this, "Reliability of Solid-State Drives Based on NAND Flash Memory." That was written by a bunch of my colleagues at Intel, many of whom actually pioneered the reliability methods, endurance methods, JEDEC tests for how you actually monitor and demonstrate the reliability of SSDs endurance and quality. And then the other paper is this one from FAS last year, from NetApp and University of Toronto about, "A Study of SSD Reliability in Large Scale Enterprise Deployments." But that's actually using, I can't remember, it's like millions of drives out in the field that NetApp had from their customers and they have their preliminary data back from the field which they have smart data in.

06:10 S1: Now, this study was not done with NVMe SSDs because it was basically using the drives from the last six years and most of the drives in the study were SaaS, but there's some really interesting stuff in the study one about correlation of drive size and failure rates. And, again, that goes to the firmware issues about firmware being more complex, it goes into the difference between TLC and MLC and different type of one in three directed today if there's any correlation between that kind of failure. But the most interesting thing I saw out of this study was . . .

06:42 S1: One is that SSDs don't favor it often. In this study, their average failure rate was way below the actual 2 million hour MTBF which corresponds to 0.44% inferred. And then the other thing was the rated life, percentage used. Most customers are only using a very fraction, like 1%-5% of the total SSD endurance, which is why I mentioned endurance even though the startup is the major issues with SSD. It's just not very common.

And then the other one, one of my old colleagues at Intel, Brennan Watt, who is now an architect for SSD at Microsoft and the Azure storage group, he presented the FMS last year about this topic, basically on how SSD fail and what can we start to do to do predictive analytics and machine learning to basically be able to prevent and monitor these things in real time.

07:30 S1: And so, I'm going to talk a little bit about what Microsoft has done to help in a few slides, but NVMe has got a ton of features for helping this basic health monitoring. So, the most important one is the smart log page. We're going to talk about that, but that's kind of your main health dashboard for the drive. Basically, if somebody suspects that there's a drive failure, again, in Linux or Windows or anything, there can be all kinds of information, like all kinds of bad things that can happen at the application, at the file system, at anywhere in the stack. Basically, if you want to figure out if things have drives issue, smart log is the place to go.

08:08 S1: The most important thing it has is something called a critical warning bit. Now this bit, if you go look into the NVMe SSD spec, you can go in here. If you go into the NVMe spec, and again, just now on the latest one, there's a get log page command. In the get log page command, if you go down here, there's smart health information. You can just click on it and it will get you to the table. This is how to learn about the smart. But, basically, this is where you need to go to learn about what smart information tells you. Now, if you go into Linux, and I'll show you an example, the commands will tell you what all this stuff means, and it'll actually pass through the information. But the most important thing in the smart is actually this critical warning bit, which is basically if this bit is set to anything that's non-zero, there's a problem with the drive and that's the easiest thing to check.

08:54 S1: If you do the smart log and this critical warning is not zero, then there's something wrong with the drive. Basically, the sub-bits can tell you where and the types of failure, whether it's a temperature or media error or whatever. The rest of the smart log, I might talk through, but there's stuff like composite temperature, which is the temperature and most of the applications actually convert that to that degree C instead of Kelvin. Percent you use, that's the endurance that I mentioned. Available spare and available threshold, those are monitoring endurance as well, a data units read and written. This, basically, you can see how much data has been run through the drive. Through all this smart attributes, you can really tell what's going on with the drive and so one.

09:34 S1: Yeah, I only have 30 minutes today. So, you just go down to the NVMe spec. It's open source. Basically, it's available for free on nvmexpress.org without having to sign up. That's the most important thing now. In that other log page, there's something called an error log page which, again, makes sense is for monitoring error. So, when an error happens, it gets log into the error log page with the queue and other information about where the error happened, and that can help people debug what's happening on the NVMe SSD. The other thing is called persistent event logs. This is something new in NVMe 1.4, but it's basically a human-readable and timestamp log of events. So, the way I describe this to people are it's like the black box, and I'm going to go over all these features in more detail in a few slides, so don't worry.

10:20 S1: Basically, persistent event log can help log when things happen. So, when you're operating a system or user wants to go back and say, "Hey, drive failed. What happened leading up to that event?" They can go figure it out. "Oh, did I update firmware? Did I format the drive first? Did I have power failure? What happened to the drive?" And then the most important thing I mentioned are firmware issues are the No. 1 cause of SSD failures. Something called telemetry allows device manufacturers to basically dump a log when there's an error, and then somebody in the field can dump this telemetry log and then they can use that. There's some telemetry data that can be used for health monitoring, but the majority of the purposes of telemetry log are basically to collect internal logs on a failure, give that back to the SSD vendor and then the SSD vendor can go fix those problems, root cause of problems, update the firmware and fix the bugs.

11:13 S1: And so, again, I stated this in webcast, if you want to monitor the health of the NVMe SSDs, and you want to prevent failures, the No. 1 thing you can do is update your firmware because firmware is the No. 1 cause of failures, and most of the vendors are spending a lot of time and effort, validation and quality and development in fixing the firmware and making it better. Part of this whole NVM Express operating system, the way it supports it is this asynchronous event support. So, basically, when things happen, the drive can notify the host of events, and this can be used in different operating systems to basically trigger things like the de-message in Linux or event log in Windows.

11:54 S1: But asynchronous event can happen to let the operators to know when things go wrong. The device self-test is one of the things we'll talk a little bit about too, which is basically an offline diagnostic tests. Many of the use cases are just like in factory integration or testing. When somebody takes the drive out of locks and they put it into a new system, or if they're repurposing a drive in another system, they'd like to run a short test to make sure the drive is functioning correctly. So, the device self-test can go do that. It can do an offline diagnostic tests. It runs a smart check, it runs the media check, it runs the DRAM check, the capacitor and all that stuff. And then, of course, end-to-end data protection. This isn't about monitoring, it's more about protecting from data. This is kind of outside the scope of today's discussion, but NVMe has lots of stuff to prevent errors, not just monitoring.

12:47 S1: So, I mentioned the log page. You know, boy, I wish I had time to go through all these today as part of this presentation, but I don't. You can go to download the NVMe specs to look through all these. The ones I'm going to spend a lot of time on are the error logs, the smart log and the . . . Yeah, those are the two. The persistent event log I'll talk a little bit about. There's a lot of other stuff going on that are useful, like the LBA status information for rebuild assists and other things, but . . . And the sanitized log for after you sanitize, but those aren't . . . For general monitoring health, the smart log is basically the No. 1 place to start.

So, here is a picture of the output of me running NVMe-CLI on a desktop. Just the command is very easy, it's just pseudo-NVMe smart-log/dev/nvme, whatever. You can put the namespace attribute in there as well if you want to run it as a per-namespace, but as you can see, you've dumped the smart log out and you can have all the information that was exactly one-to-one matched for exactly these . . . What's in the smart definition here, and the get log page smart health information here in the NVMe spec.

13:53 S1: It dumps it out and then what NVMe-CLI or other things like smartctl in Linux or other applications can actually parse this information. So, going through with this critical warning bit, you can see here, zero is good -- that means the drive is functioning correctly, you know, available spare 99%, little spare threshold that's when it triggers percentage use. This drive I just pulled from a random drive invalidation from Intel, so it's been beat up quite a bit, right? So, you can see 15% use, it's probably been through a dozen firmware versions, so you can see it's logged a bunch of media errors and unsafe shut down.

14:27 S1: So, even though this drive is functioning correctly, every . . . All the media errors do get logged here, if it does happen, and then you would be able to find that. If they're not clear, they are persistent, then you'd be able to find that in the error log page. Now, if you just do a pseudo-nvme errorlog/dev/nvme0 or whatever your NVMe device is, it'll dump the entire error log entry. And you'll see it, basically, if there are errors generally, you'll have a critical warning or something in here that suggests you have errors on the drive. But if you do have an error, this is where to look to basically find out where that error happened.

15:04 S1: OK, and then the other things in the smart that are really interesting, and I'm going to walk through some of the temperature stuff, but you have warning temperature time which can tell you if the drive has exceeded its thermal limit and it starts to thermal throttle, you know, critical composite temperature time, thermal management transition time. So, you have all these different thermal attributes to figure out if you do have some issue where the drive overheated. Many people don't know this, but if you have an NVMe SSD that overheats, it doesn't instantly cool down, it takes . . . It could take a long time to cool down sometimes, especially if there's not enough air flow. So, a lot of customers, once the . . . If there's some workload that ramps up a drive to a critical temperature, you'll find out right away here in the smart. Which brings us to temperature monitoring and so one of the things you want to do on an NVMe SSD is monitor the temperature, and there are some hooks in the new version, the new Kernel Linux bit to basically make this a lot easier to pull over to other applications for different monitoring because a lot of times people just want to monitor the temperature of the drive.

16:02 S1: And this is off a drive that I knew actually overheated, and so I grabbed it and the smart log right after that happened, and basically when the drive was overheating, when it was over the critical temp, the smart log critical warning bit changed to Ox2, and you can look back here in the critical warning bit, you know, you're basically . . . It's going to be set to one if the temperature is greater than or equal to an over-temperature threshold or less center equal to an under-temperature threshold.

16:30 S1: So, the drive was functioning correctly, it went over the critical temperature and then the critical warning bit went off and told us that it was. And then it logged the time in minutes over here in the warning temperature time about how long it was over the temperature. So, again, there's lots of stuff in the smart, but basically, you know, if you're doing . . . Especially like large block sequential write for sustained many hours, that's going to heat up the drive, and if . . . You'll know if you have a weird failure. In this case, when this drive . . . In this validation test, when the drive overheated the drive de-numerated. It did what it was supposed to do, which was thermally shut down. Most drives, they actually thermally shut down when they reach their critical temperature. So, in Intel's case, in this case for this drive it was 70C, which is a composite temperature. And at 70C, it basically shut down and it was doing what it was supposed to do, is prevent from overheating the components and damaging them. So, again, if you see something like that, the temperature and smart is where they go.

17:36 S1: The other thing is, you know, NVMe has an entire spec for management, and it's called the NVMe Management Interface specification. So, the NVMe-MI is basically for a spec that allows for both in-band and out-of-band management. So, out-of-band management meaning that you are independent of the host operating systems, sort of agnostic to operating system. And today, that out-of-band management can be done through a PCIe vendor to find messages or through most commonly SMBus or IST to down to the drive. And now the benefit of that, obviously, is to be able to provide more data out-of-band, so if things are going on in . . . Basically a system console or management console can look at the drive and report data, smart data from the drive, without having to go through the operating system. Now, the methods I showed you before, in-band is great because you can get all the information you can ever want out of a drive.

18:28 S1: Now, the benefits from out-of-band are very clear. So, my dear friend Austin Bolen from Dell, who spent a lot of . . . He's one of the worker chairs for NVMe Management Interface, provided me some of these screenshots from Dell iDRAC. And this is iDRAC9 Enterprise, this is something you get for the Dell server, and I know HPE has something similar with, I think they call it iLO, but basically, all these different OEMs have their management consoles that work out-of-band, and they use the capabilities of NVMe-MI to basically monitor SSDs out-of-band. So, you can run this through a web console independent of any operating system, it doesn't matter. You know, the drives can report information to this web console independent of any operating system for streaming data, and you can go read all about iDRAC here. But I'm showing you guys some information you can get on SSDs -- like here's just an example of like if a drive failed, this is what it would look like. You could monitor this out-of-band, and basically from a management console without even having to go to the OS, you can click on the drive and it will say, "Oh, it's failed" or you can figure out if something's going on with the PCIe negotiation, if it's not linking up at the desired speed, this is a . . .

19:39 S1: PCIe 500, which is our Intel's new gen for drives, you can see it's linking up at Gen 4x4, which is good, but yeah, again, this is where NVMe-MI is used for. Again, here's just an example of what it would look like if for a management console. But this is basically why NVMe-MI exists. So, customers like I guess, companies, OEMs like Dell and other OEMs want to be able to support this type of debugging of drives without having to be dependent on the operating system commands. Because, again, operating systems is great, it's super powerful but every version is a little bit different. Different capabilities and windows and they want to be able to do this from a combine management. So, if you want to do health monitoring of NVMe SSDs outside of the operating system, go through your management console for your OEM or if you're updating software, NVMe-MI is a great choice. OK, so yeah, we don't have time -- 20 minutes.

20:44 S1: OK, telemetry log page. So, telemetry log page I mentioned is basically for when an SSD fails you can read the telemetry log page and there's stuff that you can go in this spec it's not . . . If you're implementing this, you probably want to know there's hosts initiated and controller initiated and other kind of things in this spec. Basically, I mentioned telemetry -- the most important reason for telemetry when a device fails, somebody can run this telemetry command, dump the log and depending on what sides of the log they want, there's different data areas for how big the log could be, but basically if a drive fails they want to send this back to the SSD vendor or SSD developer or the OEM whoever and start being able to do root cause and debug the issue and in telemetry wall is encrypted.

And there's some of the stuff that is human-readable, but you can, depending on it's up to the vendor, the telemetry command and NVMe basically says "You're starting dumped data once you get the payload. Then it's up to whoever dumps that to go send it to the SCT better to debug.

21:45 S1: It's super-important for an NVMe. And the other one we talked about briefly was the by self-test, and I mentioned if you're repurposing a driver, taking a new drive into a new system or something, and you wanted just to make sure everything's operating correctly you can run the device self-test, the short test is supposed to be two minutes or less extended. It basically, you can . . . says in the device self-test log about how long it's supposed to take. It depends on the drive vendor, but basically this test basically just runs through . . . Vendors can implement a specific device self-test they want, or they can follow the example in the NVMe specification that goes through, checks the RAM, checks the smart checks valve of the memory, check the metadata, NVM integrity, data integrity, media check, drive life, the endurance all that stuff.

22:33 S1: And so . . . Yeah, if you wanted to know a simple operation, this is an NVMe 1.3 features, so some of the new Gen 4 drives and some of the new NVMe 1.3 drives that are out will support this command. If you have an older NVMe drive, probably doesn't support this, but it's one of these new features of an NVMe. It's just kind of a nice quality of life for being able to do a short, easy test. It gives a nice log if it finished and be able to identify if the drivers are working properly.

The other thing is a persistent event log, and so, basically, and the persistent event log as I mentioned, it's kind of like the black-box recorder for all the things that happened in the drive there's . . . Not to be too confusing, but it's in NVMe 1.4 and there's some other variations and future work that are going on to protect all proposals to enhance the work there.

23:23 S1: I'm not going to go through all of them, but basically I remember this just records things on the drive and when they happen and somebody can dump this persistent event log and it gives you a human-readable log with timestamps of everything that's happened on the drive.

So, I get a ton of questions. Yeah, I help out from being invalidation background and help me out just people doing general development. I get a lot of comments on . . . OK, well most of the time NVMe SSDs today are running on top of PCI Express, and so when you have an error, boy, it's tricky, right? Again, I mentioned it could be a file system, it could be, it could be your application, it could be and, in some cases, it could be PCI Express and being able to know some common debug-ability tools are helpful to be able to identify when things are happening.

And sometimes in Linux, the message is definitely where you want to go, to basically have all information about kernel errors or driver information, and if you have PCI bus failures or re-tries or advanced tier reporting failures, they'll show up d-message if you have . . . If the enemy driver can't create cues on a NVMe controller or for some reason there's read failing or something, those show up in d-message.

24:35 S1: The other thing you want to do is the lspci will give you a detail of PCI topology. Again, sometimes you just need to know what's going on if a device is, for instance, not performing as well as it should. Maybe it's not looking up a PCI Gen 3 or Gen 4 or whatever the drive supports. Maybe it's not linking up at four lanes or eight lanes, whatever the drive support, so lspci is going to be able to tell you information about that.

And then lsblk, there's millions of the commands and Linux, I'll share if just things like if drives are full or memories used, stuff like df, du -- this usage you can find out if weird things are happening – but, yeah, anyways there's a lot of stuff more than I can talk about today.

But here's some examples of d-message. Basically, if you got a message like this in a PCIe nvme0, PCI function can kind of point back to that PCIe tree and lspci to figure out which device it's coming from. If you see timeouts, this is . . . By the way, I borrowed this slide from an old colleague that I keep . . . Busch, who now works at WD, he gave as one of the Linux developers for NMVe, he is one of the regional developers for NMVe C line. He likes NVMe driver. He kind of wrote this 99,999 over 1,000 times the controller . . . It's messed up, not the OS. If you see some weird timeout issues which . . . OK, I think it's work for it and then build initialization. You can see obviously somebody wrote initialization that was from the UK or whatever.

26:09 S1: But, basically, the controller didn't acknowledge the enable sequence. So, if you have some weird field initialization where you see the PCIe device, but you don't see the NVMe device and NVMe list ,then it could be something weird where the initialization sequence didn't complete shutdown if you . . . Again, this is kind of really important on big drives where it can take a long time to shut down. Basically, if drives have to dump their metadata or have large power loss capacitors, you might see something like this where the device shutdown is incomplete and onboarded, and then you'll get back in the smart, the smart attributes monitor, something called unsafe shutdown. So, if that happens, you'll see something like an unsafe shutdown.

Here, actually, while making this presentation, I had a weird adapter, I use one of these M.2-to-U.2 adapters for plug into U.2 desktop and the drive was being flaky and I figured out "Oh, this is funny. There's some PCI bus error that are being corrected," which is funny. Yeah, so basically, I'd grab the screenshot very recently, a couple days ago when I saw it, but you can see this is just stuff in d-message where you can find out if things are going wrong . . .

27:20 S1: My favorite tools, sorry for this, this is basically some screenshots for my desktop. I like to use this weird black screen and green text with my little hacker text for my terminal and Mac, but you can see this is basically two commands, iostat and dstat, these are my favorites for monitoring performance, there's a million different applications you can use to monitor performance. But, basically, iostat will tell you if you give it a specific device like NVMe q, namespace 1, it will monitor that. It's written and you can see the M, if you do the dash H on the end, it'll give a human-readable, so it'll give you a megabytes, and basically the TPS for the IOPS, these that give you the read and write aggregate for the entire system. We see stuff like i08 over here, then you might be getting I/O bottleneck on the drive and you do see the i08 start going up, or if you see timeout of paging, you can see some DRAM issues.

So, anyways, I understand that these are super-critical to understanding. If you're diagnosing a performance problem, which are a lot of problems, these are just some information I'm going to leave with you guys. In the slides, I don't have time to go through, but if you want to give a very, very deep on NVMe.

28:36 S1: They do have kernel tracing available, so you can enable kernel tracing on the commands. This is the command instructions of how to enable kernel tracing for NVMe. And then if you want to see what that looks like, this is just a command where it's doing a DV, where it's just writing 1 megabyte in one block, and this is what basically what the NVMe trace looks like of that command. So, you can see the QID, the namespace ID, the command ID, metadata, LBA length and size, and all that stuff.

So, if you wanted to really go deep on debugging, you can choose this -- probably not required for most people that aren't developers. The other tool that I really like that Intel has contributed to is called IO tracer, and here's the link here on GitHub in the open cast, it's called the standalone Linux IO Tracer.

29:24 S1: Basically, you can run this tracer and then you can enable this driver. And then if you run your . . . whatever workload you want to run, you can use this tracer to basically collect the system traces, and then they have a report summary that parses the data to a CSV or JSON. And then you can do stuff like LBA distribution, so if you wanted to see locality for caching, you could do that. You can look at latency histograms of all the commands and different block devices to basically see what the average queue depth of over a real workload is. This is super-helpful, again, if . . . One of the things we were using this for was debugging some stuff in my school database. We weren't seeing the performance -- what we were seeing the disk utilization very high. We ran this, and then all the writes were going to the disk, but all the reads were coming from DRAM, so you could see that you in the trace, no reads were coming from the drive. So, this is a very useful tool, just . . . One it's open source; you can just download it and file it and install it if you want to run a very specific workload and learn more about it.

30:32 S1: Last thing I'm going to talk about, I guess I'm almost out of time for my 30 minutes, OCP Cloud NVMe SSD Spec, Microsoft and Facebook have done a wonderful thing and got together and open sourced their SSD specifications into this thing called the OCP Cloud SSD Spec, and it has lots of stuff like PCI Express features, NVMe features, smart log requirements, and this is the one I'm going to talk about a lot today, is they have some custom smart log and is called the C0 log page is called the smart cloud attributes log page. And this is amazing because now, NVMe was designed, remember, to be NVM-agnostic, was designed to be able to support three NAND or obtain or storage class memory. What have you . . . So, a lot of the NVMe spec was written agnostic to NVM. But, at the end of the day, most NVMe SSDs use NAND and you want to learn about that NAND and specially when you're trying to debug issues.

So, what this smart attribute log page does is actually has an open source way in the vendors OCP log page where you can run this command and dub page, and it'll give you all this detailed information about recoveries, bad users and system blocks, physical media units written. So, if you want to calculate right hand you could have host and NAND writes, ECC errors if you were to see physical device things and a in-correction counts, or if you say a layer, so all kinds of really awesome stuff is in here.

32:00 S1: And, again, just by itself isn't really useful, but if you have tools to be able to basically parse this data and basically use this data to, in a collection of drives that scale, like how the cloud vendors are going to use it, they can do predictive analytics in health monitoring. Again, if you see some PCIe correctable errors that might mean, hey, there's something going on with the system topology where the Indigo fix, but at the application, when we might to see higher latencies or something, you don't know, but having this extra detail information is going to be super-helpful for this predictive analytics health monitoring.

And I cannot stress enough how I'm commending my dear friends at Facebook and Microsoft were open sourcing the specification because we've been doing some of the stuff for custom firmware for years, and I can't wait to get it out to other customers. The other thing they've done is the OCP cloud and NVMe specification air recovery log page. And, again, you guys can just download this back at open source and air recovery blog page, I mentioned if most SSD failures are actually just firmware. Well, boy, it sure makes sense to be able to actually recover gracefully from firmware issues or when a drive fails, for the drive to be able to identify and tell the host what's wrong with me. And this is exactly what this air recovery log page does . . . do you have this panic reset wait time, panic reset action, device recovery action and panic ID.

33:16 S1: And so the drive, for instance, if a driver just has a firmware issue but it fails, but it can't verify the data integrity, instead of sending a technician after that drive to replace it, you could just run this command where it formats the drive and starts over. Yes, you lose the data, but in a cloud application where you have the data backed up or it's stored in another server, another rack or another availability zone, it doesn't matter -- you'd rather just rebuild the drive start from scratch instead of sending a technician out there to replace it so . . . Boy, this is an amazing feature, I can't wait till this is implemented across the board on SSD because I think it's a very powerful tool.

So, that's it. I'm actually over my time, 30 minutes, but, boy, there's a whole lot I could've gone into and all these different methods for Linux and performance, and I can't even scratch the surface. But, hopefully, some of the other very educated, very distinguished speakers in our track can help you guide through that, so, again, thanks for everything. Have a good Flash Memory Summit.

Monitoring the Health of NVMe SSDs

Many capabilities have been built into NVMe technology, including enhancing error reporting, logging, management, debug, and telemetry. Here are insights on how you can manage the status and health of NVMe SSDs (like notifying users when an SSD failure occurs).

Dig Deeper on Flash memory and storage

all-flash array (AFA)

M.2 SSD

5 NAND flash manufacturers balance performance, reliability

Micron 6500 Ion increases layers over 200, lowers price