Examining the future of the RAID volume

Essential Guide

Browse Sections

Article
Top tips to prevent disk failure in your RAID volume

Jon Toigo shares several ways storage admins can proactively prevent the failure of a RAID volume. Examine his set of best practices regarding setup, monitoring and maintenance of RAID volumes. Read Now

Editor's note

Since 1987, the data storage industry has leveraged a set of strategies first developed by scientists at the University of California, Berkeley, named redundant arrays of inexpensive disks, or RAID. They are used to construct large volumes from multiple discreet disk drives and to protect the data stored on the cobble from loss in the event of hardware failures. If you're involved with data storage today, you're very likely using a RAID volume in at least part of your infrastructure.

There were five types of RAID (six, if you count no RAID at all) specified in the original RAID white paper. Various storage array vendors have added a few additional schemes over the years. Of these, RAID 1 (mirroring the data on one disk to another disk) and RAID 5 (striping with distributed parity bits) have been the most popular. The former provides a full copy of data and a simple recovery mechanism (failover) while the latter, RAID 5, provides protection more economically while facilitating faster read access for databases and other applications.

RAID services are delivered in software with the most popular operating systems (OSes) offering a basic RAID capability for optional use or embedded on the controller of an array. The latter approach is often preferred by storage planners (because servers are busy doing other application-level chores) and storage vendors (because they can charge more for a box of commodity drives when RAID and other value-added functionality is embedded on controllers).

Despite more than two decades of success, two developments have begun to call the efficacy of RAID into question. For one thing, disk drives have become larger in terms of capacity. In the case of RAID 5 and other RAID levels that use distributed parity for data protection, large capacity drives present a challenge. RAID 5 enables data on the drive set to be "rebuilt" from parity information if a single drive in a multi-disk volume is lost (with RAID 6, a vendor-created extension, two drives can fail and parity data can be used to reconstruct the original contents of the volume). The amount of time required to rebuild the RAID set is proportional to the capacity of the volume, so the bigger the aggregated capacity of the disks comprising the volume, the longer the rebuild will take. While this recovery is occurring, volume performance is usually reduced significantly.

The other problem with RAID is that it's linked to the frequencies of drive failures. In 2009, two engineers published a report in the IEEE Transactions on Computers (Vol. 58, No. 3, March 2009) suggesting that the rates of disk failures were five to 1,500 times greater than those vendors had suggested. There was a higher-than-expected likelihood that a second drive would fail in a RAID 5 volume even as measures were being taken to replace a first failed drive and to rebuild the RAID volume. This increased risk called into question the value of RAID 5 and 6, arguably the two most widely used RAID schemes.

Some of RAID's increased vulnerability was attributed to a poorly understood phenomenon called "silent corruption." Undetected bit errors resulting from numerous causes were, according to a 2008 Usenix report responsible for 5% to 10% of the 39,000 storage array failures included in the study. Bit errors can impact a single file, corrupting it so it can't be used, or it can impact sectors that subsequently render the entire disk or volume unreadable.

RAID controllers provided some ability to detect bit errors, via parity scanning, but most storage administrators didn't use this functionality because of its tendency to slow the performance of the RAID array. For the same reason, file system validation cycling and OS error correction code/cyclical redundancy check were also turned off.

The rule of thumb is that at least one bit error exists in every 67 TB of disk. With companies deploying petabytes and even exabytes of storage, the math is disconcerting.

So, is RAID dead? And if it is, what is to replace it? These are questions that many storage planners confront today and there are no easy answers. This collection of RAID tips is designed to answer the key questions about RAID and its replacement technologies.

Essential Guide

Top tips to prevent disk failure in your RAID volume

Editor's note