This content is part of the Essential Guide: Hard disk vs. flash storage: The fight of the century?

What's the best way to protect against HDD failure?

HDD failure can put bytes of data at risk. Is multi-copy mirroring or erasure coding the more efficient data protection approach?

Erasure coding and multi-copy mirroring were developed in response to the inability of traditional RAID to keep up with hard disk drive (HDD) density gains. Even as HDDs have increased in areal density, they have not improved the bit error rate or number of heads per platter. The probability of a non-recoverable bit error has increased, raising the potential of HDD failure and subsequent RAID group data loss. Slowing speeds per gigabyte increase HDD rebuild times, as well as the risk window for concurrent HDD failure and RAID group data loss.

RAID 6, RAID 60 and RAID 6 triple parity have helped to a degree; however, long HDD rebuild times and the adrenalin heart attack-inducing drills caused by HDD failure created an urgent need for a sound alternative. This became increasingly obvious -- especially for nearline data that must be retained for years or even decades with no way to recreate it should it be lost.

Multi-copy mirroring solves the problem by making multiple copies of the data on different HDDs behind various storage controllers (commonly called nodes). When an HDD failure occurs or the HDD has a non-recoverable bit error, a good copy of the data is simply copied to another drive. The number of concurrent HDD or node failures that can be tolerated determines the number of copies: two concurrent failures require two copies of the data, while three concurrent failures require three copies of the data. Copying data from another good copy makes this a very fast data protection and recovery option, but it is very expensive. Each copy of the data consumes additional storage capacity, which adds up quickly.

Erasure coding is designed to be more efficient because it breaks data into chunks. The number of total chunks is called the width, while the number of chunks required to read the entire datagram is called the breadth. Each chunk has part of the data or a representation of the data (such as a formula) and metadata information about the whole datagram. Common width-to-breadth ratios for erasure codes are 16:10, meaning once the first 10 chunks are read the entire datagram is recreated. If any chunks (up to six) are missing, they are recreated and written to other HDDs and/or nodes.

Erasure coding is also much more economical than multi-copy mirroring. The 16:10 example protects against up to six concurrent HDD or node failures without losing a byte of data. To do so only requires 60% more storage vs. the 600% needed for multi-copy mirroring. If the width-to-breadth ratio was 26:20, the additional storage consumed would be a mere 30% and still protect against up to six concurrent HDD or node failures. The downside is that chunking adds considerable processing overhead, slowing writes and reads. This makes erasure coding mostly useful for secondary data or nearline storage, such as public and private cloud object storage.

Next Steps

Erasure coding provides drive-level protection

Three ways to use RAID to prevent multiple drive failures

Pros and cons of erasure coding vs. RAID

Video: Explore RAID vs. erasuring coding for data protection

Dig Deeper on Storage architecture and strategy

Disaster Recovery
Data Backup
Data Center
and ESG