Erasure coding brings trade-off of resilience vs. performance

IT's decision on using erasure coding in storage rests on data importance, durability more than amount of data, says storage consultant Marc Staimer.

For an IT shop, the most critical considerations when deciding whether or not to use erasure coding are the importance of the data and the length of time that the data needs to be readable, according to Marc Staimer, president of Dragon Slayer Consulting in Beaverton, Ore.

Staimer said the amount of data is a less crucial determinant, although he did note that erasure coding beyond standard RAID likely won't become a serious option until there are hundreds of terabytes of data.

In this podcast interview on erasure coding with TechTarget's SearchStorage writer Carol Sliwa, Staimer also shared his thoughts on the pros and cons of erasure coding, the potential impact of erasure coding on backups, the decision that IT shops face on how much erasure coding to do, and his vision for the ultimate use cases for erasure coding in storage.

What do you see as the main upside and the main downside of erasure coding?

Marc Staimer: Erasure coding provides much better data resilience, much better data durability or persistence, meaning it can last a much longer period of time in spite of the media that it's being written on. Media wasn't designed to last a long time. Erasure codes make it last a long time. It does this by using a lot less physical storage or storage nodes or storage systems and supporting infrastructure. So, it does it all at a much lower price, and it does it with a lot less management because you have far fewer things you have to manage. So, that's a much better upside.

The downside is much higher latency. Any time you're adding processing -- which is what you're doing, because you've got to process a lot of different chunks versus just read it all as one sequential data chunk or datagram -- you're going to add latency. And, when you have latency, that affects response time, and then you can especially get high latency if you distribute this geographically or over a lot of different systems.

So, generally speaking, the upside is less hardware, less software, lower cost, higher durability, higher persistence. The downside is poorer response time and higher latency.

Can erasure coding eliminate the need for backups?

Staimer: It's an interesting question. If it's your primary data versus secondary data, the answer is no. But, what most people look at as erasure code storage, which is primarily today object storage, is that it's a target for backups. It's a target for older data. It's secondary data, archived data. So, candidly, it's a copy of the data anyway, so if you're using it in that methodology, you're not exactly going to be backing that up anyway.

But if you're trying to use it for your primary data -- which, from a performance point of view, there aren't very many that can do that in a very highly performing manner -- and let's say you are, then it doesn't eliminate backups for the following reason: because it's not going to protect you against malware. It's not going to protect you against the malicious employee. It's not going to protect you against accidental deletions or human error. What it's going to protect you against is hardware failures. It's going to protect you against bit rot. It's going to protect you against silent data corruption, which makes it an ideal choice for secondary data or passive data.

What's the minimum threshold of data at which an IT shop should consider erasure coding?

Staimer: It's not a minimum threshold. What it comes down to is data resilience and durability, how important is this data to you to last for a long period of time. If you're talking decades or centuries even, you're really looking at erasure coding as the only methodology that's going to provide you with some level of guarantee that that data's still readable in 30, 40, 50, 100 years. So, it's the value of being able to read that data down the road.

The second aspect is being able to make sure, for compliance reasons, that that data's always readable. I'll give you examples: HITECH [Health Information Technology for Economic and Clinical Health] and HIPAA [Health Insurance Portability and Accountability Act]. In the HIPAA/HITECH rules, you must always have a copy of your data that is readable. If you put it on storage that has erasure codes, you have a copy of your data that's readable, even if it's 20 years from now.

So, it makes sense to look at the value of the data, what you need to do with the data more than the size of the data. In reality from a size perspective, if you want any kind of resilience, you're going to find RAID is not going to be the way to go going down the path. And multi-copy mirroring is just way too expensive when you're talking petabytes of data. So, you could start thinking about it in the hundreds of terabytes, but once you get to a petabyte, you should definitely be thinking about it. And, once you get into the exascale range, you have to have something with erasure coding.

How does an end user go about making the decision of how much erasure coding to do?

Staimer: What you're talking about is the concept of breadth and width. How many chunks are you going to break it up into? And how many chunks do you need to read to reconstruct the data? In general, the more chunks you need to read to reconstruct the data out of the breadth, the less resilient it is. So, for example, let's say you're doing a 16 by 10. You need to read 10 of the 16 chunks -- in other words, breadth of 16; 10 is the number of chunks you have to read. Under those circumstances, you can tolerate six failures. But, let's say you want to do a 16 by 13. You can tolerate three failures.

So, it comes down to how many failures you want to have tolerance for. The more failures that you have tolerance for, candidly, the more latency you're going to be adding because it's more metadata about all the other chunks on each chunk. So it's a tradeoff of resilience versus performance. Some people want a belt, suspenders and coveralls; they may be going with 100% overhead so that, for example, they may have a 30 by 15. So, they are breaking it up into 30 chunks, and they want to be able to read 15 and get their data. OK, that's 15 failure rates that they can suffer and still not lose one byte of data. Now that's kind of a waste. You would probably be better off going somewhere along the lines of a little bit more tolerance; but, again, it depends on your tolerance for risk and how many failures you expect and where that data is written.

What's your vision on how erasure coding will be used in enterprise IT shops in the short term and in the long term?

Staimer: Short-term, it's primarily going to be used for passive data, or not active or hot data, not data that's being used by databases or structured applications where response time is really, really important. Although some of the players are working on making erasure coding faster, for all intents and purposes, you're going to have longer response times.

So, I don't see it short-term beyond passive data. Long-term, I do. I see it getting better. I see the latency being managed in silicon or FPGAs [field-programmable gate arrays]. I see the latency algorithms getting faster so that you're not going to have as much latency and it can replace, for active data, RAID 5 and RAID 6. But, for passive data, it's going to replace RAID period. You don't need RAID if you're using erasure coding because it provides better resilience, better durability than RAID. 

Dig Deeper on Storage management and analytics

Disaster Recovery
Data Backup
Data Center
Sustainability
and ESG
Close