One benefit hyper-converged infrastructure offers to IT environments is simplicity. Suddenly, there's no more separate storage tier to manage and scaling becomes a breeze. Better yet, you can manage everything -- including centralized policies and, in some cases, even backup and recovery -- from a single console.
Hyper-convergence engineers may have done a good job providing administrators with simplicity, but underneath it all is a level of complexity that users don't typically see. It's here that data center architects should focus when trying to understand how resilient their environment is to failure, error and attack. After all, if you don't understand what can go wrong in your environment, you can't possibly build in the data resiliency necessary to withstand common issues that could take it out of service.
There are a number of factors that go into understanding how resilient a hyper-converged infrastructure (HCI) cluster is, including:
- Understanding the point at which an application acknowledges a write operation.
- the RAID level, if any, used per node;
- the node-based replication factor used; and
- whether erasure coding is used.
Understanding RAID and replication factors
Let's start with the RAID level used per node for data resiliency. Not all HCI platforms on the market use RAID, and, in some cases, that's OK. However, RAID does enable a local node to withstand the loss of one of more drives without losing all of the data on the node. So, if you're using RAID 5, you can lose a single drive in a node. With RAID 6, you can lose two drives in a node before things go south.
There's a huge but here, though: RAID may not be an important factor in your HCI. It depends on how the hyper-converged platform stores data. For example, Nutanix nodes don't use any kind of local RAID on individual nodes. Instead, Nutanix focuses on writing data to individual disks across multiple nodes using a replication factor mechanism.
With replication factor of two, for example, data is written to disks in two nodes. With replication factor of three, data is written to disks in three nodes. Only after all the data is committed to persistent media multiple times will the system acknowledge the write operation back to the application.
The beauty here in regard to data resiliency is that you can, theoretically, lose a bunch of disks in a node and the rest of the disks in the node would remain available. If an entire node happens to fail, there are other nodes in which the data is available, so you can begin rebuild operations immediately.
Other HCIs, such as HPE's SimpliVity product, take a more traditional approach to local data protection, implementing RAID 5 or RAID 6 depending on the size of the node. This enables applications running on SimpliVity to write data across multiple local disks that are in a RAID configuration, while also writing data to another node to ensure resiliency in the event of a node failure. In this way, a local node can lose one or two disks, depending on the RAID level in use, and remain functional. And, in the event that too many node-based disks are lost, or if a node fails entirely, data exists on other nodes that can be used in production and for rebuilding purposes.
Obviously, you're better off with more copies of data, so some organizations prefer a replication factor of three in systems that support it. However, bear in mind that more copies of data result in higher overall costs, as you're storing data multiple times. You must find the right balance between cost and resiliency.
Understanding HCI erasure coding
When I talk about replication factors, there's actually a lot more going on under the hood than you would think. Replication factor of two doesn't always mean the hyper-converged system simply writes two copies of data, thereby doubling your storage overhead requirements. Replication factor of three doesn't always mean the system simply writes three copies of data, thereby tripling your data overhead requirements.
Data resiliency questions to ask HCI vendors
As you consider hyper-converged infrastructure systems, ask vendors about their underlying data resiliency options. Start with these basic questions:
- What steps do you take to ensure ongoing operations even in the event of a node or hardware failure?
- Are all of your data resiliency operations enabled by default?
- What kind of performance impact would enabling all the resiliency options have on my workloads?
- What level of capacity overhead do your customers typically experience with all the resiliency options enabled?
In many cases, replication is coupled with erasure coding to chunk data into fragments imbued with some parity information. Each fragment is written separately to many locations across a cluster. Erasure coding guarantees that each fragment lives on at least two or three nodes depending on the replication factor in use.
The problem is erasure coding has traditionally been accompanied by significant performance degradation, as the math needed to execute erasure coding can consume swaths of CPU resources.
Datrium and Pivot3 have taken new approaches to erasure coding, with patented techniques that reduce CPU overhead, making the technology more viable. Nutanix also introduced a patented erasure coding technique that can greatly improve storage utilization, while also maintaining high levels of data resiliency. Nutanix's erasure coding is post-process, which means that it's applied after data is written.