Getty Images/iStockphoto


High availability and resiliency: A DR strategy needs both

An organization's resiliency during and after a crisis depends on many factors. High availability is one aspect of overall resiliency that DR teams can't afford to overlook.

In an age where acceptable downtime is essentially zero, high availability and resiliency are critical metrics in business continuity and technology disaster recovery.

Both high availability (HA) and resiliency address disruptions to an organization from system failures, network outages and application issues. In IT, HA describes systems that function without interruption for specific periods of time. Resiliency is the ability of a system to recover from a disruption and to modify its capabilities to adapt and better respond to similar events in the future.

Despite common goals, HA and resiliency are not synonymous. A strong disaster recovery strategy incorporates both concepts. It's critical for DR teams to understand the differences between the two, their relationship to each other and other performance metrics that can affect resiliency.

What is high availability?

High availability describes a system's ability to remain operational without interruption for a specific period. It takes technology redundancy to a higher level.

Redundancy typically means that backup hardware, software and storage are available in case primary resources fail. In many cases, users must activate the backup resources.

Diagram that illustrates system redundancy.

HA improves on redundancy by reducing single points of failure, adding dynamic system monitoring to detect failures and including an automated failover capability to immediately move the production assets to an alternate platform.

Diagram that illustrates a high availability system.

The backup system can be in a data center or an alternate location, such as a cloud service. The time needed to recover and restart the system following failover depends on the network bandwidth available and the technology used to enact the failover.

HA systems are typically designed to achieve a specific level of availability, often called the percent uptime. An example such as five-nines availability means that the system is available 99.999% of the time. This equates to downtime of less than six minutes in a year.

Greater availability usually equates to greater cost, but also provides a significant boost to an organization's DR capabilities. The technology that monitors the system's performance, costs for backup resources and the resources an organization needs to establish HA performance are more than those for simple redundancy. It is good practice to maintain spares of critical IT assets, power systems, network components and other resources.

What is fault tolerance?

Taking the model of high availability to the next level is fault tolerance. This means that a system is designed to almost never fail, other than for unusual circumstances, such as natural disasters and other unanticipated events. HA and fault tolerance are typically associated with hardware and network elements. Software that fails will fail in HA and fault tolerant systems alike.

Diagram that illustrates a fault tolerant system.

One way organizations achieve fault tolerance is by establishing fully mirrored systems that are immediately updated anytime the primary system is updated. In this scenario, single points of failure are largely eliminated. Mirrored systems are in constant standby mode, ready to take over processing from a disrupted system. When system monitoring detects an issue that crosses a preset threshold, it immediately transfers production duties to the standby resources so that production is never interrupted. These resources can be local or remotely located, typically in a cloud.

Due to the additional required systems and resources, costs to achieve fault tolerance are higher than those for high availability.

What is resiliency?

Business continuity and disaster recovery (BCDR) typically focus on recovering and restoring systems and business processes, respectively. Resiliency goes a step further. Organizations must use lessons learned from previous events to help adapt and improve their methods to be better prepared for future events. This can be applied to BCDR plans as well as IT systems and networks, backup resources, power and environmental systems, and other IT resources.

For example, a commercial power outage of two weeks might be beyond an organization's backup power system capabilities. To achieve more resilient power, the organization can install a larger system and make arrangements for scheduled refueling.

High availability applies to system availability and reliability. Resiliency addresses how these resources have been improved to better deal with future incidents. The approaches discussed here -- redundancy, HA and fault tolerance -- all contribute to resiliency. They also contribute to higher costs.

None of these approaches guarantees resiliency, but clearly a progression to fault tolerance is likely to result in a higher state of resiliency. Costs and resources needed to achieve the desired level of resiliency must be balanced with the organization's business needs and management's appetite for additional technology investments.

Paul Kirvan is an independent consultant, IT auditor, and technical writer, editor and educator. He has more than 25 years' experience in business continuity, disaster recovery, security, enterprise risk management, telecom and IT auditing.

Dig Deeper on Disaster recovery planning and management

Data Backup