What is high availability?
High availability (HA) is the ability of a system to operate continuously without failing for a designated period of time. HA works to ensure a system meets an agreed-upon operational performance level. In information technology (IT), a widely held but difficult-to-achieve standard of availability is known as five-nines availability, which means the system or product is available 99.999% of the time.
HA systems are used in situations and industries where it is critical the system remains operational. Real-world high-availability systems include military control, autonomous vehicle, industrial and healthcare systems. People's lives depend on these systems being available and functioning at all times. For example, if the system operating an autonomous vehicle fails to function when the vehicle is in operation, it could cause an accident, endangering its passengers, other drivers and vehicles, pedestrians and property.
Highly available systems must be well-designed and thoroughly tested before they are used. Planning for one of these systems requires all components meet the desired availability standard. Data backup and failover capabilities play important roles in ensuring HA systems meet their availability goals. System designers must also pay close attention to the data storage and access technology they use.
How does high availability work?
It is impossible for systems to be available 100% of the time, so true high-availability systems generally strive for five nines as the standard of operational performance.
The following three principles are used when designing HA systems to ensure high availability:
- Single points of failure. A single point of failure is a component that would cause the whole system to fail if it fails. If a business has one server running an application, that server is a single point of failure. Should that server fail, the application will be unavailable.
- Reliable crossover. Building redundancy into these systems is also important. Redundancy enables a backup component to take over for a failed one. When this happens, it's necessary to ensure reliable crossover or failover, which is the act of switching from component X to component Y without losing data or affecting performance.
- Failure detectability. Failures must be visible and, ideally, systems have built-in automation to handle the failure on their own. There should also be built-in mechanisms for avoiding common cause failures, where two or more systems or components fail simultaneously, likely from the same cause.
To ensure high availability when many users access a system, load balancing becomes necessary. Load balancing automatically distributes workloads to system resources, such as sending different requests for data to different services hosted in a hybrid cloud architecture. The load balancer decides which system resource is most capable of efficiently handling which workload. The use of multiple load balancers to do this ensures no one resource is overwhelmed.
The servers in an HA system are in clusters and organized in a tiered architecture to respond to requests from load balancers. If one server in the cluster fails, a replicated server in another cluster can handle the workload designated for the failed server. This sort of redundancy enables failover where a secondary component takes over a primary component's job when the first component fails, with minimal performance impact.
The more complex a system is, the more difficult it is to ensure high availability because there are simply more points of failure in a complex system.
Why is high availability important?
Systems that must be up and running most of the time are often ones that affect people's health, economic well-being, and access to food, shelter and other fundamentals of life. In other words, they are systems or components that will have a severe impact on a business or people's lives if they fall below a certain level of operational performance.
As mentioned earlier, autonomous vehicles are clear candidates for HA systems. For example, if a self-driving car's front-facing sensor malfunctions and mistakes the side of an 18-wheeler for the road, the car will crash. Even though, in this scenario, the car was functional, the failure of one of its components to meet the necessary level of operational performance resulted in what would likely be a serious accident.
Electronic health records (EHRs) are another example where lives depend on HA systems. When a patient shows up in the emergency room in severe pain, the doctor needs instant access to the patient's medical records to get a full picture of the patient's medical history and make the best treatment decisions. Is the patient a smoker? Do they have a family history of heart complications? What other medications are they taking? Answers to these questions are needed immediately and can't be subject to delays due to system downtime.
How availability is measured
Availability can be measured relative to a system being 100% operational or never failing -- meaning it has no outages. Typically, an availability percentage is calculated as follows:
Availability = (minutes in a month - minutes of downtime) * 100/minutes in a month
Three metrics used to measure availability include the following:
- Mean time between failures (MTBF) is the expected time between two failures for the given system.
- Mean downtime (MDT) is the average time that a system is nonoperational.
- Recovery time objective (RTO), also known as estimated time of repair, is the total time a planned outage or recovery from an unplanned outage will take.
These metrics can be used for in-house systems or by service providers to promise customers a certain level of service as stipulated in a service-level agreement (SLA). SLAs are contracts that specify the availability percentage customers can expect from a system or service.
Availability metrics are subject to interpretation as to what constitutes the availability of the system or service to the end user. Even if systems continue to partially function, users may deem it unusable based on performance problems. Despite this level of subjectivity, availability metrics are formalized concretely in SLAs, which the service provider or system is responsible for satisfying.
If a system or SLA provides 99.999% availability, the end user can expect the service to be unavailable for the following amounts of time:
|Time period||Time system is unavailable|
|Yearly||5 minutes and 15.6 seconds|
To provide context, if a company adheres to the three-nines standard (99.9%), there will be about 8 hours and 45 minutes of system downtime in a year. Downtime with a two-nines standard is even more dramatic; 99% availability equates to a little over three days of downtime a year.
How to achieve high availability
The six steps for achieving high availability are as follows:
- Design the system with HA in mind. The goal of designing an HA system is to create one that adheres to performance conventions while minimizing cost and complexity. Points of failure should be eliminated with redundancy provided, as needed.
- Define the success metrics. It's necessary to determine the level of availability the system needs, and which metrics will be used to measure it. Service providers involve customers in this process through an SLA.
- Deploy the hardware. Hardware should be resilient and balance quality with cost-effectiveness. Hot swappable and hot pluggable hardware is particularly useful in HA systems because the hardware doesn't have to be powered down when swapped out or when components are plugged in or unplugged.
- Test the failover system. Once the system is up and running, the failover system should be checked to ensure it is ready to take over in case of a failure. Applications should be tested and retested as time goes on, and a testing schedule should be in place.
- Monitor the system. The system's performance should be tracked using metrics and observation. Any variance from the norm must be logged and evaluated to determine how the system was affected and what adjustments are required.
- Evaluate. Analyze the data gathered from monitoring, and then find ways to improve the system. Continue to ensure availability as conditions change and the system evolves.
High availability and disaster recovery
Disaster recovery (DR) is part of security planning that focuses on recovering from a catastrophic event, such as a natural disaster that destroys the physical data center or other infrastructure. DR is about having a plan for when the system or network goes down, and the results of a system or network failure must be dealt with. HA strategies, on the other hand, deal with smaller, more localized failures or faults than that.
There is a lot of overlap between infrastructure and strategies that is put in place for DR and HA. Backups and failover processes should be available for all critical components of high-availability systems, and they come into play in a DR scenario, too. Some of these components may include servers, storage systems, network nodes, satellites and entire data centers. Backup components should be built into the infrastructure of the system. For example, if a database server fails, an organization should be able to switch to a backup server.
In an HA environment, data backups are needed to maintain availability in the case of data loss, corruption or storage failures. A data center should host data backups on redundant servers to ensure data resilience and quick recovery from data loss and have automated DR processes in place.
High availability and fault tolerance
Like DR, fault tolerance helps ensure high availability. Fault tolerance is the ability of a system to endure and anticipate errors in the system's functions and to automatically respond in the event of an error. A fault tolerant system requires redundancy to minimize disruption in case of hardware failure.
To obtain redundancy, IT organizations should follow an N+1, N+2, 2N or 2N+1 strategy. N represents the number of, say, servers needed to keep the system running. An N+1 model requires all the servers needed to run the system plus an additional one. A 2N model would require twice as many servers as the system normally needs. A 2N + 1 approach means twice as many servers as you need plus one more. These strategies ensure mission-critical components are given at least one backup.
It is possible for a system to be highly available but not fault tolerant. For example, if an HA system experiences a problem hosting a virtual machine on a server in a cluster of nodes but the system is not fault tolerant, the hypervisor may try to restart the VM in the same host cluster. This will likely be successful if the problem is software-based. However, if the problem is related to cluster's hardware, restarting it in the same cluster will not fix the problem, because the VM is hosted in the same broken cluster.
A fault tolerant approach in the same situation would probably have an N+1 strategy in place, and it would restart the VM on a different server in a different cluster. Fault tolerance is more likely to guarantee zero downtime. A DR strategy would go a step further to ensure there is a copy of the entire system somewhere else for use in the event of a catastrophe.
High availability best practices
A highly available system should be able to quickly recover from any sort of failure state to minimize interruptions for the end user. High availability best practices include the following:
- Eliminate single points of failure or any node that would impact the system if it becomes dysfunctional.
- Ensure all systems and data are backed up for fast and easy recovery.
- Use load balancing to distribute application and network traffic across servers or other hardware. An example of a redundant load balancer is HAProxy.
- Continuously monitor the health of back-end database servers.
- Distribute resources in different geographical regions in case of power outages or natural disasters.
- Implement reliable failover. In terms of storage, a redundant array of independent disks (RAID) or storage area network (SAN) are common approaches.
- Set up a system that detects failures as soon as they occur.
- Design system parts for high availability and test their functionality before implementation.
High availability and the cloud
As mentioned above, there is a subjective element to high availability. Depending on the system, the amount of uptime necessary will vary. In cloud computing, the level of service is especially variable.
Cloud service providers have generally promised at least 99.9% availability for their paid services; more recently, they've moved to 99.99% availability for some services. The question remains, which applications need this level of availability?