IT administrators face tremendous pressure to avoid system downtime. Server failures can result in lost business revenue, reduced productivity and a damaged reputation. At first glance, it would seem counterintuitive for organizations to adopt virtualization and risk losing multiple critical workloads if a single physical server failed.
To address this potential problem and improve overall reliability, VMware introduced VMware High Availability (HA) in 2006. It's impossible to prepare for every contingency, but in many circumstances, VMware HA can help administrators avoid unplanned downtime. Let's look at how VMware HA works.
In the event of a server failure, the VMware HA utility can restart VMs that were running on the failed server on other servers in the cluster.
Availability in the virtual world
In order to understand why and how VMware HA works, it's helpful to explain how organizations approach high availability without virtualization. One such approach to ensuring high availability for a critical application is to maintain a backup server standing by, ready to restart the workload should the primary server fail.
While this approach is reasonably effective at ensuring minimal downtime, it isn't an efficient use of resources; the backup server is sitting idle until the primary fails. Additionally, when the primary server fails and the backup takes its place, the backup becomes a potential single point of failure -- unless the organization invests in multiple backup servers.
Rather than rely on idle systems, VMware HA works by using spare capacity on servers running other workloads. When designing a high availability cluster, an administrator will leave spare capacity available on each physical server to ensure that, in the event a server fails, the VMs on that server can be distributed and restarted on the remaining servers.
How VMware HA works under the covers
Hosts in the cluster communicate via a VMware HA utility, which maintains and monitors each host's heartbeat -- a periodic message that indicates the host is still running. If the host stops transmitting its heartbeat communication, because of a failure for example, other hosts in the cluster will recognize its absence and restart the VMs on remaining hosts.
Alternatively, a host may continue to run but experience a problem rendering it incapable of sending or receiving heartbeat communications -- such as a network disruption. In this case, the administrator can preselect preferences for whether VMs on the affected host should continue running or whether those VMs should be shut down and restarted on other hosts.
How does VMware HA work to restart VMs that were running on another host? VMware HA only works if VM files are retained on shared storage, such as a SAN. All hosts in the cluster are capable of accessing VM files on this shared storage and restarting the VM.
A VMware HA cluster can consist of just two physical hosts, but this configuration doesn't allow for much flexibility or resilience. Larger HA clusters allow for a more efficient use of resources -- the spare capacity maintained on each host can be smaller -- and are capable of surviving multiple host failures. In any case, administrators should ensure that they leave enough spare capacity on each host to accommodate the largest VM running in the cluster.
VMware HA vs. Fault Tolerance
It's also important to remember the limitations of VMware High Availability and what it is not designed to do. VMware HA is not designed to eliminate downtime; it's designed to reduce downtime.
When a server fails or becomes disconnected from the cluster its heartbeat communication stops, but the HA utility will wait 15 seconds, by default, before initiating the VM restart process. It also takes several seconds, or even minutes, for the VM to boot up on the new host -- all time that the application will be unavailable.
For workloads that require higher levels of availability, VMware Fault Tolerance can be an option. Fault Tolerance maintains a redundant copy of the VM running on a different host in the cluster. When the primary VM fails, the secondary copy of the VM takes over with no downtime.
VMware HA configuration guidelines
How VMware HA, FT and DRS are different
VMware Fault Tolerance updates reduce latency