Server cluster failure considerations to maintain a fault-tolerant data center

Server clustering can help you achieve a fault-tolerant data center, but cluster failures can cause problems if your heartbeat monitoring system is on the same network as your production workload. An expert discusses the problems he encountered with VMware High Availability (HA) heartbeat monitoring and VM host failures and how his company decided to solve the problem.

One look at any professional data center underscores the old maxim "Don't put all your eggs in one basket." Everything is geared for N+1 redundancy, whether it's redundant power supplies, failover networking equipment, multipathed storage area network (SAN) fabrics or RAID storage. This push toward fault tolerance resulted in extremely fault-tolerant but expensive servers that could not only swap out power supplies and disks, but also CPUs, without any downtime.

Eventually many data center administrators realized that clustering servers was a great way to not only provide fault tolerance but also reduce costs, since commodity server hardware could be used. On top of the cost savings, a server cluster was much easier to scale horizontally. If you needed more resources, all you had to do was spin up another server and add it to the cluster. Despite the fault-tolerance benefits clusters can bring, the heartbeat monitoring system meant to protect your cluster can wreak havoc on your data center if you aren't aware of its configurations, .

Clustering in the data center

There are two ways you can relate the adage "Don't put all your eggs in one basket" to your data center. You may believe that spreading your resources across multiple servers is putting your eggs in multiple baskets. But if you step back far enough to view your entire data center as a single basket, you see the need to implement disaster recovery strategies that put, at minimum, one duplicate set of your eggs in another basket.

These days, clustering is a major consideration in the data center. Most major Web-based operations have some kind of Web or database cluster, and even cloud computing is a result of massive server clusters. More advanced virtualization technologies, such as VMware, also make use of clustering -- the virtualization hosts act as appliances as virtual machines (VMs) migrate back and forth across members of the cluster. If you need more resources for your VMs, all you need to do is add another member to the cluster. If you need to perform repairs on a server, it's easy to take it out of the cluster without any downtime for your virtual machines.

Cluster failure and heartbeat monitoring

Clusters are great when they work, but there are some fundamental aspects of clustering technology that can bite you if you aren't careful. A poorly configured cluster can still have single points of failure. Most clusters have implemented a type of heartbeat monitoring so the cluster is aware when a member is unavailable, and the member itself knows the difference between it and the entire cluster experiencing downtime.

Many problems can arise if a host attempts to stay with a cluster when it is offline, so you will find that many clusters will "shoot the server in the head," or forcibly remove it from the cluster when there is a consensus that the host is unavailable. For instance, OCFS2 clusters (Oracle's clustering file system) will actually force a kernel panic on a machine when it doesn't respond to the heartbeat fast enough.

This heartbeat monitoring can be the weak link if your cluster isn't configured correctly. A number of clusters implement the heartbeat over the network, and many best practices advocate placing heartbeat monitoring over a private network separate from your production traffic. In many cases, best practices advocate using crossover cables between hosts for heartbeat if possible. That way, if a host has saturated its main network connection or there is an issue on the network tier, the hosts themselves don't implement their failover scripts.

Unfortunately, not all data centers can have a completely separate network for heartbeats, and this is where my personal story of cluster failure comes in. I have already mentioned VMware High Availability (HA) clustering technology, but what I failed to mention is that until recently in ESX, unless you changed the default clustering settings, the network became a single point of failure for the cluster.

The cluster can check that a host is available over the network and over the SAN. Each host must periodically renew its file lock on the VMs it's managing. Combined with network monitoring, this can provide a good sense of whether a host has crashed. Unfortunately, the default HA settings had the cluster fence off a host and fail over its VMs once it lost network connectivity, whether or not it still had SAN connectivity.

I became all too aware of this problem on more than one occasion. In my case, while we had redundant network interfaces for each VMware host, we didn't set apart a separate network for heartbeats. I remember one of the first times we were bitten by this. One weekend, the networking team had scheduled substantial maintenance on the network tier involving firmware upgrades and, in some cases, rather involved changes to the network configuration. We weren't too concerned, as we did have redundancy for our VMware hosts so that as long as only one switch was taken down at a time, the environment was supposed to stay up.

Once the maintenance was complete, the VMware hypervisor was unable to communicate with a bulk of the VMware hosts. As far as it knew, a major part of the cluster was down, so it dutifully powered down all the VMs on those hosts, migrated them to the available hosts and powered them back on. This works well when you have lost one or two hosts, but not so well when you have lost most of them. The result was two or three VMware hosts that were crammed to the top with all of our VMs. Of course these hosts were completely loaded, so even the VMs that were running weren't running well. Among the hosts that weren't running well were our LDAP and DNS VMs, and when they stopped responding, the rest of the hosts that relied on them slowed down even more. Long story short, it took hours even after the network was sorted out to get all of the VMs back to a functioning state.

In addition to that large failure, we had experienced enough close calls that whenever major maintenance was imminent, we disabled the HA feature on our cluster. We ultimately disabled the feature altogether after another major outage caused by a network disruption -- we had more downtime as a result of this feature than we had ever gained by it being turned on.


Our reaction was certainly extreme. After all, clustering technologies work well for many companies. For my company's needs, it was more of a cool feature than a necessary one, and the bottom line is that you should never let yourself develop a false sense of security in your fault-tolerant data center. It's hard to track down every single point of failure and almost impossible to predict the millions of failure combinations that could take down your previously rock-solid design. Fault tolerance is a lot like security -- it isn't the flaws you know about, but the flaws you don't that cut a hole in the bottom of your data center basket.

About the author:
Kyle Rankin is a systems administrator in the San Francisco Bay Area and the author of a number of books including Knoppix Hacks and Ubuntu Hacks for O'Reilly Media.

What did you think of this feature? Write to SearchDataCenter.com's Matt Stansberry about your data center concerns at [email protected].

Dig Deeper on Data center ops, monitoring and management

Cloud Computing
and ESG