3 ways to react to an IT system failure avalanche

In complex IT infrastructures, a system or resource failure can spur follow-up issues. Compare three approaches to prevent these dreaded domino effects and reach the root of the problem.

Whether their focus is on the network, cloud or data center, many IT operations personnel share a common fear: the fault avalanche.

When a core resource fails, it often causes a range of secondary IT system failures due to their relationship with that failed core component. These secondary failures can themselves cause more failures, resulting in an explosion of fault reports that makes it difficult for IT operations personnel to respond, or even understand what's happening.

Virtualization multiplies these fault avalanches in two ways. First, virtualization is, by nature, a resource-sharing strategy. A single server hosts many VMs and/or containers, each of which represents an independent hosting point. One failure can cause a dozen more, which can then trigger application failures that cause dozens more failures in IT systems. In fact, almost any failure of a physical resource in virtualized infrastructure will create a cascade of faults.

Second, it's more complicated for IT admins to identify and fix issues with virtualized infrastructure than it is with physical infrastructure; they have to isolate the actual problem without a 1-to-1 relationship of machine and application, and that alone might require considerable exploration into how virtual resources map to physical ones. That exercise is difficult enough -- but when admins also have to reallocate virtual workloads to new physical resources to get around the original IT system failure, they can create changes in load that affect applications and services that were never part of the original problem. Often, it's inevitable that addressing faults creates new ones.

IT support teams should take three possible approaches to best respond to, and prevent, an avalanche of IT system failures. Fault correlation is a kind of root-cause analysis that links a problem to its root causes. Deployment and lifecycle modeling codifies the virtual-to-physical resource mappings associated with an application or service deployment. And capacity planning and adaptive resources is a model that suggests IT teams ignore application and service problems, except to trigger an examination of physical resource conditions. There are benefits and limitations to each approach.

Fault correlation

Almost any failure of a physical resource in virtualized infrastructure will create a cascade of faults.

Correlation is the traditional approach to deal with IT fault avalanches. It's a statistical concept that works on the assumption that when multiple faults occur within a short period, there's a reason for that synchronicity. IT admins can examine the timing of faults, along with other details, such as their locations, to correlate groups of faults and infer a likely root cause. Multiple layers of virtualization in the IT stack make it increasingly difficult to discern the relationships among faults, so for heavily virtualized IT environments, it's best to look to other options.

Deployment and lifecycle modeling

An application is a set of coordinated components that deploy based on a set of rules. Network services and other aspects of IT operations are similar in concept. IT teams can record the rules and policies that govern deployment, as well as the actual way in which resources are committed, as models. Those models can provide guidance on how to analyze faults and apply corrective measures -- often to the point where admins can identify the real problem, and be sure to avoid it in new deployments.

DevOps, NetOps, container orchestration and OASIS' Topology and Orchestration Specification for Cloud Applications are a range of examples of this model-driven approach. When teams properly design and maintain IT operations models, it gives them the foundation for a reliable way to quickly isolate problems and separate secondary alerts from root causes. These models also enable teams to rebuild the application or network service.

This entire process is lumped into a single mission -- lifecycle management -- that must be supported by a tool or a combination of tools. It's not always easy to properly design and maintain these complex models, or to identify the best tools to support them. Today, organizations use this approach primarily for virtualization and cloud computing, where deployment always requires the administrator to set up specific conditions.

Capacity planning and adaptive resources

The final strategy to address a domino effect in IT system failures essentially cuts through all this correlation and analysis. The organization creates a pool of resources -- such as server farms -- based on capacity planning for a given application or applications. Available resources are aligned to expected load and quality of service. IT admins then focus on ensuring resources operate according to the capacity plan, and only consider faults in physical resources. If an application or service fails, teams simply redeploy it onto free physical resources, and assume that this resource remediation process will fix the problem or have the capacity to absorb the change.

This model can extend further if the resource pool is self-adapting, meaning automated processes bypass and correct faults. In IP networks, for example, dynamic route discovery can direct applications' traffic around congestion or component failures without any intervention -- IT teams don't have to fix services or applications or do routing manually. Infrastructure-as-code frameworks aim for this self-healing capability as well.

The approach -- fix the resources not the failure -- has another benefit; rather than try to dig through correlations between virtual and physical resources to narrow down a root cause, IT admins only turn their attention to the latter.

A combination of policy-based models and flexible capacity deployment is likely the best way to address an avalanche of IT system failures in modern IT operations. Large organizations with expansive server farms or cloud budgets might embrace the capacity planning and adaptive resources approach, while those with more constrained resources will favor the model-driven one. In any IT operations team, ensure careful planning and thorough management are ongoing activities -- otherwise, even the best fault remediation techniques can get buried by negative events.

Dig Deeper on Systems automation and orchestration

Software Quality
App Architecture
Cloud Computing
Data Center