alotofpeople - stock.adobe.com

Tip

The root cause analysis process needs all IT hands on deck

What caused the problem and where? If IT needs to know, they'll need more than one team to find it. Take one of these three approaches to collaborative RCA for a clearer picture.

Tom Nolle, Andover Intel

Published: 07 Jun 2021

The IT world and the wide world share one critical truth: Bad decisions are often made because good, complete data is lacking.

In fault management generally -- and in root cause analysis particularly -- that problem is often caused by the separation of responsibilities found in most enterprises. Most users report that their organizations don't coordinate RCA activities effectively among teams, and at the same time agree that it's helpful, even essential.

There are two reasons to coordinate the root cause analysis process:

Where separate teams are involved, there are often separate sets of data collected and analyzed for each team. As a result, no one has a complete picture of the situation.
Problem determination and remediation steps, if undertaken by independent teams, can collide, which creates its own set of problems.

Does this mean that fault analysis associated with RCA must be conducted by a multidisciplinary team? Not necessarily -- and, in fact, that might not be the most efficient option. Users who have confronted the challenge of collecting data for RCA successfully report that there are three possible approaches, one of which is likely ideal for your own organization.

1. Collect data centrally, rather than by team

With this approach, the complete raw data must be available to all teams. Most organizations that coordinate their RCA efforts use this strategy.

When any team conducts an analysis, they have access to all events that might be related to the problem at hand, including both security events and the results of development testing. This has the advantage of an enforced single source of truth in terms of operations and fault data. Limited data sharing among teams leads to blind spots in root cause analysis; with more perspectives and knowledge sets operating together, the collaboration can eliminate redundant data collection and storage overhead.

But there are two problems with this approach: First, it requires teams to assess raw data associated with responsibilities outside their own experience and skill sets. For example, can IT operations assess security information, or would it be wiser to leave that to specialists?

Second, having data universally available might encourage parallel analysis and even correction of the same problem set, which is wasteful at best and destructive at worst.

2. Centralize fault analysis results

The goal of this second approach is to fix these problems. Here, IT teams share fault analysis results -- rather than the raw data shared in the first method. With this approach, teams continue to do specialized data collection and result analysis, but they create a shared repository of the results, and can share raw data more broadly, if necessary.

The most effective basis for this sharing and cooperation is the trouble-ticket systems of the teams. Trouble tickets contain problem reports, analysis results and remediation steps, which culminate in a status, such as resolved. IT teams must adjust their trouble ticket procedures to require their analysis results to reference the specific raw data sources that contribute to it. This creates a link from the trouble ticket to the performance statistics. Thus, team analysis and raw data are associated and shared in context.

3. Form a dedicated team -- or workflow

The biggest challenge to this approach is how to establish all teams involved in the trouble-ticket process will cooperate cohesively. To start this transition smoothly, select a point person from each team to access the trouble tickets assigned to other teams for coordinated analysis. With this approach, the team that raises a trouble report carries the responsibility for the review of material from another team.

There are alternative ways to implement this approach. One IT admin might review the trouble reports of other teams regularly and initiate a coordinated response, for example. Or, create one cohesive team to generate a trouble report. This team then determines whether another team must review said report and sends it to a designated person if so. Finally, a company can designate a dedicated review team to look at all new problem reports and coordinate access to related team-specific reports and, through them, the supporting raw data.

The first two approaches depend on at least one team to recognize a given problem, who must then initiate a trouble report and analysis. The third broad option for RCA coordination bypasses this condition and relies instead on statistical -- or even AI algorithm-supported -- review of the raw data, based on inputs from each team to help identify conditions to flag.

Sometimes it's easier to train AI than to retrain a support staff, much less multiple independent teams. But even without AI assistance, IT teams can analyze all the sources of performance data to facilitate root cause analysis. The key requirement is to synchronize timing to connect issues in time and analyze within proper context.

IT can review this new correlated log regularly to uncover potential undiscovered issues, or when a problem occurs, to relate conditions from all teams involved. This strategy works in conjunction with the other two approaches, resulting in the best team coordination for the root cause analysis process your organization can achieve.

Next Steps

An explanation of fishbone diagrams

The root cause analysis process needs all IT hands on deck

What caused the problem and where? If IT needs to know, they'll need more than one team to find it. Take one of these three approaches to collaborative RCA for a clearer picture.

1. Collect data centrally, rather than by team

2. Centralize fault analysis results

3. Form a dedicated team -- or workflow

Next Steps

Dig Deeper on DevOps

10 important incident response metrics and how to use them

10 advanced incident response strategies for ITOps

19 top distributed tracing tools to know about

Using AI and machine learning for APM

1. Collect data centrally, rather than by team

2. Centralize fault analysis results

3. Form a dedicated team -- or workflow

Next Steps

Related Resources

Dig Deeper on DevOps

10 important incident response metrics and how to use them

10 advanced incident response strategies for ITOps

19 top distributed tracing tools to know about

Using AI and machine learning for APM