Tech Accelerator What is APM? Application performance monitoring guide

Prev Next

Definition

What is root cause analysis?

By

Alexander S. Gillis, Technical Writer and Editor
Robert Sheldon

Published: Mar 04, 2025

Root cause analysis (RCA) is a method for understanding the underlying cause of an observed or experienced incident. It examines the incident's causal factors, focusing on why, how and when they occurred. An organization often initiates an RCA to get at the principal source of a problem to ensure it doesn't happen again.

Root cause analysis is a step beyond problem-solving, which focuses on taking corrective action when an incident occurs. In contrast, an RCA gets at a problem's root cause. When a system breaks or changes, investigators should perform an RCA to fully understand the incident and what caused it. Because of this added clarity, RCA is commonly used in areas like IT operations, manufacturing, healthcare, accident analysis and risk management.

An example of a Fishbone diagram. — In this diagram, the problem is defined in the head of the fishbone shape, and its causes and effects are splayed out behind it.

In some cases, an RCA is used to better understand why a system is operating in a certain way or is outperforming comparable systems. For the most part, however, the focus is on problems -- especially when they affect critical systems. An RCA identifies all factors that contribute to the problem, connecting events in a meaningful way so that the issue can be properly addressed and prevented from reoccurring. Only by getting to the root of the problem, rather than focusing on the symptoms, is it possible to identify how, when and why the problem occurred.

Problems that warrant an RCA can result from human error, malfunctioning physical systems, issues with business processes or operations, or other reasons.

This article is part of

What is APM? Application performance monitoring guide

Which also includes:
8 benefits of APM for businesses
APM vs. observability: Key differences explained
How to handle root cause analysis of software defects

For example, investigators might launch an RCA when machinery fails in a manufacturing plant, an airplane makes an emergency landing, or a web application experiences a service disruption. Any anomaly can potentially necessitate an RCA.

The goals of root cause analysis

The primary purpose of root cause analysis is to reduce risk to the overall organization. The information discovered in this process can be used to enhance a system's reliability. The main goals of an RCA are threefold:

Identify exactly what has been occurring, going beyond just the symptoms to get to the actual sequence of events and primary causes.
Understand what it will take to address the incident or apply what has been learned from that incident, while considering its causal factors.
Apply what has been learned to prevent the problem from reoccurring or duplicate the underlying conditions.

When an RCA achieves these goals, it can offer several benefits to a wide range of industries. When used effectively, root cause analysis can help improve medical treatments, reduce on-the-job injuries, deliver better application performance, optimize infrastructure uptime, minimize machinery maintenance, provide safer transportation and benefit various other systems and processes.

Root cause analysis principles

Root cause analysis is flexible enough to accommodate different types of industries and individual circumstances. Yet beneath this flexibility, the following four important principles are essential to making RCA work:

1. Learn why, how and when the incident occurred. These questions work together to provide a complete picture of the underlying causes. For example, it can be difficult to know why an event occurred if you don't know how or when it happened. Investigators must uncover an incident's full magnitude and all the key ingredients that made it happen. This process includes gathering, organizing and analyzing any potentially related information.

2. Focus on the underlying causes, not the symptoms. Addressing only the symptoms when a problem arises rarely prevents that problem from recurring and can waste time and resources. An RCA effort should instead focus on the relationships between events and the incident's underlying root causes. Ultimately, this can reduce the time and resources spent on resolving issues and ensure a viable remedy over the long term. Remember, multiple root causes might also be behind a problem that needs to be identified. Likewise, investigators must remain unbiased.

3. Think about prevention when using RCA to solve problems. To be effective, an RCA effort must address a problem's root causes, but that's not enough. It must also enable resolutions that prevent the problem from recurring. If the RCA doesn't help fix the problem and prevent it from happening again, much of the effort will have been wasted.

4. Do it right the first time. An RCA is only as successful as the effort behind it. A poorly executed RCA can waste time and resources. It might even make the situation worse, forcing investigators to start over. An effective root cause analysis must be carried out carefully and systematically. It requires the proper methods and tools, as well as leadership that understands what the effort involves and fully supports it. Reviews can be scheduled afterward to determine how effective specific corrective actions were.

Root cause analysis methods

One of the most popular methods for root cause analysis is the Five Whys. This approach defines the problem and then asks "why" questions for each answer. The idea is to keep digging until you uncover reasons that explain the "why" of what happened. The number five in the methodology's name is just a guide, as it might take fewer or more questions to get to the root causes of the initially defined problem.

Another popular approach to RCA is to create a cause-and-effect Ishikawa diagram, or fishbone diagram, where the problem is defined in the head of the fishbone shape, and its causes and effects are splayed out behind it. Possible causes are grouped into categories that connect to the spine, providing an overall view of the causes that might have led to the incident.

The following methodologies are also available to investigators when conducting a root cause analysis:

Failure mode and effects analysis. FMEA identifies different ways a system can potentially fail and then analyzes the possible effects of each failure.
Fault tree analysis. FTA provides a visual mapping of causal relationships that uses Boolean logic to determine a failure's potential causes or to test a system's reliability.
Pareto chart. This combination bar chart and line chart maps out the frequency of the most common root causes of problems, listed from left to right, starting with the most probable.
Change analysis. This type of analysis considers how conditions surrounding the incident have changed over time, which can play a direct role in bringing about the incident.
Scatter chart. This type of diagram plots data on a two-dimensional chart with an x-axis and y-axis to uncover relationships in the data as they pertain to an incident's potential causes.

Several other approaches are also used for RCA. Professionals who focus on root cause analysis and seek continuous improvement in reliability should understand multiple methods and use the appropriate one for a given scenario. Some other examples include barrier analysis and Kepner-Tregoe analysis.

Successful root cause analysis also depends on good communication within the group and staff involved in a system. Debriefing after an RCA -- often called a post-mortem -- helps ensure the key players understand the time frames of casual or related factors, their effects and the resolution methods used. Post-mortem information sharing can also lead to brainstorming around other areas that might need investigation and who should look into what areas.

How to conduct a root cause analysis

Performing a root cause analysis can be a complex undertaking that requires both time and resources. A team that's carrying out an RCA should take a systematic approach that's built on open communication and careful planning. Although there's no single approach to an RCA process, a team should consider starting with the following five basic steps:

1. Define the problem. It might seem obvious, but the first step should be to identify the problem as concisely as possible to ensure all RCA participants understand the scale and scope of the issue they're trying to address. This process includes the following:

Create a clearly defined problem statement.
Identify the specific symptoms surrounding the problem.
Document the effects of the problem on the target system as well as peripheral and supporting systems.
Ensure all key players understand and agree on the nature of the problem.
If there are multiple problems, deal with them one at a time.

2. Collect all relevant data. Investigators require whatever data is necessary to ensure they have the evidence they need to understand the full extent of the incident and the time frame in which it occurred. This process includes the following:

Data gathering should be a methodical process that's carefully documented and verified.
Investigators need access to all relevant evidence related to the incident without exception.
The data should include any information specific to the incident itself and any suspected causes.
The collected data should cover the entire applicable time frame, including data from before and after the incident.
The data should include details about any special circumstances or environmental factors that might have contributed to the incident.

3. Identify and map events. Investigators should be able to understand and track all events that contributed to the incident and how those events can be correlated. This step includes the following:

The RCA team should identify the sequence of events and the timeline in which they occurred.
The team should also determine the conditions under which the events occurred.
Events should be correlated to determine what links might exist between them.
The collected data should be examined for any causal factors that contributed to the events or that are somehow related to the events.
Any other factors that could have contributed to the incident should be examined.

4. Identify the root cause. After collecting the data and mapping events, investigators should start identifying the incident's root causes and working toward a resolution. This process includes the following:

Investigators must analyze all contributing factors and relevant data.
From their analysis, investigators should identify any potential root causes that seem feasible within the given circumstances.
Investigators should carefully analyze each potential root cause, eliminating those least viable and digging deeper into those most likely to have contributed to the incident.
Multiple causes might have contributed to the incident, and they all need to be identified and analyzed.
After identifying the real root causes, investigators should try to confirm their validity by simulating the circumstances that led to the incident and when and where this is practical.

5. Implement an action plan. After identifying the incident's root causes, investigators should develop an action plan to address the root problem and prevent it from occurring again. This step includes the following:

The resolution should reflect the problem statement created in the first step.
Investigators should carefully outline what needs to be done and what it will take to get it done, including the potential effects on individuals or operating environments.
The RCA team, with the help of other individuals, should provide a strategy for implementing the resolution, considering such factors as timelines, budgets and specific roles.
Investigators should identify any potential roadblocks to implementing the fix.
After the remedy has been deployed, the RCA team should carefully monitor and evaluate its implementation to ensure it has effectively addressed the underlying issues.

When performing root cause analysis, investigators should use the methods and tools most appropriate for their situation. They should also implement a system for verifying each stage of the RCA effort to make sure every step is done correctly. As part of this process, investigators should carefully document each phase, starting with the problem statement and continuing to the resolution's implementation.

Benefits and drawbacks of root cause analysis

Conducting an RCA can offer numerous advantages, including the following:

Identifies the leading causes of an issue. Identifying the root cause of an issue helps solve its immediate symptoms and uncovers the underlying reasons behind a problem's occurrence.
Enhances prevention. Once the root cause of a problem is identified, teams can implement a permanent fix and templates to avoid the same issue occurring in the future.
Develops a general approach to solve core issues. RCA provides a structured framework that organizations can use repeatably with different problems.
Improves problem-solving skills. RCA encourages increased critical thinking among teams, helping them troubleshoot and solve issues more efficiently.
Encourages optimization. Optimize systems, processes or operations by providing insights into underlying issues and roadblocks.
Provides higher-quality services. Deliver higher-quality customer and client services by addressing issues more efficiently and thoroughly.
Helps improve communication. It leads to better in-house communication and collaboration, along with improved knowledge of the underlying systems.
Reduces costs. It lowers costs by getting to the root of the problem sooner rather than continuously treating the symptoms.

Although RCA is an important process, it does have the following limitations:

There might be more than one root cause for an issue. Some problems might have multiple main contributing factors, making it difficult to identify one root cause. This can make the RCA process more complicated.
RCA is used only to identify performance issues. RCA is often overlooked when an organization wants to analyze why one system is performing better than expected.
Time and complexity. Depending on the problem, conducting an effective RCA can be time and resource-intensive -- especially for issues that are inherently more complex and extensive.
Risk of laying blame. If RCA is not conducted carefully, it can lead to a culture of blaming specific teams of employees instead of providing a constructive problem-solving approach.

Tools for root cause analysis

Root cause analysis is a process that pairs human deduction with data gathering and reporting tools. IT teams often turn to the platforms they're already using for application performance monitoring, infrastructure performance monitoring or systems management -- including cloud management tools -- for the background data they need to carry out the RCA.

Many of these products also include features built into their platforms to help analyze root causes. In addition, some vendors offer tools that collect and correlate the metrics from other platforms to help remediate a problem or outage event. Tools that include AIOps capabilities can learn from prior events to suggest remediation actions in the future.

In addition to monitoring and analysis tools, IT organizations often rely on external sources to help with their root cause analysis. For example, IT team members might participate in Stack Overflow discussions to get others' expertise on topics related to their RCA. Other examples of root cause analysis tools include TapRoot and EasyRCA.

Root cause analysis examples

Root cause analysis is used by a range of industries and in various situations, making it a highly valuable tool flexible enough to accommodate specific circumstances. The following are examples of RCA in action, but the possibilities for its use are nearly limitless.

Example 1. An email service disruption. Users couldn't send or receive email messages for two hours, and the boss wanted to know what happened. The IT team is tasked with carrying out a root cause analysis.

The team begins by defining a problem statement and collecting relevant data. Next, they use the Five Whys method to uncover the contributing events and underlying causes as follows:

Why did emails stop working? Because mail flow stopped.
Why did mail flow stop? Because someone installed patches during the day.
Why were the patches deployed during the day? Because the admin did not follow the rules in IT's processes to patch after business hours.
Why did this cause a two-hour outage? Because a patch disabled a service, and it took that long during the chaos to troubleshoot and resolve the outage.

The answers to the "why" questions outline what happened and what went wrong. From this information, the IT team can improve patching procedures and prevent this same situation from happening again.

Example 2. A drop in mobile app active users. A popular mobile app's number of active users has steadily dropped over the past two weeks, and several teams within the organization are scrambling to understand what happened. Individuals from each of these teams are working together to conduct an RCA.

After gathering the necessary data, the RCA team generates a fishbone diagram like the one in Figure 1 to understand possible causes and their effects better.

Basic steps of RCA. — Although there are numerous approaches to root cause analysis, a team should start with these five basic steps.

The diagram helps them identify all the potential root causes. They can then drill into each one to determine its viability. For example, they can use data generated by their monitoring software to verify whether there have been any issues with infrastructure performance or the back-end systems.

After analyzing each potential root cause, the RCA team determines that the most likely cause was the recent release of a similar app by a top competitor. The app was well marketed, included cutting-edge technology and integrated with several third-party services.

From this information, the team develops a strategy for accelerating the next update of their application to provide a competitive edge over the other app. They also communicate this information with the marketing and customer support teams so that they're prepared for the next release.

For the root cause analysis process to be effective, an organization must coordinate its RCA activities among its various teams. Learn which approaches work best for team coordination.

Continue Reading About What is root cause analysis?

Partners cite reinforcement learning use cases, gradual uptake

How to handle root cause analysis of software defects

Monte Carlo aids data observability with root cause analysis

Cyber companies need a best-practice approach to major incidents

It's time to raise the bar on observability

Dig Deeper on IT systems management and monitoring

Search Software Quality

Vibe coding with AI sparks debate, reshapes developer jobs
The 'vibe coding' catchphrase shows that GenAI is transforming software developer jobs -- but just how much change is coming? It ...
10 refactoring best practices: When and how to refactor code
Developers only have so much time available. Here's how to prioritize code refactoring to get the most value from the amount of ...
Google touts free tier for multimodal AI terminal
Google intends to differentiate Gemini CLI with multimodal support, including video, and an expansive free tier for individual ...

Search App Architecture

8 best practices for creating architecture decision records
An ADR is only as good as the record quality. Follow these best practices to establish a dependable ADR creation and maintenance ...
Refactor vs. rewrite: Deciding how to fix problem software
At some point, all developers must decide whether to refactor code or rewrite it. Base this choice on factors such as ...
Understanding API proxy vs. API gateway capabilities
API proxies and gateways help APIs talk to applications, but it can be tricky to understand vendor language around different ...

Search Cloud Computing

Prioritize security from the edge to the cloud
Businesses can find security vulnerabilities when they push their workloads to the edge. Discover the pitfalls of cloud edge ...
6 edge monitoring best practices in the cloud
When it comes to application monitoring, edge workloads are outliers -- literally and metaphorically. Learn what sets them apart ...
Google Cloud, Cloudflare struck by widespread outages
Multiple companies investigated a widespread outage Thursday. Google Cloud later said it was due to a faulty API update and ...

Search AWS

Compare Datadog vs. New Relic for IT monitoring in 2024
Compare Datadog vs. New Relic capabilities including alerts, log management, incident management and more. Learn which tool is ...
AWS Control Tower aims to simplify multi-account management
Many organizations struggle to manage their vast collection of AWS accounts, but Control Tower can help. The service automates ...
Break down the Amazon EKS pricing model
There are several important variables within the Amazon EKS pricing model. Dig into the numbers to ensure you deploy the service ...

TheServerSide.com

The case against vibe coding
Is vibe coding a bad idea for enterprises? AI can produce results faster than manual coding, but its benefits eventually unravel ...
An introduction to LLM tokenization
Users interact with LLMs through natural language prompts, but under the hood these AI models are based on LLM tokenization. ...
Agile vs. Scrum: What's the difference?
Don't fret about the differences between Agile and Scrum? It's actually their similarities that make them interesting.

Search Data Center

The pros and cons of geothermal energy use
Data centers continue to strain the grid, stressing the need for alternative energy. Admins should weigh the pros and cons of ...
8 things to know when switching from Windows to Linux
Switching to Linux from Windows can present new challenges for beginner Linux admins. Here are eight tips to keep in mind when ...
HPE launches software push with CloudOps bundle
At HPE Discover, the infrastructure vendor promises a 'great VM reset' on HPE Private Cloud and touts a new software push.

Close