Root cause analysis is a way to determine how a problematic event occurred by examining why, how and when the casual factors happened after the fact. When a system breaks or changes, investigation should be performed to have a full understanding of the event. Root cause analysis is a step beyond problem solving, which is corrective action taken when the event occurs.
The purpose of root cause analysis is to reduce risk to the overall organization. The information discovered in this process can inform how teams enhance systems' reliability. The main areas that benefit from the information discovered in root cause analysis are:
- Process improvement;
- Configuration changes;
- System improvements and
- Staff training and knowledge improvement.
A feedback loop from the troubleshooters to the operators enables a company to know what events occurred to cause a problem and how to prevent it in the future, if that is possible.
Root cause analysis methods
The most popular method of root cause analysis is known as the 5 Whys: Define the problem, and keep asking 'why' questions to each answer. Keep digging until you really get to the reasons that explain the 'why' of what happened. The number five in the methodology's name is just a guide, as it may take fewer or more 'why' questions to get to answers that really show the root cause of the originally defined problem.
Beyond the 5 Whys method, another popular approach to root cause analysis is to create a cause and effect diagram, also known as a fishbone diagram, where the problem is defined in the head of the fishbone shape, and cause and effects splay out below it. Possible causes are linked to categories, which all connect to the spine of the fishbone and give an overall view of what areas have which problems that led to the event that occurred. There are several other root cause analysis methodologies, and professionals who focus on root cause analysis and reliability improvement should understand multiple methods to use the appropriate one for a given scenario.
Successful root cause analysis depends on good communication within the group and staff involved in a system. Debriefing after an event has occurred -- often called a post-mortem -- can cover the already-known information around the event so that everyone involved knows the time frames of casual or related factors, their impact and resolution methods used. Post-mortem information sharing can lead to brainstorming around what needs to be investigated about the root cause, and who should look into what area.
Tools for root cause analysis
Root cause analysis is a process of human deduction paired with reporting tools. In IT organizations, application performance monitoring, infrastructure performance monitoring, systems management and cloud management tools collect data to inform root cause analysis. Some vendors also offer tools that collect and correlate the metrics from these various tools to suggest paths to remediate a problem or outage event. Tools that learn from prior events to suggest remediation actions in the future fall into the AIOps category.
In addition to monitoring and analysis tools, IT organizations rely on external sources for outage information. For example, an IT team might check Twitter to stay up-to-date on a cloud provider outage, or discuss a problem in a community Slack channel to get others' expertise on the root cause.
Root cause analysis example
Users couldn't send or receive email messages for two hours, and the boss wants to know what happened. The IT team is tasked with root cause analysis.
Using the 5 Whys method, they approach the issue:
Why did emails stop working? Because mail flow stopped.
Why did mail flow stop? Because someone installed patches during the day.
Why did this cause a two-hour outage? Because a patch disabled a service and it took that long during the chaos to troubleshoot and resolve the outage.
Why did the patch deploy during the day? Because the admin did not follow the rules in IT's processes to patch after business hours.
The answers to the 'why' questions give an outline of what happened and what went wrong. From this basis, the IT team can take action to improve the procedure for patches, and prevent this same situation from happening in the future.