Incident response metrics help cybersecurity professionals and corporate leadership assess their organizations' ability to deal with cybersecurity incidents effectively, quickly and responsibly. And, in cases where response efforts fail to meet the mark, these figures help pinpoint what needs to change.
Why are metrics important in incident response?
If an organization only ever saw a couple of isolated cyberattacks, tracking these KPIs wouldn't be useful. But, for most enterprises, security incidents are ongoing and increasing in size and impact every year.
To that end, the typical company needs a way to monitor and evaluate outcomes. Based on the KPIs, are incident response efforts getting faster, more effective and more efficient? Or, at the very least, are they holding steady? If not, then it's likely time to revise the incident response plan.
After making any substantive changes, take the following steps to assess their effectiveness:
- Put the updates to the test in tabletop incident response drills.
- Continue to track performance in future security incidents, and compare them against baseline benchmarks.
9 key incident response metrics
Organizations can monitor a plethora of incident response metrics to measure how effectively they respond to security incidents, depending on available resources and data. At minimum, however, consider tracking the following important metrics.
When it comes to incident response, speed is clearly of the essence. The faster the security team can contain a threat, the less damage the threat can do. On the other hand, a relatively minor incident can become a major one if left unchecked. Speed metrics are inarguably critical in measuring the effectiveness of incident response.
- Mean time to contain (MTTC)
- MTTC is the average time it takes to contain a security threat -- that is, prevent it from doing any further harm, communicating with controllers or spreading itself any further.
- Of all the incident response metrics, MTTC is the most important. It measures the time period required to stop the damage from getting worse. It encompasses all the actions an organization must take to repel an attack -- from detecting the fact that something is happening and diagnosing what is going on to knowing what has to be done in response and taking the necessary actions to contain the threat.
Full recovery from any damage done may take additional time and effort, but the essence of incident response is getting the situation under control.
- Three components underpin MTTC: time to detect, time to identify and time to respond, all explained in detail below. The smaller the MTTC, the better. Organizations should track the median MTTC across incidents and try to make sure it declines over time.
- Mean time to detect (MTTD)
- MTTD is the average amount of time it takes to realize there is an incident to respond to. In most cases, it has to be worked out after the fact. Once it is clear something is going on, some research is required to backtrack through security logs and system information to determine when it started.
- Incident detection and MTTD are linchpins of incident response, as the length of time it takes to detect a threat is a critical component of overall MTTC. An organization can't respond to an incident if it does not know one has occurred.
- A lower MTTD is better. Again, organizations should also track the median over time -- it should be declining both generally and, ideally, for each separate type of security incident.
- Mean time to identify (MTTI)
- MTTI measures how long it takes to diagnose an attack after initial detection. This includes identification of what the incident is and what to do about it, at least in broad terms.
- MTTI is a crucial benchmark in recording the responsiveness of the organization's cybersecurity team and processes.
- A lower MTTI is better. Again, organizations should also track the median over time -- it should be declining generally and, ideally, for each separate kind of incident as well.
- Mean time to respond (MTTR)
- MTTR is a measurement of the incident response time, or the time it takes to act on the knowledge of what an incident is and how to contain it. For example, imagine, in the course of identifying the problem, incident responders discover blocking certain IP addresses and eXpress data path ports stops a threat from spreading. MTTR is the time it takes to plan firewall, router and switch configuration changes and to execute on that plan.
- MTTR represents the part where the organization ends the active threat, clearing the way for recovery from the damage. As such, it is a crucial measure of the organization's ability to protect itself.
- A lower MTTR is better. Again, organizations should also track the median over time -- it should be declining generally and, ideally, for each separate kind of incident as well.
- Mean time to normal (MTTN)
- MTTN is also known as mean time to restore or resolve -- another MTTR acronym, confusingly. It defines the time period it takes for the organization to fix anything that was broken as a result of the now-contained threat. For example, the incident response team might need to reimage affected systems or restore corrupt files from backups.
- This measures the organization's ability to resolve disruptions and get back to business as usual in its service of end users, also known as resolution time.
- As with the prior yardsticks, a lower MTTN is better, and the organization should try to get the median to trend downward over time.
Speed is not the only yardstick. Another set of incident response metrics hinges on the permanence, or durability, of the resolution. For example, it's great if the organization can detect and remove malware from a compromised host once it has begun launching lateral attacks on uncompromised ones. It's even better if the organization also identifies through root cause analysis (RCA) the security vulnerability that led to the original compromise and fixes it, whether through patching, configuration changes, firewall modifications or other corrective actions.
Failing to address and measure the response's effectiveness can lead to situations where MTTC is low and getting lower, yet the same compromises keep happening again and again.
Consider the following effectiveness metrics.
- Percentage of incidents undergoing RCA
- RCA can be a significant amount of work, but it is usually work that pays off by preventing future security incidents and the need for subsequent response efforts.
- This is the best way to decrease incidents of a specific type -- by removing the conditions that enable a particular incident and making it impossible for them to reoccur.
- A higher figure is better here. The organization wants to understand the root causes of as many incidents as possible and head them off before they reoccur, shrinking the overall risk surface.
- Percentage of prescribed fixes completed on time
- This measures how much of the activity required to prevent a recurrence of a particular security incident happens on schedule.
- Knowing how to fix something is quite different from doing it. The ability to follow through and fix a root problem is a core component of an effective long-term response beyond the heat of the immediate incident.
- A higher figure is better. The better the organization is at following through on preventive measures, the lower the risk it faces.
Finally, it is important to track how efficiently an organization responds to incidents. Resources, especially cybersecurity staff resources, are limited and usually oversubscribed. Let's examine a few important benchmarks and how they fit within the overall incident response process.
- Total cost of incident
- This is the total sum of a variety of factors, including the following:
- How much time did security operations spend on a particular incident?
- How much business did the organization lose or fail to transact as a result of the incident itself or the recovery process?
- What other resources went into the response -- e.g., new software, new hardware, new security services or third-party consulting services?
- What fines or penalties did the organization have to pay?
- An organization has no choice but to respond to security incidents, but it must be able to quantify its response costs. This lets the company assess, for example, whether outsourcing security services might be more cost-effective. In another scenario, the total cost of an incident might help identify a given business activity that invites security incidents and, ultimately, costs more money than it makes.
- A lower total cost of incident figure is better.
- This is the total sum of a variety of factors, including the following:
- Security staff time on incident
- This is a critical component of the total cost, as it records the degree of human intervention -- the most precious resource in cybersecurity -- required to achieve incident resolution.
- Recruiting and retaining cybersecurity staff are ongoing challenges. It's crucial to know how much of team members' time is going into incident response and how it is divided between containment and longer-term resolution and prevention.
- Try to get the overall amount of time security staff spends on incident response to trend downward, as activity shifts away from containment and toward prevention and the number of incidents decreases.
John Burke is CTO and principal research analyst with Nemertes Research. With nearly two decades of technology experience, he has worked at all levels of IT, including end-user support specialist, programmer, system administrator, database specialist, network administrator, network architect and systems architect. His focus areas include AI, cloud, networking, infrastructure, automation and cybersecurity.