10 important incident response metrics and how to use them
In incident response, security teams can improve their work by knowing how long it takes to respond to and remediate threats. These are the key metrics to track.
Incident response metrics help an organization assess its ability to deal with cybersecurity incidents effectively, quickly and responsibly. Where response efforts are inadequate, metrics can help cybersecurity teams and corporate leadership pinpoint what needs to change.
If an organization only ever experienced a couple of isolated cyberattacks, tracking these KPIs would be a wasted effort. For most enterprises, however, security incidents are ongoing and, for many, increasing in frequency and impact every year.
Faced with the continual need to respond, an organization needs ways to monitor and evaluate outcomes. Tracking useful metrics helps the organization determine whether incident response is getting faster, more effective and more efficient.
When metrics show that responses are not improving in all three ways, it's likely time to revise the incident response plan, upskill staff or upgrade the cybersecurity tool set. If making any substantial changes to the response plan, an organization should put the updated plan to the test in tabletop incident response drills and adjust if needed.
With a revised incident response plan in place, an organization should take the following steps to assess its effectiveness:
- Revise the relevant metrics as needed -- i.e., add or drop metrics.
- Adjust targets for metrics for the coming year.
- Continue to track metrics and their medians through future security incidents.
Key incident response metrics
Organizations can monitor a variety of response metrics to measure how effectively they respond to security incidents. What they can measure depends on the available resources and data. At minimum, every organization should try to track metrics that measure speed, effectiveness and efficiency.
Speed metrics
With cybersecurity incident response, speed is crucial. As bad actors have ramped up the use of AI and other automation in their operations, the lag time between breach of a network and exploitation of the breach has shrunk. Even something that starts as a relatively minor incident can become a major one if left unchecked for too long.
Mean time to contain (MTTC)
Of all the speed metrics, containment is the most important. Mean time to contain is just that: the time it takes the organization to contain a security threat so that an active attack can do no further harm. Full recovery from any damage done could take additional time and effort; that too should be tracked but separately. The essence of incident response is stopping further damage and gaining control of the situation.
Time to contain is the sum of the following components:
- Time to detect.
- Time to identify.
- Time to respond.
Mean time to detect (MTTD)
Incident detection is crucial to incident response. An organization can't respond to an incident if it does not know that one has occurred.
Mean time to detect is the time it takes for the organization to realize an incident requires a response. In most cases, this metric is worked out after the fact. Seeing clear evidence that something is happening is different from knowing when the underlying condition began. Organizations need to investigate, backtracking through logs and other data, to determine with certainty when the trouble started.
Organizations should track MTTD over time. It's a number that should decline generally and, ideally, for each separate type of security incident.
Mean time to identify (MTTI)
Mean time to identify is how long it takes to diagnose an attack after initial detection. This includes understanding what the incident is and determining what to do about it -- in broad terms, if not in deep detail.
MTTI is a crucial measurement of the responsiveness of the organization's cybersecurity team and processes. The faster the organization can determine what to do about an incident, the sooner it can proceed to an actual response. An organization should track its MTTI to measure its progress.
Mean time to respond (MTTR)
Mean time to respond is the time it takes the organization to end the active threat, clearing the way for full recovery. This is the span during which the organization acts on its knowledge of the incident and its decisions about how to contain that incident.
Imagine that, while identifying a breach, for example, incident responders discover that blocking certain IP addresses and network ports prevents a threat from spreading. In this example, MTTR would be the length of time needed to plan and execute the changes to firewall, router and switch configurations necessary to implement those blocks, along with isolating already-infected nodes for further remediation.
Because it measures the agility of the actual response phase, MTTR is a crucial metric of the organization's ability to protect itself. A declining time to respond is an indication that a team is succeeding in its incident response work.
Mean time to normal (MTTN)
Mean time to normal, also known as mean time to restore or mean time to resolve, is the time it takes the organization to fix anything that was broken as a result of the now-contained threat. For example, the incident response team might need to reimage affected systems or restore corrupted files from backups.
MTTN measures the whole organization's ability to return to normal operations. Organizations should track median MTTN and strive to see it trend downward over time.
Effectiveness metrics
Speed is not the only yardstick. Another set of incident response metrics hinges on the permanence, or durability, of the resolution. For example, it's great if the organization can detect and remove malware from a compromised host once it has begun launching lateral attacks. It's even better if the organization identifies through root cause analysis (RCA) the security vulnerability that led to the original compromise and fixes it, whether through patching, configuration changes, firewall modifications or other corrective actions.
Failing to address and measure the response's effectiveness can lead to situations where MTTC is low and getting lower, yet the same compromises occur repeatedly.
Consider the following effectiveness metrics.
Percentage of incidents undergoing RCA
RCA can be a significant amount of work, but modern AI-powered SIEM systems can speed up these efforts. RCA pays off by preventing future security incidents and the need for subsequent responses. This analysis is the best way to decrease incidents of a specific type -- by removing the conditions that make it possible for them to recur.
With the percentage of incidents undergoing RCA, a higher number is better. When an organization understands the root causes of as many incidents as possible, it reduces risk.
Percentage of prescribed fixes completed on time
When a cybersecurity team identifies preventive measures that will reduce the threat surface, it is important to track how many of those actions are completed on schedule. Knowing how to fix something, after all, is not the same as fixing it. The ability to follow through and correct a root problem is a core competence for a cybersecurity organization and a key measurement of its response effectiveness. This makes the percentage of prescribed fixes completed on time a good complement to MTTC.
The better an organization is at following through on preventive measures, the lower the risk it faces.
Efficiency metrics
It is important to track how efficiently an organization responds to incidents. Resources, especially cybersecurity staff resources, are limited and usually oversubscribed. Some key efficiency metrics follow.
Total cost of incident
To determine the total cost of an incident, calculate the sum of relevant cost factors, including the following:
- How much time did security operations staff spend on a particular incident?
- How much business did the organization lose or fail to transact because of the incident itself or the recovery process?
- What other resources went into the response -- e.g., did the organization need new hardware, software or licenses, or third-party consulting services?
- What fines or penalties did the organization pay?
An organization has no choice but to respond to security incidents, but it must be able to quantify its response costs. This lets it assess, for example, whether outsourcing incident response services might be more cost-effective than handling them in-house -- or vice versa. In another scenario, the total cost of an incident could help identify a given business activity that invites a lot of security incidents and, ultimately, costs so much to secure that there is too little profit or justification to continue it.
Security staff time on incident
This is a critical component of the total cost because it records the degree of human intervention -- the most precious resource in cybersecurity -- required to achieve incident resolution.
Recruiting and retaining cybersecurity staff are ongoing challenges. It's crucial to know how much team members' time goes into incident response and how it is divided among containment and longer-term resolution and prevention.
Security staff time on incident response should ideally trend downward, as activity shifts away from containment and toward prevention.
Percentage of incidents contained without human intervention
With better and context-aware automation of detection, identification and containment, an organization should be able to reduce the amount of staff time consumed by incident response. A business experiencing this evolution should consider adding the percentage of incidents resolved completely by automation as a complementary metric. Because agentic AI seems certain to become part of enterprise security and incident response, tracking this metric will be important. Doing so will help a team understand not just the organization's security posture, but also the effectiveness of AI and other automation technologies put into use.
John Burke is CTO and a research analyst at Nemertes Research. Burke joined Nemertes in 2005 with nearly two decades of technology experience. He has worked at all levels of IT, including as an end-user support specialist, programmer, system administrator, database specialist, network administrator, network architect and systems architect.