kentoh - Fotolia
IT incidents happen, but what matters is how they're handled. The modern IT environment has a wealth of moving parts, each of which must be orchestrated and managed both independently and as part of the greater whole. Key performance indicators (KPIs) are categorical metrics IT admins can use to measure both how well the environment performs and how effectively the operations team handles errors.
An incident management KPI gives the operations admins information to address an issue and prevent it from snowballing or happening again.
Use this list of common IT incident management KPIs to start tracking relevant metrics around outages, or compare it to the metrics you have in place and continually find areas to improve.
Mean time to identify (MTTI): A problem can't be solved unless you know it is there. The amount of time it takes an IT admin to identify an issue is a vital incident management KPI, also referred to as mean time to detect (MTTD). MTTI and MTTD indicate how efficiently, on average, a system or admin can locate the point of failure and ultimately lead toward remediation.
Mean time to repair (MTTR): MTTR is the average amount of time it takes to resolve an issue once identified. This measurement includes the time it takes to identify the issue. MTTR is also a gauge of system or component resiliency: The lower it is, the more efficiently that component or system can be brought back online.
Mean time between failures (MTBF): Sometimes referred to as incident recurrence rate, MTBF is an incident management KPI focused on the system but can be influenced by the IT operations team's ability to manage it. No IT component is ever foolproof, insofar as it cannot be expected to withstand any possible permutation of activity without failure. The more time it can spend operational, however, the better off the environment -- and the environment's management team -- will be.
Post-mortem: Also called a retrospective, a post-mortem is not a specific metric, but is a process by which an IT organization evaluates a failure and its repair or recovery. This conversation should dig into the initial point of failure and its causes, along with the actions IT admins took to resolve the issue and how best to prevent the issue from reoccurring. The lack of well-run post-mortems in an IT organization could indicate problems with incident management.
First-time resolution (FTR) rate and escalation rate: Also referred to as first-call resolution (FCR) rate, the FTR rate is the speed at which issues are corrected at first notice, without further work required. Conversely, the escalation rate indicates the number of issues that must be passed along to somebody with more troubleshooting capabilities, whether due to higher-level permissions or a greater knowledge base, to resolve. These numbers are usually in directly inverse correlation. The better an organization's FTR rate, the better off its incident management process is.
Service-level agreement (SLA): An SLA is a contract, usually between an IT organization and a tool provider or external user, that indicates items such as guaranteed standards for performance and uptime. An SLA could outline the availability of professional support -- for example, in terms of business hours or cost. It also includes agreed-upon penalties for breach of the SLA.
Uptime and downtime: Uptime is a term that describes when a tool, software or environment is available or operational. For example, if a cloud platform's SLA promises 99.999% annual uptime -- also known as five nines -- the platform will be unavailable, or down, for only 0.001% or less of the time over 12 months.