This content is part of the Essential Guide: Data center metrics and standards guide

Essential Guide

Browse Sections

Checklist of IT KPIs for responsive data center ops

The right metrics, aligned with business needs, strengthen data center monitoring and capacity planning. KPIs orient IT performance to specific goals.

The traditional focus on hardware and software monitoring is shifting to a business focus using key performance indicators (KPIs). IT KPIs gauge abstract targets, such as user experience or job ticket turnaround effectiveness.

The difference between IT KPIs and common IT monitoring is the involvement of business leadership. Any organization can deploy tools to track the computing resources assigned to VMs or watch the bandwidth utilization on servers. Granular technical factors can be helpful to IT staff, but have little practical value to the business. Insights from KPIs allow management to justify investments to remediate issues.

KPIs help business executives gauge the impact or success of IT. For example, a business dependent on Web-based sales or content delivery would implement KPIs on overall computing capacity, application performance and system utilization. Additional KPIs measure the state of the IT infrastructure, including transactions, efficiency and agility.

Although ITIL has a general set of suggested key performance indicators, there is no single set of universal KPIs that fit every purpose. IT KPIs typically fall into three categories: service delivery effectiveness, service or performance efficiency and agility (responding to change). Organizations like IT service providers also use service availability KPI metrics.

Service delivery effectiveness

  • Throughput

The user load or demand on an application or system(s). Throughput is often expressed as the number of transactions or measure of computing work.

  • Response time

This KPI covers how much time is needed to complete a transaction. response time may include multiple infrastructure elements, including servers, networking and storage. It may be tied to service level agreements (SLAs).

  • Utilization

The amount of physical or virtual computing resources or capacity that is actually used, as compared to total capacity, yields a utilization rate. For example, if a VM is assigned 10 GB of memory and it uses 10 GB of memory, utilization is 100%.

  • Uptime

Uptime measures the percentage of time that an application or system is running. Technologies like clustering, resilient servers and network failover all help gird uptime against individual failures.

The organization uses these metrics to compute its custom service delivery KPI. For example, if throughput and uptime are high and response time is low, the service delivery score will likely be good -- regardless of utilization. But if utilization and response times increase, and throughput or uptime fall, the waning score brings to light potential IT issues.

Service efficiency or performance

  • Workload efficiency or performance

This derived index compares a workload's allocated resources to utilized resources. It shows if a workload is wasting resources, oversubscribed (resource starved) or just right.

  • System efficiency or performance

System efficiency is another derived index comparing a server's allocated resources to its available resources at an optimum load. This shows if the server is wasting resources or is oversubscribed.

The metrics for workloads and systems are typically aggregated across the data center to calculate a weighted average. The team can quickly gauge status for the period and compare it to previous periods -- justifying new technology initiatives and investments to preserve and enhance efficiency figures. A poor KPI may require workload balancing or a technology refresh project.

System agility

  • Service requests resolved

This measures the number of help desk tickets, support calls or other service requests addressed and resolved successfully within an acceptable time period.

  • Time to resolution (TTR)

TTR tracks the amount of time needed to address service requests. Examples include how long it takes to evaluate, justify, approve and provision a new VM once a request is received, or make changes to resource allocations once a performance impairment is detected.

As the number of IT service requests increases and TTR decreases, it can be inferred that IT is agile and able to respond to changes in workload or user demand. If the number of requests increases and TTR increases, IT faces pronounced agility concerns.

Service availability

IT service providers -- or any IT organization bound by SLAs -- might adopt an SLA KPI that involves a wide range of metrics.

  • Resolution percentage

This IT KPI metric measures the percentage of help or service requests addressed within an acceptable period.

  • Uptime

Uptime is how much the service was available over the billing cycle. Some amount of service disruption is unavoidable, but uptime is a means of gauging SLA adherence and business performance.

  • Mean time between failure (MTBF)/ mean time to repair (MTTR)

MTBF and MTTR gauge fault frequency and how long it takes to fix them.

  • Number of action items

This is the number of complaints or service requests that IT receives. Increases in this number indicate problems with certain systems or platforms.

Collecting all of this data, weighting it, and formulating a result provides business leaders with an important early warning of SLA problems, offers the basis for follow up or discussion with IT leaders, or spawns a business goal for service improvement.

KPI oversights

  1. Subjective metrics, such as user satisfaction, are based on objective characteristics like app performance or throughput, but the perception of the KPI may be skewed or used improperly. Always use objective, measurable parameters.
  2. When business and IT leaders look at the same KPIs the same way forever. For example, a business might track system utilization and uptime. Early on, utilization was more important, but as services evolve and utilization reaches goals, uptime takes on greater significance.
  3. When business leaders fail to update KPIs as the business model matures. For example, a greenfield data center build focused on energy consumption and cost control. Once it met those goals, the focus should change to improving IT service quality or agility.

Select the KPIs to measure

Perhaps the most important -- but overlooked -- aspect of KPIs is business relevance. Accounting, marketing, sales and IT can manage and report on granular metrics generated by a myriad of sources, systems and tools. But not every metric is essential for business decisions or measuring goals. And essential metrics vary from company to company, even project to project.

Select IT KPIs by first understanding the goals. A business that focuses on IT services will pay attention to transactional- or throughput-related measurements under various load conditions and activity levels. Conversely, a business concerned with controlling IT costs selects KPIs around computing resource availability, utilization and system power consumption.

Then, select areas that can be measured, establish thresholds of performance, implement measurements with monitoring or management tools and generate periodic or ad-hoc reports that show KPI data over time.

Stephen J. Bigelow is a senior technology editor at TechTarget, covering data center and virtualization technologies. He's acquired many CompTIA certifications in his more than two decades writing about the IT industry.

Next Steps

See 10 examples of KPI templates

Locating KPIs that really matter to the business

Are you tracking too many KPIs?

Dig Deeper on Data center ops, monitoring and management

Cloud Computing
and ESG