Availability shouldn't be discussed in a vacuum. It's not a matter of aiming for the highest possible uptime percentage; rather, it's about achieving the appropriate level of availability for a business's applications and users.
To do that, an IT team needs to understand availability management and how it fits under the ITIL/IT service management umbrella. Let's take a closer look at that process and identify the key metrics needed to gauge a service's availability, as well as how to set policies to meet the corresponding goals.
Understand availability management basics
Availability loosely means uptime, subject to the SLA. Availability management is the formal set of processes, tools, plans and metrics used to monitor and report on availability.
Part of the ITIL methodology, availability management falls under the Service Operation and Continual Service Improvement guidelines in the ITIL service lifecycle.
Availability management strives to be practical rather than theoretical. It's an understanding that there will be downtime and that 100% uptime would be cost prohibitive. The goal is to minimize downtime rather than eliminate it.
Calculate the cost of uptime vs. downtime
The cost of uptime is a curve that bends sharply upward as it gets closer to 100%. For example, an increase in uptime for from 96% to 97% for a given system could be five times more expensive than an increase from 95% to 96%. On the other hand, the cost of downtime for that system remains a constant.
At some point, it become more expensive to boost incremental uptime than to suffer from the loss of business caused by downtime. That's the point where the cost of additional redundancy and other steps aren't worth the effort.
Additionally, not all downtime is the same. Some outages are more expensive than others, and those calculations depend on multiple factors, including whether the application is customer facing or internal.
Identify availability management metrics
IT teams need to track key metrics, or key performance indicators (KPIs), for effective availability management. Some notable availability management metrics are listed in Figure 1.
Metrics are tracked at the aggregate, e.g., the whole of the SAP warehouse module, plus the key components thereof, e.g., replenishment.
Set availability management policies
Availability management requires written plans, procedures and monitoring. This is used to track progress, make course corrections and inject policies into IT change management.
For an enterprise, the availability manager would need the teams that oversee each critical system to submit their own availability plans. In smaller organizations, teams can provide data to an availability manager, who would then write the plan.
Availability plans include risk analysis and business impact statements. Important risks include system failures and susceptibility to cyber attacks. An importance rating is tied to each component of the risk analysis; this is necessary to calculate the cost of downtime.
For example, a Lightweight Directory Access Protocol (LDAP), such as Microsoft Active Directory, is more critical than employee onboarding. Onboarding employees is important, but downtime there can be fixed in hours rather than minutes. Customer-facing systems are high on the priority list, and LDAP outages affect all persons and systems.
Make the plan work
Once an IT team identifies the metrics and policies it requires to hit its availability goals, it needs buy-in from important stakeholders.
Monitoring systems, both manual and automated, can gather metrics used to calculate availability. The service desk is a key system in the process of reporting and tracking outages. From there, admins know the impact of the outage or partial outage. However, don't confuse these with monitoring systems that track performance, which exist for those who create tickets.
Additionally, availability must be incorporated into the change management process. Availability managers need to be aware of key changes so they can raise an issue with each project team in case those changes could disrupt the availability plan.
It should be incorporated as a touch point in whichever issue tracking system the organization uses, such as Jira or HubSpot. There needs to be a mechanism to broadcast availability plan changes to the responsible parties so system designers are aware.
Dig Deeper on IT systems management and monitoring
Length, cost and severity of datacentre outages continue to rise, Uptime Institute research confirms
Uptime Institute's data center tier standards
Uptime Institute to help financial services organisations reduce infrastructure outage risks
Emerging digital service models: Addressing the need to prevent downtime