Today, zero downtime from an operational disruption is the ideal outcome from a business perspective. Unfortunately, that is not always possible or realistic. Between weather-related outages and an increase in cyber attacks, business leaders are increasingly less concerned with avoiding downtime and more about reducing it.
Maximum allowable downtime, also referred to as maximum tolerable downtime (MTD), is the absolute longest amount of downtime an organization can tolerate before facing serious repercussions. These can include loss of business or reputational damage.
To prepare for potential crises, disaster recovery (DR) teams must know how to calculate maximum allowable downtime and how to effectively manage downtime.
Calculate maximum allowable downtime
Maximum allowable downtime denotes the maximum time a business can tolerate the absence or unavailability of a particular business function. Different business functions are likely to have different answers to the allowable downtime equation. The higher the function's criticality, the shorter the maximum allowable downtime will be.
Business function downtime is based on two elements: the systems or technology recovery time objective (RTO) and the people-based work recovery time (WRT). As such, the formula for maximum allowable downtime is the following:
Maximum allowable downtime = RTO + WRT
For example, if a critical business process has a three-day maximum allowable downtime, the RTO for systems, networks and data might be one day. This is the time the organization needs to recover technology. The remaining two days are for work recovery.
The following chart depicts the relationship among the metrics comprising maximum allowable downtime.
In the above image, before an incident occurs, the organization backs up mission-critical operational data and systems, and performs business functions as normal. The subsequent four points in time are key to dissecting MTD.
- Point 1: Recovery point objective (RPO). The maximum sustainable data loss based on backup schedules, data needs and system availability.
Once the disruption occurs, the organization launches incident response activities. If it cannot bring the disruption under control quickly, DR teams launch data or system disaster recovery activities to return operations to normal as quickly as possible.
- Point 2: Recovery time objective. This is the amount of time an organization needs to bring critical systems back online. This is where disaster recovery activities typically occur.
Depending on the success of DR plans and team efforts, the RTO will hopefully be met within the planned time frames. Shorter RTOs mean that systems are likely to return to normal operation more quickly, and the most current data will be available. This will enable the business to begin normal operations again.
If RTOs are exceeded due to unanticipated factors, such as extended commercial power outages or physical damage to equipment necessitating a replacement, it may be necessary to launch business continuity plans. These strategies deploy alternate arrangements so that the business can resume operations as much as possible before a full recovery occurs.
- Point 3: Work recovery time. Once mission-critical systems and data resources are recovered and again operational, this is the time needed to get back to business-as-usual operating conditions.
- Recovery of lost data (based on RPO);
- reentry of data from work backlogs, such as those manually generated during the outage;
- return of employees to their work areas;
- reactivation of systems, workstations, laptops, communications and other tools; and
- reengagement of linkages across operating units that make the company operate normally.
Together, points 2 and 3 (RTO + WRT) form the maximum allowable downtime. This is the time required to get the business back at work.
- Point 4: At this point in time, the organization is back to business as usual, and it is time to review what happened during the event. DR teams must note what worked, what didn't work, what changes need to be made and the next steps going forward to deal with future disruptions.
Once systems are operational, the RTO has been achieved. Additional steps must then be undertaken during the WRT to relaunch the business; these are typically in business continuity plans.
How to deal with downtime
Dealing with downtime must start at the top, with senior leadership setting the bar for how an organization responds to disruptive events. Investments in resilience activities, such as business continuity and disaster recovery, will help an organization recover after a disaster more rapidly and effectively.
Some organizations do not have a complex way of conducting business and might be able to return to business quickly, simply because their technology is less sophisticated and easier to recover. By contrast, organizations with very complex business processes and highly sophisticated mission-critical systems will need to invest in resilience or risk loss of business, loss of reputation or even loss of employees.
Assess all possible risks
Considering the many different types of disruptive incidents possible, business leaders should take an "all-hazards" approach to disaster preparation. They must carefully examine potential physical disasters, technology disasters and people-based disasters from a risk perspective to identify their effects on downtime. Even with the completion of risk, threat and vulnerability analyses, organizations might still be unprepared for events that occur outside the boundary of their analyses.
Exceeding maximum allowable downtime does not mean a business will fail. Other factors, such as availability of multiple offices and multiple data centers, can be present to help the business survive an event. Organizations that do not have alternate work arrangements or even access to remote working can be at a greater risk of failure.
Maximum allowable downtime is an important business metric, and while there can be multiple values for different aspects of a business, knowledge of those metrics is essential when developing plans to achieve operational resilience.