Getty Images/iStockphoto

Tip

Why maximum tolerable downtime is a key business metric

Downtime cannot be eliminated, but it can be governed. By defining maximum tolerable downtime, data leaders align recovery planning with business priorities and risk tolerance.

While zero downtime remains the ideal, it's not realistic for enterprises. But a solid operational resilience plan that keeps disruptions to tolerable limits is the next best thing.

Many organizations run increasingly automated workloads across multi-petabyte data estates, multi-cloud setups and various SaaS applications and APIs. That complexity makes outages a remarkably common occurrence.

In a resilience-related survey conducted by database vendor Cockroach Labs in 2024, 100% of 1,000 senior technology executives reported revenue losses due to outages in their organizations over the previous 12 months. Surveyed enterprises experienced an average of 86 outages annually, with average downtime of more than three hours per outage.

Business, IT and disaster recovery leaders are now less focused on avoiding downtime and more concerned with reducing it. The goal is to keep a critical business function within the acceptable disruption window and resume operations before an outage exceeds maximum tolerable downtime (MTD). This is why MTD matters: it frames disruption as a business-level risk rather than just a technical issue.

Moving from uptime targets to disruption limits

Many organizations use MTD to describe the outer limit for an outage. Some sources refer to the same concept as maximum allowable downtime (MAD). As defined in NIST guidance, MTD is the amount of time a mission or process can be disrupted before the impact becomes unacceptable. In practice, it marks the point where an IT incident becomes a business problem, such as financial loss, reputational harm or legal exposure.

MTD tells leaders how long the enterprise can withstand lost access to data, analytics interruptions, delayed decisions and stalled employee workflows before consequences escalate.

Think of MTD as the ceiling for total downtime with two components below it:

  • Recovery time objective. RTO is the target time to restore IT platforms and data services to an operating state.
  • Work recovery time. WRT is the time it takes to validate that systems are functioning properly and restore the business to dependable operations after an outage is resolved.

To avoid business problems when a disruption occurs, MTD should be equal to or greater than the sum of RTO and WRT. RTO specifies how quickly core systems and data services can be brought back. WRT measures the progress of more difficult work, such as synchronizing distributed data, validating the integrity of AI models and verifying data availability.

Leaders should use MTD to evaluate DR platforms, architectures and operating models -- not just by how fast they restart systems but by how much they shorten the way back to business operations and decision-making based on trusted data.

Why meeting MTD has become more challenging

The underlying risk profile has shifted. Issues are no longer limited to weather events, facility outages or hardware failures. Increased ransomware attacks and other cybersecurity incidents compress the time to detect, contain and recover. As attackers move faster, the window to stay within MTD shrinks.

Enterprises today must also manage -- and recover -- more than just databases and storage systems. The modern data stack includes streaming data pipelines, analytics systems, AI models, hybrid networks and various cloud resources. Recovery that stops at infrastructure and ignores downstream data products, reporting workflows and customer-facing processes will not restore business operations to a trusted state.

These demands require more rigorous practices to meet MTD targets. Automation and AI‑assisted orchestration play a growing role by reducing manual tasks, such as failovers, system validation and incident response.

For data and technology leaders who evaluate products, the key question is whether a platform helps keep a critical business function within its tolerable disruption window. That shifts the emphasis to practical capabilities, including clean recovery paths, immutable or well-protected data backups, cross-environment observability, workload prioritization and post‑recovery validation. Another factor to consider is the DR team's expertise level and whether the current staff can execute the recovery runbook.

Elevate MTD to the board and C-suite

DR is not simply an IT exercise; the organization's business performance depends on prompt, successful recovery from outages. Business continuity guidance prioritizes identifying business-critical functions and their dependencies and elevates MTD discussions to the board and C-suite, not just the architects of IT and disaster recovery strategies.

The established all-hazards approach to disaster preparation still holds up. Enterprises must assess plans for recovering from physical events, technology failures and disruptions caused by employees. What's different is the weighting. Cyber resilience is a key driver of MTD -- the primary one in some organizations. Unlike a hardware failure, which is relatively easy to recover from, a cyber incident often requires the business to press pause to perform data integrity checks and other DR tasks, which can push a disruption past the MTD threshold.

MTD also now encompasses a more complex operational reality. RTO and WRT are affected by factors such as data scale, attack speed, network interdependencies, team skills, automation maturity and governance discipline.

As a result, staying within MTD when outages occur requires more than a speedy infrastructure recovery. It depends on an operational resilience plan that:

  • Sets clear ownership at the board level for the level of disruption the organization can withstand.
  • Connects those limits to the most critical business functions and the technologies they rely on.
  • Ensures data platforms and identity systems can be restored successfully, not just the applications they support.

It also requires regular testing and practice so teams know they can resume reliable operations -- not just get systems back online -- under real pressure to avoid problematic downtime.

Organizations with a firm grasp on those concepts will make better choices about platforms, DR design and operational priorities. The ones that don't might find that even the most advanced disaster recovery technology stack still misses the MTD window, turning a recoverable outage into a material event.  

Editor's note: TechTarget editors revised this article in April 2026 to add new information and improve its timeliness.

Tom Walat is an editor and reporter for TechTarget, where he covers data technologies.

Consultant and technical writer Paul Kirvan contributed to this article, which he originally wrote in 2022.

Dig Deeper on Disaster recovery planning and management