AI orchestration modernizes approaches to disaster recovery

As ransomware gets more dangerous and infrastructure grows more complex, AI-assisted DR can shorten recovery time by improving readiness and decision-making under pressure.

Traditional disaster recovery approaches are inadequate in the face of escalating cyber threats, complex infrastructures and business continuity pressures.

Historically, DR relied on predefined runbooks, scheduled tests and manual escalation. That approach struggles with today's hybrid, large-scale environments -- spanning mainframes, cloud, SaaS, third-party platforms and security tooling -- because it lags behind these rapidly changing dependencies. That's one reason why more teams are exploring AI-orchestrated DR to use analytics and automation to check recovery paths more frequently, find drift earlier and prioritize what to restore first when time is of the essence.  

"The biggest shift is that disaster recovery is moving from a binder-on-the-shelf exercise to a living capability," said Jim Piazza, chief AI officer at Ensono.

Its proponents say AI-orchestrated DR combines predictive analytics tools, automated runbooks and human oversight to reduce recovery times, improve operational resilience, support recovery point objectives and control costs across different disaster scenarios.

From passive backup to proactive resilience

Historically, DR meant creating backups regularly and testing recovery once a year. Now it's often treated as a program of continuous monitoring, replication, automation and validation to keep systems stable and return them to a ready state if necessary. This shift can help teams respond to threat signals earlier or limit damage when incidents occur.

Organizations must move from passive backup to anticipatory cyber-resilience, said Jim Jones, senior product infrastructure architect at 11:11 Systems.

AI-assisted analytics continuously mine telemetry and threat indicators to identify potential threats and recommend clean restore points to start recovery. When that intelligence is wired into orchestration, runbooks and failover sequences can be codified and executed across backup, DRaaS and cyber recovery environments, rather than improvising mid-incident.

This process becomes more challenging as scenarios grow more complex. Take a ransomware attack dwelling in the environment for weeks before its final encryption event. Organizations face difficult choices -- which only become harder mid-attack -- such as:

  • Determining which backups are clean, tainted and partially corrupted during the intervening period.
  • Deciding the restore order for dependent systems to avoid re-infecting the environment.

Beyond faster recovery, AI help prioritization under pressure, Piazza said. During a crisis, not every workload carries the same business impact. AI connects technical events to customer disruption, regulatory exposure and revenue risk, so teams know what to restore first.

Why AI makes DR accountability harder

As more vendors and enterprises hype and push new AI capabilities, other businesses might fear being left behind.

"This has caused some organizations to overcorrect and decide that they must deploy these solutions as rapidly as possible," said Matthew Mettenheimer, director of cyber risk and resilience at S-RM, a global cybersecurity and intelligence consultancy.

Before enterprises consider using AI, they must address capability gaps that can confound AI-orchestrated DR ambitions, including minimal or absent risk management practices and limited visibility into data, access and processes. When businesses don't identify potential risks, they might use AI in sensitive areas without appropriate safeguards, risking noncompliance with corporate or legal policies.

Mettenheimer said using AI does not absolve the business from owning the outcomes. In actuality, it raises the stakes. AI increases security risks in sensitive areas. Cybersecurity has always been a component of DR, but AI has brought it to the forefront more than ever.

"AI has fundamentally changed the way that business is approaching cyber security, both from internal and external perspectives," Mettenheimer said.

Internal security threats

Internally, organizations are recognizing how AI tools can introduce significant operational risks. While third-party risk is not new, deeper integration of AI automation into core workflows can increase the effects of service disruption.

The more embedded these tools and processes become, the more downtime can cascade, highlighting the need for strong business continuity procedures that keep the company running at a basic level.

External security threats

Externally, there's a growing concern that threat actors will increase their use of AI. For example, Anthropic has said its Claude Mythos Preview could identify and exploit zero-day vulnerabilities. While that model was not made public, its capabilities could be replicated using other AI tools to gain access to environments assumed to be secure.

"It is now more important than ever to develop defense-in-depth principles to reduce the risk of single points of failure," Mettenheimer said.

Connecting resilience silos

A rising trend in DR is bringing multiple resilience disciplines and processes into a common workflow, even though they aren't required for every issue. Not every outage is a security incident, but security incidents can often become availability events.

"The organizations getting this right have stopped treating disaster recovery and security as separate disciplines with separate teams and separate budgets," said Brandon Willitts, director of cyber resilience at Everpure.

Another problem is keeping recovery plans in a document that the compliance team updates annually, but the engineering team never reads, much less tests. Recovery plans should be kept current using version control, links to enterprise systems and daily workflows that engineers already use.

AI can't do much to streamline the disconnected documentation process. However, when enterprises have the foundation, AI can monitor fleet behavior, flag drift, validate recovery paths and shorten the time between detection and restoration.

AI should operate like a site reliability engineer with bounded privileges. It has context from the fleet, understands configuration and disruption patterns, and stages a validated recovery option, but a human must always approve any consequential action.

One misconception is that AI will soon support a shift to fully autonomous failover. That's misleading.

"The shift is from reactive recovery to continuously validated resilience, where the plan, the systems, and the people who operate them are always aligned, always tested, and always ready," Willitts said.

George Lawton is a journalist based in London. Over the last 30 years, he has written more than 3,000 stories about computers, communications, knowledge management, business, health and other areas that interest him.

Dig Deeper on Disaster recovery planning and management