Getty Images

If technology breaks, can you keep your business running?

Business leaders must prepare for technology failures by implementing resilient architecture, incident-response plans and strategies to maintain operations and minimize the impact.

Recent outages from major vendors – Amazon Web Services (AWS), Cloudflare and Microsoft Azure – have put the spotlight on the business impacts of vendor outages.

The outages impacted millions globally and caused a range of services, from SaaS platforms to airlines, to go down, revealing just how heavily businesses rely on cloud services. The Cloudflare outage in both June and November 2025 resulted in several websites and apps being unavailable.

According to Gartner estimates, Amazon cloud services occupied 37.7% of all infrastructure as a service spending in 2024, compared with Microsoft's 23.9% market share.

Many organizations assume that partnering with top cloud vendors will guarantee reliability, but that isn't always the case. Even the top-tier cloud providers can experience disruptions and failures that can affect businesses.

To stay resilient and reliable even when technology fails, organizations need to be prepared, strategic, and adaptable, with clear incident-response and business continuity plans in place for when technology goes down.

Lessons from cloud outages

Recent major cloud outages have shed light on the risks associated with reliance on cloud vendors to keep businesses operational and the far-reaching effects they can have on an unprepared business.

For example, the hours-long global AWS outage on Oct. 20, 2025, impacted thousands of customers. Just days later, on Oct. 29, an Azure outage caused similar chaos. The outage lasted more than eight hours and caused failures across several Microsoft products, including Microsoft 365 software and Copilot.

These outages aren't just inconvenient – they cause real-world impacts for businesses that rely on these cloud services to run their services. According to CyberCube, preliminary insured loss estimates for the recent AWS outage range from $38 million to $581 million. In 2024, the massive CrowdStrike outage resulted in an estimated direct financial loss of $5.4 billion for Fortune 500 companies, according to Parametrix.

A cloud outage can also lead to reputational damage and customer frustration if it blocks a product or service from being used altogether or causes a poor user experience, such as delayed loading times for an app, program or website.

These outages underscore the risks and vulnerabilities associated with vendor outages, which can harm businesses, particularly through cloud dependencies that lead to overreliance on a single vendor or region. Unforeseen cascading effects can cause issues even after an outage is restored. For example, even though the AWS outage was caused by a DNS failure, the issue affected services that rely on DNS, including authentication.

Defining business continuity beyond uptime

While many businesses may equate operational continuity with uptime, there are distinct differences between system availability and operational continuity.

System availability refers to whether systems and infrastructure are online and all components are working as expected. However, just because system availability is restored doesn't mean that systems and operations are fully functional. Operational continuity refers to an organization's ability to perform its core business functions and execute the necessary processes and operations to sustain business operations without delays or other operational disruptions.

While system availability assesses the technical state of the infrastructure, it doesn't necessarily mean that operations, workflows and integrations for all users are fully restored. That's why it's critical to create business continuity plans that bridge the gap between system availability and operational continuity during and after a disruption.

Business continuity includes "planning for the manual workarounds needed when systems are down and ensuring we have the right human staff and facilities to support those temporary processes," said Justin Kates, senior business continuity advisor at Wawa, Inc.

According to McKinsey, resilient organizations were better able to handle disruptions in 2020-21. To stay prepared, organizations should have metrics that can measure and track the resiliency of the organization, from both a technical and operational standpoint.

Cameron Daniel, chief technology officer at Megaport, listed key metrics for resiliency:

  • Recovery time objective.
  • Recovery point objective.
  • Mean time to recovery.

"These metrics are important for knowing how quickly workloads can reach their recovery time and frequency, especially at a time of increasing adoption of hybrid clouds," said Daniel.

Building an incident-response plan

Cloud outage preparedness and cloud outage response are critical in a cloud-dominated IT environment. Predefined plans ensure businesses stay efficient and organized, even when disruptions cause chaos. These plans enable normal operations to be resumed as soon as possible, potentially mitigating risks such as financial loss, damage to brand reputation and customer impact.

Here are the four main parts of an incident response plan:

  • Detection. Staying proactive about detecting incidents can help organizations catch problems early on before they become major disruptions. Organizations can implement detection measures such as consistent monitoring of system health, vendor statuses and integration points. Automated alerts can streamline incident detection and ensure that issues are addressed promptly.
  • Escalation. Clear guidance on if and when incidents need to be escalated within the business, and who needs to be informed, can keep incident responses organized during chaos. "The absolute first step is identifying the escalation triggers and getting consensus on who needs to be notified and the minimum information they need to make decisions," said Kates. "The escalation processes should be simple guides, much like a pilot checklist."
  • Communication. A strong incident-response plan should clearly lay out who needs to stay informed during incidents, as well as when and how the information should be communicated. Communication needs may vary depending on the type of incident, such as cyberattacks versus outages. Other impacted business areas, such as customer support, should be made aware of incidents, as well as external stakeholders.
  • Failover. Failover strategies – such as switching to redundant servers or alternative vendors – ensure that downtime is minimized and business continuity can be maintained. Failover strategies should be predetermined and well-documented to minimize disruptions to operations while incidents are being rectified. Failovers should be reviewed and tested at regular intervals to ensure the organization is prepared for incidents. 

Incident response plans should be practiced with all teams that are involved. "One of our customers, a large global service provider, runs short tabletop exercises every month, one with business teams like legal or marketing, one with the board [and] one with outside counsel," said

Arvind Parthasarathi, CEO and founder at CYGNVS. "Because they made the process repeatable and bite-sized, it's now part of their culture. Each exercise strengthens coordination across the organization."

However, it's important to set the team up for success and make testing realistic. "Use a scenario to prompt the team to walk through the response, which helps you identify missing plans, teams, or training before you put them under exercise pressure," said Kates. "It's also critical to conduct a blameless postmortem after any exercise to identify action items and make improvements to your plans and procedures."

Designing resilient architectures

Resilient architecture is essential to a resilient business continuity strategy. "Maintaining business operations during large-scale cloud outages requires organizations to shift from passive observation to proactive resilience," said Ryan Whelan, global cyber intelligence lead at Accenture.

Architecture should be resilient enough to mitigate incidents and recover from disruptions as soon as possible. "For their most critical applications and data, organizations are increasingly leveraging a mix of public cloud vendors, private clouds and on-premises systems to balance business resilience with costs," said Whelan.

Multi-cloud systems can help reduce reliance on a single vendor, enabling operations to be split strategically between clouds to ensure that if one system experiences a disruption, other core business functions can continue to run smoothly.

Hybrid systems use both cloud-based and on-site infrastructure to reduce dependency on vendors and mitigate the risks associated with vendor incidents. On-site infrastructure also ensures that data stays safe and secure, even in the event of cyberattacks or other security incidents.  

Resilient architecture should incorporate redundancies and have critical operations in multiple environments to mitigate the risk of failures and outages that can disrupt operations.

Building out resilient architecture – even though it may seem costly upfront – can be the difference between a complete halt in operations and keeping core business functions running, even when technology fails. "[Resilient architecture] pays for itself," said Parthasarathi. "We see crisis mobilization drop from up days to under 60 minutes, mean-time-to-remediate shrink by days, and every action can be captured for compliance and chain of custody."

Leadership and governance: Call to action

Clearly, vendor outages can have a significant impact far beyond IT. Outages can affect everything from customer service to operations. That's why it's crucial for senior leaders, such as CIOs and CTOs, to be involved in everything from oversight and risk assessment to board reporting.

"What is changing is the board and executive interest and appetite to engage on these events," said Parthasarathi. "They don't want to be distant observers; they want to understand their role and participate when a crisis hits because these are truly business events affecting the business."

Technology failures are inevitable, but the level of impact it has on the business depends on how prepared leaders are for these failures. Tabletop exercises and post-mortem reviews ensure incident-response plans are optimized and ready to be deployed when outages occur.

"Through Chaos Engineering experiments and our regular Failure Fridays, teams intentionally inject controlled failures to uncover hidden dependencies, validate recovery mechanisms and strengthen response muscle memory," said Rukmini Reddy, senior vice president of engineering at PagerDuty. "As AI becomes more deeply embedded into engineering workflows, resilience increasingly depends on human judgment -- building systems and teams that can bend but never break."

Building up digital resilience shouldn't be an afterthought; it should be baked into core competencies. Resiliency should also be regularly reviewed through post-incident audits and simulations to create IT resilience that can thrive through cloud downtime and technology failures.

Alison Roller is a freelance writer with experience in tech, HR and marketing.

Dig Deeper on CIO strategy