Recent outages from major vendors – Amazon Web Services (AWS), Cloudflare and Microsoft Azure – have put the spotlight on the business impacts of vendor outages. The outages impacted millions globally and caused a range of services, from SaaS platforms to airlines, to go down, revealing just how heavily businesses rely on cloud services. The Cloudflare outage in both June and November 2025 resulted in several websites and apps being unavailable. According to Gartner estimates, Amazon cloud services occupied 37.7% of all infrastructure as a service spending in 2024, compared with Microsoft's 23.9% market share. Many organizations assume that partnering with top cloud vendors will guarantee reliability, but that isn't always the case. Even the top-tier cloud providers can experience disruptions and failures that can affect businesses. To stay resilient and reliable even when technology fails, organizations need to be prepared, strategic, and adaptable, with clear incident-response and business continuity plans in place for when technology goes down.

Lessons from cloud outages Recent major cloud outages have shed light on the risks associated with reliance on cloud vendors to keep businesses operational and the far-reaching effects they can have on an unprepared business. For example, the hours-long global AWS outage on Oct. 20, 2025, impacted thousands of customers. Just days later, on Oct. 29, an Azure outage caused similar chaos. The outage lasted more than eight hours and caused failures across several Microsoft products, including Microsoft 365 software and Copilot. These outages aren't just inconvenient – they cause real-world impacts for businesses that rely on these cloud services to run their services. According to CyberCube, preliminary insured loss estimates for the recent AWS outage range from $38 million to $581 million. In 2024, the massive CrowdStrike outage resulted in an estimated direct financial loss of $5.4 billion for Fortune 500 companies, according to Parametrix. A cloud outage can also lead to reputational damage and customer frustration if it blocks a product or service from being used altogether or causes a poor user experience, such as delayed loading times for an app, program or website. These outages underscore the risks and vulnerabilities associated with vendor outages, which can harm businesses, particularly through cloud dependencies that lead to overreliance on a single vendor or region. Unforeseen cascading effects can cause issues even after an outage is restored. For example, even though the AWS outage was caused by a DNS failure, the issue affected services that rely on DNS, including authentication. Defining business continuity beyond uptime While many businesses may equate operational continuity with uptime, there are distinct differences between system availability and operational continuity. System availability refers to whether systems and infrastructure are online and all components are working as expected. However, just because system availability is restored doesn't mean that systems and operations are fully functional. Operational continuity refers to an organization's ability to perform its core business functions and execute the necessary processes and operations to sustain business operations without delays or other operational disruptions. While system availability assesses the technical state of the infrastructure, it doesn't necessarily mean that operations, workflows and integrations for all users are fully restored. That's why it's critical to create business continuity plans that bridge the gap between system availability and operational continuity during and after a disruption. Business continuity includes "planning for the manual workarounds needed when systems are down and ensuring we have the right human staff and facilities to support those temporary processes," said Justin Kates, senior business continuity advisor at Wawa, Inc. According to McKinsey, resilient organizations were better able to handle disruptions in 2020-21. To stay prepared, organizations should have metrics that can measure and track the resiliency of the organization, from both a technical and operational standpoint. Cameron Daniel, chief technology officer at Megaport, listed key metrics for resiliency: Recovery time objective.

Recovery point objective.

Mean time to recovery. "These metrics are important for knowing how quickly workloads can reach their recovery time and frequency, especially at a time of increasing adoption of hybrid clouds," said Daniel.

Building an incident-response plan Cloud outage preparedness and cloud outage response are critical in a cloud-dominated IT environment. Predefined plans ensure businesses stay efficient and organized, even when disruptions cause chaos. These plans enable normal operations to be resumed as soon as possible, potentially mitigating risks such as financial loss, damage to brand reputation and customer impact. Here are the four main parts of an incident response plan: Detection. Staying proactive about detecting incidents can help organizations catch problems early on before they become major disruptions. Organizations can implement detection measures such as consistent monitoring of system health, vendor statuses and integration points. Automated alerts can streamline incident detection and ensure that issues are addressed promptly.

