Getty Images/iStockphoto

AWS cloud outage reveals vendor concentration risk

AWS's October 2025 outage underscored systemic cloud risk—driving IT leaders to rethink multi-cloud strategies, resilience testing and vendor oversight.

Executive summary

  • The AWS outage revealed how dependencies on a few cloud regions can cause disruptions across industries—even for companies not directly hosted on AWS.
  • Multi-region, multi-cloud strategies, dependency mapping and tested failover processes are essential to mitigate concentration risk and maintain continuity during provider outages.
  • CIOs must elevate cloud dependency and business continuity discussions to executive and governance levels, ensuring resilience is treated as an enterprise capability.

On Oct. 20, 2025, Amazon Web Services (AWS) experienced a major outage, affecting cloud services worldwide for most of the day.

The incident was centered in the AWS US-EAST-1 region, located in Virginia, and was first reported in the early hours of October 20, with AWS reporting increased error rates and latencies across multiple services.

The root cause was a domain name system (DNS) resolution failure affecting DynamoDB database service endpoints, which cascaded to affect IAM, EC2 instance launches and dozens of other AWS services. The outage lasted approximately nine hours, with AWS reporting that services returned to normal by the evening of October 20. However, some customers reported residual errors and backlogs for several hours after the incident.

The disruption affected thousands of services globally, with notable services including Snapchat, Ring, Robinhood, McDonald's mobile ordering, Signal messaging and Fortnite gaming servers all experiencing downtime. The blast radius was even larger as organizations without direct AWS contracts experienced downtime because their SaaS vendors, payment processors and authentication services depended on the US-EAST-1 infrastructure.

"The AWS incident today is a timely reminder that the internet is far more interconnected and complex than most people realize," James Barnes, CEO and founder of website monitoring vendor StatusCake, said. "Even if your company doesn't directly host on AWS in the impacted region, it may well depend on services that do; whether that's authentication services, analytics platforms, payment gateways, your CRM or customer services platform, APIs and CDNs (content delivery networks).

The cloud dependency dilemma

Among the original promises of the cloud was the availability of regionally distributed and highly resilient infrastructure for applications.

However, much of the infrastructure is consolidated among just a few providers, with AWS, Microsoft Azure, and Google Cloud controlling a significant portion of the global infrastructure-as-a-service market. This consolidation creates concentrated points of failure, where a single regional outage can affect multiple industries simultaneously.

Betsy Cooper, executive director at Aspen Policy Academy, frames the dynamic concentration in terms of trade-offs. She noted that the internet now relies on a handful of large tech companies for our internet infrastructure.

"On the one hand, this isn't necessarily a bad thing; these companies have huge incentives to keep data secure and to prevent outages," Cooper said. "But on the other hand, when failures happen, they can cause widespread harm."

Chirag Mehta, vice president and principal analyst at Constellation Research, explained how those harms propagate through technical dependencies.

"Many organizations were impacted indirectly because their software supply chain relies on AWS, even when they don't realize it," Mehta said. "SaaS applications, APIs, authentication providers and data-integration tools often sit on AWS. When one layer of that chain fails, it cascades quickly across dependent systems."

Regulatory bodies are responding to these systemic risks. The U.K.'s Financial Conduct Authority and the European Banking Authority now classify major cloud providers as critical third parties, subject to operational resilience requirements. These requirements necessitate that financial institutions map dependencies, assess concentration risk and demonstrate continuity capabilities in the event of provider failures.

The business continuity blind spot

Most business continuity and disaster recovery plans address internal failure scenarios, such as server crashes, storage failures, network partitions, and data center incidents. They assume operational control over the failure domain. Cloud outages introduce external dependencies that standard disaster recovery frameworks often fail to capture.

The AWS outage exposed specific architectural gaps:

  • DNS resolution services failed, preventing automated failovers.
  • Control plane APIs became unavailable, blocking infrastructure changes needed to reroute traffic.
  • Database connection pools exhausted when primary regions stopped responding.
  • Shared services, such as IAM, CloudWatch and Systems Manager, created single points of failure across regions, even for multi-region deployments.

Dominic Green, cloud practice lead at Northdoor, explained that the October AWS incident exposed systematic underestimation of risk.

"This event exposes a critical blind spot in risk assessments—companies often underestimate the depth of their supply chain and cloud provider dependencies," Green said.

The outage revealed specific gaps in enterprise preparedness.

Sophistication doesn't equal readiness

Green noted that even highly sophisticated IT organizations can be underprepared for widespread failures by cloud providers.

"Many CIOs focus contingency plans on classic disasters, hardware failure, cyberattacks or data center loss, yet often overlook the systemic vulnerabilities introduced by single-region reliance or untested failover strategies," he said.

Misplaced trust in availability metrics

Todd Renner, senior managing director in FTI Consulting's Cybersecurity practice, told Informa TechTarget that organizations have been transitioning from on-premises environments to fully cloud-native ecosystems for years, and IT leaders have trusted providers to ensure they are meeting availability requirements.

"This outage is a reminder that many cloud providers measure their available uptime in terms of '6 nines' (99.9999%) uptime, not 100%, and these servers are frequently outside of the physical control of an organization's IT team," Renner said.

The vendor concentration risk

Vendor lock-in typically refers to switching costs associated with proprietary APIs and data formats. The October AWS outage reveals a different dimension -- reliability concentration. When a single provider hosts most of an organization's workloads, that provider's availability becomes the ceiling for the organization's overall availability.

"The companies that went dark without even using AWS discovered just how entangled today's software supply chains are," Mehta said. "Even indirect dependencies, SaaS vendors, APIs or auth systems, can bring you down."

This mirrors supply chain concentration risk in physical industries. Manufacturers with single-source suppliers face production halts when those suppliers fail. Organizations with single-cloud architectures are vulnerable to service disruptions when their providers fail.

Cooper noted that third-party companies have always been a significant vulnerability. For example, the massive Target hack in 2013 occurred because hackers breached the company's HVAC system.

"Companies need to realize that it doesn't matter whether a disruption occurs because of internal or external causes," Cooper said. "What matters is how you respond and restore services quickly to your customers."

Lessons and leadership imperatives

The October AWS outage made it clear that cloud resilience requires deliberate architectural and governance decisions, rather than relying on assumptions about provider reliability. Technology leaders must address the following areas:

  •  Visibility. Organizations must map all critical workloads, SaaS dependencies and cloud regions in use. Green emphasized that conducting a thorough dependency mapping to understand all cloud and third-party service interconnections provides transparency essential for identifying single points of failure hidden deep in the supply chain. Mehta views visibility as a key component of a software supply chain audit. "Identify every critical application, data flow and third-party service that touches cloud infrastructure, directly or indirectly," Mehta said.
  • Diversification. For critical systems, evaluate multi-region and multi-cloud strategies. Renner advises organizations to identify critical applications and system owners, understand the data implications and consider implementing redundant/failover solutions with multiple cloud providers. Mehta cited Netflix as proof that architectural choices matter: "Netflix stayed up because reliability was engineered into their DNA. The lesson isn't to avoid AWS; it's to architect for failure."
  • Governance. Elevate cloud risk to board level. Green recommends that boards should mandate the implementation of multi-region or hybrid-cloud strategies to reduce reliance on any one service or location, complemented by regular failover testing. Renner emphasizes the importance of executive engagement, noting that the C-suite, board, and possibly investors should be made aware of the risks and costs associated with using a single cloud provider versus multiple cloud providers, on-premises hosting and multiple co-locations.
  • Communication. Develop crisis response protocols for third-party outages, including pre-drafted customer communications, internal status pages and clear accountability for external vendor failures.
  • Collaboration. Engage cloud vendors beyond SLA credits. Renner advises organizations to review their service contracts and outline third-party risks associated with hosting data and applications in fully cloud-native environments. This includes transparency discussions about shared responsibility boundaries and advance notice of maintenance windows.
  • Practice. Don't wait for the next outage; prepare and test responses as part of operational resilience. Renner stressed that preparation is critical. He noted that organizations should rehearse and practice using tabletop exercises for the next unanticipated event that disrupts a critical business system. Mehta added that organizations must fund site reliability engineering (SRE)  capacity, perform 'game-day' simulations and include resilience KPIs in business performance reviews.

Conclusion: Building a resilient digital core

The concentration of computing infrastructure among three major providers means regional outages will continue to affect multiple organizations simultaneously.

While the October 2025 cloud outage occurred at AWS, Mehta noted that it could have just as easily occurred at Azure or Google Cloud.

"The takeaway isn't to leave the cloud; it's to design for failure, test relentlessly and make resilience a board-level conversation," Mehta said.

Outages aren't going away either and will likely happen again in the future. While preventing outages is a valuable goal, being resilient is even more important.

"There is no reasonable way, no matter what governance structure is put in place, to prevent cloud outages on occasion," Cooper said. "So, companies need to prepare themselves by having a plan in place, if they need to go offline, to keep things operating as smoothly as possible."

The long-term differentiators between organizations that treat the October 2025 outage as a wake-up call and those that don't will be the organizations that take the following steps:

  • Cross-functional integration. Renner notes that organizations that capitalize on the opportunity to transform and integrate cybersecurity and IT programs, bringing key stakeholders into the process, will be better positioned to minimize disruptions and restore operations independently.
  • Enterprise-wide ownership. Green emphasizes that business continuity, finance, legal and customer relations must all engage actively in cloud risk assessments and response planning. Elevating these discussions to executive and board levels ensures resilience becomes a shared responsibility.
  • Capability development. Organizations that invest in capabilities to better manage operational uptime and resilience will be better prepared.

"Cloud resilience isn't a technical metric anymore; it's an enterprise capability," Mehta said. "Boards that treat it as business-continuity, not downtime management, will come out stronger after the next outage."

Sean Michael Kerner is an IT consultant, technology enthusiast and tinkerer. He has pulled Token Ring, configured NetWare and been known to compile his own Linux kernel. He consults with industry and media organizations on technology issues.

Dig Deeper on CIO strategy