In the wake of the Dec. 7 AWS service outage, the company has acknowledged that its existing outage response was inadequate, and its methods for communicating service health updates also failed.
A Dec. 10 blog post detailed how providing updates via a global banner on the Service Health Dashboard left some customers in the dark. AWS promised to rearchitect its support system and release an updated version of the service performance console.
The outage occurred after an attempt to scale up the capacity of an unspecified service triggered "unexpected activity" in AWS's internal network, which handles foundational services including monitoring, DNS and authorization, said the company in a blog post. The unusual activity caused a surge in connection activity that overwhelmed the devices responsible for communicating between the main and internal networks.
The lack of communication resulted in the downing of customer services ranging from Coinbase to Roomba to Netflix. The outage also affected parent company Amazon's e-commerce site.
The outage affected the northern Virginia region, home to AWS's largest data center. The disruption lasted from 7:30 a.m. to 2:22 p.m. PST on Dec. 7, but response times for some services remained heightened until 6:40 p.m. PST.
AWS said it would replace its existing failover setup with an active-active multi-region support system architecture to improve its response time to outages.
The new system is more complex to set up and maintain because it requires more communication between servers to remain synchronized. However, it will ensure that services continue to be delivered even if one region experiences an outage, said Mattias Andersson, senior community training architect at A Cloud Guru, which provides IT training on cloud use. Pluralsight, an online educational company, owns A Cloud Guru.
"The difficulty is in making [an active-active architecture] seem like it's a single system, but it happens to be running in multiple places … data synchronization across large areas is really the core difficulty around building a system of this kind," Andersson said.
A system reliant on communication between two networks to switch to another region during an outage is vulnerable to communication disruptions caused by that outage. But a system that balances all loads among all available processing capacity all the time would not be affected.
"Honestly, I think it's a really good approach," Andersson said.
Defending against outages
The AWS outage lasted so long in part because it affected the visibility of network monitoring data. It also affected the Service Health Dashboard, which left AWS unable to communicate quickly with customers. AWS will update the dashboard in early 2022 to improve communication during outages.
Running workloads in the cloud is "essentially handing over the keys of the kingdom," said Gartner analyst Sid Nag. There isn't much that customers can do to keep their workflows moving during an outage.
However, enterprises can become more prepared in advance.
"I think that the best way for people to mitigate these sorts of issues is to learn about and understand how the cloud really works," Andersson said. "People need to not just follow some set of rules, but rather understand how the cloud actually works so they can use it to good advantage and work around a lot of the risks that they might never have understood before."
A 100% uptime for all services all the time is prohibitively expensive and rarely necessary, Andersson said. Also, pursuing it would reduce the cloud's speed and flexibility.
Instead, he recommended that IT departments explore the AWS Well Architected Framework to evaluate whether they are using AWS cloud resources in an outage-resilient way.
Madelaine Millar is a news writer covering network technology at TechTarget. She has previously written about science and technology for MIT's Lincoln Laboratory and the Khoury College of Computer Science, as well as covering community news for Boston Globe Media.