olly - Fotolia
AWS, Microsoft Azure and Google Cloud all experienced service degradations or outages this week, an outcome that suggests customers should accept that cloud outages are a matter of when, not if.
In AWS's Frankfurt region, EC2, Relational Database Service, CloudFormation and Auto Scaling were all affected Nov. 11, with the issues now resolved, according to AWS's status page.
Azure DevOps services for Boards, Repos, Pipelines and Test Plans were affected for a few hours in the early hours of Nov. 11, according to its status page. Engineers determined that the problem had to do with identity calls and rebooted access tokens to fix the system, the page states.
Google Cloud said some of its APIs in several U.S. regions were affected, and others experienced problems globally on Nov. 11, according to its status dashboard. Affected APIs included those for Compute Engine, Cloud Storage, BigQuery, Dataflow, Dataproc and Pub/Sub. Those issues were resolved later in the day.
Google Kubernetes Engine also went through some hiccups over the past week, in which nodes in some recently upgraded container clusters resulted in high levels of kernel panics. Known more colloquially as the "blue screen of death" and other terms, kernel panics are conditions wherein a system's OS can't recover from an error quickly or easily.
The company rolled out a series of fixes, but as of Nov. 13, the status page for GKE remained in orange status, which indicates a small number of projects are still affected.
AWS, Microsoft and Google have yet to provide the customary post-mortem reports on why the cloud outages occurred, although more information could emerge soon.
Move to cloud means ceding some control
The cloud outages at AWS, Azure and Google this week were far from the worst experienced by customers in recent years. In September 2018, severe weather in Texas caused a power surge that shut down dozens of Azure services for days.
Cloud providers have aggressively pursued region and zone expansions to help with disaster recovery and high-availability scenarios. But customers must still architect their systems to take advantage of the expanded footprint.
Still, customers have much less control when it comes to public cloud usage, according to Stephen Elliot, an analyst at IDC. That reality requires some operational sophistication.
Stephen ElliotAnalyst, IDC
"Networks are so interconnected and distributed, lots of partners are involved in making a service perform and available," he said. "[Enterprises] need a risk mitigation strategy that covers people, process, technologies, SLAs, etc. It's a myth that outages won't happen. It could be from weather, a black swan event, security or a technology glitch."
This fact underscores why more companies are experimenting with and deploying workloads across hybrid and multi-cloud infrastructures, said Jay Lyman, an analyst at 451 Research. "They either control the infrastructure and downtime with on-premises deployments or spread their bets across multiple public clouds," he said.
Ultimately, enterprise IT shops can weigh the challenges and costs of running their own infrastructure against public cloud providers and find it difficult to match, said Holger Mueller, an analyst at Constellation Research.
"That said, performance and uptime are validated every day, and should a major and longer public cloud outage happen, it could give pause among less technical board members," he added.
Dig Deeper on Cloud infrastructure design and management
AWS outage: API and networking issues disrupt services hosted in major US Amazon datacentre hub
Major Amazon Web Services outage downs businesses, services
AWS outage: Downtime incident blights users of one of Amazon’s major US datacentre regions
Google, Microsoft add to their cloud migration toolkits