Mathias Rosenthal - Fotolia
Failures and outages are an understandable, if not inevitable, outcome for cloud providers that operate at massive scale. AWS, Google and Microsoft must strike a balance between operational efficiency and high availability -- the former improved through automation and the latter bolstered by physically distinct zones and regions.
Availability zones are intended to provide redundancy, but widespread system failures can still occur within them. Regions provide an even greater degree of logical separation, but they introduce their own sets of problems for organizations that replicate workloads across them.
Despite the complexities and tradeoffs, enterprises need to understand the variables associated with regional isolation so they can design applications that are highly available and more resilient to systemic failures.
How availability zones and regions work together
AWS, Microsoft Azure and Google Cloud each rely on a three-tiered hierarchy of interconnected cloud facilities: individual data centers, loosely coupled availability zones and geographically separated regions.
An availability zone is a collection of two or more data centers that are geographically close, usually in the same metropolitan area, and connected by redundant, extremely high speed, low-latency network circuits. A group of availability zones makes up a region.
Regions are connected via dedicated, high-performance circuits within what Microsoft describes as a "latency-defined perimeter." Providers often house more than one region in a country; for example, AWS has four commercial regions in the U.S. plus two dedicated to the public sector.
Many cloud services, including some databases, storage tiers and load balancers, are designed to automatically provision resources across an availability zone. Other services, like AWS Identity and Access Management or the Azure CDN, are automatically available across all regions. As the Google outages demonstrate, the coupling of data centers can lead an entire availability zone to go down under the right conditions.
Variables in cloud vendors' approaches
The Google incidents also highlight differences in how cloud vendors manage availability zones. For example, AWS claims to have the greatest degree of facility independence. In a 2018 blog, Amazon CTO Werner Vogels said that if there's a power outage in one availability zone, it won't affect another, while a software bug in one availability zone is unlikely to spread to another. AWS also specifically builds tools and procedures so they can't impact multiple zones, he added.
Meanwhile, Microsoft has been slow to introduce availability zones and, as of publication, only has them in 10 of its 54 global regions. Lack of widespread availability zones contributed to a massive Azure outage in its South Central US region last September.
Underscoring the value of low-latency connections within an availability zone, the Azure post mortem notes that cross-region replication adds too much latency for many services.
In terms of the connections between these data centers, cloud operators are notoriously private about the design details of their internal facilities, systems and networks. However, a few public details shed light on their scale and sophistication.
AWS says the private network that connects its various data centers and regions "is built on a global, fully redundant, parallel 100 GbE metro fiber network that is linked via trans-oceanic cables."
Google operates a global private network that it estimates accounts for about 25% of all internet traffic. In 2018, Google said it spent $30 billion over the previous three years to build out its cloud infrastructure, including the addition of new regions and several submarine sea cables.
Google uses a homegrown software-defined network with a distributed control plane and network-fabric manager that provides HA via replicating and sharing configuration data across multiple data centers. It also uses container clusters to support live workload migration and "hitless" data plane upgrades.
Azure says that its regions are designed to isolate failures to an AZ and that a failure in one AZ should not affect others.
Using regions within and between clouds
Distributing application workloads and databases across several regions is the best way to achieve high availability with cloud infrastructure. However, organizations must understand and design for network performance variability when doing so.
Cloud providers differ in how they route traffic, which can significantly affect network latency. None of the cloud providers guarantees or even publishes figures on their inter-region performance.
Yet, a detailed November 2018 study by ThousandEyes, a network monitoring vendor, provides a plethora of useful insights. While the entire presentation is worth studying, the following are some key points regarding cloud wide-area network performance.
- Performance within an availability zone is excellent and consistency meets the goal of sub-2 ms network latency.
- Inter-region performance is the best in the U.S. and Europe, and worst in Asia and Oceania. Performance between U.S. and European regions is also quite good.
- Performance differs greatly among providers in Asia, where AWS has the highest network variability.
- AWS, Azure and Google Cloud peer their networks with each other, meaning that traffic between them does not leave their backbones and traverse the public internet. Thus, there is negligent packet loss and jitter when moving data between clouds, although there will be the same differences in inter-region latency as in single-cloud scenarios.
When using multiple cloud regions to improve application availability, consider the source of application traffic -- in other words, where the majority of end users reside -- as well as variability in inter-region network latency, and differences in how the three major cloud providers route global traffic.