benedetti68 - Fotolia


Design data center redundancy based on cloud outage lessons

Public cloud providers have had catastrophic outages, but IT administrators can learn from cloud redundancy failures and successes -- and apply them to on-premises infrastructures.

IT administrators can use cloud redundancy examples and lessons learned from recent outages in the design of their on-premises data center redundancy.

Cloud platforms, especially large, public ones, have many levels of redundancy, but none are impervious to the occasional unexpected outage. Cloud platforms, like every technology inside a data center, can encounter issues that lead to downtime.

On-premises hardware failures and software bugs are inevitable, but the public cloud gives IT administrators ways to mitigate these problems. Public clouds have availability sets, redundant data centers, availability zones and regions that enable admins to plan for disruption. These various tactics are important, but admins must design reliable applications that take advantage of public cloud redundancy features.

The idea that the resiliency of an organization's IT assets depends more on the application than the infrastructure is different than the traditional way of thinking. For the past two decades, admins have maintained resilience through backups, replication and other infrastructure-centric technologies. However, with the vast majority of public cloud platforms, this strategy no longer works.

In recent years, the cloud provider industry has suffered more than its fair share of outages. Each time a cloud provider has an outage, the industry appears to learn lessons about designing native cloud redundancy. Admins can apply these lessons to traditional virtualized data center redundancy.

Rethink data center redundancy

Some of the services that are fundamental to nearly every traditional organization are designed to protect against outages. Services like the Network Time Protocol and network routing are designed to be highly redundant. However, just because a system is supposed to be redundant doesn't mean its configuration is ready to fully take advantage of natural redundancies.

The key is to design systems that meet business requirements under normal and abnormal states.

Some core services don't have an option that makes them highly available. IT departments almost always have technical debt that they must deal with, which demands the support of legacy systems that don't behave optimally. For example, some legacy application authentication systems can only exist on a single server, which limits data center redundancy.

Admins shouldn't put all their eggs in one basket. For most traditional deployments, it's best to have redundant hardware inside redundant data centers. Admins can take this tactic further by using redundant virtualization clusters that don't share systems -- similar to a cloud provider that has several availability zones. This can enable an application to rely on a higher level of data center redundancy, but it's only worthwhile if the business requirements warrant that level of protection.

With these tactics in place parallel across the cloud and data center, it might help to use public cloud provider terminology, such as availability zones, when talking about analogous, traditional infrastructures. This language enables developers to more easily understand infrastructure concepts because they're likely already familiar with similar cloud concepts.

Cloud redundancy isn't perfect, and outages offer lessons

A highly redundant system isn't immune to performance degradation. During a recent public cloud outage, directory services failed to keep the service up and running from one region to another. Rerouted traffic overworked other regions and left the service unable to keep up with the demand.

When admins design data center redundancy, they must plan around the load in the event of an outage. An admin might have two servers to support data center redundancy, but one server might not be able to handle the entire load. The key is to design systems that meet business requirements under normal and abnormal states.

Many organizations assume things are configured correctly but find out otherwise during an outage. This lesson learned is something Netflix's Chaos Monkey is known for. If admins don't practice dealing with an actual outage, they'll never know how the system will react. No system is an island. Each application and service has dependencies that further complicate testing.

Outages happen. This is true in the private data center and the public cloud, but inside the data center, admins typically know and understand the inner workings of the system. When a cloud provider has an outage, it's easy to feel as though there aren't lessons to learn from the outage, but that's far from the truth. Public cloud providers use different tools and methods than traditional data centers, but the lessons about building and sizing redundancy are universal.

Dig Deeper on Cloud deployment and architecture

Data Center