carloscastilla - Fotolia


Implement cloud failover, replication for workload protection

Businesses increasingly use public cloud services as workload protection targets. But cloud failover and replication are hardly hands-off processes for IT teams.

Proper workload protection, through practices such as failover and replication, ensures adequate availability for users and helps enterprises meet regulatory and governance obligations.

Public cloud is a practical failover and Replication target for many businesses -- but it doesn't automatically eradicate the technical and logistical issues involved with workload protection. Follow these steps to ensure things go smoothly.

Carefully select cloud failover, replication targets

Replication and failover are vastly different technologies that address two different levels of workload protection. Each approach also offers various subsets of capabilities. For example, failover can be cold, warm or hot, while replication can range from regular VM snapshots to periodic system backups.

Deciding which public cloud resources to use for each task -- and where those resources are physically located -- can have a serious effect on workloads' recoverability.

First, know where the failover or replication target is located. Public cloud providers typically have a large global footprint of data centers spread across different parts of the world, called regions. This enables a business to place data or operate workloads on a global scale. But it also adds legislative and technical complexity, such as latency.

Large replications over great distances can impose significant latency, which slows the replication process and reduces performance. Additionally, it can be time-consuming to recover a replication from a distant region. In some cases, placing workloads in a distant public cloud region might also violate regulatory or governance requirements.

The best practice here is to keep workload protection targets as close as possible.

Ensure adequate bandwidth and connectivity

Bandwidth is crucial for data-intensive workload protection tasks, such as replication, snapshots and backups. This is typically a WAN issue, and IT teams should address it with adequate and reliable internet network access.

When a business has more bandwidth, it can move more data to the cloud in less time -- but more bandwidth also costs more money. When enterprises use the public cloud for workload protection, they need to balance time demands and data volumes with network bandwidth costs. In some cases, it might be necessary to move less data or take more time for a replication, snapshot, backup or other data protection task.

Any replication or failover process requires a logical order or sequence.

To improve the consistency and reliability of available WAN bandwidth, use a direct or dedicated network connection between on-premises data centers and public cloud regions. Services such as AWS Direct Connect and Azure ExpressRoute can provide these direct connections, which establish a dedicated network circuit to the cloud provider's data center for more consistent traffic handling.

Consider the operational order

Any replication or failover process requires a logical order or sequence. It might seem appealing to trigger a replication for every workload at the same time or fail over every workload to the cloud at the same time, but that can strain network bandwidth and delay those processes at critical times.

Instead, space out workload protection processes in a way that can mitigate network loading and prioritize tasks. For example, space out replications so that each server or group is handled on a different schedule, and let that schedule reflect the importance of the associated workloads -- critical workloads should be replicated more often than noncritical ones.

What's your address again?

When a workload fails over to the public cloud, it will have a new public IP address. This address is different from the one before, which had supported workloads such as web servers and enterprise applications. IT teams need to carefully repoint IP addresses so that users can still find and use workloads after failover occurs.

There are numerous ways to handle IP address migration -- typically, it involves a domain name system (DNS) service that can make the change when a fault is detected. Cloud providers, such as AWS, offer DNS services, such as Amazon Route 53, while third-party services, like DNS Made Easy, can also automate the process.

Failovers should follow a similar paradigm; several of the most critical systems should fail over and power on first, followed by subsequent small groups, ordered by their importance to the business. For example, it's important to fail over database servers before failing over the applications that rely on a database.

Weigh the cloud data costs

Generally, the cost of cloud resources in a failover scenario is acceptable compared to the potential revenue and regulatory costs of being offline. But replication, snapshots, backup, archival storage and other data protection strategies in the cloud can carry unexpected sticker shock.

Since public cloud providers charge for storage based on capacity and egress traffic, replication to the public cloud should include stringent data lifecycle management tactics to segregate and retain storage. To limit storage retention and reduce costs, delete old or unneeded data from the cloud. For data that is committed to long-term retention, such as archives, use storage services designed and priced for long-term infrequent access, such as Amazon Glacier and Google Nearline.

Test the environment regularly

It's vital to test a workload protection process to verify that it actually works.

For replications, this might simply involve recovering a snapshot to a VM in a test environment. Similarly, IT staff can manually invoke a failover and verify that the cold, warm or hot deployment operates as expected.

Dig Deeper on Cloud infrastructure design and management

Data Center