Gajus - Fotolia
From simple scripts to Ansible orchestration and infrastructure as code, IT automation continues to proliferate in the enterprise. And a lot of software tool sets IT teams use to achieve this automation are hosted in the cloud.
As with any SaaS tool a business relies on for success, there should be a plan for if, or when, cloud-based IT automation software fails. When an IT system goes offline, policies and procedures normally exist to handle the loss. Disaster recovery environments and processes spin up resources and handle failover. Staff is trained to use backup systems and methods until core systems return online.
However, IT automation isn't a single system, but rather a set of instructions that spans dozens of systems and works together in extremely intricate ways. The creation of a SaaS disaster recovery plan for IT automation software isn't easy -- but it's possible, and definitely recommended.
The documentation challenge with IT automation
One of the biggest mistakes is a belief that, should an outage occur, on-site personnel can easily replicate cloud-based IT automation tools and processes with manual efforts. Automation and infrastructure as code are complex routines that evolve over time to their current state -- which means they often lack proper documentation.
There might be some documentation that details how IT automation processes started, but that information is obsolete if staff doesn't update the documentation regularly. It isn't the cloud vendor's responsibility to document how organizations use its automation software; the vendor only documents how the tool works, not what users do with it. This problem escalates if the IT personnel who managed the automation tool no longer work for the company; new staff must try to determine how and why things were done with no context or history.
Store and audit playbooks
One way to prepare a SaaS disaster recovery plan for IT automation tools is to download a copy of the automation playbooks. However, this approach only provides insight into how automated processes occur, not why -- which is just as important for IT teams to understand, particularly when it comes to complex, automated systems. What's more, even if the playbooks are on site, the automation engine itself remains in the cloud. While it's highly unlikely to have on-site capabilities to run the downloaded playbooks, it's still helpful to have them for reference in the event of an outage.
Second, perform audits of the most used and critical IT automation scripts and playbooks. This serves two purposes: It ensures there is current documentation of at least the most important processes, and it enables IT teams to look at the overall automation workflow and identify any steps that could occur manually, if needed. IT teams might not be able to do everything the playbook can do -- and certainly not as quickly or efficiently -- but this is the start of a contingency plan.
The goal of this plan is to be able to operate core services until the cloud-based automation engine returns online. Treat this plan the same as other DR plans: It's not designed to replace the automation tool, but to keep the business operating until services are restored. This means more manual steps, longer lead times and possible mistakes can occur. To mitigate these issues, be transparent about outage response and resolution plans, so users understand that, while not ideal, the benefits far outweigh the drawbacks.
Watch out for lock-in
To shape any SaaS disaster recovery plan, understand how the cloud vendor allows for migration. Similar to other cloud offerings, the ability to import information into cloud-based IT automation software is often a lot easier than to extract it. Research and verify that the vendor allows for migration before any purchase.