grandeduc - Fotolia
Virtual disaster recovery is one of those things that must work right the first time. It is often the last resort to quickly restore services before going to tape recovery. While tape technology is reliable and still used for backups, recovering large amounts of data from physical media may not be appealing when compared with the virtual alternative.
Unfortunately, many administrators make avoidable errors and oversights when working with virtual DR. Below are some of the common traps that administrators fall into when it comes to virtual disaster recovery -- from networking issues to human errors -- and tips on how to dodge them.
Applications need services, too
Just about every application organizations use today has some form of directory services, as well as other required services such as IP and domain name servers. It is key to ensure that the required resources that can be failed over are running correctly. This can depend on the DR configuration, such as whether the organization is using a cold site or hot site. Frequently, the reason a DR test fails is due to configuration gaps or dependent systems.
Another common issue is that when systems are brought up, there is no predefined start up order in place. Just about all virtual disaster recovery tools support configurable boot up orders to ensure that the failed over servers start up in the correct order. For example, they may ensure that the database back end comes up before the front-end web servers or application servers. This feature is installed in most recovery systems at no additional cost, and DR teams should take advantage of it if they are not already.
Systems evolve over time as additional products are brought in to support business requirements. If these systems are outside the actual protected virtual machines, it becomes an issue. Admins must account for those changes in the disaster recovery plan. This is one of many reasons DR teams should conduct periodic testing and validation of virtual environments to ensure they are working as expected for when the real disaster occurs.
In simple environments, failing over to the DR site is easy. But with environments incorporating complex networks, things become more difficult. Make sure that, once the system has failed over, it can connect to the services it needs on the necessary ports. For example, when a simple multi-tier application can't talk to the database, it presents issues.
In order to perform a failover in the virtual environment, the DR management servers must be available. If not, failover becomes difficult -- if not impossible. To mitigate this issue, ensure those servers are secure. If a CryptoLocker type event occurs, it is critical that it does not affect the DR management server.
People and process
Humans are often the weak link in the chain. The act of failing over servers is relatively straightforward, but bringing the application back up at the DR site often requires more involvement.
For example, failing over to a new network address means that the underlying configuration of the application likely needs changing. But who in the organization takes care of this? Having a runbook for failing over is important and often overlooked. It may sound intimidating, but a runbook is just a list of tasks, when they are performed and by whom. It also helps avoid potentially missing an important step in bringing a system back up in DR.
Another aspect of people and processes is that having a single administrator who does all the DR work is setting the company up for a single point of failure. Malware and disasters don't respect employees' vacation plans. This is relatively easy to fix by making sure that there is redundancy in the role of DR admin.
In a recovery scenario, access to a lot of information may not be possible. To ensure that all critical information is available before a disaster happens, it is important to answer any outstanding questions regarding configuration.
The remediation for this issue is to perform a full DR test at least annually. This will help ensure that the proper network ports are open when it is time to perform a real disaster recovery. It is easier to find and resolve these issues in a test scenario, rather than when the pressure is on and the system needs to be up as quickly as possible.
All the documentation needs to be kept up to date, as well as the configuration database, which holds all the system configurations within the environment. Administrators do not want to be troubleshooting incorrect data while trying to perform a failover.