Don't make your disaster recovery planning process even harder than it is by trying to do too much or cutting corners. Careful planning is your best bet for a successful recovery.
At the start of the new year, many IT folks (and perhaps a few business managers) resolve to take steps to prevent avoidable interruption events and to cope with interruptions that simply can't be avoided. In short, they decide to get serious about data protection and disaster recovery planning for business IT operations.
Why the disaster recovery planning process can be so tough
Disaster recovery (DR) planning is a complex and time-consuming task when done properly, which helps to explain why, for the past few years, surveys have shown the number of companies with continuity plans on the decline. In one annual PricewaterhouseCoopers study, companies with DR plans are down from roughly 50% of those previously surveyed to approximately 39% last year. Of these companies, the ones that actually test their plans are usually a fraction of those that claim to have a plan, raising further concerns about the actual preparedness of those firms with documented, but untested, plans.
Planning activity has also dropped off because of misperceptions about its necessity and value. Intuitively obvious though it may seem that "doing more with less" means "doing more with computers," and that downsizing staff actually increases dependency on the uninterrupted operation of the automation resources and reduces tolerance to even a short-term interruption, the connection between these insights and the need to ensure that automation is resilient and continuous isn't being made.
Money is also a hurdle, as it always is. Managers can always think of ways to invest money so that it makes more money for the organization -- an option that's generally preferred to spending money on a continuity capability that may never need to be used. With some economic uncertainty in today's marketplace, this normal preference to focus spending on initiatives with revenue-producing potential is even more distorted, often at the expense of initiatives focused solely on risk prevention.
DR is an investment
Common sense regarding the need to allocate budget, resources and time to the DR planning process may also be diminished by the marketecture and hype around technologies such as server virtualization, data deduplication, clouds and so on.
Over the past few years, vendors have spent considerable effort trying to convince users that a side benefit of those technologies is improved or increased protection for data and operations. "High availability trumps disaster recovery," according to one server virtualization hypervisor vendor's brochure. "Tape Sucks. Move On" was emblazoned on bumper stickers distributed at trade shows by a dedupe appliance vendor. "Clouds deliver Tier 1 data protection," claimed a service provider's PowerPoint. These statements suggest that disaster recovery planning is old school, replaced by resiliency and availability capabilities built into new products or services. However, most of these claims are downright false or, at least, only true with lots of caveats.
1. Don't think high availability equals DR. Perhaps the first and most important mistake to avoid when undertaking to build a disaster avoidance and recovery capability is to believe vendor hype about the irrelevancy of DR planning. While improvements might be made in high-availability (HA) technology, this changes nothing about the need for continuity planning. HA has always been part of the spectrum of alternatives for accomplishing a recovery from a disaster event. However, the use of HA strategies have always been constrained by budget: HA (failover between clustered components) tends to be much more expensive than alternatives, and is inappropriate for workloads and data that don't need to be made available continuously. For most companies, only about 10% of workloads actually fall into the "always on" category.
2. Don't try to make all applications fit one DR approach. A second common mistake in planning, and one closely related to the first mistake, is to try to apply a one-size-fits-all data protection strategy. For the same reason that failover clustering isn't appropriate for all workloads, all data doesn't require disk-to-disk replication over distance, disk-to-disk mirroring, continuous data replication via snapshots or some other method. The truth is that most data can be effectively backed up and restored from tape. Using disk for everything, including backup data, may seem less complex, but it tends to be far more costly and far less resilient. Given the numerous threats to disk storage, the problems with vendor hardware lock-ins for inter-array mirroring and replication, the costs of WANs and their susceptibility to latency and jitter, and many other factors, disk-to-disk data protection may not be sufficient to protect your irreplaceable information assets. At a minimum, tape will provide resiliency and portability that disk lacks. Think "defense in depth."
3. Don't try to back up everything. Expecting all your data protection needs to be included in a single backup process is another common mistake. The truth is that a lot of your data, perhaps as much as 40% to 70%, is a mix of archival-quality bits -- important but non-changing data that should be moved off production storage and into an archive platform -- and dreck (duplicate and contraband data that should be eliminated from your repository altogether). Only approximately 30% of the storage you have today requires frequent backup or replication to capture day-to-day changes; the other 70% requires very infrequent backing up, if at all. You can take a lot of cost out of data protection and shave precious hours off recovery times if you segregate the archive data from the production data. Doing so will also reclaim space on your expensive production storage environment, bending the cost curve on annual storage capacity expansion and possibly saving enough money to pay for the entire data protection capability that you field.
4. Don't overlook data that's not stored centrally. This mistake is forgetting about outlying data repositories. Not all important data is centralized in an enterprise SAN or some complex of scale-out network-attached storage (NAS) boxes. Mission-critical data may exist in branch offices, desktop PCs, laptops, tablets and, increasingly, smartphones. Recent surveys by TechTarget's Storage Media Group reveal that even before the rise of the bring-your-own-device (BYOD) era, companies weren't doing a very good job of including branch offices or PC networks in their data protection processes. In another study published this year, 46% of 211 European companies admitted they had never backed up user client devices successfully and that BYOD looms on the horizon as a huge exposure to data loss. You need to rectify this gap and may find it possible to do so with a cloud backup service, provided you do your homework and select the right backup cloud.
5. Don't mismanage data and infrastructure. Another mistake DR planning newcomers often make is ignoring root causes of disaster, which are lack of management of data and infrastructure. Lack of data management, or rather the failure to classify data according to priority of restore (based on what business workflow the data supports), is a huge cost accelerator in the disaster recovery planning process. Absent knowledge of which data is important, all data needs to be protected with expensive techniques. As for infrastructure, you can't protect what you can't see. The failure to field any sort of infrastructure monitoring and reporting capability means that you can't respond proactively to burgeoning failure conditions in equipment or plumbing, inviting disaster. These gaps can be addressed by deploying data classification tools (and archiving) to manage data better, and resource management tools to manage infrastructure better. And, with respect to infrastructure management, tell your equipment vendors that you will no longer be purchasing their gear if it can't be managed using the infrastructure management software you've selected. That will also have the effect of driving some cost out of your normal IT operations.
6. Don't try to duplicate equipment configurations at the recovery site. No. 6 in our countdown of DR preparation mistakes is developing a plan that replaces full production equipment configurations in the recovery environment. Given that only a subset of applications and data typically need to be re-instantiated following a disruptive event, you don't need to design a recovery environment that matches your normal production environment on a one-for-one basis. Minimum equipment configurations (MECs) help reduce the cost of the DR environment and simplify testing. Often, there's also an opportunity to make use of server virtualization technology to host applications in the recovery environment that you may not entrust to a virtual server under normal circumstances. Testing is key to making the transition, whether from physical host to MEC host, or physical to virtual.
7. Don't forget to fortify your WAN connections. Vesting too much confidence in WANs and underestimating the negative impact they can have on recovery timeframes is in the No. 7 slot on our list of DR planning process mistakes. WANs are services that must be properly sized and configured, and that must perform at peak efficiency to facilitate data restoration or to support remote access to applications either at a company-owned facility or in a cloud hosting environment. Regardless of the service-level agreement promised by your cloud host or cloud backup service provider, your actual experience depends on the WAN. Don't forget about providing redundancy (a supplemental WAN service supplied via an alternative point of presence) in case your primary WAN is taken out by the same disaster that claims your production environment. And also keep in mind that your WAN-connected remote recovery facility or backup data store should be at least 80 kilometers from your production site and data as a hedge against both sites being disabled by a disaster with a broad geographical footprint. Most metropolitan-area networks that provide lower cost, high-bandwidth multiprotocol label switching (MPLS) connections do NOT provide sufficient separation to survive hurricanes, dirty bombs or other big footprint disasters.
8. Don't put too much trust in a cloud provider. While not yet as prominent as some of the aforementioned potential pitfalls, our eighth mistake is placing too much trust in a cloud service provider to deliver disaster application hosting or post-disaster data restoration. If you're using an online backup provider, for example, you've probably moved data to the backup cloud in a trickling fashion over time. You might be surprised how much data has amassed at the service provider, and the length of time and the amount of resources that would be required to transfer it back to a recovery environment. Remember: Moving 10 TB over a T1 network takes at least 400-odd days. Alternatively, if your plan is to operate applications at a cloud infrastructure provider, using the latter as a "hot site" for example, then be sure to visit the cloud provider's facility in person. In the 1970s, when hot site facilities were first introduced, there was a guy selling subscriptions to a non-existent hot site who, once his scam was discovered, retired to a non-extradition country before he could be arrested. At a minimum, if you plan to use a cloud to host your recovery environment, make sure that it actually has all the bells and whistles listed in the brochure, including that Tier-1 data center.
9. Don't let app designs foil DR. This mistake is procedural: planners need to stop accepting the notion that DR planning is a passive activity -- that you're dealt some cards and are required to play the hand as it was dealt. For business continuity capabilities to be fully realized, resiliency and recoverability should be built into applications and infrastructure from the outset. However, few DR-savvy folks have been given seats at the tables where applications are designed and infrastructures are specified. This must change going forward. Put bluntly, bad design choices are being made right now that will obfuscate some company's recovery efforts in the future, including the platforming of applications and data in proprietary server hypervisors or storage platforms, coding applications using insecure functions, employing so much caching that significant amounts of critical data will be lost if an interruption occurs and so on. If DR planners can get involved early on, better design choices can be made and IT can be made much more recoverable at a much lower cost.
10. Don't forget to follow the money. Management holds the purse strings, so it could be a big mistake if you don't make the case for your DR plan based on business value rather than technical terms. You need to show management that you're doing everything possible to drive cost out of the continuity capability without sacrificing plan efficacy. You also need to emphasize investment risk reduction and improved productivity enabled by the plan, thereby providing a full business value case. Only then will you have a chance of overcoming the natural reluctance of management to spend money on a capability that in the best of circumstances will never be used.
For the record, the greatest expense in DR planning isn't the cost for data protection, application re-instantiation or network re-routing; it's the long-tail cost of testing. So, try to build a capability that can be tested as part of day-to-day operations, alleviating the burden on formal test schedules, which should serve as logistical rehearsals (not tests) of whether data can be restored.
About the author:
Jon William Toigo is a 30-year IT veteran, CEO and managing principal of Toigo Partners International, and chairman of the Data Management Institute.