Data deduplication has gained a lot of traction in the data backup and recovery world over the past few years. When integrating data deduplication products into your IT disaster recovery strategy, there are a number of things you need to consider. Should you use source dedupe or target dedupe? Which one is better for disaster recovery? If you are backing up your data to tape and disk, are there other issues to keep in mind? W. Curtis Preston, independent backup expert and executive editor, discusses source and target deduplication, how these approaches differ, what that means from a disaster recovery perspective and more in this Q&A.
Using data deduplication products and disaster recovery -- Table of contents:
>> What are the different approaches to data deduplication?
>> Is one more appropriate for disaster recovery?
>> What if you are using disk and tape as part of your strategy?
>> Can you outline the concerns around data deduplication to tape?
I separate them into two very broad categories, source deduplication and target deduplication. Target deduplication has gotten most of the traction and press. The $2.5 billion acquisition of Data Domain by EMC -- that is a target deduplication product. There are several others like it. Basically, you use your regular data backup software, so the same amount of data must be transferred across the network that you are backing up. Then it gets to the target, or the storage that you are backing up to, and it is deduplicated at that point.
Source deduplication is very different. It is a piece of software that runs on the server that will be backed up that communicates with a special backup server and it deduplicates data from the source. If a file or portion of a file has been backed up and transferred over the network before, then it is never backed up again.
Is one approach more appropriate than the other when using deduplication as part of your disaster recovery strategy? For example, if it is a priority to restore data really quickly can one approach outperform the other?
It's certainly not cut and dry to be able to say one is more appropriate than the other for disaster recovery. However, I will say that one is better than the other for certain types of disaster recovery strategies. For example, if we are talking about a remote server at a location that does not have IT personnel, source deduplication is the most appropriate option. When you go to perform a recovery, you can restore the data to a server at the backup site and then ship the server out to the remote site, eliminating the need to restore data over a WAN.
When we get into a larger remote data center or a central data center, that's where the differences come into play. While source deduplication is great for saving bandwidth, with larger data sets it has limitations. On first glance, these limitations might not seem that bad. For example, if I told you that a backup system running source deduplication could only restore at 250 MBps, if you are a small- to medium-sized business (SMB) you'd say "well gee, that's fine." But, if you are a larger company with say, 20 TB of data, you are looking at a 50 hour restore. So for single terabytes of data or less, source deduplication will work fine. For more data than that, you should be looking at target deduplication.
Having said that, I'm not trying to say that all source deduplication is slow, and all target deduplication is fast. Those are just general rules of thumb that apply.
Just complexity, really. So, if you have one set of servers that are backed up with tape that never go to disk. And you have another set of servers that are backed up to deduplicated disk and that you replicate to your disaster recovery site. With the tape, you have to go the traditional route of moving those tapes offsite. If you are doing that, you have to then encrypt those tapes.
Deduplication allows you to not worry about that. Many people replicate deduplicated data without encryption, because the bits and pieces of data going across the wire are unintelligible. It's not a file; it's a chunk of a file. Or if you do want to encrypt, you can encrypt just the pipe. You can use standard VPN technologies which have been around for decades. The data is encrypted in flight but not in your data center and not in your remote site. Just one less thing to worry about.
Almost all of the current data deduplication options on the market today require data to be "re-duped" when sending it to tape. So, you gain no benefit from the deduplication. So, some people have said "why don't we dedupe to tape?" My first reaction to that was very negative, because depending on your architecture, you could need several tapes to restore a single file. The bits and bytes of any given file arrive over time, and since you are copying to tape over time, you might need 20 tapes to restore a file that's been backed up over, say, 20 days.
There's only one company trying deduplication to tape today, CommVault. And, they are completely in agreement with me that recovery from deduplicated tape should not be your primary method for disaster recovery. For disaster recovery, they would want you to have deduplicated disk. And they sell a target deduplication product for this purpose. They would tell you that deduplication to tape is strictly for long-term retention.