Dedupe dos and don'ts: Data deduplication technology best practices
Learn best practices for implementing data deduplication into your backup system from backup expert W. Curtis Preston.
data deduplication can significantly reduce the amount of disk needed to store your data backups, but things that you do in your data center can actually be working against deduplication. And there are other things that don't necessarily hurt your deduplication system, but aren't a good idea anyway.
This article will explain those things you should (or shouldn't) be doing if you are performing data deduplication.
This first section applies only to target deduplication systems. These include dedupe appliances and software-based target dedupe in your backup software (e.g., CommVault Simpana and Symantec Corp. NetBackup PureDisk Media Server Option).
Don't perform more full backups just to get a better deduplication ratio. Some customers have been told by their target dedupe sales engineers to perform nightly full backups to increase their dedupe ratios. Please don't do this. Perform more frequent full backups because it makes your recoveries better, or makes your DBAs sleep better. (DBAs have always had a trust issue with incremental backups.) Don't do this just to get a better deduplication ratio.
Do consider increasing your backup retention period on disk. Once you have your first set of backups on disk, adding additional backups to that same deduped system will take up less space than sending them to tape. So if you're already storing 30 days on disk and 60 days on tape, consider storing all 90 days on disk. You'll be surprised at how little disk additional backups take up, assuming you're getting a good dedupe ratio.
Don't use multiplexing if you're backing up to a virtual tape library (VTL). Some people carry over this practice from physical tape to virtual tape, and this can have devastating effects on your dedupe ratio. Even systems that can de-multiplex the data (FalconStor, Sepaton) recommend that your turn it off, as they are just wasting computing cycles de-multiplexing the data -- cycles that could otherwise be used to dedupe your data faster. Instead of multiplexing 40 backups to 10 virtual tape drives, create 40 virtual tape drives and turn off multiplexing.
Source and target deduplication
Don't obsess over your deduplication ratio, period. You should closely examine this number when you are comparing multiple products. See who gets the better dedupe ratio when you send the same data to each system. But once your system is installed, try to not overanalyze this number, especially when it's first installed. Dedupe ratios are always low at first and grow over time. You should periodically check to see if there has been a significant, negative change in your dedupe ratio, as it may indicate that something is wrong.
Don't encrypt data before the dedupe system sees it. For example, do not back up a Windows Encrypted Filesystem to a dedupe system and expect to see anything other than 1:1 as your dedupe ratio. Dedupe systems look for patterns and encryption systems get rid of patterns -- ergo, no dedupe.
Don't compress data before the dedupe system sees it -- for two reasons. The first is that all dedupe systems compress after they dedupe, so you are accomplishing nothing. The second reason is that compression can "scramble" the data, creating difficulties for the dedupe system when looking for patterns. (Note: CommVault's dedupe system allows you to encrypt and compress your backups once they're fingerprinted, and that should not impact your dedupe ratio.)
Do learn what data doesn't dedupe very well and consider not deduping it. With most dedupe systems, data that is created by a human (e.g., Office documents, database entries) dedupes well. Data that is automatically created by a computer doesn't dedupe well. Photos, video, audio, imaging, seismic data, are all examples of data that don't dedupe very well. Consider storing it on non-deduped storage. (Some dedupe systems can turn off dedupe on certain sets of data.)
Do read the best practices documentation for your particular dedupe systems and follow their suggestions. The suggestions in this article should apply to most (if not all) dedupe systems, but your particular product may have some idiosyncrasies you should be aware of when using it.
Do test multiple dedupe systems before buying one of them. There are some really good products out there, but there are also products with some real limitations. Only by comparing multiple products do some of these rear their ugly heads.
Do test copying data from your dedupe system to tape if you plan to do that. This is one of the areas that separate the men from the boys, so to speak.
Don't believe a vendor that tells you that none of the dedupe products can't stream your tape drive at its maximum speed. The list may be short, but it does exist. Some of the products, unfortunately, can only stream drives that are five to six years old.
Test everything. Believe nothing. Read the documentation and follow its advice, and all should be right in dedupe land.
About this author: W. Curtis Preston (a.k.a. "Mr. Backup"), Executive Editor and Independent Backup Expert, has been singularly focused on data backup and recovery for more than 15 years. From starting as a backup admin at a $35 billion dollar credit card company to being one of the most sought-after consultants, writers and speakers in this space, it's hard to find someone more focused on recovering lost data. He is the webmaster of BackupCentral.com, the author of hundreds of articles, and the books "Backup and Recovery" and "Using SANs and NAS."