Data deduplication has dramatically improved the value proposition of disk-based data protection as well as WAN-based remote- and branch-office backup consolidation and disaster recovery (DR) strategies. It identifies duplicate data, removing redundancies and reducing the overall capacity of data transferred and stored.
Some deduplication approaches operate at the file level, while others go deeper to examine data at a sub-file, or block, level. Determining uniqueness at either the file or block level will offer benefits, though results will vary. The differences lie in the amount of reduction each produces and the time each approach takes to determine what's unique.
Also commonly referred to as single-instance storage (SIS), file-level data deduplication compares a file to be backed up or archived with those already stored by checking its attributes against an index. If the file is unique, it is stored and the index is updated; if not, only a pointer to the existing file is stored. The result is that only one instance of the file is saved and subsequent copies are replaced with a "stub" that points to the original file.
Block-level data deduplication operates on the sub-file level. As its name implies, the file is typically broken down into segments -- chunks or blocks -- that are examined for redundancy vs. previously stored information.
The most popular approach for determining duplicates is to assign an identifier to a chunk of data, using a hash algorithm, for example, that generates a unique ID or "fingerprint" for that block. The unique ID is then compared with a central index. If the ID exists, then the data segment has been processed and stored before. Therefore, only a pointer to the previously stored data needs to be saved. If the ID is new, then the block is unique. The unique ID is added to the index and the unique chunk is stored.
The size of the chunk to be examined varies from vendor to vendor. Some have fixed block sizes, while others use variable block sizes (and to make it even more confusing, a few allow end users to vary the size of the fixed block). Fixed blocks could be 8 KB or maybe 64 KB -- the difference is that the smaller the chunk, the more likely the opportunity to identify it as redundant. This, in turn, means even greater reductions as even less data is stored. The only issue with fixed blocks is that if a file is modified and the deduplication product uses the same fixed blocks from the last inspection, it might not detect redundant segments because as the blocks in the file are changed or moved, they shift downstream from the change, offsetting the rest of the comparisons.
Variable-sized blocks help increase the odds that a common segment will be detected even after a file is modified. This approach finds natural patterns or break points that might occur in a file and then segments the data accordingly. Even if blocks shift when a file is changed, this approach is more likely to find repeated segments. The tradeoff? A variable-length approach may require a vendor to track and compare more than just one unique ID for a segment, which could affect index size and computational time.
The differences between file- and block-level deduplication go beyond just how they operate. There are advantages and disadvantages to each approach.
File-level approaches can be less efficient than block-based deduplication:
- A change within the file causes the whole file to be saved again. A file, such as a PowerPoint presentation, can have something as simple as the title page changed to reflect a new presenter or date -- this will cause the entire file to be saved a second time. Block-based deduplication would only save the changed blocks between one version of the file and the next.
- Reduction ratios may only be in the 5:1 or less range whereas block-based deduplication has been shown to reduce capacity in the 20:1 to 50:1 range for stored data.
File-level approaches can be more efficient than block-based data deduplication:
- Indexes for file-level deduplication are significantly smaller, which takes less computational time when duplicates are being determined. Backup performance is, therefore, less affected by the deduplication process.
- File-level processes require less processing power due to the smaller index and reduced number of comparisons. Therefore, the impact on the systems performing the inspection is less.
- The impact on recovery time is low. Block-based deduplication will require "reassembly" of the chunks based on the master index that maps the unique segments and pointers to unique segments. Since file-based approaches store unique files and pointers to existing unique files there is less to reassemble.
About this author:
Lauren Whitehouse is an analyst with Enterprise Strategy Group and covers data protection technologies. Lauren is a 20-plus-year veteran in the software industry, formerly serving in marketing and software development roles.
Do you have comments on this tip? Let us know.