What you will learn in this tip: Data deduplication’s rise from niche product to widely accepted technology is reflected by the growing number of vendors now offering deduplication either as a standalone product or as a feature in their backup products. This tip reviews the benefits of deduplication methods and offers some insight on how these methods apply to specific environments.
File, block and variable length segment dedupe
To identify duplicates, data can be examined in different ways depending on the technology used. For example, file-level dedupe—also called single-instance storage(SIS)—will identify identical files, store a single copy and replace subsequent identical copies with a pointer to the unique version stored. Examples of file-level deduplication include Novell Inc.’s GroupWise and Microsoft Corp.’s Exchange (although SIS isn’t supported in Exchange 2010) email programs. EMC Corp. also provides file-level deduplication on its storage arrays, including Clariion, Celerra and its new VNX series.
The disadvantage of file-level deduplication is its lack of granularity and inability to provide sub-file level dedupe. That means even the smallest change in a file makes it a totally new file that will be stored. File-level dedupe is useful in email environments where the same attachment is often sent to multiple recipients at once or in unstructured data storage environments with low change rates. However, it is not practical in a structured data environment when entire large files such databases are constantly changing.
Vendors address the lack of granularity inherent to file-level dedupe by breaking data in to smaller “chunks” such as fixed blocks or variable length segments. Greater data reduction can be achieved by storing only unique data segments and creating pointers to all others that are identical. CommVault Systems Inc., FalconStor Software Inc. and NetApp Inc. and are examples of vendors who leverage block-level dedupe; while EMC’s Data Domain, Avamar and products from Sepaton Inc. are based on variable-length byte segments. The knock against block-level dedupe is that a block offset would change all blocks in a given data set, requiring all new blocks to be stored as they are no longer duplicates. This situation is alleviated with variable length segment dedupe but the technology is more complex and resource-intensive. Sub-file dedupe (block-level or variable-length segments) is commonly used in backup environments where multiple backup versions of files often contain few changes.
In-line vs. post-process dedupe (out-of-band)
Dedupe is referred to as being in-line (also called in-band) when data is analyzed for duplicates while it is being written to the storage media. In contrast, post-process (or out-of-band) dedupe takes place after the data has been written to disk. The benefits of post-process dedupe is that it does not affect write performance but it requires enough disk space to accommodate the entire data set until deduplication (and therefore reduction) can take place during off-peak hours. On the other hand, in-line dedupe provides the immediate space saving benefits of data reduction but is more resource intensive which can impact write performance. The decision is a trade-off between immediate storage space savings and performance, but that performance impact is less of a factor as the technology improves. In-line dedupe products include offerings from FalconStor, as well as EMC’s Data Domain and Sepaton, IBM Corp.’s ProtecTier (formerly Diligent), while NetApp uses post-process dedupe.
Source vs. target
Depending on the technology implemented, deduplication can take place at the source (the sending system) or at the target level (the receiving system). This distinction is specific to backup environment which are typically based on a client/server (or sender/receiver model). Source dedupe uses software on the backup client that must be dedupe-capable and the backup server to be dedupe-aware. This means some changes to an existing backup environment will be required. On the other hand, target dedupe usually requires no change since the deduplication-capable target device is seen as just another disk storage array or virtual tape library (VTL) to the backup server. Source dedupe is used in an effort to reduce the amount of data sent over the network when remote offices are backing up to a central office. The trade-off is that source dedupe impact performance on the client side thus extending the duration of the backup and dedupe is limited to duplicate data at the client level only regardless how many backup clients share identical data.
Appliance vs. software
Other possible considerations include picking between appliance-based and software-based dedupe. Deduplication appliances are typically integrated with an existing environment without requiring much change. This is the case when configuring a backup server to write to a dedupe-capable storage array (e.g., EMC Data Domain). On the other hand, dedupe software usually require changes to your environment, especially when migrating from a basic backup software to one that is dedupe-capable.
Competing manufacturers claim that appliance-based dedupe creates hardware vendor lock-in because of a dependency on proprietary storage or appliances. However, software-based dedupe can also be considered vendor lock-in, as the deduplication capability is dependent on the specific software platform.
Vendors like IBM and NetApp offer gateway appliances providing the ability to store deduplicated data to supported third-party storage, but for all intents and purposes, hardware- and software-based dedupe offerings are proprietary.
There are many benefits of deduplication, but choosing the right dedupe approach requires careful consideration of your backup environment.
About this author: Pierre Dorion is the data center practice director and a senior consultant with Long View Systems Inc. in Phoenix, Ariz., specializing in the areas of business continuity and DR planning services and corporate data protection.