What are some new data deduplication techniques?

Copy data management is just one technology utilizing recent innovations in the backup deduplication space, combatting sprawl and managing snapshots.

Chris Evans

Published: 28 Nov 2016

Data deduplication is a storage capacity optimization technique that identifies repeated sets of data in a data stream and eliminates them, retaining a single copy on physical media. Metadata and pointers are used to track each logical data instance that maps to the physical copy. Data deduplication techniques were established in the backup space so multiple, repeated full backups of a server or virtual machine could be heavily deduplicated because they contained either the same unchanged data or were based on a single master image.

Data deduplication techniques have figured heavily in products that resolve sprawl issues, such as copy data management (CDM) platforms. These products offer the ability to use data for purposes other than data protection, such as creating test/development copies of production data. Customers save with the re-use of static data, now typically stored on spinning media in powered up deduplication appliances. CDM vendors have built in technology that enables backups to be delivered effectively and that works for secondary requirements, including managing thousands of application snapshots.

Another innovation in data deduplication techniques is "pre-duplication," where the client is able to deduplicate data before sending it across the network. While this concept was seen almost 10 years ago at PureDisk, a company acquired by Symantec, vendors today are integrating it into their platforms. Hewlett Packard Enterprise did it with its 3PAR array, looking to improve data deduplication techniques using snapshots and to reduce the amount of data transiting the network. The deduplication process is also being distributed, making it possible to scale out backups without the bottleneck of managing the hash values in a single process.

Backup vendors are now also looking at the data itself and building in intelligent deduplication based on the application content rather than basing it on simple block-level identification. File-level dedupe can identify files that can be single instanced, including file attachments on backups of email systems. Again, these processes are making the backup client more intelligent, reducing the workload on the network and the back-end deduplication engine.

Next Steps

Everything you need to know about backup deduplication

Recovery in place as a data backup strategy

Strengthen flash performance with data reduction techniques

What are some new data deduplication techniques?

Copy data management is just one technology utilizing recent innovations in the backup deduplication space, combatting sprawl and managing snapshots.

Next Steps

Dig Deeper on Data reduction and deduplication

Data compression vs. deduplication

The ultimate guide to backup deduplication

data deduplication hardware

copy data management (CDM)

Related Q&A from Chris Evans

How do I get started with Oracle backup scripts for RMAN?

How does deduplication in cloud computing work and is it beneficial?

How can I ensure that my tape backup technology is secure?