global deduplication

Global deduplication is a method of preventing redundant data when backing up data to multiple deduplication devices. This method may involve backing up to more than one target deduplication appliance or, in the case of source deduplication, backing up data on multiple clients.

With global deduplication, when data is sent from one node to another, the second node recognizes that the first node already has a copy of the data and does not make an additional copy. This is more efficient than single-node deduplication, which only deduplicates data sets residing on that node.

Because large data centers use multiple backup targets, global deduplication is the preferred deduplication technology because it removes all the redundant copies of data across all targets. In an organization with a high volume of data, other forms of deduplication could result in a bottleneck.

Today, all major data backup software applications and targets with dedupe capabilities offer global deduplication. Vendors include Arcserve, Asigra, Carbonite, Cohesity, Commvault, Dell EMC, Druva, ExaGrid, HybriStor, Rubrik, Veeam Software and Veritas.

Pros and cons

Global deduplication makes the data deduplication process more effective by increasing the data deduplication ratio, which is the ratio of protected capacity to the actual physical capacity stored. This helps to reduce the required capacity of disk or tape systems used to store backup data.

Global dedupe also enables high availability and load balancing thanks to the technology's ability to manage multiple devices efficiently. It also enables greater flexibility in data retention policies, such as various storage policies for different data types stored in the same library.

However, because it is so complex and works on such a large level, global deduplication makes little sense in smaller organizations. Target and source deduplication typically work better in smaller, less complex environments.

While global deduplication can help minimize the amount of data stored and increase upload speeds, in some cases, it may put data security at risk. Because one data block is used by many, but saved only once, if that instance of the data is corrupted, every user experiences that data loss.

In addition, as data storage grows, it can become more difficult for backup methods to locate and restore files.

Global deduplication and cloud backup

With the advent of cloud-based backup, implementing global deduplication is becoming a good way to reduce expenses. Cloud data protection is intended to reduce the costs associated with storing on-site data; however, the cost of moving large amounts of data, meeting bandwidth requirements and providing adequate security can add up.

Global deduplication can help save money by deduplicating data across all devices and better utilizing storage space. The less data an organization has to store, the more money it can save on storage hardware. Less stored data also means fewer backups.

For geographically dispersed companies and those with remote users, global deduplication can help speed up cloud backups. With globally deduplicated data, each subsequent user accessing a backup benefits from previous instances of deduplication. To save on bandwidth, an organization can deploy first to those with access to better bandwidth, then to remote users who will receive the already deduplicated data.

Global dedupe vs. other forms of deduplication

Deduplication comes in many forms, and it is suited to different environments. Depending on the size of the organization and the amount of data at hand, global deduplication may not be the best option for an organization.

Local deduplication evaluates data redundancy before the data is backed up, storing files in the cloud. While global deduplication works across all devices, each device in a local deduplication environment performs dedupe for just that one device.

Because it works off a single deduplication index, global deduplication often has a better reduction rate. However, because the data is more easily accessible, local deduplication can result in better performance.

This video compares different
forms of deduplication.

Per-job deduplication is a method of dedupe that works within one backup job at a time. If there is a large quantity of data to be backed up -- and static data to be archived -- it can be better to use per-job deduplication since it won't perform deduplication based on all of the data in the system.

Compression is another form of data reduction similar to global deduplication. Compression shrinks the size of the data using an algorithm, dramatically decreasing the amount of storage a file takes up. Unlike deduplication, compression works at the file level rather than on blocks of data.

Inline deduplication processes duplicate data as it is sent to the backup target, processing and passing data just once. Inline deduplication can also reduce the recovery point objective and recovery time objective, as data is available immediately after it is processed.

Post-processing or asynchronous deduplication is the primary alternative to inline deduplication, analyzing and removing redundant data after the data has been backed up to the target. Because it takes place after the data has been backed up, post-processing deduplication can be faster than inline dedupe, but organizations will need to have the storage available to store all of the duplicated data before it is processed.

Because global deduplication is a capability, it can be used in an inline or post-processing deduplication system.

This was last updated in February 2018

Continue Reading About global deduplication

Dig Deeper on Data reduction and deduplication