Inline deduplication is the removal of redundancies from data before or as it is being written to a backup device. Inline deduplication reduces the amount of redundant data in an application and the capacity needed for the backup disk targets, in comparison to post-process deduplication. However, inline deduplication can slow down the overall data backup process because inline data deduplication devices are in the data path between the servers and the backup disk systems.
Data deduplication was originally developed for backup data as a way to reduce storage overhead. It became incorporated into primary storage with the advent of flash storage into the enterprise as a way of reducing the cost premium of solid-state drives (SSDs) over hard disk drives (HDDs).
Inline deduplication is more popular than post-process deduplication for primary storage on flash arrays. Inline dedupe reduces the amount of data written to the drives, which, in turn, reduces wear on the drives. Inline deduplication was considered a selling point for early successful all-flash arrays, such as EMC's XtremIO and Pure Storage's FlashArray.
How deduplication works
There are two primary deduplication methods. One method breaks data into small chunks and assigns each chunk a unique hash identifier. As data is being written, algorithms check the hash identifier to see if it already exists in storage. If it does, the new copy is not stored.
The second method used by deduplication technology vendors is delta differencing technology, which checks new data against existing stored data at the byte level.
Deduplication is most often a feature built into data backup products, like those from Cohesity, Commvault, Dell EMC, IBM, Veritas and Rubrik. The technology is usually bundled with other data backup efficiency tools, such as data compression, and data recovery tools, such as replication.
Data compression is used to reduce the size of the data set to be stored and is often used in conjunction with deduplication. Compression uses an algorithm to determine if the string of bits that define any particular piece of data can be expressed with a smaller string. The efficiency of compression is defined as a ratio of existing data size to be stored versus compressed size, such as 2:1 or 5:1. With a 2:1 compression ratio, a data set of 100 gigabytes, for example, would only take up 50 GB in storage.
While compression algorithms were first developed for tape storage, data deduplication came about to improve disk-based backup.
How inline deduplication works
Whether it uses the hash identifier method or the byte-level comparison method, inline deduplication checks new data ready to be sent to storage against data that already exists in storage and doesn't store any of the redundant data it discovers. The inline deduplication software runs algorithms that automatically append the identifying hashes and checks the hashes in stored data for a match. If there is no match, the data is stored.
Since the process happens during the transfer of data to backup storage, it could create a performance issue in comparison to post-process deduplication. In the early days of deduplication, this was a concern, but modern processors and memory can handle the increased workload of inline deduplication easily enough to prevent any performance hit to the backup system.
Inline deduplication benefits and downsides
Inline deduplication can be either source- or target-based. Source-based deduplication takes place on the host, before the data transfer to the storage target begins. Target-based deduplication takes place at the drive where the data is stored.
Inline deduplication requires less storage space than post-process deduplication. With post-processing, the data is written to storage before deduplication happens. That requires enough storage space to handle the entire data set, including redundancies. With inline deduplication, the storage space doesn't have to account for redundant data needing temporary storage space.
Since inline deduplication happens before or during data transfer, it can create a data bottleneck, slowing down the transfer process. Post-process deduplication enables a more rapid backup, because the data isn't touched for deduplication until it reaches storage.
Inline deduplication in primary storage
The increased use of flash-based primary storage has led to a rise in the use of inline deduplication in primary storage. Benefits include:
- reduced capacity demands for primary storage;
- fewer redundancies before data is sent to backup;
- cost savings, since solid-state storage is still more expensive than disk; and
- less wear of flash in SSDs, which can extend the life of the drives.
Vendors such as NetApp's SolidFire division, Nimbus Data and Pure Storage offer flash-based primary storage products with inline deduplication built in.
Data Domain and inline deduplication
Data Domain was a pioneer in inline data deduplication. The company sold hardware appliances that used its deduplication software to provide an alternative to tape backups. Storage giant EMC acquired Data Domain for $2.1 billion after a bidding war between NetApp and EMC. EMC integrated Data Domain products into its existing line of deduplication offerings. When Dell acquired EMC in 2016, the newly named Dell EMC continued to sell and update the Data Domain product line.
EMC also bought another deduplication pioneer, Avamar, which sold dedupe software. Today, Dell EMC Avamar is part of the vendor's Data Protection Suite of software. Avamar is source-based dedupe, while Data Domain is target-based.