Data compression vs. deduplication
Compression and deduplication both have a role to play when it comes to improving the backup process and cutting storage costs.
Backup administrators rely on efficient processes and economical storage space use. Compression and deduplication are two similar -- but different -- techniques that can help.
Backing up files is critical, and creating copies of data is a major part of that. This can lead to backup processes congesting the network or causing slow access to resources. With continued focus on availability metrics like recovery time objectives, strong data management capabilities are essential to keep extraneous copies of data from slowing down performance.
There are two primary data reduction approaches administrators use: data compression and deduplication. Also used in file server storage and general data management, compression and deduplication can help maintain more efficient backup processes. While compression reduces the size of files by eliminating redundant information, deduplication replaces that information with pointers to a single source.
This article will cover more about how compression and deduplication work, their advantages and disadvantages, and use cases for both methods.
What is data compression?
Data compression encodes data to reduce its size. The general approach removes redundant or unneeded information to reduce the file size. The result is a more efficient use of storage capacity and network bandwidth.
There are two types of compression, lossy and lossless. Lossy compression permanently removes data, resulting in a possible loss of quality but a higher compression rate. Lossless compression does not remove data, enabling complete data restoration, but without as good a compression rate as lossy compression.
Data compression offers several advantages to administrators, including the following:
- Saving storage space, reducing costs.
- Speeding up network file transfers.
- Improving the performance of backup jobs and restore operations.
- Optimizing data management.
While data compression might offer performance improvements and high space savings, it has its downsides. For one, compression is a CPU-intensive activity, which means it can potentially slow systems during the process. Corruption is possible when compressing data, which could damage mission-critical files. It can also be difficult to predict the savings associated with compression.
What is data deduplication?
Data deduplication also reduces or removes redundant information, but differently from compression. It replaces redundant information with pointers to a single data source rather than using multiple copies. Like compression, deduplication offers benefits such as storage savings and increased backup efficiency.
Administrators configure deduplication to happen at either the source or the target. With source deduplication, the deduplication process occurs before data is sent to the storage repository. With target deduplication, the process occurs at the storage target.
For example, an administrator could configure target deduplication for backup jobs stored in the cloud, offloading the processor performance hit to cloud-based resources rather than local servers, and preventing users from feeling the effects.
Depending on the file type, deduplication can dramatically affect the storage infrastructure. When Microsoft first integrated deduplication with Windows Server, it reported space savings ranging from 30%-95% for files such as user documents and virtualization libraries.
In addition to storage cost savings, benefits of deduplication include the following:
- Reducing the amount of data for backup jobs, causing them to take less time.
- Reducing storage space required for backup jobs.
- Reducing network utilization due to smaller backups.
However, like data compression, deduplication has its challenges. It is CPU-intensive, deduplicated data is not immune to corruption, and despite estimates, it can be difficult to predict associated cost savings. Additionally, managing deduplication is a complex task and the method has limited effectiveness on some file formats.
Use cases for compression vs. deduplication
When it comes to compression and deduplication, backup administrators do not have to choose just one. Depending on the types of files being reduced, organizations might use compression for some and deduplication for others. However, administrators must be aware that combining the two techniques can significantly impact CPU performance. It might also negatively affect write throughput on storage devices. However, careful planning and the right hardware can help mitigate both concerns.
Of the two options, most people are already familiar with data compression in some form, even if it's just emailing or downloading ZIP files. Deduplication is not as well-known, and it often occurs behind the scenes. For example, administrators can configure deduplication to run during off-peak hours, unnoticed by end users.
Compression is best used for the following:
- Individual files rather than full partitions or volumes.
- Files like images, multimedia and databases.
- Efficient network transmissions, such as large file downloads.
Deduplication is suited to the following:
- Storage containing lots of redundant information, such as backup or virtual machine image repositories.
- Cloud storage and large file servers.
- Optimizing backup processes and reducing costs.
Damon Garn owns Cogspinner Coaction and provides freelance IT writing and editing services. He has written multiple CompTIA study guides, including the Linux+, Cloud Essentials+ and Server+ guides, and contributes extensively to Informa TechTarget, The New Stack and CompTIA Blogs.