Data compression and deduplication have humble beginnings as optional add-ons for capacity-challenged companies, but in the modern era, they are must-have features for almost all storage systems. Together, they have helped to usher in an era of high-performance flash storage that maintains at least a semblance of pricing sanity. Without them, it's unlikely the flash revolution would have been nearly as pervasive.
These technologies, however, are sometimes controversial when it comes to determining their impact. The capacity impact of both compression and deduplication is highly dependent on the underlying data being managed. What is sometimes overlooked is the fact that they can have both a positive and a negative effect on storage performance.
On the negative front, both compression and deduplication can require significant CPU resources in order to work their magic. For deduplication, as blocks of data are fed into the storage system, each one is fingerprinted and checked against a master fingerprint table called a hash table. If a fingerprint already exists, that means the data block already exists on the storage system and the data isn't written again. If the fingerprint doesn't exist, the data is written as usual.
This fingerprint-checking process can impose some increased latency on storage write operations -- although, with advances in modern storage systems, this is practically negligible. The deduplication engine itself requires CPU cycles to process the lookup, so it may have some effect on other storage operations. That said, modern CPUs are multicore behemoths that have cycles to spare. In the past, this may have been a consideration more than it is today. Older processors were far less efficient than those in use now.
From there, whether deduplication is a positive or negative from a performance perspective becomes a bit murkier and is dependent on the data itself. If there is a lot of duplication in data, deduplication can be a net positive since you can basically throw away more write operations. Write operations tend to be the slowest variety, so the fewer you have to perform, the less the performance hit you take. For regular workloads, you simply have to make a judgment call: Is the negligible performance impact outweighed by the capacity savings you experience?
Compression is a similar story. You achieve reduced storage capacity outcomes at the expense of some CPU cycles. In this case, the underlying data types matter a lot. If you are trying to store data that's already compressed, like certain video and photo formats, compression won't actually yield any savings, so you'll just be wasting CPU cycles trying to get more gains that won't come to fruition. On compressible data, though, you need to make a similar assessment as with deduplication: Is the potential for CPU impact worth the capacity gain?
The best method to determine whether compression and deduplication are right for you is to test them and see what their impact is in terms of cost and performance.