Data dedupe software comes of age

Dedupe appliances were the big dog for a long time, but software-based products have come on strong more recently, offering many useful capabilities, often at a lower cost than an appliance.

Data deduplication has turned into one of the most important topics related to data backup and recovery because it offers simplification and cost savings at a relatively modest cost. Dedupe appliances were the big dog for a long time, but dedupe software has come on strong more recently, offering many useful capabilities, often at a lower cost than an appliance.

Perhaps most crucially, just about all backup software products have now integrated deduplication as a feature at this point (Hewlett-Packard Data Protector is a rare exception), making deduplication easily accessible.

The advantages of data dedupe software vs. an appliance

Lauren Whitehouse, analyst at Enterprise Strategy Group, offers a long list of pluses associated with data dedupe software:

  • Deduplication policies are integrated in overall backup policies so there is no need to set deduplication policies in a separate interface; the software offers a single point of management.
  • Deduplication in backup software allows deduplication to happen closer to the source of data (at the production system or at the backup media server). Deduplication processing can be then distributed in the environment instead of at a consolidation point (such as an appliance).
  • Global deduplication is more likely available in data dedupe software.
  • Backup software understands the actual data so it's content-aware. Appliances are just on the receiving end of the backup data stream and are not -- unless the appliance vendor reverse engineers the format, she said. “Content-awareness allows deduplication software to understand where the natural pattern breaks are in the data stream so they can drive higher deduplication ratios,” she added.
  • Any actions on deduplicated data are tracked by the backup catalog. This means that recovery is streamlined. Copies made via replication features of appliances can't be tracked ... unless the user is using Symantec NetBackup or Symantec Backup Exec with OpenStorage technology (and the appliance supports OST). 
  • Scalability of deduplication should usually be easier (unless the appliance scales seamlessly, like those from Exagrid Systems, NEC and Sepaton Inc. that have grid architecture approaches).
  • Licensing varies, but deduplication in software can be more cost effective, especially if it's a no-charge feature.
  • Disk vendor selection flexibility is greater since software uses existing disk and the user can pick any vendor's storage system.

The advantages of a dedupe appliance

On the other hand, Whitehouse said there are some specific advantages to dedupe appliances. For instance, with an appliance, data deduplication is performed on systems optimized specifically for data deduplication processing. For some kinds of workloads, deduplication performance can be improved this way. Likewise, integration is “often a bit easier,” since the appliance just requires you to set policy configurations, whereas software-based deduplication requires configuration of the media server to provide the processing power needed. 

Of course, appliances also eliminate loads on production servers and can deduplicate data for any backup system environment. “If a site has more than one backup solution and a single deduplication strategy is desired, this would be the way to go,” she said.

David Russell, an analyst at Gartner, has come to similar conclusions but he definitely sees a trend among clients toward dedupe software. For instance, in surveys conducted at recent Gartner conferences, of those planning to implement dedupe, 42 percent said they would use a software approach -- the highest proportion ever recorded by Gartner, and a sharp rise from “the low twenties” a year earlier, he said.

“The thinking is that with software, they can go and buy a high power server and install the code for a cost that is below that for an appliance,” said Russell. Furthermore, noted Russell, “with an appliance you can’t expand in the future without having to worry about what specific appliance model you will need or whether the vendors offer a gateway to the targeted device.”

While Russell endorses the trend he does see some problems with software-based deduplication. For instance, much depends on how it is deployed. “I have seen organizations that aren’t sure how to size this and build the infrastructure -- if you undersize your disk resources in terms of both the amount of space and the type of disk, that can also degrade the performance of software-based deduplication,” said Russell. “People will blame the software when they are really doing unreasonable things like trying to run dedupe on a server that’s already very busy running Exchange,” he said.

“In other words, the software has advantages but it also gives people the opportunity to shoot themselves in the foot,” he said.

One solution, he said, is to implement appliances for demanding dedupe challenges such as a large database, while using software for lighter and more manageable dedupe work. “Big objects like databases could bog down a server running dedupe, whereas an appliance is optimized for that kind of work,” he said.

Consulting company goes with CommVault Simpana

Paul Slager is the director of information systems at LWG Consulting in Northbrook, IL, which has 16 locations in the U.S. and four globally, recently opted for the software approach. The company, which focuses on technical disaster consulting for post-disaster issues such as data recovery, mostly on behalf of insurers, runs many virtual machines and backs up over relatively slow WAN links. As a result, “client-side dedupe and global dedupe and compression were issues we considered,” said Slager.

With a storage appliance, you usually get a cost-per-gigabyte or terabyte that is pretty high even for basic SATA and, for Fibre Channel drives, the vendors tend to jack the price up quite a bit more.

Paul Slager, director of information systems, LWG Consulting 

Slager said he was looking for a backup solution that had good deduplication capability. “Our company takes millions of very high-resolution files that we have to store on our file servers for 12 years -- we wanted to dedupe and compress as much as possible,” he said.  He looked at IBM Corp. Tivoli Storage Manager, CommVault, EMC Corp Avamar and Data Domain, initially, and decided the dedupe software option was the cheapest. The software option also allowed him to use his own storage. “With a storage appliance, you usually get a cost-per-gigabyte or terabyte that is pretty high even for basic SATA and, for Fibre Channel drives, the vendors tend to jack the price up quite a bit more,” he said.

Ultimately, he selected CommVault Simpana 9.0, which “seemed to have it all.” Slager said in addition to dedupe capabilities, it was also the “simple things” like the fact that Simpana allowed the ability to suspend or resume backup. That was helpful because of the company’s slow WAN. “Now, if one server goes down I don’t have to restart the whole backup,” he explained. 

Slager said he thinks his experience with CommVault was also helped by preparation. “At the start, I laid out a framework and collected a list of requirements including how it would affect our storage footprint, how it would fit with disaster recovery plans and our recovery point and recovery time objectives,” he said.

Slager said he went into the process realizing that getting high dedupe ratios would be difficult with any product. However, he was hopeful his virtual machines, databases and Exchange data would dedupe well. “I actually ran the EMC assessment tool for Avamar to see what our dedupe ratios would be and to figure out how much backup storage we needed to procure,” he said. That tool yielded an estimate of 5 TB of space for the initial seed. Results with CommVault, exceeded that expectation. In fact, he said, with the CommVault deployment the initial seed turned out to be only 3.2 TB. And dedupe ratios have been as good or better than he expected.

So far, Slager said he has logged the following ratios in service:

  • Virtual machines running on VMware have a dedupe ratio of 84%, “which is really good,” he said.
  • Microsoft Exchange 2010 backups have a dedupe ratio of 68%
  • Physical Windows servers have a dedupe ratio of 77%
  • My File shares with all of the image files have a dedupe ratio of 11%
  • Microsoft SQL database backups have a dedupe of 64%
  • Microsoft SharePoint database and documents have a dedupe of 57%

Those numbers seem to be in accord with what analysts see. For example, Whitehouse said although vendor claims are “all over the map” from 500:1 to a “more normal” 20:1, Enterprise Strategy Group research found that most end-users using deduplication cite a ratio of between 10:1 and 20:1, while a smaller percentage have seen higher ratios of 30:1.

Russell said the fact that software takes care of deduplication at the source will likely mean more growth for that approach. “If you told a network administrator you could reduce traffic by 95 percent through dedupe they would be very impressed,” he said.

However, Russell also said the respective advantages of both dedupe appliance and software approaches will continue and may eventually yield a hybrid approach to deduplication. For instance, he noted, EMC has hinted about combining Avamar, which does dedupe, with a Data Domain target appliance. Thus, he said, it would be wise for decision makers to hedge their bets so they will be able to take advantage of hybrid if it emerges in a few years.

About this author: Alan Earls is a frequent contributor to SearchDataBackup.

Dig Deeper on Backup and recovery software

Disaster Recovery