Five questions to ask in a data deduplication project

When scoping out a customer's data deduplication project, there are five key questions to explore with your customers to determine the best approach to take. Learn what questions you should ask and why they're important.

By Martha Young, Contributor

Service provider takeaway: Service providers should explore five important questions with customers interested in implementing data deduplication, to help gain a better understanding of project scope.

Data deduplication is as much a business consideration as it is a technical concern. From a business perspective, deduplicating data adds value by improving in-line performance and data integrity, adding value and intelligence to a business's intellectual property; reducing the amount of time required for backup and recovery, an important consideration for customers looking at business continuity and disaster recovery solutions; and reducing the cost associated with physical storage, including hardware acquisition, management and administration, and energy consumption.

Learn more
Search our library of expert answers to storage channel questions, or ask the channel experts
With technology budgets coming under intense scrutiny, data deduplication is an obvious area worth investing in and implementing for a near-term return on investment.

There are several considerations your customers must take into account when first investigating data deduplication options. Here are the questions to be explored.

What types of files need to be stored?

In today's business world, users are generating vast amounts of intellectual property across a wide variety of mediums. Firms need to address the unique file storage requirements for voice, video, data, electronic mail, instant messaging, mobile computing and other types of files. File type is important in the data deduplication equation because it can indicate differences in file size. For instance, a streaming video file would require substantially more storage and, consequently, bandwidth to transfer to storage than email documents. If a service provider is supporting a lot of video, a localized solution will make more economic sense.

How long do the files need to be stored?

The answer to this question rests within the regulations your customer needs to comply with. Data storage and accessibility regulations include the Sarbanes-Oxley (SOX) Act, the Healthcare Insurance Portability and Accountability Act (HIPAA) and the Gramm-Leach-Bliley Act (GLBA). In general, there is a mountain of regulations requiring data backup, recovery, accessibility and security. Each regulation has its own framework and objectives that your customers must be able to meet. If all of the varieties of communication need to be stored in excess of 50 years, then data deduplication is mandatory, if only from a manageability and retrieval perspective.

Where will data deduplication be conducted?

There are only two places where deduplication can be conducted: at the source or in a storage appliance. Data deduplication at the source offers the key benefits of reducing the amount of disk space needed to store the backups and reducing the impact on network bandwidth required to back up a given set of data. The drawback to deduplication at the source is the impact on the server. It takes a significant number of compute cycles on each server.

Some companies have opted to transfer the compute cycle requirements to a storage appliance and conduct their data deduplication at the appliance. This eliminates the agent footprint on the storage server and CPU cycle impact, but it does add another device or set of devices to the network that will need to be monitored, maintained and managed.

When deciding where deduplication should take place, it's important to consider the geographical distribution of the company. For a customer with numerous branch offices, it makes economic sense to deduplicate on a local level and reduce the overall impact on the WAN. For a customer that leverages a data center, deduplicating within an appliance makes sense since it allows customers to continue using existing backup methods and procedures, reducing the server performance impact.

Which deduplication approach is preferred: software-based or hardware-based?

Data deduplication can be performed using either a software-based solution or a hardware-based solution. A software-based solution enables companies to eliminate data redundancy directly at the source. As noted, a software-based solution does carry the burden of installing an agent on each server, as well as a substantial CPU cycle impact. Software-based solutions are relatively inexpensive to deploy compared with hardware solutions, but they do require ongoing maintenance to keep the clients and agents up to date. A software-based solution would be ideal in small and medium-sized businesses (SMBs), as well as within large enterprises that are geographically distributed.

Deduplication appliances, on the other hand, are ideal for a data center environment. An appliance solution offloads the transactional processing and subsequent CPU impact of the server. Deduplication appliances have a reputation of high performance and scalability, but companies considering using an appliance-based solution need to consider the bigger-picture impact of bandwidth utilization as well as increased network complexity. A hardware-based data deduplication solution is optimized for the data center environment: In addition to offloading server CPU cycles, an appliance in the data center can be integrated with other storage platforms to maximize storage usage.

Will data be encrypted and, if so, when?

When it comes to encryption, compression and data deduplication, the order of execution is critical. Compression eliminates redundancy in files (thereby reducing file size). Deduplication eliminates redundant files. Encryption converts data into a random data stream. If a company encrypts its data prior to transmission, it may become impossible to compress or deduplicate it, which would unnecessarily inflate the amount of storage required, as well as the associated costs. To optimize your customer's storage infrastructure, advise them to compress, deduplicate and then encrypt their files. By following this order of operation, it becomes clear that compression and deduplication must take place at the server, then encrypted prior to being transmitted.

As companies seek to achieve data storage and retrieval regulatory compliance at the lowest possible cost, these five questions should be addressed during the data deduplication decision process. And once a solution is chosen, you should help your clients evaluate whether the implementation will meet their business goals and objectives.

About the author
Martha Young is co-founder and CEO of Nova Amber LLC, a business consulting company specializing in business process virtualization. She has co-authored three books on virtual business processes: The Case for Virtual Business Processes, The Virtual Worker's Handbook and iExec Enterprise Essentials Companion Guide.

Dig Deeper on MSP technology services

Cloud Computing
Data Management
Business Analytics