Protecting petabytes: Best practices for big data backup

How do you protect the massive data sets? Learn the best practices and products used for big data backup disaster recovery.

[This story was updated February 2013] One of the issues created by holding voluminous data sets (also called "big data") in your storage environment is how to protect that data.

Petabyte-size data stores can play havoc with backup windows, and traditional backup is not designed to handle millions of small files. The good news is that not all big data information needs to be backed up in the traditional way.

Nick Kirsch, chief technology officer for EMC’s Isilon scale-out NAS platform, said it helps to be more intelligent about the data you have when doing big data backups. He advises that, before you consider how to protect your data, you should take a good look at what data needs to be protected. Machine-generated data – report data from a database, for instance – can be reproduced easier than it can be backed up and recovered.

You may need a larger secondary storage system, additional bandwidth and windows that can accommodate greater data backups as you try to protect big data stores.

Compare the cost of protecting your data with the cost of regenerating the data. Kirsch said that, in many instances, source data will need to be protected, but any post-acquisition processes may be cheaper to reproduce than the cost of protecting the data after manipulation.

Data protection

For protection against user or application error, Ashar Baig, a senior analyst and consultant with the Taneja Group, said snapshots can help with big data backups.

Baig also recommends a local disk-based system for quick and simple first-level data-recovery problems. “Look for a solution that provides you an option for local copies of data so that you can do local restores, which are much faster,” he said. “Having a local copy, and having an image-based technology to do fast, image-based snaps and replications, does speed it up and takes care of the performance concern.”

If you’re shopping for a new backup system for big data, Baig suggested taking into account your current backup equipment and software.

“Anything you buy [for big data] has to be a bolt-on technology to your existing [systems],” Baig elaborated. “That’s real world. That’s what admins live and breathe.”

Jeff Echols, senior director of product and solution marketing for backup software vendor CommVault, said his big data customers are using or investigating tape systems and cloud providers for offsite data protection. Some keep legacy tape systems because of their low cost or existing infrastructure, but shift them to an archival role instead of as a primary backup system.

Faster scanning needed

One of the issues big data backup systems face is scanning each time the backup and archiving solutions start their jobs. Legacy data protection systems scan the file system each time a backup job is run, and each time an archiving job is run. For file systems in big data environments, this can be time-consuming.

“The way the backup guys have always done it is they’ve had to scan that file system every time they’re going to run a backup,” said Commvault’s Echols. “If it’s a full backup, if it’s an incremental backup, there is still a scanning process here that you’ve got to finish. That scan time is getting to be the killer in the whole operation.”

Commvault’s solution for the scanning issue in its Simpana data protection software is its OnePass feature. According to Commvault, OnePass is an object-level converged process for collecting backup, archiving and reporting data. The data is collected and moved off the primary system to a ContentStore virtual repository for completing the data protection operations.

Once a complete scan has been accomplished, the Commvault software places an agent on the file system to report on incremental backups, making the process even more efficient.

Echols said he has also heard about snapshots and replication techniques from customers, but he says that at some point you have to move data off the primary system. You must archive or delete data to reduce the load and protect compliance data on the primary system.

The Research Computing and Cyber-infrastructure group (RCC) at Penn State University found another way to speed scanning. The group installed a solid-state storage array to scan hundreds of millions of files faster, according to PSU systems administrator Michael Fenn.

PSU’s RCC uses IBM’s General Parallel File System (GPFS) connected to a Dell PowerVault MD2000 storage array. GPFS separates the data from the metadata and designates distinct LUNs for each.

Fenn said scanning all those files slowed backups to a crawl, so he moved metadata backups to a Texas Memory Systems’ RamSan-810 flash storage array. Before that, he was over-provisioning around 200 15,000 RPM SAS drives to back up metadata overnight. That process reduced backups from a 12-hour to 24-hour window to about six hours. Switching to a flash system further reduced backups to an hour.

RCC backs up to tape, using IBM’s Tivoli Storage Manager.

“[GPFS] has to look in metadata to find out where the blocks of data are, and check every single file in the file system to see if it was modified since the last time it was backed up,” Fenn said. “Our backups were taking between 12 and 24 hours to run, mostly due to having to scan all those files.”

He said a single RamSan-810 can hit 150,000 IOPS. Running two in a redundant pair increases that to 300,000 IOPS. “We went from 20,000 IOPS to 300,000 IOPS,” Fenn said. “That meant the metadata scan was no longer the limiting factor in our backups.”

Fenn said RCC backs up about 150 million user-generated files, which is a small percentage of the total files that is either machine-generated or created by users.

“That’s a lot of files to scan,” he said. “Some of the data can be regenerated. Users know this file system will be backed up, other file systems won’t be backed up. We have a file scratch system with a couple of million files that we don’t back up. When people put files on that system, they know they might lose it.”

Fenn also places quotas on the file systems that get backed up so “people have to think about what really needs to get backed up.” 

Casino doesn’t want to gamble on backups

Pechanga Resort & Casino in Temecula, Calif., went live with a cluster of 50 EMC Isilon X200 nodes in February to back up data from its surveillance cameras. The casino has 1.4 PB of usable Isilon storage to keep the data, which is critical to operations because the casino must shut down all gaming operations if its surveillance system is interrupted.

“In gaming, we’re mandated to have surveillance coverage,” said Michael Grimsley, director of systems for Pechanga Technology Solutions Group. “If surveillance is down, all gaming has to stop.”

If a security incident occurs, the IT team pulls footage from the X200 nodes and moves it to WORM-compliant storage and backs it up with NetWorker software to EMC Data Domain DD860 deduplication target appliances. The casino doesn’t need tape for WORM capability because WORM is part of Isilon’s SmartLock software.

“It’s mandatory that part of our storage includes a WORM-compliant section,” Grimsley said. “Any time an incident happens, we put that footage in the vault. We have policies in place so it’s not deleted.”

The casino keeps 21 days’ worth of video on Isilon before recording over the video.

Grimsley said he is looking to expand the backup for the surveillance camera data. He’s considering adding a bigger Data Domain device to do day-to-day backup of the data. “We have no requirements for day-to-day backup, but it’s something we would like to do,” he said.

Another possibility is adding replication to a DR site so the casino can recover quickly if the surveillance system goes down.

Scale-out systems can help

Another option to solving the performance and capacity issues is using a scale-out backup system, one similar to scale-out NAS, but built for data protection. You add nodes with additional performance and capacity resources as the amount of protected data grows.

“Any backup architecture, especially for the big data world, has to balance the performance and the capacity properly,” said Jeff Tofano, Sepaton Inc.’s chief technology officer. “Otherwise, at the end of the day, it’s not a good solution for the customer and is a more expensive solution than it should be.”

Sepaton’s S2100-ES2 modular virtual tape library (VTL) was built for data-intensive large enterprises. According to the company, its 64-bit processor nodes backup data at up to 43.2 TB per hour, regardless of the data type, and can store up to 1.6 PB. You can add up to eight performance nodes per cluster as your needs require, and add disk shelves to add capacity.

The S2100-DS3 was built for branch-office data protection and replication capabilities back to the enterprise system or to a disaster recovery (DR) site. It also features backup performance at up to 5.4 TB per hour and remote backup, deduplication, replication and restore management abilities.

Both Sepaton systems also include Secure Erasure technology for the auditable destruction of VTL cartridges to free up disk capacity as data retention requirements expire.

Protecting a big data environment requires new thinking about how to use old tools, and considering new technologies that will keep pace with your data growth. Finding ways to reduce the data you must protect and scaling your protection environment are some of the keys to making sure your critical data is safe from simple and catastrophic system failures.

(Senior news director Dave Raffo contributed to this story).

Dig Deeper on Archiving and tape backup

Disaster Recovery