Sergey Nivens - Fotolia

Qumulo Core data-aware arrays help researchers manage sequencer data

Carnegie Institution for Science's embryology department manages research data with Qumulo Core data-aware NAS storage, taking advantage of advanced analytics.

For scientific researchers, a storage array can be as much of a tool of the trade as a microscope or data sequencer.

That is the way Mahmud Siddiqi sees it. Siddiqi, microscopy facility manager for the department of embryology at the Carnegie Institution for Science, spends most of his time managing microscopes and the rest dealing with storage. He sees a connection between those duties.

"As much as I train people how to use microscopes, at this stage it's impossible to do science if you don't know how to manage data," Siddiqi said. "You have to know data handling, how to do backups, or sooner or later you will cry."

The embryology department is based in Baltimore and is part of the Washington D.C.-based Carnegie Institution that has won Nobel Prizes for genetics research.

The department has 3.5 full-time IT people, with Siddiqi splitting his time between IT infrastructure and microscopy support. The team's primary storage system handles regular business records, documents and microscope image data generated from researchers.

We have between eight and 12 individual labs at any time, and we are constantly access sequencing data. Qumulo will tell us how much storage resources each lab is using.
Mahmud Siddiqimicroscopy facilities manager for the department of embryology at the Carnegie Institution for Science

The image data is most taxing on the storage system. Carnegie's researchers frequently access data from DNA sequencer instruments, and that data also requires long-term retention. Imaging data ranges from millions of kilobyte files to a small number of files in the hundreds of gigabytes.

The department switched over from EMC Isilon to Qumulo Core hybrid arrays in mid-2016. It acquired four Qumulo Core QC208 4u hybrid arrays with a total of 800 TB raw -- 550 TB usable -- data, with about 100 TB of capacity occupied when the systems went into production. Siddiqi said all data has been replicated off the Isilon arrays although he expects to re-purpose them for nonmission-critical data.

Qumulo Core data analytics helps manage storage effectively

He said advanced data analytics made Qumulo Core stand out from Isilon and other contenders.

"We have between eight and 12 individual labs at any time, and we are constantly accessing sequencing data," he said. "Qumulo will tell us how much storage resources each lab is using. It shows if there is a pattern of access, how we can make workloads faster and if there is a way we can increase efficiency by devoting resources to any one area. We don't do chargeback and give everyone a bill, but Qumulo helps increase our awareness of costs."

Siddiqi said his team retains a good chunk of its scientific data indefinitely. "There is a subset of sequencing data that we expect to keep forever from which we can reconstruct all other data if required," he said. We're planning to keep that base raw data as long as we can keep it."

Cost played big role in selection

The cloud is an option for long-term archiving, but Siddiqi said the financials don't add up for his team. "We've looked at the cloud but always concluded that it won't be practical for our needs," he said. "The amount of data we would need to upload is quite large, and if we ever need to get it back, it would be so expensive that it isn't worth it."

Siddiqi said he was happy with Isilon storage but felt it was no longer cost-effective. The Isilon platform had matured, so he explored newer technologies such as object storage and Qumulo's data-aware storage as well as old NAS mainstays, such as storage based on IBM's Spectrum Scale General Parallel File System.

"Isilon was fast enough and easy to set up," he said. "When you have a 15-node cluster you'll have trouble somewhere along the line, and we had to replace a disk here and there. But generally speaking, the Isilon was nice to us. We would have been perfectly happy to acquire another Isilon, except budgetary restrictions made that difficult.

"There weren't many new features being released on Isilon," Siddiqi said. "We talked to EMC, and it would've been a modern version of the Isilon we already had. And it was going to be expensive."

Software-only versus bundled appliance

The final choice came down to Scality RING object storage and Qumulo. Scality's software-only approach would provide greater flexibility in buying and scaling hardware, but Qumulo's appliance model required less work during implementation.

A consultant recommended Qumulo to Siddiqi, who knew of the startup because its founders were early Isilon engineers. He talked to Qumulo representatives when the vendor launched in 2015. A year later, he found Qumulo made interesting additions to Core, particularly in analytics. He liked that Qumulo shipped on a dedicated appliance, even if he was intrigued by the idea of implementing Scality on hardware of his choice.

"As much fun as it would be to spec our own system, it would be a lot of work and we all have our actual jobs," he said. "There was a strong push for buying an appliance where the software would come from the same vendor and we wouldn't have to worry about it. As much as we like Scality, we ended up going the appliance and filer route with Qumulo.

"If we had more time and someone dedicated to the care and feeding of Scality, it would've been a much harder decision. In the back of my mind I think about putting Scality on the old Isilon hardware. But I should not be thinking about that because I have other things to do."

Siddiqi said he looks forward to several additions to Qumulo Core. He said he is "eagerly awaiting" greater snapshot capability. New Qumulo clusters support snapshots, but they are still unavailable on older clusters. He is also looking for SMB 3 support and more Ethernet ports on the box. He would also like to expand from four to six nodes to take advantage of more advanced erasure coding.

"When you go to six nodes, the [data protection] repertoire expands," he said of Core's erasure coding schemes.

Next Steps

Choosing between scale-out NAS and object storage

The challenges with analysis of unstructured data

Sticking to a budget when planning for storage capacity

Dig Deeper on Storage system and application software

Disaster Recovery
Data Backup
Data Center
and ESG