kentoh - Fotolia
As data grows in volume and importance, the way companies handle their data also must change.
Several recent surveys show how this is happening. One change is that many businesses have already begun making the jump from data backup to data reuse. Another is they are taking steps to better index and classify data as it grows into the petabyte range.
A survey by IT analyst firm Enterprise Strategy Group (ESG) called "The Evolution from Data Backup to Data Intelligence" highlighted the data reuse trend. The web-based survey asked 359 North American IT professionals with data protection responsibilities questions about secondary data reuse, and 42% were already using secondary data for cybersecurity testing. Another 41% said they use secondary data for application development. Those same areas were recognized as having high potential for positively impacting the business.
One of the key findings was there was no universally agreed-upon definition for the term "data management." Christophe Bertrand, senior analyst at ESG and the author of the study, said IT professionals understand "data management" as something they want to do, even if they don't agree on what exactly that entails. Less than one-quarter -- 22% -- said the term referred to optimizing data storage and access, 19% said it was about implementing compliance and privacy processes and 11% said it was about data classification.
"People have a vague recognition of 'data management' as a term, but it's not actually clear for everyone," Bertrand said.
Bertrand said companies know there's a "data management chasm" they need to cross to gain broader business outcomes from their backup data. But the term "data management" has become muddled, making it harder for them to make that leap. Bertrand said this is why he proposed the term "intelligent data management" as a way to describe the category and give a clearer definition on what the technology can accomplish.
Leaping over the chasm
The ESG survey found that many businesses recognize the benefits of clearing that chasm: 29% said reusing secondary data would allow them to achieve greater business agility, 28% said they could lower operational costs and 25% said it would increase their resilience to cyberattacks. Other reported benefits included improved business uptime and being better positioned to handle rapid data growth.
Secondary data reuse for test/dev has been a recognized trend in the backup industry that some vendors have already jumped onto. Actifio has made a conscious effort to sell its copy data management capabilities to the test/dev crowd, even though backup remains its primary use case. Cohesity released an add-on to its flagship DataPlatform called Agile Dev and Test, which provides clean copies of data for test/dev teams to work with.
The survey also identified a strong connection between data protection and data reuse, as 59% of the respondents said data reuse was an extension of their data protection strategy. Another 21% said it was a replacement for data protection.
Bertrand said many backup vendors now have "data management" somewhere in their marketing taglines. He said the recent Veeam buyout shows just how much activity and investor confidence there is in the backup market.
"People are now saying, 'I want to do more,'" Bertrand said of organizations' reaction to data protection.
"There's still some education to be done. People don't see or understand the ROI," Bertrand said.
Unstructured data growth presents classification challenges
People certainly have more data, particularly unstructured data. Data protection vendor Igneous Systems sponsored a survey in which AWS re:Invent attendees were asked about their unstructured data management pain points. Conducted by Connect Marketing in December 2019, the study, titled "Rise of the Data Economy," found that 60% of 157 respondents said they managed more than one billion files. Not surprisingly, 70% of the respondents said managing unstructured data was difficult at that scale.
The study pointed at growth in data-intensive industries such as autonomous vehicles and genetic sequencing as one reasons behind the explosion in unstructured data. In the DNA sequencing example, a single machine can generate up to two terabytes per run. Estimating that the industry will sequence 2 billion genomes by 2025, this translates to about 200 exabytes of unstructured data.
"Things were not built in the previous generation to scale that high," said Christian Smith, vice president of product at Igneous Systems.
The ESG study found that 62% of respondents classify data through data protection tools, but Smith said tagging petabytes of data only during the backup process is not enough. He believes data classification should be more pervasive, and that data management at a massive scale requires full visibility -- a "dashboard of everything you have."
Quantum Spatial, a geospatial data analytics firm with offices throughout the United States, became an Igneous customer in January 2019 because its previous manual way of tracking storage ingest was unsustainable. Travis Spurley, senior system engineer at Quantum Spatial based out of the company's Portland, Oregon office, manages about 10 petabytes of data in primary and secondary storage and sees a data growth of 40 terabytes per week. All of this data is unstructured and includes LIDAR readings, topographic and water depth readings and aerial photos taken from company aircraft.
Spurley said tracking all that data was a meticulous and tedious process. An admin would use reports generated by the primary storage platform and look at the differences from week to week to tie storage usage to ongoing survey projects. The inefficiencies came from when projects ended -- data would sit in primary storage longer than needed, before it could be identified for archiving. With up to a dozen projects running concurrently, Spurley said the extra storage costs quickly piled up.
Spurley said it was ultimately a problem of data lifecycle management leading to unnecessary costs. "Automation and efficiency are the keys to any business," he said.
Since deploying Igneous, Quantum Spatial has had a better grasp on how it handles data life cycle, data protection and archiving, thanks to its scalable index. Instead of archiving on external hard drives, the company now uses public cloud. Spurley said Igneous' indexing let him eventually make his organization comfortable with deleting data on-premises once they've confirmed a copy of it was safely living on the cloud.
Spurley said now that the core indexing component is in place and he has the pervasive data classification that Bertrand described, he is looking forward to applying intelligence to it. He said an analytics capability would let him calculate costs of each project, which would give him useful business insight.
Bertrand said he agrees that intelligent data management requires pervasive classification of data. If data isn't tagged and searchable, then it's not context- or content-aware. The survey found that only 35% of respondents classify all their data and 29% classify only unstructured data. Organizations that classify all their data reported higher confidence when it came to business decisions and compliance audits.
Interestingly, the ESG survey identified the cloud as one of the biggest roadblocks to pervasive data classification. Fifty-seven percent of respondents said the use of public cloud computing services such as SaaS applications and IaaS providers has made it more difficult to classify data. The data generated by these off-premise sources sit in separate silos that are often out of reach of traditional backup. Because 62% of respondents who do classify their data do so through their data protection tools, this is a real challenge.
Finally, lack of ROI and a resistance to change are other reasons why organizations aren't willing to adopt intelligent data management products. One-quarter of respondents said they don't see the benefits of enabling data management, 23% said they didn't want to change their primary storage and 22% said they didn't want to change their current backup deployments.
Bertrand said silos inside organizations can make it hard to see the benefits of intelligent data management. As an example, he said a DevOps person realizes how access to clean test data will lead to faster development cycles, but someone working in data protection might not.