Big data as it is now evolving requires a different mindset within the data management profession, an expert panelist told attendees last week at DAMA International's Enterprise Data World 2014 conference in Austin, Texas. Fast-growing data volume and feeds turn the present status quo "on its head," according to data manager Robert Abate.
You will effectively do data quality on the fly.
Robert Abate, Kimberly-Clark
"Enterprise information management has always been about making sure you get the right data to the right people at the right time. That included things like master data, reference data, data quality and data governance," said Abate, who is global director for enterprise information management and analytics at paper and consumer products giant Kimberly-Clark Corp.
But the speed, variety and size of the data now being gathered changes things, he said. Data scientists tend to emphasize masses of data and are less concerned about some of that data's specific quality. Data schemas, once the lynchpin of data design, now sometimes give way to schema-less or schema-on-the-fly architectures.
"Big data and the data science associated with analytics is about getting as much data as you can into a 'data lake' [or, an undifferentiated storage layer where data is staged] and then running algorithms against it," he said. In Abate's estimation, data quality is no longer the most important factor in data management.
One version of the truth, multiplied
Increasingly, the quest for quality data will center on surrounding metadata like that found in data dictionaries and repositories, rather than the data itself, one veteran data architect said. As far as the growing mass of raw data goes, some flexibility is in order, according to Ray McGlew, who is data management consultant with the MatchPoint Consulting Group and member of the DAMA International Board of Directors.
Distinguishing and prioritizing between different types of data has always been part of the data management profession, he said. A slew of new data types makes that flexibility even more important.
"There are key things such as bank transactions where you have to have one version of the truth. But with some other transactions you have to be more flexible or more aware of the nature of the data," said McGlew, pointing to Twitter and Facebook data as examples.
What in fact is needed is to pull together a single version of the definition of that data -- the metadata, McGlew said.
There are cases where a "customer" is defined as an individual with an identified account in, say, a bank -- but now that may appear along with the "customer" derived from information from a social media site.
"One's more amorphous than the other," McGlew said. "You're not going to get a single number on that person but you are going to get a series of numbers." In other words, you get trend analysis -- not one number.
Jumping without a schema?
The new big data architectures -- the ones that more than occasionally cast off data schemas and blueprints -- can loom as a disruptive challenge for today's data shops. It is still early to say just how they are going to proceed.
"We are just setting out on that," said Ian Wood, director of data analysis at Fidelity Investments, who co-led a conference session that recommended putting master data management processes ahead of off-the-shelf master data management tools when bringing consistency to corporate data.
For more on DAMA
Catch the buzz around chief data officers at DAMA's EDW 2013 convocation
Given his group's structured environment, he said, the road to schema-less and schema-on-the-fly methods may be hard. But a journey has begun.
"We have met with business users and we gathered what the future requirements are. We've met with our technology development teams and gathered what their thoughts are on this. But I don't honestly know what direction we are going to go into yet," Wood said.
He did say, however, that the kind of approach his group would likely take would take organizational structures and procedures into account, rather than "just jumping in with a technology solution."
It's more than just Hadoop
Among themes emerging from the EDW 2014 event was the notion that big data is much more than the data found in the bubbling cauldron called Hadoop. For example, Abate said he considers Hadoop a mere "file store."
It's just a file store until it becomes better integrated with SQL-style database technology, said Abate, who helped lead big data and business intelligence efforts for Wal-Mart Stores Inc.'s Sam's Club operations before recently moving to Kimberly-Clark.
"When the 'Impalas' come out, that can make Hadoop into a database -- that is a value," he said, referring to an emerging class of SQL-on-Hadoop tools that are exemplified by the Cloudera Inc. Impala SQL query engine that runs in Hadoop.
Even when traditional data warehouses are used, the invasion of diverse data may call for a new approach to data quality, he said. There is a new dichotomy at work, in which an able data scientist can work with large amounts of data while understanding that the data's quality may not be absolute.
"You tend to run queries that ignore some information so you can find outliers easily. You look to see if you can find patterns," Abate said. "You will effectively do data quality on the fly."
That may provoke controversy among some data professionals, while being familiar to some others. Over many years, practitioners like those at Enterprise Data World have found new ways to govern, curate and steward data. In the face of more big data, the goals of data quality and flexibility may align in still newer ways.