Sergey Nivens - Fotolia


Unstructured data analysis is critical, but difficult

Sometimes, it's hard to know if your data is just garbage out, unless you have the right big data implementation analysis and preparation tools.

Big data is easy, right? All you have to do is build a Hadoop cluster, link it to all of your databases, hire a good data scientist who knows MySQL and you're set, right? Wrong. Many enterprises have taken exactly that approach, and virtually none of them have succeeded. Instead, they have spent considerable sums to find out the hardware and software are the easy parts of big data. What's hard is the data in big data.

By that, I mean you may have lots of business data and access to many other sources of public data, but that doesn't mean it is usable data. "What about this notion of unstructured data analysis?" you ask. "Wasn't big data all about using unstructured data along with structured data?"

Yes, but unstructured data analysis does not mean uncurated. Data, to be usable, must be mapped to a common data lake. It must be cleansed of unusable garbage that is either irrelevant or in error. Without data preparation, the old adage of garbage in, garbage out is in full force. However, with big data and the associated analytics, it is often impossible to tell if you have garbage out simply because the analysis is so complex and the data lake so vast. In fact, where big data implementations have gone awry, it is the delivery of incomplete or inaccurate conclusions that have sealed their fate. All it takes is one flawed recommendation based on a big data analysis to destroy trust in the technology.

Enterprises work to overcome analysis paralysis

Yes, but unstructured data analysis does not mean uncurated.

Companies contemplating big data must first assume their data is in serious need of preparation. This can be a time-intensive and expensive undertaking. And it isn't done once: It is done every time a new data set is introduced into the data lake. Stratecast surveys indicate as much as 60% to 80% of a business analyst's time can be spent simply cleaning data or interpreting results that are delivered by queries to a data lake.

Is there any way to reduce this overhead? Yes, there are utilities that can be used to manage and automate data cleansing. IBM, for example, has invested heavily in data-cleansing technology and has several applications in its analytical suite. Another company that has taken a lead in data cleansing is Paxata. Paxata's approach is built on a data preparation platform that front-ends analysis and ensures analytics are only applied to curated data, regardless of who in the organization is submitting the queries.

In any case, without a fanatical focus on data integrity and unstructured data analysis, big data implementations are unlikely to return much value to the organization. With a clean data set, however, big data and advanced analytics can provide a significant competitive advantage -- one that can translate into higher revenues and reduced cost of operations. The key, though, is to start with the data, then think big data.

Next Steps

How big data analysis can boost organizations

How and why big data analytics delivers value

Big data analytics buying decisions

This was last published in November 2016

Dig Deeper on Network management software and network analytics