Sergej Khackimullin - Fotolia
Data quality processes continue to become more prominent in organizations, often as part of data governance programs. For many companies, the growing interest in quality is commensurate with an increased need to ensure that analytics data is trustworthy and useable.
That's especially true with data quality for big data; more data usually means more data problems -- including data usability issues that should be part of the quality discussion.
One of the main challenges of effective data quality management is articulating what quality really means to a company. What are commonly referred to as the dimensions of data quality include accuracy, consistency, timeliness and conformity.
But there are many different lists of dimensions, and even some common terms have different meanings from list to list. As a result, solely relying on a particular list without having an underlying foundation for what you're looking to accomplish is a simplistic approach to data quality.
This challenge becomes more acute with big data. In Hadoop clusters and other big data systems, data volumes are exploding and data variety is increasing. An organization might accumulate data from numerous sources for analysis -- for example, transaction data from different internal systems, clickstream logs from e-commerce sites and streams of data from social networks.
Time to rethink things
In addition, the design of big data platforms exacerbates the potential problems. A company might create data in on-premises servers, syndicate it to cloud databases and distribute filtered data sets to systems at remote sites. This new world of processing data creates issues that aren't covered in conventional lists of data quality dimensions.
To compensate, reexamine what is meant by quality in the context of a big data analytics environment. Too often, people equate the concept of data quality with discrete notions such as data correctness or currency, putting in place processes to fix data values and objects that aren't accurate or up to date.
But the task of managing data quality for big data is also likely to include measures designed to help data scientists and other analysts figure out how to effectively use what they have. In other words, they must transition from simply generating a black-and-white specification of good versus bad data to supporting a spectrum of data usability.
The usability side of data quality
A focus on usability is aimed at boosting the degree to which validated data can contribute to actionable analytics results while reducing the risk of the data being misinterpreted or used improperly. Let's look at some aspects of efforts to improve big data usability that could be incorporated into data quality initiatives.
Findability. No matter how clean data sets stored in a data lake are, they won't be usable if analysts don't know they exist and can't access them. The ability to find relevant information can be increased if there's a well-organized data catalog that includes qualitative descriptions of what the data sets contain, where they're located and how to access the data.
Conformance to reference models. A reference data model provides a logical description of data entities, such as customers, products and suppliers. But unlike a conventional relational model, it abstracts the attributes and properties of the entity and its relationships, allowing for mappings among different physical representations in SQL, extensible markup language, JSON and other formats. When accumulating, processing and disseminating entity data among a cohort of source and target systems, conforming to a reference model helps ensure a consistent representation and semantic interpretation of data.
Data synchronization. In many big data environments, data sets are likely to be replicated between different platforms. As opposed to simply generating data extracts for particular users, replication synchronizes data among all the replicas. When a change is made to one copy, the modification is propagated to the others. An automated workflow of that sort helps enforce consistency and uniformity on shared data, thereby increasing its usability.
Making data identifiable. The precision of big data analytics applications can be enhanced by making it easier to accurately identify entity data so relevant data sets from different sources can be linked together for analysis. Therefore, ensuring that data is identifiable across its entire lifecycle -- from creation or capture to ingestion, integration, processing and production use -- should be a core facet of data quality for big data.
Automated tools. AI and self-service analytics tools bolster the ability to democratize data and ensure its quality, governance and security. The implementation of AI or analytics tools can help boost the quality of data and improve the data trust when spreading it throughout the organizations.
Self-service BI tools allow people to get the results they need when they need them without having to wait for reports or analysis from someone else. These efforts, bolstered by natural language processing, are continually evolving to become more user friendly and help democratize data across an organization. When reports and analysis are needed, follow best practices for data visualization to communicate those findings clean and concisely to make it easier for people to get the answers they need.
Reconsidering how data quality is framed in a big data environment to incorporate data usability measures broadens the scope of data quality processes and priorities. Alongside traditional data profiling and cleansing procedures, a usability-driven focus on data preparation, cataloging and curation offers a bigger-picture view of big data quality for organizations that are expanding their repositories -- and their use -- of analytics data.