Sergej Khackimullin - Fotolia

Tip

Make data usability a priority on data quality for big data

To help make big data analytics applications more effective, IT teams must augment conventional data quality processes with measures aimed at improving data usability for analysts.

David Loshin

By

David Loshin, Knowledge Integrity Inc.

Published: 17 Oct 2022

Data quality processes continue to become more prominent in organizations, often as part of data governance programs. For many companies, the growing interest in quality is commensurate with an increased need to ensure that analytics data is trustworthy and useable.

That's especially true with data quality for big data; more data usually means more data problems -- including data usability issues that should be part of the quality discussion.

One of the main challenges of effective data quality management is articulating what quality really means to a company. What are commonly referred to as the dimensions of data quality include accuracy, consistency, timeliness and conformity.

But there are many different lists of dimensions, and even some common terms have different meanings from list to list. As a result, solely relying on a particular list without having an underlying foundation for what you're looking to accomplish is a simplistic approach to data quality.

This challenge becomes more acute with big data. In Hadoop clusters and other big data systems, data volumes are exploding and data variety is increasing. An organization might accumulate data from numerous sources for analysis -- for example, transaction data from different internal systems, clickstream logs from e-commerce sites and streams of data from social networks.

Time to rethink things

In addition, the design of big data platforms exacerbates the potential problems. A company might create data in on-premises servers, syndicate it to cloud databases and distribute filtered data sets to systems at remote sites. This new world of processing data creates issues that aren't covered in conventional lists of data quality dimensions.

To compensate, reexamine what is meant by quality in the context of a big data analytics environment. Too often, people equate the concept of data quality with discrete notions such as data correctness or currency, putting in place processes to fix data values and objects that aren't accurate or up to date.

But the task of managing data quality for big data is also likely to include measures designed to help data scientists and other analysts figure out how to effectively use what they have. In other words, they must transition from simply generating a black-and-white specification of good versus bad data to supporting a spectrum of data usability.

Different aspects of data usability

The usability side of data quality

A focus on usability is aimed at boosting the degree to which validated data can contribute to actionable analytics results while reducing the risk of the data being misinterpreted or used improperly. Let's look at some aspects of efforts to improve big data usability that could be incorporated into data quality initiatives.

Findability. No matter how clean data sets stored in a data lake are, they won't be usable if analysts don't know they exist and can't access them. The ability to find relevant information can be increased if there's a well-organized data catalog that includes qualitative descriptions of what the data sets contain, where they're located and how to access the data.

Too often, people equate the concept of data quality with discrete notions such as data correctness or currency, putting in place processes to fix data values and objects that aren't accurate or up to date.

Conformance to reference models. A reference data model provides a logical description of data entities, such as customers, products and suppliers. But unlike a conventional relational model, it abstracts the attributes and properties of the entity and its relationships, allowing for mappings among different physical representations in SQL, extensible markup language, JSON and other formats. When accumulating, processing and disseminating entity data among a cohort of source and target systems, conforming to a reference model helps ensure a consistent representation and semantic interpretation of data.

Data synchronization. In many big data environments, data sets are likely to be replicated between different platforms. As opposed to simply generating data extracts for particular users, replication synchronizes data among all the replicas. When a change is made to one copy, the modification is propagated to the others. An automated workflow of that sort helps enforce consistency and uniformity on shared data, thereby increasing its usability.

Making data identifiable. The precision of big data analytics applications can be enhanced by making it easier to accurately identify entity data so relevant data sets from different sources can be linked together for analysis. Therefore, ensuring that data is identifiable across its entire lifecycle -- from creation or capture to ingestion, integration, processing and production use -- should be a core facet of data quality for big data.

Automated tools. AI and self-service analytics tools bolster the ability to democratize data and ensure its quality, governance and security. The implementation of AI or analytics tools can help boost the quality of data and improve the data trust when spreading it throughout the organizations.

Self-service BI tools allow people to get the results they need when they need them without having to wait for reports or analysis from someone else. These efforts, bolstered by natural language processing, are continually evolving to become more user friendly and help democratize data across an organization. When reports and analysis are needed, follow best practices for data visualization to communicate those findings clean and concisely to make it easier for people to get the answers they need.

Reconsidering how data quality is framed in a big data environment to incorporate data usability measures broadens the scope of data quality processes and priorities. Alongside traditional data profiling and cleansing procedures, a usability-driven focus on data preparation, cataloging and curation offers a bigger-picture view of big data quality for organizations that are expanding their repositories -- and their use -- of analytics data.

Next Steps

Top trends in big data for 2023 and beyond

Top big data tools and technologies to know about

Dig Deeper on Data governance

Search Business Analytics

What makes an effective data science team structure?
Data science team structures vary in strength, and their success depends on how roles and leadership align with business goals to...
Synthetic data vs. real data for predictive analytics
Synthetic data helps simulate rare events and meet privacy compliance, while real data preserves natural variability needed to ...
7 predictive analytics skills to improve simulation modeling
Predictive analytics skills such as statistical analysis, data preprocessing and model evaluation can help data professionals ...

Search AWS

Compare Datadog vs. New Relic for IT monitoring in 2024
Compare Datadog vs. New Relic capabilities including alerts, log management, incident management and more. Learn which tool is ...
AWS Control Tower aims to simplify multi-account management
Many organizations struggle to manage their vast collection of AWS accounts, but Control Tower can help. The service automates ...
Break down the Amazon EKS pricing model
There are several important variables within the Amazon EKS pricing model. Dig into the numbers to ensure you deploy the service ...

Search Content Management

How to remove digital signatures from a PDF
Digital signatures let organizations execute and secure agreements, but users can remove them if they need to reformat documents ...
The top 10 RFP response software
As B2B organizations grow, the RFP response process can become too time-consuming for manual workflows. Top tools, such as Loopio...
CRM vs. CMS: How they differ and how to integrate them
CMSes and CRM systems serve different purposes, but together, they can help organizations improve customer data management and ...

Search Oracle

Click-to-launch tools pull apps through Oracle Cloud Infrastructure marketplace
Oracle has made it easier for customers to choose and launch third-party software onto its cloud. Now, the question is whether ...
Willis develops app to put a personal touch back in voluntary benefits
Part two of a two-part article: Willis uses PeopleSoft 9.1 to bring back the personal feel to automated insurance selection for ...
Willis develops app for real-time voluntary benefit selection
Part one of a two-part article: Willis uses PeopleSoft 9.1 to create real-time automated insurance selection for voluntary ...

Search SAP

AI tackles customizations in SAP clean core migrations
New tools from Lemongrass and Kyndryl are designed to find custom code and figure out what to do with it when moving to a clean ...
SAP agrees to allow Celonis data access until case resolved
SAP agrees to allow Celonis customers to access data from its systems as their legal battle continues, but customers will be best...
Grow with SAP fuels Phoenix Global's digital transition
Phoenix Global implemented S/4HANA Cloud via Grow with SAP to replace outdated systems, digitize manual processes and enable AI ...

Close