Data quality can be a paradox. It is both simple in its essentials and tricky in its specifics. A data quality management program ensures processes run smoothly.
Data quality refers to the condition where data is accurate, consistent, complete and up to date. However, the idea of quality also depends on context. Different tasks or applications require different kinds of data and, consequently, different standards of quality.
The fitness of data for one application doesn't automatically translate to its value in another scenario. For example, a list of customer names and addresses may be of good quality for marketing campaigns, but not good enough for the purpose of tracking customer sales history.
There isn't a base level of quality that can be set. A data set of credit card transactions, complete with cancelled transactions and verification errors, may be just too messy for sales analysis, but that messiness is exactly what the fraud analysis team wants to see.
The most accurate assessment of data quality is simply this -- is the data fit for the current purpose?
Let's look at some practical steps for a data quality management process. The goal isn't universally perfect data. Instead, aim for processes that can deliver high-quality, reliable data across the enterprise. The following five key steps form a comprehensive data quality management process.
Step 1. Data quality assessment
This is the starting point. Within the organization, invested parties, from business units to IT to the chief data officer, should understand the current state of data in the system. The data management team should check for errors, duplicates or missing entries. Evaluate the accuracy, consistency and completeness of the data collected. Use techniques like data profiling to examine data closely to understand its content and structure. This step sets the foundation for all following data quality processes.
Step 2. Data quality strategy development
Create a data quality strategy that outlines the methods and procedures to improve and maintain data quality. It's a blueprint that defines the use cases for data, the quality needed for each use case, and the rules for data collection, storage and processing. Decide on the tools and technologies to use, which may range from internally written scripts to feature-rich data quality tools. This is also the time to outline how to handle errors or discrepancies when they arise.
Step 3. Initial data cleansing
In this step, organizations take the first action to improve data. Clean, prepare and correct the data to remove inaccuracies identified during the assessment stage. Data cleansing activities include to removing duplicate entries, attempting to complete missing data and rectifying inconsistencies between data sets. Begin the process of data quality management from the best possible starting place.
Step 4. Data quality implementation
Now, put the strategic plan into effect and apply the data quality strategy to improve the way data is handled across the organization. The aim is to integrate data quality rules and standards into everyday business processes. Organizations should train teams on the new data quality practices. This may require modifications to existing workflows to include data quality checks. Done correctly, data quality management should become a self-correcting, continuous process.
Step 5. Data quality monitoring
The final step is to monitor this ongoing process. Data quality management isn't a one-time event. Organizations need to regularly track and review data quality to ensure they consistently maintain standards. Regular audits, reports and dashboard reviews provide visibility into the quality of data over time.
A good data quality management process is all about understanding current data, creating a plan for continuous quality in the workflow and monitoring the process carefully. Data quality isn't about achieving perfection. It's about ensuring data is fit for purpose.
AI poses new challenges
AI and machine learning (ML) are top of many agendas. This area of technology adds a new layer of complexity to data quality.
In AI and ML, the quality of data used for training the models can make or break the project. The model's performance depends on the accuracy, completeness and bias of the training data. If the training data is flawed, the model will produce flawed results.
The volume of data is another challenge. Models often require vast amounts of high-quality data for training. It can be a significant task to ensure the quality of such large data sets.
Organizations may need to collect and prepare data specifically for AI and ML projects. They might have to gather new data, transform existing data or augment data to make it more suitable for AI and ML use cases. Attention should be paid to avoid bias and ensure diversity in the data. Existing data sets may not be complete enough, or diverse enough, to deliver the results needed for the future.
As part of this, implement specific validation checks for AI and ML training data. This might require that data teams check the data for bias, ensure diversity or verify that the data correctly represents the problem space the AI or ML model is expected to address.