6 dimensions of data quality boost data performance
Generate accurate data analysis and predictions by mastering the six dimensions of data quality -- accuracy, consistency, validity, completeness, uniqueness and integrity.
Artificial intelligence and machine learning can generate quality predictions and analysis, but first require organizations be trained on high quality data, starting with the six dimensions of data quality.
The old adage of computer programming -- garbage in, garbage out -- is just as applicable to today's AI systems as it was to traditional software. Data quality means different things in different contexts, but, in general, good quality data is reliable, accurate and trustworthy.
"Data quality also refers to the business' ability to use data for operational or management decision-making," said Musaddiq Rehman, principal in the digital, data and analytics practice at Ernst & Young.
In the past, ensuring the quality of data meant a team of human beings would fact-check data records, but as the size and number of data sets increases, this becomes less and less practical and scalable.
Many companies are starting to use automated tools, including AI, to help with the problem.
By the end of this year, 60% of organizations will leverage machine-learning-enabled data quality technology in order to reduce the need for manual tasks, according to Gartner. To maximize these data quality tools, mastering the six dimensions of data quality should help ensure effective data performance.
Data accuracy is all about whether the data in the company systems matches that out in the real world or another verifiable source.
"For an accuracy metric to provide valuable insights, there is typically the need for reference data to verify its accuracy," said Rehman.
For example, vendor data could be checked against a third-party data supplier database, or invoice amounts typed into a bookkeeping system could be checked against the paper documents.
The biggest bane when it comes to accuracy is human data entry, whether it's employees or customers themselves doing the typing.
"One letter separates the postal abbreviations for Alabama, Alaska and Arkansas," said Doug Henschen, vice president and principal analyst at Constellation Research. "One digit difference in an address or phone number makes the difference in being able to connect to a customer."
Even with all the recent progress in digitizing back-end systems and improving customer-facing interfaces, systems are still vulnerable to errors, he said. Good user interface design can help a lot here.
For example, many customer-facing address forms have a built-in address checker to confirm that an address does, in fact, exist. Similarly, credit card numbers and email addresses can be checked at the time of the manual entry.
The latest technology on this front is the customer data platform.
"CDPs are primarily designed to resolve identities and tie together information associated with one person to create a single customer record," Henschen said.
But it can also help to ensure accuracy and keep records up to date as customers change jobs, get married and divorced, move, or get new email addresses. Most data quality tools offer functionality to validate addresses and perform other standard accuracy checks.
They can also be used to profile data, so if someone enters something unexpected, it will send an alert. IBM, SAP, Attacama, Informatica and other leaders in Gartner's Magic Quadrant for data quality solutions offer AI-powered data quality rule creation with a self-learning engine.
Unfortunately, despite the new technologies coming on board, the accuracy problem is getting worse, not better, according to a survey of nearly 900 data experts released in September by data quality vendor Talend.
For example, the percentage of respondents who said their data was up to date fell dramatically since this time last year. Only 28% rated their data "very good" in timeliness -- down from 57% in 2021.
The percentage of respondents who rated their data "very good" on accuracy fell as well, from 46% in 2021 to 39% this year.
Consistency means data across all systems reflects the same information, and they are in sync with each other across the enterprise.
Consistency can also be a measure of data format-related anomalies, which can be difficult to test and require planned testing across multiple data sets, Rehman said. Different business stakeholders may need to get involved and create a set of standards that would apply for all the data sets, regardless of what business unit they originated in.
"For example, I have changed my address in an organization's database," he said. "That should be reflected across all downstream applications that they support."
Consistency may also reflect external sources as well.
"A marketing provider may use vendor data from a source, but once we change any of the data records mentioned in the marketing provider database, it will not reflect in that provider's source," Rehman said.
Ensuring consistently can be difficult to do manually, but can be significantly improved with data quality tools. Automated systems can automatically correlate data across different data sets or ensure that formats are consistent with company standards.
However, consistency has gotten worse over the past year, according to the Talend survey. In 2021, 40% of respondents rated their data "very good" on consistency. This year, only 32% did the same.
Invalid data could throw off any AI trained on that data set, so companies should create a set of systematic business rules to assess validity, Rehman said.
Birthdates are composed of a month, a day and a year. Social security numbers are ten digits long. U.S. phone numbers begin with a three-digit area code. Unfortunately, in most cases, it's not as simple as deciding on a format for a birth date.
"In many cases business input is required to understand what the required standards are," he said. "These standards may evolve over time and should be monitored on a recurring basis."
Data quality tools designed to ensure accuracy and consistency can also ensure the data is valid. Informatica, for example, offers an API to validate addresses for all countries, formats and languages.
Completeness doesn't necessarily mean that every single field is filled in, Rehman said.
"For example, an employee's first name and last name are mandatory, but middle name is optional," he said. "So a record can be considered complete even if a middle name is not available."
Once a company has determined which fields are optional and which aren't, data quality tools can validate the information at point of entry, send up alerts, or use correlation with other data sets to fill in the gaps.
According to the Talend survey, completeness has also gotten worse over the past year. Only 41% of respondents rated their data "very good" on completeness in 2021. This year, that number dropped to 35%.
Most companies have multiple overlapping sets of data. Even in the case of single data sets, records can be accidentally added more than once.
"There can be clients with five different addresses and no possibility to know which one is correct," said Rehman. "There can be a few vendors with almost the same name in a single database. Customer records may be identical with just minor variations."
Data quality tools can help correlate data across disparate data sets to come up with a single source of truth or flag records for manual review if automated deduplication is too risky or difficult.
Matching, linking and merging data are built-in capabilities of most of the major data quality tools. Some are rules-based or use algorithms or metadata to help with the challenge. More recently, tools are using machine learning to make the process faster and more accurate.
Even if data is consistent, complete, unique, and accurate it doesn't always stay that way. It's touched by different people and moves through different systems.
"Data integrity ensures that all enterprise data can be traced and connected," Rehman said.
Data integrity also affects relationships. If an employee accidentally changes a customer's identification number, then all the associated transaction records might become orphans.
Data integrity is also a key aspect of data governance and regulatory compliance. There are big risks to a company if there are unauthorized changes to customer financial or medical records.
Even if data passes all the data quality checks, it might be answering the wrong question.
That happened with one project that Juan Orlandini, chief architect and distinguished engineer at Insight, worked on for a retail company.
"The client thought that the majority of loss was at the self-checkout lanes -- and was intentional," he said.
An AI system was used to detect deliberate theft by the customers using those lanes. On further investigation, the real problem was that older customers weren't accustomed to using the scanners and were simply making honest mistakes.
"So we realized that we had a bad user experience," Orlandini said.
Once that was fixed, the AI system began giving very different results.
"There was still loss," he said. "But it wasn't as significant or pervasive as the retailer thought."
Unfortunately, automated systems, not even those powered by the smartest AI, can recognize issues that are related to a deep understanding of how the human world works. There are plenty of experts currently working on the problem of general artificial intelligence.
But until then, there's still a role for humans in solving the data quality challenge.