KOHb - Getty Images
Data quality is key for any organization utilizing data for operations, and it starts with mitigating data quality challenges that lead to inaccurate or misleading analytics results.
Seventy-seven percent of 500 information services and data professionals said they had issues with data quality, and 91% said that data quality issues were affecting company performance, according to a survey conducted earlier this summer by Pollfish on behalf of open source data tool Great Expectations.
Last year, poor data quality directly cost the average organization $12.9 million a year, Gartner estimated. It increases the complexity of data ecosystems and leads to poor decision-making.
Data professionals are spending 40% of their time checking data quality, according to a survey released last month by Wakefield Research on behalf of data firm Monte Carlo.
But companies are working on the problem -- 88% of companies were already investing in data quality solutions or planned to invest in the next six months, according to the survey.
In the past, companies worked with less data or had layers of manual processes, where they could catch errors and remediate or make decisions without relying on analytics.
But, today, companies are pivoting to becoming data-driven enterprises and are heavily investing in automation, analytics and AI. These systems all require high-quality data.
"If you have good data coming in, then good models come out," said Daniela Moody, former VP of AI at Arturo, a company that uses AI to help insurance companies speed up their underwriting, quoting, claims and policy renewals.
Arturo built models to detect features like driveways, swimming pools and basketball courts based on aerial photos to determine the number and type of buildings and to evaluate roof conditions. For the models to be accurate, the data sets must be accurate, comprehensive and representative. They work with commercial imagery providers who fly aerial surveys to find the images and then manually label the images.
"We have at least five sets of eyes on any single image," she said. "We don't want to miss anything or mislabel anything."
The company also has a team dedicated to data quality and is investing heavily in that area, like many other organizations. That's just the start of the data quality journey. The following are four data quality challenges organizations must overcome.
1. Missing data
One significant data challenge is filling the gaps in data sets.
"On average, we need several thousand of each type of image," Moody said. "We need to account for variables in how that roof or object is seen, dealing with shadows, dealing with overhanging trees."
If not enough data is available to train a model on a particular type of roof problem or building feature, the company looks at transfer learning or augments it with synthetic data.
Daniela MoodyFormer VP of AI, Arturo
In addition to having accurate and sufficient data, the data also has to be properly balanced. For example, if 1% of the average property is water and 99% is land, then a model that predicts land all the time has 99% accuracy overall -- but it is completely useless at identifying water.
"You have to mitigate for this in your training data set," she said. "Otherwise, you end up with models that don't predict in a certain category."
Most of the time spent developing an AI model goes into honing that training data set, she said.
Identifying potential issues requires domain experts, she said. In addition, testing the models against customer use cases must be done before putting them into production to make sure they're picking up everything the client needs them to.
"Data quality is an effort that has to be rooted in technology," she said. "But it's also customer-led."
2. Inconsistent data
Another issue many organizations face is inconsistent data due to it being pulled in from multiple systems in different formats. Even if each individual record is accurate on its own, the lack of coherent standards can make it unusable.
For example, if a particular customer is listed in two different ways, its sales could be counted as belonging to two different customers.
"In corporations and industries, different people use different tools to source their data," said Uyi Stewart, chief data and technology officer at Data.org, a nonprofit backed by Mastercard Center for Inclusive Growth and The Rockefeller Foundation. "As a result, you have a cacophony."
Companies typically take one of two approaches to solving this problem, he said.
The first is extract, load and transform. In this approach, extract the data from multiple sources, load the data into a single repository and then merge or transform the data while it's in that repository.
A second approach is to extract the data and then do the transformations in a middle layer before it's sent on to the final repository. This is particularly useful when there's sensitive data that needs to be sanitized before it's sent to public cloud storage, Stewart said.
"This is very popular in the pharma industry because of the nature of their data," he said.
3. Inaccurate data
Inaccurate data can be costly, said Tomas Kratky, CEO at Manta, a data lineage platform. Businesses can lose potential business opportunities and incur direct costs in finding and fixing the mistakes.
"Additionally, you should factor in measuring the number of incidents or reported data quality issues by data users -- similar to bug reporting in software development -- and the time needed to discover and fix an incident," he said.
Companies should try to move data quality issues to the left as much as possible. That means, when a particular problem shows up repeatedly, look early in the process, and try to figure out why the error is happening in the first place.
There might be a problem with the UI, for example, or a bad connection to another system.
4. Duplicate data
Data can be duplicated for a number of reasons.
There might be a UI issue, for example, where someone adds a new record, even though there's one already in the system, because the new record is entered in a slightly different way.
Duplicate records can also be the result if there are consistency issues.
That can happen if information is stored in multiple places, said Jayesh Chaurasia, analyst at Forrester Research, a research and consulting provider.
"For example, you can have two different contact numbers for the same customer because customer service updated the information but sales did not," he said.
Depending on how customers are correlated, that may result in having duplicate records for the same customer.
Duplicate records could also be due to pipeline errors -- say, if there's a problem reading a record, it's read twice and both versions are saved.
Duplicate records can result in inaccurate reporting, customer service problems and additional costs of storing duplicate records. It also takes time and effort to find the duplicates and resolve the problem.
Fortunately, finding duplicate data is an easier task to automate than some of the other data quality issues on this list. If two records are identical or nearly identical, one is most likely to be a duplicate.
Again, tracking the issue to its source benefits the company in the long run.
Inaccurate data is an even bigger problem than inconsistent data and more difficult to solve, Stewart said.
"You can't get away from the fact that you have to establish a ground truth," he said.
In some cases, this requires manual fact-checking. Employees can call and verify information, or domain experts can review it. Crowdsourcing is also an option for some types of data quality issues, but there are also technological solutions, Stewart said. For example, once there's enough core data in place, use cross-correlation across multiple data sets.
"But, by no means, should you go into cross-correlation without establishing that core data set first," he said.
Another approach is to look at the predictions that are coming out of your data. If the data is bad, then the predictions will be wrong.
"And, if you have a good model, that means that the data is probably good," Stewart said.
This metric can enable a company to gauge the progress it's making in data quality efforts and to identify areas that still need work.
Not all data issues are equally important to a company's bottom line. Many mistakes or inconsistencies are minor and have no effect on operations. Trying to fix every single mistake everywhere would be prohibitively expensive and might even be impossible.
"Before you go in and do all the cleaning, you have to understand what you are looking for," Stewart said. "Don't waste time on minor things that don't make an impact."