The rise of self-service BI tools enabled people outside of IT to analyze data and create data visualizations and dashboards on their own. That was terrific when the data was ready for analysis, but it turned out that most of the effort in creating BI applications involved data preparation. It still does -- and numerous challenges complicate the data preparation process.
Increasingly, those challenges are faced by business analysts, data scientists, data engineers and other non-IT users. That's because software vendors have also developed self-service data preparation tools. Those tools enable BI users and data science teams to perform the required data preparation tasks for analytics and data visualization projects. But they don't eliminate data prep's inherent complexities.
Why is effective data preparation important?
In the modern enterprise, an explosion of data is available to analyze and act upon to improve business operations. But the data used in analytics applications is often gathered from various sources, both internal and external. Most likely, it is formatted in different ways and contains errors, typos and other data quality issues. Some of it may be irrelevant to the work at hand.
As a result, the data must be curated to achieve the levels of cleanliness, consistency, completeness, currency and context needed for the planned analytics uses. That makes proper data preparation crucial. Without it, BI and analytics initiatives are unlikely to produce the desired outcomes.
This article is part of
Data preparation has to be done within reasonable limits. As Winston Churchill said, "Perfection is the enemy of progress." The goal is to make the data fit for its intended purpose without getting stuck on analysis paralysis or endlessly striving to create perfect data. But it can't be neglected or left to chance.
To succeed, it's important to understand the challenges that data preparation presents and how to overcome them. Many data preparation challenges could be bundled together under the data quality label, but it's useful to differentiate them into more specific issues to help identify, fix and manage the problems. With that in mind, here are seven challenges to be prepared for.
1. Inadequate or nonexistent data profiling
Data analysts and business users should never be surprised by the state of the data when doing analytics -- or worse, have their decisions be affected by faulty data that they were unaware of. Data profiling, one of the core steps in the data preparation process, should prevent that from happening. But there are different reasons why it may not do so, including the following scenarios:
- The people who gather and prepare the data assume it's valid because it was already being used in reports or spreadsheets. As a result, they don't fully profile the data. However, unknown to them, things like SQL queries, views, custom code or macros are manipulating the data, which masks underlying problems in the data set.
- Someone who collects a large volume of data only profiles a sample data set because of the time it would take to do the full one. However, data anomalies might not be picked up in the sample data.
- Custom-coded SQL queries or spreadsheet functions used to profile data aren't comprehensive enough to find all of the anomalies or other problems in the data.
How to overcome this challenge. Solid data profiling needs to be the starting point in the data preparation process. Data preparation tools can help with that: They include comprehensive data profiling functionality to examine the completeness, cleanliness and consistency of data sets in source systems and then in target ones as part of data curation. Done well, data profiling provides the information needed to identify and address many of the data issues listed in the subsequent challenges.
2. Missing or incomplete data
A common data quality issue is fields or attributes with missing values, such as nulls or blanks, zeros that represent a missing value rather than the number 0, or an entire field missing in a delimited file. The data preparation questions raised by these missing values are whether they indicate that there is an error in the data and, if they do, how should that error be handled. Can a valid value be substituted in? If not, should the record (or row) with the error be deleted, or kept but flagged to show there's an error?
If they aren't addressed, missing values and other forms of incomplete data may adversely affect business decisions driven by analytics applications that use the data. They can also cause data load processes that aren't designed to handle such occurrences to fail. That often results in a scramble to figure out what went wrong and undermines confidence in the data preparation process itself.
How to overcome this challenge. First, you need to do data profiling to identify data that's missing or incomplete. Then determine what should be done based on the planned use case for the data and implement the agreed-upon error handling processes, a task that can also be done with a data preparation tool.
3. Invalid data values
Invalid values are another common data quality issue. They include misspellings, other typos, duplicate entries and outliers, such as wrong dates or numbers that aren't reasonable given the data's context. These errors can be created even in modern enterprise applications with data validation features and then end up in curated data sets.
If the number of invalid values in a data set is small, they may not have a significant impact on analytics applications. But more frequent errors may result in faulty analysis of the data.
How to overcome this challenge. The tasks to find and fix invalid data are similar to the ones for handling missing values: Profile the data, determine what to do when errors are encountered and then implement functions to address them. In addition, data profiling should be done on an ongoing basis to identify new errors. This is a data preparation challenge where perfection is not likely to be attained -- some errors will inevitably slip through, but the intent should be to do whatever it takes to keep them from adversely affecting analytics-driven decisions.
4. Name and address standardization
One more data quality issue that complicates data preparation is inconsistency in the names and addresses of people, businesses and places. This type of inconsistency involves legitimate variations of that data, not misspellings or missing values. But if not caught when preparing the data, such inconsistencies can prevent BI and analytics users from getting a complete view of customers, suppliers and other entities.
Examples of name and address inconsistencies include the following:
- a shortened first name or nickname vs. a person's full name, such as Fred in one data field and Frederick in another;
- middle initial vs. middle name;
- differences in prefixes and suffixes, such as Ms. vs. Ms, Mr. vs. Mister or Ph.D. vs. PhD;
- spelled-out vs. abbreviated place data, such as Boulevard/Blvd, suite/ste or New York/NY;
- acronyms vs. full business names, such as BMW and Bayerische Motoren Werke for the German automaker; and
- businesses alternately being listed with and without Inc., Co., Corp., LLC and other business suffixes at the end of their name.
How to overcome this challenge. The source data schemas must be examined to determine what name and address fields are included, and then the data profiled to identify the scope of the inconsistencies. Once you've done that, the following are the three optimal ways to standardize the data:
- create customized standardization processes using a data preparation tool's string handling functionality;
- use a data prep tool's pre-built name and address standardization features; or
- use a tool from a software vendor that specializes in name and address standardization, ideally one that integrates with your data preparation tool.
5. Inconsistent data across enterprise systems
Inconsistent data also is often encountered when multiple data sources are needed for analytics. In this instance, the data may be correct within each source system, but the inconsistency becomes a problem when data from different sources is combined. It's a pervasive challenge for the people who do data preparation, especially in large enterprises.
How to overcome this challenge. When the data inconsistency is the result of an attribute such as an ID field having different data types or values in different systems, data conversions or cross-reference mapping can be used for a relatively easy fix. However, when it occurs because business rules or data definitions are different across the source systems, analysis must be done to determine data transformations that can be implemented while preparing the data.
6. Data enrichment
One of the key steps in creating the business context needed for analytics is enriching data. Examples of data enrichment measures include the following:
- calculating business metrics and KPIs;
- filtering data based on business rules applicable to the planned analytics;
- augmenting data with additional internal or external sources; and
- deriving additional data in an existing data set.
But enriching data isn't an easy task. Deciding what needs to be done in a data set is often complicated, and the required data enrichment work can be a time-consuming procedure.
How to overcome this challenge. Data enrichment should start with a strong understanding of the business needs and goals for analytics applications. That will make it easier to identify the business metrics, KPIs, augmented data and other enrichments required to meet those needs, and then to define things like filters, business rules and calculations to generate the enriched data.
7. Maintaining and expanding data prep processes
Although data scientists and other analysts perform many ad hoc tasks, the more impactful data preparation work they do inevitably becomes a recurring process that then expands in scope as the resulting analytics becomes more and more valuable. But organizations often encounter problems with that, especially if they're using custom-coded data preparation methods.
For example, what happens and why in a data preparation process is typically known only by the person who created it if there's no documentation of the process or of data lineage and where data is used. The dependency on such individuals requires them to spend increasingly more time on these processes and makes it hard to sustain the data preparation work when they leave the organization.
In addition, when changes or enhancements to a data preparation process are needed, bolting on new code makes the process more precarious and difficult to maintain.
How to overcome this challenge. Data preparation tools can help you avoid these traps and achieve long-term, sustained success in preparing data. They provide productivity and maintenance benefits such as pre-built connectors to data sources, collaboration capabilities, data lineage and where-used tracking and automated documentation, often with graphical workflows.
Final thoughts on data preparation and its challenges
To succeed at data preparation, it's imperative that you first understand what data is needed for an analytics application and the associated business context. Once the relevant data has been gathered from source systems, the key steps in preparing it include the following:
- data profiling to identify data quality and consistency issues;
- data cleansing to resolve those issues; and
- data transformation and enrichment to apply the necessary business context for the analytics.
As you go through those steps, do what's appropriate and possible in a reasonable way, especially in cleansing the data. Keep in mind that perfection often isn't attainable or may not be worth the cost to achieve -- and that it really can be the enemy of progress on data preparation.