Data preparation in machine learning: 6 key steps data validation

data cleansing (data cleaning, data scrubbing)

What is data cleansing?

Data cleansing, also referred to as data cleaning or data scrubbing, is the process of fixing incorrect, incomplete, duplicate or otherwise erroneous data in a data set. It involves identifying data errors and then changing, updating or removing data to correct them. Data cleansing improves data quality and helps provide more accurate, consistent and reliable information for decision-making in an organization.

Data cleansing is a key part of the overall data management process and one of the core components of data preparation work that readies data sets for use in business intelligence (BI) and data science applications. It's typically done by data quality analysts and engineers or other data management professionals. But data scientists, BI analysts and business users may also clean data or take part in the data cleansing process for their own applications.

Data cleansing vs. data cleaning vs. data scrubbing

Data cleansing, data cleaning and data scrubbing are often used interchangeably. For the most part, they're considered to be the same thing. In some cases, though, data scrubbing is viewed as an element of data cleansing that specifically involves removing duplicate, bad, unneeded or old data from data sets.

Data scrubbing also has a different meaning in connection with data storage. In that context, it's an automated function that checks disk drives and storage systems to make sure the data they contain can be read and to identify any bad sectors or blocks.

Why is clean data important?

Business operations and decision-making are increasingly data-driven, as organizations look to use data analytics to help improve business performance and gain competitive advantages over rivals. As a result, clean data is a must for BI and data science teams, business executives, marketing managers, sales reps and operational workers. That's particularly true in retail, financial services and other data-intensive industries, but it applies to organizations across the board, both large and small.

If data isn't properly cleansed, customer records and other business data may not be accurate and analytics applications may provide faulty information. That can lead to flawed business decisions, misguided strategies, missed opportunities and operational problems, which ultimately may increase costs and reduce revenue and profits. IBM estimated that data quality issues cost organizations in the U.S. a total of $3.1 trillion in 2016, a figure that's still widely cited.

What kind of data errors does data scrubbing fix?

Data cleansing addresses a range of errors and issues in data sets, including inaccurate, invalid, incompatible and corrupt data. Some of those problems are caused by human error during the data entry process, while others result from the use of different data structures, formats and terminology in separate systems throughout an organization.

The types of issues that are commonly fixed as part of data cleansing projects include the following:

  • Typos and invalid or missing data. Data cleansing corrects various structural errors in data sets. For example, that includes misspellings and other typographical errors, wrong numerical entries, syntax errors and missing values, such as blank or null fields that should contain data.
  • Inconsistent data. Names, addresses and other attributes are often formatted differently from system to system. For example, one data set might include a customer's middle initial, while another doesn't. Data elements such as terms and identifiers may also vary. Data cleansing helps ensure that data is consistent so it can be analyzed accurately.
  • Duplicate data. Data cleansing identifies duplicate records in data sets and either removes or merges them through the use of deduplication measures. For example, when data from two systems is combined, duplicate data entries can be reconciled to create single records.
  • Irrelevant data. Some data -- outliers or out-of-date entries, for example -- may not be relevant to analytics applications and could skew their results. Data cleansing removes redundant data from data sets, which streamlines data preparation and reduces the required amount of data processing and storage resources.

What are the steps in the data cleansing process?

The scope of data cleansing work varies depending on the data set and analytics requirements. For example, a data scientist doing fraud detection analysis on credit card transaction data may want to retain outlier values because they could be a sign of fraudulent purchases. But the data scrubbing process typically includes the following actions:

  1. Inspection and profiling. First, data is inspected and audited to assess its quality level and identify issues that need to be fixed. This step usually involves data profiling, which documents relationships between data elements, checks data quality and gathers statistics on data sets to help find errors, discrepancies and other problems.
  2. Cleaning. This is the heart of the cleansing process, when data errors are corrected and inconsistent, duplicate and redundant data is addressed.
  3. Verification. After the cleaning step is completed, the person or team that did the work should inspect the data again to verify its cleanliness and make sure it conforms to internal data quality rules and standards.
  4. Reporting. The results of the data cleansing work should then be reported to IT and business executives to highlight data quality trends and progress. The report could include the number of issues found and corrected, plus updated metrics on the data's quality levels.

The cleansed data can then be moved into the remaining stages of data preparation, starting with data structuring and data transformation, to continue readying it for analytics uses.

Commonly used data quality metrics
These metrics can be used to measure data quality levels in connection with data cleansing efforts.

Characteristics of clean data

Various data characteristics and attributes are used to measure the cleanliness and overall quality of data sets, including the following:

  • accuracy
  • completeness
  • consistency
  • integrity
  • timeliness
  • uniformity
  • validity

Data management teams create data quality metrics to track those characteristics, as well as things like error rates and the overall number of errors in data sets. Many also try to calculate the business impact of data quality problems and the potential business value of fixing them, partly through surveys and interviews with business executives.

The benefits of effective data cleansing

Done well, data cleansing provides the following business and data management benefits:

Key benefits of data cleansing
Data cleansing provides these benefits to organizations.
  • Improved decision-making. With more accurate data, analytics applications can produce better results. That enables organizations to make more informed decisions on business strategies and operations, as well as things like patient care and government programs.
  • More effective marketing and sales. Customer data is often wrong, inconsistent or out of date. Cleaning up the data in customer relationship management and sales systems helps improve the effectiveness of marketing campaigns and sales efforts.
  • Better operational performance. Clean, high-quality data helps organizations avoid inventory shortages, delivery snafus and other business problems that can result in higher costs, lower revenues and damaged relationships with customers.
  • Increased use of data. Data has become a key corporate asset, but it can't generate business value if it isn't used. By making data more trustworthy, data cleansing helps convince business managers and workers to rely on it as part of their jobs.
  • Reduced data costs. Data cleansing stops data errors and issues from further propagating in systems and analytics applications. In the long term, that saves time and money, because IT and data management teams don't have to continue fixing the same errors in data sets.

Data cleansing and other data quality methods are also a key part of data governance programs, which aim to ensure that the data in enterprise systems is consistent and gets used properly. Clean data is one of the hallmarks of a successful data governance initiative.

Data cleansing challenges

Data cleansing doesn't lack for challenges. One of the biggest is that it's often time-consuming, due to the number of issues that need to be addressed in many data sets and the difficulty of pinpointing the causes of some errors. Other common challenges include the following:

  • deciding how to resolve missing data values so they don't affect analytics applications;
  • fixing inconsistent data in systems controlled by different business units;
  • cleaning up data quality issues in big data systems that contain a mix of structured, semistructured and unstructured data;
  • getting sufficient resources and organizational support; and
  • dealing with data silos that complicate the data cleansing process.

Data cleansing tools and vendors

Numerous tools can be used to automate data cleansing tasks, including both commercial software and open source technologies. Typically, the tools include a variety of functions for correcting data errors and issues, such as adding missing values, replacing null ones, fixing punctuation, standardizing fields and combining duplicate records. Many also do data matching to find duplicate or related records.

Tools that help cleanse data are available in a variety of products and platforms, including the following:

  • specialized data cleaning tools from vendors such as Data Ladder and WinPure;
  • data quality software from vendors such as Datactics, Experian, Innovative Systems, Melissa, Microsoft and Precisely;
  • data preparation tools from vendors such as Altair, DataRobot, Tableau, Tibco Software and Trifacta;
  • data management platforms from vendors such as Alteryx, Ataccama, IBM, Informatica, SAP, SAS, Syniti and Talend;
  • customer and contact data management software from vendors such as Redpoint Global, RingLead, Synthio and Tye;
  • tools for cleansing data in Salesforce systems from vendors such as Cloudingo and Plauti; and
  • open source tools, such as DataCleaner and OpenRefine

Learn how strong data governance policies can help organizations prevent data silos and ensure better quality data.

This was last updated in January 2022

Continue Reading About data cleansing (data cleaning, data scrubbing)

Dig Deeper on Data governance

Business Analytics
Content Management