
Getty Images/iStockphoto
The difference between data cleansing and data transformation
Data cleansing corrects errors to improve data quality, while data transformation changes data format and structure to support analytics processes.
Data drives modern business, but the terminology surrounding it is sprawling and often confusing. From data preparation and wrangling to aggregation and beyond, the diverse techniques and their vocabulary can sound like an alphabet soup.
But many of these processes are critical to maintaining data quality and fueling data science and analytics. Data cleansing and transformation are two of the most critical -- and most frequently misunderstood.
Understanding the differences between data cleansing and data transformation, as well as how each is used, is essential to managing data effectively and producing reliable insights.
What is data cleansing?
Data cleansing is the process of identifying and correcting errors within a data set. It begins with assessing data quality and flagging issues such as incorrect, inconsistent or incomplete information within a data set. The next step includes correcting these errors to make sure all data is accurate and reliable.
The primary goal of data cleansing, also known as data cleaning, is to improve data quality. This is essential to data management, as inaccurate data can lead to inaccurate insights and, in turn, flawed decision-making. High-quality data, by contrast, enables data scientists, analysts and business users to make more informed, data-driven decisions that can positively affect the bottom line.
Cleansing data focuses on improving the following dimensions of data quality:
- Accuracy, or how well the data reflects the attribute being described.
- Completeness, or the percentage of data missing from a data set.
- Consistency, or the uniformity of data across systems and sources.
- Integrity, or the validity of relationships across various data entities.
- Uniqueness, or the lack of duplicate or redundant data.
- Validity, or the percentage of data that conforms to its defined syntax.
Although data cleansing can also include shoring up other attributes, its core purpose is to resolve errors and prepare data for consumption.
What is data transformation?
Data transformation is the process of converting data from one format to another. For example, raw data needs to be converted into a recognizable format for databases and applications. Data transformation facilitates this conversion.
Like data cleansing, data transformation is often part of data preparation and management processes, including data integration and migration. The process is necessary because organizations collect information in many formats from diverse data sources. In order for that data to be stored and analyzed effectively, it needs to be structured and organized. Data transformation unifies data types into consistent formats, making information usable and accessible.
Data transformation also plays a role in enriching data quality. Converting data into consistent, reliable formats supports broader access and enables faster, more accurate data analytics.
Transformation process
Data transformation is the execution stage of the extract, transform and load (ETL) or extract, load and transform (ELT) integration process. Although ETL and ELT differ in operation order, both processes involve three key phases:
- Extraction. Relevant data is collected from various sources, often through profiling and mapping.
- Transformation. Aggregated data is converted into the desired structure using techniques like data normalization, enrichment and scrubbing.
- Loading. The transformed data is loaded into a database for processing and analysis.
Choosing between ETL and ELT largely depends on how, when and where the transformation should occur.
Transformation techniques
Data transformation can use several techniques, which can be grouped by function.
Structural formatting and integration techniques include the following:
- Data aggregation involves collecting and gathering data from different sources and summarizing it in a new form.
- Data mapping connects or matches data fields from one source to another.
- Data encoding converts data into a specific format, usually numerical, to make it easier to process.
- Data normalization makes sure data elements use the same scale or format to represent all data consistently across fields.
- Data categorization and classification organizes data into new, more relevant groups.
- Data combination and integration merges, blends and unifies data elements.
Data quality improvement techniques include the following:
- Data deduplication involves identifying and eliminating redundant copies of data to ensure uniqueness.
- Data revision involves modifying or updating data within a data set to ensure accuracy and consistency over time.
- Data scrubbing -- a term often used interchangeably with data cleansing -- is the process of identifying and correcting errors in data.
- Data imputation replaces missing data with new, estimated values.
Analytical preparation and enhancement techniques include the following:
- Data discovery and profiling involves identifying relevant data, interpreting its structure and characteristics, and determining how it should be transformed.
- Data discretization breaks data down into finer or more detailed elements to make it easier to analyze.
- Data filtering narrows down and segments the most relevant data from a larger data set.
- Data generalization abstracts detailed data sets into higher-level categories to make them easier to understand.
- Data separation and splitting divides a larger data set into two or more subsets, often to test data against a sample.
- Data smoothing removes noise or outliers from a data set to identify patterns and trends more easily.
- Data supplementation and enrichment adds relevant information from external sources, such as third-party data sets, to create a more comprehensive data set.
Once data is transformed, it must undergo validation and review to make sure it meets requirements and does not contain anomalies or errors.
Differences between data cleansing and data transformation
Data cleansing and data transformation serve different purposes in the data preparation process:
- Data cleansing focuses on identifying, correcting and eliminating errors in data sets.
- Data transformation involves converting data into a specific format.
Data cleansing is often performed at the beginning of the data transformation process to make sure the data set is accurate and free of errors. Data transformation, on the other hand, tends to involve a broader range of techniques. The goal is to restructure data to make it easier to process -- for example, through analysis, reporting or interpretation. While data transformation can include fixing errors, like data cleansing, it also involves augmenting and expanding existing data to create a more comprehensive data set.
Common examples and uses for each
Data cleansing tasks can include identifying inconsistencies in a data set, automatically fixing simple errors, removing invalid or redundant data, and highlighting missing values. Examples of data cleansing include the following:
- Deleting duplicate customer information within a spreadsheet.
- Verifying that data adheres to the proper format, such as MM/DD/YYYY.
- Flagging phone numbers that are missing a digit.
- Removing irrelevant data, such as HTML tags.
- Fixing basic spelling mistakes.
Data transformation can convert data between formats, restructure or combine data sets, or enhance data for downstream use. Examples of data transformation include the following:
- Aggregating finance data to represent monthly averages.
- Smoothing data to create a clear customer purchasing trend line.
- Splitting data into a sample set and a control set for experimentation.
- Revising data in real time to update inventory with current volume.
- Enhancing data for use in training machine learning models.
When it's better to use one vs. the other
Use data cleansing when the goal is to improve specific dimensions of data quality, such as accuracy, consistency, uniqueness and validity. The process is designed to catch and fix errors, remove redundancies, and correct inaccurate information.
Use data transformation to improve data usability. This might mean changing data formats, expanding data sets, summarizing data or breaking a data set down into more granular detail. It's typically a more complex process than data cleansing, with many more use cases.
While there is some overlap, data cleansing, scrubbing and deduplication are all generally considered part of the broader data transformation process. Data transformation can include a variety of additional techniques, while data cleansing is more narrowly focused on fixing and removing errors. For example, data cleansing might only remove redundant information, whereas data transformation might combine entries into a larger data set to add more context and make the data richer.
Jacob Roundy is a freelance writer and editor with more than a decade of experience in a variety of tech topics, such as data centers, business intelligence and sustainability.