data transformation
What is data transformation?
Data transformation is the process of converting data from one format, such as a database file, XML document or Excel spreadsheet, into another.
Transformations typically involve converting a raw data source into a cleansed, validated and ready-to-use format. Data transformation is crucial to data management processes that include data integration, data migration, data warehousing and data preparation.
The process of data transformation can also be referred to as extract/transform/load (ETL). The extraction phase involves identifying and pulling data from the various source systems that create data and then moving the data to a single repository. Next, the raw data is cleansed, if needed. It's then transformed into a target format that can be fed into operational systems or into a data warehouse, a date lake or another repository for use in business intelligence and analytics applications. The transformation may involve converting data types, removing duplicate data and enriching the source data.
Data transformation is crucial to processes that include data integration, data management, data migration, data warehousing and data wrangling.
It is also a critical component for any organization seeking to leverage its data to generate timely business insights. As the volume of data has proliferated, organizations must have an efficient way to harness data to effectively put it to business use. Data transformation is one element of harnessing this data, because -- when done properly -- it ensures data is easy to access, consistent, secure and ultimately trusted by the intended business users.
What are the key steps in data transformation?
The process of data transformation, as noted, involves identifying data sources and types; determining the structure of transformations that need to occur; and defining how fields will be changed or aggregated. It includes extracting data from its original source, transforming it and sending it to the target destination, such as a database or data warehouse. Extractions can come from many locations, including structured sources, streaming sources or log files from web applications.
Data analysts, data engineers and data scientists are typically in charge of data transformation within an organization. They identify the source data, determine the required data formats and perform data mapping, as well as execute the actual transformation process before moving the data into appropriate databases for storage and use.
Their work involves five main steps:
- data discovery, in which data professionals use data profiling tools or profiling scripts to understand the structure and characteristics of the data and also to determine how it should be transformed;
- data mapping, during which data professionals connect, or match, data fields from one source to data fields in another;
- code generation, a part of the process where the software code required to transform the data is created (either by data transformation tools or the data professionals themselves writing script);
- execution of the code, where the data undergoes the transformation; and
- review, during which data professionals or the business/end users confirm that the output data meets the established transformation requirements and, if not, address and correct any anomalies and errors.
These steps fall in the middle of the ETL process for organizations that use on-premises warehouses. However, scalable cloud-based data warehouses have given rise to a slightly different process called ELT for extract, load, transform; in this process, organizations can load raw data into data warehouses and then transform data at the time of use.
What are the benefits and challenges of data transformation?
Organizations across the board need to analyze their data for a host of business operations, from customer service to supply chain management. They also need data to feed the increasing number of automated and intelligent systems within their enterprise.
To gain insight into and improve these operations, organizations need high-quality data in formats compatible with the systems consuming the data.
Thus, data transformation is a critical component of an enterprise data program because it delivers the following benefits:
- higher data quality;
- reduced number of mistakes, such as missing values;
- faster queries and retrieval times;
- less resources needed to manipulate data;
- better data organization and management; and
- more usable data, especially for advanced business intelligence or analytics.
The data transformation process, however, can be complex and complicated. The challenges organizations face include the following:
- high cost of transformation tools and professional expertise;
- significant compute resources, with the intensity of some on-premises transformation processes having the potential to slow down other operations;
- difficulty recruiting and retaining the skilled data professionals required for this work, with data professionals some of the most in-demand workers today; and
- difficulty of properly aligning data transformation activities to the business's data-related priorities and requirements.
Reasons to do data transformation
Organizations must be able to mine their data for insights in order to successfully compete in the digital marketplace, optimize operations, cut costs and boost productivity. They also require data to feed systems that use artificial intelligence, machine learning, natural language processing and other advanced technologies.
To gain accurate insights and to ensure accurate operations of intelligent systems, organizations must collect data and merge it from multiple sources and ensure that integrated data is high quality.
This is where data transformation plays the star role, by ensuring that data collected from one system is compatible with data from other systems and that the combined data is ultimately compatible for use in the systems that require it. For example, databases might need to be combined following a corporate acquisition, transferred to a cloud data warehouse or merged for analysis.
Examples of data transformation
There are various data transformation methods, including the following:
- aggregation, in which data is collected from multiple sources and stored in a single format;
- attribute construction, in which new attributes are added or created from existing attributes;
- discretization, which involves converting continuous data values into sets of data intervals with specific values to make the data more manageable for analysis;
- generalization, where low-level data attributes are converted into high-level data attributes (for example, converting data from multiple brackets broken up by ages into the more general "young" and "old" attributes) to gain a more comprehensive view of the data;
- integration, a step that involves combining data from different sources into a single view;
- manipulation, where the data is changed or altered to make it more readable and organized;
- normalization, a process that converts source data into another format to limit the occurrence of duplicated data; and
- smoothing, which uses algorithms to reduce "noise" in data sets, thereby helping to more efficiently and effectively identify trends in the data.
Data transformation tools
Data professionals have a number of tools at their disposal to support the ETL process. These technologies automate many of the steps within data transformation, replacing much, if not all, of the manual scripting and hand coding that had been a major part of the data transformation process.
Both commercial and open source data transformation tools are available, with some options designed for on-premises transformation processes and others catering to cloud-based transformation activities.
Moreover, some data transformation tools are focused on the data transformation process itself, handling the string of actions required to transform data. However, other ETL tools on the market are part of platforms that offer a broad range of capabilities for managing enterprise data.
Options include IBM InfoSphere, DataStage, Matillion, SAP Data Services and Talend.