Data quality management tools can automate many of the processes required to ensure data remains fit for purpose across analytics, data science and machine learning use cases. Organizations must find the tool that is best to assess existing data pipelines, identify quality bottlenecks and automate various remediation steps.
Processes associated with ensuring data quality include profiling data, tracking data lineage and cleaning data. Overall, it might make sense to identify a tool or set of tools that aligns with the enterprise's existing data pipeline workflows. In the short run, it is sometimes helpful to identify specific gaps or challenges in the data quality process.
"It's best to focus first on which tools and systems need to be replaced," said Jeff Brown, team lead for BI projects at Syntax, a managed service provider.
This starts by working with the teams to determine what will have the most significant effect on improving data-driven culture.
Top considerations include overall cost, ability to audit, ease of setting up policies and standards, amount of training required to use the tool, whether the tool can scale to keep up with increasing and evolving data sources, and ease of use, said Terri Sage, CTO of 1010data, a provider of analytical intelligence to the financial, retail and consumer markets.
Tools similarities and differences
Each data quality tool has its own set of capabilities and workflows. Most tools include features for profiling, cleansing and standardizing data.
Christophe AntoineVice president of global solutions engineering, Talend
Data profiling, measurement and visualization capabilities help teams understand the format and values of the collected data set. These tools will point out outliers and mixed formats. Data profiling serves as a quality fix filter in the data analytics pipeline.
"You can learn a lot about the quality of your data by using the right profiling tools," said Christophe Antoine, vice president of global solutions engineering at Talend, an open source data integration platform.
Data standardization capabilities help identify inconsistent data formats, values, names and outliers in the current data sets. Then, they can apply a standardization, such as an address validator, pattern check, formatter or database of synonyms, to the data pipeline to always return a standardized value. This is helpful when the same data is input in different ways.
Data cleansing capabilities can help fix structural issues, remove outliers, fill in missing data fields and ensure mandatory fields are correctly filled. Data cleansing can be expensive, given the extra hours and tooling involved. Look for tools that can fill in any data cleansing gaps in existing workflows, Sage said.
Parsing capabilities help to decompose data into its component parts. This can help track the cause of data quality issues and flag downstream data sets when problems arise in a data pipeline.
Monitoring capabilities track data quality metrics. When problems are detected, they can alert data management teams to investigate sooner, when they are easier to address.
Here are seven of the top data quality management tools:
Ataccama One Data Quality Suite
Ataccama was founded in 2007 to commercialize the data preparation tools developed in house by Adastra, a data analytics and AI platform. Ataccama specializes in turning raw data into reusable data products that can support various AI, analytics and operational tasks across a company. A big aspect of this is ensuring data quality for large data pipelines through self-driving capabilities for automated data classification, automated data quality and access policy documentation.
Freemium data quality tools include Ataccama Data Quality Analyzer and Ataccama One Profiler. The company has also developed an extensive set of AI tools for data anomaly detection, creating new data quality rules and automating various data quality processes. Although these tools can streamline many aspects of data quality, it might require additional work for enterprises that need to support processes that operate outside of existing workflows.
IBM InfoSphere Information Server for Data Quality
Over the years, IBM has developed extensive data quality tools to complement its various enterprise database offerings. Current tools include IBM InfoSphere Information Server for Data Quality for end-to-end quality, IBM InfoSphere QualityStage for data cleansing and standardization, and IBM Watson Knowledge Catalog for AI metadata management. The InfoSphere tools are part of IBM's legacy offerings.
Watson Knowledge Catalog is a newer offering that promises to streamline the quality aspects of AI and machine learning workflows within a single platform. This approach promises to help harmonize workflows that traditionally spanned multiple data science, ModelOps and data management tools.
Informatica Data Quality
Informatica has a long history with improving data transformation and data quality. Its current lineup includes a range of data quality tools, including Informatica Data Quality, Informatica Cloud Data Quality, Informatica Axon Data Governance, Informatica Data Engineering Quality and Informatica Data as a Service.
Each tool focuses on different aspects of data quality. This breadth of offerings enables the ensemble of tools to support diverse use cases. Its extensive cloud capabilities also provide a way for enterprises to ensure data quality as they migrate to hybrid or cloud-native data management tools. However, its cloud tools are still catching up with available features compared with on-premises versions of tools.
In addition, Informatica simplifies data quality as part of modern AI and machine learning workflows. The company also supports a rich offering of metadata management, data cataloging and data governance capabilities aligned with its data quality capabilities.
Precisely is the modern name for a legacy data transformation tools provider that has been around since the late 1960s. The company started life as Whitlow Computer Systems to focus on high-speed data transformation; it was renamed Syncsort in the 1980s and then Precisely in 2020. Each new branding has represented a shift in the current industry needs and new technology.
It has recently acquired Infogix and its data quality, governance and metadata capabilities. Current data quality offerings include Precisely Trillium, Precisely Spectrum Quality and Precisely Data360. One big strength is an extensive set of geocoding and spatial data standardization and enrichment capabilities. It is worth noting that these separate tools were separate acquisitions. As a result, enterprises might need to budget for additional integration for workflows that span these different tools.
SAP Data Intelligence
SAP is the established leader in ERP software, a data-heavy application. It has acquired or developed a variety of data quality capabilities to enhance its core platform. Current data quality offerings include SAP Information Steward, SAP Data Services and SAP Data Intelligence Cloud.
The company has undergone many significant platform shifts for its core offerings, with the development of the S/4HANA intelligent ERP platform. It is now undergoing a similar shift toward next-generation data quality capabilities with SAP Data Intelligence Cloud. This newer offering centralizes access to data quality capabilities across on-premises and cloud environments. It supports data integration, governance, metadata management and data pipelines.
Several third-party applications also enhance SAPs core data quality offerings. Teams might need to consider these third-party tools to improve data quality, particularly when working with data outside the SAP platform.
SAS Data Quality
SAS has long reigned as a leading analytics tools provider. In the mid-1960s, the company released its first analytics tools, which continue to evolve as it has extended its core tools to manage almost every aspect of the data preparation pipeline, including data quality.
Its core data quality offering is SAS Data Quality. It works in concert with a bevy of complementary tools, such as SAS Data Management, SAS Data Loader, SAS Data Governance and SAS Data Preparation. Its data quality tools are baked into the SAS Viya platform for AI, analytics and data management for the cloud at no additional cost. This helps streamline the data quality aspects of various data science workflows.
The company also offers SAS Quality Knowledge Base, which provides several data quality functions such as extraction, pattern analysis and standardization. Real-time quality enhancements included in the SAS Event Stream Processing service can perform various data cleansing tasks on data streams from IoT devices, operations or third-party sources.
Talend Data Fabric
Talend was founded in 2006 as an open source data integration company. The company developed an extensive library of tools for data integration, data preparation, application integration and master data management. Current freemium data quality offerings include Talend Open Studio for Data Quality and Talend Data Preparation Free Desktop. Other offerings include Talend Data Catalog and Talend Data Fabric.
Talend Data Catalog helps automate various aspects of data inventory and metadata management. Talend Data Fabric helps streamline data quality processes as part of automated data pipelines with support for data preparation, data enrichment and data quality monitoring. These tools are often used in conjunction with other data analytics and data science tools.