Many organizations face a growing sprawl of data across various databases and other repositories in on-premises systems, cloud services and IoT infrastructure. That makes data management more challenging, and BI and data analytics initiatives are less effective if data scientists, other data analysts and business users can't find relevant data and understand what it means. "Organizations are drowning in data yet starving for insights," said Priya Iragavarapu, vice president of the Center of Data Excellence at consulting firm AArete.
Data catalogs can provide a unified view of all the data assets in an enterprise. The idea of a catalog has been around since the early days of relational databases, when IT teams wanted to keep track of how data sets were linked, joined and transformed across SQL tables. Modern data catalog tools inventory data and collect metadata about it from a wider variety of data stores, also including data lakes, data warehouses, NoSQL databases, cloud object storage and more.
They're also commonly integrated with data governance software to help organizations keep pace with changing regulatory compliance requirements and other aspects of governance programs. In addition, the tools are evolving to take advantage of natural language queries, machine learning and other AI functionality. In 2021, consulting firm Gartner replaced data catalog with augmented data cataloging and metadata management as the term used in its Hype Cycle reports on emerging technologies.
Regardless of what it's called, the data catalog market is growing quickly. Market research firm IDC forecasts that worldwide data catalog revenues will total $338 million this year, with a compound annual growth rate of 16.8% from 2020 to 2025.
Early data catalogs required custom scripts to crawl data and capture metadata. But newer tools can do that automatically and dynamically sense data attributes, types and profiles. Iragavarapu also recommended looking for data catalog software that supports user input, a business glossary and data visualization, among other capabilities. "A robust data catalog solution should not just merely show metadata but should also allow users to take actions from that insight," she said.
Here, in alphabetical order, are details on 16 popular data catalog tools that may be able to help your organization tame its metadata management challenges and make data more readily accessible and understandable to end users.
1. Alation Data Catalog
Alation was founded in 2012 and launched its first products in 2015. The company's flagship data catalog software uses AI, machine learning, automation and natural language processing techniques to simplify data discovery, automatically create business glossaries and power its core Behavioral Analysis Engine, which analyzes data usage patterns with an eye toward streamlining data stewardship, data governance and query optimization. The engine indexes various data sources and uses pattern recognition to generate popularity rankings, usage recommendations and other insights.
Alation, which also offers a data governance application, bills its overall combination of capabilities as a "data intelligence" platform. In that vein, Alation Data Catalog includes guided navigation and various collaboration features. For example, it can automatically identify data stewards or other subject matter experts to answer questions about data sets, and users can create wiki articles and searchable conversations and subscribe to get automatic notifications when data sets or articles are updated.
Other key features in the Alation tool include the following:
- the ability to flag data health issues and define enterprise data governance policies;
- pre-built connectors to various data sources, plus an Open Connector SDK for building custom ones; and
- a built-in SQL editor that can be used as an alternative to natural language search.
2. Alex Augmented Data Catalog
Alex Solutions is a newer data catalog and metadata management provider founded in 2016. The company architected its data catalog software to take advantage of AI and machine learning techniques. Alex Augmented Data Catalog helps automate the process of discovering data assets and then bringing them into a consolidated catalog. The tool supports various types of structured, semistructured and unstructured data. The company has also created a marketplace for metadata connectors that help capture and tune metadata for specific types of business requirements or industry needs.
In addition, Alex automates various aspects of data governance and data quality within the data catalog tool. For example, data governance managers can create policies, assign data stewards and keep track of data pipeline processes from a central console.
Alex Augmented Data Catalog also provides the following features:
- Google-like natural language search and query capabilities;
- a marketplace of plug-and-play metadata connectors to popular data sources; and
- built-in automation for populating and enriching metadata in data catalogs.
3. Ataccama Data Catalog
Ataccama, which was founded in 2008, offers a data catalog tool as a core component of Ataccama One, a consolidated platform that supports data governance and management functions automated through the use of AI. Ataccama Data Catalog can catalog data from databases, data lakes, file systems and other sources; it comes with connectors for a variety of popular on-premises and cloud data platforms.
The data catalog includes capabilities that help automate data discovery and change detection. The tool can also automate data quality assessments and detect and flag data anomalies, and it can be plugged into business process management workflows to automate data policy enforcement. It supports workflows spanning a diverse set of roles in organizations, including data stewards, data engineers, business users, data analysts and system owners.
Ataccama Data Catalog also includes the following features:
- a focus on data quality improvement through continuous quality monitoring and data cleansing;
- built-in data profiling, data classification, data lineage, relationship discovery and metadata management capabilities; and
- functions for configuring workflows, user permissions and custom metadata.
4. Atlan Data Catalog
Atlan is one of the newest data catalog vendors, having first hit the market with its tool in 2018. It positions the product as a third-generation data catalog that's built on design principles borrowed from GitHub, Slack and other end-user tools. In particular, Atlan Data Catalog is designed to support easy collaboration, with the ability to seamlessly integrate common data workflows.
For example, data teams can highlight issues that need to be addressed in an actionable way from within the data catalog tool. It supports contextual discussions in Slack chats that can take advantage of a reverse metadata feature, and individual users can create Jira requests to report issues while exploring data sets.
The software also includes the following features to help simplify integration with common data sources and data quality tools:
- open APIs that enable fully customizable ingestion of metadata;
- programmable bots to help automate tasks through custom machine learning and data science algorithms; and
- a plugin marketplace with connectors to various data tools and platforms.
5. AWS Glue Data Catalog
AWS Glue Data Catalog is the persistent metadata store in AWS Glue, a fully managed extract, transform and load (ETL) service offered by AWS. The data catalog enables data management teams to store, annotate and share metadata for use in ETL integration jobs when they create data warehouses or data lakes on the AWS cloud platform. It supports similar functionality and is compatible with the metastore repository in Apache Hive, a popular open source data warehouse tool. In some cases, organizations can also integrate the AWS data catalog as an external metastore for Hive data.
Users can share access to AWS Glue Data Catalog across an organization using their AWS Identity and Access Management credentials. The data catalog tool can also help enforce data governance requirements by tracking changes to schemas and data access controls. In addition, it supports data processes that span different AWS services, including AWS Lake Formation, Amazon Athena, Amazon Redshift, Amazon EMR and more.
Other features offered by the AWS software include the following:
- the ability to write scripts to automatically crawl repositories and capture information on schemas and data types;
- improved visibility, control and governance of data assets across various AWS data services; and
- a settings page in the AWS Glue management console for changing permissions and other data catalog properties.
6. Boomi Data Catalog and Preparation
Boomi Data Catalog and Preparation is part of the company's AtomSphere Platform, a portfolio of tools that also supports data integration, master data management and other functions. It combines a data catalog with data preparation capabilities: Organizations can use the catalog to create a consolidated business glossary of metadata to track data sets, processing jobs and workflow schedules, then run a data prep recommendation engine to automatically cleanse, enrich, normalize and transform data.
The catalog tool includes connectors to more than 1,000 endpoints, including more than 200 applications. IT and data management teams can also create data pipelines to automate workflows for analytics, machine learning and AI processes, and a set of data governance and security features can be used to enhance controls across different applications and business processes.
Boomi Data Catalog and Preparation also includes the following capabilities:
- support for natural language queries and personalized searches;
- the ability to deploy and run the software in the cloud, on premises or in hybrid environments; and
- collaboration features, such as the ability to rate and comment on data and to ask data stewards for access to required data sets.
7. Collibra Data Catalog
Collibra started as a company in 2008 and offers a Data Intelligence Cloud platform that's centered on Collibra Data Catalog. Its data catalog capabilities support an extensive set of automated features for data discovery and classification using a proprietary machine learning algorithm; data curation, also powered by machine learning; and data lineage. The data catalog tool also supports graph-based metadata management techniques that help provide information on data quality and lineage to users.
Collibra Data Catalog includes pre-built integrations for ingesting metadata from various data stores, as well as commonly used business applications, BI platforms and data science tools. It also provides embedded data governance capabilities, guided data stewardship features and granular controls for enforcing data security and privacy protections, all in a single console.
In addition, the Collibra software offers the following features:
- a business glossary to standardize terminology, plus automated data governance workflows and dashboards;
- collaboration capabilities, including crowdsourced feedback on data assets through ratings, reviews and comments; and
- a "data shopping experience" that enables users to search for relevant data without requiring any SQL coding.
Data.world is a cloud-native data catalog tool offered as a SaaS platform. The company, which was founded in 2015, touts its fast pace of releasing new features, with more than 1,000 individual product updates per year. It's known for a knowledge graph approach that provides a semantically organized view of enterprise data assets and their associated metadata across disparate systems. That's designed to make it easier for business and analytics users to find relevant data and understand its context.
In April 2022, Data.world added a new suite of data catalog functions powered by knowledge graphs to simplify the use of its platform. Called Eureka, the suite includes a set of automations to help deploy and manage data catalogs; an Action Center dashboard that provides metrics, alerts and recommendations; and an Answers feature that's meant to improve search results in catalogs. A Eureka Explorer function that creates a visual map of data assets and relationships is also planned.
Other notable features in the Data.world software include the following:
- collaboration capabilities to help streamline workflows and enable knowledge sharing between data producers and users;
- the ability to automatically organize, aggregate and present metadata in a format for easy use and sharing between collaborators; and
- support for both virtualized and federated access to data, with built-in data governance controls.
9. Erwin Data Catalog
The first Erwin software was created in 1983 for data modeling; over the years, the product line went through several acquisitions and is now owned by Quest Software. It also has evolved to support additional capabilities, including this data catalog tool that was developed as part of a broader platform launched in 2017 to support different aspects of data governance.
Erwin Data Catalog by Quest, as the software is formally known, automatically harvests, catalogs and curates metadata. It also includes components for data mapping, reference data management, data lifecycle management, data quality integration and other functions. Standard data connectors can ingest data from common databases, and optional ones can be added for streaming data, cloud applications, BI environments and more data sources. In addition, the data catalog software can be used together with companion data literacy and data quality tools in Erwin Data Intelligence, a suite that includes all three.
Erwin Data Catalog also provides the following features:
- a management dashboard that can be used to view and analyze data catalog attributes;
- an impact analysis function for assessing the potential effects of changes in a catalog; and
- end-to-end data lineage information that's automatically generated down to the column level and shows data flows and transformations.
10. Google Cloud Data Catalog
Google Cloud Data Catalog is a fully managed data discovery and metadata management service that works across cloud and on-premises data sources. Its UI is designed to enable both data professionals and business users to search a catalog through natural language queries and tag data at scale. The tool has built-in integrations with Google's BigQuery, Pub/Sub and Cloud Storage data services; it's also integrated with the company's Identity and Access Management and Cloud Data Loss Prevention services to support data security and compliance management as part of data governance initiatives.
The data catalog software is provided as a serverless service, which eliminates infrastructure setup and management aspects for users. In addition to the UI, it supports cataloging of data assets via custom APIs. The tool can store both technical metadata and business metadata, such as tags and templates for them, as well as file set schemas from the Cloud Storage service and custom metadata types.
The following features are also included in Google Cloud Data Catalog:
- automatic synchronization of technical metadata;
- support for automated tagging of sensitive data; and
- a unified view of data across both cloud and on-premises systems.
11. IBM Watson Knowledge Catalog
IBM Watson Knowledge Catalog is a metadata repository that was designed from the ground up to support AI, machine learning and other analytics workflows. It works with the company's underlying InfoSphere Information Governance Catalog to help organizations discover and govern data across cloud and on-premises sources. The Watson tool can catalog various data and analytics assets, including machine learning models and structured, unstructured and semistructured data types.
It supports intelligent cataloging and data discovery, which can be driven by automated search recommendations. The tool also features a self-service portal and automated data governance functions, including active policy management capabilities, role-based access control and dynamic masking of sensitive data. It can be deployed in the cloud, on premises or as a fully managed service on the IBM Cloud Pak for Data platform.
IBM Watson Knowledge Catalog also offers the following features:
- the ability to create a common business glossary as a foundation for data governance efforts;
- a set of more than 30 connectors to both IBM and external data sources; and
- tracking of data lineage, data quality scores and data governance workflow history.
12. Informatica Enterprise Data Catalog
Informatica, which was founded in 1993 to focus on data integration tools, has since expanded its product portfolio to provide a broad set of data management technologies, including this data catalog tool. Using an engine driven by machine learning algorithms, Informatica Enterprise Data Catalog can automatically scan, ingest and classify data from systems across an organization, as well as multi-cloud platforms, BI tools, ETL workflows and third-party metadata catalogs.
Automated data curation features also use AI and machine learning for domain discovery and to identify similarities between data sets and associate business terms with technical metadata. Data lineage capabilities track the movement of data through systems and data preparation and transformation pipelines, with the ability to do impact analysis on changes to data assets. Pre-built reports and dashboards can also be used to analyze data usage and enrichment, plus collaboration levels among users.
Other features provided by the Informatica data catalog tool include the following:
- data quality tracking capabilities to view data profiling statistics and data quality rules, scorecards and metrics;
- a Google-like semantic search function for finding relevant data sets in a catalog; and
- a knowledge graph that's designed to help users identify relationships between data assets.
13. Lumada DataOps Data Catalog
In 2017, Hitachi consolidated its data management, analytics and storage technologies into Hitachi Vantara, a new subsidiary. Lumada DataOps, the rebranded line of data management and analytics products offered by Hitachi Vantara, now includes this tool that the company bought by acquiring data catalog vendor Waterline Data in 2020.The data catalog software extends metadata management capabilities to support mainstream databases, emerging IoT data infrastructure and other data sources.
It uses machine learning and AI to automatically populate data catalogs and apply tags to data. AI technology also drives self-service data discovery through a metadata-based search function designed to identify dark data that might be missed by manual tagging. To aid in data governance, the software can also automatically identify, tag and secure sensitive data and track metadata that's needed for regulatory compliance.
Lumada DataOps Data Catalog also provides the following features:
- a collaboration hub that enables teams to exchange insights through comments, data ratings and threaded conversations;
- data lineage capabilities, including the ability to find hidden links between data assets; and
- a related crowdsourcing function to help ensure that catalog users choose the best data for their needs.
14. Microsoft Purview Data Catalog
This tool is part of Microsoft Purview, a data governance, compliance and risk management cloud service introduced in April 2022, when the company rebranded and expanded an Azure Purview product line that became available just seven months earlier. Officially, the data catalog software replaces Azure Data Catalog, an older technology that has been superseded by the Purview tool.
Microsoft Purview Data Catalog provides an enterprise-level business glossary that eliminates the need to use Excel-based data dictionaries. Users can search the catalog for data in familiar business and technical terms and view interactive data lineage visualizations. The data catalog tool runs on top of Microsoft Purview Data Map, a companion metadata management product that collects metadata, configures it in a graph structure and handles data classification and labeling of sensitive data.
Other features provided by Microsoft Purview Data Catalog include the following:
- data curation capabilities, such as business glossary management functions and automated tagging of data assets with glossary terms;
- a cloud-based service for registering data sources and then storing and indexing their metadata; and
- the ability for catalog users to enrich the metadata by adding descriptions, tags and annotations.
15. Oracle Cloud Infrastructure Data Catalog
Oracle Cloud Infrastructure Data Catalog, or OCI Data Catalog for short, was designed to complement Oracle's own technology ecosystem. The metadata management cloud service creates an inventory of data assets and a business glossary for users. It can automatically harvest metadata from Oracle data stores and a set of other popular data sources in both cloud and on-premises systems, using either an on-demand or a schedule-based approach.
OCI Data Catalog also uses fuzzy matching algorithms and AI and machine learning techniques to help data stewards and other data experts curate and enrich metadata. The tool recommends links between the terms and categories in a business glossary and data entities and attributes to make it easier for catalog users to find relevant data.
The Oracle data catalog software also includes the following features:
- data discovery capabilities that enable users to search for data by technical metadata names, business glossary terms and tags;
- integration with the Oracle Cloud Infrastructure Events service to distribute notifications about the status of metadata harvesting processes; and
- the ability to use the data catalog's metastore as a central metadata repository for data lakes in Oracle's OCI Data Flow service, which runs Apache Spark workloads.
OvalEdge was founded in 2013 and provides a data catalog tool with consolidated data governance capabilities. The company touts its namesake software's ease of use and affordability, claiming its total cost of ownership is 50% lower on average vs. other data catalog tools. The OvalEdge tool crawls various databases, data lake platforms, BI and analytics systems, and custom applications to index metadata, then uses AI and machine learning algorithms to automatically organize and catalog data based on tags, usage statistics and other markers.
A data profiling function automatically generates statistical summaries of data sets, and data relationships can be marked by embedded algorithms or manual inputs. The integrated data governance capabilities support common business glossary terminology, data classification, data quality rules, data access controls and other measures.
OvalEdge also includes the following features:
- a set of self-service tools designed for different groups of users;
- collaboration through a built-in chat function and the ability to send links with details about data via Slack or email; and
- alerts to notify end users about data changes, such as metadata modifications or an increase in the size of a data set.
Open source data catalog software
Organizations can also consider various open source data catalog tools. Many of them were developed by enterprises trying to build a more efficient and effective technology to help address their own data cataloging challenges. Some of the top open source options include the following tools:
- Amundsen. This data discovery and metadata engine was created by Lyft to help increase the productivity of data scientists and other users in its complex data infrastructure. The ride sharing company released the tool as an open source technology in 2019.
- Apache Atlas. The Atlas software includes data catalog, metadata management and data governance features. It was started by former big data platform vendor Hortonworks, initially for use in Hadoop clusters, and was handed off to the Apache Software Foundation in 2015.
- DataHub. LinkedIn's data team created this metadata search and discovery tool to help internal users understand the context of data, rearchitecting and expanding on an earlier tool called WhereHows. DataHub became open source in 2020.
- Metacat. This federated metadata discovery and exploration tool was created by Netflix to simplify data discovery, data preparation and data science workflows in its big data environment. The technology was made open source in 2018.