Many organizations face a growing sprawl of data across various databases and other repositories in on-premises systems, cloud services and IoT infrastructure. That makes data management more challenging, and BI and data analytics initiatives are less effective if data scientists, other data analysts and business users can't find relevant data and understand what it means. "Organizations are drowning in data yet starving for insights," said Priya Iragavarapu, vice president of the Center of Data Excellence at consulting firm AArete.
Data catalogs can provide a unified view of all the data assets in an enterprise. The idea of a catalog has been around since the early days of relational databases, when IT teams wanted to keep track of how data sets were linked, joined and transformed across SQL tables. Modern data catalog tools inventory data and collect metadata about it from a wider variety of data stores, also including data lakes, data warehouses, NoSQL databases, cloud object storage and more.
They're also commonly integrated with data governance software to help organizations keep pace with changing regulatory compliance requirements and other aspects of governance programs. In addition, the tools are evolving to take advantage of natural language queries, machine learning and other AI functionality. In recognition of that, consulting firm Gartner now calls them "augmented data cataloging and metadata management solutions."
In its 2022 report on emerging data management technologies, Gartner said cataloging tools are at the "early mainstream" level of maturity and estimated that they're currently being used by 5% to 20% of the potential user base. Gartner gave the tools a "High" rating for potential benefits to organizations but said it could take another two to five years for them to become fully mature.
Early data catalogs required custom scripts to crawl data and capture metadata. But newer tools can do that automatically and dynamically sense data attributes, types and profiles. Iragavarapu also recommended looking for data catalog software that supports user input, a business glossary and data visualization, among other capabilities. "A robust data catalog solution should not just merely show metadata but should also allow users to take actions from that insight," she said.
Here, in alphabetical order, are details on 18 popular data catalog tools that may be able to help your organization tame its metadata management challenges and make data more readily accessible and understandable to end users.
1. Alation Data Catalog
Alation was founded in 2012 and launched its first products in 2015. The company's flagship data catalog software uses AI, machine learning, automation and natural language processing techniques to simplify data discovery, automatically create business glossaries and power its core Behavioral Analysis Engine, which analyzes data usage patterns with an eye toward streamlining data stewardship, data governance and query optimization. The engine indexes various data sources and uses pattern recognition to generate popularity rankings, usage recommendations and other insights.
Alation, which also offers a data governance application, bills its overall combination of capabilities as a "data intelligence" platform. In that vein, Alation Data Catalog includes guided navigation and various collaboration features. For example, it can automatically identify data stewards or other subject matter experts to answer questions about data sets, and users can create wiki articles and searchable conversations. They can also subscribe to get automatic notifications when data sets or articles are updated.
Other key features in the Alation tool include the following:
- the ability to flag data health issues and define enterprise data governance policies;
- prebuilt connectors to various data sources, plus an Open Connector Framework SDK for building custom ones; and
- a built-in SQL editor that can be used as an alternative to natural language search.
2. Alex Augmented Data Catalog
Alex Solutions is a newer data catalog and metadata management provider founded in 2016. The company architected its data catalog software to take advantage of AI and machine learning techniques. Alex Augmented Data Catalog helps automate the process of discovering data assets and then bringing them into a consolidated catalog, with support for various types of structured, semistructured and unstructured data. The tool also includes a set of collaboration features for things such as data sharing and curation.
In addition, Alex automates various aspects of data governance and data quality within the data catalog tool. For example, data governance managers can create policies, assign data stewards and keep track of data pipeline processes from a central console.
Alex Augmented Data Catalog also provides the following features:
- Google-like natural language search and query capabilities;
- a marketplace of plug-and-play metadata connectors to popular data sources; and
- built-in automation for populating and enriching metadata in data catalogs.
3. Ataccama Data Catalog
Ataccama, which was founded in 2008, offers a data catalog tool as a core component of Ataccama One, a consolidated platform that supports data governance and management functions automated through the use of AI. Ataccama Data Catalog can catalog data from databases, data lakes, file systems and other sources; it comes with connectors for a variety of popular on-premises and cloud data platforms.
The data catalog includes capabilities that help automate data discovery and change detection. The tool can also automate data quality assessments and detect and flag data anomalies, and it can be plugged into business process management workflows to automate data policy enforcement. It supports workflows spanning a diverse set of roles in organizations, including data stewards, data engineers, business users, data analysts and system owners.
Ataccama Data Catalog also includes the following features:
- a focus on data quality improvement through continuous quality monitoring and data cleansing;
- built-in data profiling, data classification, data lineage, relationship discovery and metadata management capabilities; and
- functions for configuring workflows, user permissions and custom metadata.
4. Atlan Data Discovery & Catalog
Atlan is one of the newest data catalog vendors, having first hit the market with its tool in 2018. It positions the product as a third-generation data catalog that's built on design principles borrowed from GitHub, Slack and other end-user tools. In particular, Atlan Data Discovery & Catalog is designed to support easy collaboration, with the ability to seamlessly integrate common data workflows.
For example, data teams can highlight issues that need to be addressed in an actionable way from within the data catalog tool. It supports contextual discussions in Slack chats that can take advantage of a reverse metadata feature, and individual users can create Jira requests to report issues while exploring data sets.
The software also includes the following features to help simplify integration with common data sources and data quality tools:
- open APIs that enable fully customizable ingestion of metadata;
- programmable bots to help automate tasks through custom machine learning and data science algorithms; and
- a plugin marketplace with connectors to various data tools and platforms.
5. AWS Glue Data Catalog
AWS Glue Data Catalog is the persistent metadata store in AWS Glue, a fully managed extract, transform and load (ETL) service offered by AWS. The data catalog enables data management teams to store, annotate and share metadata for use in ETL integration jobs when they create data warehouses or data lakes on the AWS cloud platform. It supports similar functionality and is compatible with the metastore repository in Apache Hive, a popular open source data warehouse tool. In some cases, organizations can also integrate the AWS data catalog as an external metastore for Hive data.
Users can share access to AWS Glue Data Catalog across an organization using their AWS Identity and Access Management (IAM) credentials. The data catalog tool helps enforce data governance requirements by tracking changes to schemas and data access controls. In addition, it supports data processes that span different AWS services, including AWS Lake Formation, Amazon Athena, Amazon Redshift, Amazon EMR and more. AWS Glue Data Catalog can also be used to populate business data catalogs in Amazon DataZone, a separate data management service scheduled for a preview release in early 2023.
Other features offered by the AWS software include the following:
- the ability to write scripts to automatically crawl repositories and capture information on schemas and data types;
- improved visibility, control and governance of data assets across various AWS data services; and
- a settings page in the AWS Glue management console for changing permissions and other data catalog properties.
6. Boomi Data Catalog and Preparation
Boomi Data Catalog and Preparation is part of the company's AtomSphere Platform, a portfolio of tools that also supports data integration, master data management and other functions. It combines a data catalog with data preparation capabilities: Organizations can use the catalog to create a consolidated business glossary of metadata to track data sets, processing jobs and workflow schedules, then run a data prep recommendation engine to automatically cleanse, enrich, normalize and transform data.
The catalog tool includes connectors to more than 1,000 endpoints, including more than 200 applications. IT and data management teams can also create data pipelines to automate workflows for analytics, machine learning and AI processes, and a set of data governance and security features can be used to enhance controls across different applications and business processes.
Boomi Data Catalog and Preparation also includes the following capabilities:
- support for natural language queries and personalized searches;
- the ability to deploy and run the software in the cloud, on premises or in hybrid environments; and
- collaboration features, such as the ability to rate and comment on data and to ask data stewards for access to required data sets.
7. Collibra Data Catalog
Collibra started as a company in 2008 and offers a Data Intelligence Cloud platform that's centered on Collibra Data Catalog. Its data catalog capabilities support an extensive set of automated features for data discovery and classification using a proprietary machine learning algorithm; data curation, also powered by machine learning; and data lineage. The data catalog tool also supports graph-based metadata management techniques that help provide information on data quality and lineage to users.
Collibra Data Catalog includes prebuilt integrations for ingesting metadata from various data stores, as well as commonly used business applications, BI platforms and data science tools. It also provides embedded data governance capabilities, guided data stewardship features and granular controls for enforcing data security and privacy protections, all in a single console.
In addition, the Collibra software offers the following features:
- a business glossary to standardize terminology, plus automated data governance workflows and dashboards;
- collaboration capabilities, including crowdsourced feedback on data assets through ratings, reviews and comments; and
- a "data shopping experience" that enables users to search for relevant data without requiring any SQL coding.
Data.world is a cloud-native data catalog tool offered as a SaaS platform by a vendor with the same name. The company, which was founded in 2015, claims that it releases more than 1,000 individual product updates per year. It's known for a knowledge graph approach that provides a semantically organized view of enterprise data assets and their associated metadata across disparate systems. That's designed to make it easier for business and analytics users to find relevant data and understand its context.
In 2022, Data.world added a new suite of data catalog functions powered by knowledge graphs to simplify the use of its platform. Called Eureka, the suite includes a set of automations to help deploy and manage data catalogs; an Action Center dashboard that provides metrics, alerts and recommendations; an Answers feature that's meant to improve search results in catalogs; and an Explorer Lineage function that creates a visual map of data assets and relationships.
Other notable features in the Data.world software include the following:
- collaboration capabilities to help streamline workflows and enable knowledge sharing between data producers and users;
- the ability to automatically organize, aggregate and present metadata in a format for easy use and sharing between collaborators; and
- support for both virtualized and federated access to data, with built-in data governance controls.
9. Erwin Data Catalog
The first Erwin software was created in 1983 for data modeling; over the years, the product line went through several acquisitions and is now owned by Quest Software. It also has evolved to support additional capabilities, including this data catalog tool that was developed as part of a broader platform launched in 2017 to support different aspects of data governance.
Erwin Data Catalog by Quest, as the software is formally known, automatically harvests, catalogs and curates metadata. It also includes components for data mapping, reference data management, data lifecycle management, data quality integration and other functions. Standard data connectors can ingest data from common databases, and optional ones can be added for streaming data, cloud applications, BI environments and more data sources. In addition, the data catalog software can be used together with companion data literacy and data quality tools in Erwin Data Intelligence, a suite that includes all three.
Erwin Data Catalog also provides the following features:
- a management dashboard that can be used to view and analyze data catalog attributes;
- an impact analysis function for assessing the potential effects of changes in a catalog; and
- end-to-end data lineage information that's automatically generated down to the column level and shows data flows and transformations.
10. Google Cloud Data Catalog
Google Cloud Data Catalog is a fully managed data discovery and metadata management service that works across cloud and on-premises data sources. It's designed to enable both data professionals and business users to search a catalog through natural language queries and tag data at scale. The tool has built-in integrations with Google's BigQuery, Pub/Sub, Dataproc Metastore and Cloud Storage data services. It's also integrated with the company's IAM and Cloud Data Loss Prevention services to support data security and compliance management as part of data governance initiatives.
The data catalog software is provided as a serverless service, which eliminates infrastructure setup and management aspects for users. It supports cataloging of data assets and access to other functionality via the UI in Google's Dataplex data fabric environment, a CLI and sets of APIs and client libraries. The tool can store both technical metadata and business metadata, such as tags and tag templates. File set schemas from the Cloud Storage service and custom metadata types can be stored as well.
The following features are also included in Google Cloud Data Catalog:
- automatic synchronization of technical metadata;
- support for automated tagging of sensitive data; and
- a unified view of data across both cloud and on-premises systems.
11. IBM Watson Knowledge Catalog
IBM Watson Knowledge Catalog is a metadata repository that was designed from the ground up to support AI, machine learning and other analytics workflows. It works with the company's underlying InfoSphere Information Governance Catalog to help organizations discover and govern data across cloud and on-premises sources.
The Watson tool can catalog various data and analytics assets, including machine learning models and structured, unstructured and semistructured data types. It supports intelligent cataloging and data discovery, which can be driven by automated search recommendations. The tool also features a self-service portal and automated data governance functions, including active policy management capabilities, role-based access control and dynamic masking of sensitive data. It can be deployed in the cloud, on premises or as a fully managed service on the IBM Cloud Pak for Data platform.
IBM Watson Knowledge Catalog also offers the following features:
- the ability to create a common business glossary as a foundation for data governance efforts;
- a set of more than 30 connectors to both IBM and external data sources; and
- tracking of data lineage, data quality scores and data governance workflow history.
12. Informatica Enterprise Data Catalog
Informatica, which was founded in 1993 to focus on data integration tools, has since expanded its product portfolio to provide a broad set of data management technologies, including this data catalog tool. Using an engine driven by machine learning algorithms, Informatica Enterprise Data Catalog can automatically scan, ingest and classify data from systems across an organization, as well as multi-cloud platforms, BI tools, ETL workflows and third-party metadata catalogs.
Automated data curation features also use AI and machine learning for domain discovery and to identify similarities between data sets and associate business terms with technical metadata. Data lineage capabilities track the movement of data through systems and data preparation and transformation pipelines, with the ability to do impact analysis on changes to data assets. Prebuilt reports and dashboards can also be used to analyze data usage and enrichment, plus collaboration levels among users.
Other features provided by the Informatica data catalog tool include the following:
- data quality tracking capabilities to view data profiling statistics and data quality rules, scorecards and metrics;
- a Google-like semantic search function for finding relevant data sets in a catalog; and
- a knowledge graph that's designed to help users identify relationships between data assets.
13. Lumada Data Catalog
In 2017, Hitachi consolidated its data management, analytics and storage technologies into Hitachi Vantara, a new subsidiary. Lumada DataOps, the rebranded line of data management and analytics products offered by Hitachi Vantara, includes the data catalog tool originally developed by Waterline Data, which the Vantara unit acquired in 2020. The software extends metadata management capabilities to support mainstream databases, emerging IoT data infrastructure and other data sources.
Lumada Data Catalog uses machine learning and AI to automatically populate data catalogs and apply tags to data. AI technology also drives self-service data discovery through a metadata-based search function designed to identify dark data that might be missed by manual tagging. To aid in data governance, the software can also automatically identify, tag and secure sensitive data and track metadata that's needed for regulatory compliance.
Lumada Data Catalog also provides the following features:
- a collaboration hub that enables teams to exchange insights through comments, data ratings and threaded conversations;
- data lineage capabilities, including the ability to find hidden links between data assets; and
- a related crowdsourcing function to help ensure that catalog users choose the best data for their needs.
14. Microsoft Purview Data Catalog
This tool is part of Microsoft Purview, a data governance, compliance and risk management cloud service introduced in April 2022, when the company rebranded and expanded an Azure Purview product line that became available just seven months earlier. Officially, the data catalog software replaces Azure Data Catalog, an older technology that has been superseded by the Purview tool.
Microsoft Purview Data Catalog provides an enterprise-level business glossary that eliminates the need to use Excel-based data dictionaries. Users can search the catalog for data in familiar business and technical terms and view interactive data lineage visualizations. The data catalog tool runs on top of Microsoft Purview Data Map, a companion metadata management product that collects metadata, configures it in a graph structure and handles data classification and labeling of sensitive data.
Other features provided by Microsoft Purview Data Catalog include the following:
- data curation capabilities, such as business glossary management functions and automated tagging of data assets with glossary terms;
- a cloud-based service for registering data sources and then storing and indexing their metadata; and
- the ability for catalog users to enrich the metadata by adding descriptions, tags and annotations.
15. Oracle Cloud Infrastructure Data Catalog
Oracle Cloud Infrastructure Data Catalog, or OCI Data Catalog for short, was designed to complement Oracle's own technology ecosystem. The metadata management cloud service creates an inventory of data assets and a business glossary for users. It can automatically harvest metadata from Oracle data stores and a set of other popular data sources in both cloud and on-premises systems, using either an on-demand or a schedule-based approach.
OCI Data Catalog also uses fuzzy matching algorithms and AI and machine learning techniques to help data stewards and other data experts curate and enrich metadata. The tool recommends links between the terms and categories in a business glossary and data entities and attributes to make it easier for catalog users to find relevant data.
The Oracle data catalog software also includes the following features:
- data discovery capabilities that enable users to search for data by technical metadata names, business glossary terms and tags;
- integration with the Oracle Cloud Infrastructure Events service to distribute notifications about the status of metadata harvesting processes; and
- the ability to use the data catalog's metastore as a central metadata repository for data lakes in Oracle's OCI Data Flow service, which runs Apache Spark workloads.
OvalEdge was founded in 2013 and provides a data catalog tool with consolidated data governance capabilities. The company touts its namesake software's ease of use and affordability, claiming its total cost of ownership is 50% lower on average vs. other data catalog tools. The OvalEdge tool crawls various databases, data lake platforms, BI and analytics systems, and custom applications to index metadata, then uses AI and machine learning algorithms to automatically organize and catalog data based on tags, usage statistics and other markers.
A data profiling function automatically generates statistical summaries of data sets, and data relationships can be marked by embedded algorithms or manual inputs. The integrated data governance capabilities support common business glossary terminology, data classification, data quality rules, data access controls and other measures.
OvalEdge also includes the following features:
- a set of self-service tools designed for different groups of users;
- collaboration through a built-in chat function and the ability to send links with details about data via Slack or email; and
- alerts to notify end users about data changes, such as metadata modifications or an increase in the size of a data set.
17. Qlik Catalog
Qlik was founded in 1993 as a BI and analytics vendor. In recent years, it added various data management capabilities through a series of acquisitions, including the 2018 purchase of Podium Data, a startup vendor that offered data preparation, data quality and data catalog functionality. Qlik has consolidated the data management technologies into Qlik Data Integration, a platform that includes Qlik Catalog and several other tools designed to support reliable data delivery for analytics uses.
Qlik Catalog provides a repository for accessing data from across the organization and smart data cataloging features to help users find data and incorporate it into BI and analytics workflows. It also includes data governance functions to help enterprises maintain compliance with data privacy laws and internal usage policies as they launch self-service BI models for business users. In addition, the software can aid teams in assessing the utility of different data sources for new analytics applications.
The following features are also built into Qlik Catalog:
- a browser-based GUI to ease access to the tool's functions and its metadata repository and services;
- metadata management for raw data and subsequent data transformations, with the ability to exchange the metadata with other data catalogs and applications; and
- the ability to create and apply business rules as data is ingested -- for example, to automatically protect personally identifiable information, find duplicate data or identify changes in data quality levels.
18. Tableau Catalog
Tableau pioneered the field of self-service BI and interactive data analysis after it was founded in 2003. Like Qlik, it expanded into data management technologies before being acquired by Salesforce in 2019. Tableau Catalog is part of Tableau Data Management, an add-on module for Tableau's analytics platform. The catalog tool is designed to help build trust in data and improve data discovery in organizations with Tableau installations.
Tableau Catalog automatically ingests information about Tableau data sets into a centralized repository. The tool also includes data lineage and impact analysis features that can help Tableau teams better understand data relationships and how changes to data sets or pipelines will affect analytics processes. It also supports features like data quality warnings and contextual metadata to give business users the information they need to validate data sets for analytics uses.
Other features in Tableau Catalog include the following:
- a set of APIs to ingest metadata from other applications for analysis in Tableau;
- integration with enterprise data catalogs through Tableau APIs or prebuilt connections from other catalog vendors; and
- the ability to alert end users directly in analytics results when data quality changes.
Open source data catalog software
Organizations can also consider various open source data catalog tools. Many of them were developed by enterprises trying to build a more efficient and effective technology to help address their own data cataloging challenges. Some of the top open source options include the following tools:
- Amundsen. This data discovery and metadata engine was created by Lyft to help increase the productivity of data scientists and other users in its complex data infrastructure. The ride-sharing company released the tool as an open source technology in 2019.
- Apache Atlas. The Atlas software includes data catalog, metadata management and data governance features. It was started by former big data platform vendor Hortonworks, initially for use in Hadoop clusters, and was handed off to the Apache Software Foundation in 2015.
- DataHub. LinkedIn's data team created this metadata search and discovery tool to help internal users understand the context of data, rearchitecting and expanding on an earlier tool called WhereHows. DataHub became open source in 2020.
- Metacat. This federated metadata discovery and exploration tool was created by Netflix to simplify data discovery, data preparation and data science workflows in its big data environment. The technology was made open source in 2018.