How to build a data catalog: 10 key steps
A data catalog helps business and analytics users explore data assets, find relevant data and understand what it means. Here are 10 important steps for building one.
Building a data catalog is an important initiative for many IT and data management teams, often done in conjunction...
Continue Reading This Article
Enjoy this article as well as all of our content, including E-Guides, news, tips and more.
with data governance and metadata management programs. But organizations shouldn't undertake a data catalog project without getting input from business users and planning a series of other steps that should be part of the process.
Those steps are detailed below. But first, before looking at how to build a data catalog, let's define what one is and outline the key features that catalogs provide and why they've become a core component of modern data management environments.
What is a data catalog?
In short, a data catalog is a reference application that enables business users, data scientists, BI analysts, data stewards and other workers to explore data sets, understand their contents, and collaborate and share knowledge about data assets with one another. Ideally, it helps them become more self-sufficient at finding and accessing relevant data to use in operational and analytics applications.
A data catalog collects metadata from databases, data warehouses, data lakes, BI systems and other sources and uses it to create a searchable inventory of data assets. It also provides a single point of reference for enterprise metadata management, which it can handle faster and more effectively than older types of metadata management systems. Many organizations also supplement their data catalogs with other metadata tools -- in particular, business glossaries and data dictionaries that provide additional information to help users understand data and its business context.
Why are data catalogs important?
Without a data catalog, useful data may be hidden from end users. As organizations collect more and more data, it commonly sprawls across various data stores. If business and analytics users can't find relevant data, business operations and analytics initiatives will be less effective. That's a big problem when organizations are increasingly looking to make data-driven business decisions.
Data catalogs help eliminate that problem by providing a unified view of data assets with built-in search and data discovery functions. In addition, they can automate aspects of the data management process -- for example, commercial data catalog tools from various vendors use AI and machine learning technology to create data profiles, check data quality, curate data sets and handle other tasks. Mechanisms for enforcing data governance policies and data security and privacy controls can also be embedded in data catalogs to help ensure that data is protected and used properly.
Key steps to build a data catalog
With that as background information, these are the 10 main steps to take in planning and building a data catalog for your organization.
1. Document metadata management's value to data governance
All effective data governance programs are supported by both business and technical metadata management. Metadata gives context to the contents of data sets and provides information that makes the data usable and understandable across an organization. Properly managing metadata helps organizations govern their data to improve data quality and increase operational effectiveness, through the implementation of enterprise data policies, practices and standards. Documenting those expected benefits can be part of your business case for a data catalog.
2. Identify data stewardship uses for the different metadata tools
Although the terms data catalog, business glossary and data dictionary are sometimes used interchangeably, they're not the same thing. A business glossary defines the business terms used across an organization, providing an authoritative source for understanding them. A data dictionary provides technical information about data, which can include the properties of such attributes as data type, length, valid values, default values, relationships with other data fields, data transformation rules, business rules and constraints. Dictionaries support use of physical metadata that contains details about where data resides and how it's stored. Business glossaries are focused on the business aspects of data stewardship, while data dictionaries are the domain of technical data stewards. A data catalog can be used by both business and technical stewards, since it incorporates aspects of the other two tools.
3. Design a subject area model for your data
An effective data catalog follows the business use of the data, not simply the technical implementation of systems. A subject area model (SAM), which defines different subject areas for an organization's data and the business concepts that are contained in them, shows business users the location of data unconstrained by applications, files or databases. The SAM will serve as the underlying basis of your data architecture, and both the data catalog and business glossary should be based on it.
4. Build a data glossary
Members of the data governance team and business data stewards should collaborate to design the business glossary and then populate it. An organization should have one enterprise business glossary, not a glossary for each functional area or -- even worse -- application. A robust business glossary for the entire enterprise is an essential component of effective data stewardship and business metadata management, and it can provide content for use in the data catalog. Business data stewards need to be involved in creating the glossary because they best understand their subject area's data and its associated business metadata.
5. Build a data dictionary
The data dictionary should contain descriptions and mappings of every data table or file and all their metadata entities. It then becomes the basis for pulling the metadata into the data catalog. Again, the business data stewards are essential here, since they will provide guidance on the business metadata to be used in the data catalog -- by source, concept and subject area.
6. Discover metadata from databases and other data sources
Data catalogs use metadata to identify data tables and files for users. A catalog searches the company's databases and other data repositories and loads the associated metadata into its inventory of data assets. Before an organization begins building a data catalog, metadata sources must be identified and recorded. This is a major step and, like the previous two, requires that the organization have a solid data stewardship program. In this case, business data stewards are needed to provide insight on the correct data sources to use.
7. Profile the data to provide statistics for users
These profiles are informative summaries that explain the metadata to the users of a data catalog. For example, the profile of a database often includes the number of tables, files and row counts. In a business glossary, the data profiling would be focused on business metadata and its use across the organization by business data stewards and users.
8. Identify relationships among data sources
Discover related data across multiple data stores and build that information into the data catalog so users can understand the relationships. For example, a data analyst may need consolidated customer data for an analytics application. Through the data catalog and the data dictionary, the analyst may find that five files in five different systems contain relevant data.
9. Capture information on data lineage
Extract, transfer and load (ETL) tools are used to extract data from source systems, transform and cleanse it, and load it into a target data repository. In building a data catalog, the metadata collected during the ETL process includes data lineage documentation that tracks where data originated, how it flows through systems and other information. Data lineage helps business users understand the data assets in a catalog and enables data stewards and analysts to trace data errors back to their root cause in source systems by examining the data flow.
10. Organize the catalog for use by data consumers
Most databases and file systems are designed for use by IT. Data catalogs and business glossaries should be designed for data consumers -- such as business users and data analysts -- as much as for technologists. Again, their structure should be based on the subject area model that you designed earlier in the process. In addition, these tools should be accessible via PCs, tablets and smartphones. A data dictionary, by comparison, can be organized by functional area and application, given the technical nature of its content.
Best practices for building a data catalog
Building a data catalog, as well as a business glossary and a data dictionary, and then using them to collect, organize and curate metadata are tasks that should involve teams from both IT and the business. Doing so will ensure that the metadata focuses on the needs of business users and enable consistent management of it across the enterprise.
The following are some other data catalog best practices that organizations should keep in mind:
- Incorporate user permissions, usage monitoring, tagging of sensitive data and other data security and privacy protections.
- Enable collaboration through features such as the ability to rate and comment on data and chat with other catalog users.
- Develop a training program for end users to make sure they're familiar with the data catalog and can use it effectively.
- Create a process to keep the catalog up to date as data assets and business requirements change.
Effective planning, development and implementation of a data catalog can bring metadata management into business operations and provide lasting business value by fostering a better understanding of your organization's data assets and making it easier for people to find, access and use them.
Top benefits of data governance for businesses
6 key steps to develop a data governance strategy
Data governance vs. information governance: What's the difference?
Dig Deeper on Data governance
Related Q&A from Anne Marie Smith, Ph.D.
Data lake governance: Benefits, challenges and getting started
A data lake that isn't well governed may become more of a swamp. Here are key benefits and challenges of data governance in a data lake, plus initial... Continue Reading
What data management challenges do analytics programs face?
Expert Anne Marie Smith shares five reasons why organizations' analytics programs might fail and how a data management framework and other programs ... Continue Reading
What is an enterprise data strategy?
Defining a data strategy can help focus an organization's data management initiatives -- but it isn't the same as data governance. Expert Anne Marie ... Continue Reading