What data lake governance challenges do organizations face? Why data silos matter: Settling ownership of data issues

What steps are key to building a data catalog?

An enterprise data catalog can help data stewards and other users in an organization manage metadata and explore data assets. Here are 10 key steps for creating a data catalog.

Building a data catalog is an important initiative for many IT and data management teams, often done in conjunction...

with data governance and metadata management programs. But organizations shouldn't undertake a data catalog project without getting input from business users and planning a series of other steps that should be part of the process.

First, let's define what a data catalog is and the key features that catalogs provide. It's a reference application that enables business users, data analysts, data stewards and other workers to explore data sources, understand their contents, connect data assets to the right source systems and become more self-sufficient at finding and accessing data.

A data catalog collects metadata from databases, data warehouses, BI systems and big data platforms and uses it to create a searchable inventory of data assets. It also provides a single point of reference for enterprise metadata management, which it can handle faster and more effectively than older types of metadata management systems. Many organizations also supplement their data catalogs with other metadata management tools -- in particular, business glossaries and data dictionaries.

With that as background information, these are the 10 main steps to take in planning and building a data catalog for your organization.

1. Document metadata management's value to data governance

All effective data governance programs are supported by both business and technical metadata management. Metadata gives context to the content of data sets and provides information that makes the data usable and understandable across an organization. Properly managing metadata helps organizations govern their data to improve data quality and increase operational effectiveness, through the implementation of enterprise data policies, practices and standards. That can be part of your business case for a data catalog.

2. Identify data stewardship uses for the different metadata tools

Although the terms data catalog, business glossary and data dictionary are sometimes used interchangeably, they're not the same thing. A business glossary defines the business terms used across an organization, providing an authoritative source for understanding them. A data dictionary provides technical information about data, which can include the properties of such attributes as data type, length, valid values, default values, relationships with other data fields, data transformation rules, business rules and constraints. Dictionaries support use of physical metadata that contains details about where data resides and how it's stored. Business glossaries are focused on the business aspects of data stewardship, while data dictionaries are the domain of technical data stewards. A data catalog can be used by both business and technical stewards, since it incorporates aspects of the other two tools.

3. Design a subject area model for your data

An effective data catalog follows the business use of the data, not simply the technical implementation of systems. A subject area model (SAM), which defines different subject areas for an organization's data and the business concepts that are contained in them, shows business users the location of data unconstrained by applications, files or databases. The SAM will serve as the underlying basis of your data architecture, and both the data catalog and business glossary should be based on it.

A robust business glossary is an essential component of effective data stewardship and business metadata management.

4. Build a data glossary

Members of the data governance team and business data stewards should collaborate to design the business glossary and then populate it. An organization should have one enterprise business glossary, not a glossary for each functional area or -- even worse -- application. A robust business glossary for the entire enterprise is an essential component of effective data stewardship and business metadata management, and it can provide content for use in the data catalog. Business data stewards need to be involved in creating the glossary because they best understand their subject area's data and its associated business metadata.

5. Build a data dictionary

The data dictionary should contain descriptions and mappings of every data table or file and all their metadata entities. It then becomes the basis for pulling the metadata into the data catalog. Again, the business data stewards are essential here, since they will provide guidance on the business metadata to be used in the data catalog -- by source, concept and subject area.

6. Discover metadata from databases and other data sources

Data catalogs use metadata to identify data tables and files for users. A catalog searches the company's databases and loads the metadata (not the actual data) into its inventory of data assets. Before an organization begins building a data catalog, metadata sources must be identified and recorded. This is a major step and, like the previous two, requires that the organization have a solid data stewardship program. In this case, business data stewards are needed to provide insight on the correct data sources to use.

7. Profile the data to provide statistics for users

These profiles are informative summaries that explain the metadata to the users of a data catalog. For example, the profile of a database often includes the number of tables, files and row counts. In a business glossary, this profiling would be focused on business metadata and its use across the organization by business data stewards and users.

8. Identify relationships among data sources

Discover related data across multiple databases and build that information into the data catalog. For example, a data analyst may need consolidated customer data for an analytics application. Through the data catalog and the data dictionary, the analyst may find that five files in five different systems contain relevant customer data.

9. Capture information on data lineage

Extract, transfer and load (ETL) tools are used to extract data from source databases, transform and cleanse it, and load it into a target database. The ETL process also collects the associated metadata, which is used to populate the data catalog and data dictionary. That includes data lineage documentation that tracks where data originated, how it flows through systems and other information. Data lineage helps business users understand data assets in a catalog and enables data stewards and analysts to trace data errors back to their root cause in systems.

10. Organize the catalog for use by data consumers

Most databases and file systems are designed for use by IT. Data catalogs and business glossaries should be designed for data consumers -- such as business users and data analysts -- as much as for technologists. Again, their structure should be based on the SAM that you designed earlier in the process. In addition, these tools should be accessible via PCs, tablets and smartphones. A data dictionary, by comparison, can be organized by functional area and application, given the technical nature of its content.

Building a data catalog, business glossary and data dictionary and then using them to create metadata management content is a task that should involve teams from both IT and the business. Doing so will ensure that the metadata focuses on the needs of business users and enable consistent management of it across the enterprise. Effective planning, development and implementation of a data catalog can bring metadata management into the business community and provide lasting business value from a better understanding of your organization's data assets.

Dig Deeper on Data governance