Data catalogs aim to make data discovery less exploratory
Several years ago, Uber found that data discovery "was the biggest problem our users faced" in analytics applications, according to Atul Gupte, product manager for the ride-sharing company's data science and analytics platforms. An internal survey showed that data scientists and other users wasted an average of three hours per week trying to find relevant data, Gupte said: "That's shameful."
The difficulties drove Uber to create Databook, a metadata system that functions as a data catalog. It lists available data sets and a variety of information about them to help users locate and understand data, Gupte said during a session at the 2019 Strata Data Conference in New York, where metadata management and data catalog best practices were big discussion topics.
The data platforms team at Bayer AG's Crop Science division took a similar step after data analysts complained that looking for data was too complicated. It built a system called Haystack that includes a data catalog and a business glossary with data definitions. More than 940,000 data objects are now listed in the catalog, Naghman Waheed, the unit's data platforms lead, said in another Strata session.
But data cataloging can be complicated itself. Forrester Research analyst Michele Goetz wrote in an April 2019 blog post that organizations may need two or three data catalogs to store different metadata for different users. And in a September 2019 report, Gartner analysts Guido De Simoni and Ehtisham Zaidi recommended using machine learning algorithms to automate the cataloging process.
This handbook looks more closely at data catalog best practices, challenges and trends. First, we detail advice on building and managing data catalogs. Next, consultant Andy Hayler expands on how machine learning can aid data catalogs. We close by exploring the concept of an enterprise data marketplace based on data catalog software.