Data catalogs fuel increased efficiency, speed-to-insight
By collecting and arranging data assets in a single environment, organizations are able to reduce the time to reach the insights that lead to data-driven decisions.
As organizations collect more data and develop more data assets, data catalogs can be critical.
Data catalogs are organized repositories where data users and analysts can search for and find the data they need for their work.
With organizations amassing terabytes and petabytes of data and, depending on the size of the enterprise, building hundreds and perhaps thousands of reports, dashboards, data models and other data assets, finding the table, chart or data set without knowing exactly where to look can be difficult -- and maybe almost impossible.
Data catalogs solve that problem, indexing data assets and making them easy to search, find and be put to use to make data-driven decisions.
The result is efficiency that increases speed-to-insight.
In addition, data catalogs can enable collaboration among business users working together or on similar projects, and governance to set limits on who can view and use what data.
Eckerson Group, a consulting firm founded in 2014 and based in Hingham, Mass., will host a virtual event Dec. 15 for chief data officers and other data workers entirely focused on data catalogs.
Leading up to the event, Wayne Eckerson, founder and principal consultant of the firm, discussed why a data catalog is important to data-driven organizations as their data grows exponentially.
In addition, he spoke about the evolution of data catalogs, which vendors specialize in data catalogs and which ones offer catalogs as part of a larger platform, and what makes one vendor's catalog different from another.
What exactly is a data catalog?
Wayne Eckerson: It's not unlike a card catalog in a library, except it's all digital. What it's doing is collecting metadata from all the sources of information throughout your enterprise, then pulling that metadata together in one place and indexing it so it can be easily searched. It provides a lot of descriptive characteristics of the data so people can profile it and see who else has used it and any annotations they have left. It's a great way to give people one-touch access to the information assets that are available to them in their organization instead of having to hunt around and ask people.
They're definitely used to facilitate self-service BI, and they're also a good way to curate data and manage access to it. Once you put all the metadata in one place, you can have a data curator go in and define who gets access to what metadata, and hence the actual data as well.
What do data catalogs enable organizations to do that they otherwise can't -- why are they important?
Eckerson: I mentioned self-service before -- they really facilitate finding and profiling data so you can accelerate your time-to-insight. Along those lines, the powerful thing about data catalogs is that they can capture the tribal knowledge of an organization that is usually in the heads of just one or two people who function as human data catalogs, and you have to know them to [access to their knowledge]. It's not a very streamlined process. The idea is that we can collect all this metadata, and then people can not only search but also annotate what they found and how they used the data and leave breadcrumbs along their trail so that others who follow them -- whether the next day, the next month, the next year or the next decade -- have access to that tribal knowledge. That can really help improve the usage of those assets.
On the other side, they enable data governance and data curation. Not all data is equal from a privacy and security perspective so you want to have access policies, and those can be enacted in the data catalog. Meanwhile, data stewards and curators can use data catalogs as a way to clean up the data and identify things that are missing, inconsistent, duplicative and then fix them.
When were data catalogs first introduced and how have they evolved since then?
Eckerson: You started to see the first couple of catalogs around 2015. At that point, they were more for technical people and were led by IT. They were focused on security and privacy more than governance and compliance. They were mostly on-premises. They've evolved to be SaaS-oriented in some cases, business-led with stewards and curators on the business side taking ownership of them, and their usage has moved from just governance to also include analytics and self-service.
There was this notion when they first came out that they were going to be the enterprise data catalog, the one place to go to find stuff. Now, though, almost every BI tool and every other type of tool has a catalog of some sort. There are still some enterprise catalogs from vendors like Alation and Collibra, but tools like Tableau have a catalog that focuses on Tableau assets rather than enterprise assets. As a result, companies wind up having a hierarchy of catalogs, which points to another change, which is that catalogs now have to be more open and integrate with not only just data sources but also other catalogs.
Regarding data assets, how has what can be found in data catalogs evolved?
Wayne EckersonFounder and principal consultant, Eckerson Group
Eckerson: It used to be just data sets that were cataloged, but now it's queries, reports, schema, even machine learning models. Another thing that's changing is that data catalogs used to just grab metadata, but that was frustrating for users. They were able to know what they wanted, but didn't know how to get it. Now, the catalogs are closing that loop and add data access to data sources themselves.
You mentioned Alation and Collibra -- are there other vendors who specialize in data catalogs, and do those two vendors do more than just provide a data catalog platform?
Eckerson: Alation and Collibra are pure-play data catalogs, and there are a few others. There's also Quest with its erwin data catalog, BigID and data.world.
But what's happening very rapidly is these data catalogs are becoming the foundation for all kinds of data governance functionality. We're seeing that the catalogs are morphing into data governance platforms that not only support cataloging functions but also business glossaries, data lineage, impact analysis, master data management, data access control and data quality. Vendors are starting to wrap all these governance-related activities into their products.
Regarding BI vendors, are data catalogs now a standard part of their platforms, or are data catalogs a way some vendors can differentiate themselves from others?
Eckerson: For BI vendors, it seems to be one of those requirements for doing business now -- any kind of user-facing analytics tool and that gathers assets for users like reports or even data science models [has them]. Vendors will say that their catalogs aren't designed to house everything -- the assets that aren't developed in their platforms -- but they do integrate with a vendor like Alation so users have that view from within the BI vendor's catalog to the enterprise-wide catalog and can move between the two pretty easily.
How does one data catalog differ from another, or are all vendors offering essentially the same thing?
Eckerson: The first differentiator is whether it's on premises or cloud-based. Another is whether it's an enterprise data catalog or an embedded catalog in a BI tool or someplace else. The kinds of assets it catalogs can be different. And some catalogs focus on technical information while others focus on collaboration for business users to capture that tribal knowledge.
The original catalogs were focused on technical metadata. Big companies like IBM offered them, and when Alation came out, their catalog was much more focused on users and collaboration. Now, we're starting to see a melding of those two things into one.
Is there a way to quantify the value of data catalogs to organizations?
Eckerson: I haven't actually heard anyone talk about that. It results in better data governance, and quicker time to get information and get more accurate information. It's infrastructure for data and supports a lot of different use cases and makes them all better.
Editor's note: This Q&A has been edited for clarity and conciseness.