What are the main features of data catalog software? What steps are key to building a data catalog?

What data lake governance challenges do organizations face?

Consultant Anne Marie Smith details five challenges that an organization may face in applying data governance policies to data lakes and offers advice on how to overcome them.

A successful data governance program applies policies, standards and processes to enable the effective and appropriate use of high-quality data across an organization. If your organization has a data lake environment and wants to get high-quality analytics results from it, you need to engage in proper data lake governance as part of your overall governance initiative.

But data lakes pose various challenges across all the disciplines of enterprise data management, including data governance. To start the discussion of the governance challenges, it's necessary to define what a data lake is: a type of data platform that holds vast amounts of raw data, typically left in its native format until it's needed for analytics uses.

While a traditional data warehouse stores data in relational tables, a data lake uses a flat architecture. Each data element is assigned a unique identifier and tagged with a set of metadata tags. As a result, a data lake is less structured compared to a data warehouse. Data is classified and organized when it's accessed for analysis, not when it's loaded into the data lake.

Effective data governance enables organizations to improve data quality and consistency and maximize the use of data for business decision-making, which can lead to better business planning and improved financial performance. Companion data management disciplines to data governance include data quality, metadata management and data security -- all of which factor into data lake governance.

Now, here are five data governance challenges for a data lake implementation.

1. Identification and maintenance of the correct sources of data

In many data lake implementations, the source metadata isn't captured or isn't available at all, making the validity of the data lake's contents questionable. For example, the system of record or the business owner of data sets may not be listed, or obviously redundant data may be causing issues for data analysts. At a minimum, the source metadata for all the data in a data lake should be recorded and made available to users to provide insight into its provenance.

2. Metadata management issues

Metadata gives context to the content of data sets and is an important component of making data understandable and usable in applications. But many data lake implementations ignore the need to apply the correct data definitions to the collected data. Also, since raw data is often loaded into a data lake, many organizations don't include the steps needed to validate the data or apply organizational data standards to it. This lack of proper metadata management makes the data in a data lake less useful for analytics.

Data lake vs. data warehouse comparison
A comparison of data lake and data warehouse attributes

3. Lack of coordination on data governance and data quality

Not coordinating data lake governance and data quality work can result in poor-quality data entering a data lake. That may lead to inaccurate results when the data is used for analytics and to drive business decisions, causing a loss of confidence in the data lake and a general distrust of data across an organization. Effective data lake implementations involve data quality analysts and engineers working closely with the data governance team and business data stewards to apply data quality policies, profile data and take necessary actions to improve its quality.

4. Lack of coordination on data governance and data security

In this case, data security standards and policies that aren't applied properly as part of the governance process can cause issues with access to personal data protected by privacy regulations and other types of sensitive data. Although data lakes are intended to be a rather open source of data, there's a need for security and access control measures, and the data governance and data security teams should work together during the data lake design and loading processes and ongoing data governance efforts.

5. Conflict among business units that use the same data lake

Different departments may have different business rules for similar data, which can result in an inability to reconcile data differences for accurate analytics. Having a robust data governance program with an enterprise view of data policies, standards, procedures and definitions, including an enterprise business glossary, can reduce the issues that arise when multiple business units use one data lake. If an organization has multiple data lakes, each one should be included in the data lake governance process and have business data stewards assigned to it.

In conclusion, the value of a data lake can be enhanced significantly by including strong data governance, metadata management, data quality and data security processes in the design, loading and maintenance of the environment, with active participation by experienced professionals in all of those areas. Otherwise, your data lake might become more of a swamp.

Dig Deeper on Data governance