Browse Definitions :

Getty Images/iStockphoto

Who manages data lakes and what skills are needed?

Data engineers, data scientists and chief data officers are just some of the people who have the skills to manage data lakes.

Among the most common components of modern data architecture is the use of a data lake, which is a location where data flows in to serve as a central repository.

The concept of the data lake has evolved from being just a location for data collection to a more organized approach known as a data lakehouse. Whether it's called a data lake or a data lakehouse, there is a need for certain skills and IT professionals to effectively manage the technology.

What is a data lake?

A data lake is a large open storage location that typically uses object storage as a unified repository for unstructured data coming from multiple sources. Those sources can include event streaming data, operational and transactions data and databases.

While data lakes can be in on-premises environments, they are more commonly created with cloud object storage services that enable large scalable data capacity, such as Amazon Simple Storage Service (S3), Google Cloud Storage or Microsoft Azure Data Lake Storage. Data lakes first emerged to help enable big data workloads with the Apache Hadoop big data platform.

A data lake architecture differs from a data warehouse in that warehouse data is transformed into a format that provides structured data and organization. A data warehouse enables users to more easily query the data and use it for data analytics and business intelligence use cases. Data warehouses also provide data governance and data management capabilities.

The concept of the data lakehouse -- first coined by Databricks -- is an attempt to bring together the best of data lakes and data warehouse technologies. A data lakehouse aims to combine the ease of use and open nature of a data lake with the data warehouse's ability to easily execute queries against data. A data lakehouse provides additional structure on top of a data lake -- often with the use of a data lake table format technology, such as Delta Lake, Apache Iceberg and Apache Hudi. It also uses a query engine technology, such as Apache Spark, Presto and Trino.

Who manages data lakes?

Managing data within an organization can be a multi-stakeholder effort. It can involve different job roles depending on the particular use case.

Data warehouses are often managed by data warehouse managers and data warehouse analysts. Those two roles involve data management and data analytics skills, which are typically tied to a specific data warehouse vendor technology.

Data lake management is often the domain of data engineers, who help design, build and maintain the data pipelines that bring data into data lakes. With data lakehouses, there can often be multiple stakeholders for management in addition to data engineers, including data scientists. Business analysts also fit into the management mix. They take responsibility to ensure data quality and metadata are properly managed to support business objectives.

As organizations begin to shift from data warehouse to data lake architectures, there is some overlap between the people who manage data warehouses and those who manage data lakes. Data still needs to come from multiple sources, it still needs to be governed and there is the same need for analytics so the data can be used effectively.

At the executive level -- whether it's a data warehouse, data lake or data lakehouse -- a chief data officer is often the job role that is tasked with the top level of responsibility for all data use.

What skills are necessary to manage data lakes?

There are a variety of skills that are necessary to effectively manage data lakes:

  • Data engineering. These skills include the design, development, deployment and ongoing operations of data pipelines to bring data from source destinations into a data lake. Data engineering and data pipelines often involve the use of skills and tools for extract, transform and load operations.
  • Data validation. Ensuring data is accurate and timely goes hand in hand with the data engineering skill set. Data validation is a core skill set that ensures data quality and usable data is being ingested by a data lake.
  • Data science. To be sure the right data lands in a data lake, there is also a need for data science skills. Data science skills can help align data sources to generate the insights an organization is looking for.
  • Business analysis and data analytics. Determining what insights an organization wants is often a skill that involves business analysis and data analytics. These skills help define what metrics an organization is looking to measure. These metrics often help with a business goal or operational trend that needs to be monitored and analyzed.
  • Cloud management. As data lakes are increasingly deployed in the cloud, there is a need for cloud management skills -- including the ability to provision and manage cloud resources. A fundamental component of cloud management for data lakes is cost management skills. This helps organizations understand and budget data lake usage and operations.

Available certifications

There are a variety of paths toward certifications for those looking to verify their skills for managing data lakes. A modern data lake or lakehouse deployment often uses cloud resources and tools from a specific vendor to enable data management and data queries.

Managing a data lake is not an abstract idea. It's a hands-on effort that can benefit from specific certifications. The leading cloud and data lake vendors all have some form of training and available certification.

AWS Certified Data Analytics -- Specialty

Amazon Web Services and its S3 cloud object storage service are commonly used to enable data lakes.

This certification is geared toward people with experience working with AWS services. It validates skills in using AWS data lakes and analytics services.


Google Professional Data Engineer

This Google certification provides an examination that verifies skills to build, deploy and use data models that benefit from data lake and analytics services running in Google Cloud.


Microsoft Certified: Azure Data Engineer Associate

Microsoft's Azure Data Lake Storage Gen2 is a popular option for building data lakes. With this certification, Microsoft provides validation for those looking to use Microsoft services for data lakes.


Databricks Lakehouse Platform Essentials

This certification helps users learn and validate data lakehouse skills on the Databricks platform. This tool integrates multiple open source technologies, including Apache Spark and Delta Lake.


Cloudera CCP Data Engineer

Cloudera is often associated with the open source Hadoop big data technology, which is one of the originators of the data lake concept. This certification will validate skills required to ingest, transform, store and analyze data in the Cloudera environment.


Informatica Cloud Data Warehouse & Data Lake Modernization Foundation Level

This certification is designed to help organizations that are updating from a data warehouse to a cloud data lake or data lakehouse mode.


Dremio data lake training

Dremio, a company that builds data lakehouse technology, has expanded its training options with Dremio University, which provides certificates of completion.


Next Steps

Explore top data lake providers for substantial storage use

Dig Deeper on Data analytics and AI

  • network packet

    A network packet is a basic unit of data that's grouped together and transferred over a computer network, typically a ...

  • virtual network functions (VNFs)

    Virtual network functions (VNFs) are virtualized tasks formerly carried out by proprietary, dedicated hardware.

  • network functions virtualization (NFV)

    Network functions virtualization (NFV) is a network architecture model designed to virtualize network services that have ...

  • Android System WebView

    Android System WebView is a system component for the Android operating system (OS) that allows Android apps to display web ...

  • data masking

    Data masking is a method of creating a structurally similar but inauthentic version of an organization's data that can be used ...

  • computer worm

    A computer worm is a type of malware whose primary function is to self-replicate and infect other computers while remaining ...

  • privacy compliance

    Privacy compliance is a company's accordance with established personal information protection guidelines, specifications or ...

  • contingent workforce

    A contingent workforce is a labor pool whose members are hired by an organization on an on-demand basis.

  • product development (new product development -- NPD)

    Product development, also called new product management, is a series of steps that includes the conceptualization, design, ...

  • talent acquisition

    Talent acquisition is the strategic process employers use to analyze their long-term talent needs in the context of business ...

  • employee retention

    Employee retention is the organizational goal of keeping productive and talented workers and reducing turnover by fostering a ...

  • hybrid work model

    A hybrid work model is a workforce structure that includes employees who work remotely and those who work on site, in a company's...

  • Salesforce Trailhead

    Salesforce Trailhead is a series of online tutorials that coach beginner and intermediate developers who need to learn how to ...

  • Salesforce

    Salesforce, Inc. is a cloud computing and social enterprise software-as-a-service (SaaS) provider based in San Francisco.

  • data clean room

    A data clean room is a technology service that helps content platforms keep first person user data private when interacting with ...