Who manages data lakes and what skills are needed?
Data engineers, data scientists and chief data officers are just some of the people who have the skills to manage data lakes.
Among the most common components of modern data architecture is the use of a data lake, which is a location where data flows in to serve as a central repository.
The concept of the data lake has evolved from being just a location for data collection to a more organized approach known as a data lakehouse. Whether it's called a data lake or a data lakehouse, there is a need for certain skills and IT professionals to effectively manage the technology.
What is a data lake?
A data lake is a large open storage location that typically uses object storage as a unified repository for unstructured data coming from multiple sources. Those sources can include event streaming data, operational and transactions data and databases.
While data lakes can be in on-premises environments, they are more commonly created with cloud object storage services that enable large scalable data capacity, such as Amazon Simple Storage Service (S3), Google Cloud Storage or Microsoft Azure Data Lake Storage. Data lakes first emerged to help enable big data workloads with the Apache Hadoop big data platform.
A data lake architecture differs from a data warehouse in that warehouse data is transformed into a format that provides structured data and organization. A data warehouse enables users to more easily query the data and use it for data analytics and business intelligence use cases. Data warehouses also provide data governance and data management capabilities.
The concept of the data lakehouse -- first coined by Databricks -- is an attempt to bring together the best of data lakes and data warehouse technologies. A data lakehouse aims to combine the ease of use and open nature of a data lake with the data warehouse's ability to easily execute queries against data. A data lakehouse provides additional structure on top of a data lake -- often with the use of a data lake table format technology, such as Delta Lake, Apache Iceberg and Apache Hudi. It also uses a query engine technology, such as Apache Spark, Presto and Trino.
Who manages data lakes?
Managing data within an organization can be a multi-stakeholder effort. It can involve different job roles depending on the particular use case.
Data warehouses are often managed by data warehouse managers and data warehouse analysts. Those two roles involve data management and data analytics skills, which are typically tied to a specific data warehouse vendor technology.
Data lake management is often the domain of data engineers, who help design, build and maintain the data pipelines that bring data into data lakes. With data lakehouses, there can often be multiple stakeholders for management in addition to data engineers, including data scientists. Business analysts also fit into the management mix. They take responsibility to ensure data quality and metadata are properly managed to support business objectives.
As organizations begin to shift from data warehouse to data lake architectures, there is some overlap between the people who manage data warehouses and those who manage data lakes. Data still needs to come from multiple sources, it still needs to be governed and there is the same need for analytics so the data can be used effectively.
At the executive level -- whether it's a data warehouse, data lake or data lakehouse -- a chief data officer is often the job role that is tasked with the top level of responsibility for all data use.
What skills are necessary to manage data lakes?
There are a variety of skills that are necessary to effectively manage data lakes:
- Data engineering. These skills include the design, development, deployment and ongoing operations of data pipelines to bring data from source destinations into a data lake. Data engineering and data pipelines often involve the use of skills and tools for extract, transform and load operations.
- Data validation. Ensuring data is accurate and timely goes hand in hand with the data engineering skill set. Data validation is a core skill set that ensures data quality and usable data is being ingested by a data lake.
- Data science. To be sure the right data lands in a data lake, there is also a need for data science skills. Data science skills can help align data sources to generate the insights an organization is looking for.
- Business analysis and data analytics. Determining what insights an organization wants is often a skill that involves business analysis and data analytics. These skills help define what metrics an organization is looking to measure. These metrics often help with a business goal or operational trend that needs to be monitored and analyzed.
- Cloud management. As data lakes are increasingly deployed in the cloud, there is a need for cloud management skills -- including the ability to provision and manage cloud resources. A fundamental component of cloud management for data lakes is cost management skills. This helps organizations understand and budget data lake usage and operations.
There are a variety of paths toward certifications for those looking to verify their skills for managing data lakes. A modern data lake or lakehouse deployment often uses cloud resources and tools from a specific vendor to enable data management and data queries.
Managing a data lake is not an abstract idea. It's a hands-on effort that can benefit from specific certifications. The leading cloud and data lake vendors all have some form of training and available certification.
AWS Certified Data Analytics -- Specialty
Amazon Web Services and its S3 cloud object storage service are commonly used to enable data lakes.
This certification is geared toward people with experience working with AWS services. It validates skills in using AWS data lakes and analytics services.
Google Professional Data Engineer
This Google certification provides an examination that verifies skills to build, deploy and use data models that benefit from data lake and analytics services running in Google Cloud.
Microsoft Certified: Azure Data Engineer Associate
Microsoft's Azure Data Lake Storage Gen2 is a popular option for building data lakes. With this certification, Microsoft provides validation for those looking to use Microsoft services for data lakes.
Databricks Lakehouse Platform Essentials
This certification helps users learn and validate data lakehouse skills on the Databricks platform. This tool integrates multiple open source technologies, including Apache Spark and Delta Lake.
Cloudera CCP Data Engineer
Cloudera is often associated with the open source Hadoop big data technology, which is one of the originators of the data lake concept. This certification will validate skills required to ingest, transform, store and analyze data in the Cloudera environment.
Informatica Cloud Data Warehouse & Data Lake Modernization Foundation Level
This certification is designed to help organizations that are updating from a data warehouse to a cloud data lake or data lakehouse mode.
Dremio data lake training
Dremio, a company that builds data lakehouse technology, has expanded its training options with Dremio University, which provides certificates of completion.