Hadoop data lake

A Hadoop data lake is a data management platform comprising one or more Hadoop clusters. It is used principally to process and store nonrelational data, such as log files, internet clickstream records, sensor data, JSON objects, images and social media posts.

Such systems can also hold transactional data pulled from relational databases, but they're designed to support analytics applications, not to handle transaction processing. As public cloud platforms have become common sites for data storage, many people build Hadoop data lakes in the cloud.

Hadoop data lake architecture

While the data lake concept can be applied more broadly to include other types of systems, it most frequently involves storing data in the Hadoop Distributed File System (HDFS) across a set of clustered compute nodes based on commodity server hardware. The reliance on HDFS has, over time, been supplemented with data stores using object storage technology, but non-HDFS Hadoop ecosystem components typically are part of the enterprise data lake implementation.

With the use of commodity hardware and Hadoop's standing as an open source technology, proponents claim that Hadoop data lakes provide a less expensive repository for analytics data than traditional data warehouses. In addition, their ability to hold a diverse mix of structured, unstructured and semistructured data can make them a more suitable platform for big data management and analytics applications than data warehouses based on relational software.

However, a Hadoop enterprise data lake can be used to complement an enterprise data warehouse (EDW) rather than to supplant it entirely. A Hadoop cluster can offload some data processing work from an EDW and, in effect, stand in as an analytical data lake. In such cases, the data lake can host new analytics applications. As a result, altered data sets or summarized results can be sent to the established data warehouse for further analysis.

Schema describes Hadoop data lake
An emerging style of Hadoop data lake architecture supports storage centered on open source data processing frameworks.

Hadoop data lake best practices

The contents of a Hadoop data lake need not be immediately incorporated into a formal database schema or consistent data structure, which allows users to store raw data as is; information can then either be analyzed in its raw form or prepared for specific analytics uses as needed.

As a result, data lake systems tend to employ extract, load and transform (ELT) methods for collecting and integrating data, instead of the extract, transform and load (ETL) approaches typically used in data warehouses. Data can be extracted and processed outside of HDFS using MapReduceSpark and other data processing frameworks.

Despite the common emphasis on retaining data in a raw state, data lake architectures often strive to employ schema-on-the-fly techniques to begin to refine and sort some data for enterprise uses. As a result, Hadoop data lakes have come to hold both raw and curated data.

As big data applications become more prevalent in companies, the data lake often is organized to support a variety of applications. While early Hadoop data lakes were often the province of data scientists, increasingly, these lakes are adding tools that allow analytics self-service for many types of users.

Hadoop data lake uses, challenges

Potential uses for Hadoop data lakes vary. For example, they can pool varied legacy data sources, collect network data from multiple remote locations and serve as a way station for data that is overloading another system.

Experimental analysis and archiving are among other Hadoop data lake uses. They have also become an integral part of Amazon Web Services (AWS) Lambda architectures that couple batch with real-time data processing.

The Hadoop data lake isn't without its critics or challenges for users. Spark, as well as the Hadoop framework itself, can support file architectures other than HDFS. Meanwhile, data warehouse advocates contend that similar architectures -- for example, the data mart -- have a long lineage and that Hadoop and related open source technologies still need to mature significantly in order to match the functionality and reliability of data warehousing environments.

Experienced Hadoop data lake users say that a successful implementation requires a strong architecture and disciplined data governance policies; without those things, they warn, data lake systems can become out-of-control dumping grounds. Effective metadata management typically helps to drive successful enterprise data lake implementations.

Hadoop vs. Azure Data Lakes

There are other versions of data lakes, which offer similar functionality to the Hadoop data lake and also tie into HDFS.

Microsoft launched its Azure Data Lake for big data analytical workloads in the cloud in 2016. It is compatible with Azure HDInsight, Microsoft's data processing service based on Hadoop, Spark, R and other open source frameworks. The main components of Azure Data Lake are Azure Data Lake Analytics, which is built on Apache YARN, Azure Data Lake Store and U-SQL. It uses Azure Active Directory for authentication and access control lists and includes enterprise-level features for manageability, scalability, reliability and availability.

Around the same time that Microsoft launched its data lake, AWS launched Data Lake Solutions -- an automated reference data lake implementation that guides users through creation of a data lake architecture on the AWS cloud, using AWS services, such as Amazon Simple Storage Service (S3) for storage and AWS Glue, a managed data catalog and ETL service.

This was last updated in January 2018

Continue Reading About Hadoop data lake

Dig Deeper on Data management strategies

Business Analytics
Content Management