What is a data lakehouse?
A data lakehouse is a data management architecture that combines the benefits of a traditional data warehouse and a data lake. It seeks to merge the ease of access and support for enterprise analytics capabilities found in data warehouses with the flexibility and relatively low cost of the data lake.
Data warehouses were developed in the 1980s as a high-performance storage tier that supported business intelligence (BI) and analytics independent of the operational transactional database. Data warehouses also supported various data features to enable high-performance queries and ACID (atomicity, consistency, isolation and durability) transactions to ensure data integrity.
Data lakes evolved from Hadoop in the early 2000s. They provide a low-cost storage tier for unstructured and semistructured data that reduces the cost of capturing big data. They traditionally suffer issues with performance, data quality and inconsistency owing to the way the data is managed and changed.
Data lakes and data warehouses require different processes for capturing data from operational systems and moving them into the target tier. Maintaining separate systems results in significant capital expense, ongoing operational expenses and management overhead. Consequently, enterprises began developing a two-tier architecture in which data is first captured into a data warehouse. Then, the data is later transformed using extract, transform and load or extract, load and transform (ETL and ELT) processes into a structured SQL format suitable for the data warehouse. However, this multi-tier architecture creates delays, complexity and additional overhead.
A data lakehouse uses several improvements in data architectures, data processing and metadata management to capture all data to a common data platform, which can be efficiently shared for machine learning and BI applications.
What does a data lakehouse do?
Data lakehouses address four key problems with the traditional two-tier architecture that spans separate data lake and data warehouse tiers, including:
- Reliability improves because enterprises can reduce the brittleness of engineering ETL data transfers among systems that could break easily owing to quality issues.
- Data staleness is reduced because data is available for analytics in a few hours, compared to the multiple days it sometimes took to transfer new data into the data warehouse.
- Advanced analytics that did not work well on top of traditional data warehouses can be executed on operational data using TensorFlow, PyTorch and XGBoost.
- Cost can decrease because enterprises can reduce ongoing ETL costs and a single-tier requires half the storage compared to two separate tiers.
How does a data lakehouse work?
A data lakehouse handles several key problems related to organizing data to efficiently support the uses of traditional data lakes and data warehouses. Data lakehouses work by addressing the following issues:
Data management. A data lakehouse needs to combine the benefits of easily ingesting raw data while also organizing the data using ETL processes for high-performance BI. Traditional data lakes organized data using Parquet or Optimized Row Columnar (ORC) file systems. Newer open source technologies such as Delta Lake and Apache Iceberg support ACID transactions on these big data workloads.
Artificial intelligence (AI) and machine learning support. Machine learning and data science tools have traditionally supported the ORC and Parquet file formats. These tools are also adopting DataFrames that can make it easier to optimize data for new models on the fly.
SQL tuning. Traditional data warehouses were able to fine-tune the data structures of SQL data for the most common queries. Data lakehouses can optimize the data layout for traditional data warehouse file formats like ORC and Parquet on the fly using tools like Databricks Delta Engine.
What are the features of a data lakehouse?
Features of a data lakehouse include:
- Low-cost object store. A data lakehouse needs to be able to store data in a low-cost object store that could include Amazon Simple Storage Service, Azure Blob Storage, Google Cloud Storage or natively using ORC or Parquet.
- Transactional metadata layer. This makes it possible to apply data management features important for data warehouse performance on top of the raw low-cost storage.
- Data optimization capabilities. A data lakehouse needs to be able to optimize the data in ways that leave the data format unchanged that include data layout optimization, caching and indexing.
- Declarative DataFrame API support. Most AI tools can access raw object store data using DataFrames. Declarative DataFrame API support adds the ability to optimize the structure and presentation of the data on the fly in response to a specific data science or AI workload.
What are the advantages and disadvantages of data lakehouses?
The main advantage of a data lakehouse is that it can simplify the overall data engineering architecture by providing a single staging tier for all data and all types of applications and use cases. This can lower costs because teams only have to manage one single source of data.
The main disadvantage of data lakehouses is that it is a relatively new kind of architecture. It's not entirely clear that enterprises will see the gains promised by a unified platform that may be more difficult to change down the road. Also, data engineering teams need to master new skills associated with running a more complex metadata management layer that can organize and optimize data on the fly. Data scientists need to learn new skills associated with optimizing their Declarative DataFrame queries to get the benefit of the new architecture.
What is the difference between a data lakehouse, data warehouse and data lake?
Data lakehouses differ from data warehouses and data lakes in that all data can be staged in one tier with a data lakehouse. The data can then be optimized on the fly for different kinds of queries on both structured and unstructured data.
With data warehouses, all data must go through an ETL/ELT phase to optimize the data as it is being loaded into the storage tier. However, this leads to faster performance on read for SQL analytics use cases such as business intelligence.
With a data lake, all the raw data can be captured into a single object store and then structured on read. This improves the cost of capturing more types of structures, semistructured and unstructured data. However, it can also reduce the performance of applications and lead to data quality issues.