Cloudera, the vendor that was once best known for its Hadoop big data efforts, is pressing forward into the world of cloud data lake technology with support for the Apache Iceberg cloud data lake table format.
Cloudera on Thursday said it is now supporting the open source format for its Cloudera Data Platform technology.
In recent years, Iceberg has emerged to become a widely supported approach for providing structure to cloud data lakes. Many of Cloudera's competitors, including Snowflake's Data Cloud, Dremio's data lakehouse and Starburst's Trino SQL-based data platform, already support Iceberg.
For Cloudera, the move to support Iceberg is part of the vendor's evolution as it modernizes from its Hadoop roots, where the entire concept of a data lake originated. A data lake, after all, is often defined as Hadoop Distributed File System (HDFS)-compatible storage.
In this Q&A, Ram Venkatesh, CTO of Cloudera, provides insight into the evolution of the data lake and the vendor's move to support Apache Iceberg.
Why is Cloudera now supporting Apache Iceberg for cloud data lakes?
Ram Venkatesh: We have done several generations of the data lake architecture, and Iceberg is a next generation of the same concept.
The key thing that Iceberg does for us is it enables us to work with cloud even better than we have in the past. It's convenient for people to put us in the Hadoop box, but with CDP [Cloudera Data Platform], which is a hybrid data platform, we have a technology that works not just on premises but also on AWS, Azure and Google Cloud. Iceberg was created with the need to support cloud stores from the ground up, and that's very exciting for us.
Iceberg also enables us to expose the same data set to multiple different analytical engines, including Spark, Hive, Impala and Presto. I think of Iceberg as an architecture is the next big evolution of the data lake.
How have you seen the data lake concept evolve from its origination with Hadoop?
Venkatesh: A data lake in the initial days was just a big bucket with an HDFS API. We could tell customers that they didn't need to throw away data and could keep it all.
That was a great place to start and then we said let's bring some analytics to the data that's in your data lake with SQL. So the data lake then grew from just being a repository to a place where analytics can happen.
Now what's happening with data lakes is the rise of the data set, where there is a curated set of data that has semantics and schema. Increasingly for our customers, the data lake happens to be a place where data sets are discovered, created, processed and maintained.
Ram VenkateshCTO, Cloudera
Data lakes have become part of the overall mission-critical enterprise setup, as organizations are relying on data that's in data lakes to actually perform mission-critical analytics.
What challenges have you seen enterprises face with data lakes today?
Venkatesh: Organizations have multiple locations for data deployments. They might be using AWS for some data, while other applications are in Microsoft Azure and then have a set of data on premises. Being able to work efficiently across all the different deployments can be a challenge.
What customers really want is to be able to run a workload on premises. Then if at a future point they decide that the workload should really run on AWS, organizations want to take the workload with all the data, processing and security policies, and just move that over to another deployment they have in the cloud.
What is the legacy of Hadoop, and what is the path forward beyond Apache Iceberg?
Venkatesh: I think that there is the Hadoop ecosystem and then there's Hadoop itself. I think Hadoop itself is relevant for a set of customers that we continue to support.
Cloudera is actively supporting and helping customers keep their Hadoop installations running. But from a future standpoint, we see that a lot of the future-looking use cases are going to be more around this notion of a storage layer that's disaggregated from compute.
The Hadoop ecosystem, which includes things like Spark for analyzing data at scale, is alive and well. I also see projects like Iceberg as being part of the ecosystem, and it shows that innovation is happening and the ecosystem is getting larger, not smaller.
For organizations that have made significant investments in on-premises Hadoop technology, I see Apache Ozone as being a very natural successor to HDFS. We now have customers using Ozone at scale, and I see that as the natural evolution of where a lot of the HDFS deployments in the world are going to go over the next two or three years.