buchachon - Fotolia

Apache Hudi grows cloud data lake maturity

The open source Apache Hudi project helps large organizations like Uber with stream processing capabilities that handle billions of records on data lakes every day.

The Apache open source data lake project has matured, as organizations around the world embrace the technology.

Apache Hudi (Hadoop Upserts Deletes and Incrementals) is a data lake project that enables stream data processing on top of Apache Hadoop-compatible cloud storage systems, including Amazon S3.

The project was originally developed at Uber in 2016, became open source in 2017 and entered the Apache Incubator in January 2019. As an open source effort, Hudi has gained adoption by Alibaba, Tencent, Uber and Kyligence, among major tech vendors.

On June 4, Hudi -- pronounced "hoodie" -- officially became a Top-Level Project at the Apache Software Foundation (ASF), a milestone that signifies the project has achieved a high level of code maturity and developer community involvement. The ASF is home to Hadoop, Spark, Kafka and other widely used database and data management programs.

How Hudi enables Uber's cloud data lake

While Hudi is now an open source effort used by multiple organizations, Uber has been a stalwart user.

Tanvi Kothari, engineering manager for data at Uber, said Uber uses Hudi to process over 500 billion records per day into Uber's 150+ petabyte (PB) data lake. 

Kothari runs the Global Data Warehouse team at Uber, which is responsible for the core data tables that serve Uber's entire business. She noted that Hudi supports incremental processing, for both reads and writes, for more than 10,000 tables and thousands of data pipelines at Uber.

Apache Hudi provides a stream processing layer on top of Hadoop Distributed File System (HDFS) compatible data stores, including Amazon S3.
Screenshot of Apache Hudi stream processing layer

"Hudi abstracts away a lot of the challenges that come with processing big data," Kothari said. "It helps you scale your ETL [Extract, Transform, Load] pipelines and improve data fidelity."

Hudi as a building block for cloud data lake analytics

Among the organizations that use Apache Hudi as a component of a larger offering is big data analytics vendor Kyligence Solutions, which has offices in Shanghai, China and San Jose, Calif. Shaofeng Shi, partner and chief architect at Kyligence, explained that his firm uses a number of Apache open source projects, including Apache Kylin, Hadoop and Spark technologies to help businesses manage data.

Shi said Apache Hudi provides Kyligence with a way to manage changing data sets directly on Hadoop Distributed File System (HDFS) or Amazon S3.

Kyligence started to use Hudi in 2019 for a U.S. customer. Conveniently, during that time, Shi noted that AWS unveiled an integration with Hudi and the Amazon Elastic MapReduce (EMR) service. The Kyligence Cloud service now also supports Hudi as a source format for online analytical processing for all its users.

Hudi abstracts away a lot of the challenges that come with processing big data.
Tanvi KothariEngineering manager for data, Uber

The graduation of Hudi to a Top-Level Project at Apache is an achievement Shi said he's happy to see. Hudi has an open and enthusiastic community that even translated a series of Hudi articles into Chinese to make it easier for Chinese users to learn about the technology, she said.

How Hudi works to enable cloud data lake stream processing

Hudi provides the ability to consume streams of data and enables users to update data sets, said Vinoth Chandar, co-creator and vice president of Apache Hudi at the ASF.

Chandar he sees the stream processing that Hudi enables as a style of data processing in which data lake administrators process incremental amounts of data and then are able to use that data.

"A good way to actually think about Hudi is as a data store or database that provides transactional capabilities on top of data stored in [AWS] S3," Chandar said.

The graduation of Hudi to Top-Level Project status reflects the project's maturity, Chandar said.

However, though Hudi is now top-level at Apache, the undertaking has not yet hit its 1.0 release, with the most recent update being the 0.5.2 milestone that came out on March 25.

Hudi developers are working now on the 0.6.0 release, which Chandar said is targeted for release by the end of June. That release will be a major milestone, with performance enhancements and improved data migration capabilities to help users bring data into a Hudi data lake, Chandar said.

"Our plan is to at least do a major release every quarter and then bug fix releases hopefully every month on top of the major release," he said.

Next Steps

Apache Daffodil advancing Data Format Description Language

Kyligence 4.5 adds Clickhouse to Intelligent Data Cloud

Dig Deeper on Database management

Business Analytics
Content Management