The open source Apache Hudi data lake project is helping power large deployments at a number of big enterprises, including Uber, Walmart and Disney+ Hotstar.
Apache Hudi (Hadoop Upserts, Deletes and Incrementals) is a technology that was originally developed at Uber in 2016 and became an open source project the following year.
In June 2021, Hudi became a Top-Level Project at the Apache Software Foundation, which was a major milestone for the project's maturity. Hudi provides a series of capabilities for data lakes, including a table format and services that enable organizations to effectively manage data for data queries, operations and analytics.
While Uber was the first big user of Hudi, other major enterprises have started to use it over the years. During an virtual meetup on Jan. 11, users from Uber, Walmart and Disney+ Hotstar detailed growing applications for Hudi.
Moving from HBase to Apache Hudi at Disney+ Hotstar
During the virtual meetup, Vinay Patil, senior software development engineer at Disney+ Hotstar, explained how his organization chose to move to Apache Hudi.
Disney+ Hotstar is an online media streaming service in India that serves a growing number of users. The service was originally built using the open source HBase data management system for its data lake but needed to boost scale and improve performance.
Vinay noted that Disney+ Hotstar's data lake has about 500 million read requests a day, and with HBase the system was somewhat complex to manage and operate.
Vinay and his team began looking at other frameworks for data lake management. After performing a proof-of-concept evaluation of Apache Hudi, Disney+ Hotstar decided to migrate to Hudi.
While there were some challenges with Hudi, including schema problems, Vinay noted that the Hudi project has been able to fix the problems over time.
Uber continues to see benefits from Apache Hudi
Uber, which is where the Hudi project got its start, is continuing to both contribute to Apache Hudi and use it as part of the ride-sharing giant's data architecture.
During the virtual meetup, Meenal Binwade, senior software engineer at Uber, said that Apache Hudi currently powers an Uber data lake that manages more than 200 petabytes of data, spanning 7,000-plus tables and processing 800 billion records daily.
Hudi ingests data from different sources, including databases as well as Kafka event streams, and puts all the data into the Uber data lake. Data stored in the data lake is queried with multiple query engines, including Presto, she said.
Binwade detailed several Hudi data table services that Uber uses.
Among the capabilities is a table cleaning service that cleans up old data snapshot files, freeing up storage space. She noted that there is also a compaction service that compacts data, while the replication service replicates data incrementally across data centers.
Uber has recently also started to use the Hudi table clustering service. Binwade said the goal of the service is to rewrite the data to optimize it and improve data freshness.
Hudi helps Walmart fill the data lake
During the virtual meetup, Sam Guleff, engineering manager at Walmart, the world's largest retailer, explained how Walmart uses Hudi.
Guleff came to Walmart by way of Jet.com, which was acquired by Walmart in 2016 to help build out the retailer's e-commerce efforts.
Guleff recounted how Jet.com had built its own custom framework for merging data into an HDFS data store for a data lake. It became clear to Guleff and his team over time that building and maintaining their own custom framework for data lakes was a losing proposition.
As to why Guleff and his team chose Hudi, he explained that it was the first data lake framework that they evaluated, and his team wanted an open source approach.
Walmart conducted an initial evaluation before October 2019, when Databricks made Delta Lake open source.
Apache Iceberg, another emerging open source data lake platform, wasn't a mature effort at that time either, another reason Walmart decided on Hudi. That said, Guleff noted that Walmart is always looking at different technologies and has plans to reevaluate both Delta Lake and Iceberg at some point.
As part of its production pipeline, Guleff said that is merging approximately 330 GB of data per day into its Apache Hudi cloud data lake. Walmart isn't just a user of Hudi at this point either. The company has also contributed bug fixes and code to the project.
"We're involved in ecosystem and we want to contribute back," Guleff said.