Apache Beam users detail stream data processing applications
At a Beam conference, users detail how they're using the open source data processing technology at large scale in support of a growing number of data pipelines.
The open source Apache Beam batch and stream data processing technology is finding a home in a growing number of large organizations.
At the recent Beam Summit hybrid conference, users from Google, Twitter, Spotify, Adobe, Intuit, LinkedIn and others outlined how and why they are using the Apache Beam technology.
Beam became a top-level project at the Apache Software Foundation in 2017. Beam provides capabilities that enable organizations to manage data pipeline workflows for both batch and stream processing for data.
Apache Beam is widely used at Google, according to Kerry Donny-Clark, engineering manager for Apache Beam at Google.
During a keynote on July 18, Clark noted that Google uses Beam to support data processing for YouTube, Waze, the Vertex AI machine learning platform and the Google Dataplex data fabric.
Google has no overarching mandate directing its service teams to use Beam, but rather, each team came to use Beam because they found it met their needs, Donny-Clark said. He highlighted that Beam supports multiple languages including Java, Python and Go, which is helpful for developers who use different programming tools.
"There's no command that Google developers need to use Beam," Donny-Clark said. "But they found Beam useful for a wide variety of use cases throughout the company and of course, that tells me that Beam can support things truly at Google scale."
Spotify Wrapped powered by Beam stream data processing
Streaming music service provider Spotify is also a big user of Apache Beam.
In another keynote, Spotify data engineer Rickard Zwahlen said his organization has used Beam since it moved away from its own on-premises Hadoop cluster for data processing.
One of the largest data processing jobs that Spotify runs is in support of the Spotify Wrapped service, which provides a year-end wrap-up for users about what music they listened to.
Rickard ZwahlenData engineer, Spotify
"At this point Beam pipelines are a large majority of all our scheduled jobs," Zwahlen said.
Using Beam has not been without some problems for Spotify. A challenge Spotify faced when moving on from Hadoop was that much of the tooling that the company was using was written in the Scala programming language, which is not directly supported by Beam. So Spotify built its own open source project, Scio, that provides a Scala API to interface with Apache Beam.
Twitter increasingly using Beam for data processing
Twitter is also embracing Apache Beam to support its vast data pipeline infrastructure.
In a Monday keynote, Lohit Vijayarenu, senior staff engineer at the social media platform, said that on a typical day, Twitter has more than 50,000 data pipelines running across various data processing systems. All of those systems generate more than 200 petabytes of data every day.
Twitter has used a number of different technologies for data stream processing over the years, including Apache Heron, a technology that Twitter originally developed.
After evaluating different technologies, Twitter decided Beam would be a good fit and first implemented it to support the Twitter's ad engagement analytics platform. Vijayarenu cited improved performance, programming language support and a thriving open source community as among the key reasons Beam was an attractive choice for Twitter.
After initially deploying on the ad engagement system, Vijayarenu noted that Beam is now supporting all of Twitter's machine learning data pipeline traffic as well as system monitoring for the Twitter service itself.
"The future for Beam at Twitter looks very bright," Vijayarenu said. "Our goal is to see if we can migrate all the pipeline's data processing pipelines into Apache Beam, trying to unify both batch and stream processing into one place."
The Beam Summit was held July 18-20, with the in-person events in Austin, Texas.