Konstantin Emelyanov - Fotolia
A variety of open source, real-time data streaming platforms are available today for enterprises looking to drive business insights from data as quickly as possible. The options include Spark Streaming, Kafka Streams, Flink, Hazelcast Jet, Streamlio, Storm, Samza and Flume -- some of which can be used in tandem with each other.
Enterprises are adopting these real-time data streaming platforms for tasks such as making sense of a business marketing campaign, improving financial trading or recommending marketing messages to consumers at critical junctures in the customer journey. These are all time-critical areas that can be used for improving business decisions or baked into applications driven by data from a variety of sources.
With the open source community offering several options for real-time data streaming -- each with its own strengths -- which is best suited for your organization? Experts and data decision-makers discuss below.
Selection criteria and overview
Before deciding on a platform, IT decision-makers need to decide on key selection criteria. These include target use cases, processing semantics -- exactly once or at least once -- and application language support, according to Kevin Petrie, senior director and technology evangelist at data integration vendor Attunity, which was acquired by Qlik. Exactly once processing means that each record is delivered and consumed once and only once.
Petrie said he believes that exactly once processing semantics are important, especially for finance applications. Kafka Streams, Spark Streaming, Flink and Samza support exactly once processing.
Some of the other real-time data streaming platforms don't natively support exactly once processing. Storm requires another layer called Trident to achieve exactly once, and Flume only supports at least once processing, which can lead to duplicate records that hurt data quality and consume extra bandwidth and CPU, Petrie said.
Spark Streaming and Flink shine in the area of application language compatibility -- with support for Java, Scala and Python languages, Petrie said. Generally, developers can use Java or Scala with most of these processing platforms. Kafka's KSQL is appealing to data professionals with more traditional SQL backgrounds because, as the name suggests, it provides an interactive SQL interface.
Additionally, many enterprises use Attunity software to automate the process for publishing transactional data to Kafka at high scale and low latency, with minimal disruption to production systems. Enterprises tend to prefer Spark Streaming when they need to run stream processing on top of these Kafka transactional data streams.
What exactly constitutes real-time?
There is considerable debate over what real-time means for these data platforms. "Real-time is business time," Forrester analyst Mike Gualtieri said. In financial trading, for example, real-time may have requirements on the order of milliseconds or microseconds. Most business applications, however, work fine when real-time results can be delivered in a few seconds or even a few minutes. These windows are still much smaller than batch-oriented analytics that may require hours or days to deliver results.
Spark has a larger community
Spark Streaming, a stream analytics service directly integrated into the Apache Spark platform, has become the most popular open source, real-time streaming analytics platform, said Mike Gualtieri, an analyst at Forrester Research. An earlier version of Spark Streaming used a microbatch process to execute streaming processing. This executed batch jobs quickly as a sort of streaming framework but had some performance challenges. As a result, the Spark community, which continues to grow, has reimplemented Spark Streaming to provide better performance and lower latency.
Beyond exactly once processing, access to all components of the Apache Spark platform, and support for Java, Scala and Python languages, Spark Streaming supports the merging of streaming data with historical data.
Flink has technical respect
Despite being less dominant than Spark Streaming, Flink is known to be much more real time than Spark, Gualtieri said. Flink also implemented Apache Beam, which Google contributed to for real-time processing. Flink has a much smaller community, but it has extreme technical respect, according to Gualtieri.
"Flink has some prospects as the chief competitor to Spark in the open source world," Gualtieri said.
The Flink community has also been making progress on streaming SQL, which helps business analysts build reporting and simple applications on real-time data, said Michael Winters, product manager at Camunda, a business process management vendor. Streaming SQL greatly expands the user base of a streaming platform. Uber, for example, built an internal company platform called AthenaX to make streaming SQL widely accessible across the organization.
Kafka shines for microservices
Kafka Streams is one of the leading real-time data streaming platforms and is a great tool to use either as a big data message bus or to handle peak data ingestion loads -- something that most storage engines can't handle, said Tal Doron, director of technology innovation at GigaSpaces, an in-memory computing platform.
However, it also introduces additional latency in real-time scenarios since it's another component in the workflow and has disk-based data duplication to provide high availability and no event-driven capabilities.
Kafka Streams is often used on the back end for integrating microservices together and may complement other real-time data streaming platforms, like Spark and Flink. Most of the other real-time data streaming platforms can integrate with Kafka to enable stream processing and stream analytics. Kafka often sends data to other streaming analytics platforms, like Spark or Flink, to be analyzed.
For example, Cloud Elements, an API integration platform, has adopted Kafka Streams as a service mesh in its migration from a monolithic application to microservices. Ross Garrett, vice president of product at Cloud Elements, said that Kafka stood out as the best option for this migration.
In many cases, request-response patterns are not the most efficient way for communication between microservices since they create coupling and dependencies that are counter to the objectives of a true microservices architecture. Instead, an event-oriented pattern removes the dependencies created by direct service calls. Kafka Streams is an ideal solution to manage these event streams, Garrett said.
Garrett added that the Kafka Streams API is incredibly lightweight, making stream processing available as an application programming model to each microservice individually, while leaning on the benefits from Kafka's core competencies around scalability and fault tolerance.
Using Kafka as an integration tier
Attunity's Petrie is seeing many of the vendor's customers layering stream processing on top of Kafka to address real-time processing and analytics use cases. For example, one of the largest payment processors in Europe uses Attunity to copy transactions in real time to a Spark-based machine learning platform that continuously checks fraud risk.
As with any technology, data and analytics teams need to weigh the advantages of specialization against the complexity and additional work it creates. Most enterprises that Attunity works with tend to keep things relatively simple -- by coupling Spark with Kafka to efficiently address multiple use cases, for example.
Additionally, a Fortune 100 food processing firm Attunity works with uses Spark and Kafka to optimize its supply chain. This approach also can support more advanced use cases, as is the case with a Fortune 100 pharmaceutical firm that is using Attunity software to feed clinical records into a lambda architecture for both historical and real-time machine learning, Petrie said.