carloscastilla - Fotolia
Confluent on Wednesday introduced a new infinite data retention capability for Apache Kafka event streaming in the cloud.
The new feature, now generally available, is part of Confluent's Project Metamorphosis effort that got underway on May 6 with a series of efforts designed to improve Apache Kafka operations. Kafka is a popular event streaming technology that can generate many storable individual data points.
The challenge, however, often is how to store that data, with Kafka users typically only storing data for seven days, according to Confluent. The promise of the infinite retention capability on Confluent Cloud is that Kafka users can now easily store as much data as they want.
The infinite retention feature can help address some major shortcomings of Apache Kafka, said Dave Menninger, senior vice president and research director at Ventana Research. First, Menninger noted that scaling a Kafka cluster is not trivial as it often requires a fair amount of manual effort to configure.
Another key challenge is that users must scale storage and compute simultaneously. Confluent's release addresses both of these problems, making it easier to scale and enabling organizations to scale storage and compute separately to better meet their particular needs.
"It's probably unlikely that many organizations will store data in Kafka forever, but they certainly will expand the amount of history they keep in Kafka," Menninger said. "With infinite, the decision of where to store data and how much to store will be driven much more by functional requirements than by architectural limitations as it is today."
The challenges of Kafka scalability
Kafka users could store data for long periods before, but it often required multiple steps, including some form of adapter for live data and long-term storage, said Dan Rosanova, group product manager for Confluent Cloud.
One specific use case is for training machine learning AI models, which can be used for fraud detection. Previously, users for this kind of application had to process event streaming data in real time and also train the AI model on historical data sets.
"Having to write an adapter to access historical data has been a friction point for quite a few customers as they just want to point their model to any point in the data history and then replay from there," Rosanova said.
Infinite Kafka scalability differs from a data lake
One potential route for how users could store Kafka event stream data on their own is via some form of data lake. The challenge, however, is that data doesn't necessarily need to be stored in a structured schema inside of a data lake, according to Rosanova. Tools are available that can create structure within a data lake, but that process tends to require additional steps, he noted.
Dave MenningerSenior vice president and research director, Ventana Research
With the Confluent infinite approach, the Kafka event stream data is stored and retained in its original time sequence, making it easier for users to simply replay and access data when needed in the same format that it was created.
Going a step further, a key part of enabling data analytics on top of the infinite Kafka data capability is Confluent's ksqlDB event streaming database. Confluent previewed ksqlDB in November 2019 and integrated it with the Confluent Cloud service on April 6.
Rosanova explained that ksqlDB can enable interactive queries on a materialized view of data, which is a snapshot of a data set. With the combination of infinite data retention and ksqlDB, the materialized view of a data set can go back as far as a user wants, he said.
Apache Kafka 2.6 is coming
The next major milestone is the Apache Kafka 2.6 release, which is currently in development in the open source community.
Several innovations are coming in Apache Kafka 2.6 that will help further improve scalability, said Tim Berglund, senior director of developer advocacy at Confluent. Among them are features that will make dynamic configuration changes more practical, as well as making it easier to observe the runtime behavior of a cluster.
"The theme I see in Apache Kafka 2.6 is a continued march toward being a truly cloud-native platform," Berglund said. "Apache Kafka has been run in the cloud from its early days, but the KIPs [Kafka Improvement Proposals] associated with observability, configuration and horizontal scale are all markers of a system at home in the cloud."