Data lakehouse pioneer Databricks said on Tuesday at its Data + AI Summit that it has extended its platform with a series of enhancements to accelerate data lake operations.
Also at the conference, held in San Francisco, the vendor said it is committed to making its data lakehouse technology open source.
The well-funded vendor raised a $1.6 billion round of venture capital in August 2021, giving the company a valuation of $38 billion.
Databricks' data lakehouse platform has multiple components, with the Delta Lake technology as the foundation. But while Databricks had pledged in 2019 to make the platform open source, not all elements of the current Delta Lake technology have actually been made available as open source.
That will change, according to Databricks. The vendor said it is committing to contribute its proprietary Delta Lake to the open source project but did not say when all the code will be available as open source.
In other developments from the conference, Databricks said it is now also making its Unity Catalog generally available, a year after the vendor previewed the data governance technology.
Databricks on Tuesday also unveiled the preview of its new SQL Serverless platform offering, which enables a data lakehouse as a service. Rather than running a persistent set of resources, the serverless platform enables organizations to start up and operate a data lakehouse analysis environment on a consumption basis.
Rounding out Databricks' latest product moves is a new effort to accelerate Spark Structured Streaming called Project Lightspeed.
Data lakehouse efforts in context
Databricks is in a competitive race with Snowflake, MongoDB, Oracle, Teradata and others to translate the massive volumes of enterprise data into analytics-ready data that can provide meaningful insight and context, said Hyoun Park, an analyst with Amalgam Insights.
The vendor is taking steps across its platform to make its case to enterprises for being the record and engine of choice for new analytics applications, Park said.
"The biggest challenge that Databricks faces is that it is seeking to be best in class across a wide variety of data management, analytics, governance and app development capabilities while there are standalone solutions in each of these areas that also have competitive differentiators," Park said. "Although Databricks has quickly established itself as a top general platform for the current generation of analytic and AI use cases, it faces competition in each area that it seeks to expand into."
Park said he is also optimistic about Databricks' effort to make all of its Delta Lake code open source.
"The open source push will allow more developers to be familiar with the full power of Delta Lake capabilities, which is better for Databricks in the long run," Park said.
Hyoun ParkAnalyst, Amalgam Insights
Also, Databricks' intent to make the full capabilities of Delta Lake open source is an important step to drive wider adoption and build a deeper trust with users, said analyst Sanjeev Mohan, founder of SanjMo Advisory.
Meanwhile, the generally availability of the Unity Catalog will improve security and governance aspects of lakehouse assets such as files, tables and machine learning models, Mohan said.
"This is essential to protect sensitive data," Mohan said of the Unity Catalog.
The next big areas that Databricks needs to work on are streaming data and incremental ETL (extract, transform and load) capabilities, Mohan said. Project Lightspeed could help Databricks fill the gap on streaming with better performance and enhanced functionality, he noted.
Databricks CEO commits open source data lakehouse
In a press briefing at the Data + AI Summit, which is being held in person and virtually, Databricks CEO Ali Ghodsi outlined his strategy for making the Delta Lake data lakehouse technology open source.
"We're committing to donating any future things that we build in Delta Lake to the open source project," Ghodsi said. "Open source for us helps to enable adoption."
A common approach for many open source efforts is for developers to work upstream first, meaning the code is developed first in open source and then flows down into commercial tools. Databricks is taking the inverse approach, building innovation in a closed approach first and then contributing code to open source.
In response to a question from TechTarget about not working upstream first, Ghodsi said it's complicated to develop software and ensure it is high quality.
"We found that we can build a proprietary version faster and then open source it," Ghodsi said.
Lightspeed ahead for streaming data lakehouse technology
Ghodsi also highlighted the Project Lightspeed effort, which aims to lower latency for real-time data lakehouse operations.
With Lightspeed, Databricks is looking to improve Spark Structured Streaming with better performance.
Spark Structured Streaming enables streaming data from the Apache Spark SQL technology. Lightspeed works with Apache Kafka data streaming sources, but Ghodsi noted that Lightspeed can handle more than Kafka.
For instance, a user can run machine learning in real time and detects anomalies with structured streaming, which isn't easily doable with Kafka, Ghodsi said.
"Kafka is primarily a way in which you can store real-time data that's coming in, and our streaming engine lets you do much more advanced processing on it," he said.