Getty Images/iStockphoto

News

Starburst boosts data lake connections and fault tolerance

The data lake query vendor is bringing new features to its platform to optimize queries with the open source Trino query engine across multiple formats and sources.

Sean Michael Kerner

Published: 04 May 2022

Data lake query vendor Starburst on Wednesday added new features to its Galaxy cloud service in an effort to improve reliability and enable easier access to different data lake technologies and deployments.

The new features are part of sustained momentum for the Boston-based vendor, which raised a $250 million Series D round of funding on Feb. 9.

Starburst is one of the leading commercial vendors behind the open source Trino query engine, the foundation of the Starburst Galaxy services.

Among the enhancements to the Starburst Galaxy cloud service is a new feature the vendor refers to as the Great Lakes connector, which is designed to ease access to data lakes that use the Apache Iceberg and Delta Lake technologies.

Starburst is also aiming to improve the scalability of Trino with a capability known as granular fault tolerance, a technology the vendor has been talking about since last year. The fault tolerance capability enables large, long-running queries to persist for long periods of time, and if there is a failure, the query can be restarted.

Among the users of Starburst's Trino-based platform is networking technology provider BlueCat, based in Toronto.

The company's platform enables users to gain visibility and control over their DNS traffic and detect threats. Cory Darby, the director of engineering, said BlueCat is interested in the new fault tolerance capabilities because they will enable better operational efficiency.

"Trino is our exclusive query engine in our data platform, and all data once at rest is accessed through Trino," Darby said. "DNS data is accessed through Trino to help aid with spotting anomalies in real time as well as to trend and analyze historical traffic for operational awareness."

Trino is our exclusive query engine in our data platform, and all data once at rest is accessed through Trino.

Cory DarbyDirector of engineering, BlueCat

How granular fault tolerance improved data lake queries

Granular fault tolerance is a capability that users have been waiting for, said Martin Traverso, co-creator of Trino and CTO of Starburst.

With the new fault tolerance capability, when a query runs and it fails, the query will just keep retrying until it completes.

Among reasons a query could fail is that the query consumes more memory or compute resources than are available in a given Trino cluster, Traverso noted.

Another problem could just be faulty hardware when a system fails while executing a query. Resources in the cloud can also become unavailable over time, as spot instance compute capacity, for example, is variable.

"So if you were running a query and the query failed, previously you had to restart the query, because up until now, there was no way to recover and continue from where the query left off," Traverso said. "So you had to restart the whole query from the beginning."

To enable granular fault tolerance, Traverso said the vendor changed several things about how Trino queries are executed. Trino can now aggregate queries into different pieces and execute each piece in succession until a final result has been produced.

Great Lakes connector

Starburst already has specific connectors for Iceberg, Delta Lake and Apache Hive data lake formats, but with the existing connectors, each data lake technology has been treated separately, Traverso said.

With the Great Lakes connector, a single connector links to Iceberg, Delta Lake or Apache Hive, which Traverso said can reduce complexity and aid migrations.

For example, if a user wanted to migrate from one data lake format to another previously, it required more effort and some query rewriting. Now with Great Lakes, the data lake formats are all abstracted and Traverso said users can more easily move from one to the other, as well as federate queries across multiple deployments.

Traverso noted that Starburst is developing a number of new capabilities that will help users in the future.

One such feature, known as polymorphic table functions, enables SQL functions to reach out to other database systems to execute custom processing.

"One of the things that we've seen people struggle a lot with when integrating with third- party databases is they want to take advantage of specific syntax and functionality in those databases," Traverso said. "Polymorphic table functions allow us to model functions, where you provide your query or specific thing you want to process on the other system and then feed that data back into Trino dynamically."

Next Steps

Starburst targets data discoverability with new capabilities

Starburst boosts data lake connections and fault tolerance

The data lake query vendor is bringing new features to its platform to optimize queries with the open source Trino query engine across multiple formats and sources.

How granular fault tolerance improved data lake queries

Great Lakes connector

Next Steps

Dig Deeper on Data integration

Addition of new AI capabilities shows Starburst's growth

Starburst update adds new GenAI, streaming data features

Starburst slithers support for Python DataFrame

Starburst Galaxy update targets governance, data access

How granular fault tolerance improved data lake queries

Great Lakes connector

Next Steps

Related Resources

Dig Deeper on Data integration

Addition of new AI capabilities shows Starburst's growth

Starburst update adds new GenAI, streaming data features

Starburst slithers support for Python DataFrame

Starburst Galaxy update targets governance, data access