Databricks extends data lakehouse platform to healthcare

Healthcare data exists in widely varying formats. Getting it into a data lake where it can be used for analytics and machine learning is a challenge Databricks is looking to meet.

Databricks on Wednesday released its new data lakehouse platform for the healthcare and life sciences industries.

The San Francisco-based vendor develops a data lakehouse platform based on a technology that combines the capabilities of data warehouse and a data lake.

The Databricks platform uses the open source Delta Lake technology as its foundation and then provides additional capabilities for data queries with Delta Engine, which is based on Apache Spark open source query technology.

In 2022, Databricks has launched a series of industry-specific offerings for its data lakehouse, including one for financial services and one for retail.

The healthcare and life sciences release, generally available now, is the latest addition, taking aim at the specific challenges for data analytics and machine learning in that industry vertical.

Healthcare has long been a challenge from the data analytics perspective, said Hyoun Park, an analyst at Amalgam Insights. 

Among the challenges are the size of data sets as well as the variety of different healthcare systems, which often lack standardized data formats, Park noted.

He added that the different data formats often prevent healthcare data from being effectively stored within traditional relational databases because it can be difficult to define specific fields across systems.

"Given the multi-modal and critical nature of healthcare data usage, the data lake approach is promising in supporting healthcare analytics and machine learning challenges that overwhelm traditional database and data warehouse approaches," Park said.

"Databricks' focus on the healthcare space provides healthcare providers with an option for considering how to manage their largest data sets and semi-structured data ecosystem to support smarter analytics," he said.

Given the multi-modal and critical nature of healthcare data usage, the data lake approach is promising in supporting healthcare analytics and machine learning challenges that overwhelm traditional database and data warehouse approaches.
Hyoun ParkAnalyst, Amalgam Insights

Data lakehouse platform optimized for healthcare requirements

Michael Sanky, global industry lead for healthcare and life sciences at Databricks, explained that the new offering brings a series of capabilities he referred to as accelerators, which help to enable common workflows.

For example, Sanky said medical image data is a challenge for many healthcare and life sciences organizations. To address that, Databricks has built an accelerator for medical images that can help train machine learning models to help detect potential metastatic cancer.

Another example Sanky cited is a data ingestion capability via a partnership with data analytics vendor Lovelytics to be able to handle data in the FHIR (Fast Healthcare Interoperability Resources) format that is prevalent in healthcare.

The Delta Lake technology at the core of Databricks enables users to ingest data in different formats, including JSON (JavaScript Object Notation).

"FHIR data is normally in a JSON format but is optimized for exchanging transactional healthcare messages and is not optimized for analytics," Sanky said.

Sanky added that there are multiple steps needed to optimize the FHIR data so that it can be used for analytics. That process is what Databricks now enables with its healthcare platform, he said.

Making healthcare information more usable in the data lakehouse

Another challenge for healthcare and life sciences users is being able to handle patient data.

Databricks also has a partnership with healthcare AI vendor John Snow Labs. The vendor provides natural language processing to help extract data from medical reports so it can be queried in the lakehouse.

Looking forward, Sanky said Databricks will be looking to bring its Delta Sharing capabilities to the healthcare industry. Delta Sharing is a technology that Databricks unveiled in May 2021 to enable collaboration across data lakehouses.

"Data exchange in healthcare is a very important area," Sanky said. "Being able to connect the healthcare ecosystem around data exchange and Delta Sharing is the biggest 'what's next' for us over the next six to 12 months."