Getty Images

Data mesh helping fuel Sloan Kettering's cancer research

The cancer hospital and research center began using tools from data management vendor Dremio two years ago to decentralize its data operations and improve speed-to-insight.

Memorial Sloan Kettering Cancer Center accelerated its research efforts to find a cure for the deadly disease by using a data mesh approach to data management developed with software tools from Dremio.

Data projects that once took Memorial Sloan Kettering (MSK) weeks or even months to complete now sometimes take just hours or, at most, merely days.

That reduced time to make data available to end users, in turn, speeds up the research process by enabling researchers to access data exponentially faster as they work to find cures for cancer.

Data mesh is a decentralized approach to data management.

Traditionally, data was centralized. It was overseen by a dedicated team of data experts who were responsible for managing their entire organization's data.

But under that format, end users were forced to submit requests for models, reports, dashboards and other data assets and then had to wait for the data experts to deliver the requested materials. The result was a process that could take months depending on the complexity of the request and the number of requests submitted by others.

Data mesh removes data oversight from a centralized team and federates data across the different domains -- departments such as finance, human resources and marketing -- within an organization.

Data teams within those domains manage and analyze their own data, with the intended advantages being better subject knowledge by those working with the data and lessened demand on a central team to reduce bottlenecks.

In addition, the data systems within the different domains are connected so data can be shared across domains and users in different departments can collaborate when necessary.

MSK, meanwhile, was founded as the New York Cancer Hospital in 1884. Its current form is the result of the 1980 merger of Memorial Hospital and the Sloan-Kettering Institute. It is based on the Upper East Side of Manhattan in New York City. In addition, MSK has eight satellite facilities in New York and New Jersey.

In its 2021-22 rankings, U.S. News and World Report rated MSK the second-best hospital for cancer care in the United States, behind only the University of Texas MD Anderson Cancer Center and ahead of both the Mayo Clinic and Dana-Farber Brigham Cancer Center.

The problem

Before adopting tools from data management vendor Dremio and taking a data mesh approach, MSK struggled with deriving timely insights from its vast amount of data and navigating myriad regulations designed to keep medical data private.

The organization is treating and researching more than 400 different types of cancers.   It annually sees 20,000 inpatient visits and 700,000 outpatient visits. In addition, it must adhere to more than 1,800 research protocols.

Every lab test, visit and research protocol results in data. All that data must be managed effectively to have meaning, whether that's informing the treatment of an individual patient or furthering research.

Data mesh's four architectural principles
Four architectural principles of data mesh.

"The big challenge with doing all of this is data management," said Arfath Pasha, a software engineer at MSK, during a webinar hosted by Dremio. "Different modalities of data can be difficult. These are not simple datasets. And also there are significant regulatory concerns, so we need to make sure the data is handled well and patient information remains private."

To make sense of all that data and keep all that patient information private, data governance is critical for MSK, Pasha continued.

"Data governance is a big challenge," he said. "Also there are many technical barriers to bringing all this data from the data marts where they exist to the researchers and making it easy for them use for their analysis and computations."

Like most organizations, MSK took a centralized approach to data management.

Its data was ingested into an on-premises data warehouse where data engineers and other intermediaries transformed the data. From there, it was moved into a data lake. Once in the data lake, it was accessible to data managers and software engineers who distributed the data upon request to end users via email.

Data transformation, however, was particularly onerous.

Pasha called MSK's data "highly dimensional" -- spread across categories such as clinical, genome research, radiology and pathology.

In addition, the cancer center's data is both structured and unstructured. Some data that gets manually input needs correction, while other data is incomplete and needs to be filled in. There are also frequently different versions of the same data, and datasets are often massive. All of it must be kept private.

Turning all that complex data into meaningful datasets took time.

Meanwhile, once all that data was transformed and stored in a data lake, the process of submitting and filling requests was glacial. In addition, the center had little control over data once it was distributed to end users. Pasha noted they emailed Excel files to one another, leading to duplicate datasets and exposure to security leaks.

"It was a massive data management challenge," Pasha said

About two years ago, MSK decided to try something different.

Data mesh

While many organizations have migrated their data to the cloud, MSK's data is still kept on premises. The center plans to eventually move its data to the cloud. But like many organizations with decades of history and complex systems in place that were built long ago, it hasn't yet modernized that aspect of its data operations.

One major consideration for MSK as it sought a better way to manage its data was an on-premises option with a path to the cloud, according to Pasha. Others were a powerful query engine and a no-copy data architecture, that prevented users from making their own copies of data to share.

Mainly, however, MSK wanted a decentralized architecture.

It wanted to eliminate the cumbersome extract, transform and load (ETL) pipelines that were needed for each individual project and slowed its data operations. It also wanted more modern ETL capabilities to eliminate some of the manual labor, documentation support to better organize and track data, a mature data governance model, and an easy interface that more users could use to consume data.

"We looked at the various vendors and providers and decided to go with Dremio," Pasha said.

Dremio, based in Santa Clara, Calif., offers a data lakehouse platform that combines the capabilities of data warehouses and data lakes. Its tools help automate the ingestion, access and processing of data. As the creator of the open source Apache Arrow columnar memory format for flat and hierarchical data, its technology enables in-memory analytics.

MSK, however, chose Dremio for its own specific set of reasons, according to Pasha:

  • A low barrier to entry.
  • Support for an on-premises deployment with a path to the cloud.
  • A unified semantic layer to ensure terms are defined and consistent across MSK and provide an environment in which access controls can be established.
  • The ability to make and share copies of data within the system to prevent employees from emailing Excel spreadsheets.
  • An easy-to-use spreadsheet interface that connects to Tableau, which MSK uses for its analytics.
  • Low-code/no-code tools for developing and managing data pipelines.
  • High performance that can handle large datasets.
  • Data lineage and data catalog capabilities.

MSK now has a distributed data structure. It takes advantage of Dremio's connectors to eliminate much of the cumbersome ETL process and an easy-to-use interface that enables researchers to run their own queries to further eliminates burdens previously placed on data experts.

Unstructured data, such as pathology slides, radiology scans and genomics files, are pulled from their source by data product owners with MSK's domains and automatically transformed and cataloged at scale before being deposited in a domain-specific MinIO object storage facility. Structured tables are extracted from their sources by Dremio connectors and deposited directly in both Dremio's query engine for immediate access as well as MinIO object storage.

Tableau then sits on top of Dremio's query engine and enables end users to access and query all data in a domain.

"After working with Dremio for about two years now, we think we've built a self-service data platform through which many different types of users can easily access this data," Pasha said.


The main benefit of MSK's embrace of data mesh has been time savings, according to Pasha.

Eliminating certain difficult ETL tasks and implementing self-service capabilities has enabled users to move data projects from start to finish far faster than before MSK started using Dremio.

"Overall, we've been able to reduce the time of delivery of data to the consumers from weeks or months to now minutes or days because of our increased connectivity," Pasha said.

There are many technical barriers to bringing all this data from the data marts where they exist to the researchers and making it easy for them use for their analysis and computations.
Arfath PashaSoftware engineer, Memorial Sloan Kettering Cancer Center

There have been additional benefits as well, he continued.

Data mesh has made it easier for data consumers to track and share data across domains. It has also fostered trust between data product owners and data consumers that didn't previously exist.

"Through Dremio's lineage capabilities, users can see exactly where the data came from," Pasha said. "As the underlying data gets updated, the semantic layer gets updated as well, which gives users assurance that they are working with data that is rich and clean as opposed to having the data pass through many hands before getting to the consumer and not being able to be the producer of the data."

Meanwhile,MSK plans to expand its use of Dremio to further its use of data mesh, according to Pasha.

One key initiative will be to standardize data governance across all domains and establish standards for how data is shared between domains.

Another will be to build a hierarchy within domains that include a super administrator at the top, tenant administrators in the middle, and end users at the bottom. Super administrators won't simply become what the centralized data team used to be, and bottlenecks won't develop again.

Finally, MSK plans to automate more processes to further reduce burdens on data engineers and stewards and ultimately make access to data even faster.

"Dremio allows for all of this so we understand how data is being shared and who has access to what types of data," Pasha said.

Eric Avidon is a senior news writer for TechTarget Editorial and a journalist with more than 25 years of experience. He covers analytics and data management.

Dig Deeper on Data governance

Business Analytics
Content Management