4 high-value use cases for synthetic data in healthcare

Synthetic data generation and use can bolster clinical research, application development and data privacy protection efforts in the healthcare sector.

The hype around emerging technologies -- like generative AI -- in healthcare has brought significant attention to the potential value of analytics for stakeholders pursuing improved care quality, revenue cycle management and risk stratification.

But strategies to advance big data analytics hinge on the availability, quality and accessibility of data, which can create barriers for healthcare organizations.

Synthetic data -- artificially generated information not taken from real-world sources -- has been proposed as a potential solution to many of healthcare's data woes, but the approach comes with a host of pros and cons.

To successfully navigate these hurdles, healthcare stakeholders must identify relevant applications for synthetic data generation and use within the enterprise. Here, in alphabetical order, TechTarget Editorial's Healthtech Analytics will explore four use cases for synthetic healthcare data.

Application development

Proponents of synthetic data emphasize its potential to replicate the correlations and statistical characteristics of real-world data without the associated risks and costs. In doing so, these data sets lend themselves to the development of data-driven healthcare applications.

Much of the real-world data that would be used to build these tools is stored in a tabular format, meaning that the ability to generate tabular synthetic data could help streamline application development.

In a March 2023 study published in MultiMedia Modeling, researchers examined the potential of deep learning-based approaches to generate complex tabular data sets. They found that generative adversarial networks (GANs) tasked with creating synthetic tabular healthcare data were viable across a host of applications, even with the added complexity of differing numbers of variables and feature distributions.

A research team writing for the 2018 40th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC) demonstrated how GAN architecture can be used to develop wearable sensor data for remote patient monitoring systems.

In 2021, winners of ONC's Synthetic Health Data Challenge highlighted novel uses of the open source Synthea model, a synthetic health data generator. Among the winning proposals were tools to improve medication diversification, spatiotemporal analytics of the opioid epidemic and comorbidity modeling.

Synthetic data is also valuable in developing and testing healthcare-driven AI and machine learning (ML) technologies.

A 2019 study published in Sensors detailed the importance of behavior-based sensor data for testing ML applications in healthcare. However, existing approaches for generating synthetic data can be limited in terms of their realism and complexity.

To overcome this, the research team developed an ML-driven synthetic data generation approach for creating sensor data. The analysis revealed that this method generated high-quality data, even when constrained by a small amount of ground truth data. Further, the approach outperformed existing methods, including random data generation.

A research team writing in NPJ Digital Medicine in 2020 explored how a framework combining outlier analysis, graphical modeling, resampling and latent variable identification could be used to produce realistic synthetic healthcare data for assessing ML applications.

This approach is designed to help tackle issues like complex interactions between variables and data missingness, which can arise during the synthetic data generation process. Using primary care data, the researchers were able to use their method to generate synthetic data sets "that are not significantly different from original ground truth data in terms of feature distributions, feature dependencies, and sensitivity analysis statistics when inferring machine learning classifiers."

Further, the study found that this method had a low risk of generating synthetic data that was very similar or identical to real patients.

Synthetic data's utility for healthcare-related application development is closely tied to its value for clinical research.

Clinical research

Clinical research -- particularly clinical trials -- is key to advancing innovations that improve patient outcomes and quality of life. But conducting this research is challenging due to issues like a lack of data standards and EHR missingness.

Researchers can overcome some of these obstacles by turning to synthetic data.

EHRs are valuable data sources for investigating diagnoses and treatments, but concerns about data quality and patient privacy create hurdles to their use. A research team looking to tackle these issues investigated the plausibility of synthetic EHR data generation in a 2021 study in the Computational Intelligence journal.

The study emphasizes that synthetic EHRs are needed to complement existing real-world data, as these could promote access to data, cost-efficiency, test efficiency, privacy protection, data completeness and benchmarking.

However, synthetic data generation methods for this purpose must preserve key ground truth characteristics of real-world EHR data -- including biological relationships between variables and privacy protections.

The research team proposed a framework to generate and evaluate synthetic healthcare data with these ground truth considerations in mind and found that the approach could successfully be applied to two distinct research use cases that rely on EHR-sourced cross-sectional data sets.

Similar methods are also useful for generating synthetic scans and test results, such as electrocardiograms (ECGs). Experts writing in the February 2021 issue of Electronics found that using GANs to create synthetic ECGs for research can potentially address data anonymization and data leakage.

In June, a team from Johns Hopkins University successfully developed a method to generate synthetic liver tumor computed tomography (CT) scans, which could help tackle the ongoing scarcity of high-quality tumor images.

The lack of real-world, annotated tumor CTs makes it difficult to curate the large-scale medical imaging data sets necessary to advance research into cancer detection algorithms.

Synthetic data is also helpful in bolstering infectious disease research. In 2020, Washington University researchers turned to synthetic data to accelerate COVID-19 research, allowing stakeholders to produce relevant data and share it among collaborators more efficiently.

The value of synthetic data in healthcare research is further underscored by efforts from government agencies and academic institutions to promote its use.

The United States Veterans Health Administration's Arches platform is designed to facilitate research collaboration by providing access to both real-world and synthetic veteran data, while the Agency for Healthcare Research and Quality offers its Synthetic Healthcare Database for Research to researchers who need access to high-quality medical claims data.

Alongside clinical research applications, synthetic health data also shows promise in emerging use cases like digital twin technology.

Digital twins

Digital twins serve as virtual representations of real-world processes or entities. The approach has garnered attention in healthcare for its ability to help represent individual patients and populations across various data-driven use cases.

Synthetic data has shown promise in bolstering the data that underpins a digital twin, and some health systems are already pursuing projects that combine synthetic data generation with digital twin modeling.

One such project, spearheaded by Cleveland Clinic and MetroHealth, aims to use digital twins to gain insights into neighborhood-level health disparities and their impact on patient outcomes.

Addressing the social determinants of health (SDOH) -- non-medical factors, such as housing and food security, that impact health -- is a major priority across the healthcare industry. To date, healthcare organizations have found success in building care teams to tackle SDOH and developing SDOH screening processes, but other approaches are needed to meaningfully advance health equity.

In an interview, leadership from Cleveland Clinic discussed how the Digital Twin Neighborhoods project hopes to utilize de-identified EHR data to generate synthetic populations that are closely matched to those of the real-world neighborhoods that Cleveland Clinic and MetroHealth serve.

By incorporating SDOH alongside geographic, biological and social information, the researchers hope to understand existing disparities and their drivers better. Using digital twins, the research team can explore the health profile of a community by simulating how various interventions might impact health status and outcomes over time.

The synthetic and real-world data used to run these simulations will help demonstrate how chronic disease risk, environmental exposures and other factors contribute to increased mortality and lower life expectancy through the lens of place-based health.

Using the digital twin models, the researchers will pursue initial projects assessing regional mental health and modifiable cardiovascular risk factor reduction.

This approach allows Cleveland Clinic and MetroHealth to safely use existing EHR data to inform health equity initiatives without unnecessarily risking patient privacy, one of the most promising applications for synthetic data.

Patient privacy preservation

Protecting patient privacy is paramount when health systems consider using data to improve care and reduce costs. Healthcare data de-identification helps ensure that the sharing and use of patient information is HIPAA-compliant, but the process cannot totally remove the risk of patient re-identification.

Removing or obscuring protected health information (PHI), as required by HIPAA, is only one aspect of de-identification. Another involves obscuring potential relationships between de-identified variables that could lead to re-identification.

Synthetic data can help create another layer of privacy preservation by replicating the statistical characteristics and correlations in the real-world data, enabling stakeholders to create a data set that doesn't contain PHI.

In doing so, both the privacy and value of the original data are protected, and that information can be used to inform many analytics projects. While no approach to patient privacy protection is completely foolproof, combining data de-identification, synthetic data use and the application of privacy-enhancing technologies strengthens patient privacy preservation efforts.

In 2021, a team from the Institute for Informatics at Washington University School of Medicine in St. Louis demonstrated synthetic data's potential to protect privacy while conducting clinical studies.

The researchers showed that, using a software known as MDClone, users can build effective synthetic data sets for medical research that are statistically similar to real data while simultaneously preserving privacy more effectively than traditional de-identification.

The study authors noted that these capabilities have the potential to significantly speed up critical research.

These four use cases represent an array of opportunities for using synthetic data to transform healthcare and clinical research. While not without pitfalls, synthetic data is likely to see continued interest across the industry as stakeholders continue to explore advanced technologies like digital twins and AI.

Shania Kennedy has been covering news related to health IT and analytics since 2022.

Next Steps

Data analytics in healthcare: Defining the most common terms

Breaking down the types of healthcare big data analytics

High-value use cases for predictive analytics in healthcare

Dig Deeper on Health data governance

xtelligent Health IT and EHR
Close