Healthcare has many use cases that can benefit from analytics. From advances in pharmaceutical research and cancer treatment to understanding public health, there are many insights analytics can provide in the healthcare sector.
However, personal health data is expensive to collect -- for example, clinical trials for a new drug or vaccine require thousands of participants, and finding those volunteers requires a lot of work hours. Additionally, because healthcare data is considered so sensitive and has so many privacy regulations on it, there are few cases of open data or data sharing for secondary analysis.
One way to combat the challenges holding back healthcare analytics is through synthetic data.
What is synthetic data?
Synthetic data is data generated from real-world data and has the same statistical properties as data collected through real-world observations. It is typically created algorithmically from real data and is often used to test operational data.
Synthetic data can also be generated from an analyst's domain expertise or a preexisting model, without a real data set that it matches statistically. Where the data comes from can affect its utility in a model, and how it is synthesized can impact its utility.
Typically, data that's been synthesized from a real data set has higher utility than data synthesized through expertise, depending on the analyst's domain expertise.
Why use synthetic data?
While real data is more desirable for analysis, using real data can pose several problems.
For one, new machine learning models require large data sets. Data collection for these projects is extensive. If the model is completely new and analysts are exploring its efficacy, collecting the data for it may be cost-prohibitive.
Synthetic data is also beneficial to researchers. Similar to exploratory models, research requires a large data set. The problem? Research often isn't funded unless the researcher can prove it's likely to be effective. Synthetic data can help research analysts fine-tune their models to be sure they work before investing in real data collection.
Synthetic data assists in healthcare
In the new book, Practical Synthetic Data Generation by Khaled El Emam, Lucy Mosquera and Richard Hoptroff, published by O'Reilly Media, the authors explored how data is synthesized, how to evaluate the utility of it and the use cases for synthetic data. And one expansive use case is in healthcare.
Because health data is protected by plenty of privacy regulations, such as HIPAA, using real health data for secondary analysis is decidedly difficult in most cases. While medical journals support researchers in making the data they've used publicly available so that replicated studies are easier to produce, it's still not common practice.
The problem with providing open data when it comes to health data ties into privacy regulations. There are growing threats of data breaches, and reidentification attacks on public data mean that data anonymization practices have to be extremely thorough. While these are necessary security and privacy practices meant to protect individuals, the implementation of these measures lowers the utility of real-world data, especially given the complexity of most data in this industry.
One of the areas with the most complex data where synthetic data has proven valuable is in cancer research. While real data on various patients' doctor visits, treatments, tests and prescriptions is as sensitive as it is complex, synthetic data with the same statistical properties can be shared openly and assist in advancing treatments for future cancer patients with similar permutations of the affliction.
The digital health movement can also benefit from synthetic data. Medical devices, such as CPAP (continuous positive airway pressure) machines and heart monitors, collect troves of complex, continuous data on patients that affect their future care. This data would benefit from hypothetical secondary analysis that can facilitate future treatment and analysis. Analysis of synthetic data can lead to those adaptations in treatment and understanding of how different behaviors can influence those conditions.
Practical Synthetic Data Generation covers additional use cases for synthetic data, as well as tactics for implementing synthesis, different synthesis methods and utility evaluation methods. Click here to read the first chapter of this new book and learn some of the basics of synthetic data generation.