AndreyPopov/istock via Getty Ima

Understanding De-Identified Data, How to Use It in Healthcare

Healthcare data de-identification provides significant opportunities to bolster medical research and patient care, but the process is not without its pitfalls.

De-identified data has become an important tool in medical research and for providers looking to enhance patient care. While data sharing between different organizations could violate the Health Insurance Portability and Accountability Act of 1996 (HIPAA), the de-identification process makes sharing information HIPAA-compliant.

De-identified data sharing can then assist medical researchers in advancing tools and treatments. Additionally, it allows for collaborative efforts from large providers. Overall, de-identified data plays a critical role in improving patient experience.


The process of de-identification involves removing personally identifiable information (PII), such as name and social security number, as well as protected health information (PHI), like medical history and insurance information, when processing or sharing that data.

Removing all direct identifiers from patient data can allow healthcare organizations to share it without the potential of violating HIPAA.

While direct identifiers are removed from the data to keep a patient’s identity confidential, indirect identifiers, such as race, age, and gender, can remain untouched in some cases to allow researchers to study data trends.

According to HHS,  de-identification also “supports the secondary use of data for comparative effectiveness studies, policy assessment, life sciences research, and other endeavors.”

De-identification is a crucial part of the healthcare data lifecycle and plays a vital role in advancing medical research while also protecting patient privacy.


Data sharing allows those in the healthcare field to create better tools and treatments to improve patient care and outcomes. However, HIPAA stipulates that patient information must be protected and cannot be shared with other entities without the patient’s knowledge and consent.

By de-identifying data, providers can share information with other organizations to advance medical research and treatment. Additionally, de-identifying the data removes some liability regarding HIPAA violations.

Furthermore, the use of de-identified data can enhance collaborative research efforts in healthcare. In 2021, a group of healthcare providers came together to form Truveta, a company focused on using big data analytics to enhance care insights.

By combining de-identified data from each healthcare provider’s tens of millions of patients and from thousands of care facilities across the United States, Truveta can make large datasets available for use in medical research. At the time of writing, the collective boasts nearly 30 members.

However, the use of de-identified data is nuanced and can be fraught with potential pitfalls. Experts indicate that the advent of technologies like connected devices and artificial intelligence (AI) has changed the way healthcare organizations conceptualize patient privacy and data sharing.

Data can be de-identified to various degrees, with basic de-identification obscuring information such as name or date of birth. HIPAA requires healthcare organizations to take this a step further by hiding or removing both PII and PHI to ensure that patient privacy is protected.

However, removing this information doesn’t necessarily eliminate the risk of patient re-identification. Information like an individual’s internet protocol (IP) address or the device ID associated with a pacemaker, for example, could be used to re-identify a patient.

AI technologies in healthcare are also known to be capable of re-identifying individuals, even though they are trained on de-identified data, raising questions about which privacy approaches should be leveraged to address this phenomenon. Researchers recommend that HIPAA be amended to account for the use of machine learning on healthcare data.

Obscuring specific data elements that can be tied to a specific individual, as directed by HIPAA, is one aspect of data de-identification, but the second is more complicated, dealing with how a combination of factors within one or more datasets relate to one another.

For example, an analysis investigating the impact of social determinants of health (SDOH) on US patients with a specific cancer type could contain enough data elements for some patients to be re-identified, even if the project has robust data de-identification protocols.

The combination of choice of cancer treatment regimen, timeframe, and income could be used alongside additional information, such as social media posts, to re-identify a patient in this cohort. If there was a wealthy individual in the patient pool who could afford to receive a new treatment that was largely cost-prohibitive for the majority of patients, and their diagnosis was public knowledge and coincided with the timeframe of the analysis, bad actors could theoretically home in on and re-identify them.

To prevent this, healthcare stakeholders can transform data – cryptographically, mathematically, or otherwise – at the individual data-point level to make it non-visible to the data user or ensure that the analytics being performed on the data are not designed, consciously or unconsciously, to identify individuals.

Some tools, such as privacy-enhancing technologies (PETs), can assist healthcare stakeholders with these goals. There are three main types: algorithmic PETs, which alter how data are represented; architectural PETs, which focus on the structure of the data or computation environments; and augmentation PETs, which involve using historical data distributions to generate realistic synthetic datasets.

By developing a de-identification protocol that complies with HIPAA while taking additional privacy considerations into account, providers can share patient data to assist in medical advances while also maintaining patient privacy and complying with HIPAA.


De-identified data is often leveraged in research to build advanced analytics tools for healthcare.

In a recent study, researchers used de-identified data to develop an artificial intelligence (AI) tool to predict 30-day mortality risks in patients with cancer. Using the tool, medical professionals can discover patients who are at high risk of death and provide early intervention for reversible complications.

Additionally, the tool can identify patients who are approaching end of life (EoL) and refer them to early palliative and hospice care.

In this case, the use of de-identified data can provide an improved quality of life and symptom management for the patient. The study’s authors noted that early referral for these services could transform cancer care by reducing the unnecessary and expensive treatments at EoL, which can conflict with patient preferences and lower their quality of life.

De-identified data can also be used in developing predictive analytics tools. To address healthcare gaps created by the COVID-19 pandemic, UnitedHealthcare developed one such tool that used de-identified data to address social determinants of health and improve patient care.

The Centers for Disease Control and Prevention (CDC) indicated that SDOH have a greater influence on a person’s health than their access to healthcare services or their genetics, making tackling social determinants key to enhance population health.

To eliminate care gaps, UnitedHealthcare created an advocacy system to assist members who might be struggling due to their social environment. Through predictive analytics and a machine learning model, the advocacy system can evaluate de-identified data from members and determine the need for social services.

Data are then loaded into an agent dashboard used by UnitedHealthcare advocates. When a member calls in, advocates can connect the caller to community resources at low or no cost.

De-identified data allow medical professionals to both develop tools to better serve patients and advance research to produce improved outcomes.

Next Steps

Dig Deeper on Artificial intelligence in healthcare

xtelligent Health IT and EHR