Getty Images

ChatGPT Achieves High Accuracy in Clinical Decision-Making Tasks

Researchers from Mass General Brigham found ChatGPT to be 72 percent accurate overall across the clinical decision-making process and in various care settings.

Mass General Brigham researchers demonstrated that ChatGPT can achieve “impressive accuracy” in clinical decision-making, with increased performance as the tool is fed more clinical information, according to a study published last week in the Journal of Medical Internet Research (JMIR).

The research team highlighted that large language models (LLMs) and artificial intelligence (AI)-based chatbots are advancing rapidly, with some already showing promise for healthcare applications. However, LLMs’ capacity to assist with clinical reasoning and decision-making has not been investigated.

Thus, the study aimed to evaluate ChatGPT’s capacity for clinical decision support across medical specialties and within both primary care and emergency department settings.

The researchers did so by inputting 36 published clinical vignettes into the model and tasking it with making recommendations for differential diagnoses, diagnostic testing, final diagnosis, and management for each case. ChatGPT’s recommendations were based on the gender, age, and case acuity for each patient in the vignettes.

“Our paper comprehensively assesses decision support via ChatGPT from the very beginning of working with a patient through the entire care scenario, from differential diagnosis all the way through testing, diagnosis, and management,” said corresponding author Marc Succi, MD, associate chair of innovation and commercialization and strategic innovation leader at Mass General Brigham, in the press release describing the study.

The model’s accuracy was measured as the proportion of correct responses to the questions posed in each vignette, as calculated by human scorers.

Per these criteria, ChatGPT achieved an overall accuracy of 71.7 percent across all 36 clinical vignettes.

The tool’s highest performance, 76.9 percent, was in making a final diagnosis, while its lowest performance was 60.3 percent for generating an initial differential diagnosis. The model was also 68 percent accurate in clinical management decisions. Performance was consistent across primary and emergency care settings.

ChatGPT also demonstrated inferior performance on differential diagnosis and clinical management question types compared to answering questions about general medical knowledge. Additionally, the LLM’s answers did not exhibit gender bias.

“No real benchmarks [exist], but we estimate this performance to be at the level of someone who has just graduated from medical school, such as an intern or resident,” Succi explained. “This tells us that LLMs in general have the potential to be an augmenting tool for the practice of medicine and support clinical decision making with impressive accuracy.”

These findings led the researchers to include that ChatGPT’s performance was “impressive,” but the study had two major limitations that require further investigation before the tool can be implemented in clinical care: the unclear composition of the LLM’s training data and possible model hallucinations.

The results also underscore the role of advanced technologies in assisting, rather than replacing, clinicians.

“ChatGPT struggled with differential diagnosis, which is the meat and potatoes of medicine when a physician has to figure out what to do,” said Succi. “That is important because it tells us where physicians are truly experts and adding the most value—in the early stages of patient care with little presenting information, when a list of possible diagnoses is needed.”

Moving forward, the research team will investigate whether AI tools can improve patient care and outcomes for hospitals in resource-constrained areas.

This research is part of a growing effort to explore the potential of LLMs in healthcare.

In June, a team from New York University (NYU) Grossman School of Medicine shared that their LLM to forecast readmissions, length of stay, and other clinical outcomes, known as NYUTron, had been deployed across NYU Langone Health.

The tool leverages unaltered text from EHRs to predict 30-day all-cause readmission, in-hospital mortality, comorbidity index, length of stay, and insurance denials.

During development and validation, NYUTron identified 85 percent of patients who died in the hospital, a seven percent improvement over standard in-hospital mortality prediction methods.

The tool also achieved high performance on length of stay, accurately forecasting 79 percent of patients’ actual length of stay, a 12 percent improvement compared to standard methods.

Next Steps

Dig Deeper on Artificial intelligence in healthcare

xtelligent Health IT and EHR
xtelligent Healthtech Security
xtelligent Healthcare Payers
xtelligent Pharma Life Sciences
Close