Getty Images/iStockphoto
LLMs struggle with clinical reasoning, study finds
A Mass General Brigham study revealed that LLMs may fail at identifying differential diagnoses, but could deliver accurate diagnoses when fed with complete patient information.
Large language models, or LLMs, such as publicly available AI chatbots, lack clinical reasoning capabilities when presented with incomplete data and need to remain under physician supervision, according to a new study by Mass General Brigham researchers published April 13 in JAMA Network Open.
LLMs delivered a correct final diagnosis more than 90% of the time when they received comprehensive data about a patient case, but they were unable to provide appropriate differential diagnoses more than 80% of the time, the study revealed.
Differential diagnoses are "the most important part of medicine," according to Marc Succi, M.D., the study's senior author and executive director of the MESH Incubator at Mass General Brigham. "It's a list of plausible diagnoses."
Differential diagnoses could include all the potential conditions that could cause a certain set of symptoms. For instance, a similar set of symptoms point to both heart failure and asthma, Succi explained. After creating a possible list of conditions based on a patients' symptoms, physicians then perform testing to diagnose a condition and rule out others.
"You differentiate by ordering lab tests like a CT scan or blood tests, and eventually try and narrow it down to the final diagnosis, and so 'differential' really sets the stage for the entire subsequent medical visit," Succi explained. "That's why it's so important."
The latest study follows Succi's 2023 research, which found that ChatGPT 3.5 performed accurate clinical decision-making 72% of the time. The researchers wanted to see if the chatbots had made progress in clinical decision-making since the study three years ago. They also wanted to conduct a comprehensive evaluation of the latest LLMs, including GPT-5.
A balanced approach to clinical reasoning
To test the diagnostic capabilities of LLMs, researchers compared 21 off-the-shelf AI chatbots available for public use, which included the latest versions of ChatGPT, DeepSeek, Claude, Gemini and Grok. They also introduced a new benchmarking tool called Proportional Index of Medical Evaluation for LLMs (PrIME-LLM), which produces more balanced results by measuring the clinical competence of LLMs across five domains: differential diagnosis, diagnostic testing, final diagnosis, management and miscellaneous clinical reasoning questions.
The study noted that a majority of LLMs are limited to "static knowledge retrieval," and their performance is typically evaluated on multiple-choice tests like the U.S. Medical Licensing Exam or board exams. But Succi wanted to evaluate the chatbots as medicine is actually practiced. That meant conducting a "stepwise evaluation," he said.
"The LLM has to form its own differential. Then it has to order its own lab test, get information based on those lab tests, and then it collects it all and makes a diagnosis," Succi said. "So it's more reflective of how doctors actually practice medicine."
To evaluate the LLMs, researchers entered basic patient data in the models, such as age, gender and symptoms. Then they added physical examination findings and lab results. LLMs improved in accuracy when given the lab results and imaging, alongside text. Researchers tested the LLMs across all aspects of a medical visit, including differential diagnosis, lab tests and patient management, Succi said.
The PrIME-LLM analysis differs from other studies that use an average score, which can mask weaknesses. An average score "can hide a model that performs very badly in one of those domains, but really well in other domains," Succi explained.
Researchers evaluated the LLMs on 29 published clinical cases. GPT 5 and Grok 4 scored the highest in the study for overall performance, with a score of 78% compared with 64% for Gemini 1.5 Flash.
In a press release, Arya S. Rao, the lead study author, MESH researcher and an M.D.-Ph.D. candidate in the Harvard/MIT MD-PhD program, noted that although the models performed well at delivering a final diagnosis with complete patient data, they struggled with incomplete information at the open-ended beginning of a patient case.
"Basically the models are best at converging on an answer when all information is provided," Succi noted. "It's like an open book test as opposed to reasoning through uncertainty."
Can providers rely on LLMs for diagnosis?
Succi said AI tools are suitable as a "copilot assistant" for "low-risk" tasks such as note-taking, documentation and billing. However, they should not suggest medical tests or diagnoses without physician oversight.
"These findings suggest that despite progress, current LLMs remain limited in early diagnostic reasoning and cannot yet be relied on for unsupervised patient-facing clinical decision-making," the study stated.
In addition, providers cannot rely on LLMs when they're working with incomplete data, according to Succi.
"The key issue here is not whether the model can sometimes get the answer right, but whether it reasons reliably with uncertain data," Succi explained. "And I think in the context of medicine, there's so much uncertainty that that's the primary challenge in instituting these things."
He added, "There's no room for errors."
Going forward, Succi and the researchers plan to continue evaluating the chatbots' performance in different areas of medicine.
Brian T. Horowitz started covering health IT news in 2010 and the tech beat overall in 1996.