Getty Images/iStockphoto

AI outperforms docs on clinical reasoning, but not ready for solo work

While an advanced LLM outperformed physicians and older models across clinical reasoning tasks, researchers warn that AI cannot yet replace clinicians and urge new testing approaches.

New research shows that a large language model outperformed physicians in various clinical reasoning tasks; however, the study's authors cautioned that the findings do not mean that AI tools are ready to autonomously practice medicine.

The question of whether AI tools can accurately perform clinical reasoning tasks has been top of mind since LLMs exploded onto the healthcare scene in late 2022. Generally, research shows that LLMs' clinical reasoning abilities are improving, but the models still struggle with certain tasks and should remain under human supervision.

However, few studies have compared the clinical reasoning capabilities of advanced LLMs with the baseline performance of human physicians. Thus, researchers from Harvard Medical School and Beth Israel Deaconess Medical Center sought to establish these baselines and assess an LLM's performance against them in a new study published in Science.

The researchers evaluated the clinical reasoning capabilities of the OpenAI o1 series. They compared the AI model's performance against hundreds of physicians across various experiments, including published patient vignettes, evaluations of new emergency room patients, and clinical tasks involving diagnoses and clinical management planning.

Overall, the AI model outperformed physicians across the experiments, including those using real, unstructured clinical data from the EHR in an emergency department. In the ER experiment, the model was presented with patients at various points in their diagnostic journey. They provided the model with information at each stage of the journey, from triage to admission decisions, and asked it to generate likely diagnoses and a treatment plan. Overall, o1 outperformed both ChatGPT-4o and two expert attending physicians, as assessed by two other attending physicians.

In another experiment, researchers used five clinical vignettes to test the AI model's ability to provide next steps in clinical management. Using the mixed-effects model, they found that the o1-preview model scored 41 percentage points higher than GPT-4 alone, 41.9 percentage points higher than physicians using GPT-4 and 48.4 percentage points higher than physicians with conventional resources.  

"Our findings suggest that LLMs have now eclipsed most benchmarks of clinical reasoning," the researchers concluded.

Are humans-in-the-loop still necessary for clinical AI?

In short, the answer is yes.

The researchers noted the study's limitations, including that it examined only six aspects of clinical reasoning, whereas researchers have identified dozens of other tasks that may have a greater impact on actual clinical care and need to be studied.

They also emphasized that the study only assessed text-based performance for both humans and AI. But clinical medicine is multifaceted and involves various non-text inputs, including auditory and visual information. 

"A model might get the top diagnosis right but also suggest unnecessary testing that could expose a patient to harm," Peter Brodeur, a Harvard Medical School clinical fellow in medicine at Beth Israel Deaconess and the study's co-first author, said in a press release. "Humans should be the ultimate baseline when it comes to evaluating performance and safety."

The researchers noted that new testing and research approaches are required as AI models evolve, including new benchmarks, human-computer interaction studies and prospective clinical trials.

"Models are increasingly capable," Brodeur said in the press release. "We used to evaluate models with multiple-choice tests; now they are consistently scoring close to 100%, and we can't track progress anymore because we're already at the ceiling."

Anuja Vaidya has covered the healthcare industry since 2012. She currently covers the virtual healthcare landscape, including telehealth, remote patient monitoring and digital therapeutics.