Getty Images/iStockphoto

NIH study reveals pitfalls of AI in clinical decision-making

GPT-4V scored highly on the 'New England Journal of Medicine's' Image Challenge, but made mistakes when tasked with explaining its reasoning and describing medical images.

Researchers from the National Institutes of Health (NIH) demonstrated that a multimodal AI can achieve high accuracy on a medical diagnostic quiz, but falls short when prompted to describe medical images and explain the reasoning behind its answers.

To evaluate AI's potential in clinical settings, the research team tasked Generative Pre-trained Transformer 4 with Vision (GPT-4V) with answering 207 questions from the New England Journal of Medicine (NEJM) Image Challenge.

The challenge -- designed to help healthcare professionals test their diagnostic abilities -- is an online quiz that prompts users to select a diagnosis from a set of multiple-choice answers after reviewing clinical images and a text-based description of patient symptoms and presentation.

The researchers asked the AI to both answer the questions and provide a rationale for each answer, including a description of the image presented, a summary of current, relevant clinical knowledge and step-by-step reasoning for how GPT-4V selected its answer.

Nine clinicians of various specialties were also tasked with answering the same questions, first in a closed-book environment with no access to outside resources, then in an open-book setting where they could refer to external sources.

From there, the research team provided the clinicians with the correct answers and the AI's responses, asking them to score GPT-4V's ability to describe the images, summarize medical knowledge and provide step-by-step reasoning.

The analysis revealed that both clinicians and the AI scored highly in choosing the correct diagnosis. In closed-book settings, the AI outperformed the clinicians, whereas humans outperformed the model in open-book settings.

Further, GPT-4V frequently made mistakes when explaining its reasoning and describing medical images, even in cases where it selected the correct answer.

Despite the study's small sample size, the researchers noted that their findings shed light on how multimodal AI could be used to provide clinical decision support.

"This technology has the potential to help clinicians augment their capabilities with data-driven insights that may lead to improved clinical decision-making," said Zhiyong Lu, Ph.D., corresponding author of the study and senior investigator at NIH's National Library of Medicine (NLM), in a press release. "Understanding the risks and limitations of this technology is essential to harnessing its potential in medicine."

However, the research team emphasized the importance of assessing AI-based clinical decision support tools.

"Integration of AI into healthcare holds great promise as a tool to help medical professionals diagnose patients faster, allowing them to start treatment sooner," explained Stephen Sherry, Ph.D., NLM acting director. "However, as this study shows, AI is not advanced enough yet to replace human experience, which is crucial for accurate diagnosis."

Shania Kennedy has been covering news related to health IT and analytics since 2022.

Next Steps

ChatGPT shows potential for clinical knowledge review

Enhancing patient care with clinical decision-making AI

Investigating how GenAI can support clinical decision-making

Dig Deeper on Artificial intelligence in healthcare