Nabugu - stock.adobe.com
General-purpose AI beats out specialized clinical AI in some assessments
A new study challenges the value proposition of specialized clinical AI tools, showing they underperformed compared to general-purpose AI models across medical benchmarks.
After large language models exploded on the scene in late 2022, developers rushed to explore their use in healthcare, creating clinical AI tools for healthcare-specific use cases. But now, a new study reveals that general-purpose AI can outperform specialized clinical AI on several medical benchmarks.
The study, published in nature medicine, tested two specialized LLM-based clinical AI tools, OpenEvidence and UpToDate Expert AI, against three general-purpose frontier LLMs: GPT-5.2, Gemini 3.1 Pro and Claude Opus 4.6. The results call into question the industry's focus on designing LLMs specifically for healthcare use cases.
Investment in specialized clinical AI is growing. Earlier this year, OpenEvidence raised $250 million in a closed series D funding round, sending its valuation skyrocketing to $12 billion. Since then, the company has expanded rapidly, releasing audio telehealth, AI coding, prescription and prioritization features.
The study authors noted that, though proprietary clinical AI tools claim to provide enhanced clinical performance over general-purpose AI, their architectures, base models and training pipelines are not publicly available. As a result, providers must assess their value and safety without independent evidence, making it harder for them to challenge the results of clinical AI compared with general-purpose tools.
Thus, researchers from NYU Langone Health and the University of Texas at Austin set out to evaluate the tools against three medical benchmarks.
The evaluation included testing the AI models using three types of assessments: 500 US Medical Licensing Examination-style MedQA questions assessing medical knowledge, 500 HealthBench items evaluating agreement with expert clinicians and 100 real clinical queries drawn from physicians' LLM queries. Twelve clinicians conducted a randomized, blinded review of the RCQ stage.
Model performance varied, with general-purpose AI coming out on top
The general-purpose frontier AI tools outperformed the specialized clinical AI tools in all three evaluations, the study revealed.
In the MedQA questions assessment, Gemini achieved the highest accuracy at 97.4%, followed by GPT at 94.2% and Claude at 90.2%. Meanwhile, OpenEvidence achieved an accuracy of 89.6% and UpToDate achieved 88.4%.
Similarly, GPT scored highest in the HealthBench assessment, receiving a score of 88 on a 100-point scale, followed by Gemini at 79.3 and Claude at 77. Both specialized clinical AI tools scored lower: OpenEvidence at 62.6 and UpToDate at 61.3.
In the RCQ benchmark evaluation, two performance tiers emerged. The first tier, which comprised the general-purpose tools, outperformed the second tier of clinical AI tools on most individual questions, not just on average. The researchers also included Google Search AI Overview in the RCQ evaluation because it is routinely encountered by clinicians. The clinical AI tools performed comparably to the Google Search AI Overview on the RCQ.
"Clinical AI tools may carry institutional legitimacy and are likely safe for routine use, but our results show that they are not superior to frontier models on knowledge, communication or clinical alignment," the researchers wrote.
However, the researchers are not necessarily arguing that providers only use general-purpose AI tools. Rather, they suggest that providers develop hospital-specific LLMs that leverage institutional data and use them alongside general-purpose models for less-sensitive tasks.
Anuja Vaidya has covered the healthcare industry since 2012. She currently covers healthcare IT and innovation, including artificial intelligence, digital healthcare, EHRs and interoperability.