Yuichiro Chino/Moment via Getty
How a multi-agent AI system can help identify cognitive decline
Mass General Brigham researchers have created an open-source, multi-agent AI system that analyzes clinical notes to identify early cognitive decline without extra clinical work.
Early cognitive decline often goes undetected until symptoms are advanced, not because clinicians miss warning signs, but because many existing tools to screen for cognitive decline are difficult to deploy consistently across large patient populations. Traditional cognitive assessments take time, require trained staff and depend on patient participation -- resources many health systems lack.
To address that gap, researchers at Mass General Brigham (MGB) developed an AI system that screens routine clinical notes for cognitive concerns without creating additional tasks for clinicians. The approach is described in a study published in npj Digital Medicine.
"If you ask providers to do extra work, adoption fails," said Lidia Moura, M.D., Ph.D., director of population health in neurology at MGB and co-senior author on the study. "We are living in an environment where workforce shortages and increasing clinical demand are colliding."
Large-scale screening without added work
Faced with growing demand and limited clinical capacity, the researchers designed a cognitive screening approach that does not require clinicians to do additional work. Rather than leveraging assessments or relying on surveys, the AI system uses narrative clinical notes already written during care.
By reviewing progress notes, histories and other routine documentation, the system looks for contextual patterns associated with cognitive concern. Its purpose is to flag patients who may benefit from formal evaluation.
"This is a big cast-net screening tool. It's not to replace diagnostics," Moura said. "We were aiming to support population health by identifying those who need formal screening at the right time, so they do not miss the window [for therapy]."
The system runs automatically and does not require orders, forms or direct clinician involvement for the screening stage.
Building autonomy through a multi-agent design
The project initially depended on expert-guided prompt engineering, with clinicians repeatedly adjusting instructions for large language models (LLMs). That approach proved difficult to scale and produced inconsistent results.
To address those limitations, the team shifted to a multi-agent autonomous AI architecture modeled on how clinicians work together in practice.
"It is usually not one person's call to interpret these complex brain health conditions from notes," said Hossein Estiri, Ph.D., co-senior author of the study and director of the MGB Clinical Augmented Intelligence research group. "It is more like a clinical team that talks to each other."
The resulting system coordinates five specialized agents. One reviews notes for cognitive concerns, while others focus on identifying false positives and false negatives. Two summarizer agents integrate those findings and refine the system's reasoning over time.
"The power of the agent approach is that we can break a complicated problem into subtasks and clearly define what each agent is responsible for," said study first author Jiazi Tian, an MGB data scientist.
The researchers said that transparency was a core design goal.
"Each agent is clearly documented, and we can see exactly what they are thinking and how they are refining the prompt," Tian said. "That means the results are not a black box. They are statistically effective, but also clinically meaningful."
Performance, tradeoffs and guardrails
The analysis of the AI system included more than 3,300 clinical notes from 200 anonymized patients at Mass General Brigham.
The autonomous AI system demonstrated 98% specificity during testing, reducing the likelihood of false-positive flags. Sensitivity was lower when evaluated against real-world prevalence (62%) than when assessed against a balanced training data set (91%), a result the authors say was expected given the system's conservative design.
Independent expert review supported the system's reasoning in 58% of cases where human reviewers initially disagreed, suggesting that the AI often applied clinically appropriate judgment in cases that could appear as errors at first.
However, the researchers caution that real-world use would require population-specific validation, recalibration and continued monitoring.
"We really want this to be used, but we want it to be used responsibly," Moura said.
How close are we to clinical use?
As part of the study, the research team made the underlying framework, Pythia, available as open source. They said that the work is meant to be examined and tested further rather than deployed immediately.
"This is not a finished product," Estiri said. "We want others to validate it, improve it and use it responsibly."
The system can run on standard infrastructure using widely available LLMs and may be suitable for pilot use as a decision-support tool. The researchers stress, however, that clinical decisions must remain in human hands.
"At the end of the day, decisions have to be made by humans," Estiri said.
Many clinical AI tools falter not because they lack accuracy, but because they require new workflows. This work suggests a different approach, in which screening operates quietly in the background and adoption depends on validation and governance rather than clinician behavior change.
Elizabeth Stricker, BSN, RN, comes from a nursing and healthcare leadership background, and covers health technology and leadership trends for B2B audiences.