MedAgent-Pro: Towards Multi-modal Evidence-based Medical Diagnosis via Reasoning Agentic Workflow

2503.18968

Ziyue Wang et el.

🤗 2025-03-31

TL;DR
#

Multi-modal Large Language Models (MLLMs) show potential in medical diagnosis, yet face hurdles like limited visual input perception and reasoning inconsistencies. Current MLLMs primarily function in one-hop question answering, falling short of the expert-driven, evidence-based analysis needed in real-world medical diagnosis. These models often exhibit hallucinations and inconsistencies in reasoning, making it difficult to strictly adhere to established medical criteria. This limitation raises safety concerns, highlighting the need for advancements to improve reliability and clinical applicability.

To solve these issues, the paper introduces MedAgent-Pro, an evidence-based reasoning agentic system. It uses hierarchical workflow that integrates medical guidelines and expert tools. Multiple tool agents process multi-modal inputs & analyze indicators, providing diagnoses based on quantitative and qualitative evidence. The system achieves superior results in 2D & 3D medical diagnosis tasks, demonstrating its effectiveness.

Key Takeaways
#

Why does it matter?
#

This paper introduces a new method, that uses reasoning to improve medical diagnoses by incorporating evidence. The demonstrated enhancements in diagnostic accuracy & interpretability, is significant for pushing the boundaries of AI in healthcare. This work could inspire new research into AI-driven diagnostic tools and personalized medicine approaches.

Visual Insights
#

🔼 This figure compares the performance of existing large multimodal language models (MLLMs) and the MedAgent-Pro framework in diagnosing two diseases: glaucoma and heart disease. It demonstrates that MLLMs, while capable of processing multi-modal data, often lack the precision and comprehensive reasoning necessary for reliable diagnoses. Their limitations are highlighted in red text. In contrast, MedAgent-Pro, which leverages a reasoning agentic workflow, provides more accurate and evidence-based diagnoses, supported by visual evidence and relevant literature citations (green text). This showcases MedAgent-Pro’s improved diagnostic accuracy and interpretability.
read the caption
Figure 1: Comparison of existing MLLMs and our MedAgent-Pro framework on two disease diagnoses. Red text highlights the limitations of MLLMs, while green text represents the evidence-based diagnosis provided by MedAgent-Pro. Our approach enhances diagnostic accuracy while offering comprehensive literature support and visual evidence.

Method	Glaucoma		Heart Disease
Method	mACC	F1	mACC	F1
GPT-4o [1]	-	-	-	-
LLaVa-Med [33]	50.0	0.0	50.0	0.0
Janus-Pro-7B [13]	53.4	13.3	52.3	10.7
BioMedClip [74]	58.1	21.3	47.0	37.8
MedAgent-Pro (MOE Decider)	90.4	76.4	66.8	52.6
MedAgent-Pro (LLM Decider)	75.9	44.8	63.8	44.1

🔼 This table compares the performance of different single foundation models (GPT-40, LLaVa-Med, Janus-Pro-7B, and BioMedClip) and the proposed MedAgent-Pro model (with two different decider agents: MOE and LLM) on two medical diagnosis tasks: glaucoma and heart disease. The comparison uses two metrics: macro-averaged accuracy (MACC) and F1 score. The ‘-’ symbol indicates that the method failed to provide a diagnosis for most cases. The results show that MedAgent-Pro significantly outperforms the other single foundation models on both tasks.
read the caption
Table 1: Comparison with single foundation models on two diagnosis tasks (%). '-' means the method refuses to give a clear diagnosis for most of the cases.

In-depth insights
#

Agentic Med Diag
#

Agentic medical diagnosis represents a paradigm shift in AI-driven healthcare, moving beyond isolated diagnostic tasks towards comprehensive, collaborative systems. These systems leverage the strengths of Large Language Models (LLMs) and specialized AI tools within an agentic framework to mimic the reasoning and decision-making processes of human clinicians. Key features include the ability to integrate multi-modal patient data, access and apply medical knowledge, perform quantitative image analysis, and provide explainable diagnoses with supporting evidence. Agentic architectures address the limitations of directly applying LLMs to medical diagnosis, such as hallucinations, inconsistencies, and a lack of quantitative analysis capabilities. By combining LLMs with external tools and structured workflows, agentic systems can achieve greater accuracy, reliability, and interpretability in medical decision-making. The focus on evidence-based reasoning and adherence to clinical guidelines ensures that diagnoses are well-founded and aligned with established medical protocols. Future research will likely explore enhancing the collaboration and communication between agents, incorporating more diverse data modalities, and validating the clinical utility of these systems in real-world healthcare settings.

Evidence Workflow
#

An evidence workflow, crucial in medical AI, systematically gathers and analyzes data to support diagnoses. It begins with data collection from various sources (images, text, patient data), followed by preprocessing to standardize and clean the information. AI models then extract relevant features, which are synthesized into evidence. This evidence is evaluated against medical knowledge and guidelines, generating a diagnosis with supporting rationale. Transparency is key, detailing the evidence used. The workflow is iterative, refining diagnoses as new evidence emerges. It enhances reliability by grounding decisions in verifiable data, improving clinical trust and patient outcomes. Standardizing this process ensures consistent and explainable AI application in healthcare.

Hierarchical AI
#

Hierarchical AI represents a multi-layered approach to artificial intelligence, mirroring the structured organization found in complex systems like the human brain. This paradigm facilitates sophisticated problem-solving by decomposing intricate tasks into manageable sub-tasks, each addressed by specialized AI modules. The benefits are multi-fold: enhanced modularity allows for easier updates and maintenance, improved interpretability provides insights into the decision-making process at each level, and greater scalability enables the system to handle increasingly complex scenarios. A hierarchical architecture is not without its challenges, requiring careful design to ensure seamless communication and coordination between different AI modules. Effective management of information flow and decision delegation is crucial for optimal performance. Moreover, the creation of robust and adaptable hierarchical AI systems demands advanced techniques in areas such as reinforcement learning, knowledge representation, and automated planning. The potential applications span diverse fields, including robotics, natural language processing, and autonomous systems, offering a pathway towards more intelligent and adaptable AI solutions.

Multi-Modal Focus
#

In the context of medical diagnosis, a “Multi-Modal Focus” signifies the integration of diverse data types for a more comprehensive patient assessment. This includes combining visual information from medical images (X-rays, MRIs) with structured data like patient history, lab results, and clinical notes. MLLMs are limited in quantitative analysis, making them hard to apply in clinical practice. A multi-modal approach aims to leverage the strengths of each modality, enabling more accurate and robust diagnoses. This is especially crucial when a single data source is insufficient or ambiguous, relying heavily on human doctors. The development of AI systems capable of autonomously handling the entire diagnostic workflow has become a key research focus. This integration also addresses the challenge of data heterogeneity, as medical information is often stored in various formats and locations, requiring sophisticated data fusion techniques for effective analysis.

Interpretability
#

While the paper doesn’t explicitly have a section labeled “Interpretability,” the core concept is woven throughout its design and evaluation. MedAgent-Pro prioritizes explainability by design, contrasting with end-to-end MLLM approaches often acting as black boxes. The hierarchical workflow – task-level planning and case-level execution – promotes transparency. Each stage leverages specialized tools, providing traceable steps in the diagnostic process. The use of both quantitative and qualitative indicators, combined with visual evidence and clinical guidelines offers a multifaceted rationale behind the diagnosis. This is further enhanced by the MOE decider, assigning weights to different indicators to promote a more in-depth understanding of how the diagnosis is made rather than just a result. The case study (illustrated in Figure 3) exemplifies how MedAgent-Pro makes decisions transparent and justified. Finally, MedAgentPro’s ability to provide supporting evidence offers users a more grounded approach compared to MLLMs.

REFUGE2 winners			Ophthalmology Expert MLLMs
Team Name	AUC	Rank	Method	mAcc	F1
VUNO EYE TEAM	88.3	1	RetiZero [63]	50.8	18.4
MIG	87.6	2	VisionUnite [38]	85.8	73.1
MAI	86.1	3	MedAgent-Pro (LLM decider)	75.9	44.8
MedAgent-Pro (MOE decider)	95.1	-	MedAgent-Pro (MOE decider)	90.4	76.4

Indicators				Single Indicator		Multiple Indicators
vCDR	RT	PPA	DH	mACC	F1	MOE decider		LLM decider
vCDR	RT	PPA	DH	mACC	F1	mACC	F1	mACC	F1
✓				81.7	65.9	-	-	-	-
	✓			70.8	31.3	-	-	-	-
		✓		81.0	74.6	-	-	-	-
			✓	66.8	29.6	-	-	-	-
✓	✓			-	-	87.0	55.0	71.5	55.4
✓		✓		-	-	93.8	78.7	69.7	52.0
	✓	✓		-	-	80.4	70.4	52.8	14.3
✓	✓	✓		-	-	90.1	81.5	73.3	61.3
✓	✓	✓	✓	-	-	90.4	76.4	75.9	44.8

MedAgent-Pro: Towards Multi-modal Evidence-based Medical Diagnosis via Reasoning Agentic Workflow

TL;DR
#

Key Takeaways
#

Why does it matter?
#

Visual Insights
#

In-depth insights
#

Agentic Med Diag
#

Evidence Workflow
#

Hierarchical AI
#

Multi-Modal Focus
#

Interpretability
#

More visual insights
#

Full paper
#

TL;DR#

Key Takeaways#

Why does it matter?#

Visual Insights#

In-depth insights#

Agentic Med Diag#

Evidence Workflow#

Hierarchical AI#

Multi-Modal Focus#

Interpretability#

More visual insights#

Full paper#

TL;DR
#

Key Takeaways
#

Why does it matter?
#

Visual Insights
#

In-depth insights
#

Agentic Med Diag
#

Evidence Workflow
#

Hierarchical AI
#

Multi-Modal Focus
#

Interpretability
#

More visual insights
#

Full paper
#