AgentClinic: a multimodal agent benchmark to evaluate AI in simulated clinical environments

Schmidgall, Samuel; Ziaei, Rojin; Harris, Carl; Reis, Eduardo; Jopling, Jeffrey; Moor, Michael

Computer Science > Human-Computer Interaction

arXiv:2405.07960 (cs)

[Submitted on 13 May 2024 (v1), last revised 25 May 2025 (this version, v5)]

Title:AgentClinic: a multimodal agent benchmark to evaluate AI in simulated clinical environments

Authors:Samuel Schmidgall, Rojin Ziaei, Carl Harris, Eduardo Reis, Jeffrey Jopling, Michael Moor

View PDF HTML (experimental)

Abstract:Evaluating large language models (LLM) in clinical scenarios is crucial to assessing their potential clinical utility. Existing benchmarks rely heavily on static question-answering, which does not accurately depict the complex, sequential nature of clinical decision-making. Here, we introduce AgentClinic, a multimodal agent benchmark for evaluating LLMs in simulated clinical environments that include patient interactions, multimodal data collection under incomplete information, and the usage of various tools, resulting in an in-depth evaluation across nine medical specialties and seven languages. We find that solving MedQA problems in the sequential decision-making format of AgentClinic is considerably more challenging, resulting in diagnostic accuracies that can drop to below a tenth of the original accuracy. Overall, we observe that agents sourced from Claude-3.5 outperform other LLM backbones in most settings. Nevertheless, we see stark differences in the LLMs' ability to make use of tools, such as experiential learning, adaptive retrieval, and reflection cycles. Strikingly, Llama-3 shows up to 92% relative improvements with the notebook tool that allows for writing and editing notes that persist across cases. To further scrutinize our clinical simulations, we leverage real-world electronic health records, perform a clinical reader study, perturb agents with biases, and explore novel patient-centric metrics that this interactive environment firstly enables.

Subjects:	Human-Computer Interaction (cs.HC); Computation and Language (cs.CL)
Cite as:	arXiv:2405.07960 [cs.HC]
	(or arXiv:2405.07960v5 [cs.HC] for this version)
	https://doi.org/10.48550/arXiv.2405.07960

Submission history

From: Samuel Schmidgall [view email]
[v1] Mon, 13 May 2024 17:38:53 UTC (9,883 KB)
[v2] Wed, 22 May 2024 01:57:23 UTC (9,914 KB)
[v3] Thu, 30 May 2024 22:56:17 UTC (9,638 KB)
[v4] Sun, 20 Oct 2024 18:58:58 UTC (20,571 KB)
[v5] Sun, 25 May 2025 02:19:37 UTC (18,190 KB)

Computer Science > Human-Computer Interaction

Title:AgentClinic: a multimodal agent benchmark to evaluate AI in simulated clinical environments

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Human-Computer Interaction

Title:AgentClinic: a multimodal agent benchmark to evaluate AI in simulated clinical environments

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators