Hidden flaws behind expert-level accuracy of multimodal GPT-4 vision in medicine

Jin, Qiao; Chen, Fangyuan; Zhou, Yiliang; Xu, Ziyang; Cheung, Justin M.; Chen, Robert; Summers, Ronald M.; Rousseau, Justin F.; Ni, Peiyun; Landsman, Marc J; Baxter, Sally L.; Al'Aref, Subhi J.; Li, Yijia; Chen, Alex; Brejt, Josef A.; Chiang, Michael F.; Peng, Yifan; Lu, Zhiyong

doi:10.1038/s41746-024-01185-7

Computer Science > Computer Vision and Pattern Recognition

arXiv:2401.08396 (cs)

[Submitted on 16 Jan 2024 (v1), last revised 31 Aug 2024 (this version, v4)]

Title:Hidden flaws behind expert-level accuracy of multimodal GPT-4 vision in medicine

Authors:Qiao Jin, Fangyuan Chen, Yiliang Zhou, Ziyang Xu, Justin M. Cheung, Robert Chen, Ronald M. Summers, Justin F. Rousseau, Peiyun Ni, Marc J Landsman, Sally L. Baxter, Subhi J. Al'Aref, Yijia Li, Alex Chen, Josef A. Brejt, Michael F. Chiang, Yifan Peng, Zhiyong Lu

View PDF

Abstract:Recent studies indicate that Generative Pre-trained Transformer 4 with Vision (GPT-4V) outperforms human physicians in medical challenge tasks. However, these evaluations primarily focused on the accuracy of multi-choice questions alone. Our study extends the current scope by conducting a comprehensive analysis of GPT-4V's rationales of image comprehension, recall of medical knowledge, and step-by-step multimodal reasoning when solving New England Journal of Medicine (NEJM) Image Challenges - an imaging quiz designed to test the knowledge and diagnostic capabilities of medical professionals. Evaluation results confirmed that GPT-4V performs comparatively to human physicians regarding multi-choice accuracy (81.6% vs. 77.8%). GPT-4V also performs well in cases where physicians incorrectly answer, with over 78% accuracy. However, we discovered that GPT-4V frequently presents flawed rationales in cases where it makes the correct final choices (35.5%), most prominent in image comprehension (27.2%). Regardless of GPT-4V's high accuracy in multi-choice questions, our findings emphasize the necessity for further in-depth evaluations of its rationales before integrating such multimodal AI models into clinical workflows.

Subjects:	Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Cite as:	arXiv:2401.08396 [cs.CV]
	(or arXiv:2401.08396v4 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2401.08396
Journal reference:	npj Digital Medicine, 2024
Related DOI:	https://doi.org/10.1038/s41746-024-01185-7

Submission history

From: Qiao Jin [view email]
[v1] Tue, 16 Jan 2024 14:41:20 UTC (6,405 KB)
[v2] Wed, 24 Jan 2024 17:12:51 UTC (8,286 KB)
[v3] Mon, 22 Apr 2024 23:04:41 UTC (3,142 KB)
[v4] Sat, 31 Aug 2024 23:51:14 UTC (201 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Hidden flaws behind expert-level accuracy of multimodal GPT-4 vision in medicine

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Hidden flaws behind expert-level accuracy of multimodal GPT-4 vision in medicine

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators