The Limited Impact of Medical Adaptation of Large Language and Vision-Language Models

Jeong, Daniel P.; Mani, Pranav; Garg, Saurabh; Lipton, Zachary C.; Oberst, Michael

Computer Science > Computation and Language

arXiv:2411.08870 (cs)

[Submitted on 13 Nov 2024 (v1), last revised 28 Feb 2025 (this version, v2)]

Title:The Limited Impact of Medical Adaptation of Large Language and Vision-Language Models

Authors:Daniel P. Jeong, Pranav Mani, Saurabh Garg, Zachary C. Lipton, Michael Oberst

View PDF HTML (experimental)

Abstract:Several recent works seek to adapt general-purpose large language models (LLMs) and vision-language models (VLMs) for medical applications through continued pretraining on publicly available biomedical corpora. These works typically claim that such domain-adaptive pretraining improves performance on various downstream medical tasks, such as answering medical exam questions. In this paper, we compare ten "medical" LLMs and two VLMs against their corresponding base models, arriving at a different conclusion: all medical VLMs and nearly all medical LLMs fail to consistently improve over their base models in the zero-/few-shot prompting and supervised fine-tuning regimes for medical question answering (QA). For instance, on clinical-note-based QA tasks in the 3-shot setting, medical LLMs outperform their base models in only 26.7% of cases, reach a (statistical) tie in 16.7% of cases, and perform significantly worse in the remaining 56.7% of cases. Our conclusions are based on (i) comparing each medical model directly against its base model; (ii) optimizing the prompts for each model separately in zero-/few-shot prompting; and (iii) accounting for statistical uncertainty in comparisons. Our findings suggest that state-of-the-art general-domain models may already exhibit strong medical knowledge and reasoning capabilities, and offer recommendations to strengthen the conclusions of future studies.

Comments:	Extended version of EMNLP 2024 paper arXiv:2411.04118. Includes additional results on clinical note QA tasks and supervised fine-tuning evaluations
Subjects:	Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Cite as:	arXiv:2411.08870 [cs.CL]
	(or arXiv:2411.08870v2 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2411.08870

Submission history

From: Daniel Jeong [view email]
[v1] Wed, 13 Nov 2024 18:50:13 UTC (1,425 KB)
[v2] Fri, 28 Feb 2025 07:34:44 UTC (2,155 KB)

Computer Science > Computation and Language

Title:The Limited Impact of Medical Adaptation of Large Language and Vision-Language Models

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:The Limited Impact of Medical Adaptation of Large Language and Vision-Language Models

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators