MLAN: Language-Based Instruction Tuning Preserves and Transfers Knowledge in Multimodal Language Models

Tu, Jianhong; Ni, Zhuohao; Crispino, Nicholas; Yu, Zihao; Bendersky, Michael; Gunel, Beliz; Jia, Ruoxi; Liu, Xin; Lyu, Lingjuan; Song, Dawn; Wang, Chenguang

Computer Science > Computation and Language

arXiv:2411.10557 (cs)

[Submitted on 15 Nov 2024 (v1), last revised 28 Jun 2025 (this version, v3)]

Title:MLAN: Language-Based Instruction Tuning Preserves and Transfers Knowledge in Multimodal Language Models

Authors:Jianhong Tu, Zhuohao Ni, Nicholas Crispino, Zihao Yu, Michael Bendersky, Beliz Gunel, Ruoxi Jia, Xin Liu, Lingjuan Lyu, Dawn Song, Chenguang Wang

View PDF HTML (experimental)

Abstract:We present a novel visual instruction tuning strategy to improve the zero-shot task generalization of multimodal large language models by building a firm text-only knowledge base. Existing work lacks sufficient experimentation on the importance of each modality in the instruction tuning stage, often using a majority of vision-language data while keeping text-only data limited and fixing mixtures of modalities. By incorporating diverse text-only data in the visual instruction tuning stage, we vary vision-language data in various controlled experiments to investigate the importance of modality in visual instruction tuning. Our comprehensive evaluation shows that the text-heavy instruction tuning approach is able to perform on-par with traditional vision-heavy mixtures on both modalities across 12 general datasets while using as low as half the total training tokens. We find that simply increasing sufficiently diverse text-only data enables transfer of instruction following ability and domain knowledge across modalities while being more efficient than the vision-language approach.

Subjects:	Computation and Language (cs.CL)
Cite as:	arXiv:2411.10557 [cs.CL]
	(or arXiv:2411.10557v3 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2411.10557

Submission history

From: Jianhong Tu [view email]
[v1] Fri, 15 Nov 2024 20:09:59 UTC (2,668 KB)
[v2] Tue, 19 Nov 2024 05:16:28 UTC (2,668 KB)
[v3] Sat, 28 Jun 2025 18:24:35 UTC (312 KB)

Computer Science > Computation and Language

Title:MLAN: Language-Based Instruction Tuning Preserves and Transfers Knowledge in Multimodal Language Models

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:MLAN: Language-Based Instruction Tuning Preserves and Transfers Knowledge in Multimodal Language Models

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators