OneLLM: One Framework to Align All Modalities with Language

Han, Jiaming; Gong, Kaixiong; Zhang, Yiyuan; Wang, Jiaqi; Zhang, Kaipeng; Lin, Dahua; Qiao, Yu; Gao, Peng; Yue, Xiangyu

Computer Science > Computer Vision and Pattern Recognition

arXiv:2312.03700 (cs)

[Submitted on 6 Dec 2023 (v1), last revised 9 Jan 2025 (this version, v2)]

Title:OneLLM: One Framework to Align All Modalities with Language

Authors:Jiaming Han, Kaixiong Gong, Yiyuan Zhang, Jiaqi Wang, Kaipeng Zhang, Dahua Lin, Yu Qiao, Peng Gao, Xiangyu Yue

View PDF HTML (experimental)

Abstract:Multimodal large language models (MLLMs) have gained significant attention due to their strong multimodal understanding capability. However, existing works rely heavily on modality-specific encoders, which usually differ in architecture and are limited to common modalities. In this paper, we present OneLLM, an MLLM that aligns eight modalities to language using a unified framework. We achieve this through a unified multimodal encoder and a progressive multimodal alignment pipeline. In detail, we first train an image projection module to connect a vision encoder with LLM. Then, we build a universal projection module (UPM) by mixing multiple image projection modules and dynamic routing. Finally, we progressively align more modalities to LLM with the UPM. To fully leverage the potential of OneLLM in following instructions, we also curated a comprehensive multimodal instruction dataset, including 2M items from image, audio, video, point cloud, depth/normal map, IMU and fMRI brain activity. OneLLM is evaluated on 25 diverse benchmarks, encompassing tasks such as multimodal captioning, question answering and reasoning, where it delivers excellent performance. Code, data, model and online demo are available at this https URL

Comments:	Accepted by CVPR 2024. Code: this https URL
Subjects:	Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Multimedia (cs.MM)
Cite as:	arXiv:2312.03700 [cs.CV]
	(or arXiv:2312.03700v2 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2312.03700

Submission history

From: Jiaming Han [view email]
[v1] Wed, 6 Dec 2023 18:59:19 UTC (3,188 KB)
[v2] Thu, 9 Jan 2025 09:12:06 UTC (3,188 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:OneLLM: One Framework to Align All Modalities with Language

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:OneLLM: One Framework to Align All Modalities with Language

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators