mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal Large Language Models

Ye, Jiabo; Xu, Haiyang; Liu, Haowei; Hu, Anwen; Yan, Ming; Qian, Qi; Zhang, Ji; Huang, Fei; Zhou, Jingren

Computer Science > Computer Vision and Pattern Recognition

arXiv:2408.04840 (cs)

[Submitted on 9 Aug 2024 (v1), last revised 13 Aug 2024 (this version, v2)]

Title:mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal Large Language Models

Authors:Jiabo Ye, Haiyang Xu, Haowei Liu, Anwen Hu, Ming Yan, Qi Qian, Ji Zhang, Fei Huang, Jingren Zhou

View PDF HTML (experimental)

Abstract:Multi-modal Large Language Models (MLLMs) have demonstrated remarkable capabilities in executing instructions for a variety of single-image tasks. Despite this progress, significant challenges remain in modeling long image sequences. In this work, we introduce the versatile multi-modal large language model, mPLUG-Owl3, which enhances the capability for long image-sequence understanding in scenarios that incorporate retrieved image-text knowledge, interleaved image-text, and lengthy videos. Specifically, we propose novel hyper attention blocks to efficiently integrate vision and language into a common language-guided semantic space, thereby facilitating the processing of extended multi-image scenarios. Extensive experimental results suggest that mPLUG-Owl3 achieves state-of-the-art performance among models with a similar size on single-image, multi-image, and video benchmarks. Moreover, we propose a challenging long visual sequence evaluation named Distractor Resistance to assess the ability of models to maintain focus amidst distractions. Finally, with the proposed architecture, mPLUG-Owl3 demonstrates outstanding performance on ultra-long visual sequence inputs. We hope that mPLUG-Owl3 can contribute to the development of more efficient and powerful multimodal large language models.

Subjects:	Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
Cite as:	arXiv:2408.04840 [cs.CV]
	(or arXiv:2408.04840v2 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2408.04840

Submission history

From: Jiabo Ye [view email]
[v1] Fri, 9 Aug 2024 03:25:42 UTC (4,123 KB)
[v2] Tue, 13 Aug 2024 08:10:32 UTC (4,123 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal Large Language Models

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal Large Language Models

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators