OpenUni: A Simple Baseline for Unified Multimodal Understanding and Generation

Wu, Size; Wu, Zhonghua; Gong, Zerui; Tao, Qingyi; Jin, Sheng; Li, Qinyue; Li, Wei; Loy, Chen Change

Computer Science > Computer Vision and Pattern Recognition

arXiv:2505.23661 (cs)

[Submitted on 29 May 2025 (v1), last revised 2 Jun 2025 (this version, v3)]

Title:OpenUni: A Simple Baseline for Unified Multimodal Understanding and Generation

Authors:Size Wu, Zhonghua Wu, Zerui Gong, Qingyi Tao, Sheng Jin, Qinyue Li, Wei Li, Chen Change Loy

View PDF HTML (experimental)

Abstract:In this report, we present OpenUni, a simple, lightweight, and fully open-source baseline for unifying multimodal understanding and generation. Inspired by prevailing practices in unified model learning, we adopt an efficient training strategy that minimizes the training complexity and overhead by bridging the off-the-shelf multimodal large language models (LLMs) and diffusion models through a set of learnable queries and a light-weight transformer-based connector. With a minimalist choice of architecture, we demonstrate that OpenUni can: 1) generate high-quality and instruction-aligned images, and 2) achieve exceptional performance on standard benchmarks such as GenEval, DPG- Bench, and WISE, with only 1.1B and 3.1B activated parameters. To support open research and community advancement, we release all model weights, training code, and our curated training datasets (including 23M image-text pairs) at this https URL.

Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2505.23661 [cs.CV]
	(or arXiv:2505.23661v3 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2505.23661

Submission history

From: Size Wu [view email]
[v1] Thu, 29 May 2025 17:09:44 UTC (463 KB)
[v2] Fri, 30 May 2025 12:25:06 UTC (463 KB)
[v3] Mon, 2 Jun 2025 13:04:26 UTC (464 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:OpenUni: A Simple Baseline for Unified Multimodal Understanding and Generation

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:OpenUni: A Simple Baseline for Unified Multimodal Understanding and Generation

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators