Towards Unifying Understanding and Generation in the Era of Vision Foundation Models: A Survey from the Autoregression Perspective

Xie, Shenghao; Zu, Wenqiang; Zhao, Mingyang; Su, Duo; Liu, Shilong; Shi, Ruohua; Li, Guoqi; Zhang, Shanghang; Ma, Lei

Computer Science > Computer Vision and Pattern Recognition

arXiv:2410.22217 (cs)

[Submitted on 29 Oct 2024 (v1), last revised 30 Oct 2024 (this version, v2)]

Title:Towards Unifying Understanding and Generation in the Era of Vision Foundation Models: A Survey from the Autoregression Perspective

Authors:Shenghao Xie, Wenqiang Zu, Mingyang Zhao, Duo Su, Shilong Liu, Ruohua Shi, Guoqi Li, Shanghang Zhang, Lei Ma

View PDF HTML (experimental)

Abstract:Autoregression in large language models (LLMs) has shown impressive scalability by unifying all language tasks into the next token prediction paradigm. Recently, there is a growing interest in extending this success to vision foundation models. In this survey, we review the recent advances and discuss future directions for autoregressive vision foundation models. First, we present the trend for next generation of vision foundation models, i.e., unifying both understanding and generation in vision tasks. We then analyze the limitations of existing vision foundation models, and present a formal definition of autoregression with its advantages. Later, we categorize autoregressive vision foundation models from their vision tokenizers and autoregression backbones. Finally, we discuss several promising research challenges and directions. To the best of our knowledge, this is the first survey to comprehensively summarize autoregressive vision foundation models under the trend of unifying understanding and generation. A collection of related resources is available at this https URL.

Comments:	17 pages, 1 table, 2 figures
Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2410.22217 [cs.CV]
	(or arXiv:2410.22217v2 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2410.22217

Submission history

From: Shenghao Xie [view email]
[v1] Tue, 29 Oct 2024 16:48:22 UTC (90 KB)
[v2] Wed, 30 Oct 2024 17:51:26 UTC (91 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Towards Unifying Understanding and Generation in the Era of Vision Foundation Models: A Survey from the Autoregression Perspective

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Towards Unifying Understanding and Generation in the Era of Vision Foundation Models: A Survey from the Autoregression Perspective

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators