Argus: Vision-Centric Reasoning with Grounded Chain-of-Thought

Man, Yunze; Huang, De-An; Liu, Guilin; Sheng, Shiwei; Liu, Shilong; Gui, Liang-Yan; Kautz, Jan; Wang, Yu-Xiong; Yu, Zhiding

Computer Science > Computer Vision and Pattern Recognition

arXiv:2505.23766 (cs)

[Submitted on 29 May 2025]

Title:Argus: Vision-Centric Reasoning with Grounded Chain-of-Thought

Authors:Yunze Man, De-An Huang, Guilin Liu, Shiwei Sheng, Shilong Liu, Liang-Yan Gui, Jan Kautz, Yu-Xiong Wang, Zhiding Yu

View PDF HTML (experimental)

Abstract:Recent advances in multimodal large language models (MLLMs) have demonstrated remarkable capabilities in vision-language tasks, yet they often struggle with vision-centric scenarios where precise visual focus is needed for accurate reasoning. In this paper, we introduce Argus to address these limitations with a new visual attention grounding mechanism. Our approach employs object-centric grounding as visual chain-of-thought signals, enabling more effective goal-conditioned visual attention during multimodal reasoning tasks. Evaluations on diverse benchmarks demonstrate that Argus excels in both multimodal reasoning tasks and referring object grounding tasks. Extensive analysis further validates various design choices of Argus, and reveals the effectiveness of explicit language-guided visual region-of-interest engagement in MLLMs, highlighting the importance of advancing multimodal intelligence from a visual-centric perspective. Project page: this https URL

Comments:	CVPR 2025. Project Page: this https URL
Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2505.23766 [cs.CV]
	(or arXiv:2505.23766v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2505.23766

Submission history

From: Yunze Man [view email]
[v1] Thu, 29 May 2025 17:59:56 UTC (20,955 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Argus: Vision-Centric Reasoning with Grounded Chain-of-Thought

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Argus: Vision-Centric Reasoning with Grounded Chain-of-Thought

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators