Instruction-tuned Self-Questioning Framework for Multimodal Reasoning

Jang, You-Won; Heo, Yu-Jung; Kim, Jaeseok; Lee, Minsu; Chang, Du-Seong; Zhang, Byoung-Tak

Computer Science > Computer Vision and Pattern Recognition

arXiv:2509.21251 (cs)

[Submitted on 25 Sep 2025]

Title:Instruction-tuned Self-Questioning Framework for Multimodal Reasoning

Authors:You-Won Jang, Yu-Jung Heo, Jaeseok Kim, Minsu Lee, Du-Seong Chang, Byoung-Tak Zhang

View PDF

Abstract:The field of vision-language understanding has been actively researched in recent years, thanks to the development of Large Language Models~(LLMs). However, it still needs help with problems requiring multi-step reasoning, even for very simple questions. Recent studies adopt LLMs to tackle this problem by iteratively generating sub-questions and answers. However, there are disadvantages such as 1) the fine-grained visual contents of images are not available using LLMs that cannot read visual information, 2) internal mechanisms are inaccessible and difficult to reproduce by using black-box LLMs. To solve these problems, we propose the SQ (Self-Questioning)-InstructBLIP, which improves inference performance by generating image-aware informative sub-questions and sub-answers iteratively. The SQ-InstructBLIP, which consists of a Questioner, Answerer, and Reasoner that share the same architecture. Questioner and Answerer generate sub-questions and sub-answers to help infer the main-question, and Reasoner performs reasoning on the main-question considering the generated sub-question information. Our experiments show that the proposed method SQ-InstructBLIP, which uses the generated sub-questions as additional information when solving the VQA task, performs more accurate reasoning than the previous works.

Comments:	This paper was accepted to the "CLVL: 5th Workshop on Closing the Loop Between Vision and Language (ICCV 2023 CLVL workshop)."
Subjects:	Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Cite as:	arXiv:2509.21251 [cs.CV]
	(or arXiv:2509.21251v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2509.21251

Submission history

From: Youwon Jang [view email]
[v1] Thu, 25 Sep 2025 14:45:06 UTC (1,339 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Instruction-tuned Self-Questioning Framework for Multimodal Reasoning

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Instruction-tuned Self-Questioning Framework for Multimodal Reasoning

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators