II-MMR: Identifying and Improving Multi-modal Multi-hop Reasoning in Visual Question Answering

Kil, Jihyung; Tavazoee, Farideh; Kang, Dongyeop; Kim, Joo-Kyung

Computer Science > Computer Vision and Pattern Recognition

arXiv:2402.11058 (cs)

[Submitted on 16 Feb 2024 (v1), last revised 3 Jun 2024 (this version, v3)]

Title:II-MMR: Identifying and Improving Multi-modal Multi-hop Reasoning in Visual Question Answering

Authors:Jihyung Kil, Farideh Tavazoee, Dongyeop Kang, Joo-Kyung Kim

View PDF HTML (experimental)

Abstract:Visual Question Answering (VQA) often involves diverse reasoning scenarios across Vision and Language (V&L). Most prior VQA studies, however, have merely focused on assessing the model's overall accuracy without evaluating it on different reasoning cases. Furthermore, some recent works observe that conventional Chain-of-Thought (CoT) prompting fails to generate effective reasoning for VQA, especially for complex scenarios requiring multi-hop reasoning. In this paper, we propose II-MMR, a novel idea to identify and improve multi-modal multi-hop reasoning in VQA. In specific, II-MMR takes a VQA question with an image and finds a reasoning path to reach its answer using two novel language promptings: (i) answer prediction-guided CoT prompt, or (ii) knowledge triplet-guided prompt. II-MMR then analyzes this path to identify different reasoning cases in current VQA benchmarks by estimating how many hops and what types (i.e., visual or beyond-visual) of reasoning are required to answer the question. On popular benchmarks including GQA and A-OKVQA, II-MMR observes that most of their VQA questions are easy to answer, simply demanding "single-hop" reasoning, whereas only a few questions require "multi-hop" reasoning. Moreover, while the recent V&L model struggles with such complex multi-hop reasoning questions even using the traditional CoT method, II-MMR shows its effectiveness across all reasoning cases in both zero-shot and fine-tuning settings.

Comments:	Accepted to ACL 2024 Findings
Subjects:	Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
Cite as:	arXiv:2402.11058 [cs.CV]
	(or arXiv:2402.11058v3 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2402.11058

Submission history

From: Jihyung Kil [view email]
[v1] Fri, 16 Feb 2024 20:14:47 UTC (9,384 KB)
[v2] Fri, 31 May 2024 17:30:13 UTC (9,396 KB)
[v3] Mon, 3 Jun 2024 01:09:38 UTC (9,396 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:II-MMR: Identifying and Improving Multi-modal Multi-hop Reasoning in Visual Question Answering

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:II-MMR: Identifying and Improving Multi-modal Multi-hop Reasoning in Visual Question Answering

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators