Multimodal Rationales for Explainable Visual Question Answering

Li, Kun; Vosselman, George; Yang, Michael Ying

Computer Science > Computer Vision and Pattern Recognition

arXiv:2402.03896 (cs)

[Submitted on 6 Feb 2024 (v1), last revised 10 Jun 2025 (this version, v3)]

Title:Multimodal Rationales for Explainable Visual Question Answering

Authors:Kun Li, George Vosselman, Michael Ying Yang

View PDF HTML (experimental)

Abstract:Visual Question Answering (VQA) is a challenging task of predicting the answer to a question about the content of an image. Prior works directly evaluate the answering models by simply calculating the accuracy of predicted answers. However, the inner reasoning behind the predictions is disregarded in such a "black box" system, and we cannot ascertain the trustworthiness of the predictions. Even more concerning, in some cases, these models predict correct answers despite focusing on irrelevant visual regions or textual tokens. To develop an explainable and trustworthy answering system, we propose a novel model termed MRVQA (Multimodal Rationales for VQA), which provides visual and textual rationales to support its predicted answers. To measure the quality of generated rationales, a new metric vtS (visual-textual Similarity) score is introduced from both visual and textual perspectives. Considering the extra annotations distinct from standard VQA, MRVQA is trained and evaluated using samples synthesized from some existing datasets. Extensive experiments across three EVQA datasets demonstrate that MRVQA achieves new state-of-the-art results through additional rationale generation, enhancing the trustworthiness of the explainable VQA model. The code and the synthesized dataset are released under this https URL.

Comments:	Accepted to CVPR workshops 2025
Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2402.03896 [cs.CV]
	(or arXiv:2402.03896v3 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2402.03896

Submission history

From: Kun Li [view email]
[v1] Tue, 6 Feb 2024 11:07:05 UTC (21,102 KB)
[v2] Mon, 24 Mar 2025 20:48:53 UTC (531 KB)
[v3] Tue, 10 Jun 2025 09:20:36 UTC (97 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Multimodal Rationales for Explainable Visual Question Answering

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Multimodal Rationales for Explainable Visual Question Answering

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators