Commonsense Video Question Answering through Video-Grounded Entailment Tree Reasoning

Liu, Huabin; Ilievski, Filip; Snoek, Cees G. M.

Computer Science > Computer Vision and Pattern Recognition

arXiv:2501.05069 (cs)

[Submitted on 9 Jan 2025 (v1), last revised 25 Mar 2025 (this version, v2)]

Title:Commonsense Video Question Answering through Video-Grounded Entailment Tree Reasoning

Authors:Huabin Liu, Filip Ilievski, Cees G. M. Snoek

View PDF HTML (experimental)

Abstract:This paper proposes the first video-grounded entailment tree reasoning method for commonsense video question answering (VQA). Despite the remarkable progress of large visual-language models (VLMs), there are growing concerns that they learn spurious correlations between videos and likely answers, reinforced by their black-box nature and remaining benchmarking biases. Our method explicitly grounds VQA tasks to video fragments in four steps: entailment tree construction, video-language entailment verification, tree reasoning, and dynamic tree expansion. A vital benefit of the method is its generalizability to current video and image-based VLMs across reasoning types. To support fair evaluation, we devise a de-biasing procedure based on large-language models that rewrites VQA benchmark answer sets to enforce model reasoning. Systematic experiments on existing and de-biased benchmarks highlight the impact of our method components across benchmarks, VLMs, and reasoning types.

Comments:	Accepted by CVPR 2025
Subjects:	Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Cite as:	arXiv:2501.05069 [cs.CV]
	(or arXiv:2501.05069v2 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2501.05069

Submission history

From: Huabin Liu [view email]
[v1] Thu, 9 Jan 2025 08:44:42 UTC (5,193 KB)
[v2] Tue, 25 Mar 2025 03:46:09 UTC (5,635 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Commonsense Video Question Answering through Video-Grounded Entailment Tree Reasoning

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Commonsense Video Question Answering through Video-Grounded Entailment Tree Reasoning

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators