Understanding Video Scenes through Text: Insights from Text-based Video Question Answering

Jahagirdar, Soumya; Mathew, Minesh; Karatzas, Dimosthenis; Jawahar, C. V.

Computer Science > Computer Vision and Pattern Recognition

arXiv:2309.01380 (cs)

[Submitted on 4 Sep 2023 (v1), last revised 11 Sep 2023 (this version, v2)]

Title:Understanding Video Scenes through Text: Insights from Text-based Video Question Answering

Authors:Soumya Jahagirdar, Minesh Mathew, Dimosthenis Karatzas, C. V. Jawahar

View PDF

Abstract:Researchers have extensively studied the field of vision and language, discovering that both visual and textual content is crucial for understanding scenes effectively. Particularly, comprehending text in videos holds great significance, requiring both scene text understanding and temporal reasoning. This paper focuses on exploring two recently introduced datasets, NewsVideoQA and M4-ViteVQA, which aim to address video question answering based on textual content. The NewsVideoQA dataset contains question-answer pairs related to the text in news videos, while M4-ViteVQA comprises question-answer pairs from diverse categories like vlogging, traveling, and shopping. We provide an analysis of the formulation of these datasets on various levels, exploring the degree of visual understanding and multi-frame comprehension required for answering the questions. Additionally, the study includes experimentation with BERT-QA, a text-only model, which demonstrates comparable performance to the original methods on both datasets, indicating the shortcomings in the formulation of these datasets. Furthermore, we also look into the domain adaptation aspect by examining the effectiveness of training on M4-ViteVQA and evaluating on NewsVideoQA and vice-versa, thereby shedding light on the challenges and potential benefits of out-of-domain training.

Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2309.01380 [cs.CV]
	(or arXiv:2309.01380v2 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2309.01380

Submission history

From: Soumya Jahagirdar [view email]
[v1] Mon, 4 Sep 2023 06:11:00 UTC (4,412 KB)
[v2] Mon, 11 Sep 2023 07:01:24 UTC (4,412 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Understanding Video Scenes through Text: Insights from Text-based Video Question Answering

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Understanding Video Scenes through Text: Insights from Text-based Video Question Answering

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators