TOGA: Temporally Grounded Open-Ended Video QA with Weak Supervision

Gupta, Ayush; Roy, Anirban; Chellappa, Rama; Bastian, Nathaniel D.; Velasquez, Alvaro; Jha, Susmit

Computer Science > Computer Vision and Pattern Recognition

arXiv:2506.09445 (cs)

[Submitted on 11 Jun 2025]

Title:TOGA: Temporally Grounded Open-Ended Video QA with Weak Supervision

Authors:Ayush Gupta, Anirban Roy, Rama Chellappa, Nathaniel D. Bastian, Alvaro Velasquez, Susmit Jha

View PDF HTML (experimental)

Abstract:We address the problem of video question answering (video QA) with temporal grounding in a weakly supervised setup, without any temporal annotations. Given a video and a question, we generate an open-ended answer grounded with the start and end time. For this task, we propose TOGA: a vision-language model for Temporally Grounded Open-Ended Video QA with Weak Supervision. We instruct-tune TOGA to jointly generate the answer and the temporal grounding. We operate in a weakly supervised setup where the temporal grounding annotations are not available. We generate pseudo labels for temporal grounding and ensure the validity of these labels by imposing a consistency constraint between the question of a grounding response and the response generated by a question referring to the same temporal segment. We notice that jointly generating the answers with the grounding improves performance on question answering as well as grounding. We evaluate TOGA on grounded QA and open-ended QA tasks. For grounded QA, we consider the NExT-GQA benchmark which is designed to evaluate weakly supervised grounded question answering. For open-ended QA, we consider the MSVD-QA and ActivityNet-QA benchmarks. We achieve state-of-the-art performance for both tasks on these benchmarks.

Subjects:	Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Cite as:	arXiv:2506.09445 [cs.CV]
	(or arXiv:2506.09445v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2506.09445

Submission history

From: Anirban Roy [view email]
[v1] Wed, 11 Jun 2025 06:52:31 UTC (14,927 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:TOGA: Temporally Grounded Open-Ended Video QA with Weak Supervision

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:TOGA: Temporally Grounded Open-Ended Video QA with Weak Supervision

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators