Video Action Differencing

Burgess, James; Wang, Xiaohan; Zhang, Yuhui; Rau, Anita; Lozano, Alejandro; Dunlap, Lisa; Darrell, Trevor; Yeung-Levy, Serena

Computer Science > Computer Vision and Pattern Recognition

arXiv:2503.07860 (cs)

[Submitted on 10 Mar 2025]

Title:Video Action Differencing

Authors:James Burgess, Xiaohan Wang, Yuhui Zhang, Anita Rau, Alejandro Lozano, Lisa Dunlap, Trevor Darrell, Serena Yeung-Levy

View PDF HTML (experimental)

Abstract:How do two individuals differ when performing the same action? In this work, we introduce Video Action Differencing (VidDiff), the novel task of identifying subtle differences between videos of the same action, which has many applications, such as coaching and skill learning. To enable development on this new task, we first create VidDiffBench, a benchmark dataset containing 549 video pairs, with human annotations of 4,469 fine-grained action differences and 2,075 localization timestamps indicating where these differences occur. Our experiments demonstrate that VidDiffBench poses a significant challenge for state-of-the-art large multimodal models (LMMs), such as GPT-4o and Qwen2-VL. By analyzing failure cases of LMMs on VidDiffBench, we highlight two key challenges for this task: localizing relevant sub-actions over two videos and fine-grained frame comparison. To overcome these, we propose the VidDiff method, an agentic workflow that breaks the task into three stages: action difference proposal, keyframe localization, and frame differencing, each stage utilizing specialized foundation models. To encourage future research in this new task, we release the benchmark at this https URL and code at this http URL.

Comments:	ICLR 2025 (International Conference on Learning Representations) Project page: this http URL Benchmark: this https URL
Subjects:	Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Cite as:	arXiv:2503.07860 [cs.CV]
	(or arXiv:2503.07860v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2503.07860

Submission history

From: James Burgess [view email]
[v1] Mon, 10 Mar 2025 21:18:32 UTC (13,699 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Video Action Differencing

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Video Action Differencing

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators