Temporal Preference Optimization for Long-Form Video Understanding

Li, Rui; Wang, Xiaohan; Zhang, Yuhui; Wang, Zeyu; Yeung-Levy, Serena

Computer Science > Computer Vision and Pattern Recognition

arXiv:2501.13919 (cs)

[Submitted on 23 Jan 2025 (v1), last revised 30 Jan 2025 (this version, v2)]

Title:Temporal Preference Optimization for Long-Form Video Understanding

Authors:Rui Li, Xiaohan Wang, Yuhui Zhang, Zeyu Wang, Serena Yeung-Levy

View PDF HTML (experimental)

Abstract:Despite significant advancements in video large multimodal models (video-LMMs), achieving effective temporal grounding in long-form videos remains a challenge for existing models. To address this limitation, we propose Temporal Preference Optimization (TPO), a novel post-training framework designed to enhance the temporal grounding capabilities of video-LMMs through preference learning. TPO adopts a self-training approach that enables models to differentiate between well-grounded and less accurate temporal responses by leveraging curated preference datasets at two granularities: localized temporal grounding, which focuses on specific video segments, and comprehensive temporal grounding, which captures extended temporal dependencies across entire video sequences. By optimizing on these preference datasets, TPO significantly enhances temporal understanding while reducing reliance on manually annotated data. Extensive experiments on three long-form video understanding benchmarks--LongVideoBench, MLVU, and Video-MME--demonstrate the effectiveness of TPO across two state-of-the-art video-LMMs. Notably, LLaVA-Video-TPO establishes itself as the leading 7B model on the Video-MME benchmark, underscoring the potential of TPO as a scalable and efficient solution for advancing temporal reasoning in long-form video understanding. Project page: this https URL.

Subjects:	Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Robotics (cs.RO)
Cite as:	arXiv:2501.13919 [cs.CV]
	(or arXiv:2501.13919v2 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2501.13919

Submission history

From: Rui Li [view email]
[v1] Thu, 23 Jan 2025 18:58:03 UTC (11,898 KB)
[v2] Thu, 30 Jan 2025 17:35:08 UTC (11,897 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Temporal Preference Optimization for Long-Form Video Understanding

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Temporal Preference Optimization for Long-Form Video Understanding

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators