OVO-Bench: How Far is Your Video-LLMs from Real-World Online Video Understanding?

Li, Yifei; Niu, Junbo; Miao, Ziyang; Ge, Chunjiang; Zhou, Yuanhang; He, Qihao; Dong, Xiaoyi; Duan, Haodong; Ding, Shuangrui; Qian, Rui; Zhang, Pan; Zang, Yuhang; Cao, Yuhang; He, Conghui; Wang, Jiaqi

Computer Science > Computer Vision and Pattern Recognition

arXiv:2501.05510 (cs)

[Submitted on 9 Jan 2025 (v1), last revised 27 Mar 2025 (this version, v2)]

Title:OVO-Bench: How Far is Your Video-LLMs from Real-World Online Video Understanding?

Authors:Yifei Li, Junbo Niu, Ziyang Miao, Chunjiang Ge, Yuanhang Zhou, Qihao He, Xiaoyi Dong, Haodong Duan, Shuangrui Ding, Rui Qian, Pan Zhang, Yuhang Zang, Yuhang Cao, Conghui He, Jiaqi Wang

View PDF HTML (experimental)

Abstract:Temporal Awareness, the ability to reason dynamically based on the timestamp when a question is raised, is the key distinction between offline and online video LLMs. Unlike offline models, which rely on complete videos for static, post hoc analysis, online models process video streams incrementally and dynamically adapt their responses based on the timestamp at which the question is posed. Despite its significance, temporal awareness has not been adequately evaluated in existing benchmarks. To fill this gap, we present OVO-Bench (Online-VideO-Benchmark), a novel video benchmark that emphasizes the importance of timestamps for advanced online video understanding capability benchmarking. OVO-Bench evaluates the ability of video LLMs to reason and respond to events occurring at specific timestamps under three distinct scenarios: (1) Backward tracing: trace back to past events to answer the question. (2) Real-time understanding: understand and respond to events as they unfold at the current timestamp. (3) Forward active responding: delay the response until sufficient future information becomes available to answer the question accurately. OVO-Bench comprises 12 tasks, featuring 644 unique videos and approximately human-curated 2,800 fine-grained meta-annotations with precise timestamps. We combine automated generation pipelines with human curation. With these high-quality samples, we further developed an evaluation pipeline to systematically query video LLMs along the video timeline. Evaluations of nine Video-LLMs reveal that, despite advancements on traditional benchmarks, current models struggle with online video understanding, showing a significant gap compared to human agents. We hope OVO-Bench will drive progress in video LLMs and inspire future research in online video reasoning. Our benchmark and code can be accessed at this https URL.

Comments:	CVPR 2025
Subjects:	Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Cite as:	arXiv:2501.05510 [cs.CV]
	(or arXiv:2501.05510v2 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2501.05510

Submission history

From: Yifei Li [view email]
[v1] Thu, 9 Jan 2025 19:00:01 UTC (10,794 KB)
[v2] Thu, 27 Mar 2025 17:40:09 UTC (12,863 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:OVO-Bench: How Far is Your Video-LLMs from Real-World Online Video Understanding?

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:OVO-Bench: How Far is Your Video-LLMs from Real-World Online Video Understanding?

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators