VideoCogQA: A Controllable Benchmark for Evaluating Cognitive Abilities in Video-Language Models

Li, Chenglin; Chen, Qianglong; Li, Zhi; Tao, Feng; Zhang, Yin

Computer Science > Computer Vision and Pattern Recognition

arXiv:2411.09105 (cs)

[Submitted on 14 Nov 2024 (v1), last revised 1 Jul 2025 (this version, v2)]

Title:VideoCogQA: A Controllable Benchmark for Evaluating Cognitive Abilities in Video-Language Models

Authors:Chenglin Li, Qianglong Chen, Zhi Li, Feng Tao, Yin Zhang

View PDF HTML (experimental)

Abstract:Recent advancements in Large Video-Language Models (LVLMs) have led to promising results in multimodal video understanding. However, it remains unclear whether these models possess the cognitive capabilities required for high-level tasks, particularly those involving symbolic and abstract perception. Existing benchmarks typically rely on real-world, annotated videos, which lack control over video content and inherent difficulty, limiting their diagnostic power. To bridge this gap, we propose VideoCogQA, a scalable and fully controllable benchmark inspired by game-world environments, designed to evaluate the cognitive abilities of LVLMs. By generating synthetic videos via a programmatic engine, VideoCogQA allows fine-grained control over visual elements, temporal dynamics, and task difficulty. This approach enables a focused evaluation of video cognitive abilities, independent of prior knowledge from visual scene semantics. The dataset includes 800 videos and 3,280 question-answer pairs, featuring tasks related to abstract concepts, symbolic elements, and multimodal integration, with varying levels of difficulty. Experimental results show that even state-of-the-art (SOTA) models, such as GPT-4o, achieve an average performance of 48.8% on tasks involving abstract concepts. Additionally, performance drops by 15% as task complexity increases, highlighting the challenges LVLMs face in maintaining consistent performance. Through this work, we hope to show the limitations of current LVLMs and offer insights into how they can more effectively emulate human cognitive processes in the future.

Subjects:	Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Cite as:	arXiv:2411.09105 [cs.CV]
	(or arXiv:2411.09105v2 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2411.09105

Submission history

From: Chenglin Li [view email]
[v1] Thu, 14 Nov 2024 00:26:26 UTC (6,084 KB)
[v2] Tue, 1 Jul 2025 03:47:15 UTC (8,648 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:VideoCogQA: A Controllable Benchmark for Evaluating Cognitive Abilities in Video-Language Models

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:VideoCogQA: A Controllable Benchmark for Evaluating Cognitive Abilities in Video-Language Models

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators