VLM4D: Towards Spatiotemporal Awareness in Vision Language Models

Zhou, Shijie; Vilesov, Alexander; He, Xuehai; Wan, Ziyu; Zhang, Shuwang; Nagachandra, Aditya; Chang, Di; Chen, Dongdong; Wang, Xin Eric; Kadambi, Achuta

Computer Science > Computer Vision and Pattern Recognition

arXiv:2508.02095 (cs)

[Submitted on 4 Aug 2025 (v1), last revised 6 Aug 2025 (this version, v2)]

Title:VLM4D: Towards Spatiotemporal Awareness in Vision Language Models

Authors:Shijie Zhou, Alexander Vilesov, Xuehai He, Ziyu Wan, Shuwang Zhang, Aditya Nagachandra, Di Chang, Dongdong Chen, Xin Eric Wang, Achuta Kadambi

View PDF HTML (experimental)

Abstract:Vision language models (VLMs) have shown remarkable capabilities in integrating linguistic and visual reasoning but remain fundamentally limited in understanding dynamic spatiotemporal interactions. Humans effortlessly track and reason about object movements, rotations, and perspective shifts-abilities essential for robust dynamic real-world understanding yet notably lacking in current VLMs. In this paper, we introduce VLM4D, the first benchmark specifically designed to evaluate the spatiotemporal reasoning capabilities of VLMs. Our benchmark comprises diverse real-world and synthetic videos accompanied by carefully curated question-answer pairs emphasizing translational and rotational motions, perspective awareness, and motion continuity. Through comprehensive evaluations of state-of-the-art open and closed-source VLMs, we identify significant performance gaps compared to human baselines, highlighting fundamental deficiencies in existing models. Extensive analysis reveals that VLMs struggle particularly with integrating multiple visual cues and maintaining temporal coherence. We further explore promising directions, such as leveraging 4D feature field reconstruction and targeted spatiotemporal supervised fine-tuning, demonstrating their effectiveness in enhancing spatiotemporal comprehension. Our work aims to encourage deeper exploration into improving VLMs' spatial and temporal grounding, paving the way towards more capable and reliable visual intelligence for dynamic environments.

Comments:	ICCV 2025, Project Website: this https URL
Subjects:	Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Cite as:	arXiv:2508.02095 [cs.CV]
	(or arXiv:2508.02095v2 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2508.02095

Submission history

From: Shijie Zhou [view email]
[v1] Mon, 4 Aug 2025 06:06:06 UTC (16,500 KB)
[v2] Wed, 6 Aug 2025 19:21:50 UTC (16,500 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:VLM4D: Towards Spatiotemporal Awareness in Vision Language Models

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:VLM4D: Towards Spatiotemporal Awareness in Vision Language Models

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators