Thinking With Videos: Multimodal Tool-Augmented Reinforcement Learning for Long Video Reasoning

Zhang, Haoji; Gu, Xin; Li, Jiawen; Ma, Chixiang; Bai, Sule; Zhang, Chubin; Zhang, Bowen; Zhou, Zhichao; He, Dongliang; Tang, Yansong

Computer Science > Computer Vision and Pattern Recognition

arXiv:2508.04416 (cs)

[Submitted on 6 Aug 2025 (v1), last revised 3 Sep 2025 (this version, v2)]

Title:Thinking With Videos: Multimodal Tool-Augmented Reinforcement Learning for Long Video Reasoning

Authors:Haoji Zhang, Xin Gu, Jiawen Li, Chixiang Ma, Sule Bai, Chubin Zhang, Bowen Zhang, Zhichao Zhou, Dongliang He, Yansong Tang

View PDF HTML (experimental)

Abstract:The video reasoning ability of multimodal large language models (MLLMs) is crucial for downstream tasks like video question answering and temporal grounding. While recent approaches have explored text-based chain-of-thought (CoT) reasoning for MLLMs, these methods often suffer from limited cross-modal interaction and increased hallucination, especially with longer videos or reasoning chains. To address these challenges, we propose Video Intelligence via Tool-Augmented Learning (VITAL), a novel end-to-end agentic video reasoning framework. With a visual toolbox, the model can densely sample new video frames on demand and generate multimodal CoT for precise long video reasoning. We observe that temporal grounding and question answering are mutually beneficial for video understanding tasks. Therefore, we construct two high-quality multi-task video reasoning datasets MTVR-CoT-72k for supervised fine-tuning and MTVR-RL-110k for reinforcement learning. Moreover, we propose a Difficulty-aware Group Relative Policy Optimization algorithm (DGRPO) to mitigate difficulty imbalance in multi-task reinforcement learning. Extensive experiments on 11 challenging video understanding benchmarks demonstrate the advanced reasoning ability of VITAL, outperforming existing methods in video question answering and temporal grounding tasks, especially in long video scenarios. Code is available at this https URL.

Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2508.04416 [cs.CV]
	(or arXiv:2508.04416v2 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2508.04416

Submission history

From: Haoji Zhang [view email]
[v1] Wed, 6 Aug 2025 13:03:21 UTC (9,215 KB)
[v2] Wed, 3 Sep 2025 07:11:03 UTC (9,239 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Thinking With Videos: Multimodal Tool-Augmented Reinforcement Learning for Long Video Reasoning

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Thinking With Videos: Multimodal Tool-Augmented Reinforcement Learning for Long Video Reasoning

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators