Search | arXiv e-print repository

DIVE: Deep-search Iterative Video Exploration A Technical Report for the CVRR Challenge at CVPR 2025

Authors: Umihiro Kamoto, Tatsuya Ishibashi, Noriyuki Kugo

Abstract: In this report, we present the winning solution that achieved the 1st place in the Complex Video Reasoning & Robustness Evaluation Challenge 2025. This challenge evaluates the ability to generate accurate natural language answers to questions about diverse, real-world video clips. It uses the Complex Video Reasoning and Robustness Evaluation Suite (CVRR-ES) benchmark, which consists of 214 unique… ▽ More In this report, we present the winning solution that achieved the 1st place in the Complex Video Reasoning & Robustness Evaluation Challenge 2025. This challenge evaluates the ability to generate accurate natural language answers to questions about diverse, real-world video clips. It uses the Complex Video Reasoning and Robustness Evaluation Suite (CVRR-ES) benchmark, which consists of 214 unique videos and 2,400 question-answer pairs spanning 11 categories. Our method, DIVE (Deep-search Iterative Video Exploration), adopts an iterative reasoning approach, in which each input question is semantically decomposed and solved through stepwise reasoning and progressive inference. This enables our system to provide highly accurate and contextually appropriate answers to even the most complex queries. Applied to the CVRR-ES benchmark, our approach achieves 81.44% accuracy on the test set, securing the top position among all participants. This report details our methodology and provides a comprehensive analysis of the experimental results, demonstrating the effectiveness of our iterative reasoning framework in achieving robust video question answering. The code is available at https://github.com/PanasonicConnect/DIVE △ Less

Submitted 27 June, 2025; originally announced June 2025.

arXiv:2505.09165 [pdf, ps, other]

BusOut is NP-complete

Authors: Takehiro Ishibashi, Ryo Yoshinaka, Ayumi Shinohara

Abstract: This study examines the computational complexity of the decision problem modeled on the smartphone game Bus Out. The objective of the game is to load all the passengers in a queue onto appropriate buses using a limited number of bus parking spots by selecting and dispatching the buses on a map. We show that the problem is NP-complete, even for highly restricted instances. We also show that it is h… ▽ More This study examines the computational complexity of the decision problem modeled on the smartphone game Bus Out. The objective of the game is to load all the passengers in a queue onto appropriate buses using a limited number of bus parking spots by selecting and dispatching the buses on a map. We show that the problem is NP-complete, even for highly restricted instances. We also show that it is hard to approximate the minimum number of parking spots needed to solve a given instance. △ Less

Submitted 14 May, 2025; originally announced May 2025.

arXiv:2407.03610 [pdf, other]

VDMA: Video Question Answering with Dynamically Generated Multi-Agents

Authors: Noriyuki Kugo, Tatsuya Ishibashi, Kosuke Ono, Yuji Sato

Abstract: This technical report provides a detailed description of our approach to the EgoSchema Challenge 2024. The EgoSchema Challenge aims to identify the most appropriate responses to questions regarding a given video clip. In this paper, we propose Video Question Answering with Dynamically Generated Multi-Agents (VDMA). This method is a complementary approach to existing response generation systems by… ▽ More This technical report provides a detailed description of our approach to the EgoSchema Challenge 2024. The EgoSchema Challenge aims to identify the most appropriate responses to questions regarding a given video clip. In this paper, we propose Video Question Answering with Dynamically Generated Multi-Agents (VDMA). This method is a complementary approach to existing response generation systems by employing a multi-agent system with dynamically generated expert agents. This method aims to provide the most accurate and contextually appropriate responses. This report details the stages of our approach, the tools employed, and the results of our experiments. △ Less

Submitted 3 July, 2024; originally announced July 2024.

Comments: 4 pages, 2 figures

arXiv:2307.01467 [pdf]

Technical Report for Ego4D Long Term Action Anticipation Challenge 2023

Authors: Tatsuya Ishibashi, Kosuke Ono, Noriyuki Kugo, Yuji Sato

Abstract: In this report, we describe the technical details of our approach for the Ego4D Long-Term Action Anticipation Challenge 2023. The aim of this task is to predict a sequence of future actions that will take place at an arbitrary time or later, given an input video. To accomplish this task, we introduce three improvements to the baseline model, which consists of an encoder that generates clip-level f… ▽ More In this report, we describe the technical details of our approach for the Ego4D Long-Term Action Anticipation Challenge 2023. The aim of this task is to predict a sequence of future actions that will take place at an arbitrary time or later, given an input video. To accomplish this task, we introduce three improvements to the baseline model, which consists of an encoder that generates clip-level features from the video, an aggregator that integrates multiple clip-level features, and a decoder that outputs Z future actions. 1) Model ensemble of SlowFast and SlowFast-CLIP; 2) Label smoothing to relax order constraints for future actions; 3) Constraining the prediction of the action class (verb, noun) based on word co-occurrence. Our method outperformed the baseline performance and recorded as second place solution on the public leaderboard. △ Less

Submitted 4 July, 2023; originally announced July 2023.

Showing 1–4 of 4 results for author: Ishibashi, T