-
Moment of Untruth: Dealing with Negative Queries in Video Moment Retrieval
Authors:
Kevin Flanagan,
Dima Damen,
Michael Wray
Abstract:
Video Moment Retrieval is a common task to evaluate the performance of visual-language models - it involves localising start and end times of moments in videos from query sentences. The current task formulation assumes that the queried moment is present in the video, resulting in false positive moment predictions when irrelevant query sentences are provided. In this paper we propose the task of Ne…
▽ More
Video Moment Retrieval is a common task to evaluate the performance of visual-language models - it involves localising start and end times of moments in videos from query sentences. The current task formulation assumes that the queried moment is present in the video, resulting in false positive moment predictions when irrelevant query sentences are provided. In this paper we propose the task of Negative-Aware Video Moment Retrieval (NA-VMR), which considers both moment retrieval accuracy and negative query rejection accuracy. We make the distinction between In-Domain and Out-of-Domain negative queries and provide new evaluation benchmarks for two popular video moment retrieval datasets: QVHighlights and Charades-STA. We analyse the ability of current SOTA video moment retrieval approaches to adapt to Negative-Aware Video Moment Retrieval and propose UniVTG-NA, an adaptation of UniVTG designed to tackle NA-VMR. UniVTG-NA achieves high negative rejection accuracy (avg. $98.4\%$) scores while retaining moment retrieval scores to within $3.87\%$ Recall@1. Dataset splits and code are available at https://github.com/keflanagan/MomentofUntruth
△ Less
Submitted 13 February, 2025; v1 submitted 12 February, 2025;
originally announced February 2025.
-
HD-EPIC: A Highly-Detailed Egocentric Video Dataset
Authors:
Toby Perrett,
Ahmad Darkhalil,
Saptarshi Sinha,
Omar Emara,
Sam Pollard,
Kranti Parida,
Kaiting Liu,
Prajwal Gatti,
Siddhant Bansal,
Kevin Flanagan,
Jacob Chalk,
Zhifan Zhu,
Rhodri Guerrier,
Fahd Abdelazim,
Bin Zhu,
Davide Moltisanti,
Michael Wray,
Hazel Doughty,
Dima Damen
Abstract:
We present a validation dataset of newly-collected kitchen-based egocentric videos, manually annotated with highly detailed and interconnected ground-truth labels covering: recipe steps, fine-grained actions, ingredients with nutritional values, moving objects, and audio annotations. Importantly, all annotations are grounded in 3D through digital twinning of the scene, fixtures, object locations,…
▽ More
We present a validation dataset of newly-collected kitchen-based egocentric videos, manually annotated with highly detailed and interconnected ground-truth labels covering: recipe steps, fine-grained actions, ingredients with nutritional values, moving objects, and audio annotations. Importantly, all annotations are grounded in 3D through digital twinning of the scene, fixtures, object locations, and primed with gaze. Footage is collected from unscripted recordings in diverse home environments, making HDEPIC the first dataset collected in-the-wild but with detailed annotations matching those in controlled lab environments.
We show the potential of our highly-detailed annotations through a challenging VQA benchmark of 26K questions assessing the capability to recognise recipes, ingredients, nutrition, fine-grained actions, 3D perception, object motion, and gaze direction. The powerful long-context Gemini Pro only achieves 38.5% on this benchmark, showcasing its difficulty and highlighting shortcomings in current VLMs. We additionally assess action recognition, sound recognition, and long-term video-object segmentation on HD-EPIC.
HD-EPIC is 41 hours of video in 9 kitchens with digital twins of 413 kitchen fixtures, capturing 69 recipes, 59K fine-grained actions, 51K audio events, 20K object movements and 37K object masks lifted to 3D. On average, we have 263 annotations per minute of our unscripted videos.
△ Less
Submitted 25 March, 2025; v1 submitted 6 February, 2025;
originally announced February 2025.
-
vailá: Versatile Anarcho Integrated Liberation Ánalysis in Multimodal Toolbox
Authors:
Paulo Roberto Pereira Santiago,
Abel Gonçalves Chinaglia,
Kira Flanagan,
Bruno L. S. Bedo,
Ligia Yumi Mochida,
Juan Aceros,
Aline Bononi,
Guilherme Manna Cesar
Abstract:
Human movement analysis is crucial in health and sports biomechanics for understanding physical performance, guiding rehabilitation, and preventing injuries. However, existing tools are often proprietary, expensive, and function as "black boxes", limiting user control and customization. This paper introduces vailá-Versatile Anarcho Integrated Liberation Ánalysis in Multimodal Toolbox-an open-sourc…
▽ More
Human movement analysis is crucial in health and sports biomechanics for understanding physical performance, guiding rehabilitation, and preventing injuries. However, existing tools are often proprietary, expensive, and function as "black boxes", limiting user control and customization. This paper introduces vailá-Versatile Anarcho Integrated Liberation Ánalysis in Multimodal Toolbox-an open-source, Python-based platform designed to enhance human movement analysis by integrating data from multiple biomechanical systems. vailá supports data from diverse sources, including retroreflective motion capture systems, inertial measurement units (IMUs), markerless video capture technology, electromyography (EMG), force plates, and GPS or GNSS systems, enabling comprehensive analysis of movement patterns. Developed entirely in Python 3.11.9, which offers improved efficiency and long-term support, and featuring a straightforward installation process, vailá is accessible to users without extensive programming experience. In this paper, we also present several workflow examples that demonstrate how vailá allows the rapid processing of large batches of data, independent of the type of collection method. This flexibility is especially valuable in research scenarios where unexpected data collection challenges arise, ensuring no valuable data point is lost. We demonstrate the application of vailá in analyzing sit-to-stand movements in pediatric disability, showcasing its capability to provide deeper insights even with unexpected movement patterns. By fostering a collaborative and open environment, vailá encourages users to innovate, customize, and freely explore their analysis needs, potentially contributing to the advancement of rehabilitation strategies and performance optimization.
△ Less
Submitted 7 October, 2024;
originally announced October 2024.
-
Video Editing for Video Retrieval
Authors:
Bin Zhu,
Kevin Flanagan,
Adriano Fragomeni,
Michael Wray,
Dima Damen
Abstract:
Though pre-training vision-language models have demonstrated significant benefits in boosting video-text retrieval performance from large-scale web videos, fine-tuning still plays a critical role with manually annotated clips with start and end times, which requires considerable human effort. To address this issue, we explore an alternative cheaper source of annotations, single timestamps, for vid…
▽ More
Though pre-training vision-language models have demonstrated significant benefits in boosting video-text retrieval performance from large-scale web videos, fine-tuning still plays a critical role with manually annotated clips with start and end times, which requires considerable human effort. To address this issue, we explore an alternative cheaper source of annotations, single timestamps, for video-text retrieval. We initialise clips from timestamps in a heuristic way to warm up a retrieval model. Then a video clip editing method is proposed to refine the initial rough boundaries to improve retrieval performance. A student-teacher network is introduced for video clip editing. The teacher model is employed to edit the clips in the training set whereas the student model trains on the edited clips. The teacher weights are updated from the student's after the student's performance increases. Our method is model agnostic and applicable to any retrieval models. We conduct experiments based on three state-of-the-art retrieval models, COOT, VideoCLIP and CLIP4Clip. Experiments conducted on three video retrieval datasets, YouCook2, DiDeMo and ActivityNet-Captions show that our edited clips consistently improve retrieval performance over initial clips across all the three retrieval models.
△ Less
Submitted 7 September, 2024; v1 submitted 3 February, 2024;
originally announced February 2024.
-
Learning Temporal Sentence Grounding From Narrated EgoVideos
Authors:
Kevin Flanagan,
Dima Damen,
Michael Wray
Abstract:
The onset of long-form egocentric datasets such as Ego4D and EPIC-Kitchens presents a new challenge for the task of Temporal Sentence Grounding (TSG). Compared to traditional benchmarks on which this task is evaluated, these datasets offer finer-grained sentences to ground in notably longer videos. In this paper, we develop an approach for learning to ground sentences in these datasets using only…
▽ More
The onset of long-form egocentric datasets such as Ego4D and EPIC-Kitchens presents a new challenge for the task of Temporal Sentence Grounding (TSG). Compared to traditional benchmarks on which this task is evaluated, these datasets offer finer-grained sentences to ground in notably longer videos. In this paper, we develop an approach for learning to ground sentences in these datasets using only narrations and their corresponding rough narration timestamps. We propose to artificially merge clips to train for temporal grounding in a contrastive manner using text-conditioning attention. This Clip Merging (CliMer) approach is shown to be effective when compared with a high performing TSG method -- e.g. mean R@1 improves from 3.9 to 5.7 on Ego4D and from 10.7 to 13.0 on EPIC-Kitchens. Code and data splits available from: https://github.com/keflanagan/CliMer
△ Less
Submitted 26 October, 2023;
originally announced October 2023.