Skip to main content

Showing 1–13 of 13 results for author: Alcázar, J L

.
  1. arXiv:2506.01850  [pdf, ps, other

    cs.CV cs.AI cs.LG cs.MM

    MoDA: Modulation Adapter for Fine-Grained Visual Grounding in Instructional MLLMs

    Authors: Wayner Barrios, Andrés Villa, Juan León Alcázar, SouYoung Jin, Bernard Ghanem

    Abstract: Recently, Multimodal Large Language Models (MLLMs) have demonstrated impressive performance on instruction-following tasks by integrating pretrained visual encoders with large language models (LLMs). However, existing approaches often struggle to ground fine-grained visual concepts in complex scenes. In this paper, we propose MoDA (Modulation Adapter), a lightweight yet effective module designed t… ▽ More

    Submitted 2 June, 2025; originally announced June 2025.

  2. arXiv:2502.20361  [pdf, other

    cs.CV

    OpenTAD: A Unified Framework and Comprehensive Study of Temporal Action Detection

    Authors: Shuming Liu, Chen Zhao, Fatimah Zohra, Mattia Soldan, Alejandro Pardo, Mengmeng Xu, Lama Alssum, Merey Ramazanova, Juan León Alcázar, Anthony Cioppa, Silvio Giancola, Carlos Hinojosa, Bernard Ghanem

    Abstract: Temporal action detection (TAD) is a fundamental video understanding task that aims to identify human actions and localize their temporal boundaries in videos. Although this field has achieved remarkable progress in recent years, further progress and real-world applications are impeded by the absence of a standardized framework. Currently, different methods are compared under different implementat… ▽ More

    Submitted 27 February, 2025; originally announced February 2025.

  3. arXiv:2501.02699  [pdf, other

    cs.CV cs.AI

    EAGLE: Enhanced Visual Grounding Minimizes Hallucinations in Instructional Multimodal Models

    Authors: Andrés Villa, Juan León Alcázar, Motasem Alfarra, Vladimir Araujo, Alvaro Soto, Bernard Ghanem

    Abstract: Large language models and vision transformers have demonstrated impressive zero-shot capabilities, enabling significant transferability in downstream tasks. The fusion of these models has resulted in multi-modal architectures with enhanced instructional capabilities. Despite incorporating vast image and language pre-training, these multi-modal architectures often generate responses that deviate fr… ▽ More

    Submitted 5 January, 2025; originally announced January 2025.

    Comments: 12 pages, 4 figures, 8 tables

  4. arXiv:2305.18418  [pdf, other

    cs.CV cs.AI cs.LG

    Just a Glimpse: Rethinking Temporal Information for Video Continual Learning

    Authors: Lama Alssum, Juan Leon Alcazar, Merey Ramazanova, Chen Zhao, Bernard Ghanem

    Abstract: Class-incremental learning is one of the most important settings for the study of Continual Learning, as it closely resembles real-world application scenarios. With constrained memory sizes, catastrophic forgetting arises as the number of classes/tasks increases. Studying continual learning in the video domain poses even more challenges, as video data contains a large number of frames, which place… ▽ More

    Submitted 28 June, 2023; v1 submitted 28 May, 2023; originally announced May 2023.

    Comments: Accepted at CLVision Workshop - CVPR23 (Best Paper Award)

  5. arXiv:2212.04842  [pdf, other

    cs.CV cs.AI

    PIVOT: Prompting for Video Continual Learning

    Authors: Andrés Villa, Juan León Alcázar, Motasem Alfarra, Kumail Alhamoud, Julio Hurtado, Fabian Caba Heilbron, Alvaro Soto, Bernard Ghanem

    Abstract: Modern machine learning pipelines are limited due to data availability, storage quotas, privacy regulations, and expensive annotation processes. These constraints make it difficult or impossible to train and update large-scale models on such dynamic annotated sets. Continual learning directly approaches this problem, with the ultimate goal of devising methods where a deep neural network effectivel… ▽ More

    Submitted 4 April, 2023; v1 submitted 9 December, 2022; originally announced December 2022.

    Comments: CVPR 2023

  6. arXiv:2203.14250  [pdf, other

    cs.CV cs.MM cs.SD eess.AS

    End-to-End Active Speaker Detection

    Authors: Juan Leon Alcazar, Moritz Cordes, Chen Zhao, Bernard Ghanem

    Abstract: Recent advances in the Active Speaker Detection (ASD) problem build upon a two-stage process: feature extraction and spatio-temporal context aggregation. In this paper, we propose an end-to-end ASD workflow where feature learning and contextual predictions are jointly learned. Our end-to-end trainable network simultaneously learns multi-modal embeddings and aggregates spatio-temporal context. This… ▽ More

    Submitted 25 July, 2022; v1 submitted 27 March, 2022; originally announced March 2022.

  7. arXiv:2201.09381  [pdf, other

    cs.CV

    vCLIMB: A Novel Video Class Incremental Learning Benchmark

    Authors: Andrés Villa, Kumail Alhamoud, Juan León Alcázar, Fabian Caba Heilbron, Victor Escorcia, Bernard Ghanem

    Abstract: Continual learning (CL) is under-explored in the video domain. The few existing works contain splits with imbalanced class distributions over the tasks, or study the problem in unsuitable datasets. We introduce vCLIMB, a novel video continual learning benchmark. vCLIMB is a standardized test-bed to analyze catastrophic forgetting of deep models in video continual learning. In contrast to previous… ▽ More

    Submitted 6 April, 2022; v1 submitted 23 January, 2022; originally announced January 2022.

    Comments: An updated version of our CVPR 2022 paper (oral); v2 adds minor text changes. The code of our benchmark can be found at: https://vclimb.netlify.app/

  8. arXiv:2112.00431  [pdf, other

    cs.CV cs.AI

    MAD: A Scalable Dataset for Language Grounding in Videos from Movie Audio Descriptions

    Authors: Mattia Soldan, Alejandro Pardo, Juan León Alcázar, Fabian Caba Heilbron, Chen Zhao, Silvio Giancola, Bernard Ghanem

    Abstract: The recent and increasing interest in video-language research has driven the development of large-scale datasets that enable data-intensive machine learning techniques. In comparison, limited effort has been made at assessing the fitness of these datasets for the video-language grounding task. Recent works have begun to discover significant limitations in these datasets, suggesting that state-of-t… ▽ More

    Submitted 28 March, 2022; v1 submitted 1 December, 2021; originally announced December 2021.

    Comments: 12 Pages, 6 Figures, 7 Tables

    Journal ref: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition CVPR 2022

  9. arXiv:2109.05569  [pdf, other

    cs.CV

    MovieCuts: A New Dataset and Benchmark for Cut Type Recognition

    Authors: Alejandro Pardo, Fabian Caba Heilbron, Juan León Alcázar, Ali Thabet, Bernard Ghanem

    Abstract: Understanding movies and their structural patterns is a crucial task in decoding the craft of video editing. While previous works have developed tools for general analysis, such as detecting characters or recognizing cinematography properties at the shot level, less effort has been devoted to understanding the most basic video edit, the Cut. This paper introduces the Cut type recognition task, whi… ▽ More

    Submitted 24 October, 2022; v1 submitted 12 September, 2021; originally announced September 2021.

    Comments: Paper's website: https://www.alejandropardo.net/publication/moviecuts/

    Journal ref: ECCV 2022

  10. arXiv:2108.04294  [pdf, other

    cs.CV cs.MM

    Learning to Cut by Watching Movies

    Authors: Alejandro Pardo, Fabian Caba Heilbron, Juan León Alcázar, Ali Thabet, Bernard Ghanem

    Abstract: Video content creation keeps growing at an incredible pace; yet, creating engaging stories remains challenging and requires non-trivial video editing expertise. Many video editing components are astonishingly hard to automate primarily due to the lack of raw video materials. This paper focuses on a new task for computational video editing, namely the task of raking cut plausibility. Our key idea i… ▽ More

    Submitted 29 September, 2021; v1 submitted 9 August, 2021; originally announced August 2021.

    Comments: Accepted at ICCV2021. Paper website: https://alejandropardo.net/publication/learning-to-cut/

  11. arXiv:2106.01667  [pdf, other

    cs.CV

    APES: Audiovisual Person Search in Untrimmed Video

    Authors: Juan Leon Alcazar, Long Mai, Federico Perazzi, Joon-Young Lee, Pablo Arbelaez, Bernard Ghanem, Fabian Caba Heilbron

    Abstract: Humans are arguably one of the most important subjects in video streams, many real-world applications such as video summarization or video editing workflows often require the automatic search and retrieval of a person of interest. Despite tremendous efforts in the person reidentification and retrieval domains, few works have developed audiovisual search strategies. In this paper, we present the Au… ▽ More

    Submitted 3 June, 2021; originally announced June 2021.

  12. arXiv:2005.09812  [pdf, other

    cs.CV cs.SD eess.AS

    Active Speakers in Context

    Authors: Juan Leon Alcazar, Fabian Caba Heilbron, Long Mai, Federico Perazzi, Joon-Young Lee, Pablo Arbelaez, Bernard Ghanem

    Abstract: Current methods for active speak er detection focus on modeling short-term audiovisual information from a single speaker. Although this strategy can be enough for addressing single-speaker scenarios, it prevents accurate detection when the task is to identify who of many candidate speakers are talking. This paper introduces the Active Speaker Context, a novel representation that models relationshi… ▽ More

    Submitted 19 May, 2020; originally announced May 2020.

  13. arXiv:1904.05847  [pdf, other

    cs.CV

    MAIN: Multi-Attention Instance Network for Video Segmentation

    Authors: Juan Leon Alcazar, Maria A. Bravo, Ali K. Thabet, Guillaume Jeanneret, Thomas Brox, Pablo Arbelaez, Bernard Ghanem

    Abstract: Instance-level video segmentation requires a solid integration of spatial and temporal information. However, current methods rely mostly on domain-specific information (online learning) to produce accurate instance-level segmentations. We propose a novel approach that relies exclusively on the integration of generic spatio-temporal attention cues. Our strategy, named Multi-Attention Instance Netwo… ▽ More

    Submitted 11 April, 2019; originally announced April 2019.