Skip to main content

Showing 1–25 of 25 results for author: Rota, P

Searching in archive cs. Search in all archives.
.
  1. arXiv:2506.01667  [pdf, other

    cs.CV

    EarthMind: Towards Multi-Granular and Multi-Sensor Earth Observation with Large Multimodal Models

    Authors: Yan Shu, Bin Ren, Zhitong Xiong, Danda Pani Paudel, Luc Van Gool, Begum Demir, Nicu Sebe, Paolo Rota

    Abstract: Large Multimodal Models (LMMs) have demonstrated strong performance in various vision-language tasks. However, they often struggle to comprehensively understand Earth Observation (EO) data, which is critical for monitoring the environment and the effects of human activity on it. In this work, we present EarthMind, a novel vision-language framework for multi-granular and multi-sensor EO data unders… ▽ More

    Submitted 2 June, 2025; originally announced June 2025.

  2. arXiv:2504.21682  [pdf, ps, other

    cs.CV

    Visual Text Processing: A Comprehensive Review and Unified Evaluation

    Authors: Yan Shu, Weichao Zeng, Fangmin Zhao, Zeyu Chen, Zhenhang Li, Xiaomeng Yang, Yu Zhou, Paolo Rota, Xiang Bai, Lianwen Jin, Xu-Cheng Yin, Nicu Sebe

    Abstract: Visual text is a crucial component in both document and scene images, conveying rich semantic information and attracting significant attention in the computer vision community. Beyond traditional tasks such as text detection and recognition, visual text processing has witnessed rapid advancements driven by the emergence of foundation models, including text image reconstruction and text image manip… ▽ More

    Submitted 5 June, 2025; v1 submitted 30 April, 2025; originally announced April 2025.

  3. arXiv:2503.21851  [pdf, other

    cs.CV

    On Large Multimodal Models as Open-World Image Classifiers

    Authors: Alessandro Conti, Massimiliano Mancini, Enrico Fini, Yiming Wang, Paolo Rota, Elisa Ricci

    Abstract: Traditional image classification requires a predefined list of semantic categories. In contrast, Large Multimodal Models (LMMs) can sidestep this requirement by classifying images directly using natural language (e.g., answering the prompt "What is the main object in the image?"). Despite this remarkable capability, most existing studies on LMM classification performance are surprisingly limited i… ▽ More

    Submitted 27 March, 2025; originally announced March 2025.

    Comments: 23 pages, 13 figures, code is available at https://github.com/altndrr/lmms-owc

  4. arXiv:2503.15686  [pdf, other

    cs.CV

    Multi-focal Conditioned Latent Diffusion for Person Image Synthesis

    Authors: Jiaqi Liu, Jichao Zhang, Paolo Rota, Nicu Sebe

    Abstract: The Latent Diffusion Model (LDM) has demonstrated strong capabilities in high-resolution image generation and has been widely employed for Pose-Guided Person Image Synthesis (PGPIS), yielding promising results. However, the compression process of LDM often results in the deterioration of details, particularly in sensitive areas such as facial features and clothing textures. In this paper, we propo… ▽ More

    Submitted 23 March, 2025; v1 submitted 19 March, 2025; originally announced March 2025.

    Comments: CVPR 2025 Accepted

  5. arXiv:2408.16412  [pdf, ps, other

    cs.CV

    Text-Enhanced Zero-Shot Action Recognition: A training-free approach

    Authors: Massimo Bosetti, Shibingfeng Zhang, Benedetta Liberatori, Giacomo Zara, Elisa Ricci, Paolo Rota

    Abstract: Vision-language models (VLMs) have demonstrated remarkable performance across various visual tasks, leveraging joint learning of visual and textual representations. While these models excel in zero-shot image tasks, their application to zero-shot video action recognition (ZSVAR) remains challenging due to the dynamic and temporal nature of actions. Existing methods for ZS-VAR typically require ext… ▽ More

    Submitted 29 August, 2024; originally announced August 2024.

    Comments: accepted to ICPR 2024

  6. arXiv:2406.12321  [pdf, other

    cs.AI cs.CL cs.CV

    Automatic benchmarking of large multimodal models via iterative experiment programming

    Authors: Alessandro Conti, Enrico Fini, Paolo Rota, Yiming Wang, Massimiliano Mancini, Elisa Ricci

    Abstract: Assessing the capabilities of large multimodal models (LMMs) often requires the creation of ad-hoc evaluations. Currently, building new benchmarks requires tremendous amounts of manual work for each specific analysis. This makes the evaluation process tedious and costly. In this paper, we present APEx, Automatic Programming of Experiments, the first framework for automatic benchmarking of LMMs. Gi… ▽ More

    Submitted 18 June, 2024; originally announced June 2024.

    Comments: 31 pages, 6 figures, code is available at https://github.com/altndrr/apex

  7. arXiv:2405.20675  [pdf, other

    cs.CV cs.AI cs.LG cs.MM

    Adv-KD: Adversarial Knowledge Distillation for Faster Diffusion Sampling

    Authors: Kidist Amde Mekonnen, Nicola Dall'Asen, Paolo Rota

    Abstract: Diffusion Probabilistic Models (DPMs) have emerged as a powerful class of deep generative models, achieving remarkable performance in image synthesis tasks. However, these models face challenges in terms of widespread adoption due to their reliance on sequential denoising steps during sample generation. This dependence leads to substantial computational requirements, making them unsuitable for res… ▽ More

    Submitted 31 May, 2024; originally announced May 2024.

    Comments: 7 pages, 11 figures, ELLIS Doctoral Symposium 2023 in Helsinki, Finland

  8. arXiv:2404.10864  [pdf, other

    cs.CV

    Vocabulary-free Image Classification and Semantic Segmentation

    Authors: Alessandro Conti, Enrico Fini, Massimiliano Mancini, Paolo Rota, Yiming Wang, Elisa Ricci

    Abstract: Large vision-language models revolutionized image classification and semantic segmentation paradigms. However, they typically assume a pre-defined set of categories, or vocabulary, at test time for composing textual prompts. This assumption is impractical in scenarios with unknown or evolving semantic context. Here, we address this issue and introduce the Vocabulary-free Image Classification (VIC)… ▽ More

    Submitted 16 April, 2024; originally announced April 2024.

    Comments: Under review, 22 pages, 10 figures, code is available at https://github.com/altndrr/vicss. arXiv admin note: text overlap with arXiv:2306.00917

  9. arXiv:2404.07560  [pdf, other

    cs.RO cs.AI

    Socially Pertinent Robots in Gerontological Healthcare

    Authors: Xavier Alameda-Pineda, Angus Addlesee, Daniel Hernández García, Chris Reinke, Soraya Arias, Federica Arrigoni, Alex Auternaud, Lauriane Blavette, Cigdem Beyan, Luis Gomez Camara, Ohad Cohen, Alessandro Conti, Sébastien Dacunha, Christian Dondrup, Yoav Ellinson, Francesco Ferro, Sharon Gannot, Florian Gras, Nancie Gunson, Radu Horaud, Moreno D'Incà, Imad Kimouche, Séverin Lemaignan, Oliver Lemon, Cyril Liotard , et al. (19 additional authors not shown)

    Abstract: Despite the many recent achievements in developing and deploying social robotics, there are still many underexplored environments and applications for which systematic evaluation of such systems by end-users is necessary. While several robotic platforms have been used in gerontological healthcare, the question of whether or not a social interactive robot with multi-modal conversational capabilitie… ▽ More

    Submitted 11 February, 2025; v1 submitted 11 April, 2024; originally announced April 2024.

  10. arXiv:2404.05426  [pdf, other

    cs.CV

    Test-Time Zero-Shot Temporal Action Localization

    Authors: Benedetta Liberatori, Alessandro Conti, Paolo Rota, Yiming Wang, Elisa Ricci

    Abstract: Zero-Shot Temporal Action Localization (ZS-TAL) seeks to identify and locate actions in untrimmed videos unseen during training. Existing ZS-TAL methods involve fine-tuning a model on a large amount of annotated training data. While effective, training-based ZS-TAL approaches assume the availability of labeled data for supervised learning, which can be impractical in some applications. Furthermore… ▽ More

    Submitted 11 April, 2024; v1 submitted 8 April, 2024; originally announced April 2024.

    Comments: CVPR 2024

  11. arXiv:2308.09139  [pdf, other

    cs.CV

    The Unreasonable Effectiveness of Large Language-Vision Models for Source-free Video Domain Adaptation

    Authors: Giacomo Zara, Alessandro Conti, Subhankar Roy, Stéphane Lathuilière, Paolo Rota, Elisa Ricci

    Abstract: Source-Free Video Unsupervised Domain Adaptation (SFVUDA) task consists in adapting an action recognition model, trained on a labelled source dataset, to an unlabelled target dataset, without accessing the actual source data. The previous approaches have attempted to address SFVUDA by leveraging self-supervision (e.g., enforcing temporal consistency) derived from the target data itself. In this wo… ▽ More

    Submitted 22 August, 2023; v1 submitted 17 August, 2023; originally announced August 2023.

    Comments: Accepted at ICCV2023, 14 pages, 7 figures, code is available at https://github.com/giaczara/dallv

  12. arXiv:2306.00917  [pdf, other

    cs.CV

    Vocabulary-free Image Classification

    Authors: Alessandro Conti, Enrico Fini, Massimiliano Mancini, Paolo Rota, Yiming Wang, Elisa Ricci

    Abstract: Recent advances in large vision-language models have revolutionized the image classification paradigm. Despite showing impressive zero-shot capabilities, a pre-defined set of categories, a.k.a. the vocabulary, is assumed at test time for composing the textual prompts. However, such assumption can be impractical when the semantic context is unknown and evolving. We thus formalize a novel task, term… ▽ More

    Submitted 12 January, 2024; v1 submitted 1 June, 2023; originally announced June 2023.

    Comments: Accepted at NeurIPS2023, 19 pages, 8 figures, code is available at https://github.com/altndrr/vic

  13. arXiv:2305.05268  [pdf, other

    cs.CV cs.AI

    Rotation Synchronization via Deep Matrix Factorization

    Authors: Gk Tejus, Giacomo Zara, Paolo Rota, Andrea Fusiello, Elisa Ricci, Federica Arrigoni

    Abstract: In this paper we address the rotation synchronization problem, where the objective is to recover absolute rotations starting from pairwise ones, where the unknowns and the measures are represented as nodes and edges of a graph, respectively. This problem is an essential task for structure from motion and simultaneous localization and mapping. We focus on the formulation of synchronization via neur… ▽ More

    Submitted 9 May, 2023; originally announced May 2023.

    Comments: To be published in ICRA 2023

  14. arXiv:2304.01110  [pdf, other

    cs.CV

    AutoLabel: CLIP-based framework for Open-set Video Domain Adaptation

    Authors: Giacomo Zara, Subhankar Roy, Paolo Rota, Elisa Ricci

    Abstract: Open-set Unsupervised Video Domain Adaptation (OUVDA) deals with the task of adapting an action recognition model from a labelled source domain to an unlabelled target domain that contains "target-private" categories, which are present in the target but absent in the source. In this work we deviate from the prior work of training a specialized open-set classifier or weighted adversarial learning b… ▽ More

    Submitted 4 April, 2023; v1 submitted 3 April, 2023; originally announced April 2023.

    Comments: Accepted to CVPR 2023

  15. arXiv:2301.03322  [pdf, other

    cs.CV

    Simplifying Open-Set Video Domain Adaptation with Contrastive Learning

    Authors: Giacomo Zara, Victor Guilherme Turrisi da Costa, Subhankar Roy, Paolo Rota, Elisa Ricci

    Abstract: In an effort to reduce annotation costs in action recognition, unsupervised video domain adaptation methods have been proposed that aim to adapt a predictive model from a labelled dataset (i.e., source domain) to an unlabelled dataset (i.e., target domain). In this work we address a more realistic scenario, called open-set video domain adaptation (OUVDA), where the target dataset contains "unknown… ▽ More

    Submitted 9 January, 2023; originally announced January 2023.

    Comments: Currently under review at Computer Vision and Image Understanding (CVIU) journal

  16. arXiv:2211.06742  [pdf, other

    cs.CV cs.AI

    Deep Unsupervised Key Frame Extraction for Efficient Video Classification

    Authors: Hao Tang, Lei Ding, Songsong Wu, Bin Ren, Nicu Sebe, Paolo Rota

    Abstract: Video processing and analysis have become an urgent task since a huge amount of videos (e.g., Youtube, Hulu) are uploaded online every day. The extraction of representative key frames from videos is very important in video processing and analysis since it greatly reduces computing resources and time. Although great progress has been made recently, large-scale video classification remains an open p… ▽ More

    Submitted 12 November, 2022; originally announced November 2022.

    Comments: Accepted to TOMM

  17. arXiv:2210.05246  [pdf, other

    cs.CV cs.AI

    Cluster-level pseudo-labelling for source-free cross-domain facial expression recognition

    Authors: Alessandro Conti, Paolo Rota, Yiming Wang, Elisa Ricci

    Abstract: Automatically understanding emotions from visual data is a fundamental task for human behaviour understanding. While models devised for Facial Expression Recognition (FER) have demonstrated excellent performances on many datasets, they often suffer from severe performance degradation when trained and tested on different datasets due to domain shift. In addition, as face images are considered highl… ▽ More

    Submitted 11 October, 2022; originally announced October 2022.

    Comments: Accepted at BMVC2022, 13 pages, 4 figures, code is available at https://github.com/altndrr/clup

  18. arXiv:2207.12842  [pdf, other

    cs.CV

    Unsupervised Domain Adaptation for Video Transformers in Action Recognition

    Authors: Victor G. Turrisi da Costa, Giacomo Zara, Paolo Rota, Thiago Oliveira-Santos, Nicu Sebe, Vittorio Murino, Elisa Ricci

    Abstract: Over the last few years, Unsupervised Domain Adaptation (UDA) techniques have acquired remarkable importance and popularity in computer vision. However, when compared to the extensive literature available for images, the field of videos is still relatively unexplored. On the other hand, the performance of a model in action recognition is heavily affected by domain shift. In this paper, we propose… ▽ More

    Submitted 26 July, 2022; originally announced July 2022.

    Comments: Accepted at ICPR 2022

  19. Uncertainty-aware Contrastive Distillation for Incremental Semantic Segmentation

    Authors: Guanglei Yang, Enrico Fini, Dan Xu, Paolo Rota, Mingli Ding, Moin Nabi, Xavier Alameda-Pineda, Elisa Ricci

    Abstract: A fundamental and challenging problem in deep learning is catastrophic forgetting, i.e. the tendency of neural networks to fail to preserve the knowledge acquired from old tasks when learning new tasks. This problem has been widely investigated in the research community and several Incremental Learning (IL) approaches have been proposed in the past years. While earlier works in computer vision hav… ▽ More

    Submitted 20 May, 2022; v1 submitted 26 March, 2022; originally announced March 2022.

    Comments: TPAMI

  20. arXiv:2202.00432  [pdf, other

    cs.CV

    Continual Attentive Fusion for Incremental Learning in Semantic Segmentation

    Authors: Guanglei Yang, Enrico Fini, Dan Xu, Paolo Rota, Mingli Ding, Hao Tang, Xavier Alameda-Pineda, Elisa Ricci

    Abstract: Over the past years, semantic segmentation, as many other tasks in computer vision, benefited from the progress in deep neural networks, resulting in significantly improved performance. However, deep architectures trained with gradient-based techniques suffer from catastrophic forgetting, which is the tendency to forget previously learned knowledge while learning new tasks. Aiming at devising stra… ▽ More

    Submitted 1 February, 2022; originally announced February 2022.

  21. arXiv:2103.03510  [pdf, other

    cs.CV

    Variational Structured Attention Networks for Deep Visual Representation Learning

    Authors: Guanglei Yang, Paolo Rota, Xavier Alameda-Pineda, Dan Xu, Mingli Ding, Elisa Ricci

    Abstract: Convolutional neural networks have enabled major progresses in addressing pixel-level prediction tasks such as semantic segmentation, depth estimation, surface normal prediction and so on, benefiting from their powerful capabilities in visual representation learning. Typically, state of the art models integrate attention mechanisms for improved deep feature representations. Recently, some works ha… ▽ More

    Submitted 15 December, 2021; v1 submitted 5 March, 2021; originally announced March 2021.

    Comments: Accepted at IEEE Transactions on Image Processing (TIP)

  22. arXiv:2101.10382  [pdf, other

    cs.LG cs.CL cs.CV

    Curriculum Learning: A Survey

    Authors: Petru Soviany, Radu Tudor Ionescu, Paolo Rota, Nicu Sebe

    Abstract: Training machine learning models in a meaningful order, from the easy samples to the hard ones, using curriculum learning can provide performance improvements over the standard training approach based on random data shuffling, without any additional computational costs. Curriculum learning strategies have been successfully employed in all areas of machine learning, in a wide range of tasks. Howeve… ▽ More

    Submitted 11 April, 2022; v1 submitted 25 January, 2021; originally announced January 2021.

    Comments: Accepted at the International Journal of Computer Vision

  23. arXiv:2001.00238  [pdf, other

    cs.CV

    Low-Budget Label Query through Domain Alignment Enforcement

    Authors: Jurandy Almeida, Cristiano Saltori, Paolo Rota, Nicu Sebe

    Abstract: Deep learning revolution happened thanks to the availability of a massive amount of labelled data which have contributed to the development of models with extraordinary inference capabilities. Despite the public availability of a large quantity of datasets, to address specific requirements it is often necessary to generate a new set of labelled data. Quite often, the production of labels is costly… ▽ More

    Submitted 29 March, 2020; v1 submitted 1 January, 2020; originally announced January 2020.

  24. arXiv:1911.06849  [pdf, other

    cs.CV cs.LG

    Curriculum Self-Paced Learning for Cross-Domain Object Detection

    Authors: Petru Soviany, Radu Tudor Ionescu, Paolo Rota, Nicu Sebe

    Abstract: Training (source) domain bias affects state-of-the-art object detectors, such as Faster R-CNN, when applied to new (target) domains. To alleviate this problem, researchers proposed various domain adaptation methods to improve object detection results in the cross-domain setting, e.g. by translating images with ground-truth labels from the source domain to the target domain using Cycle-GAN. On top… ▽ More

    Submitted 20 January, 2021; v1 submitted 15 November, 2019; originally announced November 2019.

    Comments: Accepted for publication in Computer Vision and Image Understanding

  25. arXiv:1710.00568  [pdf, other

    cs.CV

    Indirect Match Highlights Detection with Deep Convolutional Neural Networks

    Authors: Marco Godi, Paolo Rota, Francesco Setti

    Abstract: Highlights in a sport video are usually referred as actions that stimulate excitement or attract attention of the audience. A big effort is spent in designing techniques which find automatically highlights, in order to automatize the otherwise manual editing process. Most of the state-of-the-art approaches try to solve the problem by training a classifier using the information extracted on the tv-… ▽ More

    Submitted 2 October, 2017; originally announced October 2017.

    Comments: "Social Signal Processing and Beyond" workshop, in conjunction with ICIAP 2017