Skip to main content

Showing 1–50 of 60 results for author: Cornia, M

.
  1. arXiv:2505.21062  [pdf, ps, other

    cs.CV

    Inverse Virtual Try-On: Generating Multi-Category Product-Style Images from Clothed Individuals

    Authors: Davide Lobba, Fulvio Sanguigni, Bin Ren, Marcella Cornia, Rita Cucchiara, Nicu Sebe

    Abstract: While virtual try-on (VTON) systems aim to render a garment onto a target person image, this paper tackles the novel task of virtual try-off (VTOFF), which addresses the inverse problem: generating standardized product images of garments from real-world photos of clothed individuals. Unlike VTON, which must resolve diverse pose and style variations, VTOFF benefits from a consistent and well-define… ▽ More

    Submitted 27 May, 2025; originally announced May 2025.

  2. arXiv:2505.20405  [pdf, ps, other

    cs.CV cs.AI cs.CL cs.MM

    What Changed? Detecting and Evaluating Instruction-Guided Image Edits with Multimodal Large Language Models

    Authors: Lorenzo Baraldi, Davide Bucciarelli, Federico Betti, Marcella Cornia, Lorenzo Baraldi, Nicu Sebe, Rita Cucchiara

    Abstract: Instruction-based image editing models offer increased personalization opportunities in generative tasks. However, properly evaluating their results is challenging, and most of the existing metrics lag in terms of alignment with human judgment and explainability. To tackle these issues, we introduce DICE (DIfference Coherence Estimator), a model designed to detect localized differences between the… ▽ More

    Submitted 26 May, 2025; originally announced May 2025.

  3. arXiv:2505.15323  [pdf, other

    cs.CL

    Improving LLM First-Token Predictions in Multiple-Choice Question Answering via Prefilling Attack

    Authors: Silvia Cappelletti, Tobia Poppi, Samuele Poppi, Zheng-Xin Yong, Diego Garcia-Olano, Marcella Cornia, Lorenzo Baraldi, Rita Cucchiara

    Abstract: Large Language Models (LLMs) are increasingly evaluated on multiple-choice question answering (MCQA) tasks using *first-token probability* (FTP), which selects the answer option whose initial token has the highest likelihood. While efficient, FTP can be fragile: models may assign high probability to unrelated tokens (*misalignment*) or use a valid token merely as part of a generic preamble rather… ▽ More

    Submitted 21 May, 2025; originally announced May 2025.

    Comments: 13 pages, 5 figures, 7 tables

  4. arXiv:2504.14011  [pdf, other

    cs.CV cs.AI cs.MM

    Fashion-RAG: Multimodal Fashion Image Editing via Retrieval-Augmented Generation

    Authors: Fulvio Sanguigni, Davide Morelli, Marcella Cornia, Rita Cucchiara

    Abstract: In recent years, the fashion industry has increasingly adopted AI technologies to enhance customer experience, driven by the proliferation of e-commerce platforms and virtual applications. Among the various tasks, virtual try-on and multimodal fashion image editing -- which utilizes diverse input modalities such as text, garment sketches, and body poses -- have become a key area of research. Diffu… ▽ More

    Submitted 18 April, 2025; originally announced April 2025.

    Comments: IJCNN 2025

  5. arXiv:2503.15621  [pdf, other

    cs.CV cs.AI cs.CL cs.MM

    LLaVA-MORE: A Comparative Study of LLMs and Visual Backbones for Enhanced Visual Instruction Tuning

    Authors: Federico Cocchi, Nicholas Moratelli, Davide Caffagni, Sara Sarto, Lorenzo Baraldi, Marcella Cornia, Rita Cucchiara

    Abstract: Recent progress in Multimodal Large Language Models (MLLMs) has highlighted the critical roles of both the visual backbone and the underlying language model. While prior work has primarily focused on scaling these components to billions of parameters, the trade-offs between model size, architecture, and performance remain underexplored. Additionally, inconsistencies in training data and evaluation… ▽ More

    Submitted 19 March, 2025; originally announced March 2025.

  6. arXiv:2503.14604  [pdf, ps, other

    cs.CV cs.AI cs.CL

    Image Captioning Evaluation in the Age of Multimodal LLMs: Challenges and Future Perspectives

    Authors: Sara Sarto, Marcella Cornia, Rita Cucchiara

    Abstract: The evaluation of machine-generated image captions is a complex and evolving challenge. With the advent of Multimodal Large Language Models (MLLMs), image captioning has become a core task, increasing the need for robust and reliable evaluation metrics. This survey provides a comprehensive overview of advancements in image captioning evaluation, analyzing the evolution, strengths, and limitations… ▽ More

    Submitted 30 May, 2025; v1 submitted 18 March, 2025; originally announced March 2025.

    Comments: IJCAI 2025. Repo GitHub: https://github.com/aimagelab/awesome-captioning-evaluation

  7. arXiv:2503.01980  [pdf, other

    cs.CV cs.AI cs.CL cs.MM

    Recurrence-Enhanced Vision-and-Language Transformers for Robust Multimodal Document Retrieval

    Authors: Davide Caffagni, Sara Sarto, Marcella Cornia, Lorenzo Baraldi, Rita Cucchiara

    Abstract: Cross-modal retrieval is gaining increasing efficacy and interest from the research community, thanks to large-scale training, novel architectural and learning designs, and its application in LLMs and multimodal LLMs. In this paper, we move a step forward and design an approach that allows for multimodal queries, composed of both an image and a text, and can search within collections of multimodal… ▽ More

    Submitted 3 March, 2025; originally announced March 2025.

    Comments: CVPR 2025

  8. arXiv:2412.03665  [pdf, other

    cs.CV cs.AI cs.CL cs.MM

    Personalizing Multimodal Large Language Models for Image Captioning: An Experimental Analysis

    Authors: Davide Bucciarelli, Nicholas Moratelli, Marcella Cornia, Lorenzo Baraldi, Rita Cucchiara

    Abstract: The task of image captioning demands an algorithm to generate natural language descriptions of visual inputs. Recent advancements have seen a convergence between image captioning research and the development of Large Language Models (LLMs) and Multimodal LLMs -- like GPT-4V and Gemini -- which extend the capabilities of text-only LLMs to multiple modalities. This paper investigates whether Multimo… ▽ More

    Submitted 4 December, 2024; originally announced December 2024.

    Comments: ECCV 2024 Workshop on Green Foundation Models

  9. arXiv:2411.19331  [pdf, other

    cs.CV cs.AI cs.CL

    Talking to DINO: Bridging Self-Supervised Vision Backbones with Language for Open-Vocabulary Segmentation

    Authors: Luca Barsellotti, Lorenzo Bianchi, Nicola Messina, Fabio Carrara, Marcella Cornia, Lorenzo Baraldi, Fabrizio Falchi, Rita Cucchiara

    Abstract: Open-Vocabulary Segmentation (OVS) aims at segmenting images from free-form textual concepts without predefined training classes. While existing vision-language models such as CLIP can generate segmentation masks by leveraging coarse spatial information from Vision Transformers, they face challenges in spatial localization due to their global alignment of image and text features. Conversely, self-… ▽ More

    Submitted 28 November, 2024; originally announced November 2024.

  10. arXiv:2411.16863  [pdf, other

    cs.CV cs.AI cs.CL cs.MM

    Augmenting Multimodal LLMs with Self-Reflective Tokens for Knowledge-based Visual Question Answering

    Authors: Federico Cocchi, Nicholas Moratelli, Marcella Cornia, Lorenzo Baraldi, Rita Cucchiara

    Abstract: Multimodal LLMs (MLLMs) are the natural extension of large language models to handle multimodal inputs, combining text and image data. They have recently garnered attention due to their capability to address complex tasks involving both modalities. However, their effectiveness is limited to the knowledge acquired during training, which restricts their practical utility. In this work, we introduce… ▽ More

    Submitted 2 April, 2025; v1 submitted 25 November, 2024; originally announced November 2024.

    Comments: CVPR 2025

  11. arXiv:2410.23409  [pdf, other

    cs.CV cs.AI

    TPP-Gaze: Modelling Gaze Dynamics in Space and Time with Neural Temporal Point Processes

    Authors: Alessandro D'Amelio, Giuseppe Cartella, Vittorio Cuculo, Manuele Lucchi, Marcella Cornia, Rita Cucchiara, Giuseppe Boccignone

    Abstract: Attention guides our gaze to fixate the proper location of the scene and holds it in that location for the deserved amount of time given current processing demands, before shifting to the next one. As such, gaze deployment crucially is a temporal process. Existing computational models have made significant strides in predicting spatial aspects of observer's visual scanpaths (where to look), while… ▽ More

    Submitted 30 October, 2024; originally announced October 2024.

    Comments: Accepted at WACV 2025

  12. arXiv:2410.18195  [pdf, other

    cs.CV cs.RO

    Personalized Instance-based Navigation Toward User-Specific Objects in Realistic Environments

    Authors: Luca Barsellotti, Roberto Bigazzi, Marcella Cornia, Lorenzo Baraldi, Rita Cucchiara

    Abstract: In the last years, the research interest in visual navigation towards objects in indoor environments has grown significantly. This growth can be attributed to the recent availability of large navigation datasets in photo-realistic simulated environments, like Gibson and Matterport3D. However, the navigation tasks supported by these datasets are often restricted to the objects present in the enviro… ▽ More

    Submitted 19 February, 2025; v1 submitted 23 October, 2024; originally announced October 2024.

    Comments: NeurIPS 2024 Datasets and Benchmarks Track. Project page: https://aimagelab.github.io/pin/

  13. arXiv:2410.07336  [pdf, other

    cs.CV cs.AI cs.CL cs.MM

    Positive-Augmented Contrastive Learning for Vision-and-Language Evaluation and Training

    Authors: Sara Sarto, Nicholas Moratelli, Marcella Cornia, Lorenzo Baraldi, Rita Cucchiara

    Abstract: Despite significant advancements in caption generation, existing evaluation metrics often fail to capture the full quality or fine-grained details of captions. This is mainly due to their reliance on non-specific human-written references or noisy pre-training data. Still, finding an effective metric is crucial not only for captions evaluation but also for the generation phase. Metrics can indeed p… ▽ More

    Submitted 9 October, 2024; originally announced October 2024.

  14. arXiv:2408.16827  [pdf, other

    cs.CV

    Fluent and Accurate Image Captioning with a Self-Trained Reward Model

    Authors: Nicholas Moratelli, Marcella Cornia, Lorenzo Baraldi, Rita Cucchiara

    Abstract: Fine-tuning image captioning models with hand-crafted rewards like the CIDEr metric has been a classical strategy for promoting caption quality at the sequence level. This approach, however, is known to limit descriptiveness and semantic richness and tends to drive the model towards the style of ground-truth sentences, thus losing detail and specificity. On the contrary, recent attempts to employ… ▽ More

    Submitted 29 August, 2024; originally announced August 2024.

    Comments: ICPR 2024

  15. arXiv:2408.14547  [pdf, other

    cs.CV cs.AI cs.CL cs.MM

    Revisiting Image Captioning Training Paradigm via Direct CLIP-based Optimization

    Authors: Nicholas Moratelli, Davide Caffagni, Marcella Cornia, Lorenzo Baraldi, Rita Cucchiara

    Abstract: The conventional training approach for image captioning involves pre-training a network using teacher forcing and subsequent fine-tuning with Self-Critical Sequence Training to maximize hand-crafted captioning metrics. However, when attempting to optimize modern and higher-quality metrics like CLIP-Score and PAC-Score, this training method often encounters instability and fails to acquire the genu… ▽ More

    Submitted 26 August, 2024; originally announced August 2024.

    Comments: BMVC 2024

  16. arXiv:2407.20341  [pdf, other

    cs.CV cs.AI cs.CL cs.MM

    BRIDGE: Bridging Gaps in Image Captioning Evaluation with Stronger Visual Cues

    Authors: Sara Sarto, Marcella Cornia, Lorenzo Baraldi, Rita Cucchiara

    Abstract: Effectively aligning with human judgment when evaluating machine-generated image captions represents a complex yet intriguing challenge. Existing evaluation metrics like CIDEr or CLIP-Score fall short in this regard as they do not take into account the corresponding image or lack the capability of encoding fine-grained details and penalizing hallucinations. To overcome these issues, in this paper,… ▽ More

    Submitted 29 July, 2024; originally announced July 2024.

    Comments: ECCV 2024

  17. arXiv:2407.20337  [pdf, other

    cs.CV cs.AI cs.MM

    Contrasting Deepfakes Diffusion via Contrastive Learning and Global-Local Similarities

    Authors: Lorenzo Baraldi, Federico Cocchi, Marcella Cornia, Lorenzo Baraldi, Alessandro Nicolosi, Rita Cucchiara

    Abstract: Discerning between authentic content and that generated by advanced AI methods has become increasingly challenging. While previous research primarily addresses the detection of fake faces, the identification of generated natural images has only recently surfaced. This prompted the recent exploration of solutions that employ foundation vision-and-language models, like CLIP. However, the CLIP embedd… ▽ More

    Submitted 29 July, 2024; originally announced July 2024.

    Comments: ECCV 2024

  18. arXiv:2405.13127  [pdf, other

    cs.CV cs.AI cs.CL cs.MM

    Towards Retrieval-Augmented Architectures for Image Captioning

    Authors: Sara Sarto, Marcella Cornia, Lorenzo Baraldi, Alessandro Nicolosi, Rita Cucchiara

    Abstract: The objective of image captioning models is to bridge the gap between the visual and linguistic modalities by generating natural language descriptions that accurately reflect the content of input images. In recent years, researchers have leveraged deep learning-based models and made advances in the extraction of visual features and the design of multimodal connections to tackle this task. This wor… ▽ More

    Submitted 21 May, 2024; originally announced May 2024.

    Comments: ACM Transactions on Multimedia Computing, Communications and Applications (2024)

  19. arXiv:2404.15406  [pdf, other

    cs.CV cs.AI cs.CL cs.MM

    Wiki-LLaVA: Hierarchical Retrieval-Augmented Generation for Multimodal LLMs

    Authors: Davide Caffagni, Federico Cocchi, Nicholas Moratelli, Sara Sarto, Marcella Cornia, Lorenzo Baraldi, Rita Cucchiara

    Abstract: Multimodal LLMs are the natural evolution of LLMs, and enlarge their capabilities so as to work beyond the pure textual modality. As research is being carried out to design novel architectures and vision-and-language adapters, in this paper we concentrate on endowing such models with the capability of answering questions that require external knowledge. Our approach, termed Wiki-LLaVA, aims at int… ▽ More

    Submitted 22 May, 2024; v1 submitted 23 April, 2024; originally announced April 2024.

    Comments: CVPR 2024 Workshop on What is Next in Multimodal Foundation Models

  20. arXiv:2404.06542  [pdf, other

    cs.CV

    Training-Free Open-Vocabulary Segmentation with Offline Diffusion-Augmented Prototype Generation

    Authors: Luca Barsellotti, Roberto Amoroso, Marcella Cornia, Lorenzo Baraldi, Rita Cucchiara

    Abstract: Open-vocabulary semantic segmentation aims at segmenting arbitrary categories expressed in textual form. Previous works have trained over large amounts of image-caption pairs to enforce pixel-level multimodal alignments. However, captions provide global information about the semantics of a given image but lack direct localization of individual concepts. Further, training on large-scale datasets in… ▽ More

    Submitted 9 April, 2024; originally announced April 2024.

    Comments: CVPR 2024. Project page: https://aimagelab.github.io/freeda/

  21. arXiv:2403.14828  [pdf, other

    cs.CV

    Multimodal-Conditioned Latent Diffusion Models for Fashion Image Editing

    Authors: Alberto Baldrati, Davide Morelli, Marcella Cornia, Marco Bertini, Rita Cucchiara

    Abstract: Fashion illustration is a crucial medium for designers to convey their creative vision and transform design concepts into tangible representations that showcase the interplay between clothing and the human body. In the context of fashion design, computer vision techniques have the potential to enhance and streamline the design process. Departing from prior research primarily focused on virtual try… ▽ More

    Submitted 25 March, 2024; v1 submitted 21 March, 2024; originally announced March 2024.

  22. arXiv:2403.08933  [pdf, other

    cs.CV cs.AI

    Unveiling the Truth: Exploring Human Gaze Patterns in Fake Images

    Authors: Giuseppe Cartella, Vittorio Cuculo, Marcella Cornia, Rita Cucchiara

    Abstract: Creating high-quality and realistic images is now possible thanks to the impressive advancements in image generation. A description in natural language of your desired output is all you need to obtain breathtaking results. However, as the use of generative models grows, so do concerns about the propagation of malicious content and misinformation. Consequently, the research community is actively wo… ▽ More

    Submitted 13 March, 2024; originally announced March 2024.

    Comments: Accepted to IEEE Signal Processing Letters 2024

  23. arXiv:2402.18673  [pdf, other

    cs.CV cs.AI

    Trends, Applications, and Challenges in Human Attention Modelling

    Authors: Giuseppe Cartella, Marcella Cornia, Vittorio Cuculo, Alessandro D'Amelio, Dario Zanca, Giuseppe Boccignone, Rita Cucchiara

    Abstract: Human attention modelling has proven, in recent years, to be particularly useful not only for understanding the cognitive processes underlying visual exploration, but also for providing support to artificial intelligence models that aim to solve problems in various domains, including image and video processing, vision-and-language applications, and language modelling. This survey offers a reasoned… ▽ More

    Submitted 22 April, 2024; v1 submitted 28 February, 2024; originally announced February 2024.

    Comments: Accepted at IJCAI 2024 Survey Track

  24. arXiv:2402.12451  [pdf, other

    cs.CV cs.AI cs.CL cs.MM

    The Revolution of Multimodal Large Language Models: A Survey

    Authors: Davide Caffagni, Federico Cocchi, Luca Barsellotti, Nicholas Moratelli, Sara Sarto, Lorenzo Baraldi, Lorenzo Baraldi, Marcella Cornia, Rita Cucchiara

    Abstract: Connecting text and visual modalities plays an essential role in generative intelligence. For this reason, inspired by the success of large language models, significant research efforts are being devoted to the development of Multimodal Large Language Models (MLLMs). These models can seamlessly integrate visual and textual modalities, while providing a dialogue-based interface and instruction-foll… ▽ More

    Submitted 6 June, 2024; v1 submitted 19 February, 2024; originally announced February 2024.

    Comments: ACL 2024 (Findings)

  25. arXiv:2311.16254  [pdf, other

    cs.CV cs.AI cs.CL cs.MM

    Safe-CLIP: Removing NSFW Concepts from Vision-and-Language Models

    Authors: Samuele Poppi, Tobia Poppi, Federico Cocchi, Marcella Cornia, Lorenzo Baraldi, Rita Cucchiara

    Abstract: Large-scale vision-and-language models, such as CLIP, are typically trained on web-scale data, which can introduce inappropriate content and lead to the development of unsafe and biased behavior. This, in turn, hampers their applicability in sensitive and trustworthy contexts and could raise significant concerns in their adoption. Our research introduces a novel approach to enhancing the safety of… ▽ More

    Submitted 23 July, 2024; v1 submitted 27 November, 2023; originally announced November 2023.

    Comments: ECCV 2024

  26. arXiv:2309.05551  [pdf, other

    cs.CV

    OpenFashionCLIP: Vision-and-Language Contrastive Learning with Open-Source Fashion Data

    Authors: Giuseppe Cartella, Alberto Baldrati, Davide Morelli, Marcella Cornia, Marco Bertini, Rita Cucchiara

    Abstract: The inexorable growth of online shopping and e-commerce demands scalable and robust machine learning-based solutions to accommodate customer requirements. In the context of automatic tagging classification and multimodal retrieval, prior works either defined a low generalizable supervised learning approach or more reusable CLIP-based techniques while, however, training on closed source data. In th… ▽ More

    Submitted 11 September, 2023; originally announced September 2023.

    Comments: International Conference on Image Analysis and Processing (ICIAP) 2023

  27. arXiv:2308.12383  [pdf, other

    cs.CV cs.AI cs.CL cs.MM

    With a Little Help from your own Past: Prototypical Memory Networks for Image Captioning

    Authors: Manuele Barraco, Sara Sarto, Marcella Cornia, Lorenzo Baraldi, Rita Cucchiara

    Abstract: Image captioning, like many tasks involving vision and language, currently relies on Transformer-based architectures for extracting the semantics in an image and translating it into linguistically coherent descriptions. Although successful, the attention operator only considers a weighted summation of projections of the current input sample, therefore ignoring the relevant semantic information whi… ▽ More

    Submitted 23 August, 2023; originally announced August 2023.

    Comments: ICCV 2023

  28. arXiv:2306.07346  [pdf, other

    cs.CV cs.AI cs.MM

    Learning to Mask and Permute Visual Tokens for Vision Transformer Pre-Training

    Authors: Lorenzo Baraldi, Roberto Amoroso, Marcella Cornia, Lorenzo Baraldi, Andrea Pilzer, Rita Cucchiara

    Abstract: The use of self-supervised pre-training has emerged as a promising approach to enhance the performance of many different visual tasks. In this context, recent approaches have employed the Masked Image Modeling paradigm, which pre-trains a backbone by reconstructing visual tokens associated with randomly masked image patches. This masking approach, however, introduces noise into the input data duri… ▽ More

    Submitted 22 January, 2025; v1 submitted 12 June, 2023; originally announced June 2023.

    Comments: Computer Vision and Image Understanding (2025)

  29. arXiv:2305.13501  [pdf, other

    cs.CV cs.AI cs.MM

    LaDI-VTON: Latent Diffusion Textual-Inversion Enhanced Virtual Try-On

    Authors: Davide Morelli, Alberto Baldrati, Giuseppe Cartella, Marcella Cornia, Marco Bertini, Rita Cucchiara

    Abstract: The rapidly evolving fields of e-commerce and metaverse continue to seek innovative approaches to enhance the consumer experience. At the same time, recent advancements in the development of diffusion models have enabled generative networks to create remarkably realistic images. In this context, image-based virtual try-on, which consists in generating a novel image of a target model wearing a give… ▽ More

    Submitted 3 August, 2023; v1 submitted 22 May, 2023; originally announced May 2023.

    Comments: ACM Multimedia 2023

  30. arXiv:2304.02051  [pdf, other

    cs.CV cs.AI cs.MM

    Multimodal Garment Designer: Human-Centric Latent Diffusion Models for Fashion Image Editing

    Authors: Alberto Baldrati, Davide Morelli, Giuseppe Cartella, Marcella Cornia, Marco Bertini, Rita Cucchiara

    Abstract: Fashion illustration is used by designers to communicate their vision and to bring the design idea from conceptualization to realization, showing how clothes interact with the human body. In this context, computer vision can thus be used to improve the fashion design process. Differently from previous works that mainly focused on the virtual try-on of garments, we propose the task of multimodal-co… ▽ More

    Submitted 23 August, 2023; v1 submitted 4 April, 2023; originally announced April 2023.

    Comments: ICCV 2023

  31. arXiv:2304.02049  [pdf, other

    cs.CV cs.AI cs.LG

    Multi-Class Unlearning for Image Classification via Weight Filtering

    Authors: Samuele Poppi, Sara Sarto, Marcella Cornia, Lorenzo Baraldi, Rita Cucchiara

    Abstract: Machine Unlearning is an emerging paradigm for selectively removing the impact of training datapoints from a network. Unlike existing methods that target a limited subset or a single class, our framework unlearns all classes in a single round. We achieve this by modulating the network's components using memory matrices, enabling the network to demonstrate selective unlearning behavior for any clas… ▽ More

    Submitted 8 June, 2024; v1 submitted 4 April, 2023; originally announced April 2023.

    Comments: IEEE Intelligent Systems (2024)

  32. arXiv:2304.00500  [pdf, other

    cs.CV cs.AI cs.MM

    Parents and Children: Distinguishing Multimodal DeepFakes from Natural Images

    Authors: Roberto Amoroso, Davide Morelli, Marcella Cornia, Lorenzo Baraldi, Alberto Del Bimbo, Rita Cucchiara

    Abstract: Recent advancements in diffusion models have enabled the generation of realistic deepfakes from textual prompts in natural language. While these models have numerous benefits across various sectors, they have also raised concerns about the potential misuse of fake images and cast new pressures on fake image detection. In this work, we pioneer a systematic study on deepfake detection generated by s… ▽ More

    Submitted 21 May, 2024; v1 submitted 2 April, 2023; originally announced April 2023.

    Comments: ACM Transactions on Multimedia Computing, Communications and Applications (2024)

  33. arXiv:2303.12112  [pdf, other

    cs.CV cs.AI cs.CL cs.MM

    Positive-Augmented Contrastive Learning for Image and Video Captioning Evaluation

    Authors: Sara Sarto, Manuele Barraco, Marcella Cornia, Lorenzo Baraldi, Rita Cucchiara

    Abstract: The CLIP model has been recently proven to be very effective for a variety of cross-modal tasks, including the evaluation of captions generated from vision-and-language architectures. In this paper, we propose a new recipe for a contrastive-based evaluation metric for image captioning, namely Positive-Augmented Contrastive learning Score (PAC-S), that in a novel way unifies the learning of a contr… ▽ More

    Submitted 20 July, 2023; v1 submitted 21 March, 2023; originally announced March 2023.

    Comments: CVPR 2023 (highlight paper)

  34. arXiv:2301.07150  [pdf, other

    cs.RO cs.AI cs.CL cs.CV

    Embodied Agents for Efficient Exploration and Smart Scene Description

    Authors: Roberto Bigazzi, Marcella Cornia, Silvia Cascianelli, Lorenzo Baraldi, Rita Cucchiara

    Abstract: The development of embodied agents that can communicate with humans in natural language has gained increasing interest over the last years, as it facilitates the diffusion of robotic platforms in human-populated environments. As a step towards this objective, in this work, we tackle a setting for visual navigation in which an autonomous agent needs to explore and map an unseen indoor environment w… ▽ More

    Submitted 17 January, 2023; originally announced January 2023.

    Comments: Accepted by IEEE International Conference on Robotics and Automation (ICRA 2023)

  35. arXiv:2208.08109  [pdf, other

    cs.CV

    Boosting Modern and Historical Handwritten Text Recognition with Deformable Convolutions

    Authors: Silvia Cascianelli, Marcella Cornia, Lorenzo Baraldi, Rita Cucchiara

    Abstract: Handwritten Text Recognition (HTR) in free-layout pages is a challenging image understanding task that can provide a relevant boost to the digitization of handwritten documents and reuse of their content. The task becomes even more challenging when dealing with historical documents due to the variability of the writing style and degradation of the page quality. State-of-the-art HTR approaches typi… ▽ More

    Submitted 17 August, 2022; originally announced August 2022.

    Journal ref: International Journal on Document Analysis and Recognition (IJDAR), 2022, 1-11

  36. arXiv:2208.07682  [pdf, other

    cs.CV cs.DL

    The LAM Dataset: A Novel Benchmark for Line-Level Handwritten Text Recognition

    Authors: Silvia Cascianelli, Vittorio Pippi, Martin Maarand, Marcella Cornia, Lorenzo Baraldi, Christopher Kermorvant, Rita Cucchiara

    Abstract: Handwritten Text Recognition (HTR) is an open problem at the intersection of Computer Vision and Natural Language Processing. The main challenges, when dealing with historical manuscripts, are due to the preservation of the paper support, the variability of the handwriting -- even of the same author over a wide time-span -- and the scarcity of data from ancient, poorly represented languages. With… ▽ More

    Submitted 16 August, 2022; originally announced August 2022.

    Comments: Accepted at ICPR 2022

  37. arXiv:2207.14757  [pdf, other

    cs.CV cs.AI cs.CL cs.MM

    ALADIN: Distilling Fine-grained Alignment Scores for Efficient Image-Text Matching and Retrieval

    Authors: Nicola Messina, Matteo Stefanini, Marcella Cornia, Lorenzo Baraldi, Fabrizio Falchi, Giuseppe Amato, Rita Cucchiara

    Abstract: Image-text matching is gaining a leading role among tasks involving the joint understanding of vision and language. In literature, this task is often used as a pre-training objective to forge architectures able to jointly deal with images and texts. Nonetheless, it has a direct downstream application: cross-modal retrieval, which consists in finding images related to a given query text or vice-ver… ▽ More

    Submitted 29 July, 2022; originally announced July 2022.

    Comments: CBMI 2022

  38. arXiv:2207.13162  [pdf, other

    cs.CV cs.AI cs.CL cs.MM

    Retrieval-Augmented Transformer for Image Captioning

    Authors: Sara Sarto, Marcella Cornia, Lorenzo Baraldi, Rita Cucchiara

    Abstract: Image captioning models aim at connecting Vision and Language by providing natural language descriptions of input images. In the past few years, the task has been tackled by learning parametric models and proposing visual feature extraction advancements or by modeling better multi-modal connections. In this paper, we investigate the development of an image captioning approach with a kNN memory, wi… ▽ More

    Submitted 22 August, 2022; v1 submitted 26 July, 2022; originally announced July 2022.

    Comments: CBMI 2022

  39. Embodied Navigation at the Art Gallery

    Authors: Roberto Bigazzi, Federico Landi, Silvia Cascianelli, Marcella Cornia, Lorenzo Baraldi, Rita Cucchiara

    Abstract: Embodied agents, trained to explore and navigate indoor photorealistic environments, have achieved impressive results on standard datasets and benchmarks. So far, experiments and evaluations have involved domestic and working scenes like offices, flats, and houses. In this paper, we build and release a new 3D space with unique characteristics: the one of a complete art museum. We name this environ… ▽ More

    Submitted 19 April, 2022; originally announced April 2022.

    Comments: Accepted by 21st International Conference on Image Analysis and Processing (ICIAP 2021)

  40. arXiv:2204.08532  [pdf, other

    cs.CV cs.AI cs.GR cs.MM

    Dress Code: High-Resolution Multi-Category Virtual Try-On

    Authors: Davide Morelli, Matteo Fincato, Marcella Cornia, Federico Landi, Fabio Cesari, Rita Cucchiara

    Abstract: Image-based virtual try-on strives to transfer the appearance of a clothing item onto the image of a target person. Prior work focuses mainly on upper-body clothes (e.g. t-shirts, shirts, and tops) and neglects full-body or lower-body items. This shortcoming arises from a main factor: current publicly available datasets for image-based virtual try-on do not account for this variety, thus limiting… ▽ More

    Submitted 13 July, 2022; v1 submitted 18 April, 2022; originally announced April 2022.

    Comments: ECCV 2022 - Video Demo: https://www.youtube.com/watch?v=qr6TW3uTHG4

  41. Spot the Difference: A Novel Task for Embodied Agents in Changing Environments

    Authors: Federico Landi, Roberto Bigazzi, Marcella Cornia, Silvia Cascianelli, Lorenzo Baraldi, Rita Cucchiara

    Abstract: Embodied AI is a recent research area that aims at creating intelligent agents that can move and operate inside an environment. Existing approaches in this field demand the agents to act in completely new and unexplored scenes. However, this setting is far from realistic use cases that instead require executing multiple tasks in the same environment. Even if the environment changes over time, the… ▽ More

    Submitted 18 April, 2022; originally announced April 2022.

    Comments: Accepted by 26TH International Conference on Pattern Recognition (ICPR 2022)

  42. arXiv:2202.10492  [pdf, other

    cs.CV cs.AI cs.CL cs.MM

    CaMEL: Mean Teacher Learning for Image Captioning

    Authors: Manuele Barraco, Matteo Stefanini, Marcella Cornia, Silvia Cascianelli, Lorenzo Baraldi, Rita Cucchiara

    Abstract: Describing images in natural language is a fundamental step towards the automatic modeling of connections between the visual and textual modalities. In this paper we present CaMEL, a novel Transformer-based architecture for image captioning. Our proposed approach leverages the interaction of two interconnected language models that learn from each other during the training phase. The interplay betw… ▽ More

    Submitted 21 February, 2022; originally announced February 2022.

  43. arXiv:2111.12727  [pdf, other

    cs.CV cs.AI cs.CL cs.MM

    Generating More Pertinent Captions by Leveraging Semantics and Style on Multi-Source Datasets

    Authors: Marcella Cornia, Lorenzo Baraldi, Giuseppe Fiameni, Rita Cucchiara

    Abstract: This paper addresses the task of generating fluent descriptions by training on a non-uniform combination of data sources, containing both human-annotated and web-collected captions. Large-scale datasets with noisy image-text pairs, indeed, provide a sub-optimal source of supervision because of their low-quality descriptive style, while human-annotated datasets are cleaner but smaller in scale. To… ▽ More

    Submitted 30 November, 2023; v1 submitted 24 November, 2021; originally announced November 2021.

    Comments: Accepted to IJCV

  44. arXiv:2109.08521  [pdf, other

    cs.RO cs.AI cs.CV

    Focus on Impact: Indoor Exploration with Intrinsic Motivation

    Authors: Roberto Bigazzi, Federico Landi, Silvia Cascianelli, Lorenzo Baraldi, Marcella Cornia, Rita Cucchiara

    Abstract: Exploration of indoor environments has recently experienced a significant interest, also thanks to the introduction of deep neural agents built in a hierarchical fashion and trained with Deep Reinforcement Learning (DRL) on simulated environments. Current state-of-the-art methods employ a dense extrinsic reward that requires the complete a priori knowledge of the layout of the training environment… ▽ More

    Submitted 4 February, 2022; v1 submitted 14 September, 2021; originally announced September 2021.

    Comments: Published in IEEE Robotics and Automation Letters. To appear in ICRA 2022

    Journal ref: IEEE Robotics and Automation Letters (Volume: 7, Issue: 2, April 2022)

  45. arXiv:2109.00020  [pdf, other

    cs.LG cs.CL cs.CV cs.NE

    Working Memory Connections for LSTM

    Authors: Federico Landi, Lorenzo Baraldi, Marcella Cornia, Rita Cucchiara

    Abstract: Recurrent Neural Networks with Long Short-Term Memory (LSTM) make use of gating mechanisms to mitigate exploding and vanishing gradients when learning long-term dependencies. For this reason, LSTMs and other gated RNNs are widely adopted, being the standard de facto for many sequence modeling tasks. Although the memory cell inside the LSTM contains essential information, it is not allowed to influ… ▽ More

    Submitted 31 August, 2021; originally announced September 2021.

    Comments: Accepted for publication in Neural Networks

  46. arXiv:2107.06912  [pdf, other

    cs.CV cs.CL

    From Show to Tell: A Survey on Deep Learning-based Image Captioning

    Authors: Matteo Stefanini, Marcella Cornia, Lorenzo Baraldi, Silvia Cascianelli, Giuseppe Fiameni, Rita Cucchiara

    Abstract: Connecting Vision and Language plays an essential role in Generative Intelligence. For this reason, large research efforts have been devoted to image captioning, i.e. describing images with syntactically and semantically meaningful sentences. Starting from 2015 the task has generally been addressed with pipelines composed of a visual encoder and a language model for text generation. During these y… ▽ More

    Submitted 30 November, 2021; v1 submitted 14 July, 2021; originally announced July 2021.

  47. Learning to Select: A Fully Attentive Approach for Novel Object Captioning

    Authors: Marco Cagrandi, Marcella Cornia, Matteo Stefanini, Lorenzo Baraldi, Rita Cucchiara

    Abstract: Image captioning models have lately shown impressive results when applied to standard datasets. Switching to real-life scenarios, however, constitutes a challenge due to the larger variety of visual concepts which are not covered in existing training sets. For this reason, novel object captioning (NOC) has recently emerged as a paradigm to test captioning models on objects which are unseen during… ▽ More

    Submitted 2 June, 2021; originally announced June 2021.

    Comments: ICMR 2021

  48. Out of the Box: Embodied Navigation in the Real World

    Authors: Roberto Bigazzi, Federico Landi, Marcella Cornia, Silvia Cascianelli, Lorenzo Baraldi, Rita Cucchiara

    Abstract: The research field of Embodied AI has witnessed substantial progress in visual navigation and exploration thanks to powerful simulating platforms and the availability of 3D data of indoor and photorealistic environments. These two factors have opened the doors to a new generation of intelligent agents capable of achieving nearly perfect PointGoal Navigation. However, such architectures are commonl… ▽ More

    Submitted 12 May, 2021; originally announced May 2021.

  49. arXiv:2104.10252  [pdf, other

    cs.CV

    Revisiting The Evaluation of Class Activation Mapping for Explainability: A Novel Metric and Experimental Analysis

    Authors: Samuele Poppi, Marcella Cornia, Lorenzo Baraldi, Rita Cucchiara

    Abstract: As the request for deep learning solutions increases, the need for explainability is even more fundamental. In this setting, particular attention has been given to visualization techniques, that try to attribute the right relevance to each input pixel with respect to the output of the network. In this paper, we focus on Class Activation Mapping (CAM) approaches, which provide an effective visualiz… ▽ More

    Submitted 20 April, 2021; originally announced April 2021.

    Comments: CVPR 2021 Workshop on Responsible Computer Vision

  50. arXiv:2007.07268  [pdf, other

    cs.CV cs.AI cs.CL cs.RO

    Explore and Explain: Self-supervised Navigation and Recounting

    Authors: Roberto Bigazzi, Federico Landi, Marcella Cornia, Silvia Cascianelli, Lorenzo Baraldi, Rita Cucchiara

    Abstract: Embodied AI has been recently gaining attention as it aims to foster the development of autonomous and intelligent agents. In this paper, we devise a novel embodied setting in which an agent needs to explore a previously unknown environment while recounting what it sees during the path. In this context, the agent needs to navigate the environment driven by an exploration goal, select proper moment… ▽ More

    Submitted 14 July, 2020; originally announced July 2020.

    Comments: ICPR 2020