Skip to main content

Showing 1–12 of 12 results for author: Baldrati, A

Searching in archive cs. Search in all archives.
.
  1. arXiv:2502.04263  [pdf, other

    cs.CV cs.AI cs.LG

    Cross the Gap: Exposing the Intra-modal Misalignment in CLIP via Modality Inversion

    Authors: Marco Mistretta, Alberto Baldrati, Lorenzo Agnolucci, Marco Bertini, Andrew D. Bagdanov

    Abstract: Pre-trained multi-modal Vision-Language Models like CLIP are widely used off-the-shelf for a variety of applications. In this paper, we show that the common practice of individually exploiting the text or image encoders of these powerful multi-modal models is highly suboptimal for intra-modal tasks like image-to-image retrieval. We argue that this is inherently due to the CLIP-style inter-modal co… ▽ More

    Submitted 6 February, 2025; originally announced February 2025.

    Comments: Accepted for publication at ICLR 2025

  2. arXiv:2407.03056  [pdf, other

    cs.CV cs.AI cs.LG

    Improving Zero-shot Generalization of Learned Prompts via Unsupervised Knowledge Distillation

    Authors: Marco Mistretta, Alberto Baldrati, Marco Bertini, Andrew D. Bagdanov

    Abstract: Vision-Language Models (VLMs) demonstrate remarkable zero-shot generalization to unseen tasks, but fall short of the performance of supervised methods in generalizing to downstream tasks with limited data. Prompt learning is emerging as a parameter-efficient method for adapting VLMs, but state-of-the-art approaches require annotated samples. In this paper we propose a novel approach to prompt lear… ▽ More

    Submitted 30 July, 2024; v1 submitted 3 July, 2024; originally announced July 2024.

    Comments: Accepted for publication at ECCV24

  3. arXiv:2405.02951  [pdf, other

    cs.CV cs.IR

    iSEARLE: Improving Textual Inversion for Zero-Shot Composed Image Retrieval

    Authors: Lorenzo Agnolucci, Alberto Baldrati, Marco Bertini, Alberto Del Bimbo

    Abstract: Given a query consisting of a reference image and a relative caption, Composed Image Retrieval (CIR) aims to retrieve target images visually similar to the reference one while incorporating the changes specified in the relative caption. The reliance of supervised methods on labor-intensive manually labeled datasets hinders their broad applicability. In this work, we introduce a new task, Zero-Shot… ▽ More

    Submitted 5 May, 2024; originally announced May 2024.

    Comments: Extended version of the ICCV2023 paper arXiv:2303.15247

  4. arXiv:2403.14828  [pdf, other

    cs.CV

    Multimodal-Conditioned Latent Diffusion Models for Fashion Image Editing

    Authors: Alberto Baldrati, Davide Morelli, Marcella Cornia, Marco Bertini, Rita Cucchiara

    Abstract: Fashion illustration is a crucial medium for designers to convey their creative vision and transform design concepts into tangible representations that showcase the interplay between clothing and the human body. In the context of fashion design, computer vision techniques have the potential to enhance and streamline the design process. Departing from prior research primarily focused on virtual try… ▽ More

    Submitted 25 March, 2024; v1 submitted 21 March, 2024; originally announced March 2024.

  5. arXiv:2310.08368  [pdf, other

    cs.CV

    Mapping Memes to Words for Multimodal Hateful Meme Classification

    Authors: Giovanni Burbi, Alberto Baldrati, Lorenzo Agnolucci, Marco Bertini, Alberto Del Bimbo

    Abstract: Multimodal image-text memes are prevalent on the internet, serving as a unique form of communication that combines visual and textual elements to convey humor, ideas, or emotions. However, some memes take a malicious turn, promoting hateful content and perpetuating discrimination. Detecting hateful memes within this multimodal context is a challenging task that requires understanding the intertwin… ▽ More

    Submitted 12 October, 2023; originally announced October 2023.

    Comments: ICCV2023 CLVL Workshop

  6. Exploiting CLIP-based Multi-modal Approach for Artwork Classification and Retrieval

    Authors: Alberto Baldrati, Marco Bertini, Tiberio Uricchio, Alberto Del Bimbo

    Abstract: Given the recent advances in multimodal image pretraining where visual models trained with semantically dense textual supervision tend to have better generalization capabilities than those trained using categorical attributes or through unsupervised techniques, in this work we investigate how recent CLIP model can be applied in several tasks in artwork domain. We perform exhaustive experiments on… ▽ More

    Submitted 21 September, 2023; originally announced September 2023.

    Comments: Proc. of Florence Heri-Tech 2022: The Future of Heritage Science and Technologies: ICT and Digital Heritage, 2022

  7. arXiv:2309.05551  [pdf, other

    cs.CV

    OpenFashionCLIP: Vision-and-Language Contrastive Learning with Open-Source Fashion Data

    Authors: Giuseppe Cartella, Alberto Baldrati, Davide Morelli, Marcella Cornia, Marco Bertini, Rita Cucchiara

    Abstract: The inexorable growth of online shopping and e-commerce demands scalable and robust machine learning-based solutions to accommodate customer requirements. In the context of automatic tagging classification and multimodal retrieval, prior works either defined a low generalizable supervised learning approach or more reusable CLIP-based techniques while, however, training on closed source data. In th… ▽ More

    Submitted 11 September, 2023; originally announced September 2023.

    Comments: International Conference on Image Analysis and Processing (ICIAP) 2023

  8. arXiv:2308.11485  [pdf, other

    cs.CV

    Composed Image Retrieval using Contrastive Learning and Task-oriented CLIP-based Features

    Authors: Alberto Baldrati, Marco Bertini, Tiberio Uricchio, Alberto del Bimbo

    Abstract: Given a query composed of a reference image and a relative caption, the Composed Image Retrieval goal is to retrieve images visually similar to the reference one that integrates the modifications expressed by the caption. Given that recent research has demonstrated the efficacy of large-scale vision and language pre-trained (VLP) models in various tasks, we rely on features from the OpenAI CLIP mo… ▽ More

    Submitted 22 August, 2023; originally announced August 2023.

    Comments: Accepted in ACM Transactions on Multimedia Computing Communications and Applications (TOMM)

  9. arXiv:2307.14063  [pdf, other

    cs.CV

    ECO: Ensembling Context Optimization for Vision-Language Models

    Authors: Lorenzo Agnolucci, Alberto Baldrati, Francesco Todino, Federico Becattini, Marco Bertini, Alberto Del Bimbo

    Abstract: Image recognition has recently witnessed a paradigm shift, where vision-language models are now used to perform few-shot classification based on textual prompts. Among these, the CLIP model has shown remarkable capabilities for zero-shot transfer by matching an image and a custom textual prompt in its latent space. This has paved the way for several works that focus on engineering or learning text… ▽ More

    Submitted 26 July, 2023; originally announced July 2023.

  10. arXiv:2305.13501  [pdf, other

    cs.CV cs.AI cs.MM

    LaDI-VTON: Latent Diffusion Textual-Inversion Enhanced Virtual Try-On

    Authors: Davide Morelli, Alberto Baldrati, Giuseppe Cartella, Marcella Cornia, Marco Bertini, Rita Cucchiara

    Abstract: The rapidly evolving fields of e-commerce and metaverse continue to seek innovative approaches to enhance the consumer experience. At the same time, recent advancements in the development of diffusion models have enabled generative networks to create remarkably realistic images. In this context, image-based virtual try-on, which consists in generating a novel image of a target model wearing a give… ▽ More

    Submitted 3 August, 2023; v1 submitted 22 May, 2023; originally announced May 2023.

    Comments: ACM Multimedia 2023

  11. arXiv:2304.02051  [pdf, other

    cs.CV cs.AI cs.MM

    Multimodal Garment Designer: Human-Centric Latent Diffusion Models for Fashion Image Editing

    Authors: Alberto Baldrati, Davide Morelli, Giuseppe Cartella, Marcella Cornia, Marco Bertini, Rita Cucchiara

    Abstract: Fashion illustration is used by designers to communicate their vision and to bring the design idea from conceptualization to realization, showing how clothes interact with the human body. In this context, computer vision can thus be used to improve the fashion design process. Differently from previous works that mainly focused on the virtual try-on of garments, we propose the task of multimodal-co… ▽ More

    Submitted 23 August, 2023; v1 submitted 4 April, 2023; originally announced April 2023.

    Comments: ICCV 2023

  12. arXiv:2303.15247  [pdf, other

    cs.CV cs.CL cs.IR

    Zero-Shot Composed Image Retrieval with Textual Inversion

    Authors: Alberto Baldrati, Lorenzo Agnolucci, Marco Bertini, Alberto Del Bimbo

    Abstract: Composed Image Retrieval (CIR) aims to retrieve a target image based on a query composed of a reference image and a relative caption that describes the difference between the two images. The high effort and cost required for labeling datasets for CIR hamper the widespread usage of existing methods, as they rely on supervised learning. In this work, we propose a new task, Zero-Shot CIR (ZS-CIR), th… ▽ More

    Submitted 19 August, 2023; v1 submitted 27 March, 2023; originally announced March 2023.

    Comments: ICCV2023