Skip to main content

Showing 1–7 of 7 results for author: Cocchi, F

Searching in archive cs. Search in all archives.
.
  1. arXiv:2506.03988  [pdf, ps, other

    cs.CV cs.LG

    RAID: A Dataset for Testing the Adversarial Robustness of AI-Generated Image Detectors

    Authors: Hicham Eddoubi, Jonas Ricker, Federico Cocchi, Lorenzo Baraldi, Angelo Sotgiu, Maura Pintor, Marcella Cornia, Lorenzo Baraldi, Asja Fischer, Rita Cucchiara, Battista Biggio

    Abstract: AI-generated images have reached a quality level at which humans are incapable of reliably distinguishing them from real images. To counteract the inherent risk of fraud and disinformation, the detection of AI-generated images is a pressing challenge and an active research topic. While many of the presented methods claim to achieve high detection accuracy, they are usually evaluated under idealize… ▽ More

    Submitted 9 June, 2025; v1 submitted 4 June, 2025; originally announced June 2025.

  2. arXiv:2503.15621  [pdf, other

    cs.CV cs.AI cs.CL cs.MM

    LLaVA-MORE: A Comparative Study of LLMs and Visual Backbones for Enhanced Visual Instruction Tuning

    Authors: Federico Cocchi, Nicholas Moratelli, Davide Caffagni, Sara Sarto, Lorenzo Baraldi, Marcella Cornia, Rita Cucchiara

    Abstract: Recent progress in Multimodal Large Language Models (MLLMs) has highlighted the critical roles of both the visual backbone and the underlying language model. While prior work has primarily focused on scaling these components to billions of parameters, the trade-offs between model size, architecture, and performance remain underexplored. Additionally, inconsistencies in training data and evaluation… ▽ More

    Submitted 19 March, 2025; originally announced March 2025.

  3. arXiv:2411.16863  [pdf, other

    cs.CV cs.AI cs.CL cs.MM

    Augmenting Multimodal LLMs with Self-Reflective Tokens for Knowledge-based Visual Question Answering

    Authors: Federico Cocchi, Nicholas Moratelli, Marcella Cornia, Lorenzo Baraldi, Rita Cucchiara

    Abstract: Multimodal LLMs (MLLMs) are the natural extension of large language models to handle multimodal inputs, combining text and image data. They have recently garnered attention due to their capability to address complex tasks involving both modalities. However, their effectiveness is limited to the knowledge acquired during training, which restricts their practical utility. In this work, we introduce… ▽ More

    Submitted 2 April, 2025; v1 submitted 25 November, 2024; originally announced November 2024.

    Comments: CVPR 2025

  4. arXiv:2407.20337  [pdf, other

    cs.CV cs.AI cs.MM

    Contrasting Deepfakes Diffusion via Contrastive Learning and Global-Local Similarities

    Authors: Lorenzo Baraldi, Federico Cocchi, Marcella Cornia, Lorenzo Baraldi, Alessandro Nicolosi, Rita Cucchiara

    Abstract: Discerning between authentic content and that generated by advanced AI methods has become increasingly challenging. While previous research primarily addresses the detection of fake faces, the identification of generated natural images has only recently surfaced. This prompted the recent exploration of solutions that employ foundation vision-and-language models, like CLIP. However, the CLIP embedd… ▽ More

    Submitted 29 July, 2024; originally announced July 2024.

    Comments: ECCV 2024

  5. arXiv:2404.15406  [pdf, other

    cs.CV cs.AI cs.CL cs.MM

    Wiki-LLaVA: Hierarchical Retrieval-Augmented Generation for Multimodal LLMs

    Authors: Davide Caffagni, Federico Cocchi, Nicholas Moratelli, Sara Sarto, Marcella Cornia, Lorenzo Baraldi, Rita Cucchiara

    Abstract: Multimodal LLMs are the natural evolution of LLMs, and enlarge their capabilities so as to work beyond the pure textual modality. As research is being carried out to design novel architectures and vision-and-language adapters, in this paper we concentrate on endowing such models with the capability of answering questions that require external knowledge. Our approach, termed Wiki-LLaVA, aims at int… ▽ More

    Submitted 22 May, 2024; v1 submitted 23 April, 2024; originally announced April 2024.

    Comments: CVPR 2024 Workshop on What is Next in Multimodal Foundation Models

  6. arXiv:2402.12451  [pdf, other

    cs.CV cs.AI cs.CL cs.MM

    The Revolution of Multimodal Large Language Models: A Survey

    Authors: Davide Caffagni, Federico Cocchi, Luca Barsellotti, Nicholas Moratelli, Sara Sarto, Lorenzo Baraldi, Lorenzo Baraldi, Marcella Cornia, Rita Cucchiara

    Abstract: Connecting text and visual modalities plays an essential role in generative intelligence. For this reason, inspired by the success of large language models, significant research efforts are being devoted to the development of Multimodal Large Language Models (MLLMs). These models can seamlessly integrate visual and textual modalities, while providing a dialogue-based interface and instruction-foll… ▽ More

    Submitted 6 June, 2024; v1 submitted 19 February, 2024; originally announced February 2024.

    Comments: ACL 2024 (Findings)

  7. arXiv:2311.16254  [pdf, other

    cs.CV cs.AI cs.CL cs.MM

    Safe-CLIP: Removing NSFW Concepts from Vision-and-Language Models

    Authors: Samuele Poppi, Tobia Poppi, Federico Cocchi, Marcella Cornia, Lorenzo Baraldi, Rita Cucchiara

    Abstract: Large-scale vision-and-language models, such as CLIP, are typically trained on web-scale data, which can introduce inappropriate content and lead to the development of unsafe and biased behavior. This, in turn, hampers their applicability in sensitive and trustworthy contexts and could raise significant concerns in their adoption. Our research introduces a novel approach to enhancing the safety of… ▽ More

    Submitted 23 July, 2024; v1 submitted 27 November, 2023; originally announced November 2023.

    Comments: ECCV 2024