Skip to main content

Showing 1–9 of 9 results for author: Moratelli, N

.
  1. arXiv:2503.15621  [pdf, other

    cs.CV cs.AI cs.CL cs.MM

    LLaVA-MORE: A Comparative Study of LLMs and Visual Backbones for Enhanced Visual Instruction Tuning

    Authors: Federico Cocchi, Nicholas Moratelli, Davide Caffagni, Sara Sarto, Lorenzo Baraldi, Marcella Cornia, Rita Cucchiara

    Abstract: Recent progress in Multimodal Large Language Models (MLLMs) has highlighted the critical roles of both the visual backbone and the underlying language model. While prior work has primarily focused on scaling these components to billions of parameters, the trade-offs between model size, architecture, and performance remain underexplored. Additionally, inconsistencies in training data and evaluation… ▽ More

    Submitted 19 March, 2025; originally announced March 2025.

  2. arXiv:2412.09353  [pdf, other

    cs.CV cs.AI cs.CL cs.MM

    Causal Graphical Models for Vision-Language Compositional Understanding

    Authors: Fiorenzo Parascandolo, Nicholas Moratelli, Enver Sangineto, Lorenzo Baraldi, Rita Cucchiara

    Abstract: Recent work has empirically shown that Vision-Language Models (VLMs) struggle to fully understand the compositional properties of the human language, usually modeling an image caption as a "bag of words". As a result, they perform poorly on compositional tasks, which require a deeper understanding of the different entities of a sentence (subject, verb, etc.) jointly with their mutual relationships… ▽ More

    Submitted 15 April, 2025; v1 submitted 12 December, 2024; originally announced December 2024.

    Comments: Accepted at ICLR 2025

  3. arXiv:2412.03665  [pdf, other

    cs.CV cs.AI cs.CL cs.MM

    Personalizing Multimodal Large Language Models for Image Captioning: An Experimental Analysis

    Authors: Davide Bucciarelli, Nicholas Moratelli, Marcella Cornia, Lorenzo Baraldi, Rita Cucchiara

    Abstract: The task of image captioning demands an algorithm to generate natural language descriptions of visual inputs. Recent advancements have seen a convergence between image captioning research and the development of Large Language Models (LLMs) and Multimodal LLMs -- like GPT-4V and Gemini -- which extend the capabilities of text-only LLMs to multiple modalities. This paper investigates whether Multimo… ▽ More

    Submitted 4 December, 2024; originally announced December 2024.

    Comments: ECCV 2024 Workshop on Green Foundation Models

  4. arXiv:2411.16863  [pdf, other

    cs.CV cs.AI cs.CL cs.MM

    Augmenting Multimodal LLMs with Self-Reflective Tokens for Knowledge-based Visual Question Answering

    Authors: Federico Cocchi, Nicholas Moratelli, Marcella Cornia, Lorenzo Baraldi, Rita Cucchiara

    Abstract: Multimodal LLMs (MLLMs) are the natural extension of large language models to handle multimodal inputs, combining text and image data. They have recently garnered attention due to their capability to address complex tasks involving both modalities. However, their effectiveness is limited to the knowledge acquired during training, which restricts their practical utility. In this work, we introduce… ▽ More

    Submitted 2 April, 2025; v1 submitted 25 November, 2024; originally announced November 2024.

    Comments: CVPR 2025

  5. arXiv:2410.07336  [pdf, other

    cs.CV cs.AI cs.CL cs.MM

    Positive-Augmented Contrastive Learning for Vision-and-Language Evaluation and Training

    Authors: Sara Sarto, Nicholas Moratelli, Marcella Cornia, Lorenzo Baraldi, Rita Cucchiara

    Abstract: Despite significant advancements in caption generation, existing evaluation metrics often fail to capture the full quality or fine-grained details of captions. This is mainly due to their reliance on non-specific human-written references or noisy pre-training data. Still, finding an effective metric is crucial not only for captions evaluation but also for the generation phase. Metrics can indeed p… ▽ More

    Submitted 9 October, 2024; originally announced October 2024.

  6. arXiv:2408.16827  [pdf, other

    cs.CV

    Fluent and Accurate Image Captioning with a Self-Trained Reward Model

    Authors: Nicholas Moratelli, Marcella Cornia, Lorenzo Baraldi, Rita Cucchiara

    Abstract: Fine-tuning image captioning models with hand-crafted rewards like the CIDEr metric has been a classical strategy for promoting caption quality at the sequence level. This approach, however, is known to limit descriptiveness and semantic richness and tends to drive the model towards the style of ground-truth sentences, thus losing detail and specificity. On the contrary, recent attempts to employ… ▽ More

    Submitted 29 August, 2024; originally announced August 2024.

    Comments: ICPR 2024

  7. arXiv:2408.14547  [pdf, other

    cs.CV cs.AI cs.CL cs.MM

    Revisiting Image Captioning Training Paradigm via Direct CLIP-based Optimization

    Authors: Nicholas Moratelli, Davide Caffagni, Marcella Cornia, Lorenzo Baraldi, Rita Cucchiara

    Abstract: The conventional training approach for image captioning involves pre-training a network using teacher forcing and subsequent fine-tuning with Self-Critical Sequence Training to maximize hand-crafted captioning metrics. However, when attempting to optimize modern and higher-quality metrics like CLIP-Score and PAC-Score, this training method often encounters instability and fails to acquire the genu… ▽ More

    Submitted 26 August, 2024; originally announced August 2024.

    Comments: BMVC 2024

  8. arXiv:2404.15406  [pdf, other

    cs.CV cs.AI cs.CL cs.MM

    Wiki-LLaVA: Hierarchical Retrieval-Augmented Generation for Multimodal LLMs

    Authors: Davide Caffagni, Federico Cocchi, Nicholas Moratelli, Sara Sarto, Marcella Cornia, Lorenzo Baraldi, Rita Cucchiara

    Abstract: Multimodal LLMs are the natural evolution of LLMs, and enlarge their capabilities so as to work beyond the pure textual modality. As research is being carried out to design novel architectures and vision-and-language adapters, in this paper we concentrate on endowing such models with the capability of answering questions that require external knowledge. Our approach, termed Wiki-LLaVA, aims at int… ▽ More

    Submitted 22 May, 2024; v1 submitted 23 April, 2024; originally announced April 2024.

    Comments: CVPR 2024 Workshop on What is Next in Multimodal Foundation Models

  9. arXiv:2402.12451  [pdf, other

    cs.CV cs.AI cs.CL cs.MM

    The Revolution of Multimodal Large Language Models: A Survey

    Authors: Davide Caffagni, Federico Cocchi, Luca Barsellotti, Nicholas Moratelli, Sara Sarto, Lorenzo Baraldi, Lorenzo Baraldi, Marcella Cornia, Rita Cucchiara

    Abstract: Connecting text and visual modalities plays an essential role in generative intelligence. For this reason, inspired by the success of large language models, significant research efforts are being devoted to the development of Multimodal Large Language Models (MLLMs). These models can seamlessly integrate visual and textual modalities, while providing a dialogue-based interface and instruction-foll… ▽ More

    Submitted 6 June, 2024; v1 submitted 19 February, 2024; originally announced February 2024.

    Comments: ACL 2024 (Findings)