Skip to main content

Showing 1–11 of 11 results for author: Vivoli, E

Searching in archive cs. Search in all archives.
.
  1. arXiv:2503.08561  [pdf, other

    cs.CV

    ComicsPAP: understanding comic strips by picking the correct panel

    Authors: Emanuele Vivoli, Artemis Llabrés, Mohamed Ali Souibgui, Marco Bertini, Ernest Valveny Llobet, Dimosthenis Karatzas

    Abstract: Large multimodal models (LMMs) have made impressive strides in image captioning, VQA, and video comprehension, yet they still struggle with the intricate temporal and spatial cues found in comics. To address this gap, we introduce ComicsPAP, a large-scale benchmark designed for comic strip understanding. Comprising over 100k samples and organized into 5 subtasks under a Pick-a-Panel framework, Com… ▽ More

    Submitted 24 March, 2025; v1 submitted 11 March, 2025; originally announced March 2025.

  2. arXiv:2502.21054  [pdf, other

    cs.CV eess.IV eess.SP

    HoloMine: A Synthetic Dataset for Buried Landmines Recognition using Microwave Holographic Imaging

    Authors: Emanuele Vivoli, Lorenzo Capineri, Marco Bertini

    Abstract: The detection and removal of landmines is a complex and risky task that requires advanced remote sensing techniques to reduce the risk for the professionals involved in this task. In this paper, we propose a novel synthetic dataset for buried landmine detection to provide researchers with a valuable resource to observe, measure, locate, and address issues in landmine detection. The dataset consist… ▽ More

    Submitted 28 February, 2025; originally announced February 2025.

    Comments: under review

  3. arXiv:2409.16159  [pdf, other

    cs.CV

    ComiCap: A VLMs pipeline for dense captioning of Comic Panels

    Authors: Emanuele Vivoli, Niccolò Biondi, Marco Bertini, Dimosthenis Karatzas

    Abstract: The comic domain is rapidly advancing with the development of single- and multi-page analysis and synthesis models. Recent benchmarks and datasets have been introduced to support and assess models' capabilities in tasks such as detection (panels, characters, text), linking (character re-identification and speaker identification), and analysis of comic elements (e.g., dialog transcription). However… ▽ More

    Submitted 24 September, 2024; originally announced September 2024.

    Comments: Accepted at ECCV 2024 Workshop (AI for Visual Art), repo: https://github.com/emanuelevivoli/ComiCap

  4. arXiv:2409.09502  [pdf, other

    cs.CV

    One missing piece in Vision and Language: A Survey on Comics Understanding

    Authors: Emanuele Vivoli, Mohamed Ali Souibgui, Andrey Barsky, Artemis LLabrés, Marco Bertini, Dimosthenis Karatzas

    Abstract: Vision-language models have recently evolved into versatile systems capable of high performance across a range of tasks, such as document understanding, visual question answering, and grounding, often in zero-shot settings. Comics Understanding, a complex and multifaceted field, stands to greatly benefit from these advances. Comics, as a medium, combine rich visual and textual narratives, challeng… ▽ More

    Submitted 8 January, 2025; v1 submitted 14 September, 2024; originally announced September 2024.

    Comments: under review. project website: https://github.com/emanuelevivoli/awesome-comics-understanding

  5. arXiv:2409.01835  [pdf, other

    cs.CV cs.CL

    Towards Generative Class Prompt Learning for Fine-grained Visual Recognition

    Authors: Soumitri Chattopadhyay, Sanket Biswas, Emanuele Vivoli, Josep Lladós

    Abstract: Although foundational vision-language models (VLMs) have proven to be very successful for various semantic discrimination tasks, they still struggle to perform faithfully for fine-grained categorization. Moreover, foundational models trained on one domain do not generalize well on a different domain without fine-tuning. We attribute these to the limitations of the VLM's semantic representations an… ▽ More

    Submitted 7 September, 2024; v1 submitted 3 September, 2024; originally announced September 2024.

    Comments: Accepted in BMVC 2024

  6. arXiv:2407.03550  [pdf, other

    cs.CV

    CoMix: A Comprehensive Benchmark for Multi-Task Comic Understanding

    Authors: Emanuele Vivoli, Marco Bertini, Dimosthenis Karatzas

    Abstract: The comic domain is rapidly advancing with the development of single-page analysis and synthesis models. However, evaluation metrics and datasets lag behind, often limited to small-scale or single-style test sets. We introduce a novel benchmark, CoMix, designed to evaluate the multi-task capabilities of models in comic analysis. Unlike existing benchmarks that focus on isolated tasks such as objec… ▽ More

    Submitted 31 October, 2024; v1 submitted 3 July, 2024; originally announced July 2024.

    Comments: Accepted at NeurIPS 2024 (D&B)

  7. arXiv:2407.03540  [pdf, other

    cs.CV

    Comics Datasets Framework: Mix of Comics datasets for detection benchmarking

    Authors: Emanuele Vivoli, Irene Campaioli, Mariateresa Nardoni, Niccolò Biondi, Marco Bertini, Dimosthenis Karatzas

    Abstract: Comics, as a medium, uniquely combine text and images in styles often distinct from real-world visuals. For the past three decades, computational research on comics has evolved from basic object detection to more sophisticated tasks. However, the field faces persistent challenges such as small datasets, inconsistent annotations, inaccessible model weights, and results that cannot be directly compa… ▽ More

    Submitted 3 July, 2024; originally announced July 2024.

    Comments: Accepted at MANPU - COMICS workshop at ICDAR

  8. arXiv:2403.03719  [pdf, other

    cs.CV

    Multimodal Transformer for Comics Text-Cloze

    Authors: Emanuele Vivoli, Joan Lafuente Baeza, Ernest Valveny Llobet, Dimosthenis Karatzas

    Abstract: This work explores a closure task in comics, a medium where visual and textual elements are intricately intertwined. Specifically, Text-cloze refers to the task of selecting the correct text to use in a comic panel, given its neighboring panels. Traditional methods based on recurrent neural networks have struggled with this task due to limited OCR accuracy and inherent model limitations. We introd… ▽ More

    Submitted 6 March, 2024; originally announced March 2024.

  9. arXiv:2302.01451  [pdf, other

    cs.CL cs.CV

    CTE: A Dataset for Contextualized Table Extraction

    Authors: Andrea Gemelli, Emanuele Vivoli, Simone Marinai

    Abstract: Relevant information in documents is often summarized in tables, helping the reader to identify useful facts. Most benchmark datasets support either document layout analysis or table understanding, but lack in providing data to apply both tasks in a unified way. We define the task of Contextualized Table Extraction (CTE), which aims to extract and define the structure of tables considering the tex… ▽ More

    Submitted 13 February, 2023; v1 submitted 2 February, 2023; originally announced February 2023.

  10. arXiv:2209.06730  [pdf, other

    cs.CV

    MUST-VQA: MUltilingual Scene-text VQA

    Authors: Emanuele Vivoli, Ali Furkan Biten, Andres Mafla, Dimosthenis Karatzas, Lluis Gomez

    Abstract: In this paper, we present a framework for Multilingual Scene Text Visual Question Answering that deals with new languages in a zero-shot fashion. Specifically, we consider the task of Scene Text Visual Question Answering (STVQA) in which the question can be asked in different languages and it is not necessarily aligned to the scene text language. Thus, we first introduce a natural step towards a m… ▽ More

    Submitted 14 September, 2022; originally announced September 2022.

    Comments: To be appeared in Text In Everything Workshop in ECCV 2022

  11. Graph Neural Networks and Representation Embedding for Table Extraction in PDF Documents

    Authors: Andrea Gemelli, Emanuele Vivoli, Simone Marinai

    Abstract: Tables are widely used in several types of documents since they can bring important information in a structured way. In scientific papers, tables can sum up novel discoveries and summarize experimental results, making the research comparable and easily understandable by scholars. Several methods perform table analysis working on document images, losing useful information during the conversion from… ▽ More

    Submitted 23 August, 2022; originally announced August 2022.

    Comments: ICPR 2022