Skip to main content

Showing 1–50 of 80 results for author: Karatzas, D

Searching in archive cs. Search in all archives.
.
  1. arXiv:2506.20588  [pdf, ps, other

    cs.CV

    TRIM: A Self-Supervised Video Summarization Framework Maximizing Temporal Relative Information and Representativeness

    Authors: Pritam Mishra, Coloma Ballester, Dimosthenis Karatzas

    Abstract: The increasing ubiquity of video content and the corresponding demand for efficient access to meaningful information have elevated video summarization and video highlights as a vital research area. However, many state-of-the-art methods depend heavily either on supervised annotations or on attention-based models, which are computationally expensive and brittle in the face of distribution shifts th… ▽ More

    Submitted 25 June, 2025; originally announced June 2025.

  2. arXiv:2506.19702  [pdf, ps, other

    cs.AI

    LLM-Driven Medical Document Analysis: Enhancing Trustworthy Pathology and Differential Diagnosis

    Authors: Lei Kang, Xuanshuo Fu, Oriol Ramos Terrades, Javier Vazquez-Corral, Ernest Valveny, Dimosthenis Karatzas

    Abstract: Medical document analysis plays a crucial role in extracting essential clinical insights from unstructured healthcare records, supporting critical tasks such as differential diagnosis. Determining the most probable condition among overlapping symptoms requires precise evaluation and deep medical expertise. While recent advancements in large language models (LLMs) have significantly enhanced perfor… ▽ More

    Submitted 24 June, 2025; originally announced June 2025.

    Comments: Accepted at ICDAR 2025

  3. arXiv:2505.07496  [pdf, other

    cs.CV cs.LG

    DocVXQA: Context-Aware Visual Explanations for Document Question Answering

    Authors: Mohamed Ali Souibgui, Changkyu Choi, Andrey Barsky, Kangsoo Jung, Ernest Valveny, Dimosthenis Karatzas

    Abstract: We propose DocVXQA, a novel framework for visually self-explainable document question answering. The framework is designed not only to produce accurate answers to questions but also to learn visual heatmaps that highlight contextually critical regions, thereby offering interpretable justifications for the model's decisions. To integrate explanations into the learning process, we quantitatively for… ▽ More

    Submitted 12 May, 2025; originally announced May 2025.

  4. arXiv:2504.16101  [pdf, other

    eess.SP cs.AI cs.LG

    xLSTM-ECG: Multi-label ECG Classification via Feature Fusion with xLSTM

    Authors: Lei Kang, Xuanshuo Fu, Javier Vazquez-Corral, Ernest Valveny, Dimosthenis Karatzas

    Abstract: Cardiovascular diseases (CVDs) remain the leading cause of mortality worldwide, highlighting the critical need for efficient and accurate diagnostic tools. Electrocardiograms (ECGs) are indispensable in diagnosing various heart conditions; however, their manual interpretation is time-consuming and error-prone. In this paper, we propose xLSTM-ECG, a novel approach that leverages an extended Long Sh… ▽ More

    Submitted 14 April, 2025; originally announced April 2025.

  5. arXiv:2504.09249  [pdf, other

    cs.CV cs.IR cs.LG cs.MM

    NoTeS-Bank: Benchmarking Neural Transcription and Search for Scientific Notes Understanding

    Authors: Aniket Pal, Sanket Biswas, Alloy Das, Ayush Lodh, Priyanka Banerjee, Soumitri Chattopadhyay, Dimosthenis Karatzas, Josep Llados, C. V. Jawahar

    Abstract: Understanding and reasoning over academic handwritten notes remains a challenge in document AI, particularly for mathematical equations, diagrams, and scientific notations. Existing visual question answering (VQA) benchmarks focus on printed or structured handwritten text, limiting generalization to real-world note-taking. To address this, we introduce NoTeS-Bank, an evaluation benchmark for Neura… ▽ More

    Submitted 12 April, 2025; originally announced April 2025.

  6. arXiv:2504.08616  [pdf, other

    cs.CV

    Preserving Privacy Without Compromising Accuracy: Machine Unlearning for Handwritten Text Recognition

    Authors: Lei Kang, Xuanshuo Fu, Lluis Gomez, Alicia Fornés, Ernest Valveny, Dimosthenis Karatzas

    Abstract: Handwritten Text Recognition (HTR) is essential for document analysis and digitization. However, handwritten data often contains user-identifiable information, such as unique handwriting styles and personal lexicon choices, which can compromise privacy and erode trust in AI services. Legislation like the ``right to be forgotten'' underscores the necessity for methods that can expunge sensitive inf… ▽ More

    Submitted 11 April, 2025; originally announced April 2025.

  7. arXiv:2503.08561  [pdf, other

    cs.CV

    ComicsPAP: understanding comic strips by picking the correct panel

    Authors: Emanuele Vivoli, Artemis Llabrés, Mohamed Ali Souibgui, Marco Bertini, Ernest Valveny Llobet, Dimosthenis Karatzas

    Abstract: Large multimodal models (LMMs) have made impressive strides in image captioning, VQA, and video comprehension, yet they still struggle with the intricate temporal and spatial cues found in comics. To address this gap, we introduce ComicsPAP, a large-scale benchmark designed for comic strip understanding. Comprising over 100k samples and organized into 5 subtasks under a Pick-a-Panel framework, Com… ▽ More

    Submitted 24 March, 2025; v1 submitted 11 March, 2025; originally announced March 2025.

  8. arXiv:2502.03692  [pdf, other

    cs.LG cs.CL cs.CR

    DocMIA: Document-Level Membership Inference Attacks against DocVQA Models

    Authors: Khanh Nguyen, Raouf Kerkouche, Mario Fritz, Dimosthenis Karatzas

    Abstract: Document Visual Question Answering (DocVQA) has introduced a new paradigm for end-to-end document understanding, and quickly became one of the standard benchmarks for multimodal LLMs. Automating document processing workflows, driven by DocVQA models, presents significant potential for many business sectors. However, documents tend to contain highly sensitive information, raising concerns about pri… ▽ More

    Submitted 5 February, 2025; originally announced February 2025.

    Comments: ICLR 2025

  9. arXiv:2411.03730  [pdf, ps, other

    cs.LG cs.CR cs.CV

    NeurIPS 2023 Competition: Privacy Preserving Federated Learning Document VQA

    Authors: Marlon Tobaben, Mohamed Ali Souibgui, Rubèn Tito, Khanh Nguyen, Raouf Kerkouche, Kangsoo Jung, Joonas Jälkö, Lei Kang, Andrey Barsky, Vincent Poulain d'Andecy, Aurélie Joseph, Aashiq Muhamed, Kevin Kuo, Virginia Smith, Yusuke Yamasaki, Takumi Fukami, Kenta Niwa, Iifan Tyou, Hiro Ishii, Rio Yokota, Ragul N, Rintu Kutum, Josep Llados, Ernest Valveny, Antti Honkela , et al. (2 additional authors not shown)

    Abstract: The Privacy Preserving Federated Learning Document VQA (PFL-DocVQA) competition challenged the community to develop provably private and communication-efficient solutions in a federated setting for a real-life use case: invoice processing. The competition introduced a dataset of real invoice documents, along with associated questions and answers requiring information extraction and reasoning over… ▽ More

    Submitted 3 June, 2025; v1 submitted 6 November, 2024; originally announced November 2024.

    Comments: 33 pages, 7 figures; published in TMLR 06/2025 https://openreview.net/forum?id=3HKNwejEEq

    Journal ref: Transactions on Machine Learning Research, ISSN 2835-8856, 2025

  10. arXiv:2409.16159  [pdf, other

    cs.CV

    ComiCap: A VLMs pipeline for dense captioning of Comic Panels

    Authors: Emanuele Vivoli, Niccolò Biondi, Marco Bertini, Dimosthenis Karatzas

    Abstract: The comic domain is rapidly advancing with the development of single- and multi-page analysis and synthesis models. Recent benchmarks and datasets have been introduced to support and assess models' capabilities in tasks such as detection (panels, characters, text), linking (character re-identification and speaker identification), and analysis of comic elements (e.g., dialog transcription). However… ▽ More

    Submitted 24 September, 2024; originally announced September 2024.

    Comments: Accepted at ECCV 2024 Workshop (AI for Visual Art), repo: https://github.com/emanuelevivoli/ComiCap

  11. arXiv:2409.09502  [pdf, other

    cs.CV

    One missing piece in Vision and Language: A Survey on Comics Understanding

    Authors: Emanuele Vivoli, Mohamed Ali Souibgui, Andrey Barsky, Artemis LLabrés, Marco Bertini, Dimosthenis Karatzas

    Abstract: Vision-language models have recently evolved into versatile systems capable of high performance across a range of tasks, such as document understanding, visual question answering, and grounding, often in zero-shot settings. Comics Understanding, a complex and multifaceted field, stands to greatly benefit from these advances. Comics, as a medium, combine rich visual and textual narratives, challeng… ▽ More

    Submitted 8 January, 2025; v1 submitted 14 September, 2024; originally announced September 2024.

    Comments: under review. project website: https://github.com/emanuelevivoli/awesome-comics-understanding

  12. arXiv:2408.07259  [pdf, other

    cs.CV cs.AI

    GRIF-DM: Generation of Rich Impression Fonts using Diffusion Models

    Authors: Lei Kang, Fei Yang, Kai Wang, Mohamed Ali Souibgui, Lluis Gomez, Alicia Fornés, Ernest Valveny, Dimosthenis Karatzas

    Abstract: Fonts are integral to creative endeavors, design processes, and artistic productions. The appropriate selection of a font can significantly enhance artwork and endow advertisements with a higher level of expressivity. Despite the availability of numerous diverse font designs online, traditional retrieval-based methods for font selection are increasingly being supplanted by generation-based approac… ▽ More

    Submitted 13 August, 2024; originally announced August 2024.

    Comments: Accepted to ECAI2024

  13. Image-text matching for large-scale book collections

    Authors: Artemis Llabrés, Arka Ujjal Dey, Dimosthenis Karatzas, Ernest Valveny

    Abstract: We address the problem of detecting and mapping all books in a collection of images to entries in a given book catalogue. Instead of performing independent retrieval for each book detected, we treat the image-text mapping problem as a many-to-many matching process, looking for the best overall match between the two sets. We combine a state-of-the-art segmentation method (SAM) to detect book spines… ▽ More

    Submitted 29 July, 2024; originally announced July 2024.

    Journal ref: Document Analysis Systems, Lecture Notes in Computer Science, vol. 14994, pp. 89-102, Springer, 2024

  14. arXiv:2407.03550  [pdf, other

    cs.CV

    CoMix: A Comprehensive Benchmark for Multi-Task Comic Understanding

    Authors: Emanuele Vivoli, Marco Bertini, Dimosthenis Karatzas

    Abstract: The comic domain is rapidly advancing with the development of single-page analysis and synthesis models. However, evaluation metrics and datasets lag behind, often limited to small-scale or single-style test sets. We introduce a novel benchmark, CoMix, designed to evaluate the multi-task capabilities of models in comic analysis. Unlike existing benchmarks that focus on isolated tasks such as objec… ▽ More

    Submitted 31 October, 2024; v1 submitted 3 July, 2024; originally announced July 2024.

    Comments: Accepted at NeurIPS 2024 (D&B)

  15. arXiv:2407.03540  [pdf, other

    cs.CV

    Comics Datasets Framework: Mix of Comics datasets for detection benchmarking

    Authors: Emanuele Vivoli, Irene Campaioli, Mariateresa Nardoni, Niccolò Biondi, Marco Bertini, Dimosthenis Karatzas

    Abstract: Comics, as a medium, uniquely combine text and images in styles often distinct from real-world visuals. For the past three decades, computational research on comics has evolved from basic object detection to more sophisticated tasks. However, the field faces persistent challenges such as small datasets, inconsistent annotations, inaccessible model weights, and results that cannot be directly compa… ▽ More

    Submitted 3 July, 2024; originally announced July 2024.

    Comments: Accepted at MANPU - COMICS workshop at ICDAR

  16. arXiv:2405.06636  [pdf, other

    cs.CV cs.AI cs.CL cs.LG

    Federated Document Visual Question Answering: A Pilot Study

    Authors: Khanh Nguyen, Dimosthenis Karatzas

    Abstract: An important handicap of document analysis research is that documents tend to be copyrighted or contain private information, which prohibits their open publication and the creation of centralised, large-scale document datasets. Instead, documents are scattered in private data silos, making extensive training over heterogeneous data a tedious task. In this work, we explore the use of a federated le… ▽ More

    Submitted 22 May, 2024; v1 submitted 10 May, 2024; originally announced May 2024.

  17. arXiv:2404.19031  [pdf, other

    cs.CV cs.AI

    Machine Unlearning for Document Classification

    Authors: Lei Kang, Mohamed Ali Souibgui, Fei Yang, Lluis Gomez, Ernest Valveny, Dimosthenis Karatzas

    Abstract: Document understanding models have recently demonstrated remarkable performance by leveraging extensive collections of user documents. However, since documents often contain large amounts of personal data, their usage can pose a threat to user privacy and weaken the bonds of trust between humans and AI services. In response to these concerns, legislation advocating ``the right to be forgotten" has… ▽ More

    Submitted 29 April, 2024; originally announced April 2024.

    Comments: Accepted to ICDAR2024

  18. arXiv:2404.19024  [pdf, other

    cs.CV

    Multi-Page Document Visual Question Answering using Self-Attention Scoring Mechanism

    Authors: Lei Kang, Rubèn Tito, Ernest Valveny, Dimosthenis Karatzas

    Abstract: Documents are 2-dimensional carriers of written communication, and as such their interpretation requires a multi-modal approach where textual and visual information are efficiently combined. Document Visual Question Answering (Document VQA), due to this multi-modal nature, has garnered significant interest from both the document understanding and natural language processing communities. The state-… ▽ More

    Submitted 29 April, 2024; originally announced April 2024.

    Comments: Accepted to ICDAR2024

  19. arXiv:2404.10702  [pdf, other

    cs.MM

    Retrieval Augmented Verification for Zero-Shot Detection of Multimodal Disinformation

    Authors: Arka Ujjal Dey, Artemis Llabrés, Ernest Valveny, Dimosthenis Karatzas

    Abstract: The rise of disinformation on social media, especially through the strategic manipulation or repurposing of images, paired with provocative text, presents a complex challenge for traditional fact-checking methods. In this paper, we introduce a novel zero-shot approach to identify and interpret such multimodal disinformation, leveraging real-time evidence from credible sources. Our framework goes b… ▽ More

    Submitted 9 April, 2025; v1 submitted 16 April, 2024; originally announced April 2024.

  20. arXiv:2403.03719  [pdf, other

    cs.CV

    Multimodal Transformer for Comics Text-Cloze

    Authors: Emanuele Vivoli, Joan Lafuente Baeza, Ernest Valveny Llobet, Dimosthenis Karatzas

    Abstract: This work explores a closure task in comics, a medium where visual and textual elements are intricately intertwined. Specifically, Text-cloze refers to the task of selecting the correct text to use in a comic panel, given its neighboring panels. Traditional methods based on recurrent neural networks have struggled with this task due to limited OCR accuracy and inherent model limitations. We introd… ▽ More

    Submitted 6 March, 2024; originally announced March 2024.

  21. arXiv:2312.10108  [pdf, other

    cs.CV cs.AI cs.LG

    Privacy-Aware Document Visual Question Answering

    Authors: Rubèn Tito, Khanh Nguyen, Marlon Tobaben, Raouf Kerkouche, Mohamed Ali Souibgui, Kangsoo Jung, Joonas Jälkö, Vincent Poulain D'Andecy, Aurelie Joseph, Lei Kang, Ernest Valveny, Antti Honkela, Mario Fritz, Dimosthenis Karatzas

    Abstract: Document Visual Question Answering (DocVQA) has quickly grown into a central task of document understanding. But despite the fact that documents contain sensitive or copyrighted information, none of the current DocVQA methods offers strong privacy guarantees. In this work, we explore privacy in the domain of DocVQA for the first time, highlighting privacy issues in state of the art multi-modal LLM… ▽ More

    Submitted 2 September, 2024; v1 submitted 15 December, 2023; originally announced December 2023.

    Comments: 35 pages, 12 figures, accepted for publication at the 18th International Conference on Document Analysis and Recognition, ICDAR 2024

  22. arXiv:2309.02356  [pdf, other

    cs.CV

    STEP -- Towards Structured Scene-Text Spotting

    Authors: Sergi Garcia-Bordils, Dimosthenis Karatzas, Marçal Rusiñol

    Abstract: We introduce the structured scene-text spotting task, which requires a scene-text OCR system to spot text in the wild according to a query regular expression. Contrary to generic scene text OCR, structured scene-text spotting seeks to dynamically condition both scene text detection and recognition on user-provided regular expressions. To tackle this task, we propose the Structured TExt sPotter (ST… ▽ More

    Submitted 11 December, 2023; v1 submitted 5 September, 2023; originally announced September 2023.

    Comments: 15 pages, 11 figures

    MSC Class: 68T01 (Primary) 68T10; 68T45; 68T05; 68T07 (Secondary) ACM Class: I.2.1; I.2.6; I.2.10

  23. arXiv:2309.01380  [pdf, other

    cs.CV

    Understanding Video Scenes through Text: Insights from Text-based Video Question Answering

    Authors: Soumya Jahagirdar, Minesh Mathew, Dimosthenis Karatzas, C. V. Jawahar

    Abstract: Researchers have extensively studied the field of vision and language, discovering that both visual and textual content is crucial for understanding scenes effectively. Particularly, comprehending text in videos holds great significance, requiring both scene text understanding and temporal reasoning. This paper focuses on exploring two recently introduced datasets, NewsVideoQA and M4-ViteVQA, whic… ▽ More

    Submitted 11 September, 2023; v1 submitted 4 September, 2023; originally announced September 2023.

  24. Reading Between the Lanes: Text VideoQA on the Road

    Authors: George Tom, Minesh Mathew, Sergi Garcia, Dimosthenis Karatzas, C. V. Jawahar

    Abstract: Text and signs around roads provide crucial information for drivers, vital for safe navigation and situational awareness. Scene text recognition in motion is a challenging problem, while textual cues typically appear for a short time span, and early detection at a distance is necessary. Systems that exploit such information to assist the driver should not only extract and incorporate visual and te… ▽ More

    Submitted 14 June, 2025; v1 submitted 8 July, 2023; originally announced July 2023.

  25. arXiv:2306.03287  [pdf, other

    cs.CV

    ICDAR 2023 Competition on Structured Text Extraction from Visually-Rich Document Images

    Authors: Wenwen Yu, Chengquan Zhang, Haoyu Cao, Wei Hua, Bohan Li, Huang Chen, Mingyu Liu, Mingrui Chen, Jianfeng Kuang, Mengjun Cheng, Yuning Du, Shikun Feng, Xiaoguang Hu, Pengyuan Lyu, Kun Yao, Yuechen Yu, Yuliang Liu, Wanxiang Che, Errui Ding, Cheng-Lin Liu, Jiebo Luo, Shuicheng Yan, Min Zhang, Dimosthenis Karatzas, Xing Sun , et al. (2 additional authors not shown)

    Abstract: Structured text extraction is one of the most valuable and challenging application directions in the field of Document AI. However, the scenarios of past benchmarks are limited, and the corresponding evaluation protocols usually focus on the submodules of the structured text extraction scheme. In order to eliminate these problems, we organized the ICDAR 2023 competition on Structured text extracti… ▽ More

    Submitted 5 June, 2023; originally announced June 2023.

    Comments: ICDAR 2023 Competition on SVRD report (To be appear in ICDAR 2023)

  26. arXiv:2304.11966  [pdf, other

    cs.CV

    ICDAR 2023 Competition on Reading the Seal Title

    Authors: Wenwen Yu, Mingyu Liu, Mingrui Chen, Ning Lu, Yinlong Wen, Yuliang Liu, Dimosthenis Karatzas, Xiang Bai

    Abstract: Reading seal title text is a challenging task due to the variable shapes of seals, curved text, background noise, and overlapped text. However, this important element is commonly found in official and financial scenarios, and has not received the attention it deserves in the field of OCR technology. To promote research in this area, we organized ICDAR 2023 competition on reading the seal title (Re… ▽ More

    Submitted 5 June, 2023; v1 submitted 24 April, 2023; originally announced April 2023.

    Comments: ICDAR2023 Competition on ReST report (To be appear in ICDAR 2023)

  27. arXiv:2304.04376  [pdf, other

    cs.CV

    ICDAR 2023 Video Text Reading Competition for Dense and Small Text

    Authors: Weijia Wu, Yuzhong Zhao, Zhuang Li, Jiahong Li, Mike Zheng Shou, Umapada Pal, Dimosthenis Karatzas, Xiang Bai

    Abstract: Recently, video text detection, tracking, and recognition in natural scenes are becoming very popular in the computer vision community. However, most existing algorithms and benchmarks focus on common text cases (e.g., normal size, density) and single scenarios, while ignoring extreme video text challenges, i.e., dense and small text in various scenarios. In this competition report, we establish a… ▽ More

    Submitted 10 April, 2023; originally announced April 2023.

    Journal ref: ICDAR 2023 competition

  28. arXiv:2302.05658  [pdf, other

    cs.CL cs.AI cs.LG

    DocILE Benchmark for Document Information Localization and Extraction

    Authors: Štěpán Šimsa, Milan Šulc, Michal Uřičář, Yash Patel, Ahmed Hamdi, Matěj Kocián, Matyáš Skalický, Jiří Matas, Antoine Doucet, Mickaël Coustaty, Dimosthenis Karatzas

    Abstract: This paper introduces the DocILE benchmark with the largest dataset of business documents for the tasks of Key Information Localization and Extraction and Line Item Recognition. It contains 6.7k annotated business documents, 100k synthetically generated documents, and nearly~1M unlabeled documents for unsupervised pre-training. The dataset has been built with knowledge of domain- and task-specific… ▽ More

    Submitted 3 May, 2023; v1 submitted 11 February, 2023; originally announced February 2023.

    Comments: Accepted to ICDAR 2023

  29. arXiv:2212.05935  [pdf, other

    cs.CV cs.AI cs.CL

    Hierarchical multimodal transformers for Multi-Page DocVQA

    Authors: Rubèn Tito, Dimosthenis Karatzas, Ernest Valveny

    Abstract: Document Visual Question Answering (DocVQA) refers to the task of answering questions from document images. Existing work on DocVQA only considers single-page documents. However, in real scenarios documents are mostly composed of multiple pages that should be processed altogether. In this work we extend DocVQA to the multi-page scenario. For that, we first create a new dataset, MP-DocVQA, where qu… ▽ More

    Submitted 1 April, 2023; v1 submitted 7 December, 2022; originally announced December 2022.

  30. arXiv:2211.05588  [pdf, other

    cs.CV

    Watching the News: Towards VideoQA Models that can Read

    Authors: Soumya Jahagirdar, Minesh Mathew, Dimosthenis Karatzas, C. V. Jawahar

    Abstract: Video Question Answering methods focus on commonsense reasoning and visual cognition of objects or persons and their interactions over time. Current VideoQA approaches ignore the textual information present in the video. Instead, we argue that textual information is complementary to the action and provides essential contextualisation cues to the reasoning process. To this end, we propose a novel V… ▽ More

    Submitted 7 December, 2023; v1 submitted 10 November, 2022; originally announced November 2022.

  31. arXiv:2209.10474  [pdf, other

    cs.CV

    Show, Interpret and Tell: Entity-aware Contextualised Image Captioning in Wikipedia

    Authors: Khanh Nguyen, Ali Furkan Biten, Andres Mafla, Lluis Gomez, Dimosthenis Karatzas

    Abstract: Humans exploit prior knowledge to describe images, and are able to adapt their explanation to specific contextual information, even to the extent of inventing plausible explanations when contextual information and images do not match. In this work, we propose the novel task of captioning Wikipedia images by integrating contextual knowledge. Specifically, we produce models that jointly reason over… ▽ More

    Submitted 21 September, 2022; originally announced September 2022.

  32. arXiv:2209.06730  [pdf, other

    cs.CV

    MUST-VQA: MUltilingual Scene-text VQA

    Authors: Emanuele Vivoli, Ali Furkan Biten, Andres Mafla, Dimosthenis Karatzas, Lluis Gomez

    Abstract: In this paper, we present a framework for Multilingual Scene Text Visual Question Answering that deals with new languages in a zero-shot fashion. Specifically, we consider the task of Scene Text Visual Question Answering (STVQA) in which the question can be asked in different languages and it is not necessarily aligned to the scene text language. Thus, we first introduce a natural step towards a m… ▽ More

    Submitted 14 September, 2022; originally announced September 2022.

    Comments: To be appeared in Text In Everything Workshop in ECCV 2022

  33. arXiv:2209.06717  [pdf, other

    cs.CV

    Out-of-Vocabulary Challenge Report

    Authors: Sergi Garcia-Bordils, Andrés Mafla, Ali Furkan Biten, Oren Nuriel, Aviad Aberdam, Shai Mazor, Ron Litman, Dimosthenis Karatzas

    Abstract: This paper presents final results of the Out-Of-Vocabulary 2022 (OOV) challenge. The OOV contest introduces an important aspect that is not commonly studied by Optical Character Recognition (OCR) models, namely, the recognition of unseen scene text instances at training time. The competition compiles a collection of public scene text datasets comprising of 326,385 images with 4,864,405 scene text… ▽ More

    Submitted 14 September, 2022; originally announced September 2022.

    Comments: To be appeared in Text In Everything Workshop in ECCV 2022

  34. arXiv:2203.04814  [pdf, other

    cs.CV

    Text-DIAE: A Self-Supervised Degradation Invariant Autoencoders for Text Recognition and Document Enhancement

    Authors: Mohamed Ali Souibgui, Sanket Biswas, Andres Mafla, Ali Furkan Biten, Alicia Fornés, Yousri Kessentini, Josep Lladós, Lluis Gomez, Dimosthenis Karatzas

    Abstract: In this paper, we propose a Text-Degradation Invariant Auto Encoder (Text-DIAE), a self-supervised model designed to tackle two tasks, text recognition (handwritten or scene-text) and document image enhancement. We start by employing a transformer-based architecture that incorporates three pretext tasks as learning objectives to be optimized during pre-training without the usage of labeled data. E… ▽ More

    Submitted 18 August, 2022; v1 submitted 9 March, 2022; originally announced March 2022.

    Comments: Preprint

  35. arXiv:2202.12985  [pdf, other

    cs.CV cs.AI

    OCR-IDL: OCR Annotations for Industry Document Library Dataset

    Authors: Ali Furkan Biten, Rubèn Tito, Lluis Gomez, Ernest Valveny, Dimosthenis Karatzas

    Abstract: Pretraining has proven successful in Document Intelligence tasks where deluge of documents are used to pretrain the models only later to be finetuned on downstream tasks. One of the problems of the pretraining approaches is the inconsistent usage of pretraining data with different OCR engines leading to incomparable results between models. In other words, it is not obvious whether the performance… ▽ More

    Submitted 25 February, 2022; originally announced February 2022.

  36. arXiv:2111.05547  [pdf, other

    cs.CV cs.LG

    ICDAR 2021 Competition on Document VisualQuestion Answering

    Authors: Rubèn Tito, Minesh Mathew, C. V. Jawahar, Ernest Valveny, Dimosthenis Karatzas

    Abstract: In this report we present results of the ICDAR 2021 edition of the Document Visual Question Challenges. This edition complements the previous tasks on Single Document VQA and Document Collection VQA with a newly introduced on Infographics VQA. Infographics VQA is based on a new dataset of more than 5,000 infographics images and 30,000 question-answer pairs. The winner methods have scored 0.6120 AN… ▽ More

    Submitted 10 November, 2021; originally announced November 2021.

  37. arXiv:2110.02623  [pdf, other

    cs.CV cs.AI

    Is An Image Worth Five Sentences? A New Look into Semantics for Image-Text Matching

    Authors: Ali Furkan Biten, Andres Mafla, Lluis Gomez, Dimosthenis Karatzas

    Abstract: The task of image-text matching aims to map representations from different modalities into a common joint visual-textual embedding. However, the most widely used datasets for this task, MSCOCO and Flickr30K, are actually image captioning datasets that offer a very limited set of relationships between images and sentences in their ground-truth annotations. This limited ground truth information forc… ▽ More

    Submitted 6 October, 2021; originally announced October 2021.

    Comments: Accepted WACV 2022

  38. arXiv:2110.01705  [pdf, other

    cs.CV

    Let there be a clock on the beach: Reducing Object Hallucination in Image Captioning

    Authors: Ali Furkan Biten, Lluis Gomez, Dimosthenis Karatzas

    Abstract: Explaining an image with missing or non-existent objects is known as object bias (hallucination) in image captioning. This behaviour is quite common in the state-of-the-art captioning models which is not desirable by humans. To decrease the object hallucination in captioning, we propose three simple yet efficient training augmentation method for sentences which requires no new training data or inc… ▽ More

    Submitted 2 November, 2021; v1 submitted 4 October, 2021; originally announced October 2021.

    Comments: Accepted to WACV 2022

  39. Asking questions on handwritten document collections

    Authors: Minesh Mathew, Lluis Gomez, Dimosthenis Karatzas, CV Jawahar

    Abstract: This work addresses the problem of Question Answering (QA) on handwritten document collections. Unlike typical QA and Visual Question Answering (VQA) formulations where the answer is a short text, we aim to locate a document snippet where the answer lies. The proposed approach works without recognizing the text in the documents. We argue that the recognition-free approach is suitable for handwritt… ▽ More

    Submitted 1 October, 2021; originally announced October 2021.

    Comments: pre-print version

    Journal ref: journal = {Int. J. Document Anal. Recognit.}, volume = {24}, number = {3}, pages = {235--249}, year = {2021}

  40. arXiv:2105.05300  [pdf, other

    cs.CV

    One-shot Compositional Data Generation for Low Resource Handwritten Text Recognition

    Authors: Mohamed Ali Souibgui, Ali Furkan Biten, Sounak Dey, Alicia Fornés, Yousri Kessentini, Lluis Gomez, Dimosthenis Karatzas, Josep Lladós

    Abstract: Low resource Handwritten Text Recognition (HTR) is a hard problem due to the scarce annotated data and the very limited linguistic information (dictionaries and language models). For example, in the case of historical ciphered manuscripts, which are usually written with invented alphabets to hide the message contents. Thus, in this paper we address this problem through a data generation technique… ▽ More

    Submitted 5 October, 2021; v1 submitted 11 May, 2021; originally announced May 2021.

    Comments: Accepted in WACV 2022

  41. Document Collection Visual Question Answering

    Authors: Rubèn Tito, Dimosthenis Karatzas, Ernest Valveny

    Abstract: Current tasks and methods in Document Understanding aims to process documents as single elements. However, documents are usually organized in collections (historical records, purchase invoices), that provide context useful for their interpretation. To address this problem, we introduce Document Collection Visual Question Answering (DocCVQA) a new dataset and related task, where questions are posed… ▽ More

    Submitted 8 June, 2021; v1 submitted 27 April, 2021; originally announced April 2021.

  42. arXiv:2104.12756  [pdf, other

    cs.CV cs.CL

    InfographicVQA

    Authors: Minesh Mathew, Viraj Bagal, Rubèn Pérez Tito, Dimosthenis Karatzas, Ernest Valveny, C. V Jawahar

    Abstract: Infographics are documents designed to effectively communicate information using a combination of textual, graphical and visual elements. In this work, we explore the automatic understanding of infographic images by using Visual Question Answering technique.To this end, we present InfographicVQA, a new dataset that comprises a diverse collection of infographics along with natural language question… ▽ More

    Submitted 22 August, 2021; v1 submitted 26 April, 2021; originally announced April 2021.

  43. ICDAR2019 Competition on Scanned Receipt OCR and Information Extraction

    Authors: Zheng Huang, Kai Chen, Jianhua He, Xiang Bai, Dimosthenis Karatzas, Shjian Lu, C. V. Jawahar

    Abstract: Scanned receipts OCR and key information extraction (SROIE) represent the processeses of recognizing text from scanned receipts and extracting key texts from them and save the extracted tests to structured documents. SROIE plays critical roles for many document analysis applications and holds great commercial potentials, but very little research works and advances have been published in this area.… ▽ More

    Submitted 18 March, 2021; originally announced March 2021.

  44. arXiv:2012.04329  [pdf, other

    cs.CV

    StacMR: Scene-Text Aware Cross-Modal Retrieval

    Authors: Andrés Mafla, Rafael Sampaio de Rezende, Lluís Gómez, Diane Larlus, Dimosthenis Karatzas

    Abstract: Recent models for cross-modal retrieval have benefited from an increasingly rich understanding of visual scenes, afforded by scene graphs and object interactions to mention a few. This has resulted in an improved matching between the visual representation of an image and the textual representation of its caption. Yet, current visual representations overlook a key aspect: the text appearing in imag… ▽ More

    Submitted 8 December, 2020; originally announced December 2020.

  45. arXiv:2009.09809  [pdf, other

    cs.CV

    Multi-Modal Reasoning Graph for Scene-Text Based Fine-Grained Image Classification and Retrieval

    Authors: Andres Mafla, Sounak Dey, Ali Furkan Biten, Lluis Gomez, Dimosthenis Karatzas

    Abstract: Scene text instances found in natural images carry explicit semantic information that can provide important cues to solve a wide array of computer vision problems. In this paper, we focus on leveraging multi-modal content in the form of visual and textual cues to tackle the task of fine-grained image classification and retrieval. First, we obtain the text instances from images by employing a text… ▽ More

    Submitted 21 September, 2020; originally announced September 2020.

  46. arXiv:2008.08899  [pdf, other

    cs.CV cs.IR

    Document Visual Question Answering Challenge 2020

    Authors: Minesh Mathew, Ruben Tito, Dimosthenis Karatzas, R. Manmatha, C. V. Jawahar

    Abstract: This paper presents results of Document Visual Question Answering Challenge organized as part of "Text and Documents in the Deep Learning Era" workshop, in CVPR 2020. The challenge introduces a new problem - Visual Question Answering on document images. The challenge comprised two tasks. The first task concerns with asking questions on a single document image. On the other hand, the second task is… ▽ More

    Submitted 17 July, 2021; v1 submitted 20 August, 2020; originally announced August 2020.

    Comments: to be published as a short paper in DAS 2020

  47. arXiv:2008.04991  [pdf, other

    cs.CV

    Retrieval Guided Unsupervised Multi-domain Image-to-Image Translation

    Authors: Raul Gomez, Yahui Liu, Marco De Nadai, Dimosthenis Karatzas, Bruno Lepri, Nicu Sebe

    Abstract: Image to image translation aims to learn a mapping that transforms an image from one visual domain to another. Recent works assume that images descriptors can be disentangled into a domain-invariant content representation and a domain-specific style representation. Thus, translation models seek to preserve the content of source images while changing the style to a target visual domain. However, sy… ▽ More

    Submitted 11 August, 2020; originally announced August 2020.

    Comments: Submitted to ACM MM '20, October 12-16, 2020, Seattle, WA, USA

  48. arXiv:2007.03375  [pdf, other

    cs.CV

    Location Sensitive Image Retrieval and Tagging

    Authors: Raul Gomez, Jaume Gibert, Lluis Gomez, Dimosthenis Karatzas

    Abstract: People from different parts of the globe describe objects and concepts in distinct manners. Visual appearance can thus vary across different geographic locations, which makes location a relevant contextual information when analysing visual data. In this work, we address the task of image retrieval related to a given tag conditioned on a certain location on Earth. We present LocSens, a model that l… ▽ More

    Submitted 7 July, 2020; originally announced July 2020.

    MSC Class: 68T07 ACM Class: I.2.10

    Journal ref: ECCV 2020

  49. arXiv:2007.03098  [pdf, other

    cs.CV

    Text Recognition -- Real World Data and Where to Find Them

    Authors: Klára Janoušková, Jiri Matas, Lluis Gomez, Dimosthenis Karatzas

    Abstract: We present a method for exploiting weakly annotated images to improve text extraction pipelines. The approach uses an arbitrary end-to-end text recognition system to obtain text region proposals and their, possibly erroneous, transcriptions. The proposed method includes matching of imprecise transcription to weak annotations and edit distance guided neighbourhood search. It produces nearly error-f… ▽ More

    Submitted 17 July, 2020; v1 submitted 6 July, 2020; originally announced July 2020.

    Comments: 10 pages

  50. arXiv:2007.00398  [pdf, other

    cs.CV cs.IR

    DocVQA: A Dataset for VQA on Document Images

    Authors: Minesh Mathew, Dimosthenis Karatzas, C. V. Jawahar

    Abstract: We present a new dataset for Visual Question Answering (VQA) on document images called DocVQA. The dataset consists of 50,000 questions defined on 12,000+ document images. Detailed analysis of the dataset in comparison with similar datasets for VQA and reading comprehension is presented. We report several baseline results by adopting existing VQA and reading comprehension models. Although the exis… ▽ More

    Submitted 5 January, 2021; v1 submitted 1 July, 2020; originally announced July 2020.

    Comments: accepted at WACV 2021