TAP-VL: Text Layout-Aware Pre-training for Enriched Vision-Language Models

Fhima, Jonathan; Avraham, Elad Ben; Nuriel, Oren; Kittenplon, Yair; Ganz, Roy; Aberdam, Aviad; Litman, Ron

Computer Science > Computer Vision and Pattern Recognition

arXiv:2411.04642 (cs)

[Submitted on 7 Nov 2024]

Title:TAP-VL: Text Layout-Aware Pre-training for Enriched Vision-Language Models

Authors:Jonathan Fhima, Elad Ben Avraham, Oren Nuriel, Yair Kittenplon, Roy Ganz, Aviad Aberdam, Ron Litman

View PDF HTML (experimental)

Abstract:Vision-Language (VL) models have garnered considerable research interest; however, they still face challenges in effectively handling text within images. To address this limitation, researchers have developed two approaches. The first method involves utilizing external Optical Character Recognition (OCR) tools to extract textual information from images, which is then prepended to other textual inputs. The second strategy focuses on employing extremely high-resolution images to improve text recognition capabilities. In this paper, we focus on enhancing the first strategy by introducing a novel method, named TAP-VL, which treats OCR information as a distinct modality and seamlessly integrates it into any VL model. TAP-VL employs a lightweight transformer-based OCR module to receive OCR with layout information, compressing it into a short fixed-length sequence for input into the LLM. Initially, we conduct model-agnostic pretraining of the OCR module on unlabeled documents, followed by its integration into any VL architecture through brief fine-tuning. Extensive experiments demonstrate consistent performance improvements when applying TAP-VL to top-performing VL models, across scene-text and document-based VL benchmarks.

Subjects:	Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Cite as:	arXiv:2411.04642 [cs.CV]
	(or arXiv:2411.04642v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2411.04642

Submission history

From: Elad Ben Avraham [view email]
[v1] Thu, 7 Nov 2024 11:54:01 UTC (17,483 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:TAP-VL: Text Layout-Aware Pre-training for Enriched Vision-Language Models

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:TAP-VL: Text Layout-Aware Pre-training for Enriched Vision-Language Models

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators