DUBLIN -- Document Understanding By Language-Image Network

Aggarwal, Kriti; Khandelwal, Aditi; Tanmay, Kumar; Khan, Owais Mohammed; Liu, Qiang; Choudhury, Monojit; Chauhan, Hardik Hansrajbhai; Som, Subhojit; Chaudhary, Vishrav; Tiwary, Saurabh

Computer Science > Computer Vision and Pattern Recognition

arXiv:2305.14218 (cs)

[Submitted on 23 May 2023 (v1), last revised 27 Oct 2023 (this version, v4)]

Title:DUBLIN -- Document Understanding By Language-Image Network

Authors:Kriti Aggarwal, Aditi Khandelwal, Kumar Tanmay, Owais Mohammed Khan, Qiang Liu, Monojit Choudhury, Hardik Hansrajbhai Chauhan, Subhojit Som, Vishrav Chaudhary, Saurabh Tiwary

View PDF

Abstract:Visual document understanding is a complex task that involves analyzing both the text and the visual elements in document images. Existing models often rely on manual feature engineering or domain-specific pipelines, which limit their generalization ability across different document types and languages. In this paper, we propose DUBLIN, which is pretrained on web pages using three novel objectives: Masked Document Text Generation Task, Bounding Box Task, and Rendered Question Answering Task, that leverage both the spatial and semantic information in the document images. Our model achieves competitive or state-of-the-art results on several benchmarks, such as Web-Based Structural Reading Comprehension, Document Visual Question Answering, Key Information Extraction, Diagram Understanding, and Table Question Answering. In particular, we show that DUBLIN is the first pixel-based model to achieve an EM of 77.75 and F1 of 84.25 on the WebSRC dataset. We also show that our model outperforms the current pixel-based SOTA models on DocVQA, InfographicsVQA, OCR-VQA and AI2D datasets by 4.6%, 6.5%, 2.6% and 21%, respectively. We also achieve competitive performance on RVL-CDIP document classification. Moreover, we create new baselines for text-based datasets by rendering them as document images to promote research in this direction.

Subjects:	Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
ACM classes:	F.2.2; I.2.7
Cite as:	arXiv:2305.14218 [cs.CV]
	(or arXiv:2305.14218v4 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2305.14218

Submission history

From: Aditi Khandelwal [view email]
[v1] Tue, 23 May 2023 16:34:09 UTC (748 KB)
[v2] Wed, 24 May 2023 07:03:56 UTC (749 KB)
[v3] Sat, 17 Jun 2023 05:53:08 UTC (6,787 KB)
[v4] Fri, 27 Oct 2023 15:08:31 UTC (7,169 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:DUBLIN -- Document Understanding By Language-Image Network

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:DUBLIN -- Document Understanding By Language-Image Network

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators