ChitroJera: A Regionally Relevant Visual Question Answering Dataset for Bangla

Barua, Deeparghya Dutta; Sourove, Md Sakib Ul Rahman; Fahim, Md; Haider, Fabiha; Shifat, Fariha Tanjim; Adib, Md Tasmim Rahman; Uddin, Anam Borhan; Ishmam, Md Farhan; Alam, Md Farhad

Computer Science > Computer Vision and Pattern Recognition

arXiv:2410.14991 (cs)

[Submitted on 19 Oct 2024 (v1), last revised 2 Jun 2025 (this version, v2)]

Title:ChitroJera: A Regionally Relevant Visual Question Answering Dataset for Bangla

Authors:Deeparghya Dutta Barua, Md Sakib Ul Rahman Sourove, Md Fahim, Fabiha Haider, Fariha Tanjim Shifat, Md Tasmim Rahman Adib, Anam Borhan Uddin, Md Farhan Ishmam, Md Farhad Alam

View PDF

Abstract:Visual Question Answer (VQA) poses the problem of answering a natural language question about a visual context. Bangla, despite being a widely spoken language, is considered low-resource in the realm of VQA due to the lack of proper benchmarks, challenging models known to be performant in other languages. Furthermore, existing Bangla VQA datasets offer little regional relevance and are largely adapted from their foreign counterparts. To address these challenges, we introduce a large-scale Bangla VQA dataset, ChitroJera, totaling over 15k samples from diverse and locally relevant data sources. We assess the performance of text encoders, image encoders, multimodal models, and our novel dual-encoder models. The experiments reveal that the pre-trained dual-encoders outperform other models of their scale. We also evaluate the performance of current large vision language models (LVLMs) using prompt-based techniques, achieving the overall best performance. Given the underdeveloped state of existing datasets, we envision ChitroJera expanding the scope of Vision-Language tasks in Bangla.

Comments:	Accepted in ECML PKDD 2025
Subjects:	Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
Cite as:	arXiv:2410.14991 [cs.CV]
	(or arXiv:2410.14991v2 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2410.14991

Submission history

From: Md Farhan Ishmam [view email]
[v1] Sat, 19 Oct 2024 05:45:21 UTC (10,316 KB)
[v2] Mon, 2 Jun 2025 12:38:12 UTC (9,911 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:ChitroJera: A Regionally Relevant Visual Question Answering Dataset for Bangla

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:ChitroJera: A Regionally Relevant Visual Question Answering Dataset for Bangla

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators