Parallel Corpora for Machine Translation in Low-resource Indic Languages: A Comprehensive Review

Raja, Rahul; Vats, Arpita

Computer Science > Computation and Language

arXiv:2503.04797 (cs)

[Submitted on 2 Mar 2025 (v1), last revised 22 Apr 2025 (this version, v2)]

Title:Parallel Corpora for Machine Translation in Low-resource Indic Languages: A Comprehensive Review

Authors:Rahul Raja, Arpita Vats

View PDF HTML (experimental)

Abstract:Parallel corpora play an important role in training machine translation (MT) models, particularly for low-resource languages where high-quality bilingual data is scarce. This review provides a comprehensive overview of available parallel corpora for Indic languages, which span diverse linguistic families, scripts, and regional variations. We categorize these corpora into text-to-text, code-switched, and various categories of multimodal datasets, highlighting their significance in the development of robust multilingual MT systems. Beyond resource enumeration, we critically examine the challenges faced in corpus creation, including linguistic diversity, script variation, data scarcity, and the prevalence of informal textual this http URL also discuss and evaluate these corpora in various terms such as alignment quality and domain representativeness. Furthermore, we address open challenges such as data imbalance across Indic languages, the trade-off between quality and quantity, and the impact of noisy, informal, and dialectal data on MT performance. Finally, we outline future directions, including leveraging cross-lingual transfer learning, expanding multilingual datasets, and integrating multimodal resources to enhance translation quality. To the best of our knowledge, this paper presents the first comprehensive review of parallel corpora specifically tailored for low-resource Indic languages in the context of machine translation.

Comments:	Accepted in NACCL
Subjects:	Computation and Language (cs.CL); Machine Learning (cs.LG)
Cite as:	arXiv:2503.04797 [cs.CL]
	(or arXiv:2503.04797v2 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2503.04797

Submission history

From: Rahul Raja [view email]
[v1] Sun, 2 Mar 2025 21:22:53 UTC (1,149 KB)
[v2] Tue, 22 Apr 2025 05:10:55 UTC (1,142 KB)

Computer Science > Computation and Language

Title:Parallel Corpora for Machine Translation in Low-resource Indic Languages: A Comprehensive Review

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:Parallel Corpora for Machine Translation in Low-resource Indic Languages: A Comprehensive Review

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators