Towards Leaving No Indic Language Behind: Building Monolingual Corpora, Benchmark and Models for Indic Languages

Doddapaneni, Sumanth; Aralikatte, Rahul; Ramesh, Gowtham; Goyal, Shreya; Khapra, Mitesh M.; Kunchukuttan, Anoop; Kumar, Pratyush

Computer Science > Computation and Language

arXiv:2212.05409 (cs)

[Submitted on 11 Dec 2022 (v1), last revised 24 May 2023 (this version, v3)]

Title:Towards Leaving No Indic Language Behind: Building Monolingual Corpora, Benchmark and Models for Indic Languages

Authors:Sumanth Doddapaneni, Rahul Aralikatte, Gowtham Ramesh, Shreya Goyal, Mitesh M. Khapra, Anoop Kunchukuttan, Pratyush Kumar

View PDF

Abstract:Building Natural Language Understanding (NLU) capabilities for Indic languages, which have a collective speaker base of more than one billion speakers is absolutely crucial. In this work, we aim to improve the NLU capabilities of Indic languages by making contributions along 3 important axes (i) monolingual corpora (ii) NLU testsets (iii) multilingual LLMs focusing on Indic languages. Specifically, we curate the largest monolingual corpora, IndicCorp, with 20.9B tokens covering 24 languages from 4 language families - a 2.3x increase over prior work, while supporting 12 additional languages. Next, we create a human-supervised benchmark, IndicXTREME, consisting of nine diverse NLU tasks covering 20 languages. Across languages and tasks, IndicXTREME contains a total of 105 evaluation sets, of which 52 are new contributions to the literature. To the best of our knowledge, this is the first effort towards creating a standard benchmark for Indic languages that aims to test the multilingual zero-shot capabilities of pretrained language models. Finally, we train IndicBERT v2, a state-of-the-art model supporting all the languages. Averaged across languages and tasks, the model achieves an absolute improvement of 2 points over a strong baseline. The data and models are available at this https URL.

Comments:	ACL 2023
Subjects:	Computation and Language (cs.CL)
Cite as:	arXiv:2212.05409 [cs.CL]
	(or arXiv:2212.05409v3 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2212.05409

Submission history

From: Sumanth Doddapaneni [view email]
[v1] Sun, 11 Dec 2022 04:45:50 UTC (1,951 KB)
[v2] Tue, 13 Dec 2022 18:47:27 UTC (1,951 KB)
[v3] Wed, 24 May 2023 17:05:16 UTC (1,714 KB)

Computer Science > Computation and Language

Title:Towards Leaving No Indic Language Behind: Building Monolingual Corpora, Benchmark and Models for Indic Languages

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:Towards Leaving No Indic Language Behind: Building Monolingual Corpora, Benchmark and Models for Indic Languages

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators