Skip to main content

Showing 1–1 of 1 results for author: Khapra, M S

Searching in archive cs. Search in all archives.
.
  1. arXiv:2104.05596  [pdf

    cs.CL

    Samanantar: The Largest Publicly Available Parallel Corpora Collection for 11 Indic Languages

    Authors: Gowtham Ramesh, Sumanth Doddapaneni, Aravinth Bheemaraj, Mayank Jobanputra, Raghavan AK, Ajitesh Sharma, Sujit Sahoo, Harshita Diddee, Mahalakshmi J, Divyanshu Kakwani, Navneet Kumar, Aswin Pradeep, Srihari Nagaraj, Kumar Deepak, Vivek Raghavan, Anoop Kunchukuttan, Pratyush Kumar, Mitesh Shantadevi Khapra

    Abstract: We present Samanantar, the largest publicly available parallel corpora collection for Indic languages. The collection contains a total of 49.7 million sentence pairs between English and 11 Indic languages (from two language families). Specifically, we compile 12.4 million sentence pairs from existing, publicly-available parallel corpora, and additionally mine 37.4 million sentence pairs from the w… ▽ More

    Submitted 12 June, 2023; v1 submitted 12 April, 2021; originally announced April 2021.

    Comments: Accepted to the Transactions of the Association for Computational Linguistics (TACL)