Skip to main content

Showing 1–2 of 2 results for author: Paavola, J

Searching in archive cs. Search in all archives.
.
  1. arXiv:2506.00469  [pdf, ps, other

    cs.CL

    Massively Multilingual Adaptation of Large Language Models Using Bilingual Translation Data

    Authors: Shaoxiong Ji, Zihao Li, Jaakko Paavola, Indraneil Paul, Hengyu Luo, Jörg Tiedemann

    Abstract: This paper investigates a critical design decision in the practice of massively multilingual continual pre-training -- the inclusion of parallel data. Specifically, we study the impact of bilingual translation data for massively multilingual language adaptation of the Llama3 family of models to 500 languages. To this end, we construct the MaLA bilingual translation corpus, containing data from mor… ▽ More

    Submitted 31 May, 2025; originally announced June 2025.

    Comments: EMMA-500 Gen 2; refer to Gen 1 in arXiv:2409.17892

  2. arXiv:2409.17892  [pdf, other

    cs.CL

    EMMA-500: Enhancing Massively Multilingual Adaptation of Large Language Models

    Authors: Shaoxiong Ji, Zihao Li, Indraneil Paul, Jaakko Paavola, Peiqin Lin, Pinzhen Chen, Dayyán O'Brien, Hengyu Luo, Hinrich Schütze, Jörg Tiedemann, Barry Haddow

    Abstract: In this work, we introduce EMMA-500, a large-scale multilingual language model continue-trained on texts across 546 languages designed for enhanced multilingual performance, focusing on improving language coverage for low-resource languages. To facilitate continual pre-training, we compile the MaLA corpus, a comprehensive multilingual dataset enriched with curated datasets across diverse domains.… ▽ More

    Submitted 11 February, 2025; v1 submitted 26 September, 2024; originally announced September 2024.