Skip to main content

Showing 1–6 of 6 results for author: Aulamo, M

Searching in archive cs. Search in all archives.
.
  1. arXiv:2505.14423  [pdf, ps, other

    cs.CL

    Scaling Low-Resource MT via Synthetic Data Generation with LLMs

    Authors: Ona de Gibert, Joseph Attieh, Teemu Vahtola, Mikko Aulamo, Zihao Li, Raúl Vázquez, Tiancheng Hu, Jörg Tiedemann

    Abstract: We investigate the potential of LLM-generated synthetic data for improving low-resource machine translation (MT). Focusing on seven diverse target languages, we construct a document-level synthetic corpus from English Europarl, and extend it via pivoting to 147 additional language pairs. Automatic and human evaluation confirm its high overall quality. We study its practical application by (i) iden… ▽ More

    Submitted 20 May, 2025; originally announced May 2025.

  2. arXiv:2503.10267  [pdf, ps, other

    cs.CL

    An Expanded Massive Multilingual Dataset for High-Performance Language Technologies (HPLT)

    Authors: Laurie Burchell, Ona de Gibert, Nikolay Arefyev, Mikko Aulamo, Marta Bañón, Pinzhen Chen, Mariia Fedorova, Liane Guillou, Barry Haddow, Jan Hajič, Jindřich Helcl, Erik Henriksson, Mateusz Klimaszewski, Ville Komulainen, Andrey Kutuzov, Joona Kytöniemi, Veronika Laippala, Petter Mæhlum, Bhavitvya Malik, Farrokh Mehryary, Vladislav Mikhailov, Nikita Moghe, Amanda Myntti, Dayyán O'Brien, Stephan Oepen , et al. (10 additional authors not shown)

    Abstract: Training state-of-the-art large language models requires vast amounts of clean and diverse textual data. However, building suitable multilingual datasets remains a challenge. In this work, we present HPLT v2, a collection of high-quality multilingual monolingual and parallel corpora, extending prior work of the HPLT project. The monolingual portion of the data contains 8T tokens covering 193 langu… ▽ More

    Submitted 4 June, 2025; v1 submitted 13 March, 2025; originally announced March 2025.

    Comments: ACL'2025 Main Proceedings

  3. arXiv:2403.14009  [pdf, other

    cs.CL

    A New Massive Multilingual Dataset for High-Performance Language Technologies

    Authors: Ona de Gibert, Graeme Nail, Nikolay Arefyev, Marta Bañón, Jelmer van der Linde, Shaoxiong Ji, Jaume Zaragoza-Bernabeu, Mikko Aulamo, Gema Ramírez-Sánchez, Andrey Kutuzov, Sampo Pyysalo, Stephan Oepen, Jörg Tiedemann

    Abstract: We present the HPLT (High Performance Language Technologies) language resources, a new massive multilingual dataset including both monolingual and bilingual corpora extracted from CommonCrawl and previously unused web crawls from the Internet Archive. We describe our methods for data acquisition, management and processing of large corpora, which rely on open-source software tools and high-performa… ▽ More

    Submitted 20 March, 2024; originally announced March 2024.

    Comments: LREC-COLING 2024

  4. arXiv:2311.14838  [pdf, other

    cs.CL

    OpusCleaner and OpusTrainer, open source toolkits for training Machine Translation and Large language models

    Authors: Nikolay Bogoychev, Jelmer van der Linde, Graeme Nail, Barry Haddow, Jaume Zaragoza-Bernabeu, Gema Ramírez-Sánchez, Lukas Weymann, Tudor Nicolae Mateiu, Jindřich Helcl, Mikko Aulamo

    Abstract: Developing high quality machine translation systems is a labour intensive, challenging and confusing process for newcomers to the field. We present a pair of tools OpusCleaner and OpusTrainer that aim to simplify the process, reduce the amount of work and lower the entry barrier for newcomers. OpusCleaner is a data downloading, cleaning, and proprocessing toolkit. It is designed to allow researc… ▽ More

    Submitted 24 November, 2023; originally announced November 2023.

    Comments: Code on Github: https://github.com/hplt-project/OpusCleaner and https://github.com/hplt-project/OpusTrainer

  5. arXiv:2212.01936  [pdf, other

    cs.CL

    Democratizing Neural Machine Translation with OPUS-MT

    Authors: Jörg Tiedemann, Mikko Aulamo, Daria Bakshandaeva, Michele Boggia, Stig-Arne Grönroos, Tommi Nieminen, Alessandro Raganato, Yves Scherrer, Raul Vazquez, Sami Virpioja

    Abstract: This paper presents the OPUS ecosystem with a focus on the development of open machine translation models and tools, and their integration into end-user applications, development platforms and professional workflows. We discuss our on-going mission of increasing language coverage and translation quality, and also describe on-going work on the development of modular translation models and speed-opt… ▽ More

    Submitted 4 July, 2023; v1 submitted 4 December, 2022; originally announced December 2022.

  6. arXiv:1809.07978  [pdf, other

    cs.CL

    Paraphrase Detection on Noisy Subtitles in Six Languages

    Authors: Eetu Sjöblom, Mathias Creutz, Mikko Aulamo

    Abstract: We perform automatic paraphrase detection on subtitle data from the Opusparcus corpus comprising six European languages: German, English, Finnish, French, Russian, and Swedish. We train two types of supervised sentence embedding models: a word-averaging (WA) model and a gated recurrent averaging network (GRAN) model. We find out that GRAN outperforms WA and is more robust to noisy training data. B… ▽ More

    Submitted 21 September, 2018; originally announced September 2018.

    Comments: To appear in Proceedings of W-NUT at EMNLP 2018, Brussels, Belgium, 1 November 2018