Skip to main content

Showing 1–10 of 10 results for author: Ploeger, E

Searching in archive cs. Search in all archives.
.
  1. arXiv:2505.20264  [pdf, ps, other

    cs.CL cs.AI

    We Need to Measure Data Diversity in NLP -- Better and Broader

    Authors: Dong Nguyen, Esther Ploeger

    Abstract: Although diversity in NLP datasets has received growing attention, the question of how to measure it remains largely underexplored. This opinion paper examines the conceptual and methodological challenges of measuring data diversity and argues that interdisciplinary perspectives are essential for developing more fine-grained and valid measures.

    Submitted 26 May, 2025; originally announced May 2025.

  2. arXiv:2412.08473  [pdf, ps, other

    cs.CL

    Multi-perspective Alignment for Increasing Naturalness in Neural Machine Translation

    Authors: Huiyuan Lai, Esther Ploeger, Rik van Noord, Antonio Toral

    Abstract: Neural machine translation (NMT) systems amplify lexical biases present in their training data, leading to artificially impoverished language in output translations. These language-level characteristics render automatic translations different from text originally written in a language and human translations, which hinders their usefulness in for example creating evaluation datasets. Attempts to in… ▽ More

    Submitted 30 May, 2025; v1 submitted 11 December, 2024; originally announced December 2024.

    Comments: Accepted to ACL 2025 main; 9 pages

  3. arXiv:2411.19799  [pdf, other

    cs.CL

    INCLUDE: Evaluating Multilingual Language Understanding with Regional Knowledge

    Authors: Angelika Romanou, Negar Foroutan, Anna Sotnikova, Zeming Chen, Sree Harsha Nelaturu, Shivalika Singh, Rishabh Maheshwary, Micol Altomare, Mohamed A. Haggag, Snegha A, Alfonso Amayuelas, Azril Hafizi Amirudin, Viraat Aryabumi, Danylo Boiko, Michael Chang, Jenny Chim, Gal Cohen, Aditya Kumar Dalmia, Abraham Diress, Sharad Duwal, Daniil Dzenhaliou, Daniel Fernando Erazo Florez, Fabian Farestam, Joseph Marvin Imperial, Shayekh Bin Islam , et al. (34 additional authors not shown)

    Abstract: The performance differential of large language models (LLM) between languages hinders their effective deployment in many regions, inhibiting the potential economic and societal value of generative AI tools in many communities. However, the development of functional LLMs in many languages (\ie, multilingual LLMs) is bottlenecked by the lack of high-quality evaluation resources in languages other th… ▽ More

    Submitted 29 November, 2024; originally announced November 2024.

  4. arXiv:2411.05527  [pdf, other

    cs.CL

    How Good is Your Wikipedia? Auditing Data Quality for Low-resource and Multilingual NLP

    Authors: Kushal Tatariya, Artur Kulmizev, Wessel Poelman, Esther Ploeger, Marcel Bollmann, Johannes Bjerva, Jiaming Luo, Heather Lent, Miryam de Lhoneux

    Abstract: Wikipedia's perceived high quality and broad language coverage have established it as a fundamental resource in multilingual NLP. In the context of low-resource languages, however, these quality assumptions are increasingly being scrutinised. This paper critically examines the data quality of Wikipedia in a non-English setting by subjecting it to various quality filtering techniques, revealing wid… ▽ More

    Submitted 16 May, 2025; v1 submitted 8 November, 2024; originally announced November 2024.

  5. arXiv:2408.17308  [pdf, other

    cs.CL

    Towards Tailored Recovery of Lexical Diversity in Literary Machine Translation

    Authors: Esther Ploeger, Huiyuan Lai, Rik van Noord, Antonio Toral

    Abstract: Machine translations are found to be lexically poorer than human translations. The loss of lexical diversity through MT poses an issue in the automatic translation of literature, where it matters not only what is written, but also how it is written. Current methods for increasing lexical diversity in MT are rigid. Yet, as we demonstrate, the degree of lexical diversity can vary considerably across… ▽ More

    Submitted 30 August, 2024; originally announced August 2024.

    Comments: Accepted to EAMT 2024

  6. arXiv:2407.05022  [pdf, other

    cs.CL

    A Principled Framework for Evaluating on Typologically Diverse Languages

    Authors: Esther Ploeger, Wessel Poelman, Andreas Holck Høeg-Petersen, Anders Schlichtkrull, Miryam de Lhoneux, Johannes Bjerva

    Abstract: Beyond individual languages, multilingual natural language processing (NLP) research increasingly aims to develop models that perform well across languages generally. However, evaluating these systems on all the world's languages is practically infeasible. To attain generalizability, representative language sampling is essential. Previous work argues that generalizable multilingual evaluation sets… ▽ More

    Submitted 17 March, 2025; v1 submitted 6 July, 2024; originally announced July 2024.

    Comments: Revised version

  7. arXiv:2402.04222  [pdf, other

    cs.CL

    What is "Typological Diversity" in NLP?

    Authors: Esther Ploeger, Wessel Poelman, Miryam de Lhoneux, Johannes Bjerva

    Abstract: The NLP research community has devoted increased attention to languages beyond English, resulting in considerable improvements for multilingual NLP. However, these improvements only apply to a small subset of the world's languages. Aiming to extend this, an increasing number of papers aspires to enhance generalizable multilingual performance across languages. To this end, linguistic typology is co… ▽ More

    Submitted 2 October, 2024; v1 submitted 6 February, 2024; originally announced February 2024.

    Comments: EMNLP 2024: Main Conference

  8. arXiv:2402.01513  [pdf, other

    cs.CL

    Multilingual Gradient Word-Order Typology from Universal Dependencies

    Authors: Emi Baylor, Esther Ploeger, Johannes Bjerva

    Abstract: While information from the field of linguistic typology has the potential to improve performance on NLP tasks, reliable typological data is a prerequisite. Existing typological databases, including WALS and Grambank, suffer from inconsistencies primarily caused by their categorical format. Furthermore, typological categorisations by definition differ significantly from the continuous nature of phe… ▽ More

    Submitted 2 February, 2024; originally announced February 2024.

    Comments: EACL 2024

  9. arXiv:2310.19567  [pdf, other

    cs.CL cs.AI

    CreoleVal: Multilingual Multitask Benchmarks for Creoles

    Authors: Heather Lent, Kushal Tatariya, Raj Dabre, Yiyi Chen, Marcell Fekete, Esther Ploeger, Li Zhou, Ruth-Ann Armstrong, Abee Eijansantos, Catriona Malau, Hans Erik Heje, Ernests Lavrinovics, Diptesh Kanojia, Paul Belony, Marcel Bollmann, Loïc Grobol, Miryam de Lhoneux, Daniel Hershcovich, Michel DeGraff, Anders Søgaard, Johannes Bjerva

    Abstract: Creoles represent an under-explored and marginalized group of languages, with few available resources for NLP research.While the genealogical ties between Creoles and a number of highly-resourced languages imply a significant potential for transfer learning, this potential is hampered due to this lack of annotated data. In this work we present CreoleVal, a collection of benchmark datasets spanning… ▽ More

    Submitted 6 May, 2024; v1 submitted 30 October, 2023; originally announced October 2023.

    Comments: Accepted to TACL

  10. arXiv:2310.13440  [pdf, other

    cs.CL

    The Past, Present, and Future of Typological Databases in NLP

    Authors: Emi Baylor, Esther Ploeger, Johannes Bjerva

    Abstract: Typological information has the potential to be beneficial in the development of NLP models, particularly for low-resource languages. Unfortunately, current large-scale typological databases, notably WALS and Grambank, are inconsistent both with each other and with other sources of typological information, such as linguistic grammars. Some of these inconsistencies stem from coding errors or lingui… ▽ More

    Submitted 20 October, 2023; originally announced October 2023.

    Comments: Accepted to EMNLP Findings