Skip to main content

Showing 1–15 of 15 results for author: Lent, H

Searching in archive cs. Search in all archives.
.
  1. arXiv:2505.24525  [pdf, ps, other

    cs.CL

    Limited-Resource Adapters Are Regularizers, Not Linguists

    Authors: Marcell Fekete, Nathaniel R. Robinson, Ernests Lavrinovics, E. Djeride Jean-Baptiste, Raj Dabre, Johannes Bjerva, Heather Lent

    Abstract: Cross-lingual transfer from related high-resource languages is a well-established strategy to enhance low-resource language technologies. Prior work has shown that adapters show promise for, e.g., improving low-resource machine translation (MT). In this work, we investigate an adapter souping method combined with cross-attention fine-tuning of a pre-trained MT model to leverage language transfer f… ▽ More

    Submitted 30 May, 2025; originally announced May 2025.

  2. arXiv:2504.06669  [pdf, other

    cs.CL cs.AI

    NLP Security and Ethics, in the Wild

    Authors: Heather Lent, Erick Galinkin, Yiyi Chen, Jens Myrup Pedersen, Leon Derczynski, Johannes Bjerva

    Abstract: As NLP models are used by a growing number of end-users, an area of increasing importance is NLP Security (NLPSec): assessing the vulnerability of models to malicious attacks and developing comprehensive countermeasures against them. While work at the intersection of NLP and cybersecurity has the potential to create safer NLP for all, accidental oversights can result in tangible harm (e.g., breach… ▽ More

    Submitted 9 April, 2025; originally announced April 2025.

    Comments: Accepted to TACL

  3. arXiv:2411.05527  [pdf, other

    cs.CL

    How Good is Your Wikipedia? Auditing Data Quality for Low-resource and Multilingual NLP

    Authors: Kushal Tatariya, Artur Kulmizev, Wessel Poelman, Esther Ploeger, Marcel Bollmann, Johannes Bjerva, Jiaming Luo, Heather Lent, Miryam de Lhoneux

    Abstract: Wikipedia's perceived high quality and broad language coverage have established it as a fundamental resource in multilingual NLP. In the context of low-resource languages, however, these quality assumptions are increasingly being scrutinised. This paper critically examines the data quality of Wikipedia in a non-English setting by subjecting it to various quality filtering techniques, revealing wid… ▽ More

    Submitted 16 May, 2025; v1 submitted 8 November, 2024; originally announced November 2024.

  4. arXiv:2409.12683  [pdf, ps, other

    cs.CL cs.AI

    Connecting Ideas in 'Lower-Resource' Scenarios: NLP for National Varieties, Creoles and Other Low-resource Scenarios

    Authors: Aditya Joshi, Diptesh Kanojia, Heather Lent, Hour Kaing, Haiyue Song

    Abstract: Despite excellent results on benchmarks over a small subset of languages, large language models struggle to process text from languages situated in `lower-resource' scenarios such as dialects/sociolects (national or social varieties of a language), Creoles (languages arising from linguistic contact between multiple languages) and other low-resource languages. This introductory tutorial will identi… ▽ More

    Submitted 19 September, 2024; originally announced September 2024.

    Comments: Selected as a full-day tutorial at COLING 2025

  5. arXiv:2408.11749  [pdf, other

    cs.CL cs.CR

    Against All Odds: Overcoming Typology, Script, and Language Confusion in Multilingual Embedding Inversion Attacks

    Authors: Yiyi Chen, Russa Biswas, Heather Lent, Johannes Bjerva

    Abstract: Large Language Models (LLMs) are susceptible to malicious influence by cyber attackers through intrusions such as adversarial, backdoor, and embedding inversion attacks. In response, the burgeoning field of LLM Security aims to study and defend against such threats. Thus far, the majority of works in this area have focused on monolingual English models, however, emerging research suggests that mul… ▽ More

    Submitted 16 December, 2024; v1 submitted 21 August, 2024; originally announced August 2024.

    Comments: 11 pages, 4 figures, 7 tables

  6. arXiv:2402.03137  [pdf, other

    cs.CL cs.LG

    Sociolinguistically Informed Interpretability: A Case Study on Hinglish Emotion Classification

    Authors: Kushal Tatariya, Heather Lent, Johannes Bjerva, Miryam de Lhoneux

    Abstract: Emotion classification is a challenging task in NLP due to the inherent idiosyncratic and subjective nature of linguistic expression, especially with code-mixed data. Pre-trained language models (PLMs) have achieved high performance for many tasks and languages, but it remains to be seen whether these models learn and are robust to the differences in emotional expression across languages. Sociolin… ▽ More

    Submitted 5 February, 2024; originally announced February 2024.

    Comments: 5 pages, Accepted to SIGTYP 2024 @ EACL

  7. arXiv:2401.12192  [pdf, other

    cs.CL cs.AI cs.CR

    Text Embedding Inversion Security for Multilingual Language Models

    Authors: Yiyi Chen, Heather Lent, Johannes Bjerva

    Abstract: Textual data is often represented as real-numbered embeddings in NLP, particularly with the popularity of large language models (LLMs) and Embeddings as a Service (EaaS). However, storing sensitive information as embeddings can be susceptible to security breaches, as research shows that text can be reconstructed from embeddings, even without knowledge of the underlying model. While defence mechani… ▽ More

    Submitted 5 June, 2024; v1 submitted 22 January, 2024; originally announced January 2024.

    Comments: 18 pages, 17 Tables, 6 Figures

  8. arXiv:2310.19567  [pdf, other

    cs.CL cs.AI

    CreoleVal: Multilingual Multitask Benchmarks for Creoles

    Authors: Heather Lent, Kushal Tatariya, Raj Dabre, Yiyi Chen, Marcell Fekete, Esther Ploeger, Li Zhou, Ruth-Ann Armstrong, Abee Eijansantos, Catriona Malau, Hans Erik Heje, Ernests Lavrinovics, Diptesh Kanojia, Paul Belony, Marcel Bollmann, Loïc Grobol, Miryam de Lhoneux, Daniel Hershcovich, Michel DeGraff, Anders Søgaard, Johannes Bjerva

    Abstract: Creoles represent an under-explored and marginalized group of languages, with few available resources for NLP research.While the genealogical ties between Creoles and a number of highly-resourced languages imply a significant potential for transfer learning, this potential is hampered due to this lack of annotated data. In this work we present CreoleVal, a collection of benchmark datasets spanning… ▽ More

    Submitted 6 May, 2024; v1 submitted 30 October, 2023; originally announced October 2023.

    Comments: Accepted to TACL

  9. arXiv:2206.04371  [pdf, other

    cs.CL

    Ancestor-to-Creole Transfer is Not a Walk in the Park

    Authors: Heather Lent, Emanuele Bugliarello, Anders Søgaard

    Abstract: We aim to learn language models for Creole languages for which large volumes of data are not readily available, and therefore explore the potential transfer from ancestor languages (the 'Ancestry Transfer Hypothesis'). We find that standard transfer methods do not facilitate ancestry transfer. Surprisingly, different from other non-Creole languages, a very distinct two-phase pattern emerges for Cr… ▽ More

    Submitted 9 June, 2022; originally announced June 2022.

    Comments: Workshop on Insights from Negative Results in NLP 2022

  10. arXiv:2206.00437  [pdf, other

    cs.CL cs.CY

    What a Creole Wants, What a Creole Needs

    Authors: Heather Lent, Kelechi Ogueji, Miryam de Lhoneux, Orevaoghene Ahia, Anders Søgaard

    Abstract: In recent years, the natural language processing (NLP) community has given increased attention to the disparity of efforts directed towards high-resource languages over low-resource ones. Efforts to remedy this delta often begin with translations of existing English datasets into other languages. However, this approach ignores that different language communities have different needs. We consider a… ▽ More

    Submitted 1 June, 2022; originally announced June 2022.

    Comments: LREC 2022

  11. arXiv:2203.10020  [pdf, other

    cs.CL

    Challenges and Strategies in Cross-Cultural NLP

    Authors: Daniel Hershcovich, Stella Frank, Heather Lent, Miryam de Lhoneux, Mostafa Abdou, Stephanie Brandl, Emanuele Bugliarello, Laura Cabello Piqueras, Ilias Chalkidis, Ruixiang Cui, Constanza Fierro, Katerina Margatina, Phillip Rust, Anders Søgaard

    Abstract: Various efforts in the Natural Language Processing (NLP) community have been made to accommodate linguistic diversity and serve speakers of many different languages. However, it is important to acknowledge that speakers and the content they produce and require, vary not just by language, but also by culture. Although language and culture are tightly linked, there are important differences. Analogo… ▽ More

    Submitted 18 March, 2022; originally announced March 2022.

    Comments: ACL 2022 - Theme track

  12. arXiv:2109.06074  [pdf, other

    cs.CL

    On Language Models for Creoles

    Authors: Heather Lent, Emanuele Bugliarello, Miryam de Lhoneux, Chen Qiu, Anders Søgaard

    Abstract: Creole languages such as Nigerian Pidgin English and Haitian Creole are under-resourced and largely ignored in the NLP literature. Creoles typically result from the fusion of a foreign language with multiple local languages, and what grammatical and lexical features are transferred to the creole is a complex process. While creoles are generally stable, the prominence of some features may be much s… ▽ More

    Submitted 13 September, 2021; originally announced September 2021.

    Comments: CoNLL 2021

  13. arXiv:2108.03509  [pdf

    cs.CL

    Compositional Generalization in Multilingual Semantic Parsing over Wikidata

    Authors: Ruixiang Cui, Rahul Aralikatte, Heather Lent, Daniel Hershcovich

    Abstract: Semantic parsing (SP) allows humans to leverage vast knowledge resources through natural interaction. However, parsers are mostly designed for and evaluated on English resources, such as CFQ (Keysers et al., 2020), the current standard benchmark based on English data generated from grammar rules and oriented towards Freebase, an outdated knowledge base. We propose a method for creating a multiling… ▽ More

    Submitted 31 May, 2022; v1 submitted 7 August, 2021; originally announced August 2021.

    Comments: Accepted to TACL; Authors' final version, pre-MIT Press publication; Previous title: Multilingual Compositional Wikidata Questions

  14. arXiv:2010.05567  [pdf, other

    cs.CL

    Joint Semantic Analysis with Document-Level Cross-Task Coherence Rewards

    Authors: Rahul Aralikatte, Mostafa Abdou, Heather Lent, Daniel Hershcovich, Anders Søgaard

    Abstract: Coreference resolution and semantic role labeling are NLP tasks that capture different aspects of semantics, indicating respectively, which expressions refer to the same entity, and what semantic roles expressions serve in the sentence. However, they are often closely interdependent, and both generally necessitate natural language understanding. Do they form a coherent abstract representation of d… ▽ More

    Submitted 12 October, 2020; originally announced October 2020.

  15. arXiv:1909.02392  [pdf, other

    cs.CL

    Rewarding Coreference Resolvers for Being Consistent with World Knowledge

    Authors: Rahul Aralikatte, Heather Lent, Ana Valeria Gonzalez, Daniel Hershcovich, Chen Qiu, Anders Sandholm, Michael Ringaard, Anders Søgaard

    Abstract: Unresolved coreference is a bottleneck for relation extraction, and high-quality coreference resolvers may produce an output that makes it a lot easier to extract knowledge triples. We show how to improve coreference resolvers by forwarding their input to a relation extraction system and reward the resolvers for producing triples that are found in knowledge bases. Since relation extraction systems… ▽ More

    Submitted 11 November, 2019; v1 submitted 5 September, 2019; originally announced September 2019.

    Comments: To appear in EMNLP 2019 (with corrected Fig. 2)