Skip to main content

Showing 1–13 of 13 results for author: Caines, A

Searching in archive cs. Search in all archives.
.
  1. arXiv:2506.16190  [pdf, other

    cs.CL

    Web(er) of Hate: A Survey on How Hate Speech Is Typed

    Authors: Luna Wang, Andrew Caines, Alice Hutchings

    Abstract: The curation of hate speech datasets involves complex design decisions that balance competing priorities. This paper critically examines these methodological choices in a diverse range of datasets, highlighting common themes and practices, and their implications for dataset reliability. Drawing on Max Weber's notion of ideal types, we argue for a reflexive approach in dataset creation, urging rese… ▽ More

    Submitted 19 June, 2025; originally announced June 2025.

  2. arXiv:2501.06374  [pdf, other

    cs.CL

    AFRIDOC-MT: Document-level MT Corpus for African Languages

    Authors: Jesujoba O. Alabi, Israel Abebe Azime, Miaoran Zhang, Cristina España-Bonet, Rachel Bawden, Dawei Zhu, David Ifeoluwa Adelani, Clement Oyeleke Odoje, Idris Akinade, Iffat Maab, Davis David, Shamsuddeen Hassan Muhammad, Neo Putini, David O. Ademuyiwa, Andrew Caines, Dietrich Klakow

    Abstract: This paper introduces AFRIDOC-MT, a document-level multi-parallel translation dataset covering English and five African languages: Amharic, Hausa, Swahili, Yorùbá, and Zulu. The dataset comprises 334 health and 271 information technology news documents, all human-translated from English to these languages. We conduct document-level translation benchmark experiments by evaluating neural machine tra… ▽ More

    Submitted 10 January, 2025; originally announced January 2025.

    Comments: under review

  3. arXiv:2410.22906  [pdf, other

    cs.CL

    From Babble to Words: Pre-Training Language Models on Continuous Streams of Phonemes

    Authors: Zébulon Goriely, Richard Diehl Martinez, Andrew Caines, Lisa Beinborn, Paula Buttery

    Abstract: Language models are typically trained on large corpora of text in their default orthographic form. However, this is not the only option; representing data as streams of phonemes can offer unique advantages, from deeper insights into phonological language acquisition to improved performance on sound-based tasks. The challenge lies in evaluating the impact of phoneme-based training, as most benchmar… ▽ More

    Submitted 30 October, 2024; originally announced October 2024.

  4. arXiv:2410.11462  [pdf, other

    cs.CL

    Mitigating Frequency Bias and Anisotropy in Language Model Pre-Training with Syntactic Smoothing

    Authors: Richard Diehl Martinez, Zebulon Goriely, Andrew Caines, Paula Buttery, Lisa Beinborn

    Abstract: Language models strongly rely on frequency information because they maximize the likelihood of tokens during pre-training. As a consequence, language models tend to not generalize well to tokens that are seldom seen during training. Moreover, maximum likelihood training has been discovered to give rise to anisotropy: representations of tokens in a model tend to cluster tightly in a high-dimensiona… ▽ More

    Submitted 15 October, 2024; originally announced October 2024.

  5. arXiv:2404.12489  [pdf, other

    cs.CL

    Grammatical Error Correction for Code-Switched Sentences by Learners of English

    Authors: Kelvin Wey Han Chan, Christopher Bryant, Li Nguyen, Andrew Caines, Zheng Yuan

    Abstract: Code-switching (CSW) is a common phenomenon among multilingual speakers where multiple languages are used in a single discourse or utterance. Mixed language utterances may still contain grammatical errors however, yet most existing Grammar Error Correction (GEC) systems have been trained on monolingual data and not developed with CSW in mind. In this work, we conduct the first exploration into the… ▽ More

    Submitted 6 May, 2024; v1 submitted 18 April, 2024; originally announced April 2024.

    Journal ref: Proceedings of the 2024 Joint International Conference on Computational Linguistics

  6. Prompting open-source and commercial language models for grammatical error correction of English learner text

    Authors: Christopher Davis, Andrew Caines, Øistein Andersen, Shiva Taslimipoor, Helen Yannakoudakis, Zheng Yuan, Christopher Bryant, Marek Rei, Paula Buttery

    Abstract: Thanks to recent advances in generative AI, we are able to prompt large language models (LLMs) to produce texts which are fluent and grammatical. In addition, it has been shown that we can elicit attempts at grammatical error correction (GEC) from LLMs when prompted with ungrammatical input sentences. We evaluate how well LLMs can perform at GEC by measuring their performance on established benchm… ▽ More

    Submitted 6 April, 2025; v1 submitted 15 January, 2024; originally announced January 2024.

    Comments: 8 pages with appendices; accepted to ACL Findings 2024

  7. arXiv:2311.08886  [pdf, other

    cs.CL

    CLIMB: Curriculum Learning for Infant-inspired Model Building

    Authors: Richard Diehl Martinez, Zebulon Goriely, Hope McGovern, Christopher Davis, Andrew Caines, Paula Buttery, Lisa Beinborn

    Abstract: We describe our team's contribution to the STRICT-SMALL track of the BabyLM Challenge. The challenge requires training a language model from scratch using only a relatively small training dataset of ten million words. We experiment with three variants of cognitively-motivated curriculum learning and analyze their effect on the performance of the model on linguistic evaluation tasks. In the vocabul… ▽ More

    Submitted 15 November, 2023; originally announced November 2023.

  8. arXiv:2307.08393  [pdf, other

    cs.CL cs.LG

    On the application of Large Language Models for language teaching and assessment technology

    Authors: Andrew Caines, Luca Benedetto, Shiva Taslimipoor, Christopher Davis, Yuan Gao, Oeistein Andersen, Zheng Yuan, Mark Elliott, Russell Moore, Christopher Bryant, Marek Rei, Helen Yannakoudakis, Andrew Mullooly, Diane Nicholls, Paula Buttery

    Abstract: The recent release of very large language models such as PaLM and GPT-4 has made an unprecedented impact in the popular media and public consciousness, giving rise to a mixture of excitement and fear as to their capabilities and potential uses, and shining a light on natural language processing research which had not previously received so much attention. The developments offer great promise for e… ▽ More

    Submitted 17 July, 2023; originally announced July 2023.

    Comments: Accepted at the AIED2023 workshop: Empowering Education with LLMs - the Next-Gen Interface and Content Generation

  9. arXiv:2303.07991  [pdf, other

    cs.CL cs.LG

    Finding the Needle in a Haystack: Unsupervised Rationale Extraction from Long Text Classifiers

    Authors: Kamil Bujel, Andrew Caines, Helen Yannakoudakis, Marek Rei

    Abstract: Long-sequence transformers are designed to improve the representation of longer texts by language models and their performance on downstream document-level tasks. However, not much is understood about the quality of token-level predictions in long-form models. We investigate the performance of such architectures in the context of document classification with unsupervised rationale extraction. We f… ▽ More

    Submitted 14 March, 2023; originally announced March 2023.

  10. arXiv:2210.16228  [pdf, other

    cs.CL

    Probing for targeted syntactic knowledge through grammatical error detection

    Authors: Christopher Davis, Christopher Bryant, Andrew Caines, Marek Rei, Paula Buttery

    Abstract: Targeted studies testing knowledge of subject-verb agreement (SVA) indicate that pre-trained language models encode syntactic information. We assert that if models robustly encode subject-verb agreement, they should be able to identify when agreement is correct and when it is incorrect. To that end, we propose grammatical error detection as a diagnostic probe to evaluate token-level contextual rep… ▽ More

    Submitted 28 October, 2022; originally announced October 2022.

    Comments: CoNLL 2022

  11. arXiv:2104.05753  [pdf, other

    cs.CL cs.AI

    Towards a parallel corpus of Portuguese and the Bantu language Emakhuwa of Mozambique

    Authors: Felermino D. M. A. Ali, Andrew Caines, Jaimito L. A. Malavi

    Abstract: Major advancement in the performance of machine translation models has been made possible in part thanks to the availability of large-scale parallel corpora. But for most languages in the world, the existence of such corpora is rare. Emakhuwa, a language spoken in Mozambique, is like most African languages low-resource in NLP terms. It lacks both computational and linguistic resources and, to the… ▽ More

    Submitted 12 April, 2021; originally announced April 2021.

  12. arXiv:2011.07109  [pdf, other

    cs.CL

    The Teacher-Student Chatroom Corpus

    Authors: Andrew Caines, Helen Yannakoudakis, Helena Edmondson, Helen Allen, Pascual Pérez-Paredes, Bill Byrne, Paula Buttery

    Abstract: The Teacher-Student Chatroom Corpus (TSCC) is a collection of written conversations captured during one-to-one lessons between teachers and learners of English. The lessons took place in an online chatroom and therefore involve more interactive, immediate and informal language than might be found in asynchronous exchanges such as email correspondence. The fact that the lessons were one-to-one mean… ▽ More

    Submitted 13 November, 2020; originally announced November 2020.

    Comments: NLP4CALL

  13. arXiv:2004.11327  [pdf, other

    cs.CL cs.LG

    Adaptive Forgetting Curves for Spaced Repetition Language Learning

    Authors: Ahmed Zaidi, Andrew Caines, Russell Moore, Paula Buttery, Andrew Rice

    Abstract: The forgetting curve has been extensively explored by psychologists, educationalists and cognitive scientists alike. In the context of Intelligent Tutoring Systems, modelling the forgetting curve for each user and knowledge component (e.g. vocabulary word) should enable us to develop optimal revision strategies that counteract memory decay and ensure long-term retention. In this study we explore a… ▽ More

    Submitted 23 April, 2020; originally announced April 2020.

    Comments: Artificial Intelligence for Education 2020 (AIED)