Skip to main content

Showing 1–36 of 36 results for author: Madabushi, H T

Searching in archive cs. Search in all archives.
.
  1. arXiv:2506.01419  [pdf, ps, other

    cs.CL

    UniversalCEFR: Enabling Open Multilingual Research on Language Proficiency Assessment

    Authors: Joseph Marvin Imperial, Abdullah Barayan, Regina Stodden, Rodrigo Wilkens, Ricardo Munoz Sanchez, Lingyun Gao, Melissa Torgbi, Dawn Knight, Gail Forey, Reka R. Jablonkai, Ekaterina Kochmar, Robert Reynolds, Eugenio Ribeiro, Horacio Saggion, Elena Volodina, Sowmya Vajjala, Thomas Francois, Fernando Alva-Manchego, Harish Tayyar Madabushi

    Abstract: We introduce UniversalCEFR, a large-scale multilingual multidimensional dataset of texts annotated according to the CEFR (Common European Framework of Reference) scale in 13 languages. To enable open research in both automated readability and language proficiency assessment, UniversalCEFR comprises 505,807 CEFR-labeled texts curated from educational and learner-oriented resources, standardized int… ▽ More

    Submitted 2 June, 2025; originally announced June 2025.

  2. arXiv:2505.23323  [pdf, ps, other

    cs.CL

    Neither Stochastic Parroting nor AGI: LLMs Solve Tasks through Context-Directed Extrapolation from Training Data Priors

    Authors: Harish Tayyar Madabushi, Melissa Torgbi, Claire Bonial

    Abstract: In this position paper we raise critical awareness of a realistic view of LLM capabilities that eschews extreme alternative views that LLMs are either "stochastic parrots" or in possession of "emergent" advanced reasoning capabilities, which, due to their unpredictable emergence, constitute an existential threat. Our middle-ground view is that LLMs extrapolate from priors from their training data,… ▽ More

    Submitted 29 May, 2025; originally announced May 2025.

  3. arXiv:2505.11004  [pdf, other

    cs.CL cs.AI

    Illusion or Algorithm? Investigating Memorization, Emergence, and Symbolic Processing in In-Context Learning

    Authors: Jingcheng Niu, Subhabrata Dutta, Ahmed Elshabrawy, Harish Tayyar Madabushi, Iryna Gurevych

    Abstract: Large-scale Transformer language models (LMs) trained solely on next-token prediction with web-scale data can solve a wide range of tasks after seeing just a few examples. The mechanism behind this capability, known as in-context learning (ICL), remains both controversial and poorly understood. Some studies argue that it is merely the result of memorizing vast amounts of data, while others contend… ▽ More

    Submitted 22 May, 2025; v1 submitted 16 May, 2025; originally announced May 2025.

  4. arXiv:2504.20051  [pdf, other

    cs.CL

    Evaluating Large Language Models on Multiword Expressions in Multilingual and Code-Switched Contexts

    Authors: Frances Laureano De Leon, Harish Tayyar Madabushi, Mark G. Lee

    Abstract: Multiword expressions, characterised by non-compositional meanings and syntactic irregularities, are an example of nuanced language. These expressions can be used literally or idiomatically, leading to significant changes in meaning. While large language models have demonstrated strong performance across many tasks, their ability to handle such linguistic subtleties remains uncertain. Therefore, t… ▽ More

    Submitted 10 April, 2025; originally announced April 2025.

  5. arXiv:2503.04736  [pdf, other

    cs.CY cs.AI cs.CL cs.LG

    Standardizing Intelligence: Aligning Generative AI for Regulatory and Operational Compliance

    Authors: Joseph Marvin Imperial, Matthew D. Jones, Harish Tayyar Madabushi

    Abstract: Technical standards, or simply standards, are established documented guidelines and rules that facilitate the interoperability, quality, and accuracy of systems and processes. In recent years, we have witnessed an emerging paradigm shift where the adoption of generative AI (GenAI) models has increased tremendously, spreading implementation interests across standard-driven industries, including eng… ▽ More

    Submitted 3 February, 2025; originally announced March 2025.

  6. arXiv:2501.08716  [pdf, other

    cs.CL

    The Inherent Limits of Pretrained LLMs: The Unexpected Convergence of Instruction Tuning and In-Context Learning Capabilities

    Authors: Irina Bigoulaeva, Harish Tayyar Madabushi, Iryna Gurevych

    Abstract: Large Language Models (LLMs), trained on extensive web-scale corpora, have demonstrated remarkable abilities across diverse tasks, especially as they are scaled up. Nevertheless, even state-of-the-art models struggle in certain cases, sometimes failing at problems solvable by young children, indicating that traditional notions of task complexity are insufficient for explaining LLM capabilities. Ho… ▽ More

    Submitted 15 January, 2025; originally announced January 2025.

    Comments: The code for this paper is available at: https://github.com/UKPLab/arxiv2025-inherent-limits-plms

  7. arXiv:2501.08502  [pdf, other

    cs.CL cs.AI

    Adapting Whisper for Regional Dialects: Enhancing Public Services for Vulnerable Populations in the United Kingdom

    Authors: Melissa Torgbi, Andrew Clayman, Jordan J. Speight, Harish Tayyar Madabushi

    Abstract: We collect novel data in the public service domain to evaluate the capability of the state-of-the-art automatic speech recognition (ASR) models in capturing regional differences in accents in the United Kingdom (UK), specifically focusing on two accents from Scotland with distinct dialects. This study addresses real-world problems where biased ASR models can lead to miscommunication in public serv… ▽ More

    Submitted 14 January, 2025; originally announced January 2025.

  8. arXiv:2501.04661  [pdf, other

    cs.CL cs.AI

    Assessing Language Comprehension in Large Language Models Using Construction Grammar

    Authors: Wesley Scivetti, Melissa Torgbi, Austin Blodgett, Mollie Shichman, Taylor Hudson, Claire Bonial, Harish Tayyar Madabushi

    Abstract: Large Language Models, despite their significant capabilities, are known to fail in surprising and unpredictable ways. Evaluating their true `understanding' of language is particularly challenging due to the extensive web-scale data they are trained on. Therefore, we construct an evaluation to systematically assess natural language understanding (NLU) in LLMs by leveraging Construction Grammar (Cx… ▽ More

    Submitted 8 January, 2025; originally announced January 2025.

  9. arXiv:2407.13297  [pdf, other

    cs.CL

    SpeciaLex: A Benchmark for In-Context Specialized Lexicon Learning

    Authors: Joseph Marvin Imperial, Harish Tayyar Madabushi

    Abstract: Specialized lexicons are collections of words with associated constraints such as special definitions, specific roles, and intended target audiences. These constraints are necessary for content generation and documentation tasks (e.g., writing technical manuals or children's reading materials), where the goal is to reduce the ambiguity of text content and increase its overall readability for a spe… ▽ More

    Submitted 4 October, 2024; v1 submitted 18 July, 2024; originally announced July 2024.

    Comments: Camera-ready for EMNLP 2024 (Findings)

  10. arXiv:2407.03181  [pdf, other

    cs.CL

    Fine-Tuning on Diverse Reasoning Chains Drives Within-Inference CoT Refinement in LLMs

    Authors: Haritz Puerto, Tilek Chubakov, Xiaodan Zhu, Harish Tayyar Madabushi, Iryna Gurevych

    Abstract: Requiring a large language model (LLM) to generate intermediary reasoning steps, known as Chain of Thought (CoT), has been shown to be an effective way of boosting performance. Previous approaches have focused on generating multiple independent CoTs, combining them through ensembling or other post-hoc strategies to enhance reasoning. In this work, we introduce a novel approach where LLMs are fine-… ▽ More

    Submitted 27 May, 2025; v1 submitted 3 July, 2024; originally announced July 2024.

    Comments: ACL 2025 Main

  11. arXiv:2406.16167  [pdf, other

    cs.CL

    FS-RAG: A Frame Semantics Based Approach for Improved Factual Accuracy in Large Language Models

    Authors: Harish Tayyar Madabushi

    Abstract: We present a novel extension to Retrieval Augmented Generation with the goal of mitigating factual inaccuracies in the output of large language models. Specifically, our method draws on the cognitive linguistic theory of frame semantics for the indexing and retrieval of factual information relevant to helping large language models answer queries. We conduct experiments to demonstrate the effective… ▽ More

    Submitted 23 June, 2024; originally announced June 2024.

    Comments: program code and prompts available at https://github.com/H-TayyarMadabushi/A-Frame-Semantics-based-approach-for-Improved-Factual-Accuracy-in-Large-Language-Models

  12. arXiv:2403.11025  [pdf, other

    cs.CL

    Pre-Trained Language Models Represent Some Geographic Populations Better Than Others

    Authors: Jonathan Dunn, Benjamin Adams, Harish Tayyar Madabushi

    Abstract: This paper measures the skew in how well two families of LLMs represent diverse geographic populations. A spatial probing task is used with geo-referenced corpora to measure the degree to which pre-trained language models from the OPT and BLOOM series represent diverse populations around the world. Results show that these models perform much better for some populations than others. In particular,… ▽ More

    Submitted 16 March, 2024; originally announced March 2024.

  13. arXiv:2403.04872  [pdf, other

    cs.CL

    Code-Mixed Probes Show How Pre-Trained Models Generalise On Code-Switched Text

    Authors: Frances A. Laureano De Leon, Harish Tayyar Madabushi, Mark Lee

    Abstract: Code-switching is a prevalent linguistic phenomenon in which multilingual individuals seamlessly alternate between languages. Despite its widespread use online and recent research trends in this area, research in code-switching presents unique challenges, primarily stemming from the scarcity of labelled data and available resources. In this study we investigate how pre-trained Language Models hand… ▽ More

    Submitted 7 May, 2024; v1 submitted 7 March, 2024; originally announced March 2024.

    Comments: Accepted for publication at Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024). Data and code available at https://github.com/francesita/code-mixed-probes

  14. arXiv:2402.12593  [pdf, other

    cs.CL

    Standardize: Aligning Language Models with Expert-Defined Standards for Content Generation

    Authors: Joseph Marvin Imperial, Gail Forey, Harish Tayyar Madabushi

    Abstract: Domain experts across engineering, healthcare, and education follow strict standards for producing quality content such as technical manuals, medication instructions, and children's reading materials. However, current works in controllable text generation have yet to explore using these standards as references for control. Towards this end, we introduce Standardize, a retrieval-style in-context le… ▽ More

    Submitted 4 October, 2024; v1 submitted 19 February, 2024; originally announced February 2024.

    Comments: Camera-ready for EMNLP 2024 (Main)

  15. arXiv:2401.07923  [pdf, other

    cs.CL

    Word Boundary Information Isn't Useful for Encoder Language Models

    Authors: Edward Gow-Smith, Dylan Phelps, Harish Tayyar Madabushi, Carolina Scarton, Aline Villavicencio

    Abstract: All existing transformer-based approaches to NLP using subword tokenisation algorithms encode whitespace (word boundary information) through the use of special space symbols (such as \#\# or \_) forming part of tokens. These symbols have been shown to a) lead to reduced morphological validity of tokenisations, and b) give substantial vocabulary redundancy. As such, removing these symbols has been… ▽ More

    Submitted 27 February, 2025; v1 submitted 15 January, 2024; originally announced January 2024.

    Comments: 9th Workshop on Representation Learning for NLP

  16. arXiv:2309.05454  [pdf, other

    cs.CL

    Flesch or Fumble? Evaluating Readability Standard Alignment of Instruction-Tuned Language Models

    Authors: Joseph Marvin Imperial, Harish Tayyar Madabushi

    Abstract: Readability metrics and standards such as Flesch Kincaid Grade Level (FKGL) and the Common European Framework of Reference for Languages (CEFR) exist to guide teachers and educators to properly assess the complexity of educational materials before administering them for classroom use. In this study, we select a diverse set of open and closed-source instruction-tuned language models and investigate… ▽ More

    Submitted 3 November, 2023; v1 submitted 11 September, 2023; originally announced September 2023.

    Comments: Final camera-ready for EMNLP GEM Workshop 2023

  17. arXiv:2309.01809  [pdf, other

    cs.CL

    Are Emergent Abilities in Large Language Models just In-Context Learning?

    Authors: Sheng Lu, Irina Bigoulaeva, Rachneet Sachdeva, Harish Tayyar Madabushi, Iryna Gurevych

    Abstract: Large language models, comprising billions of parameters and pre-trained on extensive web-scale corpora, have been claimed to acquire certain capabilities without having been specifically trained on them. These capabilities, referred to as "emergent abilities," have been a driving force in discussions regarding the potentials and risks of language models. A key challenge in evaluating emergent abi… ▽ More

    Submitted 15 July, 2024; v1 submitted 4 September, 2023; originally announced September 2023.

    Comments: Accepted to ACL 2024

  18. arXiv:2308.13315  [pdf

    cs.CL

    Construction Grammar and Language Models

    Authors: Harish Tayyar Madabushi, Laurence Romain, Petar Milin, Dagmar Divjak

    Abstract: Recent progress in deep learning and natural language processing has given rise to powerful models that are primarily trained on a cloze-like task and show some evidence of having access to substantial linguistic information, including some constructional knowledge. This groundbreaking discovery presents an exciting opportunity for a synergistic relationship between computational methods and Const… ▽ More

    Submitted 4 September, 2023; v1 submitted 25 August, 2023; originally announced August 2023.

    Comments: Accepted for publication in The Cambridge Handbook of Construction Grammar, edited by Mirjam Fried and Kiki Nikiforidou. To appear in 2024

  19. arXiv:2210.17301  [pdf, other

    cs.CL

    Effective Cross-Task Transfer Learning for Explainable Natural Language Inference with T5

    Authors: Irina Bigoulaeva, Rachneet Sachdeva, Harish Tayyar Madabushi, Aline Villavicencio, Iryna Gurevych

    Abstract: We compare sequential fine-tuning with a model for multi-task learning in the context where we are interested in boosting performance on two tasks, one of which depends on the other. We test these models on the FigLang2022 shared task which requires participants to predict language inference labels on figurative language along with corresponding textual explanations of the inference predictions. O… ▽ More

    Submitted 31 October, 2022; originally announced October 2022.

    Comments: Accepted for publication in the Proceedings of the Second Workshop on Figurative Language Processing (colocated with EMNLP 2022). Code and models at https://github.com/Rachneet/cross-task-figurative-explanations

  20. arXiv:2206.04184  [pdf, other

    cs.CL

    Abstraction not Memory: BERT and the English Article System

    Authors: Harish Tayyar Madabushi, Dagmar Divjak, Petar Milin

    Abstract: Article prediction is a task that has long defied accurate linguistic description. As such, this task is ideally suited to evaluate models on their ability to emulate native-speaker intuition. To this end, we compare the performance of native English speakers and pre-trained models on the task of article prediction set up as a three way choice (a/an, the, zero). Our experiments with BERT show that… ▽ More

    Submitted 8 June, 2022; originally announced June 2022.

    Comments: Accepted for publication at 2022 Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL 2022). Data and code available at https://github.com/H-TayyarMadabushi/Abstraction-not-Memory-BERT-and-the-English-Article-System-NAACL-2022

  21. arXiv:2205.11306  [pdf, ps, other

    cs.CL

    Sample Efficient Approaches for Idiomaticity Detection

    Authors: Dylan Phelps, Xuan-Rui Fan, Edward Gow-Smith, Harish Tayyar Madabushi, Carolina Scarton, Aline Villavicencio

    Abstract: Deep neural models, in particular Transformer-based pre-trained language models, require a significant amount of data to train. This need for data tends to lead to problems when dealing with idiomatic multiword expressions (MWEs), which are inherently less frequent in natural text. As such, this work explores sample efficient methods of idiomaticity detection. In particular we study the impact of… ▽ More

    Submitted 23 May, 2022; originally announced May 2022.

  22. arXiv:2204.10050  [pdf, other

    cs.CL

    SemEval-2022 Task 2: Multilingual Idiomaticity Detection and Sentence Embedding

    Authors: Harish Tayyar Madabushi, Edward Gow-Smith, Marcos Garcia, Carolina Scarton, Marco Idiart, Aline Villavicencio

    Abstract: This paper presents the shared task on Multilingual Idiomaticity Detection and Sentence Embedding, which consists of two subtasks: (a) a binary classification task aimed at identifying whether a sentence contains an idiomatic expression, and (b) a task based on semantic text similarity which requires the model to adequately represent potentially idiomatic expressions in context. Each subtask inclu… ▽ More

    Submitted 30 May, 2022; v1 submitted 21 April, 2022; originally announced April 2022.

    Comments: Data available at https://github.com/H-TayyarMadabushi/SemEval_2022_Task2-idiomaticity and competition website at https://sites.google.com/view/semeval2022task2-idiomaticity

  23. arXiv:2204.05185  [pdf, other

    cs.CL cs.LG

    Uniform Complexity for Text Generation

    Authors: Joseph Marvin Imperial, Harish Tayyar Madabushi

    Abstract: Large language models (LLMs) have shown promising results in a wide array of generative NLP tasks, such as summarization and machine translation. In the context of narrative generation, however, existing models still do not capture factors that contribute to producing consistent text. For instance, it is logical that a piece of text or a story should be uniformly readable throughout and that this… ▽ More

    Submitted 19 October, 2023; v1 submitted 11 April, 2022; originally announced April 2022.

    Comments: Final camera-ready for EMNLP 2023

  24. arXiv:2204.04058  [pdf, other

    cs.CL

    Improving Tokenisation by Alternative Treatment of Spaces

    Authors: Edward Gow-Smith, Harish Tayyar Madabushi, Carolina Scarton, Aline Villavicencio

    Abstract: Tokenisation is the first step in almost all NLP tasks, and state-of-the-art transformer-based language models all use subword tokenisation algorithms to process input text. Existing algorithms have problems, often producing tokenisations of limited linguistic validity, and representing equivalent strings differently depending on their position within a word. We hypothesise that these problems hin… ▽ More

    Submitted 22 October, 2022; v1 submitted 8 April, 2022; originally announced April 2022.

    Comments: EMNLP 2022

  25. arXiv:2110.05663  [pdf, other

    cs.CL

    Learned Construction Grammars Converge Across Registers Given Increased Exposure

    Authors: Jonathan Dunn, Harish Tayyar Madabushi

    Abstract: This paper measures the impact of increased exposure on whether learned construction grammars converge onto shared representations when trained on data from different registers. Register influences the frequency of constructions, with some structures common in formal but not informal usage. We expect that a grammar induction algorithm exposed to different registers will acquire different construct… ▽ More

    Submitted 11 October, 2021; originally announced October 2021.

  26. UoB at SemEval-2021 Task 5: Extending Pre-Trained Language Models to Include Task and Domain-Specific Information for Toxic Span Prediction

    Authors: Erik Yan, Harish Tayyar Madabushi

    Abstract: Toxicity is pervasive in social media and poses a major threat to the health of online communities. The recent introduction of pre-trained language models, which have achieved state-of-the-art results in many NLP tasks, has transformed the way in which we approach natural language processing. However, the inherent nature of pre-training means that they are unlikely to capture task-specific statist… ▽ More

    Submitted 7 October, 2021; originally announced October 2021.

    Comments: Published in Proceedings of the 15th International Workshop on Semantic Evaluation (SemEval-2021); Code available at: https://github.com/erikdyan/toxic_span_detection

    Journal ref: 2021.semeval-1.28 (2021) 243-248

  27. arXiv:2109.04413  [pdf, other

    cs.CL

    AStitchInLanguageModels: Dataset and Methods for the Exploration of Idiomaticity in Pre-Trained Language Models

    Authors: Harish Tayyar Madabushi, Edward Gow-Smith, Carolina Scarton, Aline Villavicencio

    Abstract: Despite their success in a variety of NLP tasks, pre-trained language models, due to their heavy reliance on compositionality, fail in effectively capturing the meanings of multiword expressions (MWEs), especially idioms. Therefore, datasets and methods to improve the representation of MWEs are urgently needed. Existing datasets are limited to providing the degree of idiomaticity of expressions al… ▽ More

    Submitted 9 September, 2021; originally announced September 2021.

    Comments: Findings of EMNLP 2021. Code available at: https://github.com/H-TayyarMadabushi/AStitchInLanguageModels

  28. arXiv:2011.04134  [pdf, other

    cs.CL

    CxGBERT: BERT meets Construction Grammar

    Authors: Harish Tayyar Madabushi, Laurence Romain, Dagmar Divjak, Petar Milin

    Abstract: While lexico-semantic elements no doubt capture a large amount of linguistic information, it has been argued that they do not capture all information contained in text. This assumption is central to constructionist approaches to language which argue that language consists of constructions, learned pairings of a form and a function or meaning that are either frequent or have a meaning that cannot b… ▽ More

    Submitted 8 November, 2020; originally announced November 2020.

    Comments: 28th International Conference on Computational Linguistics (COLING 2020)

  29. arXiv:2010.09078  [pdf, other

    cs.CL

    Incorporating Count-Based Features into Pre-Trained Models for Improved Stance Detection

    Authors: Anushka Prakash, Harish Tayyar Madabushi

    Abstract: The explosive growth and popularity of Social Media has revolutionised the way we communicate and collaborate. Unfortunately, this same ease of accessing and sharing information has led to an explosion of misinformation and propaganda. Given that stance detection can significantly aid in veracity prediction, this work focuses on boosting automated stance detection, a task on which pre-trained mode… ▽ More

    Submitted 18 October, 2020; originally announced October 2020.

  30. arXiv:2010.09072  [pdf, other

    cs.CL

    UoB at SemEval-2020 Task 1: Automatic Identification of Novel Word Senses

    Authors: Eleri Sarsfield, Harish Tayyar Madabushi

    Abstract: Much as the social landscape in which languages are spoken shifts, language too evolves to suit the needs of its users. Lexical semantic change analysis is a burgeoning field of semantic analysis which aims to trace changes in the meanings of words over time. This paper presents an approach to lexical semantic change detection based on Bayesian word sense induction suitable for novel word sense id… ▽ More

    Submitted 18 October, 2020; originally announced October 2020.

  31. arXiv:2010.07988  [pdf, other

    cs.CL

    CXP949 at WNUT-2020 Task 2: Extracting Informative COVID-19 Tweets -- RoBERTa Ensembles and The Continued Relevance of Handcrafted Features

    Authors: Calum Perrio, Harish Tayyar Madabushi

    Abstract: This paper presents our submission to Task 2 of the Workshop on Noisy User-generated Text. We explore improving the performance of a pre-trained transformer-based language model fine-tuned for text classification through an ensemble implementation that makes use of corpus level information and a handcrafted feature. We test the effectiveness of including the aforementioned features in accommodatin… ▽ More

    Submitted 15 October, 2020; originally announced October 2020.

  32. arXiv:2008.08547  [pdf, ps, other

    cs.CL

    UoB at SemEval-2020 Task 12: Boosting BERT with Corpus Level Information

    Authors: Wah Meng Lim, Harish Tayyar Madabushi

    Abstract: Pre-trained language model word representation, such as BERT, have been extremely successful in several Natural Language Processing tasks significantly improving on the state-of-the-art. This can largely be attributed to their ability to better capture semantic information contained within a sentence. Several tasks, however, can benefit from information available at a corpus level, such as Term Fr… ▽ More

    Submitted 19 August, 2020; originally announced August 2020.

  33. arXiv:2006.04597  [pdf, ps, other

    cs.CL cs.LG cs.NE

    CS-Embed at SemEval-2020 Task 9: The effectiveness of code-switched word embeddings for sentiment analysis

    Authors: Frances Adriana Laureano De Leon, Florimond Guéniat, Harish Tayyar Madabushi

    Abstract: The growing popularity and applications of sentiment analysis of social media posts has naturally led to sentiment analysis of posts written in multiple languages, a practice known as code-switching. While recent research into code-switched posts has focused on the use of multilingual word embeddings, these embeddings were not trained on code-switched data. In this work, we present word-embeddings… ▽ More

    Submitted 7 September, 2020; v1 submitted 8 June, 2020; originally announced June 2020.

    Comments: Accepted at SemEval-2020, COLING

  34. arXiv:2003.11563  [pdf, other

    cs.CL cs.LG stat.ML

    Cost-Sensitive BERT for Generalisable Sentence Classification with Imbalanced Data

    Authors: Harish Tayyar Madabushi, Elena Kochkina, Michael Castelle

    Abstract: The automatic identification of propaganda has gained significance in recent years due to technological and social changes in the way news is generated and consumed. That this task can be addressed effectively using BERT, a powerful new architecture which can be fine-tuned for text classification tasks, is not surprising. However, propaganda detection, like other tasks that deal with news document… ▽ More

    Submitted 16 March, 2020; originally announced March 2020.

    Comments: NLP4IF 2019

  35. arXiv:2003.03813  [pdf

    cs.CL cs.LG stat.ML

    Keeping it simple: Implementation and performance of the proto-principle of adaptation and learning in the language sciences

    Authors: Petar Milin, Harish Tayyar Madabushi, Michael Croucher, Dagmar Divjak

    Abstract: In this paper we present the Widrow-Hoff rule and its applications to language data. After contextualizing the rule historically and placing it in the chain of neurally inspired artificial learning models, we explain its rationale and implementational considerations. Using a number of case studies we illustrate how the Widrow-Hoff rule offers unexpected opportunities for the computational simulati… ▽ More

    Submitted 28 August, 2021; v1 submitted 8 March, 2020; originally announced March 2020.

  36. arXiv:1908.05441  [pdf, other

    cs.CL cs.AI

    Multi-class Hierarchical Question Classification for Multiple Choice Science Exams

    Authors: Dongfang Xu, Peter Jansen, Jaycie Martin, Zhengnan Xie, Vikas Yadav, Harish Tayyar Madabushi, Oyvind Tafjord, Peter Clark

    Abstract: Prior work has demonstrated that question classification (QC), recognizing the problem domain of a question, can help answer it more accurately. However, developing strong QC algorithms has been hindered by the limited size and complexity of annotated data available. To address this, we present the largest challenge dataset for QC, containing 7,787 science exam questions paired with detailed class… ▽ More

    Submitted 15 August, 2019; originally announced August 2019.