Search | arXiv e-print repository

From Babble to Words: Pre-Training Language Models on Continuous Streams of Phonemes

Authors: Zébulon Goriely, Richard Diehl Martinez, Andrew Caines, Lisa Beinborn, Paula Buttery

Abstract: Language models are typically trained on large corpora of text in their default orthographic form. However, this is not the only option; representing data as streams of phonemes can offer unique advantages, from deeper insights into phonological language acquisition to improved performance on sound-based tasks. The challenge lies in evaluating the impact of phoneme-based training, as most benchmar… ▽ More Language models are typically trained on large corpora of text in their default orthographic form. However, this is not the only option; representing data as streams of phonemes can offer unique advantages, from deeper insights into phonological language acquisition to improved performance on sound-based tasks. The challenge lies in evaluating the impact of phoneme-based training, as most benchmarks are also orthographic. To address this, we develop a pipeline to convert text datasets into a continuous stream of phonemes. We apply this pipeline to the 100-million-word pre-training dataset from the BabyLM challenge, as well as to standard language and grammatical benchmarks, enabling us to pre-train and evaluate a model using phonemic input representations. Our results show that while phoneme-based training slightly reduces performance on traditional language understanding tasks, it offers valuable analytical and practical benefits. △ Less

Submitted 30 October, 2024; originally announced October 2024.

arXiv:2410.11462 [pdf, other]

Mitigating Frequency Bias and Anisotropy in Language Model Pre-Training with Syntactic Smoothing

Authors: Richard Diehl Martinez, Zebulon Goriely, Andrew Caines, Paula Buttery, Lisa Beinborn

Abstract: Language models strongly rely on frequency information because they maximize the likelihood of tokens during pre-training. As a consequence, language models tend to not generalize well to tokens that are seldom seen during training. Moreover, maximum likelihood training has been discovered to give rise to anisotropy: representations of tokens in a model tend to cluster tightly in a high-dimensiona… ▽ More Language models strongly rely on frequency information because they maximize the likelihood of tokens during pre-training. As a consequence, language models tend to not generalize well to tokens that are seldom seen during training. Moreover, maximum likelihood training has been discovered to give rise to anisotropy: representations of tokens in a model tend to cluster tightly in a high-dimensional cone, rather than spreading out over their representational capacity. Our work introduces a method for quantifying the frequency bias of a language model by assessing sentence-level perplexity with respect to token-level frequency. We then present a method for reducing the frequency bias of a language model by inducing a syntactic prior over token representations during pre-training. Our Syntactic Smoothing method adjusts the maximum likelihood objective function to distribute the learning signal to syntactically similar tokens. This approach results in better performance on infrequent English tokens and a decrease in anisotropy. We empirically show that the degree of anisotropy in a model correlates with its frequency bias. △ Less

Submitted 15 October, 2024; originally announced October 2024.

arXiv:2403.19424 [pdf, other]

The Role of Syntactic Span Preferences in Post-Hoc Explanation Disagreement

Authors: Jonathan Kamp, Lisa Beinborn, Antske Fokkens

Abstract: Post-hoc explanation methods are an important tool for increasing model transparency for users. Unfortunately, the currently used methods for attributing token importance often yield diverging patterns. In this work, we study potential sources of disagreement across methods from a linguistic perspective. We find that different methods systematically select different classes of words and that metho… ▽ More Post-hoc explanation methods are an important tool for increasing model transparency for users. Unfortunately, the currently used methods for attributing token importance often yield diverging patterns. In this work, we study potential sources of disagreement across methods from a linguistic perspective. We find that different methods systematically select different classes of words and that methods that agree most with other methods and with humans display similar linguistic preferences. Token-level differences between methods are smoothed out if we compare them on the syntactic span level. We also find higher agreement across methods by estimating the most important spans dynamically instead of relying on a fixed subset of size $k$. We systematically investigate the interaction between $k$ and spans and propose an improved configuration for selecting important tokens. △ Less

Submitted 28 March, 2024; originally announced March 2024.

Comments: Long paper accepted to LREC-Coling 2024 main conference. Please cite the conference proceedings version when available

arXiv:2311.08886 [pdf, other]

CLIMB: Curriculum Learning for Infant-inspired Model Building

Authors: Richard Diehl Martinez, Zebulon Goriely, Hope McGovern, Christopher Davis, Andrew Caines, Paula Buttery, Lisa Beinborn

Abstract: We describe our team's contribution to the STRICT-SMALL track of the BabyLM Challenge. The challenge requires training a language model from scratch using only a relatively small training dataset of ten million words. We experiment with three variants of cognitively-motivated curriculum learning and analyze their effect on the performance of the model on linguistic evaluation tasks. In the vocabul… ▽ More We describe our team's contribution to the STRICT-SMALL track of the BabyLM Challenge. The challenge requires training a language model from scratch using only a relatively small training dataset of ten million words. We experiment with three variants of cognitively-motivated curriculum learning and analyze their effect on the performance of the model on linguistic evaluation tasks. In the vocabulary curriculum, we analyze methods for constraining the vocabulary in the early stages of training to simulate cognitively more plausible learning curves. In the data curriculum experiments, we vary the order of the training instances based on i) infant-inspired expectations and ii) the learning behavior of the model. In the objective curriculum, we explore different variations of combining the conventional masked language modeling task with a more coarse-grained word class prediction task to reinforce linguistic generalization capabilities. Our results did not yield consistent improvements over our own non-curriculum learning baseline across a range of linguistic benchmarks; however, we do find marginal gains on select tasks. Our analysis highlights key takeaways for specific combinations of tasks and settings which benefit from our proposed curricula. We moreover determine that careful selection of model architecture, and training hyper-parameters yield substantial improvements over the default baselines provided by the BabyLM challenge. △ Less

Submitted 15 November, 2023; originally announced November 2023.

arXiv:2310.13348 [pdf, other]

Analyzing Cognitive Plausibility of Subword Tokenization

Authors: Lisa Beinborn, Yuval Pinter

Abstract: Subword tokenization has become the de-facto standard for tokenization, although comparative evaluations of subword vocabulary quality across languages are scarce. Existing evaluation studies focus on the effect of a tokenization algorithm on the performance in downstream tasks, or on engineering criteria such as the compression rate. We present a new evaluation paradigm that focuses on the cognit… ▽ More Subword tokenization has become the de-facto standard for tokenization, although comparative evaluations of subword vocabulary quality across languages are scarce. Existing evaluation studies focus on the effect of a tokenization algorithm on the performance in downstream tasks, or on engineering criteria such as the compression rate. We present a new evaluation paradigm that focuses on the cognitive plausibility of subword tokenization. We analyze the correlation of the tokenizer output with the response time and accuracy of human performance on a lexical decision task. We compare three tokenization algorithms across several languages and vocabulary sizes. Our results indicate that the UnigramLM algorithm yields less cognitively plausible tokenization behavior and a worse coverage of derivational morphemes, in contrast with prior work. △ Less

Submitted 20 October, 2023; originally announced October 2023.

Comments: EMNLP 2023 (main)

arXiv:2310.05619 [pdf, other]

Dynamic Top-k Estimation Consolidates Disagreement between Feature Attribution Methods

Authors: Jonathan Kamp, Lisa Beinborn, Antske Fokkens

Abstract: Feature attribution scores are used for explaining the prediction of a text classifier to users by highlighting a k number of tokens. In this work, we propose a way to determine the number of optimal k tokens that should be displayed from sequential properties of the attribution scores. Our approach is dynamic across sentences, method-agnostic, and deals with sentence length bias. We compare agree… ▽ More Feature attribution scores are used for explaining the prediction of a text classifier to users by highlighting a k number of tokens. In this work, we propose a way to determine the number of optimal k tokens that should be displayed from sequential properties of the attribution scores. Our approach is dynamic across sentences, method-agnostic, and deals with sentence length bias. We compare agreement between multiple methods and humans on an NLI task, using fixed k and dynamic k. We find that perturbation-based methods and Vanilla Gradient exhibit highest agreement on most method--method and method--human agreement metrics with a static k. Their advantage over other methods disappears with dynamic ks which mainly improve Integrated Gradient and GradientXInput. To our knowledge, this is the first evidence that sequential properties of attribution scores are informative for consolidating attribution signals for human interpretation. △ Less

Submitted 3 November, 2023; v1 submitted 9 October, 2023; originally announced October 2023.

Comments: Short paper accepted to EMNLP 2023 main conference. Please cite the EMNLP version when available

arXiv:2302.12695 [pdf, other]

Cross-Lingual Transfer of Cognitive Processing Complexity

Authors: Charlotte Pouw, Nora Hollenstein, Lisa Beinborn

Abstract: When humans read a text, their eye movements are influenced by the structural complexity of the input sentences. This cognitive phenomenon holds across languages and recent studies indicate that multilingual language models utilize structural similarities between languages to facilitate cross-lingual transfer. We use sentence-level eye-tracking patterns as a cognitive indicator for structural comp… ▽ More When humans read a text, their eye movements are influenced by the structural complexity of the input sentences. This cognitive phenomenon holds across languages and recent studies indicate that multilingual language models utilize structural similarities between languages to facilitate cross-lingual transfer. We use sentence-level eye-tracking patterns as a cognitive indicator for structural complexity and show that the multilingual model XLM-RoBERTa can successfully predict varied patterns for 13 typologically diverse languages, despite being fine-tuned only on English data. We quantify the sensitivity of the model to structural complexity and distinguish a range of complexity characteristics. Our results indicate that the model develops a meaningful bias towards sentence length but also integrates cross-lingual differences. We conduct a control experiment with randomized word order and find that the model seems to additionally capture more complex structural information. △ Less

Submitted 27 February, 2023; v1 submitted 24 February, 2023; originally announced February 2023.

Comments: Accepted at Findings of EACL 2023

ACM Class: I.2.7

arXiv:2209.14780 [pdf, other]

Perturbations and Subpopulations for Testing Robustness in Token-Based Argument Unit Recognition

Authors: Jonathan Kamp, Lisa Beinborn, Antske Fokkens

Abstract: Argument Unit Recognition and Classification aims at identifying argument units from text and classifying them as pro or against. One of the design choices that need to be made when developing systems for this task is what the unit of classification should be: segments of tokens or full sentences. Previous research suggests that fine-tuning language models on the token-level yields more robust res… ▽ More Argument Unit Recognition and Classification aims at identifying argument units from text and classifying them as pro or against. One of the design choices that need to be made when developing systems for this task is what the unit of classification should be: segments of tokens or full sentences. Previous research suggests that fine-tuning language models on the token-level yields more robust results for classifying sentences compared to training on sentences directly. We reproduce the study that originally made this claim and further investigate what exactly token-based systems learned better compared to sentence-based ones. We develop systematic tests for analysing the behavioural differences between the token-based and the sentence-based system. Our results show that token-based models are generally more robust than sentence-based models both on manually perturbed examples and on specific subpopulations of the data. △ Less

Submitted 29 September, 2022; originally announced September 2022.

Comments: Accepted at the 9th Workshop on Argument Mining, co-located with COLING 2022. Please cite the published version when available

arXiv:2106.03471 [pdf, other]

Relative Importance in Sentence Processing

Authors: Nora Hollenstein, Lisa Beinborn

Abstract: Determining the relative importance of the elements in a sentence is a key factor for effortless natural language understanding. For human language processing, we can approximate patterns of relative importance by measuring reading fixations using eye-tracking technology. In neural language models, gradient-based saliency methods indicate the relative importance of a token for the target objective… ▽ More Determining the relative importance of the elements in a sentence is a key factor for effortless natural language understanding. For human language processing, we can approximate patterns of relative importance by measuring reading fixations using eye-tracking technology. In neural language models, gradient-based saliency methods indicate the relative importance of a token for the target objective. In this work, we compare patterns of relative importance in English language processing by humans and models and analyze the underlying linguistic patterns. We find that human processing patterns in English correlate strongly with saliency-based importance in language models and not with attention-based importance. Our results indicate that saliency could be a cognitively more plausible metric for interpreting neural language models. The code is available on GitHub: https://github.com/beinborn/relative_importance △ Less

Submitted 7 June, 2021; originally announced June 2021.

Comments: accepted at ACL 2021

arXiv:2104.05433 [pdf, other]

Multilingual Language Models Predict Human Reading Behavior

Authors: Nora Hollenstein, Federico Pirovano, Ce Zhang, Lena Jäger, Lisa Beinborn

Abstract: We analyze if large language models are able to predict patterns of human reading behavior. We compare the performance of language-specific and multilingual pretrained transformer models to predict reading time measures reflecting natural human sentence processing on Dutch, English, German, and Russian texts. This results in accurate models of human reading behavior, which indicates that transform… ▽ More We analyze if large language models are able to predict patterns of human reading behavior. We compare the performance of language-specific and multilingual pretrained transformer models to predict reading time measures reflecting natural human sentence processing on Dutch, English, German, and Russian texts. This results in accurate models of human reading behavior, which indicates that transformer models implicitly encode relative importance in language in a way that is comparable to human processing mechanisms. We find that BERT and XLM models successfully predict a range of eye tracking features. In a series of experiments, we analyze the cross-domain and cross-language abilities of these models and show how they reflect human sentence processing. △ Less

Submitted 12 April, 2021; originally announced April 2021.

Comments: accepted at NAACL 2021

arXiv:2011.04592 [pdf, other]

Generating Image Descriptions via Sequential Cross-Modal Alignment Guided by Human Gaze

Authors: Ece Takmaz, Sandro Pezzelle, Lisa Beinborn, Raquel Fernández

Abstract: When speakers describe an image, they tend to look at objects before mentioning them. In this paper, we investigate such sequential cross-modal alignment by modelling the image description generation process computationally. We take as our starting point a state-of-the-art image captioning system and develop several model variants that exploit information from human gaze patterns recorded during l… ▽ More When speakers describe an image, they tend to look at objects before mentioning them. In this paper, we investigate such sequential cross-modal alignment by modelling the image description generation process computationally. We take as our starting point a state-of-the-art image captioning system and develop several model variants that exploit information from human gaze patterns recorded during language production. In particular, we propose the first approach to image description generation where visual processing is modelled $\textit{sequentially}$. Our experiments and analyses confirm that better descriptions can be obtained by exploiting gaze-driven attention and shed light on human cognitive processes by comparing different ways of aligning the gaze modality with language production. We find that processing gaze data sequentially leads to descriptions that are better aligned to those produced by speakers, more diverse, and more natural${-}$particularly when gaze is encoded with a dedicated recurrent component. △ Less

Submitted 9 November, 2020; originally announced November 2020.

Comments: In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP 2020)

arXiv:2011.02070 [pdf, other]

Probing Multilingual BERT for Genetic and Typological Signals

Authors: Taraka Rama, Lisa Beinborn, Steffen Eger

Abstract: We probe the layers in multilingual BERT (mBERT) for phylogenetic and geographic language signals across 100 languages and compute language distances based on the mBERT representations. We 1) employ the language distances to infer and evaluate language trees, finding that they are close to the reference family tree in terms of quartet tree distance, 2) perform distance matrix regression analysis,… ▽ More We probe the layers in multilingual BERT (mBERT) for phylogenetic and geographic language signals across 100 languages and compute language distances based on the mBERT representations. We 1) employ the language distances to infer and evaluate language trees, finding that they are close to the reference family tree in terms of quartet tree distance, 2) perform distance matrix regression analysis, finding that the language distances can be best explained by phylogenetic and worst by structural factors and 3) present a novel measure for measuring diachronic meaning stability (based on cross-lingual representation variability) which correlates significantly with published ranked lists based on linguistic approaches. Our results contribute to the nascent field of typological interpretability of cross-lingual text representations. △ Less

Submitted 3 November, 2020; originally announced November 2020.

Comments: COLING 2020

arXiv:2002.08880 [pdf, other]

The Fluidity of Concept Representations in Human Brain Signals

Authors: Eva Hendrikx, Lisa Beinborn

Abstract: Cognitive theories of human language processing often distinguish between concrete and abstract concepts. In this work, we analyze the discriminability of concrete and abstract concepts in fMRI data using a range of analysis methods. We find that the distinction can be decoded from the signal with an accuracy significantly above chance, but it is not found to be a relevant structuring factor in cl… ▽ More Cognitive theories of human language processing often distinguish between concrete and abstract concepts. In this work, we analyze the discriminability of concrete and abstract concepts in fMRI data using a range of analysis methods. We find that the distinction can be decoded from the signal with an accuracy significantly above chance, but it is not found to be a relevant structuring factor in clustering and relational analyses. From our detailed comparison, we obtain the impression that human concept representations are more fluid than dichotomous categories can capture. We argue that fluid concept representations lead to more realistic models of human language processing because they better capture the ambiguity and underspecification present in natural language use. △ Less

Submitted 20 February, 2020; originally announced February 2020.

Comments: 12 pages, 5 figures, 1 table

arXiv:1906.01539 [pdf, other]

Blackbox meets blackbox: Representational Similarity and Stability Analysis of Neural Language Models and Brains

Authors: Samira Abnar, Lisa Beinborn, Rochelle Choenni, Willem Zuidema

Abstract: In this paper, we define and apply representational stability analysis (ReStA), an intuitive way of analyzing neural language models. ReStA is a variant of the popular representational similarity analysis (RSA) in cognitive neuroscience. While RSA can be used to compare representations in models, model components, and human brains, ReStA compares instances of the same model, while systematically v… ▽ More In this paper, we define and apply representational stability analysis (ReStA), an intuitive way of analyzing neural language models. ReStA is a variant of the popular representational similarity analysis (RSA) in cognitive neuroscience. While RSA can be used to compare representations in models, model components, and human brains, ReStA compares instances of the same model, while systematically varying single model parameter. Using ReStA, we study four recent and successful neural language models, and evaluate how sensitive their internal representations are to the amount of prior context. Using RSA, we perform a systematic study of how similar the representational spaces in the first and second (or higher) layers of these models are to each other and to patterns of activation in the human brain. Our results reveal surprisingly strong differences between language models, and give insights into where the deep linguistic processing, that integrates information over multiple sentences, is happening in these models. The combination of ReStA and RSA on models and brains allows us to start addressing the important question of what kind of linguistic processes we can hope to observe in fMRI brain imaging data. In particular, our results suggest that the data on story reading from Wehbe et al. (2014) contains a signal of shallow linguistic processing, but show no evidence on the more interesting deep linguistic processing. △ Less

Submitted 5 June, 2019; v1 submitted 4 June, 2019; originally announced June 2019.

Journal ref: 2nd BlackBoxNLP workshop @ACL2019

arXiv:1904.10820 [pdf, other]

Semantic Drift in Multilingual Representations

Authors: Lisa Beinborn, Rochelle Choenni

Abstract: Multilingual representations have mostly been evaluated based on their performance on specific tasks. In this article, we look beyond engineering goals and analyze the relations between languages in computational representations. We introduce a methodology for comparing languages based on their organization of semantic concepts. We propose to conduct an adapted version of representational similari… ▽ More Multilingual representations have mostly been evaluated based on their performance on specific tasks. In this article, we look beyond engineering goals and analyze the relations between languages in computational representations. We introduce a methodology for comparing languages based on their organization of semantic concepts. We propose to conduct an adapted version of representational similarity analysis of a selected set of concepts in computational multilingual representations. Using this analysis method, we can reconstruct a phylogenetic tree that closely resembles those assumed by linguistic experts. These results indicate that multilingual distributional representations which are only trained on monolingual text and bilingual dictionaries preserve relations between languages without the need for any etymological information. In addition, we propose a measure to identify semantic drift between language families. We perform experiments on word-based and sentence-based multilingual models and provide both quantitative results and qualitative examples. Analyses of semantic drift in multilingual representations can serve two purposes: they can indicate unwanted characteristics of the computational models and they provide a quantitative means to study linguistic phenomena across languages. The code is available at https://github.com/beinborn/SemanticDrift. △ Less

Submitted 16 November, 2020; v1 submitted 24 April, 2019; originally announced April 2019.

Comments: Almost final version. Paper will appear in the Computational Linguistics Journal, Volume 46, Issue 3

arXiv:1904.02547 [pdf, other]

Robust Evaluation of Language-Brain Encoding Experiments

Authors: Lisa Beinborn, Samira Abnar, Rochelle Choenni

Abstract: Language-brain encoding experiments evaluate the ability of language models to predict brain responses elicited by language stimuli. The evaluation scenarios for this task have not yet been standardized which makes it difficult to compare and interpret results. We perform a series of evaluation experiments with a consistent encoding setup and compute the results for multiple fMRI datasets. In addi… ▽ More Language-brain encoding experiments evaluate the ability of language models to predict brain responses elicited by language stimuli. The evaluation scenarios for this task have not yet been standardized which makes it difficult to compare and interpret results. We perform a series of evaluation experiments with a consistent encoding setup and compute the results for multiple fMRI datasets. In addition, we test the sensitivity of the evaluation measures to randomized data and analyze the effect of voxel selection methods. Our experimental framework is publicly available to make modelling decisions more transparent and support reproducibility for future comparisons. △ Less

Submitted 4 April, 2019; originally announced April 2019.

arXiv:1806.06371 [pdf, other]

Multimodal Grounding for Language Processing

Authors: Lisa Beinborn, Teresa Botschen, Iryna Gurevych

Abstract: This survey discusses how recent developments in multimodal processing facilitate conceptual grounding of language. We categorize the information flow in multimodal processing with respect to cognitive models of human information processing and analyze different methods for combining multimodal representations. Based on this methodological inventory, we discuss the benefit of multimodal grounding… ▽ More This survey discusses how recent developments in multimodal processing facilitate conceptual grounding of language. We categorize the information flow in multimodal processing with respect to cognitive models of human information processing and analyze different methods for combining multimodal representations. Based on this methodological inventory, we discuss the benefit of multimodal grounding for a variety of language processing tasks and the challenges that arise. We particularly focus on multimodal grounding of verbs which play a crucial role for the compositional power of language. △ Less

Submitted 3 July, 2019; v1 submitted 17 June, 2018; originally announced June 2018.

Comments: The paper has been published in the Proceedings of the 27 Conference of Computational Linguistics. Please refer to this version for citations: https://www.aclweb.org/anthology/papers/C/C18/C18-1197/

Showing 1–17 of 17 results for author: Beinborn, L