-
On Linear Representations and Pretraining Data Frequency in Language Models
Authors:
Jack Merullo,
Noah A. Smith,
Sarah Wiegreffe,
Yanai Elazar
Abstract:
Pretraining data has a direct impact on the behaviors and quality of language models (LMs), but we only understand the most basic principles of this relationship. While most work focuses on pretraining data's effect on downstream task behavior, we investigate its relationship to LM representations. Previous work has discovered that, in language models, some concepts are encoded `linearly' in the r…
▽ More
Pretraining data has a direct impact on the behaviors and quality of language models (LMs), but we only understand the most basic principles of this relationship. While most work focuses on pretraining data's effect on downstream task behavior, we investigate its relationship to LM representations. Previous work has discovered that, in language models, some concepts are encoded `linearly' in the representations, but what factors cause these representations to form? We study the connection between pretraining data frequency and models' linear representations of factual relations. We find evidence that the formation of linear representations is strongly connected to pretraining term frequencies; specifically for subject-relation-object fact triplets, both subject-object co-occurrence frequency and in-context learning accuracy for the relation are highly correlated with linear representations. This is the case across all phases of pretraining. In OLMo-7B and GPT-J, we discover that a linear representation consistently (but not exclusively) forms when the subjects and objects within a relation co-occur at least 1k and 2k times, respectively, regardless of when these occurrences happen during pretraining. Finally, we train a regression model on measurements of linear representation quality in fully-trained LMs that can predict how often a term was seen in pretraining. Our model achieves low error even on inputs from a different model with a different pretraining dataset, providing a new method for estimating properties of the otherwise-unknown training data of closed-data models. We conclude that the strength of linear representations in LMs contains signal about the models' pretraining corpora that may provide new avenues for controlling and improving model behavior: particularly, manipulating the models' training data to meet specific frequency thresholds.
△ Less
Submitted 16 April, 2025;
originally announced April 2025.
-
OLMoTrace: Tracing Language Model Outputs Back to Trillions of Training Tokens
Authors:
Jiacheng Liu,
Taylor Blanton,
Yanai Elazar,
Sewon Min,
YenSung Chen,
Arnavi Chheda-Kothary,
Huy Tran,
Byron Bischoff,
Eric Marsh,
Michael Schmitz,
Cassidy Trier,
Aaron Sarnat,
Jenna James,
Jon Borchardt,
Bailey Kuehl,
Evie Cheng,
Karen Farley,
Sruthi Sreeram,
Taira Anderson,
David Albright,
Carissa Schoenick,
Luca Soldaini,
Dirk Groeneveld,
Rock Yuren Pang,
Pang Wei Koh
, et al. (6 additional authors not shown)
Abstract:
We present OLMoTrace, the first system that traces the outputs of language models back to their full, multi-trillion-token training data in real time. OLMoTrace finds and shows verbatim matches between segments of language model output and documents in the training text corpora. Powered by an extended version of infini-gram (Liu et al., 2024), our system returns tracing results within a few second…
▽ More
We present OLMoTrace, the first system that traces the outputs of language models back to their full, multi-trillion-token training data in real time. OLMoTrace finds and shows verbatim matches between segments of language model output and documents in the training text corpora. Powered by an extended version of infini-gram (Liu et al., 2024), our system returns tracing results within a few seconds. OLMoTrace can help users understand the behavior of language models through the lens of their training data. We showcase how it can be used to explore fact checking, hallucination, and the creativity of language models. OLMoTrace is publicly available and fully open-source.
△ Less
Submitted 7 July, 2025; v1 submitted 9 April, 2025;
originally announced April 2025.
-
Better Aligned with Survey Respondents or Training Data? Unveiling Political Leanings of LLMs on U.S. Supreme Court Cases
Authors:
Shanshan Xu,
T. Y. S. S Santosh,
Yanai Elazar,
Quirin Vogel,
Barbara Plank,
Matthias Grabmair
Abstract:
Recent works have shown that Large Language Models (LLMs) have a tendency to memorize patterns and biases present in their training data, raising important questions about how such memorized content influences model behavior. One such concern is the emergence of political bias in LLM outputs. In this paper, we investigate the extent to which LLMs' political leanings reflect memorized patterns from…
▽ More
Recent works have shown that Large Language Models (LLMs) have a tendency to memorize patterns and biases present in their training data, raising important questions about how such memorized content influences model behavior. One such concern is the emergence of political bias in LLM outputs. In this paper, we investigate the extent to which LLMs' political leanings reflect memorized patterns from their pretraining corpora. We propose a method to quantitatively evaluate political leanings embedded in the large pretraining corpora. Subsequently we investigate to whom are the LLMs' political leanings more aligned with, their pretrainig corpora or the surveyed human opinions. As a case study, we focus on probing the political leanings of LLMs in 32 US Supreme Court cases, addressing contentious topics such as abortion and voting rights. Our findings reveal that LLMs strongly reflect the political leanings in their training data, and no strong correlation is observed with their alignment to human opinions as expressed in surveys. These results underscore the importance of responsible curation of training data, and the methodology for auditing the memorization in LLMs to ensure human-AI alignment.
△ Less
Submitted 28 June, 2025; v1 submitted 25 February, 2025;
originally announced February 2025.
-
GRADE: Quantifying Sample Diversity in Text-to-Image Models
Authors:
Royi Rassin,
Aviv Slobodkin,
Shauli Ravfogel,
Yanai Elazar,
Yoav Goldberg
Abstract:
We introduce GRADE, an automatic method for quantifying sample diversity in text-to-image models. Our method leverages the world knowledge embedded in large language models and visual question-answering systems to identify relevant concept-specific axes of diversity (e.g., ``shape'' for the concept ``cookie''). It then estimates frequency distributions of concepts and their attributes and quantifi…
▽ More
We introduce GRADE, an automatic method for quantifying sample diversity in text-to-image models. Our method leverages the world knowledge embedded in large language models and visual question-answering systems to identify relevant concept-specific axes of diversity (e.g., ``shape'' for the concept ``cookie''). It then estimates frequency distributions of concepts and their attributes and quantifies diversity using entropy. We use GRADE to measure the diversity of 12 models over a total of 720K images, revealing that all models display limited variation, with clear deterioration in stronger models. Further, we find that models often exhibit default behaviors, a phenomenon where a model consistently generates concepts with the same attributes (e.g., 98% of the cookies are round). Lastly, we show that a key reason for low diversity is underspecified captions in training data. Our work proposes an automatic, semantically-driven approach to measure sample diversity and highlights the stunning homogeneity in text-to-image models.
△ Less
Submitted 11 March, 2025; v1 submitted 29 October, 2024;
originally announced October 2024.
-
Hybrid Preferences: Learning to Route Instances for Human vs. AI Feedback
Authors:
Lester James V. Miranda,
Yizhong Wang,
Yanai Elazar,
Sachin Kumar,
Valentina Pyatkin,
Faeze Brahman,
Noah A. Smith,
Hannaneh Hajishirzi,
Pradeep Dasigi
Abstract:
Learning from human feedback has enabled the alignment of language models (LMs) with human preferences. However, collecting human preferences is expensive and time-consuming, with highly variable annotation quality. An appealing alternative is to distill preferences from LMs as a source of synthetic annotations, offering a cost-effective and scalable alternative, albeit susceptible to other biases…
▽ More
Learning from human feedback has enabled the alignment of language models (LMs) with human preferences. However, collecting human preferences is expensive and time-consuming, with highly variable annotation quality. An appealing alternative is to distill preferences from LMs as a source of synthetic annotations, offering a cost-effective and scalable alternative, albeit susceptible to other biases and errors. In this work, we introduce HyPER, a Hybrid Preference routER that defers an annotation to either humans or LMs, achieving better annotation quality while reducing the cost of human-only annotation. We formulate this as an optimization problem: given a preference dataset and an evaluation metric, we (1) train a performance prediction model (PPM) to predict a reward model's (RM) performance on an arbitrary combination of human and LM annotations and (2) employ a routing strategy that selects a combination that maximizes the predicted performance. We train the PPM on MultiPref, a new preference dataset with 10k instances paired with humans and LM labels. We show that the selected hybrid mixture of synthetic and direct human preferences using HyPER achieves better RM performance compared to using either one exclusively by 7-13% on RewardBench and generalizes across unseen preference datasets and other base models. We also observe the same trend in other benchmarks using Best-of-N reranking, where the hybrid mix has 2-3% better performance. Finally, we analyze features from HyPER and find that prompts with moderate safety concerns or complexity benefit the most from human feedback.
△ Less
Submitted 30 May, 2025; v1 submitted 24 October, 2024;
originally announced October 2024.
-
How Many Van Goghs Does It Take to Van Gogh? Finding the Imitation Threshold
Authors:
Sahil Verma,
Royi Rassin,
Arnav Das,
Gantavya Bhatt,
Preethi Seshadri,
Chirag Shah,
Jeff Bilmes,
Hannaneh Hajishirzi,
Yanai Elazar
Abstract:
Text-to-image models are trained using large datasets collected by scraping image-text pairs from the internet. These datasets often include private, copyrighted, and licensed material. Training models on such datasets enables them to generate images with such content, which might violate copyright laws and individual privacy. This phenomenon is termed imitation -- generation of images with conten…
▽ More
Text-to-image models are trained using large datasets collected by scraping image-text pairs from the internet. These datasets often include private, copyrighted, and licensed material. Training models on such datasets enables them to generate images with such content, which might violate copyright laws and individual privacy. This phenomenon is termed imitation -- generation of images with content that has recognizable similarity to its training images. In this work we study the relationship between a concept's frequency in the training dataset and the ability of a model to imitate it. We seek to determine the point at which a model was trained on enough instances to imitate a concept -- the imitation threshold. We posit this question as a new problem: Finding the Imitation Threshold (FIT) and propose an efficient approach that estimates the imitation threshold without incurring the colossal cost of training multiple models from scratch. We experiment with two domains -- human faces and art styles -- for which we create four datasets, and evaluate three text-to-image models which were trained on two pretraining datasets. Our results reveal that the imitation threshold of these models is in the range of 200-600 images, depending on the domain and the model. The imitation threshold can provide an empirical basis for copyright violation claims and acts as a guiding principle for text-to-image model developers that aim to comply with copyright and privacy laws. We release the code and data at \url{https://github.com/vsahil/MIMETIC-2.git} and the project's website is hosted at \url{https://how-many-van-goghs-does-it-take.github.io}.
△ Less
Submitted 19 October, 2024;
originally announced October 2024.
-
Data Contamination Report from the 2024 CONDA Shared Task
Authors:
Oscar Sainz,
Iker García-Ferrero,
Alon Jacovi,
Jon Ander Campos,
Yanai Elazar,
Eneko Agirre,
Yoav Goldberg,
Wei-Lin Chen,
Jenny Chim,
Leshem Choshen,
Luca D'Amico-Wong,
Melissa Dell,
Run-Ze Fan,
Shahriar Golchin,
Yucheng Li,
Pengfei Liu,
Bhavish Pahwa,
Ameya Prabhu,
Suryansh Sharma,
Emily Silcock,
Kateryna Solonko,
David Stap,
Mihai Surdeanu,
Yu-Min Tseng,
Vishaal Udandarao
, et al. (3 additional authors not shown)
Abstract:
The 1st Workshop on Data Contamination (CONDA 2024) focuses on all relevant aspects of data contamination in natural language processing, where data contamination is understood as situations where evaluation data is included in pre-training corpora used to train large scale models, compromising evaluation results. The workshop fostered a shared task to collect evidence on data contamination in cur…
▽ More
The 1st Workshop on Data Contamination (CONDA 2024) focuses on all relevant aspects of data contamination in natural language processing, where data contamination is understood as situations where evaluation data is included in pre-training corpora used to train large scale models, compromising evaluation results. The workshop fostered a shared task to collect evidence on data contamination in current available datasets and models. The goal of the shared task and associated database is to assist the community in understanding the extent of the problem and to assist researchers in avoiding reporting evaluation results on known contaminated resources. The shared task provides a structured, centralized public database for the collection of contamination evidence, open to contributions from the community via GitHub pool requests. This first compilation paper is based on 566 reported entries over 91 contaminated sources from a total of 23 contributors. The details of the individual contamination events are available in the platform. The platform continues to be online, open to contributions from the community.
△ Less
Submitted 4 August, 2024; v1 submitted 31 July, 2024;
originally announced July 2024.
-
Generalization v.s. Memorization: Tracing Language Models' Capabilities Back to Pretraining Data
Authors:
Xinyi Wang,
Antonis Antoniades,
Yanai Elazar,
Alfonso Amayuelas,
Alon Albalak,
Kexun Zhang,
William Yang Wang
Abstract:
The impressive capabilities of large language models (LLMs) have sparked debate over whether these models genuinely generalize to unseen tasks or predominantly rely on memorizing vast amounts of pretraining data. To explore this issue, we introduce an extended concept of memorization, distributional memorization, which measures the correlation between the LLM output probabilities and the pretraini…
▽ More
The impressive capabilities of large language models (LLMs) have sparked debate over whether these models genuinely generalize to unseen tasks or predominantly rely on memorizing vast amounts of pretraining data. To explore this issue, we introduce an extended concept of memorization, distributional memorization, which measures the correlation between the LLM output probabilities and the pretraining data frequency. To effectively capture task-specific pretraining data frequency, we propose a novel task-gram language model, which is built by counting the co-occurrence of semantically related $n$-gram pairs from task inputs and outputs in the pretraining corpus. Using the Pythia models trained on the Pile dataset, we evaluate four distinct tasks: machine translation, factual question answering, world knowledge understanding, and math reasoning. Our findings reveal varying levels of memorization, with the strongest effect observed in factual question answering. Furthermore, while model performance improves across all tasks as LLM size increases, only factual question answering shows an increase in memorization, whereas machine translation and reasoning tasks exhibit greater generalization, producing more novel outputs. This study demonstrates that memorization plays a larger role in simpler, knowledge-intensive tasks, while generalization is the key for harder, reasoning-based tasks, providing a scalable method for analyzing large pretraining corpora in greater depth.
△ Less
Submitted 1 March, 2025; v1 submitted 20 July, 2024;
originally announced July 2024.
-
Detection and Measurement of Syntactic Templates in Generated Text
Authors:
Chantal Shaib,
Yanai Elazar,
Junyi Jessy Li,
Byron C. Wallace
Abstract:
Recent work on evaluating the diversity of text generated by LLMs has focused on word-level features. Here we offer an analysis of syntactic features to characterize general repetition in models, beyond frequent n-grams. Specifically, we define syntactic templates and show that models tend to produce templated text in downstream tasks at a higher rate than what is found in human-reference texts. W…
▽ More
Recent work on evaluating the diversity of text generated by LLMs has focused on word-level features. Here we offer an analysis of syntactic features to characterize general repetition in models, beyond frequent n-grams. Specifically, we define syntactic templates and show that models tend to produce templated text in downstream tasks at a higher rate than what is found in human-reference texts. We find that most (76%) templates in model-generated text can be found in pre-training data (compared to only 35% of human-authored text), and are not overwritten during fine-tuning processes such as RLHF. This connection to the pre-training data allows us to analyze syntactic templates in models where we do not have the pre-training data. We also find that templates as features are able to differentiate between models, tasks, and domains, and are useful for qualitatively evaluating common model constructions. Finally, we demonstrate the use of templates as a useful tool for analyzing style memorization of training data in LLMs.
△ Less
Submitted 6 October, 2024; v1 submitted 28 June, 2024;
originally announced July 2024.
-
Evaluating $n$-Gram Novelty of Language Models Using Rusty-DAWG
Authors:
William Merrill,
Noah A. Smith,
Yanai Elazar
Abstract:
How novel are texts generated by language models (LMs) relative to their training corpora? In this work, we investigate the extent to which modern LMs generate $n$-grams from their training data, evaluating both (i) the probability LMs assign to complete training $n$-grams and (ii) $n$-novelty, the proportion of $n$-grams generated by an LM that did not appear in the training data (for arbitrarily…
▽ More
How novel are texts generated by language models (LMs) relative to their training corpora? In this work, we investigate the extent to which modern LMs generate $n$-grams from their training data, evaluating both (i) the probability LMs assign to complete training $n$-grams and (ii) $n$-novelty, the proportion of $n$-grams generated by an LM that did not appear in the training data (for arbitrarily large $n$). To enable arbitrary-length $n$-gram search over a corpus in constant time w.r.t. corpus size, we develop Rusty-DAWG, a novel search tool inspired by indexing of genomic data. We compare the novelty of LM-generated text to human-written text and explore factors that affect generation novelty, focusing on the Pythia models. We find that, for $n > 4$, LM-generated text is less novel than human-written text, though it is more novel for smaller $n$. Larger LMs and more constrained decoding strategies both decrease novelty. Finally, we show that LMs complete $n$-grams with lower loss if they are more frequent in the training data. Overall, our results reveal factors influencing the novelty of LM-generated text, and we release Rusty-DAWG to facilitate further pretraining data research.
△ Less
Submitted 4 October, 2024; v1 submitted 18 June, 2024;
originally announced June 2024.
-
Applying Intrinsic Debiasing on Downstream Tasks: Challenges and Considerations for Machine Translation
Authors:
Bar Iluz,
Yanai Elazar,
Asaf Yehudai,
Gabriel Stanovsky
Abstract:
Most works on gender bias focus on intrinsic bias -- removing traces of information about a protected group from the model's internal representation. However, these works are often disconnected from the impact of such debiasing on downstream applications, which is the main motivation for debiasing in the first place. In this work, we systematically test how methods for intrinsic debiasing affect n…
▽ More
Most works on gender bias focus on intrinsic bias -- removing traces of information about a protected group from the model's internal representation. However, these works are often disconnected from the impact of such debiasing on downstream applications, which is the main motivation for debiasing in the first place. In this work, we systematically test how methods for intrinsic debiasing affect neural machine translation models, by measuring the extrinsic bias of such systems under different design choices. We highlight three challenges and mismatches between the debiasing techniques and their end-goal usage, including the choice of embeddings to debias, the mismatch between words and sub-word tokens debiasing, and the effect on different target languages. We find that these considerations have a significant impact on downstream performance and the success of debiasing.
△ Less
Submitted 2 June, 2024;
originally announced June 2024.
-
A Survey on Data Selection for Language Models
Authors:
Alon Albalak,
Yanai Elazar,
Sang Michael Xie,
Shayne Longpre,
Nathan Lambert,
Xinyi Wang,
Niklas Muennighoff,
Bairu Hou,
Liangming Pan,
Haewon Jeong,
Colin Raffel,
Shiyu Chang,
Tatsunori Hashimoto,
William Yang Wang
Abstract:
A major factor in the recent success of large language models is the use of enormous and ever-growing text datasets for unsupervised pre-training. However, naively training a model on all available data may not be optimal (or feasible), as the quality of available text data can vary. Filtering out data can also decrease the carbon footprint and financial costs of training models by reducing the am…
▽ More
A major factor in the recent success of large language models is the use of enormous and ever-growing text datasets for unsupervised pre-training. However, naively training a model on all available data may not be optimal (or feasible), as the quality of available text data can vary. Filtering out data can also decrease the carbon footprint and financial costs of training models by reducing the amount of training required. Data selection methods aim to determine which candidate data points to include in the training dataset and how to appropriately sample from the selected data points. The promise of improved data selection methods has caused the volume of research in the area to rapidly expand. However, because deep learning is mostly driven by empirical evidence and experimentation on large-scale data is expensive, few organizations have the resources for extensive data selection research. Consequently, knowledge of effective data selection practices has become concentrated within a few organizations, many of which do not openly share their findings and methodologies. To narrow this gap in knowledge, we present a comprehensive review of existing literature on data selection methods and related research areas, providing a taxonomy of existing approaches. By describing the current landscape of research, this work aims to accelerate progress in data selection by establishing an entry point for new and established researchers. Additionally, throughout this review we draw attention to noticeable holes in the literature and conclude the paper by proposing promising avenues for future research.
△ Less
Submitted 2 August, 2024; v1 submitted 26 February, 2024;
originally announced February 2024.
-
Calibrating Large Language Models with Sample Consistency
Authors:
Qing Lyu,
Kumar Shridhar,
Chaitanya Malaviya,
Li Zhang,
Yanai Elazar,
Niket Tandon,
Marianna Apidianaki,
Mrinmaya Sachan,
Chris Callison-Burch
Abstract:
Accurately gauging the confidence level of Large Language Models' (LLMs) predictions is pivotal for their reliable application. However, LLMs are often uncalibrated inherently and elude conventional calibration techniques due to their proprietary nature and massive scale. In this work, we explore the potential of deriving confidence from the distribution of multiple randomly sampled model generati…
▽ More
Accurately gauging the confidence level of Large Language Models' (LLMs) predictions is pivotal for their reliable application. However, LLMs are often uncalibrated inherently and elude conventional calibration techniques due to their proprietary nature and massive scale. In this work, we explore the potential of deriving confidence from the distribution of multiple randomly sampled model generations, via three measures of consistency. We perform an extensive evaluation across various open and closed-source models on nine reasoning datasets. Results show that consistency-based calibration methods outperform existing post-hoc approaches. Meanwhile, we find that factors such as intermediate explanations, model scaling, and larger sample sizes enhance calibration, while instruction-tuning makes calibration more difficult. Moreover, confidence scores obtained from consistency have the potential to enhance model performance. Finally, we offer practical guidance on choosing suitable consistency metrics for calibration, tailored to the characteristics of various LMs.
△ Less
Submitted 21 February, 2024;
originally announced February 2024.
-
OLMo: Accelerating the Science of Language Models
Authors:
Dirk Groeneveld,
Iz Beltagy,
Pete Walsh,
Akshita Bhagia,
Rodney Kinney,
Oyvind Tafjord,
Ananya Harsh Jha,
Hamish Ivison,
Ian Magnusson,
Yizhong Wang,
Shane Arora,
David Atkinson,
Russell Authur,
Khyathi Raghavi Chandu,
Arman Cohan,
Jennifer Dumas,
Yanai Elazar,
Yuling Gu,
Jack Hessel,
Tushar Khot,
William Merrill,
Jacob Morrison,
Niklas Muennighoff,
Aakanksha Naik,
Crystal Nam
, et al. (18 additional authors not shown)
Abstract:
Language models (LMs) have become ubiquitous in both NLP research and in commercial product offerings. As their commercial importance has surged, the most powerful models have become closed off, gated behind proprietary interfaces, with important details of their training data, architectures, and development undisclosed. Given the importance of these details in scientifically studying these models…
▽ More
Language models (LMs) have become ubiquitous in both NLP research and in commercial product offerings. As their commercial importance has surged, the most powerful models have become closed off, gated behind proprietary interfaces, with important details of their training data, architectures, and development undisclosed. Given the importance of these details in scientifically studying these models, including their biases and potential risks, we believe it is essential for the research community to have access to powerful, truly open LMs. To this end, we have built OLMo, a competitive, truly Open Language Model, to enable the scientific study of language models. Unlike most prior efforts that have only released model weights and inference code, we release OLMo alongside open training data and training and evaluation code. We hope this release will empower the open research community and inspire a new wave of innovation.
△ Less
Submitted 7 June, 2024; v1 submitted 1 February, 2024;
originally announced February 2024.
-
Dolma: an Open Corpus of Three Trillion Tokens for Language Model Pretraining Research
Authors:
Luca Soldaini,
Rodney Kinney,
Akshita Bhagia,
Dustin Schwenk,
David Atkinson,
Russell Authur,
Ben Bogin,
Khyathi Chandu,
Jennifer Dumas,
Yanai Elazar,
Valentin Hofmann,
Ananya Harsh Jha,
Sachin Kumar,
Li Lucy,
Xinxi Lyu,
Nathan Lambert,
Ian Magnusson,
Jacob Morrison,
Niklas Muennighoff,
Aakanksha Naik,
Crystal Nam,
Matthew E. Peters,
Abhilasha Ravichander,
Kyle Richardson,
Zejiang Shen
, et al. (11 additional authors not shown)
Abstract:
Information about pretraining corpora used to train the current best-performing language models is seldom discussed: commercial models rarely detail their data, and even open models are often released without accompanying training data or recipes to reproduce them. As a result, it is challenging to conduct and advance scientific research on language modeling, such as understanding how training dat…
▽ More
Information about pretraining corpora used to train the current best-performing language models is seldom discussed: commercial models rarely detail their data, and even open models are often released without accompanying training data or recipes to reproduce them. As a result, it is challenging to conduct and advance scientific research on language modeling, such as understanding how training data impacts model capabilities and limitations. To facilitate scientific research on language model pretraining, we curate and release Dolma, a three-trillion-token English corpus, built from a diverse mixture of web content, scientific papers, code, public-domain books, social media, and encyclopedic materials. We extensively document Dolma, including its design principles, details about its construction, and a summary of its contents. We present analyses and experimental results on intermediate states of Dolma to share what we have learned about important data curation practices. Finally, we open-source our data curation toolkit to enable reproduction of our work as well as support further research in large-scale data curation.
△ Less
Submitted 6 June, 2024; v1 submitted 31 January, 2024;
originally announced February 2024.
-
Paloma: A Benchmark for Evaluating Language Model Fit
Authors:
Ian Magnusson,
Akshita Bhagia,
Valentin Hofmann,
Luca Soldaini,
Ananya Harsh Jha,
Oyvind Tafjord,
Dustin Schwenk,
Evan Pete Walsh,
Yanai Elazar,
Kyle Lo,
Dirk Groeneveld,
Iz Beltagy,
Hannaneh Hajishirzi,
Noah A. Smith,
Kyle Richardson,
Jesse Dodge
Abstract:
Evaluations of language models (LMs) commonly report perplexity on monolithic data held out from training. Implicitly or explicitly, this data is composed of domains--varying distributions of language. We introduce Perplexity Analysis for Language Model Assessment (Paloma), a benchmark to measure LM fit to 546 English and code domains, instead of assuming perplexity on one distribution extrapolate…
▽ More
Evaluations of language models (LMs) commonly report perplexity on monolithic data held out from training. Implicitly or explicitly, this data is composed of domains--varying distributions of language. We introduce Perplexity Analysis for Language Model Assessment (Paloma), a benchmark to measure LM fit to 546 English and code domains, instead of assuming perplexity on one distribution extrapolates to others. We include two new datasets of the top 100 subreddits (e.g., r/depression on Reddit) and programming languages (e.g., Java on GitHub), both sources common in contemporary LMs. With our benchmark, we release 6 baseline 1B LMs carefully controlled to provide fair comparisons about which pretraining corpus is best and code for others to apply those controls to their own experiments. Our case studies demonstrate how the fine-grained results from Paloma surface findings such as that models pretrained without data beyond Common Crawl exhibit anomalous gaps in LM fit to many domains or that loss is dominated by the most frequently occurring strings in the vocabulary.
△ Less
Submitted 7 December, 2024; v1 submitted 16 December, 2023;
originally announced December 2023.
-
Measuring and Improving Attentiveness to Partial Inputs with Counterfactuals
Authors:
Yanai Elazar,
Bhargavi Paranjape,
Hao Peng,
Sarah Wiegreffe,
Khyathi Raghavi,
Vivek Srikumar,
Sameer Singh,
Noah A. Smith
Abstract:
The inevitable appearance of spurious correlations in training datasets hurts the generalization of NLP models on unseen data. Previous work has found that datasets with paired inputs are prone to correlations between a specific part of the input (e.g., the hypothesis in NLI) and the label; consequently, models trained only on those outperform chance. Are these correlations picked up by models tra…
▽ More
The inevitable appearance of spurious correlations in training datasets hurts the generalization of NLP models on unseen data. Previous work has found that datasets with paired inputs are prone to correlations between a specific part of the input (e.g., the hypothesis in NLI) and the label; consequently, models trained only on those outperform chance. Are these correlations picked up by models trained on the full input data? To address this question, we propose a new evaluation method, Counterfactual Attentiveness Test (CAT). CAT uses counterfactuals by replacing part of the input with its counterpart from a different example (subject to some restrictions), expecting an attentive model to change its prediction. Using CAT, we systematically investigate established supervised and in-context learning models on ten datasets spanning four tasks: natural language inference, reading comprehension, paraphrase detection, and visual & language reasoning. CAT reveals that reliance on such correlations is mainly data-dependent. Surprisingly, we find that GPT3 becomes less attentive with an increased number of demonstrations, while its accuracy on the test data improves. Our results demonstrate that augmenting training or demonstration data with counterfactuals is effective in improving models' attentiveness. We show that models' attentiveness measured by CAT reveals different conclusions from solely measuring correlations in data.
△ Less
Submitted 7 October, 2024; v1 submitted 16 November, 2023;
originally announced November 2023.
-
What's In My Big Data?
Authors:
Yanai Elazar,
Akshita Bhagia,
Ian Magnusson,
Abhilasha Ravichander,
Dustin Schwenk,
Alane Suhr,
Pete Walsh,
Dirk Groeneveld,
Luca Soldaini,
Sameer Singh,
Hanna Hajishirzi,
Noah A. Smith,
Jesse Dodge
Abstract:
Large text corpora are the backbone of language models. However, we have a limited understanding of the content of these corpora, including general statistics, quality, social factors, and inclusion of evaluation data (contamination). In this work, we propose What's In My Big Data? (WIMBD), a platform and a set of sixteen analyses that allow us to reveal and compare the contents of large text corp…
▽ More
Large text corpora are the backbone of language models. However, we have a limited understanding of the content of these corpora, including general statistics, quality, social factors, and inclusion of evaluation data (contamination). In this work, we propose What's In My Big Data? (WIMBD), a platform and a set of sixteen analyses that allow us to reveal and compare the contents of large text corpora. WIMBD builds on two basic capabilities -- count and search -- at scale, which allows us to analyze more than 35 terabytes on a standard compute node. We apply WIMBD to ten different corpora used to train popular language models, including C4, The Pile, and RedPajama. Our analysis uncovers several surprising and previously undocumented findings about these corpora, including the high prevalence of duplicate, synthetic, and low-quality content, personally identifiable information, toxic language, and benchmark contamination. For instance, we find that about 50% of the documents in RedPajama and LAION-2B-en are duplicates. In addition, several datasets used for benchmarking models trained on such corpora are contaminated with respect to important benchmarks, including the Winograd Schema Challenge and parts of GLUE and SuperGLUE. We open-source WIMBD's code and artifacts to provide a standard set of evaluations for new text-based corpora and to encourage more analyses and transparency around them.
△ Less
Submitted 5 March, 2024; v1 submitted 31 October, 2023;
originally announced October 2023.
-
The Bias Amplification Paradox in Text-to-Image Generation
Authors:
Preethi Seshadri,
Sameer Singh,
Yanai Elazar
Abstract:
Bias amplification is a phenomenon in which models exacerbate biases or stereotypes present in the training data. In this paper, we study bias amplification in the text-to-image domain using Stable Diffusion by comparing gender ratios in training vs. generated images. We find that the model appears to amplify gender-occupation biases found in the training data (LAION) considerably. However, we dis…
▽ More
Bias amplification is a phenomenon in which models exacerbate biases or stereotypes present in the training data. In this paper, we study bias amplification in the text-to-image domain using Stable Diffusion by comparing gender ratios in training vs. generated images. We find that the model appears to amplify gender-occupation biases found in the training data (LAION) considerably. However, we discover that amplification can be largely attributed to discrepancies between training captions and model prompts. For example, an inherent difference is that captions from the training data often contain explicit gender information while our prompts do not, which leads to a distribution shift and consequently inflates bias measures. Once we account for distributional differences between texts used for training and generation when evaluating amplification, we observe that amplification decreases drastically. Our findings illustrate the challenges of comparing biases in models and their training data, and highlight confounding factors that impact analyses.
△ Less
Submitted 15 November, 2023; v1 submitted 1 August, 2023;
originally announced August 2023.
-
Estimating the Causal Effect of Early ArXiving on Paper Acceptance
Authors:
Yanai Elazar,
Jiayao Zhang,
David Wadden,
Bo Zhang,
Noah A. Smith
Abstract:
What is the effect of releasing a preprint of a paper before it is submitted for peer review? No randomized controlled trial has been conducted, so we turn to observational data to answer this question. We use data from the ICLR conference (2018--2022) and apply methods from causal inference to estimate the effect of arXiving a paper before the reviewing period (early arXiving) on its acceptance t…
▽ More
What is the effect of releasing a preprint of a paper before it is submitted for peer review? No randomized controlled trial has been conducted, so we turn to observational data to answer this question. We use data from the ICLR conference (2018--2022) and apply methods from causal inference to estimate the effect of arXiving a paper before the reviewing period (early arXiving) on its acceptance to the conference. Adjusting for confounders such as topic, authors, and quality, we may estimate the causal effect. However, since quality is a challenging construct to estimate, we use the negative outcome control method, using paper citation count as a control variable to debias the quality confounding effect. Our results suggest that early arXiving may have a small effect on a paper's chances of acceptance. However, this effect (when existing) does not differ significantly across different groups of authors, as grouped by author citation count and institute rank. This suggests that early arXiving does not provide an advantage to any particular group.
△ Less
Submitted 20 February, 2024; v1 submitted 24 June, 2023;
originally announced June 2023.
-
Few-shot Fine-tuning vs. In-context Learning: A Fair Comparison and Evaluation
Authors:
Marius Mosbach,
Tiago Pimentel,
Shauli Ravfogel,
Dietrich Klakow,
Yanai Elazar
Abstract:
Few-shot fine-tuning and in-context learning are two alternative strategies for task adaptation of pre-trained language models. Recently, in-context learning has gained popularity over fine-tuning due to its simplicity and improved out-of-domain generalization, and because extensive evidence shows that fine-tuned models pick up on spurious correlations. Unfortunately, previous comparisons of the t…
▽ More
Few-shot fine-tuning and in-context learning are two alternative strategies for task adaptation of pre-trained language models. Recently, in-context learning has gained popularity over fine-tuning due to its simplicity and improved out-of-domain generalization, and because extensive evidence shows that fine-tuned models pick up on spurious correlations. Unfortunately, previous comparisons of the two approaches were done using models of different sizes. This raises the question of whether the observed weaker out-of-domain generalization of fine-tuned models is an inherent property of fine-tuning or a limitation of the experimental setup. In this paper, we compare the generalization of few-shot fine-tuning and in-context learning to challenge datasets, while controlling for the models used, the number of examples, and the number of parameters, ranging from 125M to 30B. Our results show that fine-tuned language models can in fact generalize well out-of-domain. We find that both approaches generalize similarly; they exhibit large variation and depend on properties such as model size and the number of examples, highlighting that robust task adaptation remains a challenge.
△ Less
Submitted 30 May, 2023; v1 submitted 26 May, 2023;
originally announced May 2023.
-
At Your Fingertips: Extracting Piano Fingering Instructions from Videos
Authors:
Amit Moryossef,
Yanai Elazar,
Yoav Goldberg
Abstract:
Piano fingering -- knowing which finger to use to play each note in a musical piece, is a hard and important skill to master when learning to play the piano. While some sheet music is available with expert-annotated fingering information, most pieces lack this information, and people often resort to learning the fingering from demonstrations in online videos. We consider the AI task of automating…
▽ More
Piano fingering -- knowing which finger to use to play each note in a musical piece, is a hard and important skill to master when learning to play the piano. While some sheet music is available with expert-annotated fingering information, most pieces lack this information, and people often resort to learning the fingering from demonstrations in online videos. We consider the AI task of automating the extraction of fingering information from videos. This is a non-trivial task as fingers are often occluded by other fingers, and it is often not clear from the video which of the keys were pressed, requiring the synchronization of hand position information and knowledge about the notes that were played. We show how to perform this task with high-accuracy using a combination of deep-learning modules, including a GAN-based approach for fine-tuning on out-of-domain data. We extract the fingering information with an f1 score of 97\%. We run the resulting system on 90 videos, resulting in high-quality piano fingering information of 150K notes, the largest available dataset of piano-fingering to date.
△ Less
Submitted 7 March, 2023;
originally announced March 2023.
-
Lexical Generalization Improves with Larger Models and Longer Training
Authors:
Elron Bandel,
Yoav Goldberg,
Yanai Elazar
Abstract:
While fine-tuned language models perform well on many tasks, they were also shown to rely on superficial surface features such as lexical overlap. Excessive utilization of such heuristics can lead to failure on challenging inputs. We analyze the use of lexical overlap heuristics in natural language inference, paraphrase detection, and reading comprehension (using a novel contrastive dataset), and…
▽ More
While fine-tuned language models perform well on many tasks, they were also shown to rely on superficial surface features such as lexical overlap. Excessive utilization of such heuristics can lead to failure on challenging inputs. We analyze the use of lexical overlap heuristics in natural language inference, paraphrase detection, and reading comprehension (using a novel contrastive dataset), and find that larger models are much less susceptible to adopting lexical overlap heuristics. We also find that longer training leads models to abandon lexical overlap heuristics. Finally, we provide evidence that the disparity between models size has its source in the pre-trained model
△ Less
Submitted 25 October, 2022; v1 submitted 23 October, 2022;
originally announced October 2022.
-
CIKQA: Learning Commonsense Inference with a Unified Knowledge-in-the-loop QA Paradigm
Authors:
Hongming Zhang,
Yintong Huo,
Yanai Elazar,
Yangqiu Song,
Yoav Goldberg,
Dan Roth
Abstract:
Recently, the community has achieved substantial progress on many commonsense reasoning benchmarks. However, it is still unclear what is learned from the training process: the knowledge, inference capability, or both? We argue that due to the large scale of commonsense knowledge, it is infeasible to annotate a large enough training set for each task to cover all commonsense for learning. Thus we s…
▽ More
Recently, the community has achieved substantial progress on many commonsense reasoning benchmarks. However, it is still unclear what is learned from the training process: the knowledge, inference capability, or both? We argue that due to the large scale of commonsense knowledge, it is infeasible to annotate a large enough training set for each task to cover all commonsense for learning. Thus we should separate the commonsense knowledge acquisition and inference over commonsense knowledge as two separate tasks. In this work, we focus on investigating models' commonsense inference capabilities from two perspectives: (1) Whether models can know if the knowledge they have is enough to solve the task; (2) Whether models can develop commonsense inference capabilities that generalize across commonsense tasks. We first align commonsense tasks with relevant knowledge from commonsense knowledge bases and ask humans to annotate whether the knowledge is enough or not. Then, we convert different commonsense tasks into a unified question answering format to evaluate models' generalization capabilities. We name the benchmark as Commonsense Inference with Knowledge-in-the-loop Question Answering (CIKQA).
△ Less
Submitted 12 October, 2022;
originally announced October 2022.
-
State-of-the-art generalisation research in NLP: A taxonomy and review
Authors:
Dieuwke Hupkes,
Mario Giulianelli,
Verna Dankers,
Mikel Artetxe,
Yanai Elazar,
Tiago Pimentel,
Christos Christodoulopoulos,
Karim Lasri,
Naomi Saphra,
Arabella Sinclair,
Dennis Ulmer,
Florian Schottmann,
Khuyagbaatar Batsuren,
Kaiser Sun,
Koustuv Sinha,
Leila Khalatbari,
Maria Ryskina,
Rita Frieske,
Ryan Cotterell,
Zhijing Jin
Abstract:
The ability to generalise well is one of the primary desiderata of natural language processing (NLP). Yet, what 'good generalisation' entails and how it should be evaluated is not well understood, nor are there any evaluation standards for generalisation. In this paper, we lay the groundwork to address both of these issues. We present a taxonomy for characterising and understanding generalisation…
▽ More
The ability to generalise well is one of the primary desiderata of natural language processing (NLP). Yet, what 'good generalisation' entails and how it should be evaluated is not well understood, nor are there any evaluation standards for generalisation. In this paper, we lay the groundwork to address both of these issues. We present a taxonomy for characterising and understanding generalisation research in NLP. Our taxonomy is based on an extensive literature review of generalisation research, and contains five axes along which studies can differ: their main motivation, the type of generalisation they investigate, the type of data shift they consider, the source of this data shift, and the locus of the shift within the modelling pipeline. We use our taxonomy to classify over 400 papers that test generalisation, for a total of more than 600 individual experiments. Considering the results of this review, we present an in-depth analysis that maps out the current state of generalisation research in NLP, and we make recommendations for which areas might deserve attention in the future. Along with this paper, we release a webpage where the results of our review can be dynamically explored, and which we intend to update as new NLP generalisation studies are published. With this work, we aim to take steps towards making state-of-the-art generalisation testing the new status quo in NLP.
△ Less
Submitted 12 January, 2024; v1 submitted 6 October, 2022;
originally announced October 2022.
-
Measuring Causal Effects of Data Statistics on Language Model's `Factual' Predictions
Authors:
Yanai Elazar,
Nora Kassner,
Shauli Ravfogel,
Amir Feder,
Abhilasha Ravichander,
Marius Mosbach,
Yonatan Belinkov,
Hinrich Schütze,
Yoav Goldberg
Abstract:
Large amounts of training data are one of the major reasons for the high performance of state-of-the-art NLP models. But what exactly in the training data causes a model to make a certain prediction? We seek to answer this question by providing a language for describing how training data influences predictions, through a causal framework. Importantly, our framework bypasses the need to retrain exp…
▽ More
Large amounts of training data are one of the major reasons for the high performance of state-of-the-art NLP models. But what exactly in the training data causes a model to make a certain prediction? We seek to answer this question by providing a language for describing how training data influences predictions, through a causal framework. Importantly, our framework bypasses the need to retrain expensive models and allows us to estimate causal effects based on observational data alone. Addressing the problem of extracting factual knowledge from pretrained language models (PLMs), we focus on simple data statistics such as co-occurrence counts and show that these statistics do influence the predictions of PLMs, suggesting that such models rely on shallow heuristics. Our causal framework and our results demonstrate the importance of studying datasets and the benefits of causality for understanding NLP models.
△ Less
Submitted 24 March, 2023; v1 submitted 28 July, 2022;
originally announced July 2022.
-
Text-based NP Enrichment
Authors:
Yanai Elazar,
Victoria Basmov,
Yoav Goldberg,
Reut Tsarfaty
Abstract:
Understanding the relations between entities denoted by NPs in a text is a critical part of human-like natural language understanding. However, only a fraction of such relations is covered by standard NLP tasks and benchmarks nowadays. In this work, we propose a novel task termed text-based NP enrichment (TNE), in which we aim to enrich each NP in a text with all the preposition-mediated relations…
▽ More
Understanding the relations between entities denoted by NPs in a text is a critical part of human-like natural language understanding. However, only a fraction of such relations is covered by standard NLP tasks and benchmarks nowadays. In this work, we propose a novel task termed text-based NP enrichment (TNE), in which we aim to enrich each NP in a text with all the preposition-mediated relations -- either explicit or implicit -- that hold between it and other NPs in the text. The relations are represented as triplets, each denoted by two NPs related via a preposition. Humans recover such relations seamlessly, while current state-of-the-art models struggle with them due to the implicit nature of the problem. We build the first large-scale dataset for the problem, provide the formal framing and scope of annotation, analyze the data, and report the results of fine-tuned language models on the task, demonstrating the challenge it poses to current technology. A webpage with a data-exploration UI, a demo, and links to the code, models, and leaderboard, to foster further research into this challenging problem can be found at: yanaiela.github.io/TNE/.
△ Less
Submitted 11 April, 2022; v1 submitted 24 September, 2021;
originally announced September 2021.
-
Revisiting Few-shot Relation Classification: Evaluation Data and Classification Schemes
Authors:
Ofer Sabo,
Yanai Elazar,
Yoav Goldberg,
Ido Dagan
Abstract:
We explore Few-Shot Learning (FSL) for Relation Classification (RC). Focusing on the realistic scenario of FSL, in which a test instance might not belong to any of the target categories (none-of-the-above, aka NOTA), we first revisit the recent popular dataset structure for FSL, pointing out its unrealistic data distribution. To remedy this, we propose a novel methodology for deriving more realist…
▽ More
We explore Few-Shot Learning (FSL) for Relation Classification (RC). Focusing on the realistic scenario of FSL, in which a test instance might not belong to any of the target categories (none-of-the-above, aka NOTA), we first revisit the recent popular dataset structure for FSL, pointing out its unrealistic data distribution. To remedy this, we propose a novel methodology for deriving more realistic few-shot test data from available datasets for supervised RC, and apply it to the TACRED dataset. This yields a new challenging benchmark for FSL RC, on which state of the art models show poor performance. Next, we analyze classification schemes within the popular embedding-based nearest-neighbor approach for FSL, with respect to constraints they impose on the embedding space. Triggered by this analysis we propose a novel classification scheme, in which the NOTA category is represented as learned vectors, shown empirically to be an appealing option for FSL.
△ Less
Submitted 17 April, 2021;
originally announced April 2021.
-
Back to Square One: Artifact Detection, Training and Commonsense Disentanglement in the Winograd Schema
Authors:
Yanai Elazar,
Hongming Zhang,
Yoav Goldberg,
Dan Roth
Abstract:
The Winograd Schema (WS) has been proposed as a test for measuring commonsense capabilities of models. Recently, pre-trained language model-based approaches have boosted performance on some WS benchmarks but the source of improvement is still not clear. This paper suggests that the apparent progress on WS may not necessarily reflect progress in commonsense reasoning. To support this claim, we firs…
▽ More
The Winograd Schema (WS) has been proposed as a test for measuring commonsense capabilities of models. Recently, pre-trained language model-based approaches have boosted performance on some WS benchmarks but the source of improvement is still not clear. This paper suggests that the apparent progress on WS may not necessarily reflect progress in commonsense reasoning. To support this claim, we first show that the current evaluation method of WS is sub-optimal and propose a modification that uses twin sentences for evaluation. We also propose two new baselines that indicate the existence of artifacts in WS benchmarks. We then develop a method for evaluating WS-like sentences in a zero-shot setting to account for the commonsense reasoning abilities acquired during the pretraining and observe that popular language models perform randomly in this setting when using our more strict evaluation. We conclude that the observed progress is mostly due to the use of supervision in training WS models, which is not likely to successfully support all the required commonsense reasoning skills and knowledge.
△ Less
Submitted 13 October, 2021; v1 submitted 16 April, 2021;
originally announced April 2021.
-
Contrastive Explanations for Model Interpretability
Authors:
Alon Jacovi,
Swabha Swayamdipta,
Shauli Ravfogel,
Yanai Elazar,
Yejin Choi,
Yoav Goldberg
Abstract:
Contrastive explanations clarify why an event occurred in contrast to another. They are more inherently intuitive to humans to both produce and comprehend. We propose a methodology to produce contrastive explanations for classification models by modifying the representation to disregard non-contrastive information, and modifying model behavior to only be based on contrastive reasoning. Our method…
▽ More
Contrastive explanations clarify why an event occurred in contrast to another. They are more inherently intuitive to humans to both produce and comprehend. We propose a methodology to produce contrastive explanations for classification models by modifying the representation to disregard non-contrastive information, and modifying model behavior to only be based on contrastive reasoning. Our method is based on projecting model representation to a latent space that captures only the features that are useful (to the model) to differentiate two potential decisions. We demonstrate the value of contrastive explanations by analyzing two different scenarios, using both high-level abstract concept attribution and low-level input token/span attribution, on two widely used text classification tasks. Specifically, we produce explanations for answering: for which label, and against which alternative label, is some aspect of the input useful? And which aspects of the input are useful for and against particular decisions? Overall, our findings shed light on the ability of label-contrastive explanations to provide a more accurate and finer-grained interpretability of a model's decision.
△ Less
Submitted 14 September, 2021; v1 submitted 1 March, 2021;
originally announced March 2021.
-
Measuring and Improving Consistency in Pretrained Language Models
Authors:
Yanai Elazar,
Nora Kassner,
Shauli Ravfogel,
Abhilasha Ravichander,
Eduard Hovy,
Hinrich Schütze,
Yoav Goldberg
Abstract:
Consistency of a model -- that is, the invariance of its behavior under meaning-preserving alternations in its input -- is a highly desirable property in natural language processing. In this paper we study the question: Are Pretrained Language Models (PLMs) consistent with respect to factual knowledge? To this end, we create ParaRel, a high-quality resource of cloze-style query English paraphrases…
▽ More
Consistency of a model -- that is, the invariance of its behavior under meaning-preserving alternations in its input -- is a highly desirable property in natural language processing. In this paper we study the question: Are Pretrained Language Models (PLMs) consistent with respect to factual knowledge? To this end, we create ParaRel, a high-quality resource of cloze-style query English paraphrases. It contains a total of 328 paraphrases for 38 relations. Using ParaRel, we show that the consistency of all PLMs we experiment with is poor -- though with high variance between relations. Our analysis of the representational spaces of PLMs suggests that they have a poor structure and are currently not suitable for representing knowledge robustly. Finally, we propose a method for improving model consistency and experimentally demonstrate its effectiveness.
△ Less
Submitted 29 May, 2021; v1 submitted 1 February, 2021;
originally announced February 2021.
-
First Align, then Predict: Understanding the Cross-Lingual Ability of Multilingual BERT
Authors:
Benjamin Muller,
Yanai Elazar,
Benoît Sagot,
Djamé Seddah
Abstract:
Multilingual pretrained language models have demonstrated remarkable zero-shot cross-lingual transfer capabilities. Such transfer emerges by fine-tuning on a task of interest in one language and evaluating on a distinct language, not seen during the fine-tuning. Despite promising results, we still lack a proper understanding of the source of this transfer. Using a novel layer ablation technique an…
▽ More
Multilingual pretrained language models have demonstrated remarkable zero-shot cross-lingual transfer capabilities. Such transfer emerges by fine-tuning on a task of interest in one language and evaluating on a distinct language, not seen during the fine-tuning. Despite promising results, we still lack a proper understanding of the source of this transfer. Using a novel layer ablation technique and analyses of the model's internal representations, we show that multilingual BERT, a popular multilingual language model, can be viewed as the stacking of two sub-networks: a multilingual encoder followed by a task-specific language-agnostic predictor. While the encoder is crucial for cross-lingual transfer and remains mostly unchanged during fine-tuning, the task predictor has little importance on the transfer and can be reinitialized during fine-tuning. We present extensive experiments with three distinct tasks, seventeen typologically diverse languages and multiple domains to support our hypothesis.
△ Less
Submitted 26 January, 2021;
originally announced January 2021.
-
It's not Greek to mBERT: Inducing Word-Level Translations from Multilingual BERT
Authors:
Hila Gonen,
Shauli Ravfogel,
Yanai Elazar,
Yoav Goldberg
Abstract:
Recent works have demonstrated that multilingual BERT (mBERT) learns rich cross-lingual representations, that allow for transfer across languages. We study the word-level translation information embedded in mBERT and present two simple methods that expose remarkable translation capabilities with no fine-tuning. The results suggest that most of this information is encoded in a non-linear way, while…
▽ More
Recent works have demonstrated that multilingual BERT (mBERT) learns rich cross-lingual representations, that allow for transfer across languages. We study the word-level translation information embedded in mBERT and present two simple methods that expose remarkable translation capabilities with no fine-tuning. The results suggest that most of this information is encoded in a non-linear way, while some of it can also be recovered with purely linear tools. As part of our analysis, we test the hypothesis that mBERT learns representations which contain both a language-encoding component and an abstract, cross-lingual component, and explicitly identify an empirical language-identity subspace within mBERT representations.
△ Less
Submitted 16 October, 2020;
originally announced October 2020.
-
The Extraordinary Failure of Complement Coercion Crowdsourcing
Authors:
Yanai Elazar,
Victoria Basmov,
Shauli Ravfogel,
Yoav Goldberg,
Reut Tsarfaty
Abstract:
Crowdsourcing has eased and scaled up the collection of linguistic annotation in recent years. In this work, we follow known methodologies of collecting labeled data for the complement coercion phenomenon. These are constructions with an implied action -- e.g., "I started a new book I bought last week", where the implied action is reading. We aim to collect annotated data for this phenomenon by re…
▽ More
Crowdsourcing has eased and scaled up the collection of linguistic annotation in recent years. In this work, we follow known methodologies of collecting labeled data for the complement coercion phenomenon. These are constructions with an implied action -- e.g., "I started a new book I bought last week", where the implied action is reading. We aim to collect annotated data for this phenomenon by reducing it to either of two known tasks: Explicit Completion and Natural Language Inference. However, in both cases, crowdsourcing resulted in low agreement scores, even though we followed the same methodologies as in previous work. Why does the same process fail to yield high agreement scores? We specify our modeling schemes, highlight the differences with previous work and provide some insights about the task and possible explanations for the failure. We conclude that specific phenomena require tailored solutions, not only in specialized algorithms, but also in data collection methods.
△ Less
Submitted 12 October, 2020;
originally announced October 2020.
-
Do Language Embeddings Capture Scales?
Authors:
Xikun Zhang,
Deepak Ramachandran,
Ian Tenney,
Yanai Elazar,
Dan Roth
Abstract:
Pretrained Language Models (LMs) have been shown to possess significant linguistic, common sense, and factual knowledge. One form of knowledge that has not been studied yet in this context is information about the scalar magnitudes of objects. We show that pretrained language models capture a significant amount of this information but are short of the capability required for general common-sense r…
▽ More
Pretrained Language Models (LMs) have been shown to possess significant linguistic, common sense, and factual knowledge. One form of knowledge that has not been studied yet in this context is information about the scalar magnitudes of objects. We show that pretrained language models capture a significant amount of this information but are short of the capability required for general common-sense reasoning. We identify contextual information in pre-training and numeracy as two key factors affecting their performance and show that a simple method of canonicalizing numbers can have a significant effect on the results.
△ Less
Submitted 24 November, 2020; v1 submitted 11 October, 2020;
originally announced October 2020.
-
Unsupervised Distillation of Syntactic Information from Contextualized Word Representations
Authors:
Shauli Ravfogel,
Yanai Elazar,
Jacob Goldberger,
Yoav Goldberg
Abstract:
Contextualized word representations, such as ELMo and BERT, were shown to perform well on various semantic and syntactic tasks. In this work, we tackle the task of unsupervised disentanglement between semantics and structure in neural language representations: we aim to learn a transformation of the contextualized vectors, that discards the lexical semantics, but keeps the structural information.…
▽ More
Contextualized word representations, such as ELMo and BERT, were shown to perform well on various semantic and syntactic tasks. In this work, we tackle the task of unsupervised disentanglement between semantics and structure in neural language representations: we aim to learn a transformation of the contextualized vectors, that discards the lexical semantics, but keeps the structural information. To this end, we automatically generate groups of sentences which are structurally similar but semantically different, and use metric-learning approach to learn a transformation that emphasizes the structural component that is encoded in the vectors. We demonstrate that our transformation clusters vectors in space by structural properties, rather than by lexical semantics. Finally, we demonstrate the utility of our distilled representations by showing that they outperform the original contextualized representations in a few-shot parsing setting.
△ Less
Submitted 11 March, 2021; v1 submitted 11 October, 2020;
originally announced October 2020.
-
Amnesic Probing: Behavioral Explanation with Amnesic Counterfactuals
Authors:
Yanai Elazar,
Shauli Ravfogel,
Alon Jacovi,
Yoav Goldberg
Abstract:
A growing body of work makes use of probing to investigate the working of neural models, often considered black boxes. Recently, an ongoing debate emerged surrounding the limitations of the probing paradigm. In this work, we point out the inability to infer behavioral conclusions from probing results and offer an alternative method that focuses on how the information is being used, rather than on…
▽ More
A growing body of work makes use of probing to investigate the working of neural models, often considered black boxes. Recently, an ongoing debate emerged surrounding the limitations of the probing paradigm. In this work, we point out the inability to infer behavioral conclusions from probing results and offer an alternative method that focuses on how the information is being used, rather than on what information is encoded. Our method, Amnesic Probing, follows the intuition that the utility of a property for a given task can be assessed by measuring the influence of a causal intervention that removes it from the representation. Equipped with this new analysis tool, we can ask questions that were not possible before, e.g. is part-of-speech information important for word prediction? We perform a series of analyses on BERT to answer these types of questions. Our findings demonstrate that conventional probing performance is not correlated to task importance, and we call for increased scrutiny of claims that draw behavioral or causal conclusions from probing results.
△ Less
Submitted 19 February, 2021; v1 submitted 1 June, 2020;
originally announced June 2020.
-
Null It Out: Guarding Protected Attributes by Iterative Nullspace Projection
Authors:
Shauli Ravfogel,
Yanai Elazar,
Hila Gonen,
Michael Twiton,
Yoav Goldberg
Abstract:
The ability to control for the kinds of information encoded in neural representation has a variety of use cases, especially in light of the challenge of interpreting these models. We present Iterative Null-space Projection (INLP), a novel method for removing information from neural representations. Our method is based on repeated training of linear classifiers that predict a certain property we ai…
▽ More
The ability to control for the kinds of information encoded in neural representation has a variety of use cases, especially in light of the challenge of interpreting these models. We present Iterative Null-space Projection (INLP), a novel method for removing information from neural representations. Our method is based on repeated training of linear classifiers that predict a certain property we aim to remove, followed by projection of the representations on their null-space. By doing so, the classifiers become oblivious to that target property, making it hard to linearly separate the data according to it. While applicable for multiple uses, we evaluate our method on bias and fairness use-cases, and show that our method is able to mitigate bias in word embeddings, as well as to increase fairness in a setting of multi-class classification.
△ Less
Submitted 28 April, 2020; v1 submitted 16 April, 2020;
originally announced April 2020.
-
Evaluating Models' Local Decision Boundaries via Contrast Sets
Authors:
Matt Gardner,
Yoav Artzi,
Victoria Basmova,
Jonathan Berant,
Ben Bogin,
Sihao Chen,
Pradeep Dasigi,
Dheeru Dua,
Yanai Elazar,
Ananth Gottumukkala,
Nitish Gupta,
Hanna Hajishirzi,
Gabriel Ilharco,
Daniel Khashabi,
Kevin Lin,
Jiangming Liu,
Nelson F. Liu,
Phoebe Mulcaire,
Qiang Ning,
Sameer Singh,
Noah A. Smith,
Sanjay Subramanian,
Reut Tsarfaty,
Eric Wallace,
Ally Zhang
, et al. (1 additional authors not shown)
Abstract:
Standard test sets for supervised learning evaluate in-distribution generalization. Unfortunately, when a dataset has systematic gaps (e.g., annotation artifacts), these evaluations are misleading: a model can learn simple decision rules that perform well on the test set but do not capture a dataset's intended capabilities. We propose a new annotation paradigm for NLP that helps to close systemati…
▽ More
Standard test sets for supervised learning evaluate in-distribution generalization. Unfortunately, when a dataset has systematic gaps (e.g., annotation artifacts), these evaluations are misleading: a model can learn simple decision rules that perform well on the test set but do not capture a dataset's intended capabilities. We propose a new annotation paradigm for NLP that helps to close systematic gaps in the test data. In particular, after a dataset is constructed, we recommend that the dataset authors manually perturb the test instances in small but meaningful ways that (typically) change the gold label, creating contrast sets. Contrast sets provide a local view of a model's decision boundary, which can be used to more accurately evaluate a model's true linguistic capabilities. We demonstrate the efficacy of contrast sets by creating them for 10 diverse NLP datasets (e.g., DROP reading comprehension, UD parsing, IMDb sentiment analysis). Although our contrast sets are not explicitly adversarial, model performance is significantly lower on them than on the original test sets---up to 25\% in some cases. We release our contrast sets as new evaluation benchmarks and encourage future dataset construction efforts to follow similar annotation processes.
△ Less
Submitted 1 October, 2020; v1 submitted 6 April, 2020;
originally announced April 2020.
-
oLMpics -- On what Language Model Pre-training Captures
Authors:
Alon Talmor,
Yanai Elazar,
Yoav Goldberg,
Jonathan Berant
Abstract:
Recent success of pre-trained language models (LMs) has spurred widespread interest in the language capabilities that they possess. However, efforts to understand whether LM representations are useful for symbolic reasoning tasks have been limited and scattered. In this work, we propose eight reasoning tasks, which conceptually require operations such as comparison, conjunction, and composition. A…
▽ More
Recent success of pre-trained language models (LMs) has spurred widespread interest in the language capabilities that they possess. However, efforts to understand whether LM representations are useful for symbolic reasoning tasks have been limited and scattered. In this work, we propose eight reasoning tasks, which conceptually require operations such as comparison, conjunction, and composition. A fundamental challenge is to understand whether the performance of a LM on a task should be attributed to the pre-trained representations or to the process of fine-tuning on the task data. To address this, we propose an evaluation protocol that includes both zero-shot evaluation (no fine-tuning), as well as comparing the learning curve of a fine-tuned LM to the learning curve of multiple controls, which paints a rich picture of the LM capabilities. Our main findings are that: (a) different LMs exhibit qualitatively different reasoning abilities, e.g., RoBERTa succeeds in reasoning tasks where BERT fails completely; (b) LMs do not reason in an abstract manner and are context-dependent, e.g., while RoBERTa can compare ages, it can do so only when the ages are in the typical range of human ages; (c) On half of our reasoning tasks all models fail completely. Our findings and infrastructure can help future work on designing new datasets, models and objective functions for pre-training.
△ Less
Submitted 19 November, 2020; v1 submitted 31 December, 2019;
originally announced December 2019.
-
How Large Are Lions? Inducing Distributions over Quantitative Attributes
Authors:
Yanai Elazar,
Abhijit Mahabal,
Deepak Ramachandran,
Tania Bedrax-Weiss,
Dan Roth
Abstract:
Most current NLP systems have little knowledge about quantitative attributes of objects and events. We propose an unsupervised method for collecting quantitative information from large amounts of web data, and use it to create a new, very large resource consisting of distributions over physical quantities associated with objects, adjectives, and verbs which we call Distributions over Quantitative…
▽ More
Most current NLP systems have little knowledge about quantitative attributes of objects and events. We propose an unsupervised method for collecting quantitative information from large amounts of web data, and use it to create a new, very large resource consisting of distributions over physical quantities associated with objects, adjectives, and verbs which we call Distributions over Quantitative (DoQ). This contrasts with recent work in this area which has focused on making only relative comparisons such as "Is a lion bigger than a wolf?". Our evaluation shows that DoQ compares favorably with state of the art results on existing datasets for relative comparisons of nouns and adjectives, and on a new dataset we introduce.
△ Less
Submitted 4 June, 2019;
originally announced June 2019.
-
Where's My Head? Definition, Dataset and Models for Numeric Fused-Heads Identification and Resolution
Authors:
Yanai Elazar,
Yoav Goldberg
Abstract:
We provide the first computational treatment of fused-heads constructions (FH), focusing on the numeric fused-heads (NFH). FHs constructions are noun phrases (NPs) in which the head noun is missing and is said to be `fused' with its dependent modifier. This missing information is implicit and is important for sentence understanding. The missing references are easily filled in by humans but pose a…
▽ More
We provide the first computational treatment of fused-heads constructions (FH), focusing on the numeric fused-heads (NFH). FHs constructions are noun phrases (NPs) in which the head noun is missing and is said to be `fused' with its dependent modifier. This missing information is implicit and is important for sentence understanding. The missing references are easily filled in by humans but pose a challenge for computational models. We formulate the handling of FH as a two stages process: identification of the FH construction and resolution of the missing head. We explore the NFH phenomena in large corpora of English text and create (1) a dataset and a highly accurate method for NFH identification; (2) a 10k examples (1M tokens) crowd-sourced dataset of NFH resolution; and (3) a neural baseline for the NFH resolution task. We release our code and dataset, in hope to foster further research into this challenging problem.
△ Less
Submitted 26 May, 2019;
originally announced May 2019.
-
Adversarial Removal of Demographic Attributes from Text Data
Authors:
Yanai Elazar,
Yoav Goldberg
Abstract:
Recent advances in Representation Learning and Adversarial Training seem to succeed in removing unwanted features from the learned representation. We show that demographic information of authors is encoded in -- and can be recovered from -- the intermediate representations learned by text-based neural classifiers. The implication is that decisions of classifiers trained on textual data are not agn…
▽ More
Recent advances in Representation Learning and Adversarial Training seem to succeed in removing unwanted features from the learned representation. We show that demographic information of authors is encoded in -- and can be recovered from -- the intermediate representations learned by text-based neural classifiers. The implication is that decisions of classifiers trained on textual data are not agnostic to -- and likely condition on -- demographic attributes. When attempting to remove such demographic information using adversarial training, we find that while the adversarial component achieves chance-level development-set accuracy during training, a post-hoc classifier, trained on the encoded sentences from the first part, still manages to reach substantially higher classification accuracies on the same data. This behavior is consistent across several tasks, demographic properties and datasets. We explore several techniques to improve the effectiveness of the adversarial component. Our main conclusion is a cautionary one: do not rely on the adversarial training to achieve invariant representation to sensitive features.
△ Less
Submitted 2 September, 2018; v1 submitted 20 August, 2018;
originally announced August 2018.
-
Privacy and Fairness in Recommender Systems via Adversarial Training of User Representations
Authors:
Yehezkel S. Resheff,
Yanai Elazar,
Moni Shahar,
Oren Sar Shalom
Abstract:
Latent factor models for recommender systems represent users and items as low dimensional vectors. Privacy risks of such systems have previously been studied mostly in the context of recovery of personal information in the form of usage records from the training data. However, the user representations themselves may be used together with external data to recover private user information such as ge…
▽ More
Latent factor models for recommender systems represent users and items as low dimensional vectors. Privacy risks of such systems have previously been studied mostly in the context of recovery of personal information in the form of usage records from the training data. However, the user representations themselves may be used together with external data to recover private user information such as gender and age. In this paper we show that user vectors calculated by a common recommender system can be exploited in this way. We propose the privacy-adversarial framework to eliminate such leakage of private information, and study the trade-off between recommender performance and leakage both theoretically and empirically using a benchmark dataset. An advantage of the proposed method is that it also helps guarantee fairness of results, since all implicit knowledge of a set of attributes is scrubbed from the representations used by the model, and thus can't enter into the decision making. We discuss further applications of this method towards the generation of deeper and more insightful recommendations.
△ Less
Submitted 18 December, 2018; v1 submitted 10 July, 2018;
originally announced July 2018.