-
Elements of World Knowledge (EWoK): A Cognition-Inspired Framework for Evaluating Basic World Knowledge in Language Models
Authors:
Anna A. Ivanova,
Aalok Sathe,
Benjamin Lipkin,
Unnathi Kumar,
Setayesh Radkani,
Thomas H. Clark,
Carina Kauf,
Jennifer Hu,
R. T. Pramod,
Gabriel Grand,
Vivian Paulun,
Maria Ryskina,
Ekin Akyürek,
Ethan Wilcox,
Nafisa Rashid,
Leshem Choshen,
Roger Levy,
Evelina Fedorenko,
Joshua Tenenbaum,
Jacob Andreas
Abstract:
The ability to build and reason about models of the world is essential for situated language understanding. But evaluating world modeling capabilities in modern AI systems -- especially those based on language models -- has proven challenging, in large part because of the difficulty of disentangling conceptual knowledge about the world from knowledge of surface co-occurrence statistics. This paper…
▽ More
The ability to build and reason about models of the world is essential for situated language understanding. But evaluating world modeling capabilities in modern AI systems -- especially those based on language models -- has proven challenging, in large part because of the difficulty of disentangling conceptual knowledge about the world from knowledge of surface co-occurrence statistics. This paper presents Elements of World Knowledge (EWoK), a framework for evaluating language models' understanding of the conceptual knowledge underlying world modeling. EWoK targets specific concepts from multiple knowledge domains known to be important for world modeling in humans, from social interactions (help, deceive) to spatial relations (left, right). Objects, agents, and locations in the items can be flexibly filled in, enabling easy generation of multiple controlled datasets. We then introduce EWoK-core-1.0, a dataset of 4,374 items covering 11 world knowledge domains. We evaluate 20 open-weights large language models (1.3B--70B parameters) and compare them with human performance. All tested models perform worse than humans, with results varying drastically across domains. Performance on social interactions and social properties was highest and performance on physical relations and spatial relations was lowest. Overall, this dataset highlights simple cases where even large models struggle and presents rich avenues for targeted research on LLM world modeling capabilities.
△ Less
Submitted 3 July, 2025; v1 submitted 15 May, 2024;
originally announced May 2024.
-
Queer In AI: A Case Study in Community-Led Participatory AI
Authors:
Organizers Of QueerInAI,
:,
Anaelia Ovalle,
Arjun Subramonian,
Ashwin Singh,
Claas Voelcker,
Danica J. Sutherland,
Davide Locatelli,
Eva Breznik,
Filip Klubička,
Hang Yuan,
Hetvi J,
Huan Zhang,
Jaidev Shriram,
Kruno Lehman,
Luca Soldaini,
Maarten Sap,
Marc Peter Deisenroth,
Maria Leonor Pacheco,
Maria Ryskina,
Martin Mundt,
Milind Agarwal,
Nyx McLean,
Pan Xu,
A Pranav
, et al. (26 additional authors not shown)
Abstract:
We present Queer in AI as a case study for community-led participatory design in AI. We examine how participatory design and intersectional tenets started and shaped this community's programs over the years. We discuss different challenges that emerged in the process, look at ways this organization has fallen short of operationalizing participatory and intersectional principles, and then assess th…
▽ More
We present Queer in AI as a case study for community-led participatory design in AI. We examine how participatory design and intersectional tenets started and shaped this community's programs over the years. We discuss different challenges that emerged in the process, look at ways this organization has fallen short of operationalizing participatory and intersectional principles, and then assess the organization's impact. Queer in AI provides important lessons and insights for practitioners and theorists of participatory methods broadly through its rejection of hierarchy in favor of decentralization, success at building aid and programs by and for the queer community, and effort to change actors and institutions outside of the queer community. Finally, we theorize how communities like Queer in AI contribute to the participatory design in AI more broadly by fostering cultures of participation in AI, welcoming and empowering marginalized participants, critiquing poor or exploitative participatory practices, and bringing participation to institutions outside of individual research projects. Queer in AI's work serves as a case study of grassroots activism and participatory methods within AI, demonstrating the potential of community-led participatory methods and intersectional praxis, while also providing challenges, case studies, and nuanced insights to researchers developing and using participatory methods.
△ Less
Submitted 8 June, 2023; v1 submitted 29 March, 2023;
originally announced March 2023.
-
State-of-the-art generalisation research in NLP: A taxonomy and review
Authors:
Dieuwke Hupkes,
Mario Giulianelli,
Verna Dankers,
Mikel Artetxe,
Yanai Elazar,
Tiago Pimentel,
Christos Christodoulopoulos,
Karim Lasri,
Naomi Saphra,
Arabella Sinclair,
Dennis Ulmer,
Florian Schottmann,
Khuyagbaatar Batsuren,
Kaiser Sun,
Koustuv Sinha,
Leila Khalatbari,
Maria Ryskina,
Rita Frieske,
Ryan Cotterell,
Zhijing Jin
Abstract:
The ability to generalise well is one of the primary desiderata of natural language processing (NLP). Yet, what 'good generalisation' entails and how it should be evaluated is not well understood, nor are there any evaluation standards for generalisation. In this paper, we lay the groundwork to address both of these issues. We present a taxonomy for characterising and understanding generalisation…
▽ More
The ability to generalise well is one of the primary desiderata of natural language processing (NLP). Yet, what 'good generalisation' entails and how it should be evaluated is not well understood, nor are there any evaluation standards for generalisation. In this paper, we lay the groundwork to address both of these issues. We present a taxonomy for characterising and understanding generalisation research in NLP. Our taxonomy is based on an extensive literature review of generalisation research, and contains five axes along which studies can differ: their main motivation, the type of generalisation they investigate, the type of data shift they consider, the source of this data shift, and the locus of the shift within the modelling pipeline. We use our taxonomy to classify over 400 papers that test generalisation, for a total of more than 600 individual experiments. Considering the results of this review, we present an in-depth analysis that maps out the current state of generalisation research in NLP, and we make recommendations for which areas might deserve attention in the future. Along with this paper, we release a webpage where the results of our review can be dynamically explored, and which we intend to update as new NLP generalisation studies are published. With this work, we aim to take steps towards making state-of-the-art generalisation testing the new status quo in NLP.
△ Less
Submitted 12 January, 2024; v1 submitted 6 October, 2022;
originally announced October 2022.
-
UniMorph 4.0: Universal Morphology
Authors:
Khuyagbaatar Batsuren,
Omer Goldman,
Salam Khalifa,
Nizar Habash,
Witold Kieraś,
Gábor Bella,
Brian Leonard,
Garrett Nicolai,
Kyle Gorman,
Yustinus Ghanggo Ate,
Maria Ryskina,
Sabrina J. Mielke,
Elena Budianskaya,
Charbel El-Khaissi,
Tiago Pimentel,
Michael Gasser,
William Lane,
Mohit Raj,
Matt Coler,
Jaime Rafael Montoya Samame,
Delio Siticonatzi Camaiteri,
Benoît Sagot,
Esaú Zumaeta Rojas,
Didier López Francis,
Arturo Oncevay
, et al. (71 additional authors not shown)
Abstract:
The Universal Morphology (UniMorph) project is a collaborative effort providing broad-coverage instantiated normalized morphological inflection tables for hundreds of diverse world languages. The project comprises two major thrusts: a language-independent feature schema for rich morphological annotation and a type-level resource of annotated data in diverse languages realizing that schema. This pa…
▽ More
The Universal Morphology (UniMorph) project is a collaborative effort providing broad-coverage instantiated normalized morphological inflection tables for hundreds of diverse world languages. The project comprises two major thrusts: a language-independent feature schema for rich morphological annotation and a type-level resource of annotated data in diverse languages realizing that schema. This paper presents the expansions and improvements made on several fronts over the last couple of years (since McCarthy et al. (2020)). Collaborative efforts by numerous linguists have added 67 new languages, including 30 endangered languages. We have implemented several improvements to the extraction pipeline to tackle some issues, e.g. missing gender and macron information. We have also amended the schema to use a hierarchical structure that is needed for morphological phenomena like multiple-argument agreement and case stacking, while adding some missing morphological features to make the schema more inclusive. In light of the last UniMorph release, we also augmented the database with morpheme segmentation for 16 languages. Lastly, this new release makes a push towards inclusion of derivational morphology in UniMorph by enriching the data and annotation schema with instances representing derivational processes from MorphyNet.
△ Less
Submitted 19 June, 2022; v1 submitted 7 May, 2022;
originally announced May 2022.
-
Two Approaches to Building Collaborative, Task-Oriented Dialog Agents through Self-Play
Authors:
Arkady Arkhangorodsky,
Scot Fang,
Victoria Knight,
Ajay Nagesh,
Maria Ryskina,
Kevin Knight
Abstract:
Task-oriented dialog systems are often trained on human/human dialogs, such as collected from Wizard-of-Oz interfaces. However, human/human corpora are frequently too small for supervised training to be effective. This paper investigates two approaches to training agent-bots and user-bots through self-play, in which they autonomously explore an API environment, discovering communication strategies…
▽ More
Task-oriented dialog systems are often trained on human/human dialogs, such as collected from Wizard-of-Oz interfaces. However, human/human corpora are frequently too small for supervised training to be effective. This paper investigates two approaches to training agent-bots and user-bots through self-play, in which they autonomously explore an API environment, discovering communication strategies that enable them to solve the task. We give empirical results for both reinforcement learning and game-theoretic equilibrium finding.
△ Less
Submitted 20 September, 2021;
originally announced September 2021.
-
Learning Mathematical Properties of Integers
Authors:
Maria Ryskina,
Kevin Knight
Abstract:
Embedding words in high-dimensional vector spaces has proven valuable in many natural language applications. In this work, we investigate whether similarly-trained embeddings of integers can capture concepts that are useful for mathematical applications. We probe the integer embeddings for mathematical knowledge, apply them to a set of numerical reasoning tasks, and show that by learning the repre…
▽ More
Embedding words in high-dimensional vector spaces has proven valuable in many natural language applications. In this work, we investigate whether similarly-trained embeddings of integers can capture concepts that are useful for mathematical applications. We probe the integer embeddings for mathematical knowledge, apply them to a set of numerical reasoning tasks, and show that by learning the representations from mathematical sequence data, we can substantially improve over number embeddings learned from English text corpora.
△ Less
Submitted 15 September, 2021;
originally announced September 2021.
-
Comparative Error Analysis in Neural and Finite-state Models for Unsupervised Character-level Transduction
Authors:
Maria Ryskina,
Eduard Hovy,
Taylor Berg-Kirkpatrick,
Matthew R. Gormley
Abstract:
Traditionally, character-level transduction problems have been solved with finite-state models designed to encode structural and linguistic knowledge of the underlying process, whereas recent approaches rely on the power and flexibility of sequence-to-sequence models with attention. Focusing on the less explored unsupervised learning scenario, we compare the two model classes side by side and find…
▽ More
Traditionally, character-level transduction problems have been solved with finite-state models designed to encode structural and linguistic knowledge of the underlying process, whereas recent approaches rely on the power and flexibility of sequence-to-sequence models with attention. Focusing on the less explored unsupervised learning scenario, we compare the two model classes side by side and find that they tend to make different types of errors even when achieving comparable performance. We analyze the distributions of different error classes using two unsupervised tasks as testbeds: converting informally romanized text into the native script of its language (for Russian, Arabic, and Kannada) and translating between a pair of closely related languages (Serbian and Bosnian). Finally, we investigate how combining finite-state and sequence-to-sequence models at decoding time affects the output quantitatively and qualitatively.
△ Less
Submitted 23 June, 2021;
originally announced June 2021.
-
NoiseQA: Challenge Set Evaluation for User-Centric Question Answering
Authors:
Abhilasha Ravichander,
Siddharth Dalmia,
Maria Ryskina,
Florian Metze,
Eduard Hovy,
Alan W Black
Abstract:
When Question-Answering (QA) systems are deployed in the real world, users query them through a variety of interfaces, such as speaking to voice assistants, typing questions into a search engine, or even translating questions to languages supported by the QA system. While there has been significant community attention devoted to identifying correct answers in passages assuming a perfectly formed q…
▽ More
When Question-Answering (QA) systems are deployed in the real world, users query them through a variety of interfaces, such as speaking to voice assistants, typing questions into a search engine, or even translating questions to languages supported by the QA system. While there has been significant community attention devoted to identifying correct answers in passages assuming a perfectly formed question, we show that components in the pipeline that precede an answering engine can introduce varied and considerable sources of error, and performance can degrade substantially based on these upstream noise sources even for powerful pre-trained QA models. We conclude that there is substantial room for progress before QA systems can be effectively deployed, highlight the need for QA evaluation to expand to consider real-world use, and hope that our findings will spur greater community interest in the issues that arise when our systems actually need to be of utility to humans.
△ Less
Submitted 16 February, 2021;
originally announced February 2021.
-
Phonetic and Visual Priors for Decipherment of Informal Romanization
Authors:
Maria Ryskina,
Matthew R. Gormley,
Taylor Berg-Kirkpatrick
Abstract:
Informal romanization is an idiosyncratic process used by humans in informal digital communication to encode non-Latin script languages into Latin character sets found on common keyboards. Character substitution choices differ between users but have been shown to be governed by the same main principles observed across a variety of languages---namely, character pairs are often associated through ph…
▽ More
Informal romanization is an idiosyncratic process used by humans in informal digital communication to encode non-Latin script languages into Latin character sets found on common keyboards. Character substitution choices differ between users but have been shown to be governed by the same main principles observed across a variety of languages---namely, character pairs are often associated through phonetic or visual similarity. We propose a noisy-channel WFST cascade model for deciphering the original non-Latin script from observed romanized text in an unsupervised fashion. We train our model directly on romanized data from two languages: Egyptian Arabic and Russian. We demonstrate that adding inductive bias through phonetic and visual priors on character mappings substantially improves the model's performance on both languages, yielding results much closer to the supervised skyline. Finally, we introduce a new dataset of romanized Russian, collected from a Russian social network website and partially annotated for our experiments.
△ Less
Submitted 5 May, 2020;
originally announced May 2020.
-
Where New Words Are Born: Distributional Semantic Analysis of Neologisms and Their Semantic Neighborhoods
Authors:
Maria Ryskina,
Ella Rabinovich,
Taylor Berg-Kirkpatrick,
David R. Mortensen,
Yulia Tsvetkov
Abstract:
We perform statistical analysis of the phenomenon of neology, the process by which new words emerge in a language, using large diachronic corpora of English. We investigate the importance of two factors, semantic sparsity and frequency growth rates of semantic neighbors, formalized in the distributional semantics paradigm. We show that both factors are predictive of word emergence although we find…
▽ More
We perform statistical analysis of the phenomenon of neology, the process by which new words emerge in a language, using large diachronic corpora of English. We investigate the importance of two factors, semantic sparsity and frequency growth rates of semantic neighbors, formalized in the distributional semantics paradigm. We show that both factors are predictive of word emergence although we find more support for the latter hypothesis. Besides presenting a new linguistic application of distributional semantics, this study tackles the linguistic question of the role of language-internal factors (in our case, sparsity) in language change motivated by language-external factors (reflected in frequency growth).
△ Less
Submitted 21 January, 2020;
originally announced January 2020.
-
Automatic Compositor Attribution in the First Folio of Shakespeare
Authors:
Maria Ryskina,
Hannah Alpert-Abrams,
Dan Garrette,
Taylor Berg-Kirkpatrick
Abstract:
Compositor attribution, the clustering of pages in a historical printed document by the individual who set the type, is a bibliographic task that relies on analysis of orthographic variation and inspection of visual details of the printed page. In this paper, we introduce a novel unsupervised model that jointly describes the textual and visual features needed to distinguish compositors. Applied to…
▽ More
Compositor attribution, the clustering of pages in a historical printed document by the individual who set the type, is a bibliographic task that relies on analysis of orthographic variation and inspection of visual details of the printed page. In this paper, we introduce a novel unsupervised model that jointly describes the textual and visual features needed to distinguish compositors. Applied to images of Shakespeare's First Folio, our model predicts attributions that agree with the manual judgements of bibliographers with an accuracy of 87%, even on text that is the output of OCR.
△ Less
Submitted 25 April, 2017;
originally announced April 2017.