Search | arXiv e-print repository

Aligned but Blind: Alignment Increases Implicit Bias by Reducing Awareness of Race

Authors: Lihao Sun, Chengzhi Mao, Valentin Hofmann, Xuechunzi Bai

Abstract: Although value-aligned language models (LMs) appear unbiased in explicit bias evaluations, they often exhibit stereotypes in implicit word association tasks, raising concerns about their fair usage. We investigate the mechanisms behind this discrepancy and find that alignment surprisingly amplifies implicit bias in model outputs. Specifically, we show that aligned LMs, unlike their unaligned count… ▽ More Although value-aligned language models (LMs) appear unbiased in explicit bias evaluations, they often exhibit stereotypes in implicit word association tasks, raising concerns about their fair usage. We investigate the mechanisms behind this discrepancy and find that alignment surprisingly amplifies implicit bias in model outputs. Specifically, we show that aligned LMs, unlike their unaligned counterparts, overlook racial concepts in early internal representations when the context is ambiguous. Not representing race likely fails to activate safety guardrails, leading to unintended biases. Inspired by this insight, we propose a new bias mitigation strategy that works by incentivizing the representation of racial concepts in the early model layers. In contrast to conventional mitigation methods of machine unlearning, our interventions find that steering the model to be more aware of racial concepts effectively mitigates implicit bias. Similar to race blindness in humans, ignoring racial nuances can inadvertently perpetuate subtle biases in LMs. △ Less

Submitted 8 June, 2025; v1 submitted 30 May, 2025; originally announced June 2025.

Comments: Accepted to ACL 2025 (Main)

arXiv:2505.03054 [pdf, other]

BLAB: Brutally Long Audio Bench

Authors: Orevaoghene Ahia, Martijn Bartelds, Kabir Ahuja, Hila Gonen, Valentin Hofmann, Siddhant Arora, Shuyue Stella Li, Vishal Puttagunta, Mofetoluwa Adeyemi, Charishma Buchireddy, Ben Walls, Noah Bennett, Shinji Watanabe, Noah A. Smith, Yulia Tsvetkov, Sachin Kumar

Abstract: Developing large audio language models (LMs) capable of understanding diverse spoken interactions is essential for accommodating the multimodal nature of human communication and can increase the accessibility of language technologies across different user populations. Recent work on audio LMs has primarily evaluated their performance on short audio segments, typically under 30 seconds, with limite… ▽ More Developing large audio language models (LMs) capable of understanding diverse spoken interactions is essential for accommodating the multimodal nature of human communication and can increase the accessibility of language technologies across different user populations. Recent work on audio LMs has primarily evaluated their performance on short audio segments, typically under 30 seconds, with limited exploration of long-form conversational speech segments that more closely reflect natural user interactions with these models. We introduce Brutally Long Audio Bench (BLAB), a challenging long-form audio benchmark that evaluates audio LMs on localization, duration estimation, emotion, and counting tasks using audio segments averaging 51 minutes in length. BLAB consists of 833+ hours of diverse, full-length audio clips, each paired with human-annotated, text-based natural language questions and answers. Our audio data were collected from permissively licensed sources and underwent a human-assisted filtering process to ensure task compliance. We evaluate six open-source and proprietary audio LMs on BLAB and find that all of them, including advanced models such as Gemini 2.0 Pro and GPT-4o, struggle with the tasks in BLAB. Our comprehensive analysis reveals key insights into the trade-offs between task difficulty and audio duration. In general, we find that audio LMs struggle with long-form speech, with performance declining as duration increases. They perform poorly on localization, temporal reasoning, counting, and struggle to understand non-phonemic information, relying more on prompts than audio content. BLAB serves as a challenging evaluation framework to develop audio LMs with robust long-form audio understanding capabilities. △ Less

Submitted 12 May, 2025; v1 submitted 5 May, 2025; originally announced May 2025.

arXiv:2503.13423 [pdf, other]

SuperBPE: Space Travel for Language Models

Authors: Alisa Liu, Jonathan Hayase, Valentin Hofmann, Sewoong Oh, Noah A. Smith, Yejin Choi

Abstract: The assumption across nearly all language model (LM) tokenization schemes is that tokens should be subwords, i.e., contained within word boundaries. While providing a seemingly reasonable inductive bias, is this common practice limiting the potential of modern LMs? Whitespace is not a reliable delimiter of meaning, as evidenced by multi-word expressions (e.g., "by the way"), crosslingual variation… ▽ More The assumption across nearly all language model (LM) tokenization schemes is that tokens should be subwords, i.e., contained within word boundaries. While providing a seemingly reasonable inductive bias, is this common practice limiting the potential of modern LMs? Whitespace is not a reliable delimiter of meaning, as evidenced by multi-word expressions (e.g., "by the way"), crosslingual variation in the number of words needed to express a concept (e.g., "spacesuit helmet" in German is "raumanzughelm"), and languages that do not use whitespace at all (e.g., Chinese). To explore the potential of tokenization beyond subwords, we introduce a "superword" tokenizer, SuperBPE, which incorporates a simple pretokenization curriculum into the byte-pair encoding (BPE) algorithm to first learn subwords, then superwords that bridge whitespace. This brings dramatic improvements in encoding efficiency: when fixing the vocabulary size to 200k, SuperBPE encodes a fixed piece of text with up to 33% fewer tokens than BPE on average. In experiments, we pretrain 8B transformer LMs from scratch while fixing the model size, vocabulary size, and train compute, varying *only* the algorithm for learning the vocabulary. Our model trained with SuperBPE achieves an average +4.0% absolute improvement over the BPE baseline across 30 downstream tasks (including +8.2% on MMLU), while simultaneously requiring 27% less compute at inference time. In analysis, we find that SuperBPE results in segmentations of text that are more uniform in per-token difficulty. Qualitatively, this may be because SuperBPE tokens often capture common multi-word expressions that function semantically as a single unit. SuperBPE is a straightforward, local modification to tokenization that improves both encoding efficiency and downstream performance, yielding better language models overall. △ Less

Submitted 14 April, 2025; v1 submitted 17 March, 2025; originally announced March 2025.

Comments: updated related work

arXiv:2502.08395 [pdf, other]

IssueBench: Millions of Realistic Prompts for Measuring Issue Bias in LLM Writing Assistance

Authors: Paul Röttger, Musashi Hinck, Valentin Hofmann, Kobi Hackenburg, Valentina Pyatkin, Faeze Brahman, Dirk Hovy

Abstract: Large language models (LLMs) are helping millions of users write texts about diverse issues, and in doing so expose users to different ideas and perspectives. This creates concerns about issue bias, where an LLM tends to present just one perspective on a given issue, which in turn may influence how users think about this issue. So far, it has not been possible to measure which issue biases LLMs ac… ▽ More Large language models (LLMs) are helping millions of users write texts about diverse issues, and in doing so expose users to different ideas and perspectives. This creates concerns about issue bias, where an LLM tends to present just one perspective on a given issue, which in turn may influence how users think about this issue. So far, it has not been possible to measure which issue biases LLMs actually manifest in real user interactions, making it difficult to address the risks from biased LLMs. Therefore, we create IssueBench: a set of 2.49m realistic prompts for measuring issue bias in LLM writing assistance, which we construct based on 3.9k templates (e.g. "write a blog about") and 212 political issues (e.g. "AI regulation") from real user interactions. Using IssueBench, we show that issue biases are common and persistent in state-of-the-art LLMs. We also show that biases are remarkably similar across models, and that all models align more with US Democrat than Republican voter opinion on a subset of issues. IssueBench can easily be adapted to include other issues, templates, or tasks. By enabling robust and realistic measurement, we hope that IssueBench can bring a new quality of evidence to ongoing discussions about LLM biases and how to address them. △ Less

Submitted 12 February, 2025; originally announced February 2025.

Comments: under review

arXiv:2411.07990 [pdf, other]

Derivational Morphology Reveals Analogical Generalization in Large Language Models

Authors: Valentin Hofmann, Leonie Weissweiler, David Mortensen, Hinrich Schütze, Janet Pierrehumbert

Abstract: What mechanisms underlie linguistic generalization in large language models (LLMs)? This question has attracted considerable attention, with most studies analyzing the extent to which the language skills of LLMs resemble rules. As of yet, it is not known whether linguistic generalization in LLMs could equally well be explained as the result of analogical processes, which can be formalized as simil… ▽ More What mechanisms underlie linguistic generalization in large language models (LLMs)? This question has attracted considerable attention, with most studies analyzing the extent to which the language skills of LLMs resemble rules. As of yet, it is not known whether linguistic generalization in LLMs could equally well be explained as the result of analogical processes, which can be formalized as similarity operations on stored exemplars. A key shortcoming of prior research is its focus on linguistic phenomena with a high degree of regularity, for which rule-based and analogical approaches make the same predictions. Here, we instead examine derivational morphology, specifically English adjective nominalization, which displays notable variability. We introduce a new method for investigating linguistic generalization in LLMs: focusing on GPT-J, we fit cognitive models that instantiate rule-based and analogical learning to the LLM training data and compare their predictions on a set of nonce adjectives with those of the LLM, allowing us to draw direct conclusions regarding underlying mechanisms. As expected, rule-based and analogical models explain the predictions of GPT-J equally well for adjectives with regular nominalization patterns. However, for adjectives with variable nominalization patterns, the analogical model provides a much better match. Furthermore, GPT-J's behavior is sensitive to the individual word frequencies, even for regular forms, a behavior that is consistent with an analogical account of regular forms but not a rule-based one. These findings refute the hypothesis that GPT-J's linguistic generalization on adjective nominalization involves rules, suggesting similarity operations on stored exemplars as the underlying mechanism. Overall, our study suggests that analogical processes play a bigger role in the linguistic generalization of LLMs than previously thought. △ Less

Submitted 12 November, 2024; originally announced November 2024.

arXiv:2410.11005 [pdf, ps, other]

Assessing Dialect Fairness and Robustness of Large Language Models in Reasoning Tasks

Authors: Fangru Lin, Shaoguang Mao, Emanuele La Malfa, Valentin Hofmann, Adrian de Wynter, Xun Wang, Si-Qing Chen, Michael Wooldridge, Janet B. Pierrehumbert, Furu Wei

Abstract: Language is not monolithic. While benchmarks, including those designed for multiple languages, are often used as proxies to evaluate the performance of Large Language Models (LLMs), they tend to overlook the nuances of within-language variation and thus fail to model the experience of speakers of non-standard dialects. Focusing on African American Vernacular English (AAVE), we present the first st… ▽ More Language is not monolithic. While benchmarks, including those designed for multiple languages, are often used as proxies to evaluate the performance of Large Language Models (LLMs), they tend to overlook the nuances of within-language variation and thus fail to model the experience of speakers of non-standard dialects. Focusing on African American Vernacular English (AAVE), we present the first study aimed at objectively assessing the fairness and robustness of LLMs in handling dialects across canonical reasoning tasks, including algorithm, math, logic, and integrated reasoning. We introduce ReDial (Reasoning with Dialect Queries), a benchmark containing 1.2K+ parallel query pairs in Standardized English and AAVE. We hire AAVE speakers, including experts with computer science backgrounds, to rewrite seven popular benchmarks, such as HumanEval and GSM8K. With ReDial, we evaluate widely used LLMs, including GPT, Claude, Llama, Mistral, and the Phi model families. Our findings reveal that almost all of these widely used models show significant brittleness and unfairness to queries in AAVE. Our work establishes a systematic and objective framework for analyzing LLM bias in dialectal queries. Moreover, it highlights how mainstream LLMs provide unfair service to dialect speakers in reasoning tasks, laying a critical foundation for future research. △ Less

Submitted 9 June, 2025; v1 submitted 14 October, 2024; originally announced October 2024.

Comments: ACL 2025 main

arXiv:2407.08818 [pdf]

MAGNET: Improving the Multilingual Fairness of Language Models with Adaptive Gradient-Based Tokenization

Authors: Orevaoghene Ahia, Sachin Kumar, Hila Gonen, Valentin Hofmann, Tomasz Limisiewicz, Yulia Tsvetkov, Noah A. Smith

Abstract: In multilingual settings, non-Latin scripts and low-resource languages are usually disadvantaged in terms of language models' utility, efficiency, and cost. Specifically, previous studies have reported multiple modeling biases that the current tokenization algorithms introduce to non-Latin script languages, the main one being over-segmentation. In this work, we propose MAGNET; multilingual adaptiv… ▽ More In multilingual settings, non-Latin scripts and low-resource languages are usually disadvantaged in terms of language models' utility, efficiency, and cost. Specifically, previous studies have reported multiple modeling biases that the current tokenization algorithms introduce to non-Latin script languages, the main one being over-segmentation. In this work, we propose MAGNET; multilingual adaptive gradient-based tokenization to reduce over-segmentation via adaptive gradient-based subword tokenization. MAGNET learns to predict segment boundaries between byte tokens in a sequence via sub-modules within the model, which act as internal boundary predictors (tokenizers). Previous gradient-based tokenization methods aimed for uniform compression across sequences by integrating a single boundary predictor during training and optimizing it end-to-end through stochastic reparameterization alongside the next token prediction objective. However, this approach still results in over-segmentation for non-Latin script languages in multilingual settings. In contrast, MAGNET offers a customizable architecture where byte-level sequences are routed through language-script-specific predictors, each optimized for its respective language script. This modularity enforces equitable segmentation granularity across different language scripts compared to previous methods. Through extensive experiments, we demonstrate that in addition to reducing segmentation disparities, MAGNET also enables faster language modelling and improves downstream utility. △ Less

Submitted 16 November, 2024; v1 submitted 11 July, 2024; originally announced July 2024.

arXiv:2403.00742 [pdf, other]

Dialect prejudice predicts AI decisions about people's character, employability, and criminality

Authors: Valentin Hofmann, Pratyusha Ria Kalluri, Dan Jurafsky, Sharese King

Abstract: Hundreds of millions of people now interact with language models, with uses ranging from serving as a writing aid to informing hiring decisions. Yet these language models are known to perpetuate systematic racial prejudices, making their judgments biased in problematic ways about groups like African Americans. While prior research has focused on overt racism in language models, social scientists h… ▽ More Hundreds of millions of people now interact with language models, with uses ranging from serving as a writing aid to informing hiring decisions. Yet these language models are known to perpetuate systematic racial prejudices, making their judgments biased in problematic ways about groups like African Americans. While prior research has focused on overt racism in language models, social scientists have argued that racism with a more subtle character has developed over time. It is unknown whether this covert racism manifests in language models. Here, we demonstrate that language models embody covert racism in the form of dialect prejudice: we extend research showing that Americans hold raciolinguistic stereotypes about speakers of African American English and find that language models have the same prejudice, exhibiting covert stereotypes that are more negative than any human stereotypes about African Americans ever experimentally recorded, although closest to the ones from before the civil rights movement. By contrast, the language models' overt stereotypes about African Americans are much more positive. We demonstrate that dialect prejudice has the potential for harmful consequences by asking language models to make hypothetical decisions about people, based only on how they speak. Language models are more likely to suggest that speakers of African American English be assigned less prestigious jobs, be convicted of crimes, and be sentenced to death. Finally, we show that existing methods for alleviating racial bias in language models such as human feedback training do not mitigate the dialect prejudice, but can exacerbate the discrepancy between covert and overt stereotypes, by teaching language models to superficially conceal the racism that they maintain on a deeper level. Our findings have far-reaching implications for the fair and safe employment of language technology. △ Less

Submitted 1 March, 2024; originally announced March 2024.

arXiv:2402.16786 [pdf, other]

Political Compass or Spinning Arrow? Towards More Meaningful Evaluations for Values and Opinions in Large Language Models

Authors: Paul Röttger, Valentin Hofmann, Valentina Pyatkin, Musashi Hinck, Hannah Rose Kirk, Hinrich Schütze, Dirk Hovy

Abstract: Much recent work seeks to evaluate values and opinions in large language models (LLMs) using multiple-choice surveys and questionnaires. Most of this work is motivated by concerns around real-world LLM applications. For example, politically-biased LLMs may subtly influence society when they are used by millions of people. Such real-world concerns, however, stand in stark contrast to the artificial… ▽ More Much recent work seeks to evaluate values and opinions in large language models (LLMs) using multiple-choice surveys and questionnaires. Most of this work is motivated by concerns around real-world LLM applications. For example, politically-biased LLMs may subtly influence society when they are used by millions of people. Such real-world concerns, however, stand in stark contrast to the artificiality of current evaluations: real users do not typically ask LLMs survey questions. Motivated by this discrepancy, we challenge the prevailing constrained evaluation paradigm for values and opinions in LLMs and explore more realistic unconstrained evaluations. As a case study, we focus on the popular Political Compass Test (PCT). In a systematic review, we find that most prior work using the PCT forces models to comply with the PCT's multiple-choice format. We show that models give substantively different answers when not forced; that answers change depending on how models are forced; and that answers lack paraphrase robustness. Then, we demonstrate that models give different answers yet again in a more realistic open-ended answer setting. We distill these findings into recommendations and open challenges in evaluating values and opinions in LLMs. △ Less

Submitted 5 June, 2024; v1 submitted 26 February, 2024; originally announced February 2024.

Comments: Accepted at ACL 2024 (Main Conference)

arXiv:2402.02805 [pdf, other]

Graph-enhanced Large Language Models in Asynchronous Plan Reasoning

Authors: Fangru Lin, Emanuele La Malfa, Valentin Hofmann, Elle Michelle Yang, Anthony Cohn, Janet B. Pierrehumbert

Abstract: Planning is a fundamental property of human intelligence. Reasoning about asynchronous plans is challenging since it requires sequential and parallel planning to optimize time costs. Can large language models (LLMs) succeed at this task? Here, we present the first large-scale study investigating this question. We find that a representative set of closed and open-source LLMs, including GPT-4 and LL… ▽ More Planning is a fundamental property of human intelligence. Reasoning about asynchronous plans is challenging since it requires sequential and parallel planning to optimize time costs. Can large language models (LLMs) succeed at this task? Here, we present the first large-scale study investigating this question. We find that a representative set of closed and open-source LLMs, including GPT-4 and LLaMA-2, behave poorly when not supplied with illustrations about the task-solving process in our benchmark AsyncHow. We propose a novel technique called Plan Like a Graph (PLaG) that combines graphs with natural language prompts and achieves state-of-the-art results. We show that although PLaG can boost model performance, LLMs still suffer from drastic degradation when task complexity increases, highlighting the limits of utilizing LLMs for simulating digital devices. We see our study as an exciting step towards using LLMs as efficient autonomous agents. Our code and data are available at https://github.com/fangru-lin/graph-llm-asynchow-plan. △ Less

Submitted 3 June, 2024; v1 submitted 5 February, 2024; originally announced February 2024.

Comments: Accepted at ICML-2024

arXiv:2402.00159 [pdf, other]

Dolma: an Open Corpus of Three Trillion Tokens for Language Model Pretraining Research

Authors: Luca Soldaini, Rodney Kinney, Akshita Bhagia, Dustin Schwenk, David Atkinson, Russell Authur, Ben Bogin, Khyathi Chandu, Jennifer Dumas, Yanai Elazar, Valentin Hofmann, Ananya Harsh Jha, Sachin Kumar, Li Lucy, Xinxi Lyu, Nathan Lambert, Ian Magnusson, Jacob Morrison, Niklas Muennighoff, Aakanksha Naik, Crystal Nam, Matthew E. Peters, Abhilasha Ravichander, Kyle Richardson, Zejiang Shen , et al. (11 additional authors not shown)

Abstract: Information about pretraining corpora used to train the current best-performing language models is seldom discussed: commercial models rarely detail their data, and even open models are often released without accompanying training data or recipes to reproduce them. As a result, it is challenging to conduct and advance scientific research on language modeling, such as understanding how training dat… ▽ More Information about pretraining corpora used to train the current best-performing language models is seldom discussed: commercial models rarely detail their data, and even open models are often released without accompanying training data or recipes to reproduce them. As a result, it is challenging to conduct and advance scientific research on language modeling, such as understanding how training data impacts model capabilities and limitations. To facilitate scientific research on language model pretraining, we curate and release Dolma, a three-trillion-token English corpus, built from a diverse mixture of web content, scientific papers, code, public-domain books, social media, and encyclopedic materials. We extensively document Dolma, including its design principles, details about its construction, and a summary of its contents. We present analyses and experimental results on intermediate states of Dolma to share what we have learned about important data curation practices. Finally, we open-source our data curation toolkit to enable reproduction of our work as well as support further research in large-scale data curation. △ Less

Submitted 6 June, 2024; v1 submitted 31 January, 2024; originally announced February 2024.

Comments: Accepted at ACL 2024; Dataset: https://hf.co/datasets/allenai/dolma; Code: https://github.com/allenai/dolma

arXiv:2312.10523 [pdf, other]

Paloma: A Benchmark for Evaluating Language Model Fit

Authors: Ian Magnusson, Akshita Bhagia, Valentin Hofmann, Luca Soldaini, Ananya Harsh Jha, Oyvind Tafjord, Dustin Schwenk, Evan Pete Walsh, Yanai Elazar, Kyle Lo, Dirk Groeneveld, Iz Beltagy, Hannaneh Hajishirzi, Noah A. Smith, Kyle Richardson, Jesse Dodge

Abstract: Evaluations of language models (LMs) commonly report perplexity on monolithic data held out from training. Implicitly or explicitly, this data is composed of domains--varying distributions of language. We introduce Perplexity Analysis for Language Model Assessment (Paloma), a benchmark to measure LM fit to 546 English and code domains, instead of assuming perplexity on one distribution extrapolate… ▽ More Evaluations of language models (LMs) commonly report perplexity on monolithic data held out from training. Implicitly or explicitly, this data is composed of domains--varying distributions of language. We introduce Perplexity Analysis for Language Model Assessment (Paloma), a benchmark to measure LM fit to 546 English and code domains, instead of assuming perplexity on one distribution extrapolates to others. We include two new datasets of the top 100 subreddits (e.g., r/depression on Reddit) and programming languages (e.g., Java on GitHub), both sources common in contemporary LMs. With our benchmark, we release 6 baseline 1B LMs carefully controlled to provide fair comparisons about which pretraining corpus is best and code for others to apply those controls to their own experiments. Our case studies demonstrate how the fine-grained results from Paloma surface findings such as that models pretrained without data beyond Common Crawl exhibit anomalous gaps in LM fit to many domains or that loss is dominated by the most frequently occurring strings in the vocabulary. △ Less

Submitted 7 December, 2024; v1 submitted 16 December, 2023; originally announced December 2023.

Comments: Conference: NeurIPS 2024, Project Page: https://paloma.allen.ai/

arXiv:2310.15113 [pdf]

Counting the Bugs in ChatGPT's Wugs: A Multilingual Investigation into the Morphological Capabilities of a Large Language Model

Authors: Leonie Weissweiler, Valentin Hofmann, Anjali Kantharuban, Anna Cai, Ritam Dutt, Amey Hengle, Anubha Kabra, Atharva Kulkarni, Abhishek Vijayakumar, Haofei Yu, Hinrich Schütze, Kemal Oflazer, David R. Mortensen

Abstract: Large language models (LLMs) have recently reached an impressive level of linguistic capability, prompting comparisons with human language skills. However, there have been relatively few systematic inquiries into the linguistic capabilities of the latest generation of LLMs, and those studies that do exist (i) ignore the remarkable ability of humans to generalize, (ii) focus only on English, and (i… ▽ More Large language models (LLMs) have recently reached an impressive level of linguistic capability, prompting comparisons with human language skills. However, there have been relatively few systematic inquiries into the linguistic capabilities of the latest generation of LLMs, and those studies that do exist (i) ignore the remarkable ability of humans to generalize, (ii) focus only on English, and (iii) investigate syntax or semantics and overlook other capabilities that lie at the heart of human language, like morphology. Here, we close these gaps by conducting the first rigorous analysis of the morphological capabilities of ChatGPT in four typologically varied languages (specifically, English, German, Tamil, and Turkish). We apply a version of Berko's (1958) wug test to ChatGPT, using novel, uncontaminated datasets for the four examined languages. We find that ChatGPT massively underperforms purpose-built systems, particularly in English. Overall, our results -- through the lens of morphology -- cast a new light on the linguistic capabilities of ChatGPT, suggesting that claims of human-like language skills are premature and misleading. △ Less

Submitted 26 October, 2023; v1 submitted 23 October, 2023; originally announced October 2023.

Comments: EMNLP 2023

arXiv:2308.11456 [pdf]

Deep learning-based denoising streamed from mobile phones improves speech-in-noise understanding for hearing aid users

Authors: Peter Udo Diehl, Hannes Zilly, Felix Sattler, Yosef Singer, Kevin Kepp, Mark Berry, Henning Hasemann, Marlene Zippel, Müge Kaya, Paul Meyer-Rachner, Annett Pudszuhn, Veit M. Hofmann, Matthias Vormann, Elias Sprengel

Abstract: The hearing loss of almost half a billion people is commonly treated with hearing aids. However, current hearing aids often do not work well in real-world noisy environments. We present a deep learning based denoising system that runs in real time on iPhone 7 and Samsung Galaxy S10 (25ms algorithmic latency). The denoised audio is streamed to the hearing aid, resulting in a total delay of around 7… ▽ More The hearing loss of almost half a billion people is commonly treated with hearing aids. However, current hearing aids often do not work well in real-world noisy environments. We present a deep learning based denoising system that runs in real time on iPhone 7 and Samsung Galaxy S10 (25ms algorithmic latency). The denoised audio is streamed to the hearing aid, resulting in a total delay of around 75ms. In tests with hearing aid users having moderate to severe hearing loss, our denoising system improves audio across three tests: 1) listening for subjective audio ratings, 2) listening for objective speech intelligibility, and 3) live conversations in a noisy environment for subjective ratings. Subjective ratings increase by more than 40%, for both the listening test and the live conversation compared to a fitted hearing aid as a baseline. Speech reception thresholds, measuring speech understanding in noise, improve by 1.6 dB SRT. Ours is the first denoising system that is implemented on a mobile device, streamed directly to users' hearing aids using only a single channel as audio input while improving user satisfaction on all tested aspects, including speech intelligibility. This includes overall preference of the denoised and streamed signal over the hearing aid, thereby accepting the higher latency for the significant improvement in speech understanding. △ Less

Submitted 22 August, 2023; originally announced August 2023.

arXiv:2306.03003 [pdf, other]

KSIM: simulating KIDSpec, a Microwave Kinetic Inductance Detector spectrograph for the optical/NIR

Authors: V. Benedict Hofmann, Kieran O'Brien

Abstract: KIDSpec, the Kinetic Inductance Detector Spectrometer, is a proposed optical to near IR Microwave Kinetic Inductance Detector (MKID) spectrograph. MKIDs are superconducting photon counting detectors which are able to resolve the energy of incoming photons and their time of arrival. KIDSpec will use these detectors to separate incoming spectral orders from a grating, thereby not requiring a cross-d… ▽ More KIDSpec, the Kinetic Inductance Detector Spectrometer, is a proposed optical to near IR Microwave Kinetic Inductance Detector (MKID) spectrograph. MKIDs are superconducting photon counting detectors which are able to resolve the energy of incoming photons and their time of arrival. KIDSpec will use these detectors to separate incoming spectral orders from a grating, thereby not requiring a cross-disperser. In this paper we present a simulation tool for KIDSpec's potential performance upon construction to optimise a given design. This simulation tool is the KIDSpec Simulator (KSIM), a Python package designed to simulate a variety of KIDSpec and observation parameters. A range of astrophysical objects are simulated: stellar objects, an SDSS observed galaxy, a Seyfert galaxy, and a mock galaxy spectrum from the JAGUAR catalogue. Multiple medium spectral resolution designs for KIDSpec are simulated. The possible impact of MKID energy resolution variance and dead pixels were simulated, with impacts to KIDSpec performance observed using the Reduced Chi-Squared (RCS) value. Using dead pixel percentages from current instruments, the RCS result was found to only increase to 1.21 at worst for one of the designs simulated. SNR comparisons of object simulations between KSIM and X-Shooter's ETC were also simulated. KIDSpec offers a particular improvement over X-Shooter for short and faint observations. For a Seyfert galaxy ($m_{R}=21$) simulation with a 180s exposure, KIDSpec had an average SNR of 4.8, in contrast to 1.5 for X-Shooter. Using KSIM the design of KIDSpec can be optimised to improve the instrument further. △ Less

Submitted 5 June, 2023; originally announced June 2023.

Comments: 17 pages, 13 figures, accepted to RASTI

arXiv:2212.07547 [pdf, other]

Unsupervised Detection of Contextualized Embedding Bias with Application to Ideology

Authors: Valentin Hofmann, Janet B. Pierrehumbert, Hinrich Schütze

Abstract: We propose a fully unsupervised method to detect bias in contextualized embeddings. The method leverages the assortative information latently encoded by social networks and combines orthogonality regularization, structured sparsity learning, and graph neural networks to find the embedding subspace capturing this information. As a concrete example, we focus on the phenomenon of ideological bias: we… ▽ More We propose a fully unsupervised method to detect bias in contextualized embeddings. The method leverages the assortative information latently encoded by social networks and combines orthogonality regularization, structured sparsity learning, and graph neural networks to find the embedding subspace capturing this information. As a concrete example, we focus on the phenomenon of ideological bias: we introduce the concept of an ideological subspace, show how it can be found by applying our method to online discussion forums, and present techniques to probe it. Our experiments suggest that the ideological subspace encodes abstract evaluative semantics and reflects changes in the political left-right spectrum during the presidency of Donald Trump. △ Less

Submitted 14 December, 2022; originally announced December 2022.

Comments: ICML 2022

arXiv:2210.13181 [pdf, other]

The Better Your Syntax, the Better Your Semantics? Probing Pretrained Language Models for the English Comparative Correlative

Authors: Leonie Weissweiler, Valentin Hofmann, Abdullatif Köksal, Hinrich Schütze

Abstract: Construction Grammar (CxG) is a paradigm from cognitive linguistics emphasising the connection between syntax and semantics. Rather than rules that operate on lexical items, it posits constructions as the central building blocks of language, i.e., linguistic units of different granularity that combine syntax and semantics. As a first step towards assessing the compatibility of CxG with the syntact… ▽ More Construction Grammar (CxG) is a paradigm from cognitive linguistics emphasising the connection between syntax and semantics. Rather than rules that operate on lexical items, it posits constructions as the central building blocks of language, i.e., linguistic units of different granularity that combine syntax and semantics. As a first step towards assessing the compatibility of CxG with the syntactic and semantic knowledge demonstrated by state-of-the-art pretrained language models (PLMs), we present an investigation of their capability to classify and understand one of the most commonly studied constructions, the English comparative correlative (CC). We conduct experiments examining the classification accuracy of a syntactic probe on the one hand and the models' behaviour in a semantic application task on the other, with BERT, RoBERTa, and DeBERTa as the example PLMs. Our results show that all three investigated PLMs are able to recognise the structure of the CC but fail to use its meaning. While human-like performance of PLMs on many NLP tasks has been alleged, this indicates that PLMs still suffer from substantial shortcomings in central domains of linguistic knowledge. △ Less

Submitted 24 October, 2022; originally announced October 2022.

Comments: EMNLP 2022

arXiv:2209.05144 [pdf, other]

doi 10.1117/12.2628889

What could KIDSpec, a new MKID spectrograph, do on the ELT?

Authors: V. Benedict Hofmann, Kieran O'Brien, Deli Geng

Abstract: Microwave Kinetic Inductance Detectors (MKIDs) are beginning to become more prominent in astronomical instrumentation, due to their sensitivity, low noise, high pixel count for superconducting detectors, and inherent energy and time resolving capability. The Kinetic Inductance Detector Spectrometer (KIDSpec) will take advantage of these features, KIDSpec is a medium resolution MKID spectrograph fo… ▽ More Microwave Kinetic Inductance Detectors (MKIDs) are beginning to become more prominent in astronomical instrumentation, due to their sensitivity, low noise, high pixel count for superconducting detectors, and inherent energy and time resolving capability. The Kinetic Inductance Detector Spectrometer (KIDSpec) will take advantage of these features, KIDSpec is a medium resolution MKID spectrograph for the optical/near infrared. KIDSpec will contribute to many science areas particularly those involving short and/or faint observations. When short period binary systems are found, typical CCD detectors will struggle to characterise these systems due to the very short exposures required, causing errors as large as the estimated parameter itself. The KIDSpec Simulator (KSIM) has been developed to investigate how much KIDSpec could improve on this. KIDSpec was simulated on an ELT class telescope to find the extent of its potential, and it was found that KIDSpec could observe a $m_{V}\approx{24}$ with an SNR of 5 for a 10s exposure at 1420 spectral resolution. This would mean that KIDSpec on an ELT class telescope could spectroscopically follow up on any LSST photometric discoveries of LISA verification sources. △ Less

Submitted 12 September, 2022; originally announced September 2022.

Comments: Presented at SPIE Astronomical Telescopes & Instrumentation 2022. 7 pages, 4 figures

Journal ref: Proc. SPIE 12184, Ground-based and Airborne Instrumentation for Astronomy IX, 1218419 (29 August 2022)

arXiv:2206.11567 [pdf]

Restoring speech intelligibility for hearing aid users with deep learning

Authors: Peter Udo Diehl, Yosef Singer, Hannes Zilly, Uwe Schönfeld, Paul Meyer-Rachner, Mark Berry, Henning Sprekeler, Elias Sprengel, Annett Pudszuhn, Veit M. Hofmann

Abstract: Almost half a billion people world-wide suffer from disabling hearing loss. While hearing aids can partially compensate for this, a large proportion of users struggle to understand speech in situations with background noise. Here, we present a deep learning-based algorithm that selectively suppresses noise while maintaining speech signals. The algorithm restores speech intelligibility for hearing… ▽ More Almost half a billion people world-wide suffer from disabling hearing loss. While hearing aids can partially compensate for this, a large proportion of users struggle to understand speech in situations with background noise. Here, we present a deep learning-based algorithm that selectively suppresses noise while maintaining speech signals. The algorithm restores speech intelligibility for hearing aid users to the level of control subjects with normal hearing. It consists of a deep network that is trained on a large custom database of noisy speech signals and is further optimized by a neural architecture search, using a novel deep learning-based metric for speech intelligibility. The network achieves state-of-the-art denoising on a range of human-graded assessments, generalizes across different noise categories and - in contrast to classic beamforming approaches - operates on a single microphone. The system runs in real time on a laptop, suggesting that large-scale deployment on hearing aid chips could be achieved within a few years. Deep learning-based denoising therefore holds the potential to improve the quality of life of millions of hearing impaired people soon. △ Less

Submitted 23 June, 2022; originally announced June 2022.

arXiv:2203.10010 [pdf, other]

CaMEL: Case Marker Extraction without Labels

Authors: Leonie Weissweiler, Valentin Hofmann, Masoud Jalili Sabet, Hinrich Schütze

Abstract: We introduce CaMEL (Case Marker Extraction without Labels), a novel and challenging task in computational morphology that is especially relevant for low-resource languages. We propose a first model for CaMEL that uses a massively multilingual corpus to extract case markers in 83 languages based only on a noun phrase chunker and an alignment system. To evaluate CaMEL, we automatically construct a s… ▽ More We introduce CaMEL (Case Marker Extraction without Labels), a novel and challenging task in computational morphology that is especially relevant for low-resource languages. We propose a first model for CaMEL that uses a massively multilingual corpus to extract case markers in 83 languages based only on a noun phrase chunker and an alignment system. To evaluate CaMEL, we automatically construct a silver standard from UniMorph. The case markers extracted by our model can be used to detect and visualise similarities and differences between the case systems of different languages as well as to annotate fine-grained deep cases in languages in which they are not overtly marked. △ Less

Submitted 28 March, 2022; v1 submitted 18 March, 2022; originally announced March 2022.

Comments: ACL 2022

arXiv:2203.08769 [pdf, other]

Room temperature donor incorporation for quantum devices: arsine on germanium

Authors: Emily V. S. Hofmann, Taylor J. Z. Stock, Oliver Warschkow, Rebecca Conybeare, Neil J. Curson, Steven R. Schofield

Abstract: Germanium has emerged as an exceptionally promising material for spintronics and quantum information applications, with significant fundamental advantages over silicon. However, efforts to create atomic-scale devices using donor atoms as qubits have largely focussed on phosphorus in silicon. Positioning phosphorus in silicon with atomic-scale precision requires a thermal incorporation anneal, but… ▽ More Germanium has emerged as an exceptionally promising material for spintronics and quantum information applications, with significant fundamental advantages over silicon. However, efforts to create atomic-scale devices using donor atoms as qubits have largely focussed on phosphorus in silicon. Positioning phosphorus in silicon with atomic-scale precision requires a thermal incorporation anneal, but the low success rate for this step has been shown to be a fundamental limitation prohibiting the scale-up to large-scale devices. Here, we present a comprehensive study of arsine (AsH$_3$) on the germanium (001) surface. We show that, unlike any previously studied dopant precursor on silicon or germanium, arsenic atoms fully incorporate into substitutional surface lattice sites at room temperature. Our results pave the way for the next generation of atomic-scale donor devices combining the superior electronic properties of germanium with the enhanced properties of arsine/germanium chemistry that promises scale-up to large numbers of deterministically-placed qubits. △ Less

Submitted 16 March, 2022; originally announced March 2022.

Comments: 8 pages, 4 figures, plus 2 pages supplementary information and 1 supplementary figure

arXiv:2203.08565 [pdf, other]

Geographic Adaptation of Pretrained Language Models

Authors: Valentin Hofmann, Goran Glavaš, Nikola Ljubešić, Janet B. Pierrehumbert, Hinrich Schütze

Abstract: While pretrained language models (PLMs) have been shown to possess a plethora of linguistic knowledge, the existing body of research has largely neglected extralinguistic knowledge, which is generally difficult to obtain by pretraining on text alone. Here, we contribute to closing this gap by examining geolinguistic knowledge, i.e., knowledge about geographic variation in language. We introduce ge… ▽ More While pretrained language models (PLMs) have been shown to possess a plethora of linguistic knowledge, the existing body of research has largely neglected extralinguistic knowledge, which is generally difficult to obtain by pretraining on text alone. Here, we contribute to closing this gap by examining geolinguistic knowledge, i.e., knowledge about geographic variation in language. We introduce geoadaptation, an intermediate training step that couples language modeling with geolocation prediction in a multi-task learning setup. We geoadapt four PLMs, covering language groups from three geographic areas, and evaluate them on five different tasks: fine-tuned (i.e., supervised) geolocation prediction, zero-shot (i.e., unsupervised) geolocation prediction, fine-tuned language identification, zero-shot language identification, and zero-shot prediction of dialect features. Geoadaptation is very successful at injecting geolinguistic knowledge into the PLMs: the geoadapted PLMs consistently outperform PLMs adapted using only language modeling (by especially wide margins on zero-shot prediction tasks), and we obtain new state-of-the-art results on two benchmarks for geolocation prediction and language identification. Furthermore, we show that the effectiveness of geoadaptation stems from its ability to geographically retrofit the representation space of the PLMs. △ Less

Submitted 28 January, 2024; v1 submitted 16 March, 2022; originally announced March 2022.

Comments: TACL 2024 (pre-MIT Press publication version)

arXiv:2104.08829 [pdf, other]

Modeling Ideological Salience and Framing in Polarized Online Groups with Graph Neural Networks and Structured Sparsity

Authors: Valentin Hofmann, Xiaowen Dong, Janet B. Pierrehumbert, Hinrich Schütze

Abstract: The increasing polarization of online political discourse calls for computational tools that automatically detect and monitor ideological divides in social media. We introduce a minimally supervised method that leverages the network structure of online discussion forums, specifically Reddit, to detect polarized concepts. We model polarization along the dimensions of salience and framing, drawing u… ▽ More The increasing polarization of online political discourse calls for computational tools that automatically detect and monitor ideological divides in social media. We introduce a minimally supervised method that leverages the network structure of online discussion forums, specifically Reddit, to detect polarized concepts. We model polarization along the dimensions of salience and framing, drawing upon insights from moral psychology. Our architecture combines graph neural networks with structured sparsity learning and results in representations for concepts and subreddits that capture temporal ideological dynamics such as right-wing and left-wing radicalization. △ Less

Submitted 14 December, 2022; v1 submitted 18 April, 2021; originally announced April 2021.

Comments: NAACL 2022 (Findings)

arXiv:2101.00403 [pdf, other]

Superbizarre Is Not Superb: Derivational Morphology Improves BERT's Interpretation of Complex Words

Authors: Valentin Hofmann, Janet B. Pierrehumbert, Hinrich Schütze

Abstract: How does the input segmentation of pretrained language models (PLMs) affect their interpretations of complex words? We present the first study investigating this question, taking BERT as the example PLM and focusing on its semantic representations of English derivatives. We show that PLMs can be interpreted as serial dual-route models, i.e., the meanings of complex words are either stored or else… ▽ More How does the input segmentation of pretrained language models (PLMs) affect their interpretations of complex words? We present the first study investigating this question, taking BERT as the example PLM and focusing on its semantic representations of English derivatives. We show that PLMs can be interpreted as serial dual-route models, i.e., the meanings of complex words are either stored or else need to be computed from the subwords, which implies that maximally meaningful input tokens should allow for the best generalization on new words. This hypothesis is confirmed by a series of semantic probing tasks on which DelBERT (Derivation leveraging BERT), a model with derivational input segmentation, substantially outperforms BERT with WordPiece segmentation. Our results suggest that the generalization capabilities of PLMs could be further improved if a morphologically-informed vocabulary of input tokens were used. △ Less

Submitted 2 June, 2021; v1 submitted 2 January, 2021; originally announced January 2021.

Comments: ACL 2021

arXiv:2010.12684 [pdf, other]

Dynamic Contextualized Word Embeddings

Authors: Valentin Hofmann, Janet B. Pierrehumbert, Hinrich Schütze

Abstract: Static word embeddings that represent words by a single vector cannot capture the variability of word meaning in different linguistic and extralinguistic contexts. Building on prior work on contextualized and dynamic word embeddings, we introduce dynamic contextualized word embeddings that represent words as a function of both linguistic and extralinguistic context. Based on a pretrained language… ▽ More Static word embeddings that represent words by a single vector cannot capture the variability of word meaning in different linguistic and extralinguistic contexts. Building on prior work on contextualized and dynamic word embeddings, we introduce dynamic contextualized word embeddings that represent words as a function of both linguistic and extralinguistic context. Based on a pretrained language model (PLM), dynamic contextualized word embeddings model time and social space jointly, which makes them attractive for a range of NLP tasks involving semantic variability. We highlight potential application scenarios by means of qualitative and quantitative analyses on four English datasets. △ Less

Submitted 8 June, 2021; v1 submitted 23 October, 2020; originally announced October 2020.

Comments: ACL 2021

arXiv:2005.00672 [pdf, other]

DagoBERT: Generating Derivational Morphology with a Pretrained Language Model

Authors: Valentin Hofmann, Janet B. Pierrehumbert, Hinrich Schütze

Abstract: Can pretrained language models (PLMs) generate derivationally complex words? We present the first study investigating this question, taking BERT as the example PLM. We examine BERT's derivational capabilities in different settings, ranging from using the unmodified pretrained model to full finetuning. Our best model, DagoBERT (Derivationally and generatively optimized BERT), clearly outperforms th… ▽ More Can pretrained language models (PLMs) generate derivationally complex words? We present the first study investigating this question, taking BERT as the example PLM. We examine BERT's derivational capabilities in different settings, ranging from using the unmodified pretrained model to full finetuning. Our best model, DagoBERT (Derivationally and generatively optimized BERT), clearly outperforms the previous state of the art in derivation generation (DG). Furthermore, our experiments show that the input segmentation crucially impacts BERT's derivational knowledge, suggesting that the performance of PLMs could be further improved if a morphologically informed vocabulary of units were used. △ Less

Submitted 7 October, 2020; v1 submitted 1 May, 2020; originally announced May 2020.

arXiv:1910.06685 [pdf]

Atomic-Scale Patterning of Arsenic in Silicon by Scanning Tunneling Microscopy

Authors: Taylor J. Z. Stock, Oliver Warschkow, Procopios C. Constantinou, Juerong Li, Sarah Fearn, Eleanor Crane, Emily V. S. Hofmann, Alexander Kölker, David R. McKenzie, Steven R. Schofield, Neil J. Curson

Abstract: Over the last two decades, prototype devices for future classical and quantum computing technologies have been fabricated, by using scanning tunneling microscopy and hydrogen resist lithography to position phosphorus atoms in silicon with atomic-scale precision. Despite these successes, phosphine remains the only donor precursor molecule to have been demonstrated as compatible with the hydrogen re… ▽ More Over the last two decades, prototype devices for future classical and quantum computing technologies have been fabricated, by using scanning tunneling microscopy and hydrogen resist lithography to position phosphorus atoms in silicon with atomic-scale precision. Despite these successes, phosphine remains the only donor precursor molecule to have been demonstrated as compatible with the hydrogen resist lithography technique. The potential benefits of atomic-scale placement of alternative dopant species have, until now, remained unexplored. In this work, we demonstrate successful fabrication of atomic-scale structures of arsenic-in-silicon. Using a scanning tunneling microscope tip, we pattern a monolayer hydrogen mask to selectively place arsenic atoms on the Si(001) surface using arsine as the precursor molecule. We fully elucidate the surface chemistry and reaction pathways of arsine on Si(001), revealing significant differences to phosphine. We explain how these differences result in enhanced surface immobilization and in-plane confinement of arsenic compared to phosphorus, and a dose-rate independent arsenic saturation density of $0.24{\pm}0.04$ ML. We demonstrate the successful encapsulation of arsenic delta-layers using silicon molecular beam epitaxy, and find electrical characteristics that are competitive with equivalent structures fabricated with phosphorus. Arsenic delta-layers are also found to offer improvement in out-of-plane confinement compared to similarly prepared phosphorus layers, while still retaining >80% carrier activation and sheet resistances of $<2 kΩ/{\square}$. These excellent characteristics of arsenic represent opportunities to enhance existing capabilities of atomic-scale fabrication of dopant structures in silicon, and are particularly important for three-dimensional devices, where vertical control of the position of device components is critical. △ Less

Submitted 15 October, 2019; originally announced October 2019.

Showing 1–27 of 27 results for author: Hofmann, V