-
The Surprising Soupability of Documents in State Space Models
Authors:
Yasaman Jafari,
Zixian Wang,
Leon Bergen,
Taylor Berg-Kirkpatrick
Abstract:
We investigate whether hidden states from Structured State Space Models (SSMs) can be merged post-hoc to support downstream reasoning. Inspired by model souping, we propose a strategy where documents are encoded independently and their representations are pooled -- via simple operations like averaging -- into a single context state. This approach, which we call document souping, enables modular en…
▽ More
We investigate whether hidden states from Structured State Space Models (SSMs) can be merged post-hoc to support downstream reasoning. Inspired by model souping, we propose a strategy where documents are encoded independently and their representations are pooled -- via simple operations like averaging -- into a single context state. This approach, which we call document souping, enables modular encoding and reuse without reprocessing the full input for each query. We finetune Mamba2 models to produce soupable representations and find that they support multi-hop QA, sparse retrieval, and long-document reasoning with strong accuracy. On HotpotQA, souping ten independently encoded documents nearly matches the performance of a cross-encoder trained on the same inputs.
△ Less
Submitted 29 May, 2025;
originally announced May 2025.
-
Quiet Feature Learning in Algorithmic Tasks
Authors:
Prudhviraj Naidu,
Zixian Wang,
Leon Bergen,
Ramamohan Paturi
Abstract:
We train Transformer-based language models on ten foundational algorithmic tasks and observe pronounced phase transitions in their loss curves that deviate from established power-law scaling trends. Over large ranges of compute, the validation loss barely improves, then abruptly decreases. Probing the models' internal representations reveals the learning of quiet features during the stagnant phase…
▽ More
We train Transformer-based language models on ten foundational algorithmic tasks and observe pronounced phase transitions in their loss curves that deviate from established power-law scaling trends. Over large ranges of compute, the validation loss barely improves, then abruptly decreases. Probing the models' internal representations reveals the learning of quiet features during the stagnant phase, followed by sudden acquisition of loud features that coincide with the sharp drop in loss. Our ablation experiments show that disrupting a single learned feature can dramatically degrade performance, providing evidence of their causal role in task performance. These findings challenge the prevailing assumption that next-token predictive loss reliably tracks incremental progress; instead, key internal features may be developing below the surface until they coalesce, triggering a rapid performance gain.
△ Less
Submitted 6 May, 2025;
originally announced May 2025.
-
EvidenceBench: A Benchmark for Extracting Evidence from Biomedical Papers
Authors:
Jianyou Wang,
Weili Cao,
Kaicheng Wang,
Xiaoyue Wang,
Ashish Dalvi,
Gino Prasad,
Qishan Liang,
Hsuan-lin Her,
Ming Wang,
Qin Yang,
Gene W. Yeo,
David E. Neal,
Maxim Khan,
Christopher D. Rosin,
Ramamohan Paturi,
Leon Bergen
Abstract:
We study the task of automatically finding evidence relevant to hypotheses in biomedical papers. Finding relevant evidence is an important step when researchers investigate scientific hypotheses. We introduce EvidenceBench to measure models performance on this task, which is created by a novel pipeline that consists of hypothesis generation and sentence-by-sentence annotation of biomedical papers…
▽ More
We study the task of automatically finding evidence relevant to hypotheses in biomedical papers. Finding relevant evidence is an important step when researchers investigate scientific hypotheses. We introduce EvidenceBench to measure models performance on this task, which is created by a novel pipeline that consists of hypothesis generation and sentence-by-sentence annotation of biomedical papers for relevant evidence, completely guided by and faithfully following existing human experts judgment. We demonstrate the pipeline's validity and accuracy with multiple sets of human-expert annotations. We evaluated a diverse set of language models and retrieval systems on the benchmark and found that model performances still fall significantly short of the expert level on this task. To show the scalability of our proposed pipeline, we create a larger EvidenceBench-100k with 107,461 fully annotated papers with hypotheses to facilitate model training and development. Both datasets are available at https://github.com/EvidenceBench/EvidenceBench
△ Less
Submitted 25 April, 2025;
originally announced April 2025.
-
Single-Pass Document Scanning for Question Answering
Authors:
Weili Cao,
Jianyou Wang,
Youze Zheng,
Longtian Bao,
Qirui Zheng,
Taylor Berg-Kirkpatrick,
Ramamohan Paturi,
Leon Bergen
Abstract:
Handling extremely large documents for question answering is challenging: chunk-based embedding methods often lose track of important global context, while full-context transformers can be prohibitively expensive for hundreds of thousands of tokens. We propose a single-pass document scanning approach that processes the entire text in linear time, preserving global coherence while deciding which se…
▽ More
Handling extremely large documents for question answering is challenging: chunk-based embedding methods often lose track of important global context, while full-context transformers can be prohibitively expensive for hundreds of thousands of tokens. We propose a single-pass document scanning approach that processes the entire text in linear time, preserving global coherence while deciding which sentences are most relevant to the query. On 41 QA benchmarks, our single-pass scanner consistently outperforms chunk-based embedding methods and competes with large language models at a fraction of the computational cost. By conditioning on the entire preceding context without chunk breaks, the method preserves global coherence, which is especially important for long documents. Overall, single-pass document scanning offers a simple solution for question answering over massive text. All code, datasets, and model checkpoints are available at https://github.com/MambaRetriever/MambaRetriever
△ Less
Submitted 3 April, 2025;
originally announced April 2025.
-
Measuring Risk of Bias in Biomedical Reports: The RoBBR Benchmark
Authors:
Jianyou Wang,
Weili Cao,
Longtian Bao,
Youze Zheng,
Gil Pasternak,
Kaicheng Wang,
Xiaoyue Wang,
Ramamohan Paturi,
Leon Bergen
Abstract:
Systems that answer questions by reviewing the scientific literature are becoming increasingly feasible. To draw reliable conclusions, these systems should take into account the quality of available evidence, placing more weight on studies that use a valid methodology. We present a benchmark for measuring the methodological strength of biomedical papers, drawing on the risk-of-bias framework used…
▽ More
Systems that answer questions by reviewing the scientific literature are becoming increasingly feasible. To draw reliable conclusions, these systems should take into account the quality of available evidence, placing more weight on studies that use a valid methodology. We present a benchmark for measuring the methodological strength of biomedical papers, drawing on the risk-of-bias framework used for systematic reviews. The four benchmark tasks, drawn from more than 500 papers, cover the analysis of research study methodology, followed by evaluation of risk of bias in these studies. The benchmark contains 2000 expert-generated bias annotations, and a human-validated pipeline for fine-grained alignment with research paper content. We evaluate a range of large language models on the benchmark, and find that these models fall significantly short of expert-level performance. By providing a standardized tool for measuring judgments of study quality, the benchmark can help to guide systems that perform large-scale aggregation of scientific data. The dataset is available at https://github.com/RoBBR-Benchmark/RoBBR.
△ Less
Submitted 27 November, 2024;
originally announced November 2024.
-
Adapting While Learning: Grounding LLMs for Scientific Problems with Intelligent Tool Usage Adaptation
Authors:
Bohan Lyu,
Yadi Cao,
Duncan Watson-Parris,
Leon Bergen,
Taylor Berg-Kirkpatrick,
Rose Yu
Abstract:
Large Language Models (LLMs) demonstrate promising capabilities in solving simple scientific problems but, even with domain-specific fine-tuning, often produce hallucinations for complex ones. While integrating LLMs with tools can mitigate this reliability issue, models finetuned on tool usage only often over-rely on them, incurring unnecessary costs from resource-intensive scientific tools even f…
▽ More
Large Language Models (LLMs) demonstrate promising capabilities in solving simple scientific problems but, even with domain-specific fine-tuning, often produce hallucinations for complex ones. While integrating LLMs with tools can mitigate this reliability issue, models finetuned on tool usage only often over-rely on them, incurring unnecessary costs from resource-intensive scientific tools even for simpler problems. Inspired by how human experts assess the complexity of the problem before choosing the solutions, we propose a novel two-component fine-tuning method, Adapting While Learning (AWL). In the first component, World Knowledge Learning (WKL), LLMs internalize scientific knowledge by learning from tools-generated solutions. In the second component, Tool Usage Adaptation (TUA), we classify questions as easy or hard based on the WKL-trained model's accuracy, and train it to maintain direct reasoning for simple problems while switching to tools for challenging ones. We validate our method on 6 scientific benchmark datasets in climate science, epidemiology, and mathematics. Compared to the base 8B model, our trained models achieve 28.27% higher answer accuracy and 13.76% better tool usage accuracy, even surpassing state-of-the-art models including GPT-4 and Claude-3.5 on 4 custom-created datasets.
△ Less
Submitted 5 February, 2025; v1 submitted 1 November, 2024;
originally announced November 2024.
-
ClimaQA: An Automated Evaluation Framework for Climate Question Answering Models
Authors:
Veeramakali Vignesh Manivannan,
Yasaman Jafari,
Srikar Eranky,
Spencer Ho,
Rose Yu,
Duncan Watson-Parris,
Yian Ma,
Leon Bergen,
Taylor Berg-Kirkpatrick
Abstract:
The use of Large Language Models (LLMs) in climate science has recently gained significant attention. However, a critical issue remains: the lack of a comprehensive evaluation framework capable of assessing the quality and scientific validity of model outputs. To address this issue, we develop ClimaGen (Climate QA Generator), an adaptive learning framework that generates question-answer pairs from…
▽ More
The use of Large Language Models (LLMs) in climate science has recently gained significant attention. However, a critical issue remains: the lack of a comprehensive evaluation framework capable of assessing the quality and scientific validity of model outputs. To address this issue, we develop ClimaGen (Climate QA Generator), an adaptive learning framework that generates question-answer pairs from graduate textbooks with climate scientists in the loop. As a result, we present ClimaQA-Gold, an expert-annotated benchmark dataset alongside ClimaQA-Silver, a large-scale, comprehensive synthetic QA dataset for climate science. Finally, we develop evaluation strategies and compare different LLMs on our benchmarks. Our results offer novel insights into various approaches used to enhance knowledge of climate LLMs. The source code is publicly available at https://github.com/Rose-STL-Lab/genie-climaqa
△ Less
Submitted 9 March, 2025; v1 submitted 22 October, 2024;
originally announced October 2024.
-
Dissociation of Faithful and Unfaithful Reasoning in LLMs
Authors:
Evelyn Yee,
Alice Li,
Chenyu Tang,
Yeon Ho Jung,
Ramamohan Paturi,
Leon Bergen
Abstract:
Large language models (LLMs) often improve their performance in downstream tasks when they generate Chain of Thought reasoning text before producing an answer. We investigate how LLMs recover from errors in Chain of Thought. Through analysis of error recovery behaviors, we find evidence for unfaithfulness in Chain of Thought, which occurs when models arrive at the correct answer despite invalid re…
▽ More
Large language models (LLMs) often improve their performance in downstream tasks when they generate Chain of Thought reasoning text before producing an answer. We investigate how LLMs recover from errors in Chain of Thought. Through analysis of error recovery behaviors, we find evidence for unfaithfulness in Chain of Thought, which occurs when models arrive at the correct answer despite invalid reasoning text. We identify factors that shift LLM recovery behavior: LLMs recover more frequently from obvious errors and in contexts that provide more evidence for the correct answer. Critically, these factors have divergent effects on faithful and unfaithful recoveries. Our results indicate that there are distinct mechanisms driving faithful and unfaithful error recoveries. Selective targeting of these mechanisms may be able to drive down the rate of unfaithful reasoning and improve model interpretability.
△ Less
Submitted 2 September, 2024; v1 submitted 23 May, 2024;
originally announced May 2024.
-
IR2: Information Regularization for Information Retrieval
Authors:
Jianyou Wang,
Kaicheng Wang,
Xiaoyue Wang,
Weili Cao,
Ramamohan Paturi,
Leon Bergen
Abstract:
Effective information retrieval (IR) in settings with limited training data, particularly for complex queries, remains a challenging task. This paper introduces IR2, Information Regularization for Information Retrieval, a technique for reducing overfitting during synthetic data generation. This approach, representing a novel application of regularization techniques in synthetic data creation for I…
▽ More
Effective information retrieval (IR) in settings with limited training data, particularly for complex queries, remains a challenging task. This paper introduces IR2, Information Regularization for Information Retrieval, a technique for reducing overfitting during synthetic data generation. This approach, representing a novel application of regularization techniques in synthetic data creation for IR, is tested on three recent IR tasks characterized by complex queries: DORIS-MAE, ArguAna, and WhatsThatBook. Experimental results indicate that our regularization techniques not only outperform previous synthetic query generation methods on the tasks considered but also reduce cost by up to 50%. Furthermore, this paper categorizes and explores three regularization methods at different stages of the query synthesis pipeline-input, prompt, and output-each offering varying degrees of performance improvement compared to models where no regularization is applied. This provides a systematic approach for optimizing synthetic data generation in data-limited, complex-query IR scenarios. All code, prompts and synthetic data are available at https://github.com/Info-Regularization/Information-Regularization.
△ Less
Submitted 1 April, 2025; v1 submitted 25 February, 2024;
originally announced February 2024.
-
BIRCO: A Benchmark of Information Retrieval Tasks with Complex Objectives
Authors:
Xiaoyue Wang,
Jianyou Wang,
Weili Cao,
Kaicheng Wang,
Ramamohan Paturi,
Leon Bergen
Abstract:
We present the Benchmark of Information Retrieval (IR) tasks with Complex Objectives (BIRCO). BIRCO evaluates the ability of IR systems to retrieve documents given multi-faceted user objectives. The benchmark's complexity and compact size make it suitable for evaluating large language model (LLM)-based information retrieval systems. We present a modular framework for investigating factors that may…
▽ More
We present the Benchmark of Information Retrieval (IR) tasks with Complex Objectives (BIRCO). BIRCO evaluates the ability of IR systems to retrieve documents given multi-faceted user objectives. The benchmark's complexity and compact size make it suitable for evaluating large language model (LLM)-based information retrieval systems. We present a modular framework for investigating factors that may influence LLM performance on retrieval tasks, and identify a simple baseline model which matches or outperforms existing approaches and more complex alternatives. No approach achieves satisfactory performance on all benchmark tasks, suggesting that stronger models and new retrieval protocols are necessary to address complex user needs.
△ Less
Submitted 3 April, 2024; v1 submitted 21 February, 2024;
originally announced February 2024.
-
DORIS-MAE: Scientific Document Retrieval using Multi-level Aspect-based Queries
Authors:
Jianyou Wang,
Kaicheng Wang,
Xiaoyue Wang,
Prudhviraj Naidu,
Leon Bergen,
Ramamohan Paturi
Abstract:
In scientific research, the ability to effectively retrieve relevant documents based on complex, multifaceted queries is critical. Existing evaluation datasets for this task are limited, primarily due to the high cost and effort required to annotate resources that effectively represent complex queries. To address this, we propose a novel task, Scientific DOcument Retrieval using Multi-level Aspect…
▽ More
In scientific research, the ability to effectively retrieve relevant documents based on complex, multifaceted queries is critical. Existing evaluation datasets for this task are limited, primarily due to the high cost and effort required to annotate resources that effectively represent complex queries. To address this, we propose a novel task, Scientific DOcument Retrieval using Multi-level Aspect-based quEries (DORIS-MAE), which is designed to handle the complex nature of user queries in scientific research. We developed a benchmark dataset within the field of computer science, consisting of 100 human-authored complex query cases. For each complex query, we assembled a collection of 100 relevant documents and produced annotated relevance scores for ranking them. Recognizing the significant labor of expert annotation, we also introduce Anno-GPT, a scalable framework for validating the performance of Large Language Models (LLMs) on expert-level dataset annotation tasks. LLM annotation of the DORIS-MAE dataset resulted in a 500x reduction in cost, without compromising quality. Furthermore, due to the multi-tiered structure of these complex queries, the DORIS-MAE dataset can be extended to over 4,000 sub-query test cases without requiring additional annotation. We evaluated 17 recent retrieval methods on DORIS-MAE, observing notable performance drops compared to traditional datasets. This highlights the need for better approaches to handle complex, multifaceted queries in scientific research. Our dataset and codebase are available at https://github.com/Real-Doris-Mae/Doris-Mae-Dataset.
△ Less
Submitted 28 October, 2023; v1 submitted 6 October, 2023;
originally announced October 2023.
-
Systematic Generalization with Edge Transformers
Authors:
Leon Bergen,
Timothy J. O'Donnell,
Dzmitry Bahdanau
Abstract:
Recent research suggests that systematic generalization in natural language understanding remains a challenge for state-of-the-art neural models such as Transformers and Graph Neural Networks. To tackle this challenge, we propose Edge Transformer, a new model that combines inspiration from Transformers and rule-based symbolic AI. The first key idea in Edge Transformers is to associate vector state…
▽ More
Recent research suggests that systematic generalization in natural language understanding remains a challenge for state-of-the-art neural models such as Transformers and Graph Neural Networks. To tackle this challenge, we propose Edge Transformer, a new model that combines inspiration from Transformers and rule-based symbolic AI. The first key idea in Edge Transformers is to associate vector states with every edge, that is, with every pair of input nodes -- as opposed to just every node, as it is done in the Transformer model. The second major innovation is a triangular attention mechanism that updates edge representations in a way that is inspired by unification from logic programming. We evaluate Edge Transformer on compositional generalization benchmarks in relational reasoning, semantic parsing, and dependency parsing. In all three settings, the Edge Transformer outperforms Relation-aware, Universal and classical Transformer baselines.
△ Less
Submitted 1 December, 2021;
originally announced December 2021.
-
Jointly Learning Truth-Conditional Denotations and Groundings using Parallel Attention
Authors:
Leon Bergen,
Dzmitry Bahdanau,
Timothy J. O'Donnell
Abstract:
We present a model that jointly learns the denotations of words together with their groundings using a truth-conditional semantics. Our model builds on the neurosymbolic approach of Mao et al. (2019), learning to ground objects in the CLEVR dataset (Johnson et al., 2017) using a novel parallel attention mechanism. The model achieves state of the art performance on visual question answering, learni…
▽ More
We present a model that jointly learns the denotations of words together with their groundings using a truth-conditional semantics. Our model builds on the neurosymbolic approach of Mao et al. (2019), learning to ground objects in the CLEVR dataset (Johnson et al., 2017) using a novel parallel attention mechanism. The model achieves state of the art performance on visual question answering, learning to detect and ground objects with question performance as the only training signal. We also show that the model is able to learn flexible non-canonical groundings just by adjusting answers to questions in the training set.
△ Less
Submitted 14 April, 2021;
originally announced April 2021.
-
Word Frequency Does Not Predict Grammatical Knowledge in Language Models
Authors:
Charles Yu,
Ryan Sie,
Nico Tedeschi,
Leon Bergen
Abstract:
Neural language models learn, to varying degrees of accuracy, the grammatical properties of natural languages. In this work, we investigate whether there are systematic sources of variation in the language models' accuracy. Focusing on subject-verb agreement and reflexive anaphora, we find that certain nouns are systematically understood better than others, an effect which is robust across grammat…
▽ More
Neural language models learn, to varying degrees of accuracy, the grammatical properties of natural languages. In this work, we investigate whether there are systematic sources of variation in the language models' accuracy. Focusing on subject-verb agreement and reflexive anaphora, we find that certain nouns are systematically understood better than others, an effect which is robust across grammatical tasks and different language models. Surprisingly, we find that across four orders of magnitude, corpus frequency is unrelated to a noun's performance on grammatical tasks. Finally, we find that a novel noun's grammatical properties can be few-shot learned from various types of training data. The results present a paradox: there should be less variation in grammatical performance than is actually observed.
△ Less
Submitted 26 October, 2020;
originally announced October 2020.
-
Unusual magnetoelectric effect in paramagnetic rare-earth langasite
Authors:
L. Weymann,
L. Bergen,
Th. Kain,
Anna Pimenov,
A. Shuvaev,
E. Constable,
D. Szaller,
A. Pimenov,
B. V. Mill,
A. M. Kuzmenko,
V. Yu. Ivanov,
N. V. Kostyuchenko,
A. I. Popov,
A. K. Zvezdin,
A. A. Mukhin,
M. Mostovoy
Abstract:
Violation of time reversal and spatial inversion symmetries has profound consequences for elementary particles and cosmology. Spontaneous breaking of these symmetries at phase transitions gives rise to unconventional physical phenomena in condensed matter systems, such as ferroelectricity induced by magnetic spirals, electromagnons, non-reciprocal propagation of light and spin waves, and the linea…
▽ More
Violation of time reversal and spatial inversion symmetries has profound consequences for elementary particles and cosmology. Spontaneous breaking of these symmetries at phase transitions gives rise to unconventional physical phenomena in condensed matter systems, such as ferroelectricity induced by magnetic spirals, electromagnons, non-reciprocal propagation of light and spin waves, and the linear magnetoelectric (ME) effect - the electric polarization proportional to the applied magnetic field and the magnetization induced by the electric field. Here, we report the experimental study of the holmium-doped langasite, Ho$_{x}$La$_{3-x}$Ga$_5$SiO$_{14}$, showing a puzzling combination of linear and highly non-linear ME responses in the disordered paramagnetic state: its electric polarization grows linearly with the magnetic field but oscillates many times upon rotation of the magnetic field vector. We propose a simple phenomenological Hamiltonian describing this unusual behavior and derive it microscopically using the coupling of magnetic multipoles of the rare-earth ions to the electric field.
△ Less
Submitted 11 April, 2020;
originally announced April 2020.
-
Grammar Induction for Minimalist Grammars using Variational Bayesian Inference : A Technical Report
Authors:
Eva Portelance,
Amelia Bruno,
Daniel Harasim,
Leon Bergen,
Timothy J. O'Donnell
Abstract:
The following technical report presents a formal approach to probabilistic minimalist grammar parameter estimation. We describe a formalization of a minimalist grammar. We then present an algorithm for the application of variational Bayesian inference to this formalization.
The following technical report presents a formal approach to probabilistic minimalist grammar parameter estimation. We describe a formalization of a minimalist grammar. We then present an algorithm for the application of variational Bayesian inference to this formalization.
△ Less
Submitted 28 August, 2019; v1 submitted 31 October, 2017;
originally announced October 2017.
-
Non-uniform mixing of quantum walk on cycles
Authors:
William Adamczak,
Kevin Andrew,
Leon Bergen,
Dillon Ethier,
Peter Hernberg,
Jennifer Lin,
Christino Tamon
Abstract:
A classical lazy random walk on cycles is known to mix to the uniform distribution. In contrast, we show that a continuous-time quantum walk on cycles exhibit strong non-uniform mixing properties. Our results include the following:
- The instantaneous distribution of a quantum walk on most even-length cycles is never uniform. - The average distribution of a quantum walk on any Abelian circulan…
▽ More
A classical lazy random walk on cycles is known to mix to the uniform distribution. In contrast, we show that a continuous-time quantum walk on cycles exhibit strong non-uniform mixing properties. Our results include the following:
- The instantaneous distribution of a quantum walk on most even-length cycles is never uniform. - The average distribution of a quantum walk on any Abelian circulant graph is never uniform. As a corollary, the average distribution of a quantum walk on any standard circulant graph, such as the cycles, complete graphs, and even hypercubes, is never uniform.
△ Less
Submitted 15 August, 2007;
originally announced August 2007.