-
Transformers Struggle to Learn to Search
Authors:
Abulhair Saparov,
Srushti Pawar,
Shreyas Pimpalgaonkar,
Nitish Joshi,
Richard Yuanzhe Pang,
Vishakh Padmakumar,
Seyed Mehran Kazemi,
Najoung Kim,
He He
Abstract:
Search is an ability foundational in many important tasks, and recent studies have shown that large language models (LLMs) struggle to perform search robustly. It is unknown whether this inability is due to a lack of data, insufficient model parameters, or fundamental limitations of the transformer architecture. In this work, we use the foundational graph connectivity problem as a testbed to gener…
▽ More
Search is an ability foundational in many important tasks, and recent studies have shown that large language models (LLMs) struggle to perform search robustly. It is unknown whether this inability is due to a lack of data, insufficient model parameters, or fundamental limitations of the transformer architecture. In this work, we use the foundational graph connectivity problem as a testbed to generate effectively limitless high-coverage data to train small transformers and test whether they can learn to perform search. We find that, when given the right training distribution, the transformer is able to learn to search.
We analyze the algorithm that the transformer has learned through a novel mechanistic interpretability technique that enables us to extract the computation graph from the trained model. We find that transformers perform search at every vertex in parallel: For each vertex in the input graph, transformers compute the set of vertices reachable from that vertex. Each layer then progressively expands these sets, allowing the model to search over a number of vertices exponential in $n_{\text{layers}}$.
However, we find that as the input graph size increases, the transformer has greater difficulty in learning the task. This difficulty is not resolved even as the number of parameters is increased, suggesting that increasing model scale will not lead to robust search abilities. We also find that performing search in-context (i.e., chain-of-thought) does not resolve this inability to learn to search on larger graphs.
△ Less
Submitted 16 March, 2025; v1 submitted 5 December, 2024;
originally announced December 2024.
-
MathGAP: Out-of-Distribution Evaluation on Problems with Arbitrarily Complex Proofs
Authors:
Andreas Opedal,
Haruki Shirakami,
Bernhard Schölkopf,
Abulhair Saparov,
Mrinmaya Sachan
Abstract:
Large language models (LLMs) can solve arithmetic word problems with high accuracy, but little is known about how well they generalize to more complex problems. This is difficult to study, as (i) much of the available evaluation data has already been seen by the most capable models during training, and (ii) existing benchmarks do not capture how problem proofs may be arbitrarily complex in various…
▽ More
Large language models (LLMs) can solve arithmetic word problems with high accuracy, but little is known about how well they generalize to more complex problems. This is difficult to study, as (i) much of the available evaluation data has already been seen by the most capable models during training, and (ii) existing benchmarks do not capture how problem proofs may be arbitrarily complex in various ways. In this paper, we present a data-generation framework for evaluating LLMs on problems with arbitrarily complex arithmetic proofs, called MathGAP. MathGAP generates problem statements and chain-of-thought reasoning traces according to specifications about their arithmetic proof structure, enabling systematic studies on easy-to-hard generalization with respect to complexity of proof trees. Using MathGAP, we find that LLMs show a significant decrease in performance as proofs get deeper and wider. This effect is more pronounced in complex, nonlinear proof structures, which are challenging even for the most capable models. The models are also sensitive to simple changes in sentence ordering. However, they remain capable of solving some complex problems, suggesting that reasoning generalization is noisy.
△ Less
Submitted 14 February, 2025; v1 submitted 17 October, 2024;
originally announced October 2024.
-
A Practical Review of Mechanistic Interpretability for Transformer-Based Language Models
Authors:
Daking Rai,
Yilun Zhou,
Shi Feng,
Abulhair Saparov,
Ziyu Yao
Abstract:
Mechanistic interpretability (MI) is an emerging sub-field of interpretability that seeks to understand a neural network model by reverse-engineering its internal computations. Recently, MI has garnered significant attention for interpreting transformer-based language models (LMs), resulting in many novel insights yet introducing new challenges. However, there has not been work that comprehensivel…
▽ More
Mechanistic interpretability (MI) is an emerging sub-field of interpretability that seeks to understand a neural network model by reverse-engineering its internal computations. Recently, MI has garnered significant attention for interpreting transformer-based language models (LMs), resulting in many novel insights yet introducing new challenges. However, there has not been work that comprehensively reviews these insights and challenges, particularly as a guide for newcomers to this field. To fill this gap, we provide a comprehensive survey from a task-centric perspective, organizing the taxonomy of MI research around specific research questions or tasks. We outline the fundamental objects of study in MI, along with the techniques, evaluation methods, and key findings for each task in the taxonomy. In particular, we present a task-centric taxonomy as a roadmap for beginners to navigate the field by helping them quickly identify impactful problems in which they are most interested and leverage MI for their benefit. Finally, we discuss the current gaps in the field and suggest potential future directions for MI research.
△ Less
Submitted 15 March, 2025; v1 submitted 2 July, 2024;
originally announced July 2024.
-
LLMs Are Prone to Fallacies in Causal Inference
Authors:
Nitish Joshi,
Abulhair Saparov,
Yixin Wang,
He He
Abstract:
Recent work shows that causal facts can be effectively extracted from LLMs through prompting, facilitating the creation of causal graphs for causal inference tasks. However, it is unclear if this success is limited to explicitly-mentioned causal facts in the pretraining data which the model can memorize. Thus, this work investigates: Can LLMs infer causal relations from other relational data in te…
▽ More
Recent work shows that causal facts can be effectively extracted from LLMs through prompting, facilitating the creation of causal graphs for causal inference tasks. However, it is unclear if this success is limited to explicitly-mentioned causal facts in the pretraining data which the model can memorize. Thus, this work investigates: Can LLMs infer causal relations from other relational data in text? To disentangle the role of memorized causal facts vs inferred causal relations, we finetune LLMs on synthetic data containing temporal, spatial and counterfactual relations, and measure whether the LLM can then infer causal relations. We find that: (a) LLMs are susceptible to inferring causal relations from the order of two entity mentions in text (e.g. X mentioned before Y implies X causes Y); (b) if the order is randomized, LLMs still suffer from the post hoc fallacy, i.e. X occurs before Y (temporal relation) implies X causes Y. We also find that while LLMs can correctly deduce the absence of causal relations from temporal and spatial relations, they have difficulty inferring causal relations from counterfactuals, questioning their understanding of causality.
△ Less
Submitted 17 June, 2024;
originally announced June 2024.
-
Foundational Challenges in Assuring Alignment and Safety of Large Language Models
Authors:
Usman Anwar,
Abulhair Saparov,
Javier Rando,
Daniel Paleka,
Miles Turpin,
Peter Hase,
Ekdeep Singh Lubana,
Erik Jenner,
Stephen Casper,
Oliver Sourbut,
Benjamin L. Edelman,
Zhaowei Zhang,
Mario Günther,
Anton Korinek,
Jose Hernandez-Orallo,
Lewis Hammond,
Eric Bigelow,
Alexander Pan,
Lauro Langosco,
Tomasz Korbak,
Heidi Zhang,
Ruiqi Zhong,
Seán Ó hÉigeartaigh,
Gabriel Recchia,
Giulio Corsi
, et al. (17 additional authors not shown)
Abstract:
This work identifies 18 foundational challenges in assuring the alignment and safety of large language models (LLMs). These challenges are organized into three different categories: scientific understanding of LLMs, development and deployment methods, and sociotechnical challenges. Based on the identified challenges, we pose $200+$ concrete research questions.
This work identifies 18 foundational challenges in assuring the alignment and safety of large language models (LLMs). These challenges are organized into three different categories: scientific understanding of LLMs, development and deployment methods, and sociotechnical challenges. Based on the identified challenges, we pose $200+$ concrete research questions.
△ Less
Submitted 5 September, 2024; v1 submitted 15 April, 2024;
originally announced April 2024.
-
Do Language Models Exhibit the Same Cognitive Biases in Problem Solving as Human Learners?
Authors:
Andreas Opedal,
Alessandro Stolfo,
Haruki Shirakami,
Ying Jiao,
Ryan Cotterell,
Bernhard Schölkopf,
Abulhair Saparov,
Mrinmaya Sachan
Abstract:
There is increasing interest in employing large language models (LLMs) as cognitive models. For such purposes, it is central to understand which properties of human cognition are well-modeled by LLMs, and which are not. In this work, we study the biases of LLMs in relation to those known in children when solving arithmetic word problems. Surveying the learning science literature, we posit that the…
▽ More
There is increasing interest in employing large language models (LLMs) as cognitive models. For such purposes, it is central to understand which properties of human cognition are well-modeled by LLMs, and which are not. In this work, we study the biases of LLMs in relation to those known in children when solving arithmetic word problems. Surveying the learning science literature, we posit that the problem-solving process can be split into three distinct steps: text comprehension, solution planning and solution execution. We construct tests for each one in order to understand whether current LLMs display the same cognitive biases as children in these steps. We generate a novel set of word problems for each of these tests, using a neuro-symbolic approach that enables fine-grained control over the problem features. We find evidence that LLMs, with and without instruction-tuning, exhibit human-like biases in both the text-comprehension and the solution-planning steps of the solving process, but not in the final step, in which the arithmetic expressions are executed to obtain the answer.
△ Less
Submitted 17 June, 2024; v1 submitted 31 January, 2024;
originally announced January 2024.
-
Noisy Exemplars Make Large Language Models More Robust: A Domain-Agnostic Behavioral Analysis
Authors:
Hongyi Zheng,
Abulhair Saparov
Abstract:
Recent advances in prompt engineering enable large language models (LLMs) to solve multi-hop logical reasoning problems with impressive accuracy. However, there is little existing work investigating the robustness of LLMs with few-shot prompting techniques. Therefore, we introduce a systematic approach to test the robustness of LLMs in multi-hop reasoning tasks via domain-agnostic perturbations. W…
▽ More
Recent advances in prompt engineering enable large language models (LLMs) to solve multi-hop logical reasoning problems with impressive accuracy. However, there is little existing work investigating the robustness of LLMs with few-shot prompting techniques. Therefore, we introduce a systematic approach to test the robustness of LLMs in multi-hop reasoning tasks via domain-agnostic perturbations. We include perturbations at multiple levels of abstractions (e.g. lexical perturbations such as typos, and semantic perturbations such as the inclusion of intermediate reasoning steps in the questions) to conduct behavioral analysis on the LLMs. Throughout our experiments, we find that models are more sensitive to certain perturbations such as replacing words with their synonyms. We also demonstrate that increasing the proportion of perturbed exemplars in the prompts improves the robustness of few-shot prompting methods.
△ Less
Submitted 31 October, 2023;
originally announced November 2023.
-
Personas as a Way to Model Truthfulness in Language Models
Authors:
Nitish Joshi,
Javier Rando,
Abulhair Saparov,
Najoung Kim,
He He
Abstract:
Large language models (LLMs) are trained on vast amounts of text from the internet, which contains both factual and misleading information about the world. While unintuitive from a classic view of LMs, recent work has shown that the truth value of a statement can be elicited from the model's representations. This paper presents an explanation for why LMs appear to know the truth despite not being…
▽ More
Large language models (LLMs) are trained on vast amounts of text from the internet, which contains both factual and misleading information about the world. While unintuitive from a classic view of LMs, recent work has shown that the truth value of a statement can be elicited from the model's representations. This paper presents an explanation for why LMs appear to know the truth despite not being trained with truth labels. We hypothesize that the pretraining data is generated by groups of (un)truthful agents whose outputs share common features, and they form a (un)truthful persona. By training on this data, LMs can infer and represent the persona in its activation space. This allows the model to separate truth from falsehoods and controls the truthfulness of its generation. We show evidence for the persona hypothesis via two observations: (1) we can probe whether a model's answer will be truthful before it is generated; (2) finetuning a model on a set of facts improves its truthfulness on unseen topics. Next, using arithmetics as a synthetic environment, we show that structures of the pretraining data are crucial for the model to infer the truthful persona. Overall, our findings suggest that models can exploit hierarchical structures in the data to learn abstract concepts like truthfulness.
△ Less
Submitted 6 February, 2024; v1 submitted 27 October, 2023;
originally announced October 2023.
-
Retrieval-Augmented Chain-of-Thought in Semi-structured Domains
Authors:
Vaibhav Mavi,
Abulhair Saparov,
Chen Zhao
Abstract:
Applying existing question answering (QA) systems to specialized domains like law and finance presents challenges that necessitate domain expertise. Although large language models (LLMs) have shown impressive language comprehension and in-context learning capabilities, their inability to handle very long inputs/contexts is well known. Tasks specific to these domains need significant background kno…
▽ More
Applying existing question answering (QA) systems to specialized domains like law and finance presents challenges that necessitate domain expertise. Although large language models (LLMs) have shown impressive language comprehension and in-context learning capabilities, their inability to handle very long inputs/contexts is well known. Tasks specific to these domains need significant background knowledge, leading to contexts that can often exceed the maximum length that existing LLMs can process. This study explores leveraging the semi-structured nature of legal and financial data to efficiently retrieve relevant context, enabling the use of LLMs for domain-specialized QA. The resulting system outperforms contemporary models and also provides useful explanations for the answers, encouraging the integration of LLMs into legal and financial NLP systems for future research.
△ Less
Submitted 22 October, 2023;
originally announced October 2023.
-
World Models for Math Story Problems
Authors:
Andreas Opedal,
Niklas Stoehr,
Abulhair Saparov,
Mrinmaya Sachan
Abstract:
Solving math story problems is a complex task for students and NLP models alike, requiring them to understand the world as described in the story and reason over it to compute an answer. Recent years have seen impressive performance on automatically solving these problems with large pre-trained language models and innovative techniques to prompt them. However, it remains unclear if these models po…
▽ More
Solving math story problems is a complex task for students and NLP models alike, requiring them to understand the world as described in the story and reason over it to compute an answer. Recent years have seen impressive performance on automatically solving these problems with large pre-trained language models and innovative techniques to prompt them. However, it remains unclear if these models possess accurate representations of mathematical concepts. This leads to lack of interpretability and trustworthiness which impedes their usefulness in various applications. In this paper, we consolidate previous work on categorizing and representing math story problems and develop MathWorld, which is a graph-based semantic formalism specific for the domain of math story problems. With MathWorld, we can assign world models to math story problems which represent the situations and actions introduced in the text and their mathematical relationships. We combine math story problems from several existing datasets and annotate a corpus of 1,019 problems and 3,204 logical forms with MathWorld. Using this data, we demonstrate the following use cases of MathWorld: (1) prompting language models with synthetically generated question-answer pairs to probe their reasoning and world modeling abilities, and (2) generating new problems by using the world models as a design space.
△ Less
Submitted 26 February, 2025; v1 submitted 7 June, 2023;
originally announced June 2023.
-
Testing the General Deductive Reasoning Capacity of Large Language Models Using OOD Examples
Authors:
Abulhair Saparov,
Richard Yuanzhe Pang,
Vishakh Padmakumar,
Nitish Joshi,
Seyed Mehran Kazemi,
Najoung Kim,
He He
Abstract:
Given the intractably large size of the space of proofs, any model that is capable of general deductive reasoning must generalize to proofs of greater complexity. Recent studies have shown that large language models (LLMs) possess some abstract deductive reasoning ability given chain-of-thought prompts. However, they have primarily been tested on proofs using modus ponens or of a specific size, an…
▽ More
Given the intractably large size of the space of proofs, any model that is capable of general deductive reasoning must generalize to proofs of greater complexity. Recent studies have shown that large language models (LLMs) possess some abstract deductive reasoning ability given chain-of-thought prompts. However, they have primarily been tested on proofs using modus ponens or of a specific size, and from the same distribution as the in-context examples. To measure the general deductive reasoning ability of LLMs, we test on a broad set of deduction rules and measure their ability to generalize to more complex proofs from simpler demonstrations from multiple angles: depth-, width-, and compositional generalization. To facilitate systematic exploration, we construct a new synthetic and programmable reasoning dataset that enables control over deduction rules and proof complexity. Our experiments on four LLMs of various sizes and training objectives show that they are able to generalize to compositional proofs. However, they have difficulty generalizing to longer proofs, and they require explicit demonstrations to produce hypothetical subproofs, specifically in proof by cases and proof by contradiction.
△ Less
Submitted 3 November, 2023; v1 submitted 24 May, 2023;
originally announced May 2023.
-
Language Models Are Greedy Reasoners: A Systematic Formal Analysis of Chain-of-Thought
Authors:
Abulhair Saparov,
He He
Abstract:
Large language models (LLMs) have shown remarkable reasoning capabilities given chain-of-thought prompts (examples with intermediate reasoning steps). Existing benchmarks measure reasoning ability indirectly, by evaluating accuracy on downstream tasks such as mathematical reasoning. However, it is unclear how these models obtain the answers and whether they rely on simple heuristics rather than th…
▽ More
Large language models (LLMs) have shown remarkable reasoning capabilities given chain-of-thought prompts (examples with intermediate reasoning steps). Existing benchmarks measure reasoning ability indirectly, by evaluating accuracy on downstream tasks such as mathematical reasoning. However, it is unclear how these models obtain the answers and whether they rely on simple heuristics rather than the generated chain-of-thought. To enable systematic exploration of the reasoning ability of LLMs, we present a new synthetic question-answering dataset called PrOntoQA, where each example is generated from a synthetic world model represented in first-order logic. This allows us to parse the generated chain-of-thought into symbolic proofs for formal analysis. Our analysis on InstructGPT and GPT-3 shows that LLMs are quite capable of making correct individual deduction steps, and so are generally capable of reasoning, even in fictional contexts. However, they have difficulty with proof planning: When multiple valid deduction steps are available, they are not able to systematically explore the different options.
△ Less
Submitted 1 March, 2023; v1 submitted 3 October, 2022;
originally announced October 2022.
-
Towards General Natural Language Understanding with Probabilistic Worldbuilding
Authors:
Abulhair Saparov,
Tom M. Mitchell
Abstract:
We introduce the Probabilistic Worldbuilding Model (PWM), a new fully-symbolic Bayesian model of semantic parsing and reasoning, as a first step in a research program toward more domain- and task-general NLU and AI. Humans create internal mental models of their observations which greatly aid in their ability to understand and reason about a large variety of problems. In PWM, the meanings of senten…
▽ More
We introduce the Probabilistic Worldbuilding Model (PWM), a new fully-symbolic Bayesian model of semantic parsing and reasoning, as a first step in a research program toward more domain- and task-general NLU and AI. Humans create internal mental models of their observations which greatly aid in their ability to understand and reason about a large variety of problems. In PWM, the meanings of sentences, acquired facts about the world, and intermediate steps in reasoning are all expressed in a human-readable formal language, with the design goal of interpretability. PWM is Bayesian, designed specifically to be able to generalize to new domains and new tasks. We derive and implement an inference algorithm that reads sentences by parsing and abducing updates to its latent world model that capture the semantics of those sentences, and evaluate it on two out-of-domain question-answering datasets: (1) ProofWriter and (2) a new dataset we call FictionalGeoQA, designed to be more representative of real language but still simple enough to focus on evaluating reasoning ability, while being robust against heuristics. Our method outperforms baselines on both, thereby demonstrating its value as a proof-of-concept.
△ Less
Submitted 20 December, 2021; v1 submitted 6 May, 2021;
originally announced May 2021.
-
Jelly Bean World: A Testbed for Never-Ending Learning
Authors:
Emmanouil Antonios Platanios,
Abulhair Saparov,
Tom Mitchell
Abstract:
Machine learning has shown growing success in recent years. However, current machine learning systems are highly specialized, trained for particular problems or domains, and typically on a single narrow dataset. Human learning, on the other hand, is highly general and adaptable. Never-ending learning is a machine learning paradigm that aims to bridge this gap, with the goal of encouraging research…
▽ More
Machine learning has shown growing success in recent years. However, current machine learning systems are highly specialized, trained for particular problems or domains, and typically on a single narrow dataset. Human learning, on the other hand, is highly general and adaptable. Never-ending learning is a machine learning paradigm that aims to bridge this gap, with the goal of encouraging researchers to design machine learning systems that can learn to perform a wider variety of inter-related tasks in more complex environments. To date, there is no environment or testbed to facilitate the development and evaluation of never-ending learning systems. To this end, we propose the Jelly Bean World testbed. The Jelly Bean World allows experimentation over two-dimensional grid worlds which are filled with items and in which agents can navigate. This testbed provides environments that are sufficiently complex and where more generally intelligent algorithms ought to perform better than current state-of-the-art reinforcement learning approaches. It does so by producing non-stationary environments and facilitating experimentation with multi-task, multi-agent, multi-modal, and curriculum learning settings. We hope that this new freely-available software will prompt new research and interest in the development and evaluation of never-ending learning systems and more broadly, general intelligence systems.
△ Less
Submitted 14 February, 2020;
originally announced February 2020.
-
A Probabilistic Generative Grammar for Semantic Parsing
Authors:
Abulhair Saparov
Abstract:
Domain-general semantic parsing is a long-standing goal in natural language processing, where the semantic parser is capable of robustly parsing sentences from domains outside of which it was trained. Current approaches largely rely on additional supervision from new domains in order to generalize to those domains. We present a generative model of natural language utterances and logical forms and…
▽ More
Domain-general semantic parsing is a long-standing goal in natural language processing, where the semantic parser is capable of robustly parsing sentences from domains outside of which it was trained. Current approaches largely rely on additional supervision from new domains in order to generalize to those domains. We present a generative model of natural language utterances and logical forms and demonstrate its application to semantic parsing. Our approach relies on domain-independent supervision to generalize to new domains. We derive and implement efficient algorithms for training, parsing, and sentence generation. The work relies on a novel application of hierarchical Dirichlet processes (HDPs) for structured prediction, which we also present in this manuscript.
This manuscript is an excerpt of chapter 4 from the Ph.D. thesis of Saparov (2022), where the model plays a central role in a larger natural language understanding system.
This manuscript provides a new simplified and more complete presentation of the work first introduced in Saparov, Saraswat, and Mitchell (2017). The description and proofs of correctness of the training algorithm, parsing algorithm, and sentence generation algorithm are much simplified in this new presentation. We also describe the novel application of hierarchical Dirichlet processes for structured prediction. In addition, we extend the earlier work with a new model of word morphology, which utilizes the comprehensive morphological data from Wiktionary.
△ Less
Submitted 9 February, 2022; v1 submitted 20 June, 2016;
originally announced June 2016.