-
How Well Can Reasoning Models Identify and Recover from Unhelpful Thoughts?
Authors:
Sohee Yang,
Sang-Woo Lee,
Nora Kassner,
Daniela Gottesman,
Sebastian Riedel,
Mor Geva
Abstract:
Recent reasoning models show the ability to reflect, backtrack, and self-validate their reasoning, which is crucial in spotting mistakes and arriving at accurate solutions. A natural question that arises is how effectively models can perform such self-reevaluation. We tackle this question by investigating how well reasoning models identify and recover from four types of unhelpful thoughts: uninfor…
▽ More
Recent reasoning models show the ability to reflect, backtrack, and self-validate their reasoning, which is crucial in spotting mistakes and arriving at accurate solutions. A natural question that arises is how effectively models can perform such self-reevaluation. We tackle this question by investigating how well reasoning models identify and recover from four types of unhelpful thoughts: uninformative rambling thoughts, thoughts irrelevant to the question, thoughts misdirecting the question as a slightly different question, and thoughts that lead to incorrect answers. We show that models are effective at identifying most unhelpful thoughts but struggle to recover from the same thoughts when these are injected into their thinking process, causing significant performance drops. Models tend to naively continue the line of reasoning of the injected irrelevant thoughts, which showcases that their self-reevaluation abilities are far from a general "meta-cognitive" awareness. Moreover, we observe non/inverse-scaling trends, where larger models struggle more than smaller ones to recover from short irrelevant thoughts, even when instructed to reevaluate their reasoning. We demonstrate the implications of these findings with a jailbreak experiment using irrelevant thought injection, showing that the smallest models are the least distracted by harmful-response-triggering thoughts. Overall, our findings call for improvement in self-reevaluation of reasoning models to develop better reasoning and safer systems.
△ Less
Submitted 12 June, 2025;
originally announced June 2025.
-
Performance Gap in Entity Knowledge Extraction Across Modalities in Vision Language Models
Authors:
Ido Cohen,
Daniela Gottesman,
Mor Geva,
Raja Giryes
Abstract:
Vision-language models (VLMs) excel at extracting and reasoning about information from images. Yet, their capacity to leverage internal knowledge about specific entities remains underexplored. This work investigates the disparity in model performance when answering factual questions about an entity described in text versus depicted in an image. Our results reveal a significant accuracy drop - reac…
▽ More
Vision-language models (VLMs) excel at extracting and reasoning about information from images. Yet, their capacity to leverage internal knowledge about specific entities remains underexplored. This work investigates the disparity in model performance when answering factual questions about an entity described in text versus depicted in an image. Our results reveal a significant accuracy drop - reaching 18% for some models - when the entity is presented visually instead of textually. To study this gap we present PopVQA, a dataset which allows separating entity recognition and question answering, and use it to benchmark several models. We hypothesize that this decline arises from limitations in how information flows from image tokens to query tokens. Thus, we use mechanistic interpretability tools to reveal that, although image tokens are preprocessed by the vision encoder, meaningful information flow from these tokens occurs only in the much deeper layers. Furthermore, critical image processing happens in the language model's middle layers, allowing few layers for consecutive reasoning, highlighting a potential inefficiency in how the model utilizes its layers for reasoning. These insights shed light on the internal mechanics of VLMs and offer pathways for enhancing their reasoning capabilities. PopVQA can be found at https://huggingface.co/datasets/idoco/PopVQA.
△ Less
Submitted 7 June, 2025; v1 submitted 18 December, 2024;
originally announced December 2024.
-
A Criterion for Quantum Advantage
Authors:
Chaitanya Karamchedu,
Matthew Fox,
Daniel Gottesman
Abstract:
Assuming the polynomial hierarchy is infinite, we prove a sufficient condition for determining if uniform and polynomial size quantum circuits over a non-universal gate set are not efficiently classically simulable in the weak multiplicative sense. Our criterion exploits the fact that subgroups of $\mathrm{SL}(2;\mathbb{C})$ are essentially either discrete or dense in $\mathrm{SL}(2;\mathbb{C})$.…
▽ More
Assuming the polynomial hierarchy is infinite, we prove a sufficient condition for determining if uniform and polynomial size quantum circuits over a non-universal gate set are not efficiently classically simulable in the weak multiplicative sense. Our criterion exploits the fact that subgroups of $\mathrm{SL}(2;\mathbb{C})$ are essentially either discrete or dense in $\mathrm{SL}(2;\mathbb{C})$. Using our criterion, we give a new proof that both instantaneous quantum polynomial (IQP) circuits and conjugated Clifford circuits (CCCs) afford a quantum advantage. We also prove that both commuting CCCs and CCCs over various fragments of the Clifford group afford a quantum advantage, which settles two questions of Bouland, Fitzsimons, and Koh. Our results imply that circuits over just $(U^\dagger \otimes U^\dagger) \mathrm{CZ} (U \otimes U)$ afford a quantum advantage for almost all $U \in \mathrm{U}(2)$.
△ Less
Submitted 4 November, 2024;
originally announced November 2024.
-
Eliciting Textual Descriptions from Representations of Continuous Prompts
Authors:
Dana Ramati,
Daniela Gottesman,
Mor Geva
Abstract:
Continuous prompts, or "soft prompts", are a widely-adopted parameter-efficient tuning strategy for large language models, but are often less favorable due to their opaque nature. Prior attempts to interpret continuous prompts relied on projecting individual prompt tokens onto the vocabulary space. However, this approach is problematic as performant prompts can yield arbitrary or contradictory tex…
▽ More
Continuous prompts, or "soft prompts", are a widely-adopted parameter-efficient tuning strategy for large language models, but are often less favorable due to their opaque nature. Prior attempts to interpret continuous prompts relied on projecting individual prompt tokens onto the vocabulary space. However, this approach is problematic as performant prompts can yield arbitrary or contradictory text, and it interprets prompt tokens individually. In this work, we propose a new approach to interpret continuous prompts that elicits textual descriptions from their representations during model inference. Using a Patchscopes variant (Ghandeharioun et al., 2024) called InSPEcT over various tasks, we show our method often yields accurate task descriptions which become more faithful as task performance increases. Moreover, an elaborated version of InSPEcT reveals biased features in continuous prompts, whose presence correlates with biased model predictions. Providing an effective interpretability solution, InSPEcT can be leveraged to debug unwanted properties in continuous prompts and inform developers on ways to mitigate them.
△ Less
Submitted 15 October, 2024;
originally announced October 2024.
-
Hopping Too Late: Exploring the Limitations of Large Language Models on Multi-Hop Queries
Authors:
Eden Biran,
Daniela Gottesman,
Sohee Yang,
Mor Geva,
Amir Globerson
Abstract:
Large language models (LLMs) can solve complex multi-step problems, but little is known about how these computations are implemented internally. Motivated by this, we study how LLMs answer multi-hop queries such as "The spouse of the performer of Imagine is". These queries require two information extraction steps: a latent one for resolving the first hop ("the performer of Imagine") into the bridg…
▽ More
Large language models (LLMs) can solve complex multi-step problems, but little is known about how these computations are implemented internally. Motivated by this, we study how LLMs answer multi-hop queries such as "The spouse of the performer of Imagine is". These queries require two information extraction steps: a latent one for resolving the first hop ("the performer of Imagine") into the bridge entity (John Lennon), and another for resolving the second hop ("the spouse of John Lennon") into the target entity (Yoko Ono). Understanding how the latent step is computed internally is key to understanding the overall computation. By carefully analyzing the internal computations of transformer-based LLMs, we discover that the bridge entity is resolved in the early layers of the model. Then, only after this resolution, the two-hop query is solved in the later layers. Because the second hop commences in later layers, there could be cases where these layers no longer encode the necessary knowledge for correctly predicting the answer. Motivated by this, we propose a novel "back-patching" analysis method whereby a hidden representation from a later layer is patched back to an earlier layer. We find that in up to 66% of previously incorrect cases there exists a back-patch that results in the correct generation of the answer, showing that the later layers indeed sometimes lack the needed functionality. Overall, our methods and findings open further opportunities for understanding and improving latent reasoning in transformer-based LLMs.
△ Less
Submitted 14 October, 2024; v1 submitted 18 June, 2024;
originally announced June 2024.
-
Estimating Knowledge in Large Language Models Without Generating a Single Token
Authors:
Daniela Gottesman,
Mor Geva
Abstract:
To evaluate knowledge in large language models (LLMs), current methods query the model and then evaluate its generated responses. In this work, we ask whether evaluation can be done before the model has generated any text. Concretely, is it possible to estimate how knowledgeable a model is about a certain entity, only from its internal computation? We study this question with two tasks: given a su…
▽ More
To evaluate knowledge in large language models (LLMs), current methods query the model and then evaluate its generated responses. In this work, we ask whether evaluation can be done before the model has generated any text. Concretely, is it possible to estimate how knowledgeable a model is about a certain entity, only from its internal computation? We study this question with two tasks: given a subject entity, the goal is to predict (a) the ability of the model to answer common questions about the entity, and (b) the factuality of open-ended responses generated by the model about the entity. Experiments with a variety of LLMs show that KEEN, a simple probe trained over internal subject representations, succeeds at both tasks - correlating with both the QA accuracy of the model per-subject and FActScore, a recent factuality metric in open-ended generation. Moreover, KEEN naturally aligns with the model's hedging behavior and faithfully reflects changes in the model's knowledge after fine-tuning. Lastly, we show a more interpretable yet equally performant variant of KEEN, which highlights a small set of tokens indicative of clusters and gaps in the model's knowledge. Being simple and lightweight, KEEN can be leveraged to guide decisions such as when it is appropriate to apply further training or augment queries with retrieval.
△ Less
Submitted 29 October, 2024; v1 submitted 18 June, 2024;
originally announced June 2024.
-
Using Discretization for Extending the Set of Predictive Features
Authors:
Avi Rosenfeld,
Ron Illuz,
Dovid Gottesman,
Mark Last
Abstract:
To date, attribute discretization is typically performed by replacing the original set of continuous features with a transposed set of discrete ones. This paper provides support for a new idea that discretized features should often be used in addition to existing features and as such, datasets should be extended, and not replaced, by discretization. We also claim that discretization algorithms sho…
▽ More
To date, attribute discretization is typically performed by replacing the original set of continuous features with a transposed set of discrete ones. This paper provides support for a new idea that discretized features should often be used in addition to existing features and as such, datasets should be extended, and not replaced, by discretization. We also claim that discretization algorithms should be developed with the explicit purpose of enriching a non-discretized dataset with discretized values. We present such an algorithm, D-MIAT, a supervised algorithm that discretizes data based on Minority Interesting Attribute Thresholds. D-MIAT only generates new features when strong indications exist for one of the target values needing to be learned and thus is intended to be used in addition to the original data. We present extensive empirical results demonstrating the success of using D-MIAT on $ 28 $ benchmark datasets. We also demonstrate that $ 10 $ other discretization algorithms can also be used to generate features that yield improved performance when used in combination with the original non-discretized data. Our results show that the best predictive performance is attained using a combination of the original dataset with added features from a "standard" supervised discretization algorithm and D-MIAT.
△ Less
Submitted 9 February, 2018;
originally announced February 2018.
-
The Quantum and Classical Complexity of Translationally Invariant Tiling and Hamiltonian Problems
Authors:
Daniel Gottesman,
Sandy Irani
Abstract:
We study the complexity of a class of problems involving satisfying constraints which remain the same under translations in one or more spatial directions. In this paper, we show hardness of a classical tiling problem on an N x N 2-dimensional grid and a quantum problem involving finding the ground state energy of a 1-dimensional quantum system of N particles. In both cases, the only input is N, p…
▽ More
We study the complexity of a class of problems involving satisfying constraints which remain the same under translations in one or more spatial directions. In this paper, we show hardness of a classical tiling problem on an N x N 2-dimensional grid and a quantum problem involving finding the ground state energy of a 1-dimensional quantum system of N particles. In both cases, the only input is N, provided in binary. We show that the classical problem is NEXP-complete and the quantum problem is QMA_EXP-complete. Thus, an algorithm for these problems which runs in time polynomial in N (exponential in the input size) would imply that EXP = NEXP or BQEXP = QMA_EXP, respectively. Although tiling in general is already known to be NEXP-complete, to our knowledge, all previous reductions require that either the set of tiles and their constraints or some varying boundary conditions be given as part of the input. In the problem considered here, these are fixed, constant-sized parameters of the problem. Instead, the problem instance is encoded solely in the size of the system.
△ Less
Submitted 23 August, 2010; v1 submitted 14 May, 2009;
originally announced May 2009.
-
Approximate Quantum Error-Correcting Codes and Secret Sharing Schemes
Authors:
Claude Crepeau,
Daniel Gottesman,
Adam Smith
Abstract:
It is a standard result in the theory of quantum error-correcting codes that no code of length n can fix more than n/4 arbitrary errors, regardless of the dimension of the coding and encoded Hilbert spaces. However, this bound only applies to codes which recover the message exactly. Naively, one might expect that correcting errors to very high fidelity would only allow small violations of this b…
▽ More
It is a standard result in the theory of quantum error-correcting codes that no code of length n can fix more than n/4 arbitrary errors, regardless of the dimension of the coding and encoded Hilbert spaces. However, this bound only applies to codes which recover the message exactly. Naively, one might expect that correcting errors to very high fidelity would only allow small violations of this bound. This intuition is incorrect: in this paper we describe quantum error-correcting codes capable of correcting up to (n-1)/2 arbitrary errors with fidelity exponentially close to 1, at the price of increasing the size of the registers (i.e., the coding alphabet). This demonstrates a sharp distinction between exact and approximate quantum error correction. The codes have the property that any $t$ components reveal no information about the message, and so they can also be viewed as error-tolerant secret sharing schemes.
The construction has several interesting implications for cryptography and quantum information theory. First, it suggests that secret sharing is a better classical analogue to quantum error correction than is classical error correction. Second, it highlights an error in a purported proof that verifiable quantum secret sharing (VQSS) is impossible when the number of cheaters t is n/4. More generally, the construction illustrates a difference between exact and approximate requirements in quantum cryptography and (yet again) the delicacy of security proofs and impossibility results in the quantum model.
△ Less
Submitted 15 March, 2005;
originally announced March 2005.
-
Improved Simulation of Stabilizer Circuits
Authors:
Scott Aaronson,
Daniel Gottesman
Abstract:
The Gottesman-Knill theorem says that a stabilizer circuit -- that is, a quantum circuit consisting solely of CNOT, Hadamard, and phase gates -- can be simulated efficiently on a classical computer. This paper improves that theorem in several directions. First, by removing the need for Gaussian elimination, we make the simulation algorithm much faster at the cost of a factor-2 increase in the nu…
▽ More
The Gottesman-Knill theorem says that a stabilizer circuit -- that is, a quantum circuit consisting solely of CNOT, Hadamard, and phase gates -- can be simulated efficiently on a classical computer. This paper improves that theorem in several directions. First, by removing the need for Gaussian elimination, we make the simulation algorithm much faster at the cost of a factor-2 increase in the number of bits needed to represent a state. We have implemented the improved algorithm in a freely-available program called CHP (CNOT-Hadamard-Phase), which can handle thousands of qubits easily. Second, we show that the problem of simulating stabilizer circuits is complete for the classical complexity class ParityL, which means that stabilizer circuits are probably not even universal for classical computation. Third, we give efficient algorithms for computing the inner product between two stabilizer states, putting any n-qubit stabilizer circuit into a "canonical form" that requires at most O(n^2/log n) gates, and other useful tasks. Fourth, we extend our simulation algorithm to circuits acting on mixed states, circuits containing a limited number of non-stabilizer gates, and circuits acting on general tensor-product initial states but containing only a limited number of measurements.
△ Less
Submitted 18 June, 2008; v1 submitted 25 June, 2004;
originally announced June 2004.
-
Authentication of Quantum Messages
Authors:
Howard Barnum,
Claude Crepeau,
Daniel Gottesman,
Adam Smith,
Alain Tapp
Abstract:
Authentication is a well-studied area of classical cryptography: a sender S and a receiver R sharing a classical private key want to exchange a classical message with the guarantee that the message has not been modified by any third party with control of the communication line. In this paper we define and investigate the authentication of messages composed of quantum states. Assuming S and R hav…
▽ More
Authentication is a well-studied area of classical cryptography: a sender S and a receiver R sharing a classical private key want to exchange a classical message with the guarantee that the message has not been modified by any third party with control of the communication line. In this paper we define and investigate the authentication of messages composed of quantum states. Assuming S and R have access to an insecure quantum channel and share a private, classical random key, we provide a non-interactive scheme that enables S both to encrypt and to authenticate (with unconditional security) an m qubit message by encoding it into m+s qubits, where the failure probability decreases exponentially in the security parameter s. The classical private key is 2m+O(s) bits. To achieve this, we give a highly efficient protocol for testing the purity of shared EPR pairs. We also show that any scheme to authenticate quantum messages must also encrypt them. (In contrast, one can authenticate a classical message while leaving it publicly readable.) This has two important consequences: On one hand, it allows us to give a lower bound of 2m key bits for authenticating m qubits, which makes our protocol asymptotically optimal. On the other hand, we use it to show that digitally signing quantum states is impossible, even with only computational security.
△ Less
Submitted 20 May, 2002;
originally announced May 2002.