-
Mitigating Goal Misgeneralization with Minimax Regret
Authors:
Karim Abdel Sadek,
Matthew Farrugia-Roberts,
Usman Anwar,
Hannah Erlebach,
Christian Schroeder de Witt,
David Krueger,
Michael Dennis
Abstract:
Safe generalization in reinforcement learning requires not only that a learned policy acts capably in new situations, but also that it uses its capabilities towards the pursuit of the designer's intended goal. The latter requirement may fail when a proxy goal incentivizes similar behavior to the intended goal within the training environment, but not in novel deployment environments. This creates t…
▽ More
Safe generalization in reinforcement learning requires not only that a learned policy acts capably in new situations, but also that it uses its capabilities towards the pursuit of the designer's intended goal. The latter requirement may fail when a proxy goal incentivizes similar behavior to the intended goal within the training environment, but not in novel deployment environments. This creates the risk that policies will behave as if in pursuit of the proxy goal, rather than the intended goal, in deployment -- a phenomenon known as goal misgeneralization. In this paper, we formalize this problem setting in order to theoretically study the possibility of goal misgeneralization under different training objectives. We show that goal misgeneralization is possible under approximate optimization of the maximum expected value (MEV) objective, but not the minimax expected regret (MMER) objective. We then empirically show that the standard MEV-based training method of domain randomization exhibits goal misgeneralization in procedurally-generated grid-world environments, whereas current regret-based unsupervised environment design (UED) methods are more robust to goal misgeneralization (though they don't find MMER policies in all cases). Our findings suggest that minimax expected regret is a promising approach to mitigating goal misgeneralization.
△ Less
Submitted 3 July, 2025;
originally announced July 2025.
-
You Are What You Eat -- AI Alignment Requires Understanding How Data Shapes Structure and Generalisation
Authors:
Simon Pepin Lehalleur,
Jesse Hoogland,
Matthew Farrugia-Roberts,
Susan Wei,
Alexander Gietelink Oldenziel,
George Wang,
Liam Carroll,
Daniel Murfet
Abstract:
In this position paper, we argue that understanding the relation between structure in the data distribution and structure in trained models is central to AI alignment. First, we discuss how two neural networks can have equivalent performance on the training set but compute their outputs in essentially different ways and thus generalise differently. For this reason, standard testing and evaluation…
▽ More
In this position paper, we argue that understanding the relation between structure in the data distribution and structure in trained models is central to AI alignment. First, we discuss how two neural networks can have equivalent performance on the training set but compute their outputs in essentially different ways and thus generalise differently. For this reason, standard testing and evaluation are insufficient for obtaining assurances of safety for widely deployed generally intelligent systems. We argue that to progress beyond evaluation to a robust mathematical science of AI alignment, we need to develop statistical foundations for an understanding of the relation between structure in the data distribution, internal structure in models, and how these structures underlie generalisation.
△ Less
Submitted 8 February, 2025;
originally announced February 2025.
-
Dynamics of Transient Structure in In-Context Linear Regression Transformers
Authors:
Liam Carroll,
Jesse Hoogland,
Matthew Farrugia-Roberts,
Daniel Murfet
Abstract:
Modern deep neural networks display striking examples of rich internal computational structure. Uncovering principles governing the development of such structure is a priority for the science of deep learning. In this paper, we explore the transient ridge phenomenon: when transformers are trained on in-context linear regression tasks with intermediate task diversity, they initially behave like rid…
▽ More
Modern deep neural networks display striking examples of rich internal computational structure. Uncovering principles governing the development of such structure is a priority for the science of deep learning. In this paper, we explore the transient ridge phenomenon: when transformers are trained on in-context linear regression tasks with intermediate task diversity, they initially behave like ridge regression before specializing to the tasks in their training distribution. This transition from a general solution to a specialized solution is revealed by joint trajectory principal component analysis. Further, we draw on the theory of Bayesian internal model selection to suggest a general explanation for the phenomena of transient structure in transformers, based on an evolving tradeoff between loss and complexity. We empirically validate this explanation by measuring the model complexity of our transformers as defined by the local learning coefficient.
△ Less
Submitted 31 January, 2025; v1 submitted 29 January, 2025;
originally announced January 2025.
-
Loss Landscape Degeneracy Drives Stagewise Development in Transformers
Authors:
Jesse Hoogland,
George Wang,
Matthew Farrugia-Roberts,
Liam Carroll,
Susan Wei,
Daniel Murfet
Abstract:
Deep learning involves navigating a high-dimensional loss landscape over the neural network parameter space. Over the course of training, complex computational structures form and re-form inside the neural network, leading to shifts in input/output behavior. It is a priority for the science of deep learning to uncover principles governing the development of neural network structure and behavior. D…
▽ More
Deep learning involves navigating a high-dimensional loss landscape over the neural network parameter space. Over the course of training, complex computational structures form and re-form inside the neural network, leading to shifts in input/output behavior. It is a priority for the science of deep learning to uncover principles governing the development of neural network structure and behavior. Drawing on the framework of singular learning theory, we propose that model development is deeply linked to degeneracy in the local geometry of the loss landscape. We investigate this link by monitoring loss landscape degeneracy throughout training, as quantified by the local learning coefficient, for a transformer language model and an in-context linear regression transformer. We show that training can be divided into distinct periods of change in loss landscape degeneracy, and that these changes in degeneracy coincide with significant changes in the internal computational structure and the input/output behavior of the transformers. This finding underscores the potential of a degeneracy-based perspective for understanding modern deep learning.
△ Less
Submitted 13 February, 2025; v1 submitted 4 February, 2024;
originally announced February 2024.
-
Proximity to Losslessly Compressible Parameters
Authors:
Matthew Farrugia-Roberts
Abstract:
To better understand complexity in neural networks, we theoretically investigate the idealised phenomenon of lossless network compressibility, whereby an identical function can be implemented with fewer hidden units. In the setting of single-hidden-layer hyperbolic tangent networks, we define the rank of a parameter as the minimum number of hidden units required to implement the same function. We…
▽ More
To better understand complexity in neural networks, we theoretically investigate the idealised phenomenon of lossless network compressibility, whereby an identical function can be implemented with fewer hidden units. In the setting of single-hidden-layer hyperbolic tangent networks, we define the rank of a parameter as the minimum number of hidden units required to implement the same function. We give efficient formal algorithms for optimal lossless compression and computing the rank of a parameter. Losslessly compressible parameters are atypical, but their existence has implications for nearby parameters. We define the proximate rank of a parameter as the rank of the most compressible parameter within a small L-infinity neighbourhood. We give an efficient greedy algorithm for bounding the proximate rank of a parameter, and show that the problem of tightly bounding the proximate rank is NP-complete. These results lay a foundation for future theoretical and empirical work on losslessly compressible parameters and their neighbours.
△ Less
Submitted 23 May, 2024; v1 submitted 5 June, 2023;
originally announced June 2023.
-
Functional Equivalence and Path Connectivity of Reducible Hyperbolic Tangent Networks
Authors:
Matthew Farrugia-Roberts
Abstract:
Understanding the learning process of artificial neural networks requires clarifying the structure of the parameter space within which learning takes place. A neural network parameter's functional equivalence class is the set of parameters implementing the same input--output function. For many architectures, almost all parameters have a simple and well-documented functional equivalence class. Howe…
▽ More
Understanding the learning process of artificial neural networks requires clarifying the structure of the parameter space within which learning takes place. A neural network parameter's functional equivalence class is the set of parameters implementing the same input--output function. For many architectures, almost all parameters have a simple and well-documented functional equivalence class. However, there is also a vanishing minority of reducible parameters, with richer functional equivalence classes caused by redundancies among the network's units.
In this paper, we give an algorithmic characterisation of unit redundancies and reducible functional equivalence classes for a single-hidden-layer hyperbolic tangent architecture. We show that such functional equivalence classes are piecewise-linear path-connected sets, and that for parameters with a majority of redundant units, the sets have a diameter of at most 7 linear segments.
△ Less
Submitted 7 June, 2023; v1 submitted 8 May, 2023;
originally announced May 2023.
-
Teaching Simple Constructive Proofs with Haskell Programs
Authors:
Matthew Farrugia-Roberts,
Bryn Jeffries,
Harald Søndergaard
Abstract:
In recent years we have explored using Haskell alongside a traditional mathematical formalism in our large-enrolment university course on topics including logic and formal languages, aiming to offer our students a programming perspective on these mathematical topics. We have found it possible to offer almost all formative and summative assessment through an interactive learning platform, using Has…
▽ More
In recent years we have explored using Haskell alongside a traditional mathematical formalism in our large-enrolment university course on topics including logic and formal languages, aiming to offer our students a programming perspective on these mathematical topics. We have found it possible to offer almost all formative and summative assessment through an interactive learning platform, using Haskell as a lingua franca for digital exercises across our broad syllabus. One of the hardest exercises to convert into this format are traditional written proofs conveying constructive arguments. In this paper we reflect on the digitisation of this kind of exercise. We share many examples of Haskell exercises designed to target similar skills to written proof exercises across topics in propositional logic and formal languages, discussing various aspects of the design of such exercises. We also catalogue a sample of student responses to such exercises. This discussion contributes to our broader exploration of programming problems as a flexible digital medium for learning and assessment.
△ Less
Submitted 26 July, 2022;
originally announced August 2022.
-
Invariance in Policy Optimisation and Partial Identifiability in Reward Learning
Authors:
Joar Skalse,
Matthew Farrugia-Roberts,
Stuart Russell,
Alessandro Abate,
Adam Gleave
Abstract:
It is often very challenging to manually design reward functions for complex, real-world tasks. To solve this, one can instead use reward learning to infer a reward function from data. However, there are often multiple reward functions that fit the data equally well, even in the infinite-data limit. This means that the reward function is only partially identifiable. In this work, we formally chara…
▽ More
It is often very challenging to manually design reward functions for complex, real-world tasks. To solve this, one can instead use reward learning to infer a reward function from data. However, there are often multiple reward functions that fit the data equally well, even in the infinite-data limit. This means that the reward function is only partially identifiable. In this work, we formally characterise the partial identifiability of the reward function given several popular reward learning data sources, including expert demonstrations and trajectory comparisons. We also analyse the impact of this partial identifiability for several downstream tasks, such as policy optimisation. We unify our results in a framework for comparing data sources and downstream tasks by their invariances, with implications for the design and selection of data sources for reward learning.
△ Less
Submitted 7 June, 2023; v1 submitted 14 March, 2022;
originally announced March 2022.