-
SAGE-Eval: Evaluating LLMs for Systematic Generalizations of Safety Facts
Authors:
Chen Yueh-Han,
Guy Davidson,
Brenden M. Lake
Abstract:
Do LLMs robustly generalize critical safety facts to novel situations? Lacking this ability is dangerous when users ask naive questions. For instance, "I'm considering packing melon balls for my 10-month-old's lunch. What other foods would be good to include?" Before offering food options, the LLM should warn that melon balls pose a choking hazard to toddlers, as documented by the CDC. Failing to…
▽ More
Do LLMs robustly generalize critical safety facts to novel situations? Lacking this ability is dangerous when users ask naive questions. For instance, "I'm considering packing melon balls for my 10-month-old's lunch. What other foods would be good to include?" Before offering food options, the LLM should warn that melon balls pose a choking hazard to toddlers, as documented by the CDC. Failing to provide such warnings could result in serious injuries or even death. To evaluate this, we introduce SAGE-Eval, SAfety-fact systematic GEneralization evaluation, the first benchmark that tests whether LLMs properly apply well established safety facts to naive user queries. SAGE-Eval comprises 104 facts manually sourced from reputable organizations, systematically augmented to create 10,428 test scenarios across 7 common domains (e.g., Outdoor Activities, Medicine). We find that the top model, Claude-3.7-sonnet, passes only 58% of all the safety facts tested. We also observe that model capabilities and training compute weakly correlate with performance on SAGE-Eval, implying that scaling up is not the golden solution. Our findings suggest frontier LLMs still lack robust generalization ability. We recommend developers use SAGE-Eval in pre-deployment evaluations to assess model reliability in addressing salient risks. We publicly release SAGE-Eval at https://huggingface.co/datasets/YuehHanChen/SAGE-Eval and our code is available at https://github.com/YuehHanChen/SAGE-Eval/tree/main.
△ Less
Submitted 27 May, 2025;
originally announced May 2025.
-
Do different prompting methods yield a common task representation in language models?
Authors:
Guy Davidson,
Todd M. Gureckis,
Brenden M. Lake,
Adina Williams
Abstract:
Demonstrations and instructions are two primary approaches for prompting language models to perform in-context learning (ICL) tasks. Do identical tasks elicited in different ways result in similar representations of the task? An improved understanding of task representation mechanisms would offer interpretability insights and may aid in steering models. We study this through \textit{function vecto…
▽ More
Demonstrations and instructions are two primary approaches for prompting language models to perform in-context learning (ICL) tasks. Do identical tasks elicited in different ways result in similar representations of the task? An improved understanding of task representation mechanisms would offer interpretability insights and may aid in steering models. We study this through \textit{function vectors} (FVs), recently proposed as a mechanism to extract few-shot ICL task representations. We generalize FVs to alternative task presentations, focusing on short textual instruction prompts, and successfully extract instruction function vectors that promote zero-shot task accuracy. We find evidence that demonstration- and instruction-based function vectors leverage different model components, and offer several controls to dissociate their contributions to task performance. Our results suggest that different task promptings forms do not induce a common task representation through FVs but elicit different, partly overlapping mechanisms. Our findings offer principled support to the practice of combining instructions and task demonstrations, imply challenges in universally monitoring task inference across presentation forms, and encourage further examinations of LLM task inference mechanisms.
△ Less
Submitted 21 May, 2025; v1 submitted 17 May, 2025;
originally announced May 2025.
-
PAARS: Persona Aligned Agentic Retail Shoppers
Authors:
Saab Mansour,
Leonardo Perelli,
Lorenzo Mainetti,
George Davidson,
Stefano D'Amato
Abstract:
In e-commerce, behavioral data is collected for decision making which can be costly and slow. Simulation with LLM powered agents is emerging as a promising alternative for representing human population behavior. However, LLMs are known to exhibit certain biases, such as brand bias, review rating bias and limited representation of certain groups in the population, hence they need to be carefully be…
▽ More
In e-commerce, behavioral data is collected for decision making which can be costly and slow. Simulation with LLM powered agents is emerging as a promising alternative for representing human population behavior. However, LLMs are known to exhibit certain biases, such as brand bias, review rating bias and limited representation of certain groups in the population, hence they need to be carefully benchmarked and aligned to user behavior. Ultimately, our goal is to synthesise an agent population and verify that it collectively approximates a real sample of humans. To this end, we propose a framework that: (i) creates synthetic shopping agents by automatically mining personas from anonymised historical shopping data, (ii) equips agents with retail-specific tools to synthesise shopping sessions and (iii) introduces a novel alignment suite measuring distributional differences between humans and shopping agents at the group (i.e. population) level rather than the traditional "individual" level. Experimental results demonstrate that using personas improves performance on the alignment suite, though a gap remains to human behaviour. We showcase an initial application of our framework for automated agentic A/B testing and compare the findings to human results. Finally, we discuss applications, limitations and challenges setting the stage for impactful future work.
△ Less
Submitted 31 March, 2025;
originally announced March 2025.
-
Goals as Reward-Producing Programs
Authors:
Guy Davidson,
Graham Todd,
Julian Togelius,
Todd M. Gureckis,
Brenden M. Lake
Abstract:
People are remarkably capable of generating their own goals, beginning with child's play and continuing into adulthood. Despite considerable empirical and computational work on goals and goal-oriented behavior, models are still far from capturing the richness of everyday human goals. Here, we bridge this gap by collecting a dataset of human-generated playful goals (in the form of scorable, single-…
▽ More
People are remarkably capable of generating their own goals, beginning with child's play and continuing into adulthood. Despite considerable empirical and computational work on goals and goal-oriented behavior, models are still far from capturing the richness of everyday human goals. Here, we bridge this gap by collecting a dataset of human-generated playful goals (in the form of scorable, single-player games), modeling them as reward-producing programs, and generating novel human-like goals through program synthesis. Reward-producing programs capture the rich semantics of goals through symbolic operations that compose, add temporal constraints, and allow for program execution on behavioral traces to evaluate progress. To build a generative model of goals, we learn a fitness function over the infinite set of possible goal programs and sample novel goals with a quality-diversity algorithm. Human evaluators found that model-generated goals, when sampled from partitions of program space occupied by human examples, were indistinguishable from human-created games. We also discovered that our model's internal fitness scores predict games that are evaluated as more fun to play and more human-like.
△ Less
Submitted 10 September, 2024; v1 submitted 21 May, 2024;
originally announced May 2024.
-
Toward Human-AI Alignment in Large-Scale Multi-Player Games
Authors:
Sugandha Sharma,
Guy Davidson,
Khimya Khetarpal,
Anssi Kanervisto,
Udit Arora,
Katja Hofmann,
Ida Momennejad
Abstract:
Achieving human-AI alignment in complex multi-agent games is crucial for creating trustworthy AI agents that enhance gameplay. We propose a method to evaluate this alignment using an interpretable task-sets framework, focusing on high-level behavioral tasks instead of low-level policies. Our approach has three components. First, we analyze extensive human gameplay data from Xbox's Bleeding Edge (1…
▽ More
Achieving human-AI alignment in complex multi-agent games is crucial for creating trustworthy AI agents that enhance gameplay. We propose a method to evaluate this alignment using an interpretable task-sets framework, focusing on high-level behavioral tasks instead of low-level policies. Our approach has three components. First, we analyze extensive human gameplay data from Xbox's Bleeding Edge (100K+ games), uncovering behavioral patterns in a complex task space. This task space serves as a basis set for a behavior manifold capturing interpretable axes: fight-flight, explore-exploit, and solo-multi-agent. Second, we train an AI agent to play Bleeding Edge using a Generative Pretrained Causal Transformer and measure its behavior. Third, we project human and AI gameplay to the proposed behavior manifold to compare and contrast. This allows us to interpret differences in policy as higher-level behavioral concepts, e.g., we find that while human players exhibit variability in fight-flight and explore-exploit behavior, AI players tend towards uniformity. Furthermore, AI agents predominantly engage in solo play, while humans often engage in cooperative and competitive multi-agent patterns. These stark differences underscore the need for interpretable evaluation, design, and integration of AI in human-aligned applications. Our study advances the alignment discussion in AI and especially generative AI research, offering a measurable framework for interpretable human-agent alignment in multiplayer gaming.
△ Less
Submitted 18 June, 2024; v1 submitted 5 February, 2024;
originally announced February 2024.
-
High Quality Audio Coding with MDCTNet
Authors:
Grant Davidson,
Mark Vinton,
Per Ekstrand,
Cong Zhou,
Lars Villemoes,
Lie Lu
Abstract:
We propose a neural audio generative model, MDCTNet, operating in the perceptually weighted domain of an adaptive modified discrete cosine transform (MDCT). The architecture of the model captures correlations in both time and frequency directions with recurrent layers (RNNs). An audio coding system is obtained by training MDCTNet on a diverse set of fullband monophonic audio signals at 48 kHz samp…
▽ More
We propose a neural audio generative model, MDCTNet, operating in the perceptually weighted domain of an adaptive modified discrete cosine transform (MDCT). The architecture of the model captures correlations in both time and frequency directions with recurrent layers (RNNs). An audio coding system is obtained by training MDCTNet on a diverse set of fullband monophonic audio signals at 48 kHz sampling, conditioned by a perceptual audio encoder. In a subjective listening test with ten excerpts chosen to be balanced across content types, yet stressful for both codecs, the mean performance of the proposed system for 24 kb/s variable bitrate (VBR) is similar to that of Opus at twice the bitrate.
△ Less
Submitted 8 December, 2022;
originally announced December 2022.
-
Investigating Simple Object Representations in Model-Free Deep Reinforcement Learning
Authors:
Guy Davidson,
Brenden M. Lake
Abstract:
We explore the benefits of augmenting state-of-the-art model-free deep reinforcement algorithms with simple object representations. Following the Frostbite challenge posited by Lake et al. (2017), we identify object representations as a critical cognitive capacity lacking from current reinforcement learning agents. We discover that providing the Rainbow model (Hessel et al.,2018) with simple, feat…
▽ More
We explore the benefits of augmenting state-of-the-art model-free deep reinforcement algorithms with simple object representations. Following the Frostbite challenge posited by Lake et al. (2017), we identify object representations as a critical cognitive capacity lacking from current reinforcement learning agents. We discover that providing the Rainbow model (Hessel et al.,2018) with simple, feature-engineered object representations substantially boosts its performance on the Frostbite game from Atari 2600. We then analyze the relative contributions of the representations of different types of objects, identify environment states where these representations are most impactful, and examine how these representations aid in generalizing to novel situations.
△ Less
Submitted 28 May, 2020; v1 submitted 16 February, 2020;
originally announced February 2020.
-
Sequential mastery of multiple visual tasks: Networks naturally learn to learn and forget to forget
Authors:
Guy Davidson,
Michael C. Mozer
Abstract:
We explore the behavior of a standard convolutional neural net in a continual-learning setting that introduces visual classification tasks sequentially and requires the net to master new tasks while preserving mastery of previously learned tasks. This setting corresponds to that which human learners face as they acquire domain expertise serially, for example, as an individual studies a textbook. T…
▽ More
We explore the behavior of a standard convolutional neural net in a continual-learning setting that introduces visual classification tasks sequentially and requires the net to master new tasks while preserving mastery of previously learned tasks. This setting corresponds to that which human learners face as they acquire domain expertise serially, for example, as an individual studies a textbook. Through simulations involving sequences of ten related visual tasks, we find reason for optimism that nets will scale well as they advance from having a single skill to becoming multi-skill domain experts. We observe two key phenomena. First, \emph{forward facilitation}---the accelerated learning of task $n+1$ having learned $n$ previous tasks---grows with $n$. Second, \emph{backward interference}---the forgetting of the $n$ previous tasks when learning task $n+1$---diminishes with $n$. Amplifying forward facilitation is the goal of research on metalearning, and attenuating backward interference is the goal of research on catastrophic forgetting. We find that both of these goals are attained simply through broader exposure to a domain.
△ Less
Submitted 30 March, 2020; v1 submitted 26 May, 2019;
originally announced May 2019.
-
Domain-Specific Acceleration and Auto-Parallelization of Legacy Scientific Code in FORTRAN 77 using Source-to-Source Compilation
Authors:
Wim Vanderbauwhede,
Gavin Davidson
Abstract:
Massively parallel accelerators such as GPGPUs, manycores and FPGAs represent a powerful and affordable tool for scientists who look to speed up simulations of complex systems. However, porting code to such devices requires a detailed understanding of heterogeneous programming tools and effective strategies for parallelization. In this paper we present a source to source compilation approach with…
▽ More
Massively parallel accelerators such as GPGPUs, manycores and FPGAs represent a powerful and affordable tool for scientists who look to speed up simulations of complex systems. However, porting code to such devices requires a detailed understanding of heterogeneous programming tools and effective strategies for parallelization. In this paper we present a source to source compilation approach with whole-program analysis to automatically transform single-threaded FORTRAN 77 legacy code into OpenCL-accelerated programs with parallelized kernels.
The main contributions of our work are: (1) whole-source refactoring to allow any subroutine in the code to be offloaded to an accelerator. (2) Minimization of the data transfer between the host and the accelerator by eliminating redundant transfers. (3) Pragmatic auto-parallelization of the code to be offloaded to the accelerator by identification of parallelizable maps and reductions.
We have validated the code transformation performance of the compiler on the NIST FORTRAN 78 test suite and several real-world codes: the Large Eddy Simulator for Urban Flows, a high-resolution turbulent flow model; the shallow water component of the ocean model Gmodel; the Linear Baroclinic Model, an atmospheric climate model and Flexpart-WRF, a particle dispersion simulator.
The automatic parallelization component has been tested on as 2-D Shallow Water model (2DSW) and on the Large Eddy Simulator for Urban Flows (UFLES) and produces a complete OpenCL-enabled code base. The fully OpenCL-accelerated versions of the 2DSW and the UFLES are resp. 9x and 20x faster on GPU than the original code on CPU, in both cases this is the same performance as manually ported code.
△ Less
Submitted 13 November, 2017;
originally announced November 2017.
-
Rayleigh Quotient Iteration with a Multigrid in Energy Preconditioner for Massively Parallel Neutron Transport
Authors:
R. N. Slaybaugh,
T. M. Evans,
G. G. Davidson,
P. P. H. Wilson
Abstract:
Three complementary methods have been implemented in the code Denovo that accelerate neutral particle transport calculations with methods that use leadership-class computers fully and effectively: a multigroup block (MG) Krylov solver, a Rayleigh quotient iteration (RQI) eigenvalue solver, and a multigrid in energy preconditioner. The multigroup Krylov solver converges more quickly than Gauss Seid…
▽ More
Three complementary methods have been implemented in the code Denovo that accelerate neutral particle transport calculations with methods that use leadership-class computers fully and effectively: a multigroup block (MG) Krylov solver, a Rayleigh quotient iteration (RQI) eigenvalue solver, and a multigrid in energy preconditioner. The multigroup Krylov solver converges more quickly than Gauss Seidel and enables energy decomposition such that Denovo can scale to hundreds of thousands of cores. The new multigrid in energy preconditioner reduces iteration count for many problem types and takes advantage of the new energy decomposition such that it can scale efficiently. These two tools are useful on their own, but together they enable the RQI eigenvalue solver to work. Each individual method has been described before, but this is the first time they have been demonstrated to work together effectively.
RQI should converge in fewer iterations than power iteration (PI) for large and challenging problems. RQI creates shifted systems that would not be tractable without the MG Krylov solver. It also creates ill-conditioned matrices that cannot converge without the multigrid in energy preconditioner. Using these methods together, RQI converged in fewer iterations and in less time than all PI calculations for a full pressurized water reactor core. It also scaled reasonably well out to 275,968 cores.
△ Less
Submitted 7 February, 2017;
originally announced February 2017.