-
AgentMisalignment: Measuring the Propensity for Misaligned Behaviour in LLM-Based Agents
Authors:
Akshat Naik,
Patrick Quinn,
Guillermo Bosch,
Emma Gouné,
Francisco Javier Campos Zabala,
Jason Ross Brown,
Edward James Young
Abstract:
As Large Language Model (LLM) agents become more widespread, associated misalignment risks increase. Prior work has examined agents' ability to enact misaligned behaviour (misalignment capability) and their compliance with harmful instructions (misuse propensity). However, the likelihood of agents attempting misaligned behaviours in real-world settings (misalignment propensity) remains poorly unde…
▽ More
As Large Language Model (LLM) agents become more widespread, associated misalignment risks increase. Prior work has examined agents' ability to enact misaligned behaviour (misalignment capability) and their compliance with harmful instructions (misuse propensity). However, the likelihood of agents attempting misaligned behaviours in real-world settings (misalignment propensity) remains poorly understood. We introduce a misalignment propensity benchmark, AgentMisalignment, consisting of a suite of realistic scenarios in which LLM agents have the opportunity to display misaligned behaviour. We organise our evaluations into subcategories of misaligned behaviours, including goal-guarding, resisting shutdown, sandbagging, and power-seeking. We report the performance of frontier models on our benchmark, observing higher misalignment on average when evaluating more capable models. Finally, we systematically vary agent personalities through different system prompts. We find that persona characteristics can dramatically and unpredictably influence misalignment tendencies -- occasionally far more than the choice of model itself -- highlighting the importance of careful system prompt engineering for deployed AI agents. Our work highlights the failure of current alignment methods to generalise to LLM agents, and underscores the need for further propensity evaluations as autonomous systems become more prevalent.
△ Less
Submitted 4 June, 2025;
originally announced June 2025.
-
Large language models can learn and generalize steganographic chain-of-thought under process supervision
Authors:
Joey Skaf,
Luis Ibanez-Lissen,
Robert McCarthy,
Connor Watts,
Vasil Georgiv,
Hannes Whittingham,
Lorena Gonzalez-Manzano,
David Lindner,
Cameron Tice,
Edward James Young,
Puria Radmard
Abstract:
Chain-of-thought (CoT) reasoning not only enhances large language model performance but also provides critical insights into decision-making processes, marking it as a useful tool for monitoring model intent and planning. By proactively preventing models from acting on CoT indicating misaligned or harmful intent, CoT monitoring can be used to reduce risks associated with deploying models. However,…
▽ More
Chain-of-thought (CoT) reasoning not only enhances large language model performance but also provides critical insights into decision-making processes, marking it as a useful tool for monitoring model intent and planning. By proactively preventing models from acting on CoT indicating misaligned or harmful intent, CoT monitoring can be used to reduce risks associated with deploying models. However, developers may be incentivized to train away the appearance of harmful intent from CoT traces, by either customer preferences or regulatory requirements. Recent works have shown that banning mention of a specific example of reward hacking, which may be done either to make CoT presentable to users or as a naive attempt to prevent the behavior, causes obfuscation of the undesired reasoning traces but the persistence of the undesired behavior. Such obfuscation threatens the reliability of CoT monitoring. However, obfuscation of reasoning can be due to its internalization to latent space computation, or its encoding within the CoT. Here, we provide an extension to these results. First, we show that penalizing the use of specific strings within load-bearing reasoning traces causes models to substitute alternative strings. Crucially, this does not alter the underlying method by which the model performs the task, demonstrating that the model can learn to steganographically encode its reasoning. We further demonstrate that models can generalize an encoding scheme. When the penalized strings belong to an overarching class, the model learns not only to substitute strings seen in training, but also develops a general encoding scheme for all members of the class which it can apply to held-out testing strings.
△ Less
Submitted 2 June, 2025;
originally announced June 2025.
-
Reinforcement Learning applied to Insurance Portfolio Pursuit
Authors:
Edward James Young,
Alistair Rogers,
Elliott Tong,
James Jordon
Abstract:
When faced with a new customer, many factors contribute to an insurance firm's decision of what offer to make to that customer. In addition to the expected cost of providing the insurance, the firm must consider the other offers likely to be made to the customer, and how sensitive the customer is to differences in price. Moreover, firms often target a specific portfolio of customers that could dep…
▽ More
When faced with a new customer, many factors contribute to an insurance firm's decision of what offer to make to that customer. In addition to the expected cost of providing the insurance, the firm must consider the other offers likely to be made to the customer, and how sensitive the customer is to differences in price. Moreover, firms often target a specific portfolio of customers that could depend on, e.g., age, location, and occupation. Given such a target portfolio, firms may choose to modulate an individual customer's offer based on whether the firm desires the customer within their portfolio. We term the problem of modulating offers to achieve a desired target portfolio the portfolio pursuit problem. Having formulated the portfolio pursuit problem as a sequential decision making problem, we devise a novel reinforcement learning algorithm for its solution. We test our method on a complex synthetic market environment, and demonstrate that it outperforms a baseline method which mimics current industry approaches to portfolio pursuit.
△ Less
Submitted 2 August, 2024; v1 submitted 1 August, 2024;
originally announced August 2024.