-
Intent Factored Generation: Unleashing the Diversity in Your Language Model
Authors:
Eltayeb Ahmed,
Uljad Berdica,
Martha Elliott,
Danijela Horak,
Jakob N. Foerster
Abstract:
Obtaining multiple meaningfully diverse, high quality samples from Large Language Models for a fixed prompt remains an open challenge. Current methods for increasing diversity often only operate at the token-level, paraphrasing the same response. This is problematic because it leads to poor exploration on reasoning problems and to unengaging, repetitive conversational agents. To address this we pr…
▽ More
Obtaining multiple meaningfully diverse, high quality samples from Large Language Models for a fixed prompt remains an open challenge. Current methods for increasing diversity often only operate at the token-level, paraphrasing the same response. This is problematic because it leads to poor exploration on reasoning problems and to unengaging, repetitive conversational agents. To address this we propose Intent Factored Generation (IFG), factorising the sampling process into two stages. First, we sample a semantically dense intent, e.g., a summary or keywords. Second, we sample the final response conditioning on both the original prompt and the intent from the first stage. This allows us to use a higher temperature during the intent step to promote conceptual diversity, and a lower temperature during the final generation to ensure the outputs are coherent and self-consistent. Additionally, we find that prompting the model to explicitly state its intent for each step of the chain-of-thought before generating the step is beneficial for reasoning tasks. We demonstrate our method's effectiveness across a diverse set of tasks. We show this method improves both pass@k and Reinforcement Learning from Verifier Feedback on maths and code tasks. For instruction-tuning, we combine IFG with Direct Preference Optimisation to increase conversational diversity without sacrificing reward. Finally, we achieve higher diversity while maintaining the quality of generations on a general language modelling task, using a new dataset of reader comments and news articles that we collect and open-source. In summary, we present a simple method of increasing the sample diversity of LLMs while maintaining performance. This method can be implemented by changing the prompt and varying the temperature during generation, making it easy to integrate into many algorithms for gains across various applications.
△ Less
Submitted 11 June, 2025;
originally announced June 2025.
-
High Accuracy, Less Talk (HALT): Reliable LLMs through Capability-Aligned Finetuning
Authors:
Tim Franzmeyer,
Archie Sravankumar,
Lijuan Liu,
Yuning Mao,
Rui Hou,
Sinong Wang,
Jakob N. Foerster,
Luke Zettlemoyer,
Madian Khabsa
Abstract:
Large Language Models (LLMs) currently respond to every prompt. However, they can produce incorrect answers when they lack knowledge or capability -- a problem known as hallucination. We instead propose post-training an LLM to generate content only when confident in its correctness and to otherwise (partially) abstain. Specifically, our method, HALT, produces capability-aligned post-training data…
▽ More
Large Language Models (LLMs) currently respond to every prompt. However, they can produce incorrect answers when they lack knowledge or capability -- a problem known as hallucination. We instead propose post-training an LLM to generate content only when confident in its correctness and to otherwise (partially) abstain. Specifically, our method, HALT, produces capability-aligned post-training data that encodes what the model can and cannot reliably generate. We generate this data by splitting responses of the pretrained LLM into factual fragments (atomic statements or reasoning steps), and use ground truth information to identify incorrect fragments. We achieve capability-aligned finetuning responses by either removing incorrect fragments or replacing them with "Unsure from Here" -- according to a tunable threshold that allows practitioners to trade off response completeness and mean correctness of the response's fragments. We finetune four open-source models for biography writing, mathematics, coding, and medicine with HALT for three different trade-off thresholds. HALT effectively trades off response completeness for correctness, increasing the mean correctness of response fragments by 15% on average, while resulting in a 4% improvement in the F1 score (mean of completeness and correctness of the response) compared to the relevant baselines. By tuning HALT for highest correctness, we train a single reliable Llama3-70B model with correctness increased from 51% to 87% across all four domains while maintaining 53% of the response completeness achieved with standard finetuning.
△ Less
Submitted 4 June, 2025;
originally announced June 2025.
-
StochasTok: Improving Fine-Grained Subword Understanding in LLMs
Authors:
Anya Sims,
Thom Foster,
Klara Kaleb,
Tuan-Duy H. Nguyen,
Joseph Lee,
Jakob N. Foerster,
Yee Whye Teh,
Cong Lu
Abstract:
Subword-level understanding is integral to numerous tasks, including understanding multi-digit numbers, spelling mistakes, abbreviations, rhyming, and wordplay. Despite this, current large language models (LLMs) still often struggle with seemingly simple subword-level tasks like How many 'r's in 'strawberry'?. A key factor behind these failures is tokenization which obscures the fine-grained struc…
▽ More
Subword-level understanding is integral to numerous tasks, including understanding multi-digit numbers, spelling mistakes, abbreviations, rhyming, and wordplay. Despite this, current large language models (LLMs) still often struggle with seemingly simple subword-level tasks like How many 'r's in 'strawberry'?. A key factor behind these failures is tokenization which obscures the fine-grained structure of words. Current alternatives, such as character-level and dropout tokenization methods, significantly increase computational costs and provide inconsistent improvements. In this paper we revisit tokenization and introduce StochasTok, a simple, efficient stochastic tokenization scheme that randomly splits tokens during training, allowing LLMs to 'see' their internal structure. Our experiments show that pretraining with StochasTok substantially improves LLMs' downstream performance across multiple subword-level language games, including character counting, substring identification, and math tasks. Furthermore, StochasTok's simplicity allows seamless integration at any stage of the training pipeline; and we demonstrate that post-training with StochasTok can instill improved subword understanding into existing pretrained models, thus avoiding costly pretraining from scratch. These dramatic improvements achieved with a minimal change suggest StochasTok holds exciting potential when applied to larger, more capable models. Code open-sourced at: https://github.com/anyasims/stochastok.
△ Less
Submitted 10 June, 2025; v1 submitted 2 June, 2025;
originally announced June 2025.
-
SOReL and TOReL: Two Methods for Fully Offline Reinforcement Learning
Authors:
Mattie Fellows,
Clarisse Wibault,
Uljad Berdica,
Johannes Forkel,
Michael A. Osborne,
Jakob N. Foerster
Abstract:
Sample efficiency remains a major obstacle for real world adoption of reinforcement learning (RL): success has been limited to settings where simulators provide access to essentially unlimited environment interactions, which in reality are typically costly or dangerous to obtain. Offline RL in principle offers a solution by exploiting offline data to learn a near-optimal policy before deployment.…
▽ More
Sample efficiency remains a major obstacle for real world adoption of reinforcement learning (RL): success has been limited to settings where simulators provide access to essentially unlimited environment interactions, which in reality are typically costly or dangerous to obtain. Offline RL in principle offers a solution by exploiting offline data to learn a near-optimal policy before deployment. In practice, however, current offline RL methods rely on extensive online interactions for hyperparameter tuning, and have no reliable bound on their initial online performance. To address these two issues, we introduce two algorithms. Firstly, SOReL: an algorithm for safe offline reinforcement learning. Using only offline data, our Bayesian approach infers a posterior over environment dynamics to obtain a reliable estimate of the online performance via the posterior predictive uncertainty. Crucially, all hyperparameters are also tuned fully offline. Secondly, we introduce TOReL: a tuning for offline reinforcement learning algorithm that extends our information rate based offline hyperparameter tuning methods to general offline RL approaches. Our empirical evaluation confirms SOReL's ability to accurately estimate regret in the Bayesian setting whilst TOReL's offline hyperparameter tuning achieves competitive performance with the best online hyperparameter tuning methods using only offline data. Thus, SOReL and TOReL make a significant step towards safe and reliable offline RL, unlocking the potential for RL in the real world. Our implementations are publicly available: https://github.com/CWibault/sorel\_torel.
△ Less
Submitted 29 May, 2025; v1 submitted 28 May, 2025;
originally announced May 2025.
-
An Optimisation Framework for Unsupervised Environment Design
Authors:
Nathan Monette,
Alistair Letcher,
Michael Beukman,
Matthew T. Jackson,
Alexander Rutherford,
Alexander D. Goldie,
Jakob N. Foerster
Abstract:
For reinforcement learning agents to be deployed in high-risk settings, they must achieve a high level of robustness to unfamiliar scenarios. One method for improving robustness is unsupervised environment design (UED), a suite of methods aiming to maximise an agent's generalisability across configurations of an environment. In this work, we study UED from an optimisation perspective, providing st…
▽ More
For reinforcement learning agents to be deployed in high-risk settings, they must achieve a high level of robustness to unfamiliar scenarios. One method for improving robustness is unsupervised environment design (UED), a suite of methods aiming to maximise an agent's generalisability across configurations of an environment. In this work, we study UED from an optimisation perspective, providing stronger theoretical guarantees for practical settings than prior work. Whereas previous methods relied on guarantees if they reach convergence, our framework employs a nonconvex-strongly-concave objective for which we provide a provably convergent algorithm in the zero-sum setting. We empirically verify the efficacy of our method, outperforming prior methods in a number of environments with varying difficulties.
△ Less
Submitted 26 May, 2025;
originally announced May 2025.
-
A Clean Slate for Offline Reinforcement Learning
Authors:
Matthew Thomas Jackson,
Uljad Berdica,
Jarek Liesen,
Shimon Whiteson,
Jakob Nicolaus Foerster
Abstract:
Progress in offline reinforcement learning (RL) has been impeded by ambiguous problem definitions and entangled algorithmic designs, resulting in inconsistent implementations, insufficient ablations, and unfair evaluations. Although offline RL explicitly avoids environment interaction, prior methods frequently employ extensive, undocumented online evaluation for hyperparameter tuning, complicating…
▽ More
Progress in offline reinforcement learning (RL) has been impeded by ambiguous problem definitions and entangled algorithmic designs, resulting in inconsistent implementations, insufficient ablations, and unfair evaluations. Although offline RL explicitly avoids environment interaction, prior methods frequently employ extensive, undocumented online evaluation for hyperparameter tuning, complicating method comparisons. Moreover, existing reference implementations differ significantly in boilerplate code, obscuring their core algorithmic contributions. We address these challenges by first introducing a rigorous taxonomy and a transparent evaluation protocol that explicitly quantifies online tuning budgets. To resolve opaque algorithmic design, we provide clean, minimalistic, single-file implementations of various model-free and model-based offline RL methods, significantly enhancing clarity and achieving substantial speed-ups. Leveraging these streamlined implementations, we propose Unifloral, a unified algorithm that encapsulates diverse prior approaches within a single, comprehensive hyperparameter space, enabling algorithm development in a shared hyperparameter space. Using Unifloral with our rigorous evaluation protocol, we develop two novel algorithms - TD3-AWR (model-free) and MoBRAC (model-based) - which substantially outperform established baselines. Our implementation is publicly available at https://github.com/EmptyJackson/unifloral.
△ Less
Submitted 15 April, 2025;
originally announced April 2025.
-
OvercookedV2: Rethinking Overcooked for Zero-Shot Coordination
Authors:
Tobias Gessler,
Tin Dizdarevic,
Ani Calinescu,
Benjamin Ellis,
Andrei Lupu,
Jakob Nicolaus Foerster
Abstract:
AI agents hold the potential to transform everyday life by helping humans achieve their goals. To do this successfully, agents need to be able to coordinate with novel partners without prior interaction, a setting known as zero-shot coordination (ZSC). Overcooked has become one of the most popular benchmarks for evaluating coordination capabilities of AI agents and learning algorithms. In this wor…
▽ More
AI agents hold the potential to transform everyday life by helping humans achieve their goals. To do this successfully, agents need to be able to coordinate with novel partners without prior interaction, a setting known as zero-shot coordination (ZSC). Overcooked has become one of the most popular benchmarks for evaluating coordination capabilities of AI agents and learning algorithms. In this work, we investigate the origins of ZSC challenges in Overcooked. We introduce a state augmentation mechanism which mixes states that might be encountered when paired with unknown partners into the training distribution, reducing the out-of-distribution challenge associated with ZSC. We show that independently trained agents under this algorithm coordinate successfully in Overcooked. Our results suggest that ZSC failure can largely be attributed to poor state coverage under self-play rather than more sophisticated coordination challenges. The Overcooked environment is therefore not suitable as a ZSC benchmark. To address these shortcomings, we introduce OvercookedV2, a new version of the benchmark, which includes asymmetric information and stochasticity, facilitating the creation of interesting ZSC scenarios. To validate OvercookedV2, we conduct experiments demonstrating that mere exhaustive state coverage is insufficient to coordinate well. Finally, we use OvercookedV2 to build a new range of coordination challenges, including ones that require test time protocol formation, and we demonstrate the need for new coordination algorithms that can adapt online. We hope that OvercookedV2 will help benchmark the next generation of ZSC algorithms and advance collaboration between AI agents and humans.
△ Less
Submitted 22 March, 2025;
originally announced March 2025.
-
SensPS: Sensing Personal Space Comfortable Distance between Human-Human Using Multimodal Sensors
Authors:
Ko Watanabe,
Nico Förster,
Shoya Ishimaru
Abstract:
Personal space, also known as peripersonal space, is crucial in human social interaction, influencing comfort, communication, and social stress. Estimating and respecting personal space is essential for enhancing human-computer interaction (HCI) and smart environments. Personal space preferences vary due to individual traits, cultural background, and contextual factors. Advanced multimodal sensing…
▽ More
Personal space, also known as peripersonal space, is crucial in human social interaction, influencing comfort, communication, and social stress. Estimating and respecting personal space is essential for enhancing human-computer interaction (HCI) and smart environments. Personal space preferences vary due to individual traits, cultural background, and contextual factors. Advanced multimodal sensing technologies, including eye-tracking and wristband sensors, offer opportunities to develop adaptive systems that dynamically adjust to user comfort levels. Integrating physiological and behavioral data enables a deeper understanding of spatial interactions. This study develops a sensor-based model to estimate comfortable personal space and identifies key features influencing spatial preferences. Our findings show that multimodal sensors, particularly eye-tracking and physiological wristband data, can effectively predict personal space preferences, with eye-tracking data playing a more significant role. An experimental study involving controlled human interactions demonstrates that a Transformer-based model achieves the highest predictive accuracy (F1 score: 0.87) for estimating personal space. Eye-tracking features, such as gaze point and pupil diameter, emerge as the most significant predictors, while physiological signals from wristband sensors contribute marginally. These results highlight the potential for AI-driven personalization of social space in adaptive environments, suggesting that multimodal sensing can be leveraged to develop intelligent systems that optimize spatial arrangements in workplaces, educational institutions, and public settings. Future work should explore larger datasets, real-world applications, and additional physiological markers to enhance model robustness.
△ Less
Submitted 11 February, 2025;
originally announced February 2025.
-
AgentBreeder: Mitigating the AI Safety Impact of Multi-Agent Scaffolds via Self-Improvement
Authors:
J Rosser,
Jakob Nicolaus Foerster
Abstract:
Scaffolding Large Language Models (LLMs) into multi-agent systems often improves performance on complex tasks, but the safety impact of such scaffolds has not been thoroughly explored. We introduce AgentBreeder, a framework for multi-objective self-improving evolutionary search over scaffolds. We evaluate discovered scaffolds on widely recognized reasoning, mathematics, and safety benchmarks and c…
▽ More
Scaffolding Large Language Models (LLMs) into multi-agent systems often improves performance on complex tasks, but the safety impact of such scaffolds has not been thoroughly explored. We introduce AgentBreeder, a framework for multi-objective self-improving evolutionary search over scaffolds. We evaluate discovered scaffolds on widely recognized reasoning, mathematics, and safety benchmarks and compare them with popular baselines. In 'blue' mode, we see a 79.4% average uplift in safety benchmark performance while maintaining or improving capability scores. In 'red' mode, we find adversarially weak scaffolds emerging concurrently with capability optimization. Our work demonstrates the risks of multi-agent scaffolding and provides a framework for mitigating them. Code is available at https://github.com/J-Rosser-UK/AgentBreeder.
△ Less
Submitted 14 April, 2025; v1 submitted 2 February, 2025;
originally announced February 2025.
-
BALROG: Benchmarking Agentic LLM and VLM Reasoning On Games
Authors:
Davide Paglieri,
Bartłomiej Cupiał,
Samuel Coward,
Ulyana Piterbarg,
Maciej Wolczyk,
Akbir Khan,
Eduardo Pignatelli,
Łukasz Kuciński,
Lerrel Pinto,
Rob Fergus,
Jakob Nicolaus Foerster,
Jack Parker-Holder,
Tim Rocktäschel
Abstract:
Large Language Models (LLMs) and Vision Language Models (VLMs) possess extensive knowledge and exhibit promising reasoning abilities, however, they still struggle to perform well in complex, dynamic environments. Real-world tasks require handling intricate interactions, advanced spatial reasoning, long-term planning, and continuous exploration of new strategies-areas in which we lack effective met…
▽ More
Large Language Models (LLMs) and Vision Language Models (VLMs) possess extensive knowledge and exhibit promising reasoning abilities, however, they still struggle to perform well in complex, dynamic environments. Real-world tasks require handling intricate interactions, advanced spatial reasoning, long-term planning, and continuous exploration of new strategies-areas in which we lack effective methodologies for comprehensively evaluating these capabilities. To address this gap, we introduce BALROG, a novel benchmark designed to assess the agentic capabilities of LLMs and VLMs through a diverse set of challenging games. Our benchmark incorporates a range of existing reinforcement learning environments with varying levels of difficulty, including tasks that are solvable by non-expert humans in seconds to extremely challenging ones that may take years to master (e.g., the NetHack Learning Environment). We devise fine-grained metrics to measure performance and conduct an extensive evaluation of several popular open-source and closed-source LLMs and VLMs. Our findings indicate that while current models achieve partial success in the easier games, they struggle significantly with more challenging tasks. Notably, we observe severe deficiencies in vision-based decision-making, as several models perform worse when visual representations of the environments are provided. We release BALROG as an open and user-friendly benchmark to facilitate future research and development in the agentic community. Code and Leaderboard at balrogai.com.
△ Less
Submitted 1 April, 2025; v1 submitted 20 November, 2024;
originally announced November 2024.
-
Meta-Learning Objectives for Preference Optimization
Authors:
Carlo Alfano,
Silvia Sapora,
Jakob Nicolaus Foerster,
Patrick Rebeschini,
Yee Whye Teh
Abstract:
Evaluating preference optimization (PO) algorithms on LLM alignment is a challenging task that presents prohibitive costs, noise, and several variables like model size and hyper-parameters. In this work, we show that it is possible to gain insights on the efficacy of PO algorithm on much simpler benchmarks. We design a diagnostic suite of MuJoCo tasks and datasets, which we use to systematically e…
▽ More
Evaluating preference optimization (PO) algorithms on LLM alignment is a challenging task that presents prohibitive costs, noise, and several variables like model size and hyper-parameters. In this work, we show that it is possible to gain insights on the efficacy of PO algorithm on much simpler benchmarks. We design a diagnostic suite of MuJoCo tasks and datasets, which we use to systematically evaluate PO algorithms, establishing a more controlled and cheaper benchmark. We then propose a novel family of PO algorithms based on mirror descent, which we call Mirror Preference Optimization (MPO). Through evolutionary strategies, we search this class to discover algorithms specialized to specific properties of preference datasets, such as mixed-quality or noisy data. We demonstrate that our discovered PO algorithms outperform all known algorithms in the targeted MuJoCo settings. Finally, based on the insights gained from our MuJoCo experiments, we design a novel PO algorithm that significantly outperforms existing baselines in an LLM alignment task.
△ Less
Submitted 4 February, 2025; v1 submitted 10 November, 2024;
originally announced November 2024.
-
Beyond the Boundaries of Proximal Policy Optimization
Authors:
Charlie B. Tan,
Edan Toledo,
Benjamin Ellis,
Jakob N. Foerster,
Ferenc Huszár
Abstract:
Proximal policy optimization (PPO) is a widely-used algorithm for on-policy reinforcement learning. This work offers an alternative perspective of PPO, in which it is decomposed into the inner-loop estimation of update vectors, and the outer-loop application of updates using gradient ascent with unity learning rate. Using this insight we propose outer proximal policy optimization (outer-PPO); a fr…
▽ More
Proximal policy optimization (PPO) is a widely-used algorithm for on-policy reinforcement learning. This work offers an alternative perspective of PPO, in which it is decomposed into the inner-loop estimation of update vectors, and the outer-loop application of updates using gradient ascent with unity learning rate. Using this insight we propose outer proximal policy optimization (outer-PPO); a framework wherein these update vectors are applied using an arbitrary gradient-based optimizer. The decoupling of update estimation and update application enabled by outer-PPO highlights several implicit design choices in PPO that we challenge through empirical investigation. In particular we consider non-unity learning rates and momentum applied to the outer loop, and a momentum-bias applied to the inner estimation loop. Methods are evaluated against an aggressively tuned PPO baseline on Brax, Jumanji and MinAtar environments; non-unity learning rates and momentum both achieve statistically significant improvement on Brax and Jumanji, given the same hyperparameter tuning budget.
△ Less
Submitted 1 November, 2024;
originally announced November 2024.
-
ADIOS: Antibody Development via Opponent Shaping
Authors:
Sebastian Towers,
Aleksandra Kalisz,
Philippe A. Robert,
Alicia Higueruelo,
Francesca Vianello,
Ming-Han Chloe Tsai,
Harrison Steel,
Jakob N. Foerster
Abstract:
Anti-viral therapies are typically designed to target only the current strains of a virus, a myopic response. However, therapy-induced selective pressures drive the emergence of new viral strains, against which the original myopic therapies are no longer effective. This evolutionary response presents an opportunity: our therapies could both defend against and actively influence viral evolution. Th…
▽ More
Anti-viral therapies are typically designed to target only the current strains of a virus, a myopic response. However, therapy-induced selective pressures drive the emergence of new viral strains, against which the original myopic therapies are no longer effective. This evolutionary response presents an opportunity: our therapies could both defend against and actively influence viral evolution. This motivates our method ADIOS: Antibody Development vIa Opponent Shaping. ADIOS is a meta-learning framework where the process of antibody therapy design, the outer loop, accounts for the virus's adaptive response, the inner loop. With ADIOS, antibodies are not only robust against potential future variants, they also influence, i.e., shape, which future variants emerge. In line with the opponent shaping literature, we refer to our optimised antibodies as shapers. To demonstrate the value of ADIOS, we build a viral evolution simulator using the Absolut! framework, in which shapers successfully target both current and future viral variants, outperforming myopic antibodies. Furthermore, we show that shapers modify the distribution over viral evolutionary trajectories to result in weaker variants. We believe that our ADIOS paradigm will facilitate the discovery of long-lived vaccines and antibody therapies while also generalising to other domains. Specifically, domains such as antimicrobial resistance, cancer treatment, and others with evolutionarily adaptive opponents. Our code is available at https://github.com/olakalisz/adios.
△ Less
Submitted 6 June, 2025; v1 submitted 16 September, 2024;
originally announced September 2024.
-
Can Learned Optimization Make Reinforcement Learning Less Difficult?
Authors:
Alexander David Goldie,
Chris Lu,
Matthew Thomas Jackson,
Shimon Whiteson,
Jakob Nicolaus Foerster
Abstract:
While reinforcement learning (RL) holds great potential for decision making in the real world, it suffers from a number of unique difficulties which often need specific consideration. In particular: it is highly non-stationary; suffers from high degrees of plasticity loss; and requires exploration to prevent premature convergence to local optima and maximize return. In this paper, we consider whet…
▽ More
While reinforcement learning (RL) holds great potential for decision making in the real world, it suffers from a number of unique difficulties which often need specific consideration. In particular: it is highly non-stationary; suffers from high degrees of plasticity loss; and requires exploration to prevent premature convergence to local optima and maximize return. In this paper, we consider whether learned optimization can help overcome these problems. Our method, Learned Optimization for Plasticity, Exploration and Non-stationarity (OPEN), meta-learns an update rule whose input features and output structure are informed by previously proposed solutions to these difficulties. We show that our parameterization is flexible enough to enable meta-learning in diverse learning contexts, including the ability to use stochasticity for exploration. Our experiments demonstrate that when meta-trained on single and small sets of environments, OPEN outperforms or equals traditionally used optimizers. Furthermore, OPEN shows strong generalization characteristics across a range of environments and agent architectures.
△ Less
Submitted 15 April, 2025; v1 submitted 9 July, 2024;
originally announced July 2024.
-
Simplifying Deep Temporal Difference Learning
Authors:
Matteo Gallici,
Mattie Fellows,
Benjamin Ellis,
Bartomeu Pou,
Ivan Masmitja,
Jakob Nicolaus Foerster,
Mario Martin
Abstract:
Q-learning played a foundational role in the field reinforcement learning (RL). However, TD algorithms with off-policy data, such as Q-learning, or nonlinear function approximation like deep neural networks require several additional tricks to stabilise training, primarily a large replay buffer and target networks. Unfortunately, the delayed updating of frozen network parameters in the target netw…
▽ More
Q-learning played a foundational role in the field reinforcement learning (RL). However, TD algorithms with off-policy data, such as Q-learning, or nonlinear function approximation like deep neural networks require several additional tricks to stabilise training, primarily a large replay buffer and target networks. Unfortunately, the delayed updating of frozen network parameters in the target network harms the sample efficiency and, similarly, the large replay buffer introduces memory and implementation overheads. In this paper, we investigate whether it is possible to accelerate and simplify off-policy TD training while maintaining its stability. Our key theoretical result demonstrates for the first time that regularisation techniques such as LayerNorm can yield provably convergent TD algorithms without the need for a target network or replay buffer, even with off-policy data. Empirically, we find that online, parallelised sampling enabled by vectorised environments stabilises training without the need for a large replay buffer. Motivated by these findings, we propose PQN, our simplified deep online Q-Learning algorithm. Surprisingly, this simple algorithm is competitive with more complex methods like: Rainbow in Atari, PPO-RNN in Craftax, QMix in Smax, and can be up to 50x faster than traditional DQN without sacrificing sample efficiency. In an era where PPO has become the go-to RL algorithm, PQN reestablishes off-policy Q-learning as a viable alternative.
△ Less
Submitted 21 April, 2025; v1 submitted 5 July, 2024;
originally announced July 2024.
-
Discovering Minimal Reinforcement Learning Environments
Authors:
Jarek Liesen,
Chris Lu,
Andrei Lupu,
Jakob N. Foerster,
Henning Sprekeler,
Robert T. Lange
Abstract:
Reinforcement learning (RL) agents are commonly trained and evaluated in the same environment. In contrast, humans often train in a specialized environment before being evaluated, such as studying a book before taking an exam. The potential of such specialized training environments is still vastly underexplored, despite their capacity to dramatically speed up training.
The framework of synthetic…
▽ More
Reinforcement learning (RL) agents are commonly trained and evaluated in the same environment. In contrast, humans often train in a specialized environment before being evaluated, such as studying a book before taking an exam. The potential of such specialized training environments is still vastly underexplored, despite their capacity to dramatically speed up training.
The framework of synthetic environments takes a first step in this direction by meta-learning neural network-based Markov decision processes (MDPs). The initial approach was limited to toy problems and produced environments that did not transfer to unseen RL algorithms. We extend this approach in three ways: Firstly, we modify the meta-learning algorithm to discover environments invariant towards hyperparameter configurations and learning algorithms. Secondly, by leveraging hardware parallelism and introducing a curriculum on an agent's evaluation episode horizon, we can achieve competitive results on several challenging continuous control problems. Thirdly, we surprisingly find that contextual bandits enable training RL agents that transfer well to their evaluation environment, even if it is a complex MDP. Hence, we set up our experiments to train synthetic contextual bandits, which perform on par with synthetic MDPs, yield additional insights into the evaluation environment, and can speed up downstream applications.
△ Less
Submitted 18 June, 2024;
originally announced June 2024.
-
EvIL: Evolution Strategies for Generalisable Imitation Learning
Authors:
Silvia Sapora,
Gokul Swamy,
Chris Lu,
Yee Whye Teh,
Jakob Nicolaus Foerster
Abstract:
Often times in imitation learning (IL), the environment we collect expert demonstrations in and the environment we want to deploy our learned policy in aren't exactly the same (e.g. demonstrations collected in simulation but deployment in the real world). Compared to policy-centric approaches to IL like behavioural cloning, reward-centric approaches like inverse reinforcement learning (IRL) often…
▽ More
Often times in imitation learning (IL), the environment we collect expert demonstrations in and the environment we want to deploy our learned policy in aren't exactly the same (e.g. demonstrations collected in simulation but deployment in the real world). Compared to policy-centric approaches to IL like behavioural cloning, reward-centric approaches like inverse reinforcement learning (IRL) often better replicate expert behaviour in new environments. This transfer is usually performed by optimising the recovered reward under the dynamics of the target environment. However, (a) we find that modern deep IL algorithms frequently recover rewards which induce policies far weaker than the expert, even in the same environment the demonstrations were collected in. Furthermore, (b) these rewards are often quite poorly shaped, necessitating extensive environment interaction to optimise effectively. We provide simple and scalable fixes to both of these concerns. For (a), we find that reward model ensembles combined with a slightly different training objective significantly improves re-training and transfer performance. For (b), we propose a novel evolution-strategies based method EvIL to optimise for a reward-shaping term that speeds up re-training in the target environment, closing a gap left open by the classical theory of IRL. On a suite of continuous control tasks, we are able to re-train policies in target (and source) environments more interaction-efficiently than prior work.
△ Less
Submitted 15 June, 2024;
originally announced June 2024.
-
HelloFresh: LLM Evaluations on Streams of Real-World Human Editorial Actions across X Community Notes and Wikipedia edits
Authors:
Tim Franzmeyer,
Aleksandar Shtedritski,
Samuel Albanie,
Philip Torr,
João F. Henriques,
Jakob N. Foerster
Abstract:
Benchmarks have been essential for driving progress in machine learning. A better understanding of LLM capabilities on real world tasks is vital for safe development. Designing adequate LLM benchmarks is challenging: Data from real-world tasks is hard to collect, public availability of static evaluation data results in test data contamination and benchmark overfitting, and periodically generating…
▽ More
Benchmarks have been essential for driving progress in machine learning. A better understanding of LLM capabilities on real world tasks is vital for safe development. Designing adequate LLM benchmarks is challenging: Data from real-world tasks is hard to collect, public availability of static evaluation data results in test data contamination and benchmark overfitting, and periodically generating new evaluation data is tedious and may result in temporally inconsistent results. We introduce HelloFresh, based on continuous streams of real-world data generated by intrinsically motivated human labelers. It covers recent events from X (formerly Twitter) community notes and edits of Wikipedia pages, mitigating the risk of test data contamination and benchmark overfitting. Any X user can propose an X note to add additional context to a misleading post (formerly tweet); if the community classifies it as helpful, it is shown with the post. Similarly, Wikipedia relies on community-based consensus, allowing users to edit articles or revert edits made by other users. Verifying whether an X note is helpful or whether a Wikipedia edit should be accepted are hard tasks that require grounding by querying the web. We backtest state-of-the-art LLMs supplemented with simple web search access and find that HelloFresh yields a temporally consistent ranking. To enable continuous evaluation on HelloFresh, we host a public leaderboard and periodically updated evaluation data at https://tinyurl.com/hello-fresh-LLM.
△ Less
Submitted 5 June, 2024;
originally announced June 2024.
-
Discovering Temporally-Aware Reinforcement Learning Algorithms
Authors:
Matthew Thomas Jackson,
Chris Lu,
Louis Kirsch,
Robert Tjarko Lange,
Shimon Whiteson,
Jakob Nicolaus Foerster
Abstract:
Recent advancements in meta-learning have enabled the automatic discovery of novel reinforcement learning algorithms parameterized by surrogate objective functions. To improve upon manually designed algorithms, the parameterization of this learned objective function must be expressive enough to represent novel principles of learning (instead of merely recovering already established ones) while sti…
▽ More
Recent advancements in meta-learning have enabled the automatic discovery of novel reinforcement learning algorithms parameterized by surrogate objective functions. To improve upon manually designed algorithms, the parameterization of this learned objective function must be expressive enough to represent novel principles of learning (instead of merely recovering already established ones) while still generalizing to a wide range of settings outside of its meta-training distribution. However, existing methods focus on discovering objective functions that, like many widely used objective functions in reinforcement learning, do not take into account the total number of steps allowed for training, or "training horizon". In contrast, humans use a plethora of different learning objectives across the course of acquiring a new ability. For instance, students may alter their studying techniques based on the proximity to exam deadlines and their self-assessed capabilities. This paper contends that ignoring the optimization time horizon significantly restricts the expressive potential of discovered learning algorithms. We propose a simple augmentation to two existing objective discovery approaches that allows the discovered algorithm to dynamically update its objective function throughout the agent's training procedure, resulting in expressive schedules and increased generalization across different training horizons. In the process, we find that commonly used meta-gradient approaches fail to discover such adaptive objective functions while evolution strategies discover highly dynamic learning rules. We demonstrate the effectiveness of our approach on a wide range of tasks and analyze the resulting learned algorithms, which we find effectively balance exploration and exploitation by modifying the structure of their learning rules throughout the agent's lifetime.
△ Less
Submitted 8 February, 2024;
originally announced February 2024.
-
HARDCORE: H-field and power loss estimation for arbitrary waveforms with residual, dilated convolutional neural networks in ferrite cores
Authors:
Wilhelm Kirchgässner,
Nikolas Förster,
Till Piepenbrock,
Oliver Schweins,
Oliver Wallscheid
Abstract:
The MagNet Challenge 2023 calls upon competitors to develop data-driven models for the material-specific, waveform-agnostic estimation of steady-state power losses in toroidal ferrite cores. The following HARDCORE (H-field and power loss estimation for Arbitrary waveforms with Residual, Dilated convolutional neural networks in ferrite COREs) approach shows that a residual convolutional neural netw…
▽ More
The MagNet Challenge 2023 calls upon competitors to develop data-driven models for the material-specific, waveform-agnostic estimation of steady-state power losses in toroidal ferrite cores. The following HARDCORE (H-field and power loss estimation for Arbitrary waveforms with Residual, Dilated convolutional neural networks in ferrite COREs) approach shows that a residual convolutional neural network with physics-informed extensions can serve this task efficiently when trained on observational data beforehand. One key solution element is an intermediate model layer which first reconstructs the bh curve and then estimates the power losses based on the curve's area rendering the proposed topology physically interpretable. In addition, emphasis was placed on expert-based feature engineering and information-rich inputs in order to enable a lean model architecture. A model is trained from scratch for each material, while the topology remains the same. A Pareto-style trade-off between model size and estimation accuracy is demonstrated, which yields an optimum at as low as 1755 parameters and down to below 8\,\% for the 95-th percentile of the relative error for the worst-case material with sufficient samples.
△ Less
Submitted 23 January, 2024; v1 submitted 21 January, 2024;
originally announced January 2024.
-
JaxMARL: Multi-Agent RL Environments and Algorithms in JAX
Authors:
Alexander Rutherford,
Benjamin Ellis,
Matteo Gallici,
Jonathan Cook,
Andrei Lupu,
Gardar Ingvarsson,
Timon Willi,
Ravi Hammond,
Akbir Khan,
Christian Schroeder de Witt,
Alexandra Souly,
Saptarashmi Bandyopadhyay,
Mikayel Samvelyan,
Minqi Jiang,
Robert Tjarko Lange,
Shimon Whiteson,
Bruno Lacerda,
Nick Hawes,
Tim Rocktaschel,
Chris Lu,
Jakob Nicolaus Foerster
Abstract:
Benchmarks are crucial in the development of machine learning algorithms, with available environments significantly influencing reinforcement learning (RL) research. Traditionally, RL environments run on the CPU, which limits their scalability with typical academic compute. However, recent advancements in JAX have enabled the wider use of hardware acceleration, enabling massively parallel RL train…
▽ More
Benchmarks are crucial in the development of machine learning algorithms, with available environments significantly influencing reinforcement learning (RL) research. Traditionally, RL environments run on the CPU, which limits their scalability with typical academic compute. However, recent advancements in JAX have enabled the wider use of hardware acceleration, enabling massively parallel RL training pipelines and environments. While this has been successfully applied to single-agent RL, it has not yet been widely adopted for multi-agent scenarios. In this paper, we present JaxMARL, the first open-source, Python-based library that combines GPU-enabled efficiency with support for a large number of commonly used MARL environments and popular baseline algorithms. Our experiments show that, in terms of wall clock time, our JAX-based training pipeline is around 14 times faster than existing approaches, and up to 12500x when multiple training runs are vectorized. This enables efficient and thorough evaluations, potentially alleviating the evaluation crisis in the field. We also introduce and benchmark SMAX, a JAX-based approximate reimplementation of the popular StarCraft Multi-Agent Challenge, which removes the need to run the StarCraft II game engine. This not only enables GPU acceleration, but also provides a more flexible MARL environment, unlocking the potential for self-play, meta-learning, and other future applications in MARL. The code is available at https://github.com/flairox/jaxmarl.
△ Less
Submitted 2 November, 2024; v1 submitted 16 November, 2023;
originally announced November 2023.
-
Discovering General Reinforcement Learning Algorithms with Adversarial Environment Design
Authors:
Matthew Thomas Jackson,
Minqi Jiang,
Jack Parker-Holder,
Risto Vuorio,
Chris Lu,
Gregory Farquhar,
Shimon Whiteson,
Jakob Nicolaus Foerster
Abstract:
The past decade has seen vast progress in deep reinforcement learning (RL) on the back of algorithms manually designed by human researchers. Recently, it has been shown that it is possible to meta-learn update rules, with the hope of discovering algorithms that can perform well on a wide range of RL tasks. Despite impressive initial results from algorithms such as Learned Policy Gradient (LPG), th…
▽ More
The past decade has seen vast progress in deep reinforcement learning (RL) on the back of algorithms manually designed by human researchers. Recently, it has been shown that it is possible to meta-learn update rules, with the hope of discovering algorithms that can perform well on a wide range of RL tasks. Despite impressive initial results from algorithms such as Learned Policy Gradient (LPG), there remains a generalization gap when these algorithms are applied to unseen environments. In this work, we examine how characteristics of the meta-training distribution impact the generalization performance of these algorithms. Motivated by this analysis and building on ideas from Unsupervised Environment Design (UED), we propose a novel approach for automatically generating curricula to maximize the regret of a meta-learned optimizer, in addition to a novel approximation of regret, which we name algorithmic regret (AR). The result is our method, General RL Optimizers Obtained Via Environment Design (GROOVE). In a series of experiments, we show that GROOVE achieves superior generalization to LPG, and evaluate AR against baseline metrics from UED, identifying it as a critical component of environment design in this setting. We believe this approach is a step towards the discovery of truly general RL algorithms, capable of solving a wide range of real-world environments.
△ Less
Submitted 4 October, 2023;
originally announced October 2023.
-
ReLU to the Rescue: Improve Your On-Policy Actor-Critic with Positive Advantages
Authors:
Andrew Jesson,
Chris Lu,
Gunshi Gupta,
Nicolas Beltran-Velez,
Angelos Filos,
Jakob Nicolaus Foerster,
Yarin Gal
Abstract:
This paper proposes a step toward approximate Bayesian inference in on-policy actor-critic deep reinforcement learning. It is implemented through three changes to the Asynchronous Advantage Actor-Critic (A3C) algorithm: (1) applying a ReLU function to advantage estimates, (2) spectral normalization of actor-critic weights, and (3) incorporating \emph{dropout as a Bayesian approximation}. We prove…
▽ More
This paper proposes a step toward approximate Bayesian inference in on-policy actor-critic deep reinforcement learning. It is implemented through three changes to the Asynchronous Advantage Actor-Critic (A3C) algorithm: (1) applying a ReLU function to advantage estimates, (2) spectral normalization of actor-critic weights, and (3) incorporating \emph{dropout as a Bayesian approximation}. We prove under standard assumptions that restricting policy updates to positive advantages optimizes for value by maximizing a lower bound on the value function plus an additive term. We show that the additive term is bounded proportional to the Lipschitz constant of the value function, which offers theoretical grounding for spectral normalization of critic weights. Finally, our application of dropout corresponds to approximate Bayesian inference over both the actor and critic parameters, which enables \textit{adaptive state-aware} exploration around the modes of the actor via Thompson sampling. We demonstrate significant improvements for median and interquartile mean metrics over A3C, PPO, SAC, and TD3 on the MuJoCo continuous control benchmark and improvement over PPO in the challenging ProcGen generalization benchmark.
△ Less
Submitted 10 October, 2024; v1 submitted 2 June, 2023;
originally announced June 2023.
-
Cheap Talk Discovery and Utilization in Multi-Agent Reinforcement Learning
Authors:
Yat Long Lo,
Christian Schroeder de Witt,
Samuel Sokota,
Jakob Nicolaus Foerster,
Shimon Whiteson
Abstract:
By enabling agents to communicate, recent cooperative multi-agent reinforcement learning (MARL) methods have demonstrated better task performance and more coordinated behavior. Most existing approaches facilitate inter-agent communication by allowing agents to send messages to each other through free communication channels, i.e., cheap talk channels. Current methods require these channels to be co…
▽ More
By enabling agents to communicate, recent cooperative multi-agent reinforcement learning (MARL) methods have demonstrated better task performance and more coordinated behavior. Most existing approaches facilitate inter-agent communication by allowing agents to send messages to each other through free communication channels, i.e., cheap talk channels. Current methods require these channels to be constantly accessible and known to the agents a priori. In this work, we lift these requirements such that the agents must discover the cheap talk channels and learn how to use them. Hence, the problem has two main parts: cheap talk discovery (CTD) and cheap talk utilization (CTU). We introduce a novel conceptual framework for both parts and develop a new algorithm based on mutual information maximization that outperforms existing algorithms in CTD/CTU settings. We also release a novel benchmark suite to stimulate future research in CTD/CTU.
△ Less
Submitted 19 March, 2023;
originally announced March 2023.
-
SMACv2: An Improved Benchmark for Cooperative Multi-Agent Reinforcement Learning
Authors:
Benjamin Ellis,
Jonathan Cook,
Skander Moalla,
Mikayel Samvelyan,
Mingfei Sun,
Anuj Mahajan,
Jakob N. Foerster,
Shimon Whiteson
Abstract:
The availability of challenging benchmarks has played a key role in the recent progress of machine learning. In cooperative multi-agent reinforcement learning, the StarCraft Multi-Agent Challenge (SMAC) has become a popular testbed for centralised training with decentralised execution. However, after years of sustained improvement on SMAC, algorithms now achieve near-perfect performance. In this w…
▽ More
The availability of challenging benchmarks has played a key role in the recent progress of machine learning. In cooperative multi-agent reinforcement learning, the StarCraft Multi-Agent Challenge (SMAC) has become a popular testbed for centralised training with decentralised execution. However, after years of sustained improvement on SMAC, algorithms now achieve near-perfect performance. In this work, we conduct new analysis demonstrating that SMAC lacks the stochasticity and partial observability to require complex *closed-loop* policies. In particular, we show that an *open-loop* policy conditioned only on the timestep can achieve non-trivial win rates for many SMAC scenarios. To address this limitation, we introduce SMACv2, a new version of the benchmark where scenarios are procedurally generated and require agents to generalise to previously unseen settings (from the same distribution) during evaluation. We also introduce the extended partial observability challenge (EPO), which augments SMACv2 to ensure meaningful partial observability. We show that these changes ensure the benchmark requires the use of *closed-loop* policies. We evaluate state-of-the-art algorithms on SMACv2 and show that it presents significant challenges not present in the original benchmark. Our analysis illustrates that SMACv2 addresses the discovered deficiencies of SMAC and can help benchmark the next generation of MARL methods. Videos of training are available at https://sites.google.com/view/smacv2.
△ Less
Submitted 17 October, 2023; v1 submitted 14 December, 2022;
originally announced December 2022.
-
Game-Theoretical Perspectives on Active Equilibria: A Preferred Solution Concept over Nash Equilibria
Authors:
Dong-Ki Kim,
Matthew Riemer,
Miao Liu,
Jakob N. Foerster,
Gerald Tesauro,
Jonathan P. How
Abstract:
Multiagent learning settings are inherently more difficult than single-agent learning because each agent interacts with other simultaneously learning agents in a shared environment. An effective approach in multiagent reinforcement learning is to consider the learning process of agents and influence their future policies toward desirable behaviors from each agent's perspective. Importantly, if eac…
▽ More
Multiagent learning settings are inherently more difficult than single-agent learning because each agent interacts with other simultaneously learning agents in a shared environment. An effective approach in multiagent reinforcement learning is to consider the learning process of agents and influence their future policies toward desirable behaviors from each agent's perspective. Importantly, if each agent maximizes its long-term rewards by accounting for the impact of its behavior on the set of convergence policies, the resulting multiagent system reaches an active equilibrium. While this new solution concept is general such that standard solution concepts, such as a Nash equilibrium, are special cases of active equilibria, it is unclear when an active equilibrium is a preferred equilibrium over other solution concepts. In this paper, we analyze active equilibria from a game-theoretic perspective by closely studying examples where Nash equilibria are known. By directly comparing active equilibria to Nash equilibria in these examples, we find that active equilibria find more effective solutions than Nash equilibria, concluding that an active equilibrium is the desired solution for multiagent learning settings.
△ Less
Submitted 28 October, 2022;
originally announced October 2022.
-
Proximal Learning With Opponent-Learning Awareness
Authors:
Stephen Zhao,
Chris Lu,
Roger Baker Grosse,
Jakob Nicolaus Foerster
Abstract:
Learning With Opponent-Learning Awareness (LOLA) (Foerster et al. [2018a]) is a multi-agent reinforcement learning algorithm that typically learns reciprocity-based cooperation in partially competitive environments. However, LOLA often fails to learn such behaviour on more complex policy spaces parameterized by neural networks, partly because the update rule is sensitive to the policy parameteriza…
▽ More
Learning With Opponent-Learning Awareness (LOLA) (Foerster et al. [2018a]) is a multi-agent reinforcement learning algorithm that typically learns reciprocity-based cooperation in partially competitive environments. However, LOLA often fails to learn such behaviour on more complex policy spaces parameterized by neural networks, partly because the update rule is sensitive to the policy parameterization. This problem is especially pronounced in the opponent modeling setting, where the opponent's policy is unknown and must be inferred from observations; in such settings, LOLA is ill-specified because behaviorally equivalent opponent policies can result in non-equivalent updates. To address this shortcoming, we reinterpret LOLA as approximating a proximal operator, and then derive a new algorithm, proximal LOLA (POLA), which uses the proximal formulation directly. Unlike LOLA, the POLA updates are parameterization invariant, in the sense that when the proximal objective has a unique optimum, behaviorally equivalent policies result in behaviorally equivalent updates. We then present practical approximations to the ideal POLA update, which we evaluate in several partially competitive environments with function approximation and opponent modeling. This empirically demonstrates that POLA achieves reciprocity-based cooperation more reliably than LOLA.
△ Less
Submitted 18 October, 2022;
originally announced October 2022.
-
Learning to Optimize Quasi-Newton Methods
Authors:
Isaac Liao,
Rumen R. Dangovski,
Jakob N. Foerster,
Marin Soljačić
Abstract:
Fast gradient-based optimization algorithms have become increasingly essential for the computationally efficient training of machine learning models. One technique is to multiply the gradient by a preconditioner matrix to produce a step, but it is unclear what the best preconditioner matrix is. This paper introduces a novel machine learning optimizer called LODO, which tries to online meta-learn t…
▽ More
Fast gradient-based optimization algorithms have become increasingly essential for the computationally efficient training of machine learning models. One technique is to multiply the gradient by a preconditioner matrix to produce a step, but it is unclear what the best preconditioner matrix is. This paper introduces a novel machine learning optimizer called LODO, which tries to online meta-learn the best preconditioner during optimization. Specifically, our optimizer merges Learning to Optimize (L2O) techniques with quasi-Newton methods to learn preconditioners parameterized as neural networks; they are more flexible than preconditioners in other quasi-Newton methods. Unlike other L2O methods, LODO does not require any meta-training on a training task distribution, and instead learns to optimize on the fly while optimizing on the test task, adapting to the local characteristics of the loss landscape while traversing it. Theoretically, we show that our optimizer approximates the inverse Hessian in noisy loss landscapes and is capable of representing a wide range of inverse Hessians. We experimentally verify that our algorithm can optimize in noisy settings, and show that simpler alternatives for representing the inverse Hessians worsen performance. Lastly, we use our optimizer to train a semi-realistic deep neural network with 95k parameters at speeds comparable to those of standard neural network optimizers.
△ Less
Submitted 11 September, 2023; v1 submitted 10 October, 2022;
originally announced October 2022.
-
Self-Explaining Deviations for Coordination
Authors:
Hengyuan Hu,
Samuel Sokota,
David Wu,
Anton Bakhtin,
Andrei Lupu,
Brandon Cui,
Jakob N. Foerster
Abstract:
Fully cooperative, partially observable multi-agent problems are ubiquitous in the real world. In this paper, we focus on a specific subclass of coordination problems in which humans are able to discover self-explaining deviations (SEDs). SEDs are actions that deviate from the common understanding of what reasonable behavior would be in normal circumstances. They are taken with the intention of ca…
▽ More
Fully cooperative, partially observable multi-agent problems are ubiquitous in the real world. In this paper, we focus on a specific subclass of coordination problems in which humans are able to discover self-explaining deviations (SEDs). SEDs are actions that deviate from the common understanding of what reasonable behavior would be in normal circumstances. They are taken with the intention of causing another agent or other agents to realize, using theory of mind, that the circumstance must be abnormal. We first motivate SED with a real world example and formalize its definition. Next, we introduce a novel algorithm, improvement maximizing self-explaining deviations (IMPROVISED), to perform SEDs. Lastly, we evaluate IMPROVISED both in an illustrative toy setting and the popular benchmark setting Hanabi, where it is the first method to produce so called finesse plays, which are regarded as one of the more iconic examples of human theory of mind.
△ Less
Submitted 13 July, 2022;
originally announced July 2022.
-
Illusory Attacks: Information-Theoretic Detectability Matters in Adversarial Attacks
Authors:
Tim Franzmeyer,
Stephen McAleer,
João F. Henriques,
Jakob N. Foerster,
Philip H. S. Torr,
Adel Bibi,
Christian Schroeder de Witt
Abstract:
Autonomous agents deployed in the real world need to be robust against adversarial attacks on sensory inputs. Robustifying agent policies requires anticipating the strongest attacks possible. We demonstrate that existing observation-space attacks on reinforcement learning agents have a common weakness: while effective, their lack of information-theoretic detectability constraints makes them detect…
▽ More
Autonomous agents deployed in the real world need to be robust against adversarial attacks on sensory inputs. Robustifying agent policies requires anticipating the strongest attacks possible. We demonstrate that existing observation-space attacks on reinforcement learning agents have a common weakness: while effective, their lack of information-theoretic detectability constraints makes them detectable using automated means or human inspection. Detectability is undesirable to adversaries as it may trigger security escalations. We introduce ε-illusory, a novel form of adversarial attack on sequential decision-makers that is both effective and of ε-bounded statistical detectability. We propose a novel dual ascent algorithm to learn such attacks end-to-end. Compared to existing attacks, we empirically find ε-illusory to be significantly harder to detect with automated methods, and a small study with human participants (IRB approval under reference R84123/RE001) suggests they are similarly harder to detect for humans. Our findings suggest the need for better anomaly detectors, as well as effective hardware- and system-level defenses. The project website can be found at https://tinyurl.com/illusory-attacks.
△ Less
Submitted 6 May, 2024; v1 submitted 20 July, 2022;
originally announced July 2022.
-
K-level Reasoning for Zero-Shot Coordination in Hanabi
Authors:
Brandon Cui,
Hengyuan Hu,
Luis Pineda,
Jakob N. Foerster
Abstract:
The standard problem setting in cooperative multi-agent settings is self-play (SP), where the goal is to train a team of agents that works well together. However, optimal SP policies commonly contain arbitrary conventions ("handshakes") and are not compatible with other, independently trained agents or humans. This latter desiderata was recently formalized by Hu et al. 2020 as the zero-shot coordi…
▽ More
The standard problem setting in cooperative multi-agent settings is self-play (SP), where the goal is to train a team of agents that works well together. However, optimal SP policies commonly contain arbitrary conventions ("handshakes") and are not compatible with other, independently trained agents or humans. This latter desiderata was recently formalized by Hu et al. 2020 as the zero-shot coordination (ZSC) setting and partially addressed with their Other-Play (OP) algorithm, which showed improved ZSC and human-AI performance in the card game Hanabi. OP assumes access to the symmetries of the environment and prevents agents from breaking these in a mutually incompatible way during training. However, as the authors point out, discovering symmetries for a given environment is a computationally hard problem. Instead, we show that through a simple adaption of k-level reasoning (KLR) Costa Gomes et al. 2006, synchronously training all levels, we can obtain competitive ZSC and ad-hoc teamplay performance in Hanabi, including when paired with a human-like proxy bot. We also introduce a new method, synchronous-k-level reasoning with a best response (SyKLRBR), which further improves performance on our synchronous KLR by co-training a best response.
△ Less
Submitted 14 July, 2022;
originally announced July 2022.
-
Influencing Long-Term Behavior in Multiagent Reinforcement Learning
Authors:
Dong-Ki Kim,
Matthew Riemer,
Miao Liu,
Jakob N. Foerster,
Michael Everett,
Chuangchuang Sun,
Gerald Tesauro,
Jonathan P. How
Abstract:
The main challenge of multiagent reinforcement learning is the difficulty of learning useful policies in the presence of other simultaneously learning agents whose changing behaviors jointly affect the environment's transition and reward dynamics. An effective approach that has recently emerged for addressing this non-stationarity is for each agent to anticipate the learning of other agents and in…
▽ More
The main challenge of multiagent reinforcement learning is the difficulty of learning useful policies in the presence of other simultaneously learning agents whose changing behaviors jointly affect the environment's transition and reward dynamics. An effective approach that has recently emerged for addressing this non-stationarity is for each agent to anticipate the learning of other agents and influence the evolution of future policies towards desirable behavior for its own benefit. Unfortunately, previous approaches for achieving this suffer from myopic evaluation, considering only a finite number of policy updates. As such, these methods can only influence transient future policies rather than achieving the promise of scalable equilibrium selection approaches that influence the behavior at convergence. In this paper, we propose a principled framework for considering the limiting policies of other agents as time approaches infinity. Specifically, we develop a new optimization objective that maximizes each agent's average reward by directly accounting for the impact of its behavior on the limiting set of policies that other agents will converge to. Our paper characterizes desirable solution concepts within this problem setting and provides practical approaches for optimizing over possible outcomes. As a result of our farsighted objective, we demonstrate better long-term performance than state-of-the-art baselines across a suite of diverse multiagent benchmark domains.
△ Less
Submitted 15 October, 2022; v1 submitted 7 March, 2022;
originally announced March 2022.
-
Reinforcement Learning Enhanced Quantum-inspired Algorithm for Combinatorial Optimization
Authors:
Dmitrii Beloborodov,
A. E. Ulanov,
Jakob N. Foerster,
Shimon Whiteson,
A. I. Lvovsky
Abstract:
Quantum hardware and quantum-inspired algorithms are becoming increasingly popular for combinatorial optimization. However, these algorithms may require careful hyperparameter tuning for each problem instance. We use a reinforcement learning agent in conjunction with a quantum-inspired algorithm to solve the Ising energy minimization problem, which is equivalent to the Maximum Cut problem. The age…
▽ More
Quantum hardware and quantum-inspired algorithms are becoming increasingly popular for combinatorial optimization. However, these algorithms may require careful hyperparameter tuning for each problem instance. We use a reinforcement learning agent in conjunction with a quantum-inspired algorithm to solve the Ising energy minimization problem, which is equivalent to the Maximum Cut problem. The agent controls the algorithm by tuning one of its parameters with the goal of improving recently seen solutions. We propose a new Rescaled Ranked Reward (R3) method that enables stable single-player version of self-play training that helps the agent to escape local optima. The training on any problem instance can be accelerated by applying transfer learning from an agent trained on randomly generated problems. Our approach allows sampling high-quality solutions to the Ising problem with high probability and outperforms both baseline heuristics and a black-box hyperparameter optimization approach.
△ Less
Submitted 14 February, 2020; v1 submitted 11 February, 2020;
originally announced February 2020.
-
Simplified Action Decoder for Deep Multi-Agent Reinforcement Learning
Authors:
Hengyuan Hu,
Jakob N Foerster
Abstract:
In recent years we have seen fast progress on a number of benchmark problems in AI, with modern methods achieving near or super human performance in Go, Poker and Dota. One common aspect of all of these challenges is that they are by design adversarial or, technically speaking, zero-sum. In contrast to these settings, success in the real world commonly requires humans to collaborate and communicat…
▽ More
In recent years we have seen fast progress on a number of benchmark problems in AI, with modern methods achieving near or super human performance in Go, Poker and Dota. One common aspect of all of these challenges is that they are by design adversarial or, technically speaking, zero-sum. In contrast to these settings, success in the real world commonly requires humans to collaborate and communicate with others, in settings that are, at least partially, cooperative. In the last year, the card game Hanabi has been established as a new benchmark environment for AI to fill this gap. In particular, Hanabi is interesting to humans since it is entirely focused on theory of mind, i.e., the ability to effectively reason over the intentions, beliefs and point of view of other agents when observing their actions. Learning to be informative when observed by others is an interesting challenge for Reinforcement Learning (RL): Fundamentally, RL requires agents to explore in order to discover good policies. However, when done naively, this randomness will inherently make their actions less informative to others during training. We present a new deep multi-agent RL method, the Simplified Action Decoder (SAD), which resolves this contradiction exploiting the centralized training phase. During training SAD allows other agents to not only observe the (exploratory) action chosen, but agents instead also observe the greedy action of their team mates. By combining this simple intuition with best practices for multi-agent learning, SAD establishes a new SOTA for learning methods for 2-5 players on the self-play part of the Hanabi challenge. Our ablations show the contributions of SAD compared with the best practice components. All of our code and trained agents are available at https://github.com/facebookresearch/Hanabi_SAD.
△ Less
Submitted 12 May, 2021; v1 submitted 4 December, 2019;
originally announced December 2019.
-
Robust Visual Domain Randomization for Reinforcement Learning
Authors:
Reda Bahi Slaoui,
William R. Clements,
Jakob N. Foerster,
Sébastien Toth
Abstract:
Producing agents that can generalize to a wide range of visually different environments is a significant challenge in reinforcement learning. One method for overcoming this issue is visual domain randomization, whereby at the start of each training episode some visual aspects of the environment are randomized so that the agent is exposed to many possible variations. However, domain randomization i…
▽ More
Producing agents that can generalize to a wide range of visually different environments is a significant challenge in reinforcement learning. One method for overcoming this issue is visual domain randomization, whereby at the start of each training episode some visual aspects of the environment are randomized so that the agent is exposed to many possible variations. However, domain randomization is highly inefficient and may lead to policies with high variance across domains. Instead, we propose a regularization method whereby the agent is only trained on one variation of the environment, and its learned state representations are regularized during training to be invariant across domains. We conduct experiments that demonstrate that our technique leads to more efficient and robust learning than standard domain randomization, while achieving equal generalization scores.
△ Less
Submitted 6 March, 2020; v1 submitted 23 October, 2019;
originally announced October 2019.
-
Exploratory Combinatorial Optimization with Reinforcement Learning
Authors:
Thomas D. Barrett,
William R. Clements,
Jakob N. Foerster,
A. I. Lvovsky
Abstract:
Many real-world problems can be reduced to combinatorial optimization on a graph, where the subset or ordering of vertices that maximize some objective function must be found. With such tasks often NP-hard and analytically intractable, reinforcement learning (RL) has shown promise as a framework with which efficient heuristic methods to tackle these problems can be learned. Previous works construc…
▽ More
Many real-world problems can be reduced to combinatorial optimization on a graph, where the subset or ordering of vertices that maximize some objective function must be found. With such tasks often NP-hard and analytically intractable, reinforcement learning (RL) has shown promise as a framework with which efficient heuristic methods to tackle these problems can be learned. Previous works construct the solution subset incrementally, adding one element at a time, however, the irreversible nature of this approach prevents the agent from revising its earlier decisions, which may be necessary given the complexity of the optimization task. We instead propose that the agent should seek to continuously improve the solution by learning to explore at test time. Our approach of exploratory combinatorial optimization (ECO-DQN) is, in principle, applicable to any combinatorial problem that can be defined on a graph. Experimentally, we show our method to produce state-of-the-art RL performance on the Maximum Cut problem. Moreover, because ECO-DQN can start from any arbitrary configuration, it can be combined with other search methods to further improve performance, which we demonstrate using a simple random search.
△ Less
Submitted 31 January, 2020; v1 submitted 9 September, 2019;
originally announced September 2019.
-
The Hanabi Challenge: A New Frontier for AI Research
Authors:
Nolan Bard,
Jakob N. Foerster,
Sarath Chandar,
Neil Burch,
Marc Lanctot,
H. Francis Song,
Emilio Parisotto,
Vincent Dumoulin,
Subhodeep Moitra,
Edward Hughes,
Iain Dunning,
Shibl Mourad,
Hugo Larochelle,
Marc G. Bellemare,
Michael Bowling
Abstract:
From the early days of computing, games have been important testbeds for studying how well machines can do sophisticated decision making. In recent years, machine learning has made dramatic advances with artificial agents reaching superhuman performance in challenge domains like Go, Atari, and some variants of poker. As with their predecessors of chess, checkers, and backgammon, these game domains…
▽ More
From the early days of computing, games have been important testbeds for studying how well machines can do sophisticated decision making. In recent years, machine learning has made dramatic advances with artificial agents reaching superhuman performance in challenge domains like Go, Atari, and some variants of poker. As with their predecessors of chess, checkers, and backgammon, these game domains have driven research by providing sophisticated yet well-defined challenges for artificial intelligence practitioners. We continue this tradition by proposing the game of Hanabi as a new challenge domain with novel problems that arise from its combination of purely cooperative gameplay with two to five players and imperfect information. In particular, we argue that Hanabi elevates reasoning about the beliefs and intentions of other agents to the foreground. We believe developing novel techniques for such theory of mind reasoning will not only be crucial for success in Hanabi, but also in broader collaborative efforts, especially those with human partners. To facilitate future research, we introduce the open-source Hanabi Learning Environment, propose an experimental framework for the research community to evaluate algorithmic advances, and assess the performance of current state-of-the-art techniques.
△ Less
Submitted 6 December, 2019; v1 submitted 1 February, 2019;
originally announced February 2019.
-
Bayesian Action Decoder for Deep Multi-Agent Reinforcement Learning
Authors:
Jakob N. Foerster,
Francis Song,
Edward Hughes,
Neil Burch,
Iain Dunning,
Shimon Whiteson,
Matthew Botvinick,
Michael Bowling
Abstract:
When observing the actions of others, humans make inferences about why they acted as they did, and what this implies about the world; humans also use the fact that their actions will be interpreted in this manner, allowing them to act informatively and thereby communicate efficiently with others. Although learning algorithms have recently achieved superhuman performance in a number of two-player,…
▽ More
When observing the actions of others, humans make inferences about why they acted as they did, and what this implies about the world; humans also use the fact that their actions will be interpreted in this manner, allowing them to act informatively and thereby communicate efficiently with others. Although learning algorithms have recently achieved superhuman performance in a number of two-player, zero-sum games, scalable multi-agent reinforcement learning algorithms that can discover effective strategies and conventions in complex, partially observable settings have proven elusive. We present the Bayesian action decoder (BAD), a new multi-agent learning method that uses an approximate Bayesian update to obtain a public belief that conditions on the actions taken by all agents in the environment. BAD introduces a new Markov decision process, the public belief MDP, in which the action space consists of all deterministic partial policies, and exploits the fact that an agent acting only on this public belief state can still learn to use its private information if the action space is augmented to be over all partial policies mapping private information into environment actions. The Bayesian update is closely related to the theory of mind reasoning that humans carry out when observing others' actions. We first validate BAD on a proof-of-principle two-step matrix game, where it outperforms policy gradient methods; we then evaluate BAD on the challenging, cooperative partial-information card game Hanabi, where, in the two-player setting, it surpasses all previously published learning and hand-coded approaches, establishing a new state of the art.
△ Less
Submitted 10 September, 2019; v1 submitted 4 November, 2018;
originally announced November 2018.
-
Multi-Agent Common Knowledge Reinforcement Learning
Authors:
Christian A. Schroeder de Witt,
Jakob N. Foerster,
Gregory Farquhar,
Philip H. S. Torr,
Wendelin Boehmer,
Shimon Whiteson
Abstract:
Cooperative multi-agent reinforcement learning often requires decentralised policies, which severely limit the agents' ability to coordinate their behaviour. In this paper, we show that common knowledge between agents allows for complex decentralised coordination. Common knowledge arises naturally in a large number of decentralised cooperative multi-agent tasks, for example, when agents can recons…
▽ More
Cooperative multi-agent reinforcement learning often requires decentralised policies, which severely limit the agents' ability to coordinate their behaviour. In this paper, we show that common knowledge between agents allows for complex decentralised coordination. Common knowledge arises naturally in a large number of decentralised cooperative multi-agent tasks, for example, when agents can reconstruct parts of each others' observations. Since agents an independently agree on their common knowledge, they can execute complex coordinated policies that condition on this knowledge in a fully decentralised fashion. We propose multi-agent common knowledge reinforcement learning (MACKRL), a novel stochastic actor-critic algorithm that learns a hierarchical policy tree. Higher levels in the hierarchy coordinate groups of agents by conditioning on their common knowledge, or delegate to lower levels with smaller subgroups but potentially richer common knowledge. The entire policy tree can be executed in a fully decentralised fashion. As the lowest policy tree level consists of independent policies for each agent, MACKRL reduces to independently learnt decentralised policies as a special case. We demonstrate that our method can exploit common knowledge for superior performance on complex decentralised coordination tasks, including a stochastic matrix game and challenging problems in StarCraft II unit micromanagement.
△ Less
Submitted 11 January, 2020; v1 submitted 27 October, 2018;
originally announced October 2018.
-
Complete event-by-event $α$/$γ(β)$ separation in a full-size TeO$_2$ CUORE bolometer by Neganov-Luke-magnified light detection
Authors:
L. Bergé,
M. Chapellier,
M. de Combarieu,
L. Dumoulin,
A. Giuliani,
M. Gros,
P. de Marcillac,
S. Marnieros,
C. Nones,
V. Novati,
E. Olivieri,
B. Paul,
D. V. Poda,
T. Redon,
B. Siebenborn,
A. S. Zolotarova,
E. Armengaud,
C. Augier,
A. Benoît,
J. Billard,
A. Broniatowski,
P. Camus,
A. Cazes,
F. Charlieux,
M. De Jesus
, et al. (19 additional authors not shown)
Abstract:
In the present work, we describe the results obtained with a large ($\approx 133$ cm$^3$) TeO$_2$ bolometer, with a view to a search for neutrinoless double-beta decay ($0νββ$) of $^{130}$Te. We demonstrate an efficient $α$ particle discrimination (99.9\%) with a high acceptance of the $0νββ$ signal (about 96\%), expected at $\approx 2.5$ MeV. This unprecedented result was possible thanks to the s…
▽ More
In the present work, we describe the results obtained with a large ($\approx 133$ cm$^3$) TeO$_2$ bolometer, with a view to a search for neutrinoless double-beta decay ($0νββ$) of $^{130}$Te. We demonstrate an efficient $α$ particle discrimination (99.9\%) with a high acceptance of the $0νββ$ signal (about 96\%), expected at $\approx 2.5$ MeV. This unprecedented result was possible thanks to the superior performance (10 eV rms baseline noise) of a Neganov-Luke-assisted germanium bolometer used to detect a tiny (70 eV) light signal from the TeO$_2$ detector, dominated by $γ$($β$)-induced Cherenkov radiation but exhibiting also a clear scintillation component. The obtained results represent a major breakthrough towards the TeO$_2$-based version of CUORE Upgrade with Particle IDentification (CUPID), a ton-scale cryogenic $0νββ$ experiment proposed as a follow-up to the CUORE project with particle identification. The CUORE experiment began recently a search for neutrinoless double-beta decay of $^{130}$Te with an array of 988 125-cm$^3$ TeO$_2$ bolometers. The lack of $α$ discrimination in CUORE makes $α$ decays at the detector surface the dominant background component, at the level of $\approx 0.01$ counts/(keV kg y) in the region of interest. We show here, for the first time with a CUORE-size bolometer and using the same technology as CUORE for the readout of both heat and light signals, that surface $α$ background can be fully rejected.
△ Less
Submitted 25 April, 2018; v1 submitted 10 October, 2017;
originally announced October 2017.
-
Learning with Opponent-Learning Awareness
Authors:
Jakob N. Foerster,
Richard Y. Chen,
Maruan Al-Shedivat,
Shimon Whiteson,
Pieter Abbeel,
Igor Mordatch
Abstract:
Multi-agent settings are quickly gathering importance in machine learning. This includes a plethora of recent work on deep multi-agent reinforcement learning, but also can be extended to hierarchical RL, generative adversarial networks and decentralised optimisation. In all these settings the presence of multiple learning agents renders the training problem non-stationary and often leads to unstab…
▽ More
Multi-agent settings are quickly gathering importance in machine learning. This includes a plethora of recent work on deep multi-agent reinforcement learning, but also can be extended to hierarchical RL, generative adversarial networks and decentralised optimisation. In all these settings the presence of multiple learning agents renders the training problem non-stationary and often leads to unstable training or undesired final results. We present Learning with Opponent-Learning Awareness (LOLA), a method in which each agent shapes the anticipated learning of the other agents in the environment. The LOLA learning rule includes a term that accounts for the impact of one agent's policy on the anticipated parameter update of the other agents. Results show that the encounter of two LOLA agents leads to the emergence of tit-for-tat and therefore cooperation in the iterated prisoners' dilemma, while independent learning does not. In this domain, LOLA also receives higher payouts compared to a naive learner, and is robust against exploitation by higher order gradient-based methods. Applied to repeated matching pennies, LOLA agents converge to the Nash equilibrium. In a round robin tournament we show that LOLA agents successfully shape the learning of a range of multi-agent learning algorithms from literature, resulting in the highest average returns on the IPD. We also show that the LOLA update rule can be efficiently calculated using an extension of the policy gradient estimator, making the method suitable for model-free RL. The method thus scales to large parameter and input spaces and nonlinear function approximators. We apply LOLA to a grid world task with an embedded social dilemma using recurrent policies and opponent modelling. By explicitly considering the learning of the other agent, LOLA agents learn to cooperate out of self-interest. The code is at github.com/alshedivat/lola.
△ Less
Submitted 19 September, 2018; v1 submitted 13 September, 2017;
originally announced September 2017.
-
Optimizing EDELWEISS detectors for low-mass WIMP searches
Authors:
EDELWEISS Collaboration,
Q. Arnaud,
E. Armengaud,
C. Augier,
A. Benoît,
L. Bergé,
J. Billard,
A. Broniatowski,
P. Camus,
A. Cazes,
M. Chapellier,
F. Charlieux,
M. De Jésus,
L. Dumoulin,
K. Eitel,
N. Foerster,
J. Gascon,
A. Giuliani,
M. Gros,
L. Hehn,
Y. Jin,
A. Juillard,
M. Kleifges,
V. Kozlov,
H. Kraus
, et al. (18 additional authors not shown)
Abstract:
The physics potential of EDELWEISS detectors for the search of low-mass Weakly Interacting Massive Particles (WIMPs) is studied. Using a data-driven background model, projected exclusion limits are computed using frequentist and multivariate analysis approaches, namely profile likelihood and boosted decision tree. Both current and achievable experimental performance are considered. The optimal str…
▽ More
The physics potential of EDELWEISS detectors for the search of low-mass Weakly Interacting Massive Particles (WIMPs) is studied. Using a data-driven background model, projected exclusion limits are computed using frequentist and multivariate analysis approaches, namely profile likelihood and boosted decision tree. Both current and achievable experimental performance are considered. The optimal strategy for detector optimization depends critically on whether the emphasis is put on WIMP masses below or above $\sim$ 5 GeV/c$^2$. The projected sensitivity for the next phase of the EDELWEISS-III experiment at the Modane Underground Laboratory (LSM) for low-mass WIMP search is presented. By 2018 an upper limit on the spin-independent WIMP-nucleon cross-section of $σ_{SI} = 7 \times 10^{-42}$ cm$^2$ is expected for a WIMP mass in the range 2$-$5 GeV/c$^2$. The requirements for a future hundred-kilogram scale experiment designed to reach the bounds imposed by the coherent scattering of solar neutrinos are also described. By improving the ionization resolution down to 50 eV$_{ee}$, we show that such an experiment installed in an even lower background environment (e.g. at SNOLAB) should allow to observe about 80 $^8$B neutrino events after discrimination.
△ Less
Submitted 11 July, 2017;
originally announced July 2017.
-
Performance of the EDELWEISS-III experiment for direct dark matter searches
Authors:
E. Armengaud,
Q. Arnaud,
C. Augier,
A. Benoît,
L. Bergé,
T. Bergmann,
J. Billard,
T. de Boissière,
G. Bres,
A. Broniatowski,
V. Brudanin,
P. Camus,
A. Cazes,
M. Chapellier,
F. Charlieux,
M. De Jésus,
L. Dumoulin,
K. Eitel,
D. Filosofov,
N. Foerster,
N. Fourches,
G. Garde,
J. Gascon,
A. Giuliani,
M. Grollier
, et al. (38 additional authors not shown)
Abstract:
We present the results of measurements demonstrating the efficiency of the EDELWEISS-III array of cryogenic germanium detectors for direct dark matter searches. The experimental setup and the FID (Fully Inter-Digitized) detector array is described, as well as the efficiency of the double measurement of heat and ionization signals in background rejection. For the whole set of 24 FID detectors used…
▽ More
We present the results of measurements demonstrating the efficiency of the EDELWEISS-III array of cryogenic germanium detectors for direct dark matter searches. The experimental setup and the FID (Fully Inter-Digitized) detector array is described, as well as the efficiency of the double measurement of heat and ionization signals in background rejection. For the whole set of 24 FID detectors used for coincidence studies, the baseline resolutions for the fiducial ionization energy are mainly below 0.7 keV$_{ee}$ (FHWM) whereas the baseline resolutions for heat energies are mainly below 1.5 keV$_{ee}$ (FWHM). The response to nuclear recoils as well as the very good discrimination capability of the FID design has been measured with an AmBe source. The surface $β$- and $α$-decay rejection power of $R_{\rm surf} < 4 \times 10^{-5}$ per $α$ at 90% C.L. has been determined with a $^{210}$Pb source, the rejection of bulk $γ$-ray events has been demonstrated using $γ$-calibrations with $^{133}$Ba sources leading to a value of $R_{γ{\rm -mis-fid}} < 2.5 \times 10^{-6}$ at 90% C.L.. The current levels of natural radioactivity measured in the detector array are shown as the rate of single $γ$ background. The fiducial volume fraction of the FID detectors has been measured to a weighted average value of $(74.6 \pm 0.4)\%$ using the cosmogenic activation of the $^{65}$Zn and $^{68,71}$Ge isotopes. The stability and uniformity of the detector response is also discussed. The achieved resolutions, thresholds and background levels of the upgraded EDELWEISS-III detectors in their setup are thus well suited to the direct search of WIMP dark matter over a large mass range.
△ Less
Submitted 4 June, 2017;
originally announced June 2017.
-
Development of $^{100}$Mo-containing scintillating bolometers for a high-sensitivity neutrinoless double-beta decay search
Authors:
E. Armengaud,
C. Augier,
A. S. Barabash,
J. W. Beeman,
T. B. Bekker,
F. Bellini,
A. Benoît,
L. Bergé,
T. Bergmann,
J. Billard,
R. S. Boiko,
A. Broniatowski,
V. Brudanin,
P. Camus,
S. Capelli,
L. Cardani,
N. Casali,
A. Cazes,
M. Chapellier,
F. Charlieux,
D. M. Chernyak,
M. de Combarieu,
N. Coron,
F. A. Danevich,
I. Dafinei
, et al. (77 additional authors not shown)
Abstract:
This paper reports on the development of a technology involving $^{100}$Mo-enriched scintillating bolometers, compatible with the goals of CUPID, a proposed next-generation bolometric experiment to search for neutrinoless double-beta decay. Large mass ($\sim$1~kg), high optical quality, radiopure $^{100}$Mo-containing zinc and lithium molybdate crystals have been produced and used to develop high…
▽ More
This paper reports on the development of a technology involving $^{100}$Mo-enriched scintillating bolometers, compatible with the goals of CUPID, a proposed next-generation bolometric experiment to search for neutrinoless double-beta decay. Large mass ($\sim$1~kg), high optical quality, radiopure $^{100}$Mo-containing zinc and lithium molybdate crystals have been produced and used to develop high performance single detector modules based on 0.2--0.4~kg scintillating bolometers. In particular, the energy resolution of the lithium molybdate detectors near the $Q$-value of the double-beta transition of $^{100}$Mo (3034~keV) is 4--6~keV FWHM. The rejection of the $α$-induced dominant background above 2.6~MeV is better than 8$σ$. Less than 10~$μ$Bq/kg activity of $^{232}$Th ($^{228}$Th) and $^{226}$Ra in the crystals is ensured by boule recrystallization. The potential of $^{100}$Mo-enriched scintillating bolometers to perform high sensitivity double-beta decay searches has been demonstrated with only 10~kg$\times$d exposure: the two neutrino double-beta decay half-life of $^{100}$Mo has been measured with the up-to-date highest accuracy as $T_{1/2}$ = [6.90 $\pm$ 0.15(stat.) $\pm$ 0.37(syst.)] $\times$ 10$^{18}$~yr. Both crystallization and detector technologies favor lithium molybdate, which has been selected for the ongoing construction of the CUPID-0/Mo demonstrator, containing several kg of $^{100}$Mo.
△ Less
Submitted 4 October, 2017; v1 submitted 6 April, 2017;
originally announced April 2017.
-
Input Switched Affine Networks: An RNN Architecture Designed for Interpretability
Authors:
Jakob N. Foerster,
Justin Gilmer,
Jan Chorowski,
Jascha Sohl-Dickstein,
David Sussillo
Abstract:
There exist many problem domains where the interpretability of neural network models is essential for deployment. Here we introduce a recurrent architecture composed of input-switched affine transformations - in other words an RNN without any explicit nonlinearities, but with input-dependent recurrent weights. This simple form allows the RNN to be analyzed via straightforward linear methods: we ca…
▽ More
There exist many problem domains where the interpretability of neural network models is essential for deployment. Here we introduce a recurrent architecture composed of input-switched affine transformations - in other words an RNN without any explicit nonlinearities, but with input-dependent recurrent weights. This simple form allows the RNN to be analyzed via straightforward linear methods: we can exactly characterize the linear contribution of each input to the model predictions; we can use a change-of-basis to disentangle input, output, and computational hidden unit subspaces; we can fully reverse-engineer the architecture's solution to a simple task. Despite this ease of interpretation, the input switched affine network achieves reasonable performance on a text modeling tasks, and allows greater computational efficiency than networks with standard nonlinearities.
△ Less
Submitted 12 June, 2017; v1 submitted 28 November, 2016;
originally announced November 2016.
-
Measurement of the cosmogenic activation of germanium detectors in EDELWEISS-III
Authors:
The EDELWEISS Collaboration,
E. Armengaud,
Q. Arnaud,
C. Augier,
A. Benoît,
L. Bergé,
J. Billard,
J. Blümer,
T. de Boissière,
A. Broniatowski,
P. Camus,
A. Cazes,
M. Chapellier,
F. Charlieux,
M. De Jésus,
L. Dumoulin,
K. Eitel,
N. Foerster,
J. Gascon,
A. Giuliani,
M. Gros,
L. Hehn,
G. Heuermann,
Y. Jin,
A. Juillard
, et al. (24 additional authors not shown)
Abstract:
We present a measurement of the cosmogenic activation in the germanium cryogenic detectors of the EDELWEISS III direct dark matter search experiment. The decay rates measured in detectors with different exposures to cosmic rays above ground are converted into production rates of different isotopes. The measured production rates in units of nuclei/kg/day are 82 $\pm$ 21 for $^3$H, 2.8 $\pm$ 0.6 for…
▽ More
We present a measurement of the cosmogenic activation in the germanium cryogenic detectors of the EDELWEISS III direct dark matter search experiment. The decay rates measured in detectors with different exposures to cosmic rays above ground are converted into production rates of different isotopes. The measured production rates in units of nuclei/kg/day are 82 $\pm$ 21 for $^3$H, 2.8 $\pm$ 0.6 for $^{49}$V, 4.6 $\pm$ 0.7 for $^{55}$Fe, and 106 $\pm$ 13 for $^{65}$Zn. These results are the most accurate for these isotopes. A lower limit on the production rate of $^{68}$Ge of 74 nuclei/kg/day is also presented. They are compared to model predictions present in literature and to estimates calculated with the ACTIVIA code.
△ Less
Submitted 15 July, 2016;
originally announced July 2016.
-
Improved EDELWEISS-III sensitivity for low-mass WIMPs using a profile likelihood approach
Authors:
EDELWEISS Collaboration,
L. Hehn,
E. Armengaud,
Q. Arnaud,
C. Augier,
A. Benoît,
L. Bergé,
J. Billard,
J. Blümer,
T. de Boissière,
A. Broniatowski,
P. Camus,
A. Cazes,
M. Chapellier,
F. Charlieux,
M. De Jésus,
L. Dumoulin,
K. Eitel,
N. Foerster,
J. Gascon,
A. Giuliani,
M. Gros,
G. Heuermann,
Y. Jin,
A. Juillard
, et al. (24 additional authors not shown)
Abstract:
We report on a dark matter search for a Weakly Interacting Massive Particle (WIMP) in the mass range $m_χ\in [4, 30]\,\mathrm{GeV}/c^2$ with the EDELWEISS-III experiment. A 2D profile likelihood analysis is performed on data from eight selected detectors with the lowest energy thresholds leading to a combined fiducial exposure of 496 kg-days. External backgrounds from $γ$- and $β$-radiation, recoi…
▽ More
We report on a dark matter search for a Weakly Interacting Massive Particle (WIMP) in the mass range $m_χ\in [4, 30]\,\mathrm{GeV}/c^2$ with the EDELWEISS-III experiment. A 2D profile likelihood analysis is performed on data from eight selected detectors with the lowest energy thresholds leading to a combined fiducial exposure of 496 kg-days. External backgrounds from $γ$- and $β$-radiation, recoils from $^{206}$Pb and neutrons as well as detector intrinsic backgrounds were modelled from data outside the region of interest and constrained in the analysis. The basic data selection and most of the background models are the same as those used in a previously published analysis based on Boosted Decision Trees (BDT). For the likelihood approach applied in the analysis presented here, a larger signal efficiency and a subtraction of the expected background lead to a higher sensitivity, especially for the lowest WIMP masses probed. No statistically significant signal was found and upper limits on the spin-independent WIMP-nucleon scattering cross section can be set with a hypothesis test based on the profile likelihood test statistics. The 90% C.L. exclusion limit set for WIMPs with $m_χ= 4\,\mathrm{GeV/}c^2$ is $1.6 \times 10^{-39}\,\mathrm{cm^2}$, which is an improvement of a factor of seven with respect to the BDT-based analysis. For WIMP masses above $15\,\mathrm{GeV/}c^2$ the exclusion limits found with both analyses are in good agreement.
△ Less
Submitted 20 September, 2016; v1 submitted 12 July, 2016;
originally announced July 2016.
-
Signals induced by charge-trapping in EDELWEISS FID detectors: analytical modeling and applications
Authors:
The EDELWEISS Collaboration,
Q. Arnaud,
E. Armengaud,
C. Augier,
A. Benoît,
L. Bergé,
J. Billard,
J. Blümer,
T. de Boissière,
A. Broniatowski,
P. Camus,
A. Cazes,
M. Chapellier,
F. Charlieux,
L. Dumoulin,
K. Eitel,
N. Foerster,
N. Fourches,
J. Gascon,
A. Giuliani,
M. Gros,
L. Hehn,
G. Heuermann,
M. De Jésus,
Y. Jin
, et al. (25 additional authors not shown)
Abstract:
The EDELWEISS-III direct dark matter search experiment uses cryogenic HP-Ge detectors Fully covered with Inter-Digitized electrodes (FID). They are operated at low fields ($<1\;\mathrm{V/cm}$), and as a consequence charge-carrier trapping significantly affects both the ionization and heat energy measurements. This paper describes an analytical model of the signals induced by trapped charges in FID…
▽ More
The EDELWEISS-III direct dark matter search experiment uses cryogenic HP-Ge detectors Fully covered with Inter-Digitized electrodes (FID). They are operated at low fields ($<1\;\mathrm{V/cm}$), and as a consequence charge-carrier trapping significantly affects both the ionization and heat energy measurements. This paper describes an analytical model of the signals induced by trapped charges in FID detectors based on the Shockley-Ramo theorem. It is used to demonstrate that veto electrodes, initially designed for the sole purpose of surface event rejection, can be used to provide a sensitivity to the depth of the energy deposits, characterize the trapping in the crystals, perform heat and ionization energy corrections and improve the ionization baseline resolutions. These procedures are applied successfully to actual data.
△ Less
Submitted 29 June, 2016; v1 submitted 26 June, 2016;
originally announced June 2016.
-
Learning to Communicate with Deep Multi-Agent Reinforcement Learning
Authors:
Jakob N. Foerster,
Yannis M. Assael,
Nando de Freitas,
Shimon Whiteson
Abstract:
We consider the problem of multiple agents sensing and acting in environments with the goal of maximising their shared utility. In these environments, agents must learn communication protocols in order to share information that is needed to solve the tasks. By embracing deep neural networks, we are able to demonstrate end-to-end learning of protocols in complex environments inspired by communicati…
▽ More
We consider the problem of multiple agents sensing and acting in environments with the goal of maximising their shared utility. In these environments, agents must learn communication protocols in order to share information that is needed to solve the tasks. By embracing deep neural networks, we are able to demonstrate end-to-end learning of protocols in complex environments inspired by communication riddles and multi-agent computer vision problems with partial observability. We propose two approaches for learning in these domains: Reinforced Inter-Agent Learning (RIAL) and Differentiable Inter-Agent Learning (DIAL). The former uses deep Q-learning, while the latter exploits the fact that, during learning, agents can backpropagate error derivatives through (noisy) communication channels. Hence, this approach uses centralised learning but decentralised execution. Our experiments introduce new environments for studying the learning of communication protocols and present a set of engineering innovations that are essential for success in these domains.
△ Less
Submitted 24 May, 2016; v1 submitted 21 May, 2016;
originally announced May 2016.
-
Constraints on low-mass WIMPs from the EDELWEISS-III dark matter search
Authors:
EDELWEISS Collaboration,
E. Armengaud,
Q. Arnaud,
C. Augier,
A. Benoît,
A. Benoît,
L. Bergé,
T. Bergmann,
J. Billard,
J. Blümer,
T. de Boissière,
G. Bres,
A. Broniatowski,
V. Brudanin,
P. Camus,
A. Cazes,
M. Chapellier,
F. Charlieux,
L. Dumoulin,
K. Eitel,
D. Filosofov,
N. Foerster,
N. Fourches,
G. Garde,
J. Gascon
, et al. (42 additional authors not shown)
Abstract:
We present the results of a search for elastic scattering from galactic dark matter in the form of Weakly Interacting Massive Particles (WIMPs) in the 4-30 GeV/$c^2$ mass range. We make use of a 582 kg-day fiducial exposure from an array of 800 g Germanium bolometers equipped with a set of interleaved electrodes with full surface coverage. We searched specifically for $\sim 2.5-20$ keV nuclear rec…
▽ More
We present the results of a search for elastic scattering from galactic dark matter in the form of Weakly Interacting Massive Particles (WIMPs) in the 4-30 GeV/$c^2$ mass range. We make use of a 582 kg-day fiducial exposure from an array of 800 g Germanium bolometers equipped with a set of interleaved electrodes with full surface coverage. We searched specifically for $\sim 2.5-20$ keV nuclear recoils inside the detector fiducial volume. As an illustration the number of observed events in the search for 5 (resp. 20) GeV/$c^2$ WIMPs are 9 (resp. 4), compared to an expected background of 6.1 (resp. 1.4). A 90% CL limit of $4.3\times 10^{-40}$ cm$^2$ (resp. $9.4\times 10^{-44}$ cm$^2$) is set on the spin-independent WIMP-nucleon scattering cross-section for 5 (resp. 20) GeV/$c^2$ WIMPs. This result represents a 41-fold improvement with respect to the previous EDELWEISS-II low-mass WIMP search for 7 GeV/$c^2$ WIMPs. The derived constraint is in tension with hints of WIMP signals from some recent experiments, thus confirming results obtained with different detection techniques.
△ Less
Submitted 9 May, 2016; v1 submitted 16 March, 2016;
originally announced March 2016.