Search | arXiv e-print repository

On Efficient Bayesian Exploration in Model-Based Reinforcement Learning

Authors: Alberto Caron, Chris Hicks, Vasilios Mavroudis

Abstract: In this work, we address the challenge of data-efficient exploration in reinforcement learning by examining existing principled, information-theoretic approaches to intrinsic motivation. Specifically, we focus on a class of exploration bonuses that targets epistemic uncertainty rather than the aleatoric noise inherent in the environment. We prove that these bonuses naturally signal epistemic infor… ▽ More In this work, we address the challenge of data-efficient exploration in reinforcement learning by examining existing principled, information-theoretic approaches to intrinsic motivation. Specifically, we focus on a class of exploration bonuses that targets epistemic uncertainty rather than the aleatoric noise inherent in the environment. We prove that these bonuses naturally signal epistemic information gains and converge to zero once the agent becomes sufficiently certain about the environment's dynamics and rewards, thereby aligning exploration with genuine knowledge gaps. Our analysis provides formal guarantees for IG-based approaches, which previously lacked theoretical grounding. To enable practical use, we also discuss tractable approximations via sparse variational Gaussian Processes, Deep Kernels and Deep Ensemble models. We then outline a general framework - Predictive Trajectory Sampling with Bayesian Exploration (PTS-BE) - which integrates model-based planning with information-theoretic bonuses to achieve sample-efficient deep exploration. We empirically demonstrate that PTS-BE substantially outperforms other baselines across a variety of environments characterized by sparse rewards and/or purely exploratory tasks. △ Less

Submitted 3 July, 2025; originally announced July 2025.

arXiv:2507.02083 [pdf, ps, other]

Measuring Scientific Capabilities of Language Models with a Systems Biology Dry Lab

Authors: Haonan Duan, Stephen Zhewen Lu, Caitlin Fiona Harrigan, Nishkrit Desai, Jiarui Lu, Michał Koziarski, Leonardo Cotta, Chris J. Maddison

Abstract: Designing experiments and result interpretations are core scientific competencies, particularly in biology, where researchers perturb complex systems to uncover the underlying systems. Recent efforts to evaluate the scientific capabilities of large language models (LLMs) fail to test these competencies because wet-lab experimentation is prohibitively expensive: in expertise, time and equipment. We… ▽ More Designing experiments and result interpretations are core scientific competencies, particularly in biology, where researchers perturb complex systems to uncover the underlying systems. Recent efforts to evaluate the scientific capabilities of large language models (LLMs) fail to test these competencies because wet-lab experimentation is prohibitively expensive: in expertise, time and equipment. We introduce SciGym, a first-in-class benchmark that assesses LLMs' iterative experiment design and analysis abilities in open-ended scientific discovery tasks. SciGym overcomes the challenge of wet-lab costs by running a dry lab of biological systems. These models, encoded in Systems Biology Markup Language, are efficient for generating simulated data, making them ideal testbeds for experimentation on realistically complex systems. We evaluated six frontier LLMs on 137 small systems, and released a total of 350 systems. Our evaluation shows that while more capable models demonstrated superior performance, all models' performance declined significantly as system complexity increased, suggesting substantial room for improvement in the scientific capabilities of LLM agents. △ Less

Submitted 2 July, 2025; originally announced July 2025.

arXiv:2507.02068 [pdf, ps, other]

How do Software Engineering Candidates Prepare for Technical Interviews?

Authors: Brian Bell, Teresa Thomas, Sang Won Lee, Chris Brown

Abstract: To obtain employment, aspiring software engineers must complete technical interviews -- a hiring process which involves candidates writing code while communicating to an audience. However, the complexities of tech interviews are difficult to prepare for and seldom faced in computing curricula. To this end, we seek to understand how candidates prepare for technical interviews, investigating the eff… ▽ More To obtain employment, aspiring software engineers must complete technical interviews -- a hiring process which involves candidates writing code while communicating to an audience. However, the complexities of tech interviews are difficult to prepare for and seldom faced in computing curricula. To this end, we seek to understand how candidates prepare for technical interviews, investigating the effects of preparation methods and the role of education. We distributed a survey to candidates (n = 131) actively preparing for technical interviews. Our results suggest candidates rarely train in authentic settings and courses fail to support preparation efforts -- leading to stress and unpreparedness. Based on our findings, we provide implications for stakeholders to enhance tech interview preparation for candidates pursuing software engineering roles. △ Less

Submitted 2 July, 2025; originally announced July 2025.

arXiv:2507.01968 [pdf, ps, other]

doi 10.1108/JM2-11-2023-0263

Optimising task allocation to balance business goals and worker well-being for financial service workforces

Authors: Chris Duckworth, Zlatko Zlatev, James Sciberras, Peter Hallett, Enrico Gerding

Abstract: Purpose: Financial service companies manage huge volumes of data which requires timely error identification and resolution. The associated tasks to resolve these errors frequently put financial analyst workforces under significant pressure leading to resourcing challenges and increased business risk. To address this challenge, we introduce a formal task allocation model which considers both busine… ▽ More Purpose: Financial service companies manage huge volumes of data which requires timely error identification and resolution. The associated tasks to resolve these errors frequently put financial analyst workforces under significant pressure leading to resourcing challenges and increased business risk. To address this challenge, we introduce a formal task allocation model which considers both business orientated goals and analyst well-being. Methodology: We use a Genetic Algorithm (GA) to optimise our formal model to allocate and schedule tasks to analysts. The proposed solution is able to allocate tasks to analysts with appropriate skills and experience, while taking into account staff well-being objectives. Findings: We demonstrate our GA model outperforms baseline heuristics, current working practice, and is applicable to a range of single and multi-objective real-world scenarios. We discuss the potential for metaheuristics (such as GAs) to efficiently find sufficiently good allocations which can provide recommendations for financial service managers in-the-loop. Originality: A key gap in existing allocation and scheduling models, is fully considering worker well-being. This paper presents an allocation model which explicitly optimises for well-being while still improving on current working practice for efficiency. △ Less

Submitted 18 June, 2025; originally announced July 2025.

Comments: Accepted in Journal of Modelling in Management

Journal ref: ISSN 1746-5664 (2025)

arXiv:2507.01352 [pdf, ps, other]

Skywork-Reward-V2: Scaling Preference Data Curation via Human-AI Synergy

Authors: Chris Yuhao Liu, Liang Zeng, Yuzhen Xiao, Jujie He, Jiacai Liu, Chaojie Wang, Rui Yan, Wei Shen, Fuxiang Zhang, Jiacheng Xu, Yang Liu, Yahui Zhou

Abstract: Despite the critical role of reward models (RMs) in reinforcement learning from human feedback (RLHF), current state-of-the-art open RMs perform poorly on most existing evaluation benchmarks, failing to capture the spectrum of nuanced and sophisticated human preferences. Even approaches that incorporate advanced training techniques have not yielded meaningful performance improvements. We hypothesi… ▽ More Despite the critical role of reward models (RMs) in reinforcement learning from human feedback (RLHF), current state-of-the-art open RMs perform poorly on most existing evaluation benchmarks, failing to capture the spectrum of nuanced and sophisticated human preferences. Even approaches that incorporate advanced training techniques have not yielded meaningful performance improvements. We hypothesize that this brittleness stems primarily from limitations in preference datasets, which are often narrowly scoped, synthetically labeled, or lack rigorous quality control. To address these challenges, we present a large-scale preference dataset comprising 40 million preference pairs, named SynPref-40M. To enable data curation at scale, we design a human-AI synergistic two-stage pipeline that leverages the complementary strengths of human annotation quality and AI scalability. In this pipeline, humans provide verified annotations, while large language models perform automatic curation based on human guidance. Training on this preference mixture, we introduce Skywork-Reward-V2, a suite of eight reward models ranging from 0.6B to 8B parameters, trained on a carefully curated subset of 26 million preference pairs from SynPref-40M. We demonstrate that Skywork-Reward-V2 is versatile across a wide range of capabilities, including alignment with human preferences, objective correctness, safety, resistance to stylistic biases, and best-of-N scaling, achieving state-of-the-art performance across seven major reward model benchmarks. Ablation studies confirm that the effectiveness of our approach stems not only from data scale but also from high-quality curation. The Skywork-Reward-V2 series represents substantial progress in open reward models, highlighting the untapped potential of existing preference datasets and demonstrating how human-AI curation synergy can unlock significantly higher data quality. △ Less

Submitted 3 July, 2025; v1 submitted 2 July, 2025; originally announced July 2025.

arXiv:2507.01134 [pdf, ps, other]

Animated Visual Encoding and Layer Blending for Identification of Educational Game Strategies

Authors: Braden Roper, William Thompson, Chris Weaver

Abstract: Game-Based Learning has proven to be an effective method for enhancing engagement with educational material. However, gaining a deeper understanding of player strategies remains challenging. Sequential game-state and action-based tracking tools often gather extensive data that can be difficult to interpret as long-term strategy. This data presents unique problems to visualization, as it can be fai… ▽ More Game-Based Learning has proven to be an effective method for enhancing engagement with educational material. However, gaining a deeper understanding of player strategies remains challenging. Sequential game-state and action-based tracking tools often gather extensive data that can be difficult to interpret as long-term strategy. This data presents unique problems to visualization, as it can be fairly natural, noisy data but is constrained within synthetic, controlled environments, leading to issues such as overplotting which can make interpretation complicated. We propose an animated visual encoding tool that utilizes kinetic visualization to address these issues. This tool enables researchers to construct animated data narratives through the configuration of parameter interpolation curves and blending layers. Finally, we demonstrate the usefulness of the tool while addressing specific interests as outlined by a domain expert collaborator. △ Less