Search | arXiv e-print repository

Configurable Preference Tuning with Rubric-Guided Synthetic Data

Abstract: Models of human feedback for AI alignment, such as those underpinning Direct Preference Optimization (DPO), often bake in a singular, static set of preferences, limiting adaptability. This paper challenges the assumption of monolithic preferences by introducing Configurable Preference Tuning (CPT), a novel framework for endowing language models with the ability to dynamically adjust their behavior… ▽ More Models of human feedback for AI alignment, such as those underpinning Direct Preference Optimization (DPO), often bake in a singular, static set of preferences, limiting adaptability. This paper challenges the assumption of monolithic preferences by introducing Configurable Preference Tuning (CPT), a novel framework for endowing language models with the ability to dynamically adjust their behavior based on explicit, human-interpretable directives. CPT leverages synthetically generated preference data, conditioned on system prompts derived from structured, fine-grained rubrics that define desired attributes like writing style. By fine-tuning with these rubric-guided preferences, the LLM learns to modulate its outputs at inference time in response to the system prompt, without retraining. This approach not only offers fine-grained control but also provides a mechanism for modeling more nuanced and context-dependent human feedback. Several experimental artifacts, such as training code, generated datasets and fine-tuned models are released at https://github.com/vicgalle/configurable-preference-tuning △ Less

Submitted 13 June, 2025; originally announced June 2025.

Comments: Accepted to ICML 2025 Workshop on Models of Human Feedback for AI Alignment

arXiv:2502.07985 [pdf, other]

MetaSC: Test-Time Safety Specification Optimization for Language Models

Authors: Víctor Gallego

Abstract: We propose a novel dynamic safety framework that optimizes language model (LM) safety reasoning at inference time without modifying model weights. Building on recent advances in self-critique methods, our approach leverages a meta-critique mechanism that iteratively updates safety prompts-termed specifications-to drive the critique and revision process adaptively. This test-time optimization not o… ▽ More We propose a novel dynamic safety framework that optimizes language model (LM) safety reasoning at inference time without modifying model weights. Building on recent advances in self-critique methods, our approach leverages a meta-critique mechanism that iteratively updates safety prompts-termed specifications-to drive the critique and revision process adaptively. This test-time optimization not only improves performance against adversarial jailbreak requests but also in diverse general safety-related tasks, such as avoiding moral harm or pursuing honest responses. Our empirical evaluations across several language models demonstrate that dynamically optimized safety prompts yield significantly higher safety scores compared to fixed system prompts and static self-critique defenses. Code released at https://github.com/vicgalle/meta-self-critique.git . △ Less

Submitted 7 April, 2025; v1 submitted 11 February, 2025; originally announced February 2025.

Comments: Published at ICLR 2025 Workshop on Foundation Models in the Wild

Journal ref: ICLR 2025 Workshop on Foundation Models in the Wild

arXiv:2406.07188 [pdf, other]

Merging Improves Self-Critique Against Jailbreak Attacks

Authors: Victor Gallego

Abstract: The robustness of large language models (LLMs) against adversarial manipulations, such as jailbreak attacks, remains a significant challenge. In this work, we propose an approach that enhances the self-critique capability of the LLM and further fine-tunes it over sanitized synthetic data. This is done with the addition of an external critic model that can be merged with the original, thus bolsteri… ▽ More The robustness of large language models (LLMs) against adversarial manipulations, such as jailbreak attacks, remains a significant challenge. In this work, we propose an approach that enhances the self-critique capability of the LLM and further fine-tunes it over sanitized synthetic data. This is done with the addition of an external critic model that can be merged with the original, thus bolstering self-critique capabilities and improving the robustness of the LLMs response to adversarial prompts. Our results demonstrate that the combination of merging and self-critique can reduce the attack success rate of adversaries significantly, thus offering a promising defense mechanism against jailbreak attacks. Code, data and models released at https://github.com/vicgalle/merging-self-critique-jailbreaks . △ Less

Submitted 14 July, 2024; v1 submitted 11 June, 2024; originally announced June 2024.

Comments: Published at ICML 2024 Workshop on Foundation Models in the Wild

arXiv:2404.00495 [pdf, other]

Configurable Safety Tuning of Language Models with Synthetic Preference Data

Authors: Victor Gallego

Abstract: State-of-the-art language model fine-tuning techniques, such as Direct Preference Optimization (DPO), restrict user control by hard-coding predefined behaviors into the model. To address this, we propose a novel method, Configurable Safety Tuning (CST), that augments DPO using synthetic preference data to facilitate flexible safety configuration of LLMs at inference time. CST overcomes the constra… ▽ More State-of-the-art language model fine-tuning techniques, such as Direct Preference Optimization (DPO), restrict user control by hard-coding predefined behaviors into the model. To address this, we propose a novel method, Configurable Safety Tuning (CST), that augments DPO using synthetic preference data to facilitate flexible safety configuration of LLMs at inference time. CST overcomes the constraints of vanilla DPO by introducing a system prompt specifying safety configurations, enabling LLM deployers to disable/enable safety preferences based on their need, just changing the system prompt. Our experimental evaluations indicate that CST successfully manages different safety configurations and retains the original functionality of LLMs, showing it is a robust method for configurable deployment. Data and models available at https://github.com/vicgalle/configurable-safety-tuning △ Less

Submitted 30 March, 2024; originally announced April 2024.

arXiv:2402.08005 [pdf, other]

Refined Direct Preference Optimization with Synthetic Data for Behavioral Alignment of LLMs

Authors: Víctor Gallego

Abstract: In this paper, we introduce \emph{refined Direct Preference Optimization} (rDPO), a method for improving the behavioral alignment of Large Language Models (LLMs) without the need for human-annotated data. The method involves creating synthetic data using self-critique prompting by a teacher LLM and then utilising a generalized DPO loss function to distil to a student LLM. The loss function incorpo… ▽ More In this paper, we introduce \emph{refined Direct Preference Optimization} (rDPO), a method for improving the behavioral alignment of Large Language Models (LLMs) without the need for human-annotated data. The method involves creating synthetic data using self-critique prompting by a teacher LLM and then utilising a generalized DPO loss function to distil to a student LLM. The loss function incorporates an additional external reward model to improve the quality of synthetic data, making rDPO robust to potential noise in the synthetic dataset. rDPO is shown to be effective in a diverse set of behavioural alignment tasks, such as improved safety, robustness against role-playing, and reduced sycophancy. Code to be released at https://github.com/vicgalle/refined-dpo. △ Less

Submitted 12 February, 2024; originally announced February 2024.

Comments: Pre-print. Submitted to the ICLR 2024 Workshop on Representational Alignment (Re-Align)

arXiv:2312.01957 [pdf, other]

Distilled Self-Critique of LLMs with Synthetic Data: a Bayesian Perspective

Authors: Victor Gallego

Abstract: This paper proposes an interpretation of RLAIF as Bayesian inference by introducing distilled Self-Critique (dSC), which refines the outputs of a LLM through a Gibbs sampler that is later distilled into a fine-tuned model. Only requiring synthetic data, dSC is exercised in experiments regarding safety, sentiment, and privacy control, showing it can be a viable and cheap alternative to align LLMs.… ▽ More This paper proposes an interpretation of RLAIF as Bayesian inference by introducing distilled Self-Critique (dSC), which refines the outputs of a LLM through a Gibbs sampler that is later distilled into a fine-tuned model. Only requiring synthetic data, dSC is exercised in experiments regarding safety, sentiment, and privacy control, showing it can be a viable and cheap alternative to align LLMs. Code released at \url{https://github.com/vicgalle/distilled-self-critique}. △ Less

Submitted 11 April, 2024; v1 submitted 4 December, 2023; originally announced December 2023.

Comments: Accepted to ICLR 2024 (TinyPapers track)

Journal ref: The Second Tiny Papers Track at ICLR 2024

arXiv:2308.07929 [pdf, other]

Fast Adaptation with Bradley-Terry Preference Models in Text-To-Image Classification and Generation

Authors: Victor Gallego

Abstract: Recently, large multimodal models, such as CLIP and Stable Diffusion have experimented tremendous successes in both foundations and applications. However, as these models increase in parameter size and computational requirements, it becomes more challenging for users to personalize them for specific tasks or preferences. In this work, we address the problem of adapting the previous models towards… ▽ More Recently, large multimodal models, such as CLIP and Stable Diffusion have experimented tremendous successes in both foundations and applications. However, as these models increase in parameter size and computational requirements, it becomes more challenging for users to personalize them for specific tasks or preferences. In this work, we address the problem of adapting the previous models towards sets of particular human preferences, aligning the retrieved or generated images with the preferences of the user. We leverage the Bradley-Terry preference model to develop a fast adaptation method that efficiently fine-tunes the original model, with few examples and with minimal computing resources. Extensive evidence of the capabilities of this framework is provided through experiments in different domains related to multimodal text and image understanding, including preference prediction as a reward model, and generation tasks. △ Less

Submitted 21 September, 2023; v1 submitted 15 July, 2023; originally announced August 2023.

Comments: Accepted to Proceedings of the 23rd European Young Statisticians Meeting (EYSM)

arXiv:2308.06385 [pdf, other]

ZYN: Zero-Shot Reward Models with Yes-No Questions for RLAIF

Authors: Victor Gallego

Abstract: In this work, we address the problem of directing the text generation of a language model (LM) towards a desired behavior, aligning the generated text with the preferences of the human operator. We propose using another, instruction-tuned language model as a critic reward model in a zero-shot way thanks to the prompt of a Yes-No question that represents the user preferences, without requiring furt… ▽ More In this work, we address the problem of directing the text generation of a language model (LM) towards a desired behavior, aligning the generated text with the preferences of the human operator. We propose using another, instruction-tuned language model as a critic reward model in a zero-shot way thanks to the prompt of a Yes-No question that represents the user preferences, without requiring further labeled data. This zero-shot reward model provides the learning signal to further fine-tune the base LM using Reinforcement Learning from AI Feedback (RLAIF); yet our approach is also compatible in other contexts such as quality-diversity search. Extensive evidence of the capabilities of the proposed ZYN framework is provided through experiments in different domains related to text generation, including detoxification; optimizing sentiment of movie reviews, or any other attribute; steering the opinion about a particular topic the model may have; and personalizing prompt generators for text-to-image tasks. Code available at \url{https://github.com/vicgalle/zero-shot-reward-models/}. △ Less

Submitted 14 December, 2023; v1 submitted 11 August, 2023; originally announced August 2023.

Comments: pre-print, work in progress

arXiv:2302.06427 [pdf]

HERMES: qualification of High pErformance pRogrammable Microprocessor and dEvelopment of Software ecosystem

Authors: Nadia Ibellaatti, Edouard Lepape, Alp Kilic, Kaya Akyel, Kassem Chouayakh, Fabrizio Ferrandi, Claudio Barone, Serena Curzel, Michele Fiorito, Giovanni Gozzi, Miguel Masmano, Ana Risquez Navarro, Manuel Muñoz, Vicente Nicolau Gallego, Patricia Lopez Cueva, Jean-noel Letrillard, Franck Wartel

Abstract: European efforts to boost competitiveness in the sector of space services promote the research and development of advanced software and hardware solutions. The EU-funded HERMES project contributes to the effort by qualifying radiation-hardened, high-performance programmable microprocessors, and by developing a software ecosystem that facilitates the deployment of complex applications on such platf… ▽ More European efforts to boost competitiveness in the sector of space services promote the research and development of advanced software and hardware solutions. The EU-funded HERMES project contributes to the effort by qualifying radiation-hardened, high-performance programmable microprocessors, and by developing a software ecosystem that facilitates the deployment of complex applications on such platforms. The main objectives of the project include reaching a technology readiness level of 6 (i.e., validated and demonstrated in relevant environment) for the rad-hard NG-ULTRA FPGA with its ceramic hermetic package CGA 1752, developed within projects of the European Space Agency, French National Centre for Space Studies and the European Union. An equally important share of the project is dedicated to the development and validation of tools that support multicore software programming and FPGA acceleration, including Bambu for High-Level Synthesis and the XtratuM hypervisor with a level one boot loader for virtualization. △ Less

Submitted 9 February, 2023; originally announced February 2023.

Comments: Accepted for publication at DATE 2023

arXiv:2209.12330 [pdf, other]

Personalizing Text-to-Image Generation via Aesthetic Gradients

Authors: Victor Gallego

Abstract: This work proposes aesthetic gradients, a method to personalize a CLIP-conditioned diffusion model by guiding the generative process towards custom aesthetics defined by the user from a set of images. The approach is validated with qualitative and quantitative experiments, using the recent stable diffusion model and several aesthetically-filtered datasets. Code is released at https://github.com/vi… ▽ More This work proposes aesthetic gradients, a method to personalize a CLIP-conditioned diffusion model by guiding the generative process towards custom aesthetics defined by the user from a set of images. The approach is validated with qualitative and quantitative experiments, using the recent stable diffusion model and several aesthetically-filtered datasets. Code is released at https://github.com/vicgalle/stable-diffusion-aesthetic-gradients △ Less

Submitted 25 September, 2022; originally announced September 2022.

Comments: Submitted to NeurIPS 2022 Machine Learning for Creativity and Design Workshop

arXiv:2208.01740 [pdf, other]

From Single Aircraft to Communities: A Neutral Interpretation of Air Traffic Complexity Dynamics

Authors: Ralvi Isufaj, Marsel Omeri, Miquel Angel Piera, Jaume Saez Valls, Christian Eduardo Verdonk Gallego

Abstract: Present air traffic complexity metrics are defined considering the interests of different management layers of ATM. These layers have different objectives which in practice compete to maximize their own goals, which leads to fragmented decision making. This fragmentation together with competing KPAs requires transparent and neutral air traffic information to pave the way for an explainable set of… ▽ More Present air traffic complexity metrics are defined considering the interests of different management layers of ATM. These layers have different objectives which in practice compete to maximize their own goals, which leads to fragmented decision making. This fragmentation together with competing KPAs requires transparent and neutral air traffic information to pave the way for an explainable set of actions. In this paper, we introduce the concept of single aircraft complexity, to determine the contribution of each aircraft to the overall complexity of air traffic. Furthermore, we describe a methodology extending this concept to define complex communities, which are groups of interdependent aircraft that contribute the majority of the complexity in a certain airspace. In order to showcase the methodology, a tool that visualizes different outputs of the algorithm is developed. Through use-cases based on synthetic and real historical traffic, we first show that the algorithm can serve to formalize controller decisions as well as guide controllers to better decisions. Further, we investigate how the provided information can be used to increase transparency of the decision makers towards different airspace users, which serves also to increase fairness and equity. Lastly, a sensitivity analysis is conducted in order to systematically analyse how each input affects the methodology. △ Less

Submitted 15 July, 2022; originally announced August 2022.

Comments: 21 pages, 30 figures, 2 tables, submitted to Research Transportation Part C

arXiv:2207.07049 [pdf, other]

How do tuna schools associate to dFADs? A study using echo-sounder buoys to identify global patterns

Authors: Manuel Navarro-García, Daniel Precioso, Kathryn Gavira-O'Neill, Alberto Torres-Barrán, David Gordo, Víctor Gallego, David Gómez-Ullate

Abstract: Based on the data gathered by echo-sounder buoys attached to drifting Fish Aggregating Devices (dFADs) across tropical oceans, the current study applies a Machine Learning protocol to examine the temporal trends of tuna schools' association to drifting objects. Using a binary output, metrics typically used in the literature were adapted to account for the fact that the entire tuna aggregation unde… ▽ More Based on the data gathered by echo-sounder buoys attached to drifting Fish Aggregating Devices (dFADs) across tropical oceans, the current study applies a Machine Learning protocol to examine the temporal trends of tuna schools' association to drifting objects. Using a binary output, metrics typically used in the literature were adapted to account for the fact that the entire tuna aggregation under the dFAD was considered. The median time it took tuna to colonize the dFADs for the first time varied between 25 and 43 days, depending on the ocean, and the longest soak and colonization times were registered in the Pacific Ocean. The tuna schools' Continuous Residence Times were generally shorter than Continuous Absence Times (median values between 5 and 7 days, and 9 and 11 days, respectively), in line with the results found by previous studies. Using a regression output, two novel metrics, namely aggregation time and disaggregation time, were estimated to obtain further insight into the symmetry of the aggregation process. Across all oceans, the time it took for the tuna aggregation to depart from the dFADs was not significantly longer than the time it took for the aggregation to form. The value of these results in the context of the "ecological trap" hypothesis is discussed, and further analyses to enrich and make use of this data source are proposed. △ Less

Submitted 14 July, 2022; originally announced July 2022.

arXiv:2109.13232 [pdf, other]

Contributions to Large Scale Bayesian Inference and Adversarial Machine Learning

Authors: Víctor Gallego

Abstract: The rampant adoption of ML methodologies has revealed that models are usually adopted to make decisions without taking into account the uncertainties in their predictions. More critically, they can be vulnerable to adversarial examples. Thus, we believe that developing ML systems that take into account predictive uncertainties and are robust against adversarial examples is a must for critical, rea… ▽ More The rampant adoption of ML methodologies has revealed that models are usually adopted to make decisions without taking into account the uncertainties in their predictions. More critically, they can be vulnerable to adversarial examples. Thus, we believe that developing ML systems that take into account predictive uncertainties and are robust against adversarial examples is a must for critical, real-world tasks. We start with a case study in retailing. We propose a robust implementation of the Nerlove-Arrow model using a Bayesian structural time series model. Its Bayesian nature facilitates incorporating prior information reflecting the manager's views, which can be updated with relevant data. However, this case adopted classical Bayesian techniques, such as the Gibbs sampler. Nowadays, the ML landscape is pervaded with neural networks and this chapter also surveys current developments in this sub-field. Then, we tackle the problem of scaling Bayesian inference to complex models and large data regimes. In the first part, we propose a unifying view of two different Bayesian inference algorithms, Stochastic Gradient Markov Chain Monte Carlo (SG-MCMC) and Stein Variational Gradient Descent (SVGD), leading to improved and efficient novel sampling schemes. In the second part, we develop a framework to boost the efficiency of Bayesian inference in probabilistic models by embedding a Markov chain sampler within a variational posterior approximation. After that, we present an alternative perspective on adversarial classification based on adversarial risk analysis, and leveraging the scalable Bayesian approaches from chapter 2. In chapter 4 we turn to reinforcement learning, introducing Threatened Markov Decision Processes, showing the benefits of accounting for adversaries in RL while the agent learns. △ Less

Submitted 25 September, 2021; originally announced September 2021.

Comments: PhD thesis

arXiv:2101.10721 [pdf, other]

Data sharing games

Authors: Víctor Gallego, Roi Naveiro, David Ríos Insua, Wolfram Rozas

Abstract: Data sharing issues pervade online social and economic environments. To foster social progress, it is important to develop models of the interaction between data producers and consumers that can promote the rise of cooperation between the involved parties. We formalize this interaction as a game, the data sharing game, based on the Iterated Prisoner's Dilemma and deal with it through multi-agent r… ▽ More Data sharing issues pervade online social and economic environments. To foster social progress, it is important to develop models of the interaction between data producers and consumers that can promote the rise of cooperation between the involved parties. We formalize this interaction as a game, the data sharing game, based on the Iterated Prisoner's Dilemma and deal with it through multi-agent reinforcement learning techniques. We consider several strategies for how the citizens may behave, depending on the degree of centralization sought. Simulations suggest mechanisms for cooperation to take place and, thus, achieve maximum social utility: data consumers should perform some kind of opponent modeling, or a regulator should transfer utility between both players and incentivise them. △ Less

Submitted 26 January, 2021; originally announced January 2021.

arXiv:2007.02613 [pdf, ps, other]

Adversarial Risk Analysis (Overview)

Authors: David Banks, Víctor Gallego, Roi Naveiro, David Ríos Insua

Abstract: Adversarial risk analysis (ARA) is a relatively new area of research that informs decision-making when facing intelligent opponents and uncertain outcomes. It enables an analyst to express her Bayesian beliefs about an opponent's utilities, capabilities, probabilities and the type of strategic calculation that the opponent is using. Within that framework, the analyst then solves the problem from t… ▽ More Adversarial risk analysis (ARA) is a relatively new area of research that informs decision-making when facing intelligent opponents and uncertain outcomes. It enables an analyst to express her Bayesian beliefs about an opponent's utilities, capabilities, probabilities and the type of strategic calculation that the opponent is using. Within that framework, the analyst then solves the problem from the perspective of the opponent while placing subjective probability distributions on all unknown quantities. This produces a distribution over the actions of the opponent that permits the analyst to maximize her expected utility. This overview covers conceptual, modeling, computational and applied issues in ARA. △ Less

Submitted 6 July, 2020; originally announced July 2020.

arXiv:2004.08705 [pdf, other]

Protecting Classifiers From Attacks. A Bayesian Approach

Authors: Victor Gallego, Roi Naveiro, Alberto Redondo, David Rios Insua, Fabrizio Ruggeri

Abstract: Classification problems in security settings are usually modeled as confrontations in which an adversary tries to fool a classifier manipulating the covariates of instances to obtain a benefit. Most approaches to such problems have focused on game-theoretic ideas with strong underlying common knowledge assumptions, which are not realistic in the security realm. We provide an alternative Bayesian f… ▽ More Classification problems in security settings are usually modeled as confrontations in which an adversary tries to fool a classifier manipulating the covariates of instances to obtain a benefit. Most approaches to such problems have focused on game-theoretic ideas with strong underlying common knowledge assumptions, which are not realistic in the security realm. We provide an alternative Bayesian framework that accounts for the lack of precise knowledge about the attacker's behavior using adversarial risk analysis. A key ingredient required by our framework is the ability to sample from the distribution of originating instances given the possibly attacked observed one. We propose a sampling procedure based on approximate Bayesian computation, in which we simulate the attacker's problem taking into account our uncertainty about his elements. For large scale problems, we propose an alternative, scalable approach that could be used when dealing with differentiable classifiers. Within it, we move the computational load to the training phase, simulating attacks from an adversary, adapting the framework to obtain a classifier robustified against attacks. △ Less

Submitted 18 April, 2020; originally announced April 2020.

arXiv:2003.03546 [pdf, other]

doi 10.1080/01621459.2023.2183129

Adversarial Machine Learning: Bayesian Perspectives

Authors: David Rios Insua, Roi Naveiro, Victor Gallego, Jason Poulos

Abstract: Adversarial Machine Learning (AML) is emerging as a major field aimed at protecting machine learning (ML) systems against security threats: in certain scenarios there may be adversaries that actively manipulate input data to fool learning systems. This creates a new class of security vulnerabilities that ML systems may face, and a new desirable property called adversarial robustness essential to t… ▽ More Adversarial Machine Learning (AML) is emerging as a major field aimed at protecting machine learning (ML) systems against security threats: in certain scenarios there may be adversaries that actively manipulate input data to fool learning systems. This creates a new class of security vulnerabilities that ML systems may face, and a new desirable property called adversarial robustness essential to trust operations based on ML outputs. Most work in AML is built upon a game-theoretic modelling of the conflict between a learning system and an adversary, ready to manipulate input data. This assumes that each agent knows their opponent's interests and uncertainty judgments, facilitating inferences based on Nash equilibria. However, such common knowledge assumption is not realistic in the security scenarios typical of AML. After reviewing such game-theoretic approaches, we discuss the benefits that Bayesian perspectives provide when defending ML-based systems. We demonstrate how the Bayesian approach allows us to explicitly model our uncertainty about the opponent's beliefs and interests, relaxing unrealistic assumptions, and providing more robust inferences. We illustrate this approach in supervised learning settings, and identify relevant future research problems. △ Less

Submitted 22 February, 2024; v1 submitted 7 March, 2020; originally announced March 2020.

Journal ref: Journal of the American Statistical Association. Volume 118, 2023 - Issue 543

arXiv:1908.09744 [pdf, other]

Variationally Inferred Sampling Through a Refined Bound for Probabilistic Programs

Authors: Victor Gallego, David Rios Insua

Abstract: A framework to boost the efficiency of Bayesian inference in probabilistic programs is introduced by embedding a sampler inside a variational posterior approximation. We call it the refined variational approximation. Its strength lies both in ease of implementation and automatically tuning of the sampler parameters to speed up mixing time using automatic differentiation. Several strategies to appr… ▽ More A framework to boost the efficiency of Bayesian inference in probabilistic programs is introduced by embedding a sampler inside a variational posterior approximation. We call it the refined variational approximation. Its strength lies both in ease of implementation and automatically tuning of the sampler parameters to speed up mixing time using automatic differentiation. Several strategies to approximate \emph{evidence lower bound} (ELBO) computation are introduced. Experimental evidence of its efficient performance is shown solving an influence diagram in a high-dimensional space using a conditional variational autoencoder (cVAE) as a deep Bayes classifier; an unconditional VAE on density estimation tasks; and state-space models for time-series data. △ Less

Submitted 22 February, 2020; v1 submitted 26 August, 2019; originally announced August 2019.

arXiv:1908.08773 [pdf, other]

Opponent Aware Reinforcement Learning

Authors: Victor Gallego, Roi Naveiro, David Rios Insua, David Gomez-Ullate Oteiza

Abstract: We introduce Threatened Markov Decision Processes (TMDPs) as an extension of the classical Markov Decision Process framework for Reinforcement Learning (RL). TMDPs allow suporting a decision maker against potential opponents in a RL context. We also propose a level-k thinking scheme resulting in a novel learning approach to deal with TMDPs. After introducing our framework and deriving theoretical… ▽ More We introduce Threatened Markov Decision Processes (TMDPs) as an extension of the classical Markov Decision Process framework for Reinforcement Learning (RL). TMDPs allow suporting a decision maker against potential opponents in a RL context. We also propose a level-k thinking scheme resulting in a novel learning approach to deal with TMDPs. After introducing our framework and deriving theoretical results, relevant empirical evidence is given via extensive experiments, showing the benefits of accounting for adversaries in RL while the agent learns △ Less

Submitted 26 August, 2019; v1 submitted 22 August, 2019; originally announced August 2019.

Comments: Substantially extends the previous work: https://www.aaai.org/ojs/index.php/AAAI/article/view/5106. This article draws heavily from arXiv arXiv:1809.01560

arXiv:1812.00071 [pdf, other]

Stochastic Gradient MCMC with Repulsive Forces

Authors: Victor Gallego, David Rios Insua

Abstract: We propose a unifying view of two different Bayesian inference algorithms, Stochastic Gradient Markov Chain Monte Carlo (SG-MCMC) and Stein Variational Gradient Descent (SVGD), leading to improved and efficient novel sampling schemes. We show that SVGD combined with a noise term can be framed as a multiple chain SG-MCMC method. Instead of treating each parallel chain independently from others, our… ▽ More We propose a unifying view of two different Bayesian inference algorithms, Stochastic Gradient Markov Chain Monte Carlo (SG-MCMC) and Stein Variational Gradient Descent (SVGD), leading to improved and efficient novel sampling schemes. We show that SVGD combined with a noise term can be framed as a multiple chain SG-MCMC method. Instead of treating each parallel chain independently from others, our proposed algorithm implements a repulsive force between particles, avoiding collapse and facilitating a better exploration of the parameter space. We also show how the addition of this noise term is necessary to obtain a valid SG-MCMC sampler, a significant difference with SVGD. Experiments with both synthetic distributions and real datasets illustrate the benefits of the proposed scheme. △ Less

Submitted 22 February, 2020; v1 submitted 30 November, 2018; originally announced December 2018.

Comments: Extends the workshop version

arXiv:1809.01560 [pdf, other]

doi 10.1609/aaai.v33i01.33019939

Reinforcement Learning under Threats

Authors: Victor Gallego, Roi Naveiro, David Rios Insua

Abstract: In several reinforcement learning (RL) scenarios, mainly in security settings, there may be adversaries trying to interfere with the reward generating process. In this paper, we introduce Threatened Markov Decision Processes (TMDPs), which provide a framework to support a decision maker against a potential adversary in RL. Furthermore, we propose a level-$k$ thinking scheme resulting in a new lear… ▽ More In several reinforcement learning (RL) scenarios, mainly in security settings, there may be adversaries trying to interfere with the reward generating process. In this paper, we introduce Threatened Markov Decision Processes (TMDPs), which provide a framework to support a decision maker against a potential adversary in RL. Furthermore, we propose a level-$k$ thinking scheme resulting in a new learning framework to deal with TMDPs. After introducing our framework and deriving theoretical results, relevant empirical evidence is given via extensive experiments, showing the benefits of accounting for adversaries while the agent learns. △ Less

Submitted 30 July, 2019; v1 submitted 5 September, 2018; originally announced September 2018.

Comments: Extends the verson published at the Proceedings of the AAAI Conference on Artificial Intelligence 33, https://www.aaai.org/ojs/index.php/AAAI/article/view/5106

arXiv:1801.03050 [pdf, other]

doi 10.1002/asmb.2460

Assessing the effect of advertising expenditures upon sales: a Bayesian structural time series model

Authors: Víctor Gallego, Pablo Suárez-García, Pablo Angulo, David Gómez-Ullate

Abstract: We propose a robust implementation of the Nerlove--Arrow model using a Bayesian structural time series model to explain the relationship between advertising expenditures of a country-wide fast-food franchise network with its weekly sales. Thanks to the flexibility and modularity of the model, it is well suited to generalization to other markets or situations. Its Bayesian nature facilitates incorp… ▽ More We propose a robust implementation of the Nerlove--Arrow model using a Bayesian structural time series model to explain the relationship between advertising expenditures of a country-wide fast-food franchise network with its weekly sales. Thanks to the flexibility and modularity of the model, it is well suited to generalization to other markets or situations. Its Bayesian nature facilitates incorporating \emph{a priori} information (the manager's views), which can be updated with relevant data. This aspect of the model will be used to present a strategy of budget scheduling across time and channels. △ Less

Submitted 29 May, 2019; v1 submitted 9 January, 2018; originally announced January 2018.

Comments: Published at Applied Stochastic Models in Business and Industry, https://onlinelibrary.wiley.com/doi/full/10.1002/asmb.2460

Journal ref: Appl Stochastic Models Bus Ind. 2019; 1-13

Showing 1–22 of 22 results for author: Gallego, V