Skip to main content

Showing 1–50 of 65 results for author: Valko, M

Searching in archive stat. Search in all archives.
.
  1. arXiv:2505.19731  [pdf, ps, other

    stat.ML cs.LG

    Accelerating Nash Learning from Human Feedback via Mirror Prox

    Authors: Daniil Tiapkin, Daniele Calandriello, Denis Belomestny, Eric Moulines, Alexey Naumov, Kashif Rasul, Michal Valko, Pierre Menard

    Abstract: Traditional Reinforcement Learning from Human Feedback (RLHF) often relies on reward models, frequently assuming preference structures like the Bradley-Terry model, which may not accurately capture the complexities of real human preferences (e.g., intransitivity). Nash Learning from Human Feedback (NLHF) offers a more direct alternative by framing the problem as finding a Nash equilibrium of a gam… ▽ More

    Submitted 26 May, 2025; originally announced May 2025.

  2. arXiv:2410.17055  [pdf, other

    cs.LG stat.ML

    Optimal Design for Reward Modeling in RLHF

    Authors: Antoine Scheid, Etienne Boursier, Alain Durmus, Michael I. Jordan, Pierre Ménard, Eric Moulines, Michal Valko

    Abstract: Reinforcement Learning from Human Feedback (RLHF) has become a popular approach to align language models (LMs) with human preferences. This method involves collecting a large dataset of human pairwise preferences across various text generations and using it to infer (implicitly or explicitly) a reward model. Numerous methods have been proposed to learn the reward model and align a LM with it. Howe… ▽ More

    Submitted 23 October, 2024; v1 submitted 22 October, 2024; originally announced October 2024.

  3. arXiv:2403.08635  [pdf, other

    cs.LG cs.AI stat.ML

    Human Alignment of Large Language Models through Online Preference Optimisation

    Authors: Daniele Calandriello, Daniel Guo, Remi Munos, Mark Rowland, Yunhao Tang, Bernardo Avila Pires, Pierre Harvey Richemond, Charline Le Lan, Michal Valko, Tianqi Liu, Rishabh Joshi, Zeyu Zheng, Bilal Piot

    Abstract: Ensuring alignment of language models' outputs with human preferences is critical to guarantee a useful, safe, and pleasant user experience. Thus, human alignment has been extensively studied recently and several methods such as Reinforcement Learning from Human Feedback (RLHF), Direct Policy Optimisation (DPO) and Sequence Likelihood Calibration (SLiC) have emerged. In this paper, our contributio… ▽ More

    Submitted 13 March, 2024; originally announced March 2024.

  4. arXiv:2312.00886  [pdf, other

    stat.ML cs.AI cs.GT cs.LG cs.MA

    Nash Learning from Human Feedback

    Authors: Rémi Munos, Michal Valko, Daniele Calandriello, Mohammad Gheshlaghi Azar, Mark Rowland, Zhaohan Daniel Guo, Yunhao Tang, Matthieu Geist, Thomas Mesnard, Andrea Michi, Marco Selvi, Sertan Girgin, Nikola Momchev, Olivier Bachem, Daniel J. Mankowitz, Doina Precup, Bilal Piot

    Abstract: Reinforcement learning from human feedback (RLHF) has emerged as the main paradigm for aligning large language models (LLMs) with human preferences. Typically, RLHF involves the initial step of learning a reward model from human feedback, often expressed as preferences between pairs of text generations produced by a pre-trained LLM. Subsequently, the LLM's policy is fine-tuned by optimizing it to… ▽ More

    Submitted 11 June, 2024; v1 submitted 1 December, 2023; originally announced December 2023.

  5. arXiv:2310.18186  [pdf, ps, other

    stat.ML cs.LG

    Model-free Posterior Sampling via Learning Rate Randomization

    Authors: Daniil Tiapkin, Denis Belomestny, Daniele Calandriello, Eric Moulines, Remi Munos, Alexey Naumov, Pierre Perrault, Michal Valko, Pierre Menard

    Abstract: In this paper, we introduce Randomized Q-learning (RandQL), a novel randomized model-free algorithm for regret minimization in episodic Markov Decision Processes (MDPs). To the best of our knowledge, RandQL is the first tractable model-free posterior sampling-based algorithm. We analyze the performance of RandQL in both tabular and non-tabular metric space settings. In tabular MDPs, RandQL achieve… ▽ More

    Submitted 7 July, 2025; v1 submitted 27 October, 2023; originally announced October 2023.

    Comments: This revision fixed an error connected to an incorrect use of Proposition 7 inside of Lemma 4, and a misprint in Lemma 12. In the current version, we modified the martingale construction and applied the same argument as before; no results need to be modified as a result of these fixes

    Journal ref: Advances in Neural Information Processing Systems 36 (NeurIPS 2023)

  6. arXiv:2310.17303  [pdf, ps, other

    stat.ML cs.LG

    Demonstration-Regularized RL

    Authors: Daniil Tiapkin, Denis Belomestny, Daniele Calandriello, Eric Moulines, Alexey Naumov, Pierre Perrault, Michal Valko, Pierre Menard

    Abstract: Incorporating expert demonstrations has empirically helped to improve the sample efficiency of reinforcement learning (RL). This paper quantifies theoretically to what extent this extra information reduces RL's sample complexity. In particular, we study the demonstration-regularized reinforcement learning that leverages the expert demonstrations by KL-regularization for a policy learned by behavio… ▽ More

    Submitted 10 June, 2024; v1 submitted 26 October, 2023; originally announced October 2023.

    Comments: This revision fixes an error due to use of some incorrect results (Lemma 32, Corollary 11 by Talebi & Maillard, 2018) in the proof of Theorem 8. The condition for the RLHF results have slightly changed

  7. arXiv:2310.12036  [pdf, other

    cs.AI cs.LG stat.ML

    A General Theoretical Paradigm to Understand Learning from Human Preferences

    Authors: Mohammad Gheshlaghi Azar, Mark Rowland, Bilal Piot, Daniel Guo, Daniele Calandriello, Michal Valko, Rémi Munos

    Abstract: The prevalent deployment of learning from human preferences through reinforcement learning (RLHF) relies on two important approximations: the first assumes that pairwise preferences can be substituted with pointwise rewards. The second assumes that a reward model trained on these pointwise rewards can generalize from collected data to out-of-distribution data sampled by the policy. Recently, Direc… ▽ More

    Submitted 21 November, 2023; v1 submitted 18 October, 2023; originally announced October 2023.

  8. arXiv:2309.00656  [pdf, other

    cs.GT cs.LG stat.ML

    Local and adaptive mirror descents in extensive-form games

    Authors: Côme Fiegel, Pierre Ménard, Tadashi Kozuno, Rémi Munos, Vianney Perchet, Michal Valko

    Abstract: We study how to learn $ε$-optimal strategies in zero-sum imperfect information games (IIG) with trajectory feedback. In this setting, players update their policies sequentially based on their observations over a fixed number of episodes, denoted by $T$. Existing procedures suffer from high variance due to the use of importance sampling over sequences of actions (Steinberger et al., 2020; McAleer e… ▽ More

    Submitted 1 September, 2023; originally announced September 2023.

  9. arXiv:2305.01521  [pdf, other

    cs.LG stat.ML

    Unlocking the Power of Representations in Long-term Novelty-based Exploration

    Authors: Alaa Saade, Steven Kapturowski, Daniele Calandriello, Charles Blundell, Pablo Sprechmann, Leopoldo Sarra, Oliver Groth, Michal Valko, Bilal Piot

    Abstract: We introduce Robust Exploration via Clustering-based Online Density Estimation (RECODE), a non-parametric method for novelty-based exploration that estimates visitation counts for clusters of states based on their similarity in a chosen embedding space. By adapting classical clustering to the nonstationary setting of Deep RL, RECODE can efficiently track state visitation counts over thousands of e… ▽ More

    Submitted 2 May, 2023; originally announced May 2023.

  10. arXiv:2304.03056  [pdf, ps, other

    math.PR math.ST stat.ML

    Sharp Deviations Bounds for Dirichlet Weighted Sums with Application to analysis of Bayesian algorithms

    Authors: Denis Belomestny, Pierre Menard, Alexey Naumov, Daniil Tiapkin, Michal Valko

    Abstract: In this work, we derive sharp non-asymptotic deviation bounds for weighted sums of Dirichlet random variables. These bounds are based on a novel integral representation of the density of a weighted Dirichlet sum. This representation allows us to obtain a Gaussian-like approximation for the sum distribution using geometry and complex analysis methods. Our results generalize similar bounds for the B… ▽ More

    Submitted 6 April, 2023; originally announced April 2023.

  11. arXiv:2303.08059  [pdf, other

    stat.ML cs.LG

    Fast Rates for Maximum Entropy Exploration

    Authors: Daniil Tiapkin, Denis Belomestny, Daniele Calandriello, Eric Moulines, Remi Munos, Alexey Naumov, Pierre Perrault, Yunhao Tang, Michal Valko, Pierre Menard

    Abstract: We address the challenge of exploration in reinforcement learning (RL) when the agent operates in an unknown environment with sparse or no rewards. In this work, we study the maximum entropy exploration problem of two different types. The first type is visitation entropy maximization previously considered by Hazan et al.(2019) in the discounted setting. For this type of exploration, we propose a g… ▽ More

    Submitted 6 June, 2023; v1 submitted 14 March, 2023; originally announced March 2023.

    Comments: ICML-2023

  12. arXiv:2212.12567  [pdf, other

    stat.ML cs.LG

    Adapting to game trees in zero-sum imperfect information games

    Authors: Côme Fiegel, Pierre Ménard, Tadashi Kozuno, Rémi Munos, Vianney Perchet, Michal Valko

    Abstract: Imperfect information games (IIG) are games in which each player only partially observes the current game state. We study how to learn $ε$-optimal strategies in a zero-sum IIG through self-play with trajectory feedback. We give a problem-independent lower bound $\widetilde{\mathcal{O}}(H(A_{\mathcal{X}}+B_{\mathcal{Y}})/ε^2)$ on the required number of realizations to learn these strategies with hi… ▽ More

    Submitted 15 February, 2023; v1 submitted 23 December, 2022; originally announced December 2022.

  13. arXiv:2211.10515  [pdf, other

    stat.ML cs.LG

    Curiosity in Hindsight: Intrinsic Exploration in Stochastic Environments

    Authors: Daniel Jarrett, Corentin Tallec, Florent Altché, Thomas Mesnard, Rémi Munos, Michal Valko

    Abstract: Consider the problem of exploration in sparse-reward or reward-free environments, such as in Montezuma's Revenge. In the curiosity-driven paradigm, the agent is rewarded for how much each realized outcome differs from their predicted outcome. But using predictive error as intrinsic motivation is fragile in stochastic environments, as the agent may become trapped by high-entropy areas of the state-… ▽ More

    Submitted 14 July, 2023; v1 submitted 18 November, 2022; originally announced November 2022.

    Journal ref: In Proc. 40th International Conference on Machine Learning (ICML 2023)

  14. arXiv:2209.14414  [pdf, other

    stat.ML cs.LG

    Optimistic Posterior Sampling for Reinforcement Learning with Few Samples and Tight Guarantees

    Authors: Daniil Tiapkin, Denis Belomestny, Daniele Calandriello, Eric Moulines, Remi Munos, Alexey Naumov, Mark Rowland, Michal Valko, Pierre Menard

    Abstract: We consider reinforcement learning in an environment modeled by an episodic, finite, stage-dependent Markov decision process of horizon $H$ with $S$ states, and $A$ actions. The performance of an agent is measured by the regret after interacting with the environment for $T$ episodes. We propose an optimistic posterior sampling algorithm for reinforcement learning (OPSRL), a simple variant of poste… ▽ More

    Submitted 28 September, 2022; originally announced September 2022.

    Comments: arXiv admin note: text overlap with arXiv:2205.07704

  15. arXiv:2206.08332  [pdf, other

    cs.LG cs.AI stat.ML

    BYOL-Explore: Exploration by Bootstrapped Prediction

    Authors: Zhaohan Daniel Guo, Shantanu Thakoor, Miruna Pîslar, Bernardo Avila Pires, Florent Altché, Corentin Tallec, Alaa Saade, Daniele Calandriello, Jean-Bastien Grill, Yunhao Tang, Michal Valko, Rémi Munos, Mohammad Gheshlaghi Azar, Bilal Piot

    Abstract: We present BYOL-Explore, a conceptually simple yet general approach for curiosity-driven exploration in visually-complex environments. BYOL-Explore learns a world representation, the world dynamics, and an exploration policy all-together by optimizing a single prediction loss in the latent space with no additional auxiliary objective. We show that BYOL-Explore is effective in DM-HARD-8, a challeng… ▽ More

    Submitted 16 June, 2022; originally announced June 2022.

  16. arXiv:2205.14211  [pdf, other

    cs.LG cs.AI stat.ML

    KL-Entropy-Regularized RL with a Generative Model is Minimax Optimal

    Authors: Tadashi Kozuno, Wenhao Yang, Nino Vieillard, Toshinori Kitamura, Yunhao Tang, Jincheng Mei, Pierre Ménard, Mohammad Gheshlaghi Azar, Michal Valko, Rémi Munos, Olivier Pietquin, Matthieu Geist, Csaba Szepesvári

    Abstract: In this work, we consider and analyze the sample complexity of model-free reinforcement learning with a generative model. Particularly, we analyze mirror descent value iteration (MDVI) by Geist et al. (2019) and Vieillard et al. (2020a), which uses the Kullback-Leibler divergence and entropy regularization in its value and policy updates. Our analysis shows that it is nearly minimax-optimal for fi… ▽ More

    Submitted 27 May, 2022; originally announced May 2022.

    Comments: 29 pages, 6 figures

  17. arXiv:2205.07704  [pdf, other

    stat.ML cs.LG

    From Dirichlet to Rubin: Optimistic Exploration in RL without Bonuses

    Authors: Daniil Tiapkin, Denis Belomestny, Eric Moulines, Alexey Naumov, Sergey Samsonov, Yunhao Tang, Michal Valko, Pierre Menard

    Abstract: We propose the Bayes-UCBVI algorithm for reinforcement learning in tabular, stage-dependent, episodic Markov decision process: a natural extension of the Bayes-UCB algorithm by Kaufmann et al. (2012) for multi-armed bandits. Our method uses the quantile of a Q-value function posterior as upper confidence bound on the optimal Q-value function. For Bayes-UCBVI, we prove a regret bound of order… ▽ More

    Submitted 22 June, 2022; v1 submitted 16 May, 2022; originally announced May 2022.

  18. arXiv:2201.12909  [pdf, other

    stat.ML cs.LG

    Scaling Gaussian Process Optimization by Evaluating a Few Unique Candidates Multiple Times

    Authors: Daniele Calandriello, Luigi Carratino, Alessandro Lazaric, Michal Valko, Lorenzo Rosasco

    Abstract: Computing a Gaussian process (GP) posterior has a computational cost cubical in the number of historical points. A reformulation of the same GP posterior highlights that this complexity mainly depends on how many \emph{unique} historical points are considered. This can have important implication in active learning settings, where the set of historical points is constructed sequentially by the lear… ▽ More

    Submitted 30 January, 2022; originally announced January 2022.

  19. arXiv:2106.06279  [pdf, ps, other

    stat.ML cs.LG

    Model-Free Learning for Two-Player Zero-Sum Partially Observable Markov Games with Perfect Recall

    Authors: Tadashi Kozuno, Pierre Ménard, Rémi Munos, Michal Valko

    Abstract: We study the problem of learning a Nash equilibrium (NE) in an imperfect information game (IIG) through self-play. Precisely, we focus on two-player, zero-sum, episodic, tabular IIG under the perfect-recall assumption where the only feedback is realizations of the game (bandit feedback). In particular, the dynamic of the IIG is not known -- we can only access it by sampling or interacting with a g… ▽ More

    Submitted 11 June, 2021; originally announced June 2021.

    Comments: 20 pages

  20. arXiv:2103.01312  [pdf, other

    stat.ML cs.LG

    UCB Momentum Q-learning: Correcting the bias without forgetting

    Authors: Pierre Menard, Omar Darwiche Domingues, Xuedong Shang, Michal Valko

    Abstract: We propose UCBMQ, Upper Confidence Bound Momentum Q-learning, a new algorithm for reinforcement learning in tabular and possibly stage-dependent, episodic Markov decision process. UCBMQ is based on Q-learning where we add a momentum term and rely on the principle of optimism in face of uncertainty to deal with exploration. Our new technical ingredient of UCBMQ is the use of momentum to correct the… ▽ More

    Submitted 18 March, 2022; v1 submitted 1 March, 2021; originally announced March 2021.

  21. arXiv:2103.00107  [pdf, other

    cs.LG cs.AI stat.ML

    Revisiting Peng's Q($λ$) for Modern Reinforcement Learning

    Authors: Tadashi Kozuno, Yunhao Tang, Mark Rowland, Rémi Munos, Steven Kapturowski, Will Dabney, Michal Valko, David Abel

    Abstract: Off-policy multi-step reinforcement learning algorithms consist of conservative and non-conservative algorithms: the former actively cut traces, whereas the latter do not. Recently, Munos et al. (2016) proved the convergence of conservative algorithms to an optimal Q-function. In contrast, non-conservative algorithms are thought to be unsafe and have a limited or no theoretical guarantee. Nonethel… ▽ More

    Submitted 26 February, 2021; originally announced March 2021.

    Comments: 26 pages, 7 figures, 2 tables

  22. arXiv:2102.06514  [pdf, other

    cs.LG cs.SI stat.ML

    Large-Scale Representation Learning on Graphs via Bootstrapping

    Authors: Shantanu Thakoor, Corentin Tallec, Mohammad Gheshlaghi Azar, Mehdi Azabou, Eva L. Dyer, Rémi Munos, Petar Veličković, Michal Valko

    Abstract: Self-supervised learning provides a promising path towards eliminating the need for costly label information in representation learning on graphs. However, to achieve state-of-the-art performance, methods often need large numbers of negative examples and rely on complex augmentations. This can be prohibitively expensive, especially for large graphs. To address these challenges, we introduce Bootst… ▽ More

    Submitted 20 February, 2023; v1 submitted 12 February, 2021; originally announced February 2021.

    Comments: Published as a conference paper at ICLR 2022

  23. arXiv:2012.14755  [pdf, other

    cs.LG stat.ML

    Improved Sample Complexity for Incremental Autonomous Exploration in MDPs

    Authors: Jean Tarbouriech, Matteo Pirotta, Michal Valko, Alessandro Lazaric

    Abstract: We investigate the exploration of an unknown environment when no reward function is provided. Building on the incremental exploration setting introduced by Lim and Auer [1], we define the objective of learning the set of $ε$-optimal goal-conditioned policies attaining all states that are incrementally reachable within $L$ steps (in expectation) from a reference state $s_0$. In this paper, we intro… ▽ More

    Submitted 29 December, 2020; originally announced December 2020.

    Comments: NeurIPS 2020

  24. arXiv:2010.10241  [pdf, ps, other

    stat.ML cs.CV cs.LG

    BYOL works even without batch statistics

    Authors: Pierre H. Richemond, Jean-Bastien Grill, Florent Altché, Corentin Tallec, Florian Strub, Andrew Brock, Samuel Smith, Soham De, Razvan Pascanu, Bilal Piot, Michal Valko

    Abstract: Bootstrap Your Own Latent (BYOL) is a self-supervised learning approach for image representation. From an augmented view of an image, BYOL trains an online network to predict a target network representation of a different augmented view of the same image. Unlike contrastive methods, BYOL does not explicitly use a repulsion term built from negative pairs in its training objective. Yet, it avoids co… ▽ More

    Submitted 20 October, 2020; originally announced October 2020.

  25. arXiv:2010.03531  [pdf, ps, other

    cs.LG stat.ML

    Episodic Reinforcement Learning in Finite MDPs: Minimax Lower Bounds Revisited

    Authors: Omar Darwiche Domingues, Pierre Ménard, Emilie Kaufmann, Michal Valko

    Abstract: In this paper, we propose new problem-independent lower bounds on the sample complexity and regret in episodic MDPs, with a particular focus on the non-stationary case in which the transition kernel is allowed to change in each stage of the episode. Our main contribution is a novel lower bound of $Ω((H^3SA/ε^2)\log(1/δ))$ on the sample complexity of an $(\varepsilon,δ)$-PAC algorithm for best poli… ▽ More

    Submitted 7 October, 2020; originally announced October 2020.

  26. arXiv:2007.13442  [pdf, other

    cs.LG stat.ML

    Fast active learning for pure exploration in reinforcement learning

    Authors: Pierre Ménard, Omar Darwiche Domingues, Anders Jonsson, Emilie Kaufmann, Edouard Leurent, Michal Valko

    Abstract: Realistic environments often provide agents with very limited feedback. When the environment is initially unknown, the feedback, in the beginning, can be completely absent, and the agents may first choose to devote all their effort on exploring efficiently. The exploration remains a challenge while it has been addressed with many hand-tuned heuristics with different levels of generality on one sid… ▽ More

    Submitted 10 October, 2020; v1 submitted 27 July, 2020; originally announced July 2020.

  27. arXiv:2007.12509  [pdf, other

    cs.LG stat.ML

    Monte-Carlo Tree Search as Regularized Policy Optimization

    Authors: Jean-Bastien Grill, Florent Altché, Yunhao Tang, Thomas Hubert, Michal Valko, Ioannis Antonoglou, Rémi Munos

    Abstract: The combination of Monte-Carlo tree search (MCTS) with deep reinforcement learning has led to significant advances in artificial intelligence. However, AlphaZero, the current state-of-the-art MCTS algorithm, still relies on handcrafted heuristics that are only partially understood. In this paper, we show that AlphaZero's search heuristics, along with other common ones such as UCT, are an approxima… ▽ More

    Submitted 24 July, 2020; originally announced July 2020.

    Comments: Accepted to International Conference on Machine Learning (ICML), 2020

  28. arXiv:2007.06437  [pdf, other

    cs.LG stat.ML

    A Provably Efficient Sample Collection Strategy for Reinforcement Learning

    Authors: Jean Tarbouriech, Matteo Pirotta, Michal Valko, Alessandro Lazaric

    Abstract: One of the challenges in online reinforcement learning (RL) is that the agent needs to trade off the exploration of the environment and the exploitation of the samples to optimize its behavior. Whether we optimize for regret, sample complexity, state-space coverage or model estimation, we need to strike a different exploration-exploitation trade-off. In this paper, we propose to tackle the explora… ▽ More

    Submitted 18 November, 2021; v1 submitted 13 July, 2020; originally announced July 2020.

    Comments: NeurIPS 2021

  29. arXiv:2007.05078  [pdf, other

    cs.LG stat.ML

    A Kernel-Based Approach to Non-Stationary Reinforcement Learning in Metric Spaces

    Authors: Omar Darwiche Domingues, Pierre Ménard, Matteo Pirotta, Emilie Kaufmann, Michal Valko

    Abstract: In this work, we propose KeRNS: an algorithm for episodic reinforcement learning in non-stationary Markov Decision Processes (MDPs) whose state-action set is endowed with a metric. Using a non-parametric model of the MDP built with time-dependent kernels, we prove a regret bound that scales with the covering dimension of the state-action space and the total variation of the MDP with time, which qu… ▽ More

    Submitted 23 March, 2022; v1 submitted 9 July, 2020; originally announced July 2020.

    Comments: Update following the publication in AISTATS 2021. Fixed typos and lemma about runtime

  30. arXiv:2007.00953  [pdf, other

    stat.ML cs.LG

    Gamification of Pure Exploration for Linear Bandits

    Authors: Rémy Degenne, Pierre Ménard, Xuedong Shang, Michal Valko

    Abstract: We investigate an active pure-exploration setting, that includes best-arm identification, in the context of linear stochastic bandits. While asymptotically optimal algorithms exist for standard multi-arm bandits, the existence of such algorithms for the best-arm identification in linear bandits has been elusive despite several attempts to address it. First, we provide a thorough comparison and new… ▽ More

    Submitted 2 July, 2020; originally announced July 2020.

    Comments: 11+25 pages. To be published in the proceedings of ICML 2020

  31. arXiv:2006.16947  [pdf, other

    cs.LG cs.DS stat.ML

    Sampling from a $k$-DPP without looking at all items

    Authors: Daniele Calandriello, Michał Dereziński, Michal Valko

    Abstract: Determinantal point processes (DPPs) are a useful probabilistic model for selecting a small diverse subset out of a large collection of items, with applications in summarization, stochastic optimization, active learning and more. Given a kernel function and a subset size $k$, our goal is to sample $k$ out of $n$ items with probability proportional to the determinant of the kernel matrix induced by… ▽ More

    Submitted 30 June, 2020; originally announced June 2020.

  32. arXiv:2006.10459  [pdf, other

    stat.ML cs.LG

    Stochastic bandits with arm-dependent delays

    Authors: Anne Gael Manegueu, Claire Vernade, Alexandra Carpentier, Michal Valko

    Abstract: Significant work has been recently dedicated to the stochastic delayed bandit setting because of its relevance in applications. The applicability of existing algorithms is however restricted by the fact that strong assumptions are often made on the delay distributions, such as full observability, restrictive shape constraints, or uniformity over arms. In this work, we weaken them significantly and… ▽ More

    Submitted 18 June, 2020; originally announced June 2020.

    Comments: 19 Pages, 4 figures

    MSC Class: 62L10

  33. arXiv:2006.07733  [pdf, other

    cs.LG cs.CV stat.ML

    Bootstrap your own latent: A new approach to self-supervised Learning

    Authors: Jean-Bastien Grill, Florian Strub, Florent Altché, Corentin Tallec, Pierre H. Richemond, Elena Buchatskaya, Carl Doersch, Bernardo Avila Pires, Zhaohan Daniel Guo, Mohammad Gheshlaghi Azar, Bilal Piot, Koray Kavukcuoglu, Rémi Munos, Michal Valko

    Abstract: We introduce Bootstrap Your Own Latent (BYOL), a new approach to self-supervised image representation learning. BYOL relies on two neural networks, referred to as online and target networks, that interact and learn from each other. From an augmented view of an image, we train the online network to predict the target network representation of the same image under a different augmented view. At the… ▽ More

    Submitted 10 September, 2020; v1 submitted 13 June, 2020; originally announced June 2020.

  34. arXiv:2006.06613  [pdf, ps, other

    stat.ML cs.LG

    Statistical Efficiency of Thompson Sampling for Combinatorial Semi-Bandits

    Authors: Pierre Perrault, Etienne Boursier, Vianney Perchet, Michal Valko

    Abstract: We investigate stochastic combinatorial multi-armed bandit with semi-bandit feedback (CMAB). In CMAB, the question of the existence of an efficient policy with an optimal asymptotic regret (up to a factor poly-logarithmic with the action size) is still open for many families of distributions, including mutually independent outcomes, and more generally the multivariate sub-Gaussian family. We propo… ▽ More

    Submitted 3 January, 2021; v1 submitted 11 June, 2020; originally announced June 2020.

    Comments: accepted to NeurIPS 2020

  35. arXiv:2006.06294  [pdf, other

    cs.LG stat.ML

    Adaptive Reward-Free Exploration

    Authors: Emilie Kaufmann, Pierre Ménard, Omar Darwiche Domingues, Anders Jonsson, Edouard Leurent, Michal Valko

    Abstract: Reward-free exploration is a reinforcement learning setting studied by Jin et al. (2020), who address it by running several algorithms with regret guarantees in parallel. In our work, we instead give a more natural adaptive approach for reward-free exploration which directly reduces upper bounds on the maximum MDP estimation error. We show that, interestingly, our reward-free UCRL algorithm can be… ▽ More

    Submitted 7 October, 2020; v1 submitted 11 June, 2020; originally announced June 2020.

  36. arXiv:2006.05879  [pdf, other

    cs.LG stat.ML

    Planning in Markov Decision Processes with Gap-Dependent Sample Complexity

    Authors: Anders Jonsson, Emilie Kaufmann, Pierre Ménard, Omar Darwiche Domingues, Edouard Leurent, Michal Valko

    Abstract: We propose MDP-GapE, a new trajectory-based Monte-Carlo Tree Search algorithm for planning in a Markov Decision Process in which transitions have a finite support. We prove an upper bound on the number of calls to the generative models needed for MDP-GapE to identify a near-optimal action with high probability. This problem-dependent sample complexity result is expressed in terms of the sub-optima… ▽ More

    Submitted 10 June, 2020; originally announced June 2020.

  37. arXiv:2004.06248  [pdf, other

    cs.LG stat.ML

    Improved Sleeping Bandits with Stochastic Actions Sets and Adversarial Rewards

    Authors: Aadirupa Saha, Pierre Gaillard, Michal Valko

    Abstract: In this paper, we consider the problem of sleeping bandits with stochastic action sets and adversarial rewards. In this setting, in contrast to most work in bandits, the actions may not be available at all times. For instance, some products might be out of stock in item recommendation. The best existing efficient (i.e., polynomial-time) algorithms for this problem only guarantee an $O(T^{2/3})$ up… ▽ More

    Submitted 8 August, 2020; v1 submitted 13 April, 2020; originally announced April 2020.

    Comments: Accepted to ICML 2020

  38. arXiv:2004.05599  [pdf, other

    cs.LG stat.ML

    Kernel-Based Reinforcement Learning: A Finite-Time Analysis

    Authors: Omar Darwiche Domingues, Pierre Ménard, Matteo Pirotta, Emilie Kaufmann, Michal Valko

    Abstract: We consider the exploration-exploitation dilemma in finite-horizon reinforcement learning problems whose state-action space is endowed with a metric. We introduce Kernel-UCBVI, a model-based optimistic algorithm that leverages the smoothness of the MDP and a non-parametric kernel estimator of the rewards and transitions to efficiently balance exploration and exploitation. For problems with $K$ epi… ▽ More

    Submitted 23 March, 2022; v1 submitted 12 April, 2020; originally announced April 2020.

    Comments: Update following the publication in ICML 2021, including fixed typos

  39. arXiv:2003.06259  [pdf, other

    cs.LG stat.ML

    Taylor Expansion Policy Optimization

    Authors: Yunhao Tang, Michal Valko, Rémi Munos

    Abstract: In this work, we investigate the application of Taylor expansions in reinforcement learning. In particular, we propose Taylor expansion policy optimization, a policy optimization formalism that generalizes prior work (e.g., TRPO) as a first-order special case. We also show that Taylor expansions intimately relate to off-policy evaluation. Finally, we show that this new formulation entails modifica… ▽ More

    Submitted 13 March, 2020; originally announced March 2020.

  40. Fast sampling from $β$-ensembles

    Authors: Guillaume Gautier, Rémi Bardenet, Michal Valko

    Abstract: We study sampling algorithms for $β$-ensembles with time complexity less than cubic in the cardinality of the ensemble. Following Dumitriu & Edelman (2002), we see the ensemble as the eigenvalues of a random tridiagonal matrix, namely a random Jacobi matrix. First, we provide a unifying and elementary treatment of the tridiagonal models associated to the three classical Hermite, Laguerre and Jacob… ▽ More

    Submitted 4 March, 2020; originally announced March 2020.

    Comments: 37 pages, 8 figures, code at https://github.com/guilgautier/DPPy

    MSC Class: 60K35 (Primary) 65C40; 60B20; 33C45 (Secondary)

    Journal ref: Stat. Comput. 31 (2021) 7

  41. arXiv:2002.09954  [pdf, other

    stat.ML cs.LG

    Near-linear Time Gaussian Process Optimization with Adaptive Batching and Resparsification

    Authors: Daniele Calandriello, Luigi Carratino, Alessandro Lazaric, Michal Valko, Lorenzo Rosasco

    Abstract: Gaussian processes (GP) are one of the most successful frameworks to model uncertainty. However, GP optimization (e.g., GP-UCB) suffers from major scalability issues. Experimental time grows linearly with the number of evaluations, unless candidates are selected in batches (e.g., using GP-BUCB) and evaluated in parallel. Furthermore, computational cost is often prohibitive since algorithms such as… ▽ More

    Submitted 26 February, 2020; v1 submitted 23 February, 2020; originally announced February 2020.

  42. arXiv:1912.03517  [pdf, other

    stat.ML cs.LG

    No-Regret Exploration in Goal-Oriented Reinforcement Learning

    Authors: Jean Tarbouriech, Evrard Garcelon, Michal Valko, Matteo Pirotta, Alessandro Lazaric

    Abstract: Many popular reinforcement learning problems (e.g., navigation in a maze, some Atari games, mountain car) are instances of the episodic setting under its stochastic shortest path (SSP) formulation, where an agent has to achieve a goal state while minimizing the cumulative cost. Despite the popularity of this setting, the exploration-exploitation dilemma has been sparsely studied in general SSP pro… ▽ More

    Submitted 17 August, 2020; v1 submitted 7 December, 2019; originally announced December 2019.

    Journal ref: International Conference on Machine Learning (ICML 2020)

  43. arXiv:1910.10945  [pdf, other

    cs.LG stat.ML

    Fixed-Confidence Guarantees for Bayesian Best-Arm Identification

    Authors: Xuedong Shang, Rianne de Heide, Emilie Kaufmann, Pierre Ménard, Michal Valko

    Abstract: We investigate and provide new insights on the sampling rule called Top-Two Thompson Sampling (TTTS). In particular, we justify its use for fixed-confidence best-arm identification. We further propose a variant of TTTS called Top-Two Transportation Cost (T3C), which disposes of the computational burden of TTTS. As our main contribution, we provide the first sample complexity analysis of TTTS and T… ▽ More

    Submitted 28 October, 2019; v1 submitted 24 October, 2019; originally announced October 2019.

  44. arXiv:1910.04034  [pdf, ps, other

    cs.LG stat.ML

    Derivative-Free & Order-Robust Optimisation

    Authors: Victor Gabillon, Rasul Tutunov, Michal Valko, Haitham Bou Ammar

    Abstract: In this paper, we formalise order-robust optimisation as an instance of online learning minimising simple regret, and propose Vroom, a zero'th order optimisation algorithm capable of achieving vanishing regret in non-stationary environments, while recovering favorable rates under stochastic reward-generating processes. Our results are the first to target simple regret definitions in adversarial sc… ▽ More

    Submitted 22 October, 2019; v1 submitted 9 October, 2019; originally announced October 2019.

  45. arXiv:1906.08509  [pdf, other

    stat.ML cs.LG math.OC

    Online A-Optimal Design and Active Linear Regression

    Authors: Xavier Fontaine, Pierre Perrault, Michal Valko, Vianney Perchet

    Abstract: We consider in this paper the problem of optimal experiment design where a decision maker can choose which points to sample to obtain an estimate $\hatβ$ of the hidden parameter $β^{\star}$ of an underlying linear model. The key challenge of this work lies in the heteroscedasticity assumption that we make, meaning that each covariate has a different and unknown variance. The goal of the decision m… ▽ More

    Submitted 30 December, 2020; v1 submitted 20 June, 2019; originally announced June 2019.

    Comments: 29 pages, 5 figures

  46. arXiv:1905.13476  [pdf, other

    cs.LG stat.ML

    Exact sampling of determinantal point processes with sublinear time preprocessing

    Authors: Michał Dereziński, Daniele Calandriello, Michal Valko

    Abstract: We study the complexity of sampling from a distribution over all index subsets of the set $\{1,...,n\}$ with the probability of a subset $S$ proportional to the determinant of the submatrix $\mathbf{L}_S$ of some $n\times n$ p.s.d. matrix $\mathbf{L}$, where $\mathbf{L}_S$ corresponds to the entries of $\mathbf{L}$ indexed by $S$. Known as a determinantal point process, this distribution is used i… ▽ More

    Submitted 8 July, 2019; v1 submitted 31 May, 2019; originally announced May 2019.

  47. arXiv:1903.05594  [pdf, other

    stat.ML cs.LG

    Gaussian Process Optimization with Adaptive Sketching: Scalable and No Regret

    Authors: Daniele Calandriello, Luigi Carratino, Alessandro Lazaric, Michal Valko, Lorenzo Rosasco

    Abstract: Gaussian processes (GP) are a well studied Bayesian approach for the optimization of black-box functions. Despite their effectiveness in simple problems, GP-based algorithms hardly scale to high-dimensional functions, as their per-iteration time and space cost is at least quadratic in the number of dimensions $d$ and iterations $t$. Given a set of $A$ alternatives to choose from, the overall runti… ▽ More

    Submitted 27 August, 2019; v1 submitted 13 March, 2019; originally announced March 2019.

    Comments: Accepted at COLT 2019. Corrected typos and improved comparison with existing methods

    Journal ref: Proceedings of Machine Learning Research vol, 99, (COLT 2019)

  48. arXiv:1902.03794  [pdf, other

    stat.ML cs.LG

    Exploiting Structure of Uncertainty for Efficient Matroid Semi-Bandits

    Authors: Pierre Perrault, Vianney Perchet, Michal Valko

    Abstract: We improve the efficiency of algorithms for stochastic \emph{combinatorial semi-bandits}. In most interesting problems, state-of-the-art algorithms take advantage of structural properties of rewards, such as \emph{independence}. However, while being optimal in terms of asymptotic regret, these algorithms are inefficient. In our paper, we first reduce their implementation to a specific \emph{submod… ▽ More

    Submitted 20 June, 2019; v1 submitted 11 February, 2019; originally announced February 2019.

    Comments: Accepted to ICML 2019, Long Beach

  49. arXiv:1901.04884  [pdf, other

    stat.ML cs.LG stat.AP stat.CO

    Optimistic optimization of a Brownian

    Authors: Jean-Bastien Grill, Michal Valko, Rémi Munos

    Abstract: We address the problem of optimizing a Brownian motion. We consider a (random) realization $W$ of a Brownian motion with input space in $[0,1]$. Given $W$, our goal is to return an $ε$-approximation of its maximum using the smallest possible number of function evaluations, the sample complexity of the algorithm. We provide an algorithm with sample complexity of order $\log^2(1/ε)$. This improves o… ▽ More

    Submitted 15 January, 2019; originally announced January 2019.

    Comments: 10 pages, 2 figures

    Journal ref: Neural Information Processing Systems (NeurIPS 2018)

  50. arXiv:1811.11043  [pdf, other

    stat.ML cs.LG

    Rotting bandits are not harder than stochastic ones

    Authors: Julien Seznec, Andrea Locatelli, Alexandra Carpentier, Alessandro Lazaric, Michal Valko

    Abstract: In stochastic multi-armed bandits, the reward distribution of each arm is assumed to be stationary. This assumption is often violated in practice (e.g., in recommendation systems), where the reward of an arm may change whenever is selected, i.e., rested bandit setting. In this paper, we consider the non-parametric rotting bandit setting, where rewards can only decrease. We introduce the filtering… ▽ More

    Submitted 9 May, 2020; v1 submitted 27 November, 2018; originally announced November 2018.

    Journal ref: International Conference on Artificial Intelligence and Statistics (AISTATS 2019)