Skip to main content

Showing 1–28 of 28 results for author: Pirotta, M

Searching in archive stat. Search in all archives.
.
  1. arXiv:2503.09817  [pdf, other

    cs.LG cs.AI stat.ML

    Temporal Difference Flows

    Authors: Jesse Farebrother, Matteo Pirotta, Andrea Tirinzoni, Rémi Munos, Alessandro Lazaric, Ahmed Touati

    Abstract: Predictive models of the future are fundamental for an agent's ability to reason and plan. A common strategy learns a world model and unrolls it step-by-step at inference, where small errors can rapidly compound. Geometric Horizon Models (GHMs) offer a compelling alternative by directly making predictions of future states, avoiding cumulative inference errors. While GHMs can be conveniently learne… ▽ More

    Submitted 12 March, 2025; originally announced March 2025.

  2. arXiv:2212.09429  [pdf, ps, other

    cs.LG stat.ML

    On the Complexity of Representation Learning in Contextual Linear Bandits

    Authors: Andrea Tirinzoni, Matteo Pirotta, Alessandro Lazaric

    Abstract: In contextual linear bandits, the reward function is assumed to be a linear combination of an unknown reward vector and a given embedding of context-arm pairs. In practice, the embedding is often learned at the same time as the reward vector, thus leading to an online representation learning problem. Existing approaches to representation learning in contextual bandits are either very generic (e.g.… ▽ More

    Submitted 19 December, 2022; originally announced December 2022.

  3. arXiv:2210.09957  [pdf, other

    cs.LG cs.AI cs.CY cs.IR stat.ML

    Contextual bandits with concave rewards, and an application to fair ranking

    Authors: Virginie Do, Elvis Dohmatob, Matteo Pirotta, Alessandro Lazaric, Nicolas Usunier

    Abstract: We consider Contextual Bandits with Concave Rewards (CBCR), a multi-objective bandit problem where the desired trade-off between the rewards is defined by a known concave objective function, and the reward vector depends on an observed stochastic context. We present the first algorithm with provably vanishing regret for CBCR without restrictions on the policy space, whereas prior works were restri… ▽ More

    Submitted 28 February, 2023; v1 submitted 18 October, 2022; originally announced October 2022.

    Comments: ICLR 2023

  4. arXiv:2210.04946  [pdf, ps, other

    cs.LG cs.AI stat.ML

    Reaching Goals is Hard: Settling the Sample Complexity of the Stochastic Shortest Path

    Authors: Liyu Chen, Andrea Tirinzoni, Matteo Pirotta, Alessandro Lazaric

    Abstract: We study the sample complexity of learning an $ε$-optimal policy in the Stochastic Shortest Path (SSP) problem. We first derive sample complexity bounds when the learner has access to a generative model. We show that there exists a worst-case SSP instance with $S$ states, $A$ actions, minimum cost $c_{\min}$, and maximum expected cost of the optimal policy over all states $B_{\star}$, where any al… ▽ More

    Submitted 10 October, 2022; originally announced October 2022.

  5. arXiv:2112.06517  [pdf, other

    cs.LG stat.ML

    Top $K$ Ranking for Multi-Armed Bandit with Noisy Evaluations

    Authors: Evrard Garcelon, Vashist Avadhanula, Alessandro Lazaric, Matteo Pirotta

    Abstract: We consider a multi-armed bandit setting where, at the beginning of each round, the learner receives noisy independent, and possibly biased, \emph{evaluations} of the true reward of each arm and it selects $K$ arms with the objective of accumulating as much reward as possible over $T$ rounds. Under the assumption that at each round the true reward of each arm is drawn from a fixed distribution, we… ▽ More

    Submitted 12 April, 2022; v1 submitted 13 December, 2021; originally announced December 2021.

  6. arXiv:2106.11692  [pdf, ps, other

    cs.LG stat.ML

    A Reduction-Based Framework for Conservative Bandits and Reinforcement Learning

    Authors: Yunchang Yang, Tianhao Wu, Han Zhong, Evrard Garcelon, Matteo Pirotta, Alessandro Lazaric, Liwei Wang, Simon S. Du

    Abstract: In this paper, we present a reduction-based framework for conservative bandits and RL, in which our core technique is to calculate the necessary and sufficient budget obtained from running the baseline policy. For lower bounds, we improve the existing lower bound for conservative multi-armed bandits and obtain new lower bounds for conservative linear bandits, tabular RL and low-rank MDP, through a… ▽ More

    Submitted 16 March, 2022; v1 submitted 22 June, 2021; originally announced June 2021.

  7. arXiv:2012.14755  [pdf, other

    cs.LG stat.ML

    Improved Sample Complexity for Incremental Autonomous Exploration in MDPs

    Authors: Jean Tarbouriech, Matteo Pirotta, Michal Valko, Alessandro Lazaric

    Abstract: We investigate the exploration of an unknown environment when no reward function is provided. Building on the incremental exploration setting introduced by Lim and Auer [1], we define the objective of learning the set of $ε$-optimal goal-conditioned policies attaining all states that are incrementally reachable within $L$ steps (in expectation) from a reference state $s_0$. In this paper, we intro… ▽ More

    Submitted 29 December, 2020; originally announced December 2020.

    Comments: NeurIPS 2020

  8. arXiv:2010.12247  [pdf, other

    cs.LG stat.ML

    An Asymptotically Optimal Primal-Dual Incremental Algorithm for Contextual Linear Bandits

    Authors: Andrea Tirinzoni, Matteo Pirotta, Marcello Restelli, Alessandro Lazaric

    Abstract: In the contextual linear bandit setting, algorithms built on the optimism principle fail to exploit the structure of the problem and have been shown to be asymptotically suboptimal. In this paper, we follow recent approaches of deriving asymptotically optimal algorithms from problem-dependent regret lower bounds and we introduce a novel algorithm improving over the state-of-the-art along multiple… ▽ More

    Submitted 20 November, 2020; v1 submitted 23 October, 2020; originally announced October 2020.

    Comments: To appear at NeurIPS 2020. V2: clarified dependencies in the worst-case regret bound

  9. arXiv:2007.06437  [pdf, other

    cs.LG stat.ML

    A Provably Efficient Sample Collection Strategy for Reinforcement Learning

    Authors: Jean Tarbouriech, Matteo Pirotta, Michal Valko, Alessandro Lazaric

    Abstract: One of the challenges in online reinforcement learning (RL) is that the agent needs to trade off the exploration of the environment and the exploitation of the samples to optimize its behavior. Whether we optimize for regret, sample complexity, state-space coverage or model estimation, we need to strike a different exploration-exploitation trade-off. In this paper, we propose to tackle the explora… ▽ More

    Submitted 18 November, 2021; v1 submitted 13 July, 2020; originally announced July 2020.

    Comments: NeurIPS 2021

  10. arXiv:2007.05456  [pdf, ps, other

    cs.LG stat.ML

    Improved Analysis of UCRL2 with Empirical Bernstein Inequality

    Authors: Ronan Fruit, Matteo Pirotta, Alessandro Lazaric

    Abstract: We consider the problem of exploration-exploitation in communicating Markov Decision Processes. We provide an analysis of UCRL2 with Empirical Bernstein inequalities (UCRL2B). For any MDP with $S$ states, $A$ actions, $Γ\leq S$ next states and diameter $D$, the regret of UCRL2B is bounded as $\widetilde{O}(\sqrt{DΓS A T})$.

    Submitted 10 July, 2020; originally announced July 2020.

    Comments: Document in support of the tutorial at ALT 2019

  11. arXiv:2007.05078  [pdf, other

    cs.LG stat.ML

    A Kernel-Based Approach to Non-Stationary Reinforcement Learning in Metric Spaces

    Authors: Omar Darwiche Domingues, Pierre Ménard, Matteo Pirotta, Emilie Kaufmann, Michal Valko

    Abstract: In this work, we propose KeRNS: an algorithm for episodic reinforcement learning in non-stationary Markov Decision Processes (MDPs) whose state-action set is endowed with a metric. Using a non-parametric model of the MDP built with time-dependent kernels, we prove a regret bound that scales with the covering dimension of the state-action space and the total variation of the MDP with time, which qu… ▽ More

    Submitted 23 March, 2022; v1 submitted 9 July, 2020; originally announced July 2020.

    Comments: Update following the publication in AISTATS 2021. Fixed typos and lemma about runtime

  12. arXiv:2005.02934  [pdf, other

    cs.LG cs.AI stat.ML

    Learning Adaptive Exploration Strategies in Dynamic Environments Through Informed Policy Regularization

    Authors: Pierre-Alexandre Kamienny, Matteo Pirotta, Alessandro Lazaric, Thibault Lavril, Nicolas Usunier, Ludovic Denoyer

    Abstract: We study the problem of learning exploration-exploitation strategies that effectively adapt to dynamic environments, where the task may change over time. While RNN-based policies could in principle represent such strategies, in practice their training time is prohibitive and the learning process often converges to poor solutions. In this paper, we consider the case where the agent has access to a… ▽ More

    Submitted 6 May, 2020; originally announced May 2020.

    Comments: 18 pages

    MSC Class: 68T99

  13. arXiv:2004.05599  [pdf, other

    cs.LG stat.ML

    Kernel-Based Reinforcement Learning: A Finite-Time Analysis

    Authors: Omar Darwiche Domingues, Pierre Ménard, Matteo Pirotta, Emilie Kaufmann, Michal Valko

    Abstract: We consider the exploration-exploitation dilemma in finite-horizon reinforcement learning problems whose state-action space is endowed with a metric. We introduce Kernel-UCBVI, a model-based optimistic algorithm that leverages the smoothness of the MDP and a non-parametric kernel estimator of the rewards and transitions to efficiently balance exploration and exploitation. For problems with $K$ epi… ▽ More

    Submitted 23 March, 2022; v1 submitted 12 April, 2020; originally announced April 2020.

    Comments: Update following the publication in ICML 2021, including fixed typos

  14. arXiv:2003.03297  [pdf, other

    stat.ML cs.LG

    Active Model Estimation in Markov Decision Processes

    Authors: Jean Tarbouriech, Shubhanshu Shekhar, Matteo Pirotta, Mohammad Ghavamzadeh, Alessandro Lazaric

    Abstract: We study the problem of efficient exploration in order to learn an accurate model of an environment, modeled as a Markov decision process (MDP). Efficient exploration in this problem requires the agent to identify the regions in which estimating the model is more difficult and then exploit this knowledge to collect more samples there. In this paper, we formalize this problem, introduce the first a… ▽ More

    Submitted 22 June, 2020; v1 submitted 6 March, 2020; originally announced March 2020.

  15. arXiv:2003.02189  [pdf, ps, other

    cs.LG stat.ML

    Exploration-Exploitation in Constrained MDPs

    Authors: Yonathan Efroni, Shie Mannor, Matteo Pirotta

    Abstract: In many sequential decision-making problems, the goal is to optimize a utility function while satisfying a set of constraints on different utilities. This learning problem is formalized through Constrained Markov Decision Processes (CMDPs). In this paper, we investigate the exploration-exploitation dilemma in CMDPs. While learning in an unknown CMDP, an agent should trade-off exploration to discov… ▽ More

    Submitted 4 March, 2020; originally announced March 2020.

  16. arXiv:2002.03839  [pdf, other

    cs.LG stat.ML

    Adversarial Attacks on Linear Contextual Bandits

    Authors: Evrard Garcelon, Baptiste Roziere, Laurent Meunier, Jean Tarbouriech, Olivier Teytaud, Alessandro Lazaric, Matteo Pirotta

    Abstract: Contextual bandit algorithms are applied in a wide range of domains, from advertising to recommender systems, from clinical trials to education. In many of these domains, malicious agents may have incentives to attack the bandit algorithm to induce it to perform a desired behavior. For instance, an unscrupulous ad publisher may try to increase their own revenue at the expense of the advertisers; a… ▽ More

    Submitted 23 October, 2020; v1 submitted 10 February, 2020; originally announced February 2020.

  17. arXiv:2002.03221  [pdf, other

    cs.LG stat.ML

    Improved Algorithms for Conservative Exploration in Bandits

    Authors: Evrard Garcelon, Mohammad Ghavamzadeh, Alessandro Lazaric, Matteo Pirotta

    Abstract: In many fields such as digital marketing, healthcare, finance, and robotics, it is common to have a well-tested and reliable baseline policy running in production (e.g., a recommender system). Nonetheless, the baseline policy is often suboptimal. In this case, it is desirable to deploy online learning algorithms (e.g., a multi-armed bandit algorithm) that interact with the system to learn a better… ▽ More

    Submitted 8 February, 2020; originally announced February 2020.

  18. arXiv:2002.03218  [pdf, other

    cs.LG stat.ML

    Conservative Exploration in Reinforcement Learning

    Authors: Evrard Garcelon, Mohammad Ghavamzadeh, Alessandro Lazaric, Matteo Pirotta

    Abstract: While learning in an unknown Markov Decision Process (MDP), an agent should trade off exploration to discover new information about the MDP, and exploitation of the current knowledge to maximize the reward. Although the agent will eventually learn a good or optimal policy, there is no guarantee on the quality of the intermediate policies. This lack of control is undesired in real-world application… ▽ More

    Submitted 15 July, 2020; v1 submitted 8 February, 2020; originally announced February 2020.

    Comments: AISTATS 2020

  19. arXiv:2001.11595  [pdf, ps, other

    cs.LG stat.ML

    Concentration Inequalities for Multinoulli Random Variables

    Authors: Jian Qian, Ronan Fruit, Matteo Pirotta, Alessandro Lazaric

    Abstract: We investigate concentration inequalities for Dirichlet and Multinomial random variables.

    Submitted 30 January, 2020; originally announced January 2020.

    Comments: Tutorial at ALT'19 on Regret Minimization in Infinite-Horizon Finite Markov Decision Processes

  20. arXiv:1912.03517  [pdf, other

    stat.ML cs.LG

    No-Regret Exploration in Goal-Oriented Reinforcement Learning

    Authors: Jean Tarbouriech, Evrard Garcelon, Michal Valko, Matteo Pirotta, Alessandro Lazaric

    Abstract: Many popular reinforcement learning problems (e.g., navigation in a maze, some Atari games, mountain car) are instances of the episodic setting under its stochastic shortest path (SSP) formulation, where an agent has to achieve a goal state while minimizing the cumulative cost. Despite the popularity of this setting, the exploration-exploitation dilemma has been sparsely studied in general SSP pro… ▽ More

    Submitted 17 August, 2020; v1 submitted 7 December, 2019; originally announced December 2019.

    Journal ref: International Conference on Machine Learning (ICML 2020)

  21. arXiv:1911.00567  [pdf, ps, other

    cs.LG stat.ML

    Frequentist Regret Bounds for Randomized Least-Squares Value Iteration

    Authors: Andrea Zanette, David Brandfonbrener, Emma Brunskill, Matteo Pirotta, Alessandro Lazaric

    Abstract: We consider the exploration-exploitation dilemma in finite-horizon reinforcement learning (RL). When the state space is large or continuous, traditional tabular approaches are unfeasible and some form of function approximation is mandatory. In this paper, we introduce an optimistically-initialized variant of the popular randomized least-squares value iteration (RLSVI), a model-free algorithm where… ▽ More

    Submitted 8 September, 2023; v1 submitted 1 November, 2019; originally announced November 2019.

    Comments: Minor bug fixes

  22. arXiv:1905.03231  [pdf, other

    cs.LG stat.ML

    Smoothing Policies and Safe Policy Gradients

    Authors: Matteo Papini, Matteo Pirotta, Marcello Restelli

    Abstract: Policy Gradient (PG) algorithms are among the best candidates for the much-anticipated applications of reinforcement learning to real-world control tasks, such as robotics. However, the trial-and-error nature of these methods poses safety issues whenever the learning process itself must be performed on a physical system or involves any form of human-computer interaction. In this paper, we address… ▽ More

    Submitted 17 June, 2022; v1 submitted 8 May, 2019; originally announced May 2019.

  23. arXiv:1812.04363  [pdf, ps, other

    cs.LG stat.ML

    Exploration Bonus for Regret Minimization in Undiscounted Discrete and Continuous Markov Decision Processes

    Authors: Jian Qian, Ronan Fruit, Matteo Pirotta, Alessandro Lazaric

    Abstract: We introduce and analyse two algorithms for exploration-exploitation in discrete and continuous Markov Decision Processes (MDPs) based on exploration bonuses. SCAL$^+$ is a variant of SCAL (Fruit et al., 2018) that performs efficient exploration-exploitation in any unknown weakly-communicating MDP for which an upper bound C on the span of the optimal bias function is known. For an MDP with $S$ sta… ▽ More

    Submitted 11 December, 2018; originally announced December 2018.

  24. arXiv:1807.02373  [pdf, other

    cs.LG stat.ML

    Near Optimal Exploration-Exploitation in Non-Communicating Markov Decision Processes

    Authors: Ronan Fruit, Matteo Pirotta, Alessandro Lazaric

    Abstract: While designing the state space of an MDP, it is common to include states that are transient or not reachable by any policy (e.g., in mountain car, the product space of speed and position contains configurations that are not physically reachable). This leads to defining weakly-communicating or multi-chain MDPs. In this paper, we introduce \tucrl, the first algorithm able to perform efficient explo… ▽ More

    Submitted 20 March, 2019; v1 submitted 6 July, 2018; originally announced July 2018.

  25. arXiv:1806.05618  [pdf, other

    cs.LG stat.ML

    Stochastic Variance-Reduced Policy Gradient

    Authors: Matteo Papini, Damiano Binaghi, Giuseppe Canonaco, Matteo Pirotta, Marcello Restelli

    Abstract: In this paper, we propose a novel reinforcement- learning algorithm consisting in a stochastic variance-reduced version of policy gradient for solving Markov Decision Processes (MDPs). Stochastic variance-reduced gradient (SVRG) methods have proven to be very successful in supervised learning. However, their adaptation to policy gradient is not straightforward and needs to account for I) a non-con… ▽ More

    Submitted 14 June, 2018; originally announced June 2018.

    Journal ref: Proceedings of the 35 th International Conference on Machine Learning, Stockholm, Sweden, PMLR 80, 2018

  26. arXiv:1805.10886  [pdf, other

    cs.LG stat.ML

    Importance Weighted Transfer of Samples in Reinforcement Learning

    Authors: Andrea Tirinzoni, Andrea Sessa, Matteo Pirotta, Marcello Restelli

    Abstract: We consider the transfer of experience samples (i.e., tuples < s, a, s', r >) in reinforcement learning (RL), collected from a set of source tasks to improve the learning process in a given target task. Most of the related approaches focus on selecting the most relevant source samples for solving the target task, but then all the transferred samples are used without considering anymore the discrep… ▽ More

    Submitted 28 May, 2018; originally announced May 2018.

    Comments: Accepted at ICML 2018

  27. arXiv:1802.04020  [pdf, other

    cs.LG stat.ML

    Efficient Bias-Span-Constrained Exploration-Exploitation in Reinforcement Learning

    Authors: Ronan Fruit, Matteo Pirotta, Alessandro Lazaric, Ronald Ortner

    Abstract: We introduce SCAL, an algorithm designed to perform efficient exploration-exploitation in any unknown weakly-communicating Markov decision process (MDP) for which an upper bound $c$ on the span of the optimal bias function is known. For an MDP with $S$ states, $A$ actions and $Γ\leq S$ possible next states, we prove a regret bound of $\widetilde{O}(c\sqrt{ΓSAT})$, which significantly improves over… ▽ More

    Submitted 6 July, 2018; v1 submitted 12 February, 2018; originally announced February 2018.

  28. arXiv:1712.03428  [pdf, other

    cs.LG stat.ML

    Cost-Sensitive Approach to Batch Size Adaptation for Gradient Descent

    Authors: Matteo Pirotta, Marcello Restelli

    Abstract: In this paper, we propose a novel approach to automatically determine the batch size in stochastic gradient descent methods. The choice of the batch size induces a trade-off between the accuracy of the gradient estimate and the cost in terms of samples of each update. We propose to determine the batch size by optimizing the ratio between a lower bound to a linear or quadratic Taylor approximation… ▽ More

    Submitted 9 December, 2017; originally announced December 2017.

    Comments: Presented at the NIPS workshop on Optimizing the Optimizers. Barcelona, Spain, 2016