Skip to main content

Showing 1–13 of 13 results for author: Cassel, A

Searching in archive cs. Search in all archives.
.
  1. arXiv:2409.08570  [pdf, other

    cs.LG stat.ML

    Batch Ensemble for Variance Dependent Regret in Stochastic Bandits

    Authors: Asaf Cassel, Orin Levy, Yishay Mansour

    Abstract: Efficiently trading off exploration and exploitation is one of the key challenges in online Reinforcement Learning (RL). Most works achieve this by carefully estimating the model uncertainty and following the so-called optimistic model. Inspired by practical ensemble methods, in this work we propose a simple and novel batch ensemble scheme that provably achieves near-optimal regret for stochastic… ▽ More

    Submitted 13 September, 2024; originally announced September 2024.

  2. arXiv:2407.03065  [pdf, ps, other

    cs.LG stat.ML

    Warm-up Free Policy Optimization: Improved Regret in Linear Markov Decision Processes

    Authors: Asaf Cassel, Aviv Rosenberg

    Abstract: Policy Optimization (PO) methods are among the most popular Reinforcement Learning (RL) algorithms in practice. Recently, Sherman et al. [2023a] proposed a PO-based algorithm with rate-optimal regret guarantees under the linear Markov Decision Process (MDP) model. However, their algorithm relies on a costly pure exploration warm-up phase that is hard to implement in practice. This paper eliminates… ▽ More

    Submitted 3 July, 2024; originally announced July 2024.

  3. arXiv:2405.14655  [pdf, other

    cs.LG

    Multi-turn Reinforcement Learning from Preference Human Feedback

    Authors: Lior Shani, Aviv Rosenberg, Asaf Cassel, Oran Lang, Daniele Calandriello, Avital Zipori, Hila Noga, Orgad Keller, Bilal Piot, Idan Szpektor, Avinatan Hassidim, Yossi Matias, Rémi Munos

    Abstract: Reinforcement Learning from Human Feedback (RLHF) has become the standard approach for aligning Large Language Models (LLMs) with human preferences, allowing LLMs to demonstrate remarkable abilities in various tasks. Existing methods work by emulating the preferences at the single decision (turn) level, limiting their capabilities in settings that require planning or multi-turn interactions to ach… ▽ More

    Submitted 2 December, 2024; v1 submitted 23 May, 2024; originally announced May 2024.

  4. arXiv:2405.07637  [pdf, ps, other

    cs.LG

    Near-Optimal Regret in Linear MDPs with Aggregate Bandit Feedback

    Authors: Asaf Cassel, Haipeng Luo, Aviv Rosenberg, Dmitry Sotnikov

    Abstract: In many real-world applications, it is hard to provide a reward signal in each step of a Reinforcement Learning (RL) process and more natural to give feedback when an episode ends. To this end, we study the recently proposed model of RL with Aggregate Bandit Feedback (RL-ABF), where the agent only observes the sum of rewards at the end of an episode instead of each reward individually. Prior work… ▽ More

    Submitted 14 May, 2024; v1 submitted 13 May, 2024; originally announced May 2024.

  5. arXiv:2303.01464  [pdf, ps, other

    cs.LG

    Efficient Rate Optimal Regret for Adversarial Contextual MDPs Using Online Function Approximation

    Authors: Orin Levy, Alon Cohen, Asaf Cassel, Yishay Mansour

    Abstract: We present the OMG-CMDP! algorithm for regret minimization in adversarial Contextual MDPs. The algorithm operates under the minimal assumptions of realizable function class and access to online least squares and log loss regression oracles. Our algorithm is efficient (assuming efficient online regression oracles), simple and robust to approximation errors. It enjoys an… ▽ More

    Submitted 14 August, 2023; v1 submitted 2 March, 2023; originally announced March 2023.

  6. arXiv:2211.14932  [pdf, ps, other

    cs.LG

    Eluder-based Regret for Stochastic Contextual MDPs

    Authors: Orin Levy, Asaf Cassel, Alon Cohen, Yishay Mansour

    Abstract: We present the E-UC$^3$RL algorithm for regret minimization in Stochastic Contextual Markov Decision Processes (CMDPs). The algorithm operates under the minimal assumptions of realizable function class and access to \emph{offline} least squares and log loss regression oracles. Our algorithm is efficient (assuming efficient offline regression oracles) and enjoys a regret guarantee of… ▽ More

    Submitted 29 May, 2024; v1 submitted 27 November, 2022; originally announced November 2022.

  7. arXiv:2206.01426  [pdf, ps, other

    cs.LG math.OC stat.ML

    Rate-Optimal Online Convex Optimization in Adaptive Linear Control

    Authors: Asaf Cassel, Alon Cohen, Tomer Koren

    Abstract: We consider the problem of controlling an unknown linear dynamical system under adversarially changing convex costs and full feedback of both the state and cost function. We present the first computationally-efficient algorithm that attains an optimal $\smash{\sqrt{T}}$-regret rate compared to the best stabilizing linear controller in hindsight, while avoiding stringent assumptions on the costs su… ▽ More

    Submitted 3 June, 2022; originally announced June 2022.

    Comments: arXiv admin note: text overlap with arXiv:2203.01170

  8. arXiv:2203.01170  [pdf, ps, other

    math.OC cs.LG stat.ML

    Efficient Online Linear Control with Stochastic Convex Costs and Unknown Dynamics

    Authors: Asaf Cassel, Alon Cohen, Tomer Koren

    Abstract: We consider the problem of controlling an unknown linear dynamical system under a stochastic convex cost and full feedback of both the state and cost function. We present a computationally efficient algorithm that attains an optimal $\sqrt{T}$ regret-rate compared to the best stabilizing linear controller in hindsight. In contrast to previous work, our algorithm is based on the Optimism in the Fac… ▽ More

    Submitted 22 June, 2022; v1 submitted 2 March, 2022; originally announced March 2022.

  9. arXiv:2102.12608  [pdf, ps, other

    cs.LG stat.ML

    Online Policy Gradient for Model Free Learning of Linear Quadratic Regulators with $\sqrt{T}$ Regret

    Authors: Asaf Cassel, Tomer Koren

    Abstract: We consider the task of learning to control a linear dynamical system under fixed quadratic costs, known as the Linear Quadratic Regulator (LQR) problem. While model-free approaches are often favorable in practice, thus far only model-based methods, which rely on costly system identification, have been shown to achieve regret that scales with the optimal dependence on the time horizon T. We presen… ▽ More

    Submitted 24 February, 2021; originally announced February 2021.

  10. arXiv:2007.13232  [pdf, other

    math.PR cs.DM math.CO math.OC physics.data-an

    The Pendulum Arrangement: Maximizing the Escape Time of Heterogeneous Random Walks

    Authors: Asaf Cassel, Shie Mannor, Guy Tennenholtz

    Abstract: We identify a fundamental phenomenon of heterogeneous one dimensional random walks: the escape (traversal) time is maximized when the heterogeneity in transition probabilities forms a pyramid-like potential barrier. This barrier corresponds to a distinct arrangement of transition probabilities, sometimes referred to as the pendulum arrangement. We reduce this problem to a sum over products, combin… ▽ More

    Submitted 28 July, 2020; v1 submitted 26 July, 2020; originally announced July 2020.

    Comments: Names ordered alphabetically

  11. arXiv:2007.00759  [pdf, ps, other

    cs.LG stat.ML

    Bandit Linear Control

    Authors: Asaf Cassel, Tomer Koren

    Abstract: We consider the problem of controlling a known linear dynamical system under stochastic noise, adversarially chosen costs, and bandit feedback. Unlike the full feedback setting where the entire cost function is revealed after each decision, here only the cost incurred by the learner is observed. We present a new and efficient algorithm that, for strongly convex and smooth costs, obtains regret tha… ▽ More

    Submitted 1 July, 2020; originally announced July 2020.

  12. arXiv:2002.08095  [pdf, ps, other

    cs.LG stat.ML

    Logarithmic Regret for Learning Linear Quadratic Regulators Efficiently

    Authors: Asaf Cassel, Alon Cohen, Tomer Koren

    Abstract: We consider the problem of learning in Linear Quadratic Control systems whose transition parameters are initially unknown. Recent results in this setting have demonstrated efficient learning algorithms with regret growing with the square root of the number of decision steps. We present new efficient algorithms that achieve, perhaps surprisingly, regret that scales only (poly)logarithmically with t… ▽ More

    Submitted 1 July, 2020; v1 submitted 19 February, 2020; originally announced February 2020.

    Comments: Accepted for presentation at International Conference on Machine Learning (ICML) 2020

  13. arXiv:1806.01380  [pdf, ps, other

    stat.ML cs.LG

    A General Framework for Bandit Problems Beyond Cumulative Objectives

    Authors: Asaf Cassel, Shie Mannor, Assaf Zeevi

    Abstract: The stochastic multi-armed bandit (MAB) problem is a common model for sequential decision problems. In the standard setup, a decision maker has to choose at every instant between several competing arms, each of them provides a scalar random variable, referred to as a "reward." Nearly all research on this topic considers the total cumulative reward as the criterion of interest. This work focuses on… ▽ More

    Submitted 26 October, 2021; v1 submitted 4 June, 2018; originally announced June 2018.

    Comments: Preliminary version accepted for presentation at Conference on Learning Theory (COLT) 2018