Skip to main content

Showing 1–50 of 223 results for author: Mannor, S

Searching in archive cs. Search in all archives.
.
  1. arXiv:2506.07085  [pdf, ps, other

    cs.LG stat.ML

    State Entropy Regularization for Robust Reinforcement Learning

    Authors: Yonatan Ashlag, Uri Koren, Mirco Mutti, Esther Derman, Pierre-Luc Bacon, Shie Mannor

    Abstract: State entropy regularization has empirically shown better exploration and sample complexity in reinforcement learning (RL). However, its theoretical guarantees have not been studied. In this paper, we show that state entropy regularization improves robustness to structured and spatially correlated perturbations. These types of variation are common in transfer learning but often overlooked by stand… ▽ More

    Submitted 29 June, 2025; v1 submitted 8 June, 2025; originally announced June 2025.

  2. arXiv:2506.07054  [pdf, ps, other

    cs.LG cs.AI

    Policy Gradient with Tree Search: Avoiding Local Optimas through Lookahead

    Authors: Uri Koren, Navdeep Kumar, Uri Gadot, Giorgia Ramponi, Kfir Yehuda Levy, Shie Mannor

    Abstract: Classical policy gradient (PG) methods in reinforcement learning frequently converge to suboptimal local optima, a challenge exacerbated in large or complex environments. This work investigates Policy Gradient with Tree Search (PGTS), an approach that integrates an $m$-step lookahead mechanism to enhance policy optimization. We provide theoretical analysis demonstrating that increasing the tree se… ▽ More

    Submitted 8 June, 2025; originally announced June 2025.

  3. arXiv:2505.21478  [pdf, ps, other

    cs.CV cs.AI

    Policy Optimized Text-to-Image Pipeline Design

    Authors: Uri Gadot, Rinon Gal, Yftah Ziser, Gal Chechik, Shie Mannor

    Abstract: Text-to-image generation has evolved beyond single monolithic models to complex multi-component pipelines. These combine fine-tuned generators, adapters, upscaling blocks and even editing steps, leading to significant improvements in image quality. However, their effective design requires substantial expertise. Recent approaches have shown promise in automating this process through large language… ▽ More

    Submitted 27 May, 2025; originally announced May 2025.

  4. arXiv:2505.19061  [pdf, other

    cs.LG cs.MA stat.ML

    Adversarial Bandit over Bandits: Hierarchical Bandits for Online Configuration Management

    Authors: Chen Avin, Zvi Lotker, Shie Mannor, Gil Shabat, Hanan Shteingart, Roey Yadgar

    Abstract: Motivated by dynamic parameter optimization in finite, but large action (configurations) spaces, this work studies the nonstochastic multi-armed bandit (MAB) problem in metric action spaces with oblivious Lipschitz adversaries. We propose ABoB, a hierarchical Adversarial Bandit over Bandits algorithm that can use state-of-the-art existing "flat" algorithms, but additionally clusters similar config… ▽ More

    Submitted 25 May, 2025; originally announced May 2025.

  5. arXiv:2505.18269  [pdf, ps, other

    cs.LG math.OC math.PR stat.ML

    Representative Action Selection for Large Action-Space Meta-Bandits

    Authors: Quan Zhou, Mark Kozdoba, Shie Mannor

    Abstract: We study the problem of selecting a subset from a large action space shared by a family of bandits, with the goal of achieving performance nearly matching that of using the full action space. We assume that similar actions tend to have related payoffs, modeled by a Gaussian process. To exploit this structure, we propose a simple epsilon-net algorithm to select a representative subset. We provide t… ▽ More

    Submitted 23 May, 2025; originally announced May 2025.

  6. arXiv:2504.06126  [pdf

    cs.LG cs.NE

    Accelerating Vehicle Routing via AI-Initialized Genetic Algorithms

    Authors: Ido Greenberg, Piotr Sielski, Hugo Linsenmaier, Rajesh Gandham, Shie Mannor, Alex Fender, Gal Chechik, Eli Meirom

    Abstract: Vehicle Routing Problems (VRP) are an extension of the Traveling Salesperson Problem and are a fundamental NP-hard challenge in combinatorial optimization. Solving VRP in real-time at large scale has become critical in numerous applications, from growing markets like last-mile delivery to emerging use-cases like interactive logistics planning. Such applications involve solving similar problem inst… ▽ More

    Submitted 8 April, 2025; originally announced April 2025.

  7. arXiv:2504.04505  [pdf, other

    cs.LG

    A Classification View on Meta Learning Bandits

    Authors: Mirco Mutti, Jeongyeol Kwon, Shie Mannor, Aviv Tamar

    Abstract: Contextual multi-armed bandits are a popular choice to model sequential decision-making. E.g., in a healthcare application we may perform various tests to asses a patient condition (exploration) and then decide on the best treatment to give (exploitation). When humans design strategies, they aim for the exploration to be fast, since the patient's health is at stake, and easy to interpret for a phy… ▽ More

    Submitted 6 April, 2025; originally announced April 2025.

  8. arXiv:2503.22886  [pdf, other

    cs.LG cs.RO

    Task Tokens: A Flexible Approach to Adapting Behavior Foundation Models

    Authors: Ron Vainshtein, Zohar Rimon, Shie Mannor, Chen Tessler

    Abstract: Recent advancements in imitation learning have led to transformer-based behavior foundation models (BFMs) that enable multi-modal, human-like control for humanoid agents. While excelling at zero-shot generation of robust behaviors, BFMs often require meticulous prompt engineering for specific tasks, potentially yielding suboptimal results. We introduce "Task Tokens", a method to effectively tailor… ▽ More

    Submitted 28 March, 2025; originally announced March 2025.

  9. arXiv:2502.11537  [pdf, other

    cs.LG cs.AI

    Uncovering Untapped Potential in Sample-Efficient World Model Agents

    Authors: Lior Cohen, Kaixin Wang, Bingyi Kang, Uri Gadot, Shie Mannor

    Abstract: World model (WM) agents enable sample-efficient reinforcement learning by learning policies entirely from simulated experience. However, existing token-based world models (TBWMs) are limited to visual inputs and discrete actions, restricting their adoption and applicability. Moreover, although both intrinsic motivation and prioritized WM replay have shown promise in improving WM performance and ge… ▽ More

    Submitted 20 May, 2025; v1 submitted 17 February, 2025; originally announced February 2025.

  10. arXiv:2502.09432  [pdf, other

    cs.AI cs.LG

    Dual Formulation for Non-Rectangular Lp Robust Markov Decision Processes

    Authors: Navdeep Kumar, Adarsh Gupta, Maxence Mohamed Elfatihi, Giorgia Ramponi, Kfir Yehuda Levy, Shie Mannor

    Abstract: We study robust Markov decision processes (RMDPs) with non-rectangular uncertainty sets, which capture interdependencies across states unlike traditional rectangular models. While non-rectangular robust policy evaluation is generally NP-hard, even in approximation, we identify a powerful class of $L_p$-bounded uncertainty sets that avoid these complexity barriers due to their structural simplicity… ▽ More

    Submitted 13 February, 2025; originally announced February 2025.

  11. arXiv:2502.01876  [pdf, ps, other

    cs.LG

    Reinforcement Learning with Segment Feedback

    Authors: Yihan Du, Anna Winnicki, Gal Dalal, Shie Mannor, R. Srikant

    Abstract: Standard reinforcement learning (RL) assumes that an agent can observe a reward for each state-action pair. However, in practical applications, it is often difficult and costly to collect a reward for each state-action pair. While there have been several works considering RL with trajectory feedback, it is unclear if trajectory feedback is inefficient for learning when trajectories are long. In th… ▽ More

    Submitted 17 June, 2025; v1 submitted 3 February, 2025; originally announced February 2025.

  12. arXiv:2501.12216  [pdf, other

    cs.LG cs.CV eess.IV

    RL-RC-DoT: A Block-level RL agent for Task-Aware Video Compression

    Authors: Uri Gadot, Assaf Shocher, Shie Mannor, Gal Chechik, Assaf Hallak

    Abstract: Video encoders optimize compression for human perception by minimizing reconstruction error under bit-rate constraints. In many modern applications such as autonomous driving, an overwhelming majority of videos serve as input for AI systems performing tasks like object recognition or segmentation, rather than being watched by humans. It is therefore useful to optimize the encoder for a downstream… ▽ More

    Submitted 25 March, 2025; v1 submitted 21 January, 2025; originally announced January 2025.

  13. arXiv:2411.02158  [pdf, other

    cs.LG cs.AI cs.RO eess.SY

    Learning Multiple Initial Solutions to Optimization Problems

    Authors: Elad Sharony, Heng Yang, Tong Che, Marco Pavone, Shie Mannor, Peter Karkus

    Abstract: Sequentially solving similar optimization problems under strict runtime constraints is essential for many applications, such as robot control, autonomous driving, and portfolio management. The performance of local optimization methods in these settings is sensitive to the initial solution: poor initialization can lead to slow convergence or suboptimal solutions. To address this challenge, we propo… ▽ More

    Submitted 3 February, 2025; v1 submitted 4 November, 2024; originally announced November 2024.

    Comments: Under Review

  14. arXiv:2410.19471  [pdf, other

    cs.LG cs.AI

    Improving Inverse Folding for Peptide Design with Diversity-regularized Direct Preference Optimization

    Authors: Ryan Park, Darren J. Hsu, C. Brian Roland, Maria Korshunova, Chen Tessler, Shie Mannor, Olivia Viessmann, Bruno Trentini

    Abstract: Inverse folding models play an important role in structure-based design by predicting amino acid sequences that fold into desired reference structures. Models like ProteinMPNN, a message-passing encoder-decoder model, are trained to reliably produce new sequences from a reference structure. However, when applied to peptides, these models are prone to generating repetitive sequences that do not fol… ▽ More

    Submitted 25 October, 2024; originally announced October 2024.

    Comments: Preprint. 10 pages plus appendices

  15. arXiv:2410.08868  [pdf, ps, other

    cs.LG stat.ML

    On the Convergence of Single-Timescale Actor-Critic

    Authors: Navdeep Kumar, Priyank Agrawal, Giorgia Ramponi, Kfir Yehuda Levy, Shie Mannor

    Abstract: We analyze the global convergence of the single-timescale actor-critic (AC) algorithm for the infinite-horizon discounted Markov Decision Processes (MDPs) with finite state spaces. To this end, we introduce an elegant analytical framework for handling complex, coupled recursions inherent in the algorithm. Leveraging this framework, we establish that the algorithm converges to an $ε$-close \textbf{… ▽ More

    Submitted 4 June, 2025; v1 submitted 11 October, 2024; originally announced October 2024.

    Comments: updated version , 27 pages

  16. arXiv:2409.17643  [pdf, other

    stat.ML cs.LG

    Efficient Fairness-Performance Pareto Front Computation

    Authors: Mark Kozdoba, Binyamin Perets, Shie Mannor

    Abstract: There is a well known intrinsic trade-off between the fairness of a representation and the performance of classifiers derived from the representation. Due to the complexity of optimisation algorithms in most modern representation learning approaches, for a given method it may be non-trivial to decide whether the obtained fairness-performance curve of the method is optimal, i.e., whether it is clos… ▽ More

    Submitted 26 September, 2024; originally announced September 2024.

  17. arXiv:2408.11876  [pdf

    q-bio.QM cs.AI cs.LG

    From Glucose Patterns to Health Outcomes: A Generalizable Foundation Model for Continuous Glucose Monitor Data Analysis

    Authors: Guy Lutsker, Gal Sapir, Smadar Shilo, Jordi Merino, Anastasia Godneva, Jerry R Greenfield, Dorit Samocha-Bonet, Raja Dhir, Francisco Gude, Shie Mannor, Eli Meirom, Gal Chechik, Hagai Rossman, Eran Segal

    Abstract: Recent advances in SSL enabled novel medical AI models, known as foundation models, offer great potential for better characterizing health from diverse biomedical data. CGM provides rich, temporal data on glycemic patterns, but its full potential for predicting broader health outcomes remains underutilized. Here, we present GluFormer, a generative foundation model for CGM data that learns nuanced… ▽ More

    Submitted 7 January, 2025; v1 submitted 20 August, 2024; originally announced August 2024.

  18. arXiv:2406.18237  [pdf, other

    cs.AI cs.GR cs.RO

    PlaMo: Plan and Move in Rich 3D Physical Environments

    Authors: Assaf Hallak, Gal Dalal, Chen Tessler, Kelly Guo, Shie Mannor, Gal Chechik

    Abstract: Controlling humanoids in complex physically simulated worlds is a long-standing challenge with numerous applications in gaming, simulation, and visual content creation. In our setup, given a rich and complex 3D scene, the user provides a list of instructions composed of target locations and locomotion types. To solve this task we present PlaMo, a scene-aware path planner and a robust physics-based… ▽ More

    Submitted 26 June, 2024; originally announced June 2024.

  19. arXiv:2406.01389  [pdf, other

    cs.LG cs.AI eess.SY

    RL in Latent MDPs is Tractable: Online Guarantees via Off-Policy Evaluation

    Authors: Jeongyeol Kwon, Shie Mannor, Constantine Caramanis, Yonathan Efroni

    Abstract: In many real-world decision problems there is partially observed, hidden or latent information that remains fixed throughout an interaction. Such decision problems can be modeled as Latent Markov Decision Processes (LMDPs), where a latent variable is selected at the beginning of an interaction and is not disclosed to the agent. In the last decade, there has been significant progress in solving LMD… ▽ More

    Submitted 26 June, 2024; v1 submitted 3 June, 2024; originally announced June 2024.

    Comments: Fixed typos + alpha

  20. arXiv:2405.16581  [pdf, other

    cs.LG

    On Bits and Bandits: Quantifying the Regret-Information Trade-off

    Authors: Itai Shufaro, Nadav Merlis, Nir Weinberger, Shie Mannor

    Abstract: In many sequential decision problems, an agent performs a repeated task. He then suffers regret and obtains information that he may use in the following rounds. However, sometimes the agent may also obtain information and avoid suffering regret by querying external sources. We study the trade-off between the information an agent accumulates and the regret it suffers. We invoke information-theoreti… ▽ More

    Submitted 23 February, 2025; v1 submitted 26 May, 2024; originally announced May 2024.

  21. arXiv:2404.05440  [pdf, other

    cs.AI cs.LG

    Tree Search-Based Policy Optimization under Stochastic Execution Delay

    Authors: David Valensi, Esther Derman, Shie Mannor, Gal Dalal

    Abstract: The standard formulation of Markov decision processes (MDPs) assumes that the agent's decisions are executed immediately. However, in numerous realistic applications such as robotics or healthcare, actions are performed with a delay whose value can even be stochastic. In this work, we introduce stochastic delayed execution MDPs, a new formalism addressing random delays without resorting to state a… ▽ More

    Submitted 8 April, 2024; originally announced April 2024.

    Comments: Published in ICLR 2024

  22. arXiv:2403.06806  [pdf, other

    cs.LG eess.SY

    On the Global Convergence of Policy Gradient in Average Reward Markov Decision Processes

    Authors: Navdeep Kumar, Yashaswini Murthy, Itai Shufaro, Kfir Y. Levy, R. Srikant, Shie Mannor

    Abstract: We present the first finite time global convergence analysis of policy gradient in the context of infinite horizon average reward Markov decision processes (MDPs). Specifically, we focus on ergodic tabular MDPs with finite state and action spaces. Our analysis shows that the policy gradient iterates converge to the optimal policy at a sublinear rate of $O\left({\frac{1}{T}}\right),$ which translat… ▽ More

    Submitted 11 March, 2024; originally announced March 2024.

    Comments: 29 pages, 5 figures

  23. arXiv:2403.05732   

    cs.AI cs.LG

    Conservative DDPG -- Pessimistic RL without Ensemble

    Authors: Nitsan Soffair, Shie Mannor

    Abstract: DDPG is hindered by the overestimation bias problem, wherein its $Q$-estimates tend to overstate the actual $Q$-values. Traditional solutions to this bias involve ensemble-based methods, which require significant computational resources, or complex log-policy-based approaches, which are difficult to understand and implement. In contrast, we propose a straightforward solution using a $Q$-target and… ▽ More

    Submitted 2 June, 2024; v1 submitted 8 March, 2024; originally announced March 2024.

    Comments: Paper do not ready

  24. arXiv:2402.10342  [pdf, other

    cs.LG cs.AI cs.CL

    Exploration-Driven Policy Optimization in RLHF: Theoretical Insights on Efficient Data Utilization

    Authors: Yihan Du, Anna Winnicki, Gal Dalal, Shie Mannor, R. Srikant

    Abstract: Reinforcement Learning from Human Feedback (RLHF) has achieved impressive empirical successes while relying on a small amount of human feedback. However, there is limited theoretical justification for this phenomenon. Additionally, most recent studies focus on value-based algorithms despite the recent empirical successes of policy-based algorithms. In this work, we consider an RLHF algorithm based… ▽ More

    Submitted 15 July, 2024; v1 submitted 15 February, 2024; originally announced February 2024.

  25. arXiv:2402.05951   

    cs.LG cs.AI

    MinMaxMin $Q$-learning

    Authors: Nitsan Soffair, Shie Mannor

    Abstract: MinMaxMin $Q$-learning is a novel optimistic Actor-Critic algorithm that addresses the problem of overestimation bias ($Q$-estimations are overestimating the real $Q$-values) inherent in conservative RL algorithms. Its core formula relies on the disagreement among $Q$-networks in the form of the min-batch MaxMin $Q$-networks distance which is added to the $Q$-target and used as the priority experi… ▽ More

    Submitted 2 June, 2024; v1 submitted 3 February, 2024; originally announced February 2024.

    Comments: Paper do not ready

  26. arXiv:2402.05950  [pdf, other

    cs.LG cs.AI

    SQT -- std $Q$-target

    Authors: Nitsan Soffair, Dotan Di-Castro, Orly Avner, Shie Mannor

    Abstract: Std $Q$-target is a conservative, actor-critic, ensemble, $Q$-learning-based algorithm, which is based on a single key $Q$-formula: $Q$-networks standard deviation, which is an "uncertainty penalty", and, serves as a minimalistic solution to the problem of overestimation bias. We implement SQT on top of TD3/TD7 code and test it against the state-of-the-art (SOTA) actor-critic algorithms, DDPG, TD3… ▽ More

    Submitted 2 June, 2024; v1 submitted 3 February, 2024; originally announced February 2024.

  27. arXiv:2402.05643  [pdf, other

    cs.LG cs.AI

    Improving Token-Based World Models with Parallel Observation Prediction

    Authors: Lior Cohen, Kaixin Wang, Bingyi Kang, Shie Mannor

    Abstract: Motivated by the success of Transformers when applied to sequences of discrete symbols, token-based world models (TBWMs) were recently proposed as sample-efficient methods. In TBWMs, the world model consumes agent experience as a language-like sequence of tokens, where each observation constitutes a sub-sequence. However, during imagination, the sequential token-by-token generation of next observa… ▽ More

    Submitted 29 May, 2024; v1 submitted 8 February, 2024; originally announced February 2024.

  28. arXiv:2310.07596  [pdf, other

    cs.LG cs.IT

    Prospective Side Information for Latent MDPs

    Authors: Jeongyeol Kwon, Yonathan Efroni, Shie Mannor, Constantine Caramanis

    Abstract: In many interactive decision-making settings, there is latent and unobserved information that remains fixed. Consider, for example, a dialogue system, where complete information about a user, such as the user's preferences, is not given. In such an environment, the latent information remains fixed throughout each episode, since the identity of the user does not change during an interaction. This t… ▽ More

    Submitted 11 October, 2023; originally announced October 2023.

  29. arXiv:2310.00675  [pdf, other

    cs.LG eess.SP

    Optimization or Architecture: How to Hack Kalman Filtering

    Authors: Ido Greenberg, Netanel Yannay, Shie Mannor

    Abstract: In non-linear filtering, it is traditional to compare non-linear architectures such as neural networks to the standard linear Kalman Filter (KF). We observe that this mixes the evaluation of two separate components: the non-linear architecture, and the parameters optimization method. In particular, the non-linear model is often optimized, whereas the reference KF model is not. We argue that both s… ▽ More

    Submitted 1 October, 2023; originally announced October 2023.

    Comments: NeurIPS 2023

  30. arXiv:2309.01107  [pdf, other

    cs.LG

    Solving Non-Rectangular Reward-Robust MDPs via Frequency Regularization

    Authors: Uri Gadot, Esther Derman, Navdeep Kumar, Maxence Mohamed Elfatihi, Kfir Levy, Shie Mannor

    Abstract: In robust Markov decision processes (RMDPs), it is assumed that the reward and the transition dynamics lie in a given uncertainty set. By targeting maximal return under the most adversarial model from that set, RMDPs address performance sensitivity to misspecified environments. Yet, to preserve computational tractability, the uncertainty set is traditionally independently structured for each state… ▽ More

    Submitted 12 February, 2024; v1 submitted 3 September, 2023; originally announced September 2023.

    Comments: accepted in AAAI2024

  31. arXiv:2307.13763  [pdf, other

    stat.ML cs.AI cs.LG

    Sobolev Space Regularised Pre Density Models

    Authors: Mark Kozdoba, Binyamin Perets, Shie Mannor

    Abstract: We propose a new approach to non-parametric density estimation that is based on regularizing a Sobolev norm of the density. This method is statistically consistent, and makes the inductive bias of the model clear and interpretable. While there is no closed analytic form for the associated kernel, we show that one can approximate it using sampling. The optimization problem needed to determine the d… ▽ More

    Submitted 13 February, 2024; v1 submitted 25 July, 2023; originally announced July 2023.

  32. arXiv:2306.14020  [pdf, other

    cs.LG

    Individualized Dosing Dynamics via Neural Eigen Decomposition

    Authors: Stav Belogolovsky, Ido Greenberg, Danny Eytan, Shie Mannor

    Abstract: Dosing models often use differential equations to model biological dynamics. Neural differential equations in particular can learn to predict the derivative of a process, which permits predictions at irregular points of time. However, this temporal flexibility often comes with a high sensitivity to noise, whereas medical problems often present high noise and limited data. Moreover, medical dosing… ▽ More

    Submitted 24 June, 2023; originally announced June 2023.

    Comments: arXiv admin note: text overlap with arXiv:2202.00117

  33. arXiv:2306.05859  [pdf, other

    cs.LG

    Bring Your Own (Non-Robust) Algorithm to Solve Robust MDPs by Estimating The Worst Kernel

    Authors: Kaixin Wang, Uri Gadot, Navdeep Kumar, Kfir Levy, Shie Mannor

    Abstract: Robust Markov Decision Processes (RMDPs) provide a framework for sequential decision-making that is robust to perturbations on the transition kernel. However, current RMDP methods are often limited to small-scale problems, hindering their use in high-dimensional domains. To bridge this gap, we present EWoK, a novel online approach to solve RMDP that Estimates the Worst transition Kernel to learn r… ▽ More

    Submitted 12 February, 2024; v1 submitted 9 June, 2023; originally announced June 2023.

  34. arXiv:2305.19922  [pdf, other

    cs.LG cs.AI

    Representation-Driven Reinforcement Learning

    Authors: Ofir Nabati, Guy Tennenholtz, Shie Mannor

    Abstract: We present a representation-driven framework for reinforcement learning. By representing policies as estimates of their expected values, we leverage techniques from contextual bandits to guide exploration and exploitation. Particularly, embedding a policy network into a linear feature space allows us to reframe the exploration-exploitation problem as a representation-exploitation problem, where go… ▽ More

    Submitted 17 June, 2023; v1 submitted 31 May, 2023; originally announced May 2023.

    Comments: Accepted to ICML 2023

  35. arXiv:2305.02195  [pdf, other

    cs.CV cs.AI cs.RO

    CALM: Conditional Adversarial Latent Models for Directable Virtual Characters

    Authors: Chen Tessler, Yoni Kasten, Yunrong Guo, Shie Mannor, Gal Chechik, Xue Bin Peng

    Abstract: In this work, we present Conditional Adversarial Latent Models (CALM), an approach for generating diverse and directable behaviors for user-controlled interactive virtual characters. Using imitation learning, CALM learns a representation of movement that captures the complexity and diversity of human motion, and enables direct control over character movements. The approach jointly learns a control… ▽ More

    Submitted 2 May, 2023; originally announced May 2023.

    Comments: Accepted to SIGGRAPH 2023

  36. arXiv:2303.06654  [pdf, other

    cs.LG cs.AI

    Twice Regularized Markov Decision Processes: The Equivalence between Robustness and Regularization

    Authors: Esther Derman, Yevgeniy Men, Matthieu Geist, Shie Mannor

    Abstract: Robust Markov decision processes (MDPs) aim to handle changing or partially known system dynamics. To solve them, one typically resorts to robust optimization methods. However, this significantly increases computational complexity and limits scalability in both learning and planning. On the other hand, regularized MDPs show more stability in policy learning without impairing time complexity. Yet,… ▽ More

    Submitted 12 March, 2023; originally announced March 2023.

    Comments: Extended version of NeuIPS paper: arXiv:2110.06267

  37. arXiv:2301.13642  [pdf, other

    cs.LG math.OC

    An Efficient Solution to s-Rectangular Robust Markov Decision Processes

    Authors: Navdeep Kumar, Kfir Levy, Kaixin Wang, Shie Mannor

    Abstract: We present an efficient robust value iteration for \texttt{s}-rectangular robust Markov Decision Processes (MDPs) with a time complexity comparable to standard (non-robust) MDPs which is significantly faster than any existing method. We do so by deriving the optimal robust Bellman operator in concrete forms using our $L_p$ water filling lemma. We unveil the exact form of the optimal policies, whic… ▽ More

    Submitted 31 January, 2023; originally announced January 2023.

    Comments: arXiv admin note: substantial text overlap with arXiv:2205.14327

  38. arXiv:2301.13589  [pdf, ps, other

    cs.LG cs.AI

    Policy Gradient for Rectangular Robust Markov Decision Processes

    Authors: Navdeep Kumar, Esther Derman, Matthieu Geist, Kfir Levy, Shie Mannor

    Abstract: Policy gradient methods have become a standard for training reinforcement learning agents in a scalable and efficient manner. However, they do not account for transition uncertainty, whereas learning robust policies can be computationally expensive. In this paper, we introduce robust policy gradient (RPG), a policy-based method that efficiently solves rectangular robust Markov decision processes (… ▽ More

    Submitted 10 December, 2023; v1 submitted 31 January, 2023; originally announced January 2023.

    Comments: Accepted to NeurIPS 2023

  39. arXiv:2301.13236  [pdf, other

    cs.LG cs.AI

    Policy Gradient with Tree Expansion

    Authors: Gal Dalal, Assaf Hallak, Gugan Thoppe, Shie Mannor, Gal Chechik

    Abstract: Policy gradient methods are notorious for having a large variance and high sample complexity. To mitigate this, we introduce SoftTreeMax -- a generalization of softmax that employs planning. In SoftTreeMax, we extend the traditional logits with the multi-step discounted cumulative reward, topped with the logits of future states. We analyze SoftTreeMax and explain how tree expansion helps to reduce… ▽ More

    Submitted 25 May, 2025; v1 submitted 30 January, 2023; originally announced January 2023.

    Comments: arXiv admin note: text overlap with arXiv:2209.13966

  40. arXiv:2301.11147  [pdf, other

    cs.LG

    Train Hard, Fight Easy: Robust Meta Reinforcement Learning

    Authors: Ido Greenberg, Shie Mannor, Gal Chechik, Eli Meirom

    Abstract: A major challenge of reinforcement learning (RL) in real-world applications is the variation between environments, tasks or clients. Meta-RL (MRL) addresses this issue by learning a meta-policy that adapts to new tasks. Standard MRL methods optimize the average return over tasks, but often suffer from poor results in tasks of high risk or difficulty. This limits system reliability since test tasks… ▽ More

    Submitted 1 October, 2023; v1 submitted 26 January, 2023; originally announced January 2023.

    Comments: NeurIPS 2023

  41. arXiv:2301.01320  [pdf, ps, other

    cs.LG stat.ML

    Towards Deployable RL -- What's Broken with RL Research and a Potential Fix

    Authors: Shie Mannor, Aviv Tamar

    Abstract: Reinforcement learning (RL) has demonstrated great potential, but is currently full of overhyping and pipe dreams. We point to some difficulties with current research which we feel are endemic to the direction taken by the community. To us, the current direction is not likely to lead to "deployable" RL: RL that works in practice and can work in practical situations yet still is economically viable… ▽ More

    Submitted 3 January, 2023; originally announced January 2023.

  42. arXiv:2212.06437  [pdf, other

    cs.RO cs.AI cs.CV cs.LG

    DiffStack: A Differentiable and Modular Control Stack for Autonomous Vehicles

    Authors: Peter Karkus, Boris Ivanovic, Shie Mannor, Marco Pavone

    Abstract: Autonomous vehicle (AV) stacks are typically built in a modular fashion, with explicit components performing detection, tracking, prediction, planning, control, etc. While modularity improves reusability, interpretability, and generalizability, it also suffers from compounding errors, information bottlenecks, and integration challenges. To overcome these challenges, a prominent approach is to conv… ▽ More

    Submitted 13 December, 2022; originally announced December 2022.

    Comments: CoRL 2022 camera ready

  43. arXiv:2210.03528  [pdf, other

    cs.LG cs.IT stat.ML

    Tractable Optimality in Episodic Latent MABs

    Authors: Jeongyeol Kwon, Yonathan Efroni, Constantine Caramanis, Shie Mannor

    Abstract: We consider a multi-armed bandit problem with $M$ latent contexts, where an agent interacts with the environment for an episode of $H$ time steps. Depending on the length of the episode, the learner may not be able to estimate accurately the latent context. The resulting partial observation of the environment makes the learning task significantly more challenging. Without any additional structural… ▽ More

    Submitted 5 October, 2022; originally announced October 2022.

    Comments: NeurIPS 2022

  44. arXiv:2210.02594  [pdf, ps, other

    cs.LG cs.IT stat.ML

    Reward-Mixing MDPs with a Few Latent Contexts are Learnable

    Authors: Jeongyeol Kwon, Yonathan Efroni, Constantine Caramanis, Shie Mannor

    Abstract: We consider episodic reinforcement learning in reward-mixing Markov decision processes (RMMDPs): at the beginning of every episode nature randomly picks a latent reward model among $M$ candidates and an agent interacts with the MDP throughout the episode for $H$ time steps. Our goal is to learn a near-optimal policy that nearly maximizes the $H$ time-step cumulative rewards in such a model. Previo… ▽ More

    Submitted 5 October, 2022; originally announced October 2022.

  45. arXiv:2210.00991  [pdf, ps, other

    cs.LG

    Policy Gradient for Reinforcement Learning with General Utilities

    Authors: Navdeep Kumar, Kaixin Wang, Kfir Levy, Shie Mannor

    Abstract: In Reinforcement Learning (RL), the goal of agents is to discover an optimal policy that maximizes the expected cumulative rewards. This objective may also be viewed as finding a policy that optimizes a linear function of its state-action occupancy measure, hereafter referred as Linear RL. However, many supervised and unsupervised RL problems are not covered in the Linear RL framework, such as app… ▽ More

    Submitted 29 August, 2023; v1 submitted 3 October, 2022; originally announced October 2022.

  46. arXiv:2209.13966  [pdf, other

    cs.LG

    SoftTreeMax: Policy Gradient with Tree Search

    Authors: Gal Dalal, Assaf Hallak, Shie Mannor, Gal Chechik

    Abstract: Policy-gradient methods are widely used for learning control policies. They can be easily distributed to multiple workers and reach state-of-the-art results in many domains. Unfortunately, they exhibit large variance and subsequently suffer from high-sample complexity since they aggregate gradients over entire trajectories. At the other extreme, planning methods, like tree search, optimize the pol… ▽ More

    Submitted 28 September, 2022; originally announced September 2022.

  47. arXiv:2207.09090  [pdf, other

    cs.LG cs.AI eess.SY

    Actor-Critic based Improper Reinforcement Learning

    Authors: Mohammadi Zaki, Avinash Mohan, Aditya Gopalan, Shie Mannor

    Abstract: We consider an improper reinforcement learning setting where a learner is given $M$ base controllers for an unknown Markov decision process, and wishes to combine them optimally to produce a potentially new controller that can outperform each of the base ones. This can be useful in tuning across controllers, learnt possibly in mismatched or simulated environments, to obtain a good controller for a… ▽ More

    Submitted 19 July, 2022; originally announced July 2022.

    Comments: arXiv admin note: substantial text overlap with arXiv:2102.08201

  48. Implementing Reinforcement Learning Datacenter Congestion Control in NVIDIA NICs

    Authors: Benjamin Fuhrer, Yuval Shpigelman, Chen Tessler, Shie Mannor, Gal Chechik, Eitan Zahavi, Gal Dalal

    Abstract: As communication protocols evolve, datacenter network utilization increases. As a result, congestion is more frequent, causing higher latency and packet loss. Combined with the increasing complexity of workloads, manual design of congestion control (CC) algorithms becomes extremely difficult. This calls for the development of AI approaches to replace the human effort. Unfortunately, it is currentl… ▽ More

    Submitted 1 June, 2024; v1 submitted 5 July, 2022; originally announced July 2022.

  49. arXiv:2206.12848  [pdf, ps, other

    cs.LG

    Analysis of Stochastic Processes through Replay Buffers

    Authors: Shirli Di Castro Shashua, Shie Mannor, Dotan Di-Castro

    Abstract: Replay buffers are a key component in many reinforcement learning schemes. Yet, their theoretical properties are not fully understood. In this paper we analyze a system where a stochastic process X is pushed into a replay buffer and then randomly sampled to generate a stochastic process Y from the replay buffer. We provide an analysis of the properties of the sampled process such as stationarity,… ▽ More

    Submitted 26 June, 2022; originally announced June 2022.

    Comments: arXiv admin note: text overlap with arXiv:2110.00445

  50. arXiv:2205.15376  [pdf, other

    cs.LG cs.AI

    Reinforcement Learning with a Terminator

    Authors: Guy Tennenholtz, Nadav Merlis, Lior Shani, Shie Mannor, Uri Shalit, Gal Chechik, Assaf Hallak, Gal Dalal

    Abstract: We present the problem of reinforcement learning with exogenous termination. We define the Termination Markov Decision Process (TerMDP), an extension of the MDP framework, in which episodes may be interrupted by an external non-Markovian observer. This formulation accounts for numerous real-world situations, such as a human interrupting an autonomous driving agent for reasons of discomfort. We lea… ▽ More

    Submitted 5 October, 2023; v1 submitted 30 May, 2022; originally announced May 2022.

    Comments: NeurIPS 2022