Search | arXiv e-print repository

Nonparametric Sparse Online Learning of the Koopman Operator

Authors: Boya Hou, Sina Sanjari, Nathan Dahlin, Alec Koppel, Subhonmesh Bose

Abstract: The Koopman operator provides a powerful framework for representing the dynamics of general nonlinear dynamical systems. Data-driven techniques to learn the Koopman operator typically assume that the chosen function space is closed under system dynamics. In this paper, we study the Koopman operator via its action on the reproducing kernel Hilbert space (RKHS), and explore the mis-specified scenari… ▽ More The Koopman operator provides a powerful framework for representing the dynamics of general nonlinear dynamical systems. Data-driven techniques to learn the Koopman operator typically assume that the chosen function space is closed under system dynamics. In this paper, we study the Koopman operator via its action on the reproducing kernel Hilbert space (RKHS), and explore the mis-specified scenario where the dynamics may escape the chosen function space. We relate the Koopman operator to the conditional mean embeddings (CME) operator and then present an operator stochastic approximation algorithm to learn the Koopman operator iteratively with control over the complexity of the representation. We provide both asymptotic and finite-time last-iterate guarantees of the online sparse learning algorithm with trajectory-based sampling with an analysis that is substantially more involved than that for finite-dimensional stochastic approximation. Numerical examples confirm the effectiveness of the proposed algorithm. △ Less

Submitted 4 February, 2025; v1 submitted 27 January, 2025; originally announced January 2025.

Comments: This work was intended as a replacement of arXiv:2405.07432 and any subsequent updates will appear there

arXiv:2406.13992 [pdf, ps, other]

Robust Cooperative Multi-Agent Reinforcement Learning:A Mean-Field Type Game Perspective

Authors: Muhammad Aneeq uz Zaman, Mathieu Laurière, Alec Koppel, Tamer Başar

Abstract: In this paper, we study the problem of robust cooperative multi-agent reinforcement learning (RL) where a large number of cooperative agents with distributed information aim to learn policies in the presence of \emph{stochastic} and \emph{non-stochastic} uncertainties whose distributions are respectively known and unknown. Focusing on policy optimization that accounts for both types of uncertainti… ▽ More In this paper, we study the problem of robust cooperative multi-agent reinforcement learning (RL) where a large number of cooperative agents with distributed information aim to learn policies in the presence of \emph{stochastic} and \emph{non-stochastic} uncertainties whose distributions are respectively known and unknown. Focusing on policy optimization that accounts for both types of uncertainties, we formulate the problem in a worst-case (minimax) framework, which is is intractable in general. Thus, we focus on the Linear Quadratic setting to derive benchmark solutions. First, since no standard theory exists for this problem due to the distributed information structure, we utilize the Mean-Field Type Game (MFTG) paradigm to establish guarantees on the solution quality in the sense of achieved Nash equilibrium of the MFTG. This in turn allows us to compare the performance against the corresponding original robust multi-agent control problem. Then, we propose a Receding-horizon Gradient Descent Ascent RL algorithm to find the MFTG Nash equilibrium and we prove a non-asymptotic rate of convergence. Finally, we provide numerical experiments to demonstrate the efficacy of our approach relative to a baseline algorithm. △ Less

Submitted 20 June, 2024; originally announced June 2024.

Comments: Accepted for publication in L4DC 2024

arXiv:2405.07432 [pdf, other]

Nonparametric Sparse Online Learning of the Koopman Operator

Authors: Boya Hou, Sina Sanjari, Nathan Dahlin, Alec Koppel, Subhonmesh Bose

Abstract: The Koopman operator provides a powerful framework for representing the dynamics of general nonlinear dynamical systems. Data-driven techniques to learn the Koopman operator typically assume that the chosen function space is closed under system dynamics. In this paper, we study the Koopman operator via its action on the reproducing kernel Hilbert space (RKHS), and explore the mis-specified scenari… ▽ More The Koopman operator provides a powerful framework for representing the dynamics of general nonlinear dynamical systems. Data-driven techniques to learn the Koopman operator typically assume that the chosen function space is closed under system dynamics. In this paper, we study the Koopman operator via its action on the reproducing kernel Hilbert space (RKHS), and explore the mis-specified scenario where the dynamics may escape the chosen function space. We relate the Koopman operator to the conditional mean embeddings (CME) operator and then present an operator stochastic approximation algorithm to learn the Koopman operator iteratively with control over the complexity of the representation. We provide both asymptotic and finite-time last-iterate guarantees of the online sparse learning algorithm with trajectory-based sampling with an analysis that is substantially more involved than that for finite-dimensional stochastic approximation. Numerical examples confirm the effectiveness of the proposed algorithm. △ Less

Submitted 4 February, 2025; v1 submitted 12 May, 2024; originally announced May 2024.

Comments: 49 pages, 6 figures

arXiv:2310.07320 [pdf, ps, other]

Byzantine-Resilient Decentralized Multi-Armed Bandits

Authors: Jingxuan Zhu, Alec Koppel, Alvaro Velasquez, Ji Liu

Abstract: In decentralized cooperative multi-armed bandits (MAB), each agent observes a distinct stream of rewards, and seeks to exchange information with others to select a sequence of arms so as to minimize its regret. Agents in the cooperative setting can outperform a single agent running a MAB method such as Upper-Confidence Bound (UCB) independently. In this work, we study how to recover such salient b… ▽ More In decentralized cooperative multi-armed bandits (MAB), each agent observes a distinct stream of rewards, and seeks to exchange information with others to select a sequence of arms so as to minimize its regret. Agents in the cooperative setting can outperform a single agent running a MAB method such as Upper-Confidence Bound (UCB) independently. In this work, we study how to recover such salient behavior when an unknown fraction of the agents can be Byzantine, that is, communicate arbitrarily wrong information in the form of reward mean-estimates or confidence sets. This framework can be used to model attackers in computer networks, instigators of offensive content into recommender systems, or manipulators of financial markets. Our key contribution is the development of a fully decentralized resilient upper confidence bound (UCB) algorithm that fuses an information mixing step among agents with a truncation of inconsistent and extreme values. This truncation step enables us to establish that the performance of each normal agent is no worse than the classic single-agent UCB1 algorithm in terms of regret, and more importantly, the cumulative regret of all normal agents is strictly better than the non-cooperative case, provided that each agent has at least 3f+1 neighbors where f is the maximum possible Byzantine agents in each agent's neighborhood. Extensions to time-varying neighbor graphs, and minimax lower bounds are further established on the achievable regret. Experiments corroborate the merits of this framework in practice. △ Less

Submitted 11 June, 2025; v1 submitted 11 October, 2023; originally announced October 2023.

Comments: add a disclaimer

arXiv:2206.05652 [pdf, other]

Dealing with Sparse Rewards in Continuous Control Robotics via Heavy-Tailed Policies

Authors: Souradip Chakraborty, Amrit Singh Bedi, Alec Koppel, Pratap Tokekar, Dinesh Manocha

Abstract: In this paper, we present a novel Heavy-Tailed Stochastic Policy Gradient (HT-PSG) algorithm to deal with the challenges of sparse rewards in continuous control problems. Sparse reward is common in continuous control robotics tasks such as manipulation and navigation, and makes the learning problem hard due to non-trivial estimation of value functions over the state space. This demands either rewa… ▽ More In this paper, we present a novel Heavy-Tailed Stochastic Policy Gradient (HT-PSG) algorithm to deal with the challenges of sparse rewards in continuous control problems. Sparse reward is common in continuous control robotics tasks such as manipulation and navigation, and makes the learning problem hard due to non-trivial estimation of value functions over the state space. This demands either reward shaping or expert demonstrations for the sparse reward environment. However, obtaining high-quality demonstrations is quite expensive and sometimes even impossible. We propose a heavy-tailed policy parametrization along with a modified momentum-based policy gradient tracking scheme (HT-SPG) to induce a stable exploratory behavior to the algorithm. The proposed algorithm does not require access to expert demonstrations. We test the performance of HT-SPG on various benchmark tasks of continuous control with sparse rewards such as 1D Mario, Pathological Mountain Car, Sparse Pendulum in OpenAI Gym, and Sparse MuJoCo environments (Hopper-v2). We show consistent performance improvement across all tasks in terms of high average cumulative reward. HT-SPG also demonstrates improved convergence speed with minimum samples, thereby emphasizing the sample efficiency of our proposed algorithm. △ Less

Submitted 12 June, 2022; originally announced June 2022.

arXiv:2111.10933 [pdf, other]

Decentralized Upper Confidence Bound Algorithms for Homogeneous Multi-Agent Multi-Armed Bandits

Authors: Jingxuan Zhu, Ethan Mulle, Christopher S. Smith, Alec Koppel, Ji Liu

Abstract: This paper studies a decentralized homogeneous multi-armed bandit problem in a multi-agent network. The problem is simultaneously solved by $N$ agents assuming they face a common set of $M$ arms and share the same arms' reward distributions. Each agent can receive information only from its neighbors, where the neighbor relationships among the agents are described by a fixed graph. Two fully decent… ▽ More This paper studies a decentralized homogeneous multi-armed bandit problem in a multi-agent network. The problem is simultaneously solved by $N$ agents assuming they face a common set of $M$ arms and share the same arms' reward distributions. Each agent can receive information only from its neighbors, where the neighbor relationships among the agents are described by a fixed graph. Two fully decentralized upper confidence bound (UCB) algorithms are proposed for undirected graphs, respectively based on the classic algorithm and the state-of-the-art Kullback-Leibler upper confidence bound (KL-UCB) algorithm. The proposed decentralized UCB1 and KL-UCB algorithms permit each agent in the network to achieve a better logarithmic asymptotic regret than their single-agent counterparts, provided that the agent has at least one neighbor, and the more neighbors an agent has, the better regret it will have, meaning that the sum is more than its component parts. The same algorithm design framework is also extended to directed graphs through the design of a variant of the decentralized UCB1 algorithm, which outperforms the single-agent UCB1 algorithm. △ Less

Submitted 28 December, 2024; v1 submitted 21 November, 2021; originally announced November 2021.

Comments: [v3] and [v4] are different works with different algorithm designs. A shortened version of [v3] was published in the 63rd IEEE Conference on Decision and Control, and a shortened version of [v4] was accepted in IEEE Transactions on Automatic Control

arXiv:2106.08414 [pdf, other]

On the Sample Complexity and Metastability of Heavy-tailed Policy Search in Continuous Control

Authors: Amrit Singh Bedi, Anjaly Parayil, Junyu Zhang, Mengdi Wang, Alec Koppel

Abstract: Reinforcement learning is a framework for interactive decision-making with incentives sequentially revealed across time without a system dynamics model. Due to its scaling to continuous spaces, we focus on policy search where one iteratively improves a parameterized policy with stochastic policy gradient (PG) updates. In tabular Markov Decision Problems (MDPs), under persistent exploration and sui… ▽ More Reinforcement learning is a framework for interactive decision-making with incentives sequentially revealed across time without a system dynamics model. Due to its scaling to continuous spaces, we focus on policy search where one iteratively improves a parameterized policy with stochastic policy gradient (PG) updates. In tabular Markov Decision Problems (MDPs), under persistent exploration and suitable parameterization, global optimality may be obtained. By contrast, in continuous space, the non-convexity poses a pathological challenge as evidenced by existing convergence results being mostly limited to stationarity or arbitrary local extrema. To close this gap, we step towards persistent exploration in continuous space through policy parameterizations defined by distributions of heavier tails defined by tail-index parameter alpha, which increases the likelihood of jumping in state space. Doing so invalidates smoothness conditions of the score function common to PG. Thus, we establish how the convergence rate to stationarity depends on the policy's tail index alpha, a Holder continuity parameter, integrability conditions, and an exploration tolerance parameter introduced here for the first time. Further, we characterize the dependence of the set of local maxima on the tail index through an exit and transition time analysis of a suitably defined Markov chain, identifying that policies associated with Levy Processes of a heavier tail converge to wider peaks. This phenomenon yields improved stability to perturbations in supervised learning, which we corroborate also manifests in improved performance of policy search, especially when myopic and farsighted incentives are misaligned. △ Less

Submitted 2 January, 2023; v1 submitted 15 June, 2021; originally announced June 2021.

arXiv:2007.01219 [pdf, ps, other]

Balancing Rates and Variance via Adaptive Batch-Size for Stochastic Optimization Problems

Authors: Zhan Gao, Alec Koppel, Alejandro Ribeiro

Abstract: Stochastic gradient descent is a canonical tool for addressing stochastic optimization problems, and forms the bedrock of modern machine learning and statistics. In this work, we seek to balance the fact that attenuating step-size is required for exact asymptotic convergence with the fact that constant step-size learns faster in finite time up to an error. To do so, rather than fixing the mini-bat… ▽ More Stochastic gradient descent is a canonical tool for addressing stochastic optimization problems, and forms the bedrock of modern machine learning and statistics. In this work, we seek to balance the fact that attenuating step-size is required for exact asymptotic convergence with the fact that constant step-size learns faster in finite time up to an error. To do so, rather than fixing the mini-batch and the step-size at the outset, we propose a strategy to allow parameters to evolve adaptively. Specifically, the batch-size is set to be a piecewise-constant increasing sequence where the increase occurs when a suitable error criterion is satisfied. Moreover, the step-size is selected as that which yields the fastest convergence. The overall algorithm, two scale adaptive (TSA) scheme, is developed for both convex and non-convex stochastic optimization problems. It inherits the exact asymptotic convergence of stochastic gradient method. More importantly, the optimal error decreasing rate is achieved theoretically, as well as an overall reduction in computational cost. Experimentally, we observe that TSA attains a favorable tradeoff relative to standard SGD that fixes the mini-batch and the step-size, or simply allowing one to increase or decrease respectively. △ Less

Submitted 9 July, 2020; v1 submitted 2 July, 2020; originally announced July 2020.

arXiv:2004.04843 [pdf, other]

Policy Gradient using Weak Derivatives for Reinforcement Learning

Authors: Sujay Bhatt, Alec Koppel, Vikram Krishnamurthy

Abstract: This paper considers policy search in continuous state-action reinforcement learning problems. Typically, one computes search directions using a classic expression for the policy gradient called the Policy Gradient Theorem, which decomposes the gradient of the value function into two factors: the score function and the Q-function. This paper presents four results:(i) an alternative policy gradient… ▽ More This paper considers policy search in continuous state-action reinforcement learning problems. Typically, one computes search directions using a classic expression for the policy gradient called the Policy Gradient Theorem, which decomposes the gradient of the value function into two factors: the score function and the Q-function. This paper presents four results:(i) an alternative policy gradient theorem using weak (measure-valued) derivatives instead of score-function is established; (ii) the stochastic gradient estimates thus derived are shown to be unbiased and to yield algorithms that converge almost surely to stationary points of the non-convex value function of the reinforcement learning problem; (iii) the sample complexity of the algorithm is derived and is shown to be $O(1/\sqrt(k))$; (iv) finally, the expected variance of the gradient estimates obtained using weak derivatives is shown to be lower than those obtained using the popular score-function approach. Experiments on OpenAI gym pendulum environment show superior performance of the proposed algorithm. △ Less

Submitted 9 April, 2020; originally announced April 2020.

Comments: 1 figure

arXiv:2003.12637 [pdf, other]

Collaborative Beamforming Under Localization Errors: A Discrete Optimization Approach

Authors: Erfaun Noorani, Yagiz Savas, Alec Koppel, John Baras, Ufuk Topcu, Brian M. Sadler

Abstract: We consider a network of agents that locate themselves in an environment through sensor measurements and aim to transmit a message signal to a base station via collaborative beamforming. The agents' sensor measurements result in localization errors, which degrade the quality of service at the base station due to unknown phase offsets that arise in the agents' communication channels. Assuming that… ▽ More We consider a network of agents that locate themselves in an environment through sensor measurements and aim to transmit a message signal to a base station via collaborative beamforming. The agents' sensor measurements result in localization errors, which degrade the quality of service at the base station due to unknown phase offsets that arise in the agents' communication channels. Assuming that each agent's localization error follows a Gaussian distribution, we study the problem of forming a reliable communication link between the agents and the base station despite the localization errors. In particular, we formulate a discrete optimization problem to choose only a subset of agents to transmit the message signal so that the variance of the signal-to-noise ratio (SNR) received by the base station is minimized while the expected SNR exceeds a desired threshold. When the variances of the localization errors are below a certain threshold characterized in terms of the carrier frequency, we show that greedy algorithms can be used to globally minimize the variance of the received SNR. On the other hand, when some agents have localization errors with large variances, we show that the variance of the received SNR can be locally minimized by exploiting the supermodularity of the mean and variance of the received SNR. In numerical simulations, we demonstrate that the proposed algorithms have the potential to synthesize beamformers orders of magnitude faster than convex optimization-based approaches while achieving comparable performances using less number of agents. △ Less

Submitted 17 March, 2021; v1 submitted 27 March, 2020; originally announced March 2020.

arXiv:2002.12475 [pdf, other]

Cautious Reinforcement Learning via Distributional Risk in the Dual Domain

Authors: Junyu Zhang, Amrit Singh Bedi, Mengdi Wang, Alec Koppel

Abstract: We study the estimation of risk-sensitive policies in reinforcement learning problems defined by a Markov Decision Process (MDPs) whose state and action spaces are countably finite. Prior efforts are predominately afflicted by computational challenges associated with the fact that risk-sensitive MDPs are time-inconsistent. To ameliorate this issue, we propose a new definition of risk, which we cal… ▽ More We study the estimation of risk-sensitive policies in reinforcement learning problems defined by a Markov Decision Process (MDPs) whose state and action spaces are countably finite. Prior efforts are predominately afflicted by computational challenges associated with the fact that risk-sensitive MDPs are time-inconsistent. To ameliorate this issue, we propose a new definition of risk, which we call caution, as a penalty function added to the dual objective of the linear programming (LP) formulation of reinforcement learning. The caution measures the distributional risk of a policy, which is a function of the policy's long-term state occupancy distribution. To solve this problem in an online model-free manner, we propose a stochastic variant of primal-dual method that uses Kullback-Lieber (KL) divergence as its proximal term. We establish that the number of iterations/samples required to attain approximately optimal solutions of this scheme matches tight dependencies on the cardinality of the state and action spaces, but differs in its dependence on the infinity norm of the gradient of the risk measure. Experiments demonstrate the merits of this approach for improving the reliability of reward accumulation without additional computational burdens. △ Less

Submitted 27 February, 2020; originally announced February 2020.

arXiv:1909.11555 [pdf, other]

Optimally Compressed Nonparametric Online Learning

Authors: Alec Koppel, Amrit Singh Bedi, Ketan Rajawat, Brian M. Sadler

Abstract: Batch training of machine learning models based on neural networks is now well established, whereas to date streaming methods are largely based on linear models. To go beyond linear in the online setting, nonparametric methods are of interest due to their universality and ability to stably incorporate new information via convexity or Bayes' Rule. Unfortunately, when used online, nonparametric meth… ▽ More Batch training of machine learning models based on neural networks is now well established, whereas to date streaming methods are largely based on linear models. To go beyond linear in the online setting, nonparametric methods are of interest due to their universality and ability to stably incorporate new information via convexity or Bayes' Rule. Unfortunately, when used online, nonparametric methods suffer a "curse of dimensionality" which precludes their use: their complexity scales at least with the time index. We survey online compression tools which bring their memory under control and attain approximate convergence. The asymptotic bias depends on a compression parameter that trades off memory and accuracy. Further, the applications to robotics, communications, economics, and power are discussed, as well as extensions to multi-agent systems. △ Less

Submitted 17 January, 2020; v1 submitted 25 September, 2019; originally announced September 2019.

arXiv:1909.05442 [pdf, other]

Nonstationary Nonparametric Online Learning: Balancing Dynamic Regret and Model Parsimony

Authors: Amrit Singh Bedi, Alec Koppel, Ketan Rajawat, Brian M. Sadler

Abstract: An open challenge in supervised learning is \emph{conceptual drift}: a data point begins as classified according to one label, but over time the notion of that label changes. Beyond linear autoregressive models, transfer and meta learning address drift, but require data that is representative of disparate domains at the outset of training. To relax this requirement, we propose a memory-efficient \… ▽ More An open challenge in supervised learning is \emph{conceptual drift}: a data point begins as classified according to one label, but over time the notion of that label changes. Beyond linear autoregressive models, transfer and meta learning address drift, but require data that is representative of disparate domains at the outset of training. To relax this requirement, we propose a memory-efficient \emph{online} universal function approximator based on compressed kernel methods. Our approach hinges upon viewing non-stationary learning as online convex optimization with dynamic comparators, for which performance is quantified by dynamic regret. Prior works control dynamic regret growth only for linear models. In contrast, we hypothesize actions belong to reproducing kernel Hilbert spaces (RKHS). We propose a functional variant of online gradient descent (OGD) operating in tandem with greedy subspace projections. Projections are necessary to surmount the fact that RKHS functions have complexity proportional to time. For this scheme, we establish sublinear dynamic regret growth in terms of both loss variation and functional path length, and that the memory of the function sequence remains moderate. Experiments demonstrate the usefulness of the proposed technique for online nonlinear regression and classification problems with non-stationary data. △ Less

Submitted 11 September, 2019; originally announced September 2019.

arXiv:1908.00510 [pdf, ps, other]

Adaptive Kernel Learning in Heterogeneous Networks

Authors: Hrusikesha Pradhan, Amrit Singh Bedi, Alec Koppel, Ketan Rajawat

Abstract: We consider learning in decentralized heterogeneous networks: agents seek to minimize a convex functional that aggregates data across the network, while only having access to their local data streams. We focus on the case where agents seek to estimate a regression \emph{function} that belongs to a reproducing kernel Hilbert space (RKHS). To incentivize coordination while respecting network heterog… ▽ More We consider learning in decentralized heterogeneous networks: agents seek to minimize a convex functional that aggregates data across the network, while only having access to their local data streams. We focus on the case where agents seek to estimate a regression \emph{function} that belongs to a reproducing kernel Hilbert space (RKHS). To incentivize coordination while respecting network heterogeneity, we impose nonlinear proximity constraints. To solve the constrained stochastic program, we propose applying a functional variant of stochastic primal-dual (Arrow-Hurwicz) method which yields a decentralized algorithm. To handle the fact that agents' functions have complexity proportional to time (owing to the RKHS parameterization), we project the primal iterates onto subspaces greedily constructed from kernel evaluations of agents' local observations. The resulting scheme, dubbed Heterogeneous Adaptive Learning with Kernels (HALK), when used with constant step-sizes, yields $\mathcal{O}(\sqrt{T})$ attenuation in sub-optimality and exactly satisfies the constraints in the long run, which improves upon the state of the art rates for vector-valued problems. △ Less

Submitted 1 June, 2021; v1 submitted 1 August, 2019; originally announced August 2019.

arXiv:1906.08383 [pdf, other]

Global Convergence of Policy Gradient Methods to (Almost) Locally Optimal Policies

Authors: Kaiqing Zhang, Alec Koppel, Hao Zhu, Tamer Başar

Abstract: Policy gradient (PG) methods are a widely used reinforcement learning methodology in many applications such as video games, autonomous driving, and robotics. In spite of its empirical success, a rigorous understanding of the global convergence of PG methods is lacking in the literature. In this work, we close the gap by viewing PG methods from a nonconvex optimization perspective. In particular, w… ▽ More Policy gradient (PG) methods are a widely used reinforcement learning methodology in many applications such as video games, autonomous driving, and robotics. In spite of its empirical success, a rigorous understanding of the global convergence of PG methods is lacking in the literature. In this work, we close the gap by viewing PG methods from a nonconvex optimization perspective. In particular, we propose a new variant of PG methods for infinite-horizon problems that uses a random rollout horizon for the Monte-Carlo estimation of the policy gradient. This method then yields an unbiased estimate of the policy gradient with bounded variance, which enables the tools from nonconvex optimization to be applied to establish global convergence. Employing this perspective, we first recover the convergence results with rates to the stationary-point policies in the literature. More interestingly, motivated by advances in nonconvex optimization, we modify the proposed PG method by introducing periodically enlarged stepsizes. The modified algorithm is shown to escape saddle points under mild assumptions on the reward and the policy parameterization. Under a further strict saddle points assumption, this result establishes convergence to essentially locally-optimal policies of the underlying problem, and thus bridges the gap in existing literature on the convergence of PG methods. Results from experiments on the inverted pendulum are then provided to corroborate our theory, namely, by slightly reshaping the reward function to satisfy our assumption, unfavorable saddle points can be avoided and better limit points can be attained. Intriguingly, this empirical finding justifies the benefit of reward-reshaping from a nonconvex optimization perspective. △ Less

Submitted 28 June, 2020; v1 submitted 19 June, 2019; originally announced June 2019.

Comments: Initially submitted in Jan. 2019. Accepted to SIAM Journal on Control and Optimization (SICON)

arXiv:1804.07323 [pdf, other]

Nonparametric Stochastic Compositional Gradient Descent for Q-Learning in Continuous Markov Decision Problems

Authors: Alec Koppel, Ekaterina Tolstaya, Ethan Stump, Alejandro Ribeiro

Abstract: We consider Markov Decision Problems defined over continuous state and action spaces, where an autonomous agent seeks to learn a map from its states to actions so as to maximize its long-term discounted accumulation of rewards. We address this problem by considering Bellman's optimality equation defined over action-value functions, which we reformulate into a nested non-convex stochastic optimizat… ▽ More We consider Markov Decision Problems defined over continuous state and action spaces, where an autonomous agent seeks to learn a map from its states to actions so as to maximize its long-term discounted accumulation of rewards. We address this problem by considering Bellman's optimality equation defined over action-value functions, which we reformulate into a nested non-convex stochastic optimization problem defined over a Reproducing Kernel Hilbert Space (RKHS). We develop a functional generalization of stochastic quasi-gradient method to solve it, which, owing to the structure of the RKHS, admits a parameterization in terms of scalar weights and past state-action pairs which grows proportionately with the algorithm iteration index. To ameliorate this complexity explosion, we apply Kernel Orthogonal Matching Pursuit to the sequence of kernel weights and dictionaries, which yields a controllable error in the descent direction of the underlying optimization method. We prove that the resulting algorithm, called KQ-Learning, converges with probability 1 to a stationary point of this problem, yielding a fixed point of the Bellman optimality operator under the hypothesis that it belongs to the RKHS. Under constant learning rates, we further obtain convergence to a small Bellman error that depends on the chosen learning rates. Numerical evaluation on the Continuous Mountain Car and Inverted Pendulum tasks yields convergent parsimonious learned action-value functions, policies that are competitive with the state of the art, and exhibit reliable, reproducible learning behavior. △ Less

Submitted 19 April, 2018; originally announced April 2018.

arXiv:1606.05578 [pdf, other]

doi 10.1109/TSP.2017.2686368

Proximity Without Consensus in Online Multi-Agent Optimization

Authors: Alec Koppel, Brian M. Sadler, Alejandro Ribeiro

Abstract: We consider stochastic optimization problems in multi-agent settings, where a network of agents aims to learn parameters which are optimal in terms of a global objective, while giving preference to locally observed streaming information. To do so, we depart from the canonical decentralized optimization framework where agreement constraints are enforced, and instead formulate a problem where each a… ▽ More We consider stochastic optimization problems in multi-agent settings, where a network of agents aims to learn parameters which are optimal in terms of a global objective, while giving preference to locally observed streaming information. To do so, we depart from the canonical decentralized optimization framework where agreement constraints are enforced, and instead formulate a problem where each agent minimizes a global objective while enforcing network proximity constraints. This formulation includes online consensus optimization as a special case, but allows for the more general hypothesis that there is data heterogeneity across the network. To solve this problem, we propose using a stochastic saddle point algorithm inspired by Arrow and Hurwicz. This method yields a decentralized algorithm for processing observations sequentially received at each node of the network. Using Lagrange multipliers to penalize the discrepancy between them, only neighboring nodes exchange model information. We establish that under a constant step-size regime the time-average suboptimality and constraint violation are contained in a neighborhood whose radius vanishes with increasing number of iterations. As a consequence, we prove that the time-average primal vectors converge to the optimal objective while satisfying the network proximity constraints. We apply this method to the problem of sequentially estimating a correlated random field in a sensor network, as well as an online source localization problem, both of which demonstrate the empirical validity of the aforementioned convergence results. △ Less

Submitted 17 June, 2016; originally announced June 2016.

Showing 1–17 of 17 results for author: Koppel, A