-
Accelerated Multi-Time-Scale Stochastic Approximation: Optimal Complexity and Applications in Reinforcement Learning and Multi-Agent Games
Authors:
Sihan Zeng,
Thinh T. Doan
Abstract:
Multi-time-scale stochastic approximation is an iterative algorithm for finding the fixed point of a set of $N$ coupled operators given their noisy samples. It has been observed that due to the coupling between the decision variables and noisy samples of the operators, the performance of this method decays as $N$ increases. In this work, we develop a new accelerated variant of multi-time-scale sto…
▽ More
Multi-time-scale stochastic approximation is an iterative algorithm for finding the fixed point of a set of $N$ coupled operators given their noisy samples. It has been observed that due to the coupling between the decision variables and noisy samples of the operators, the performance of this method decays as $N$ increases. In this work, we develop a new accelerated variant of multi-time-scale stochastic approximation, which significantly improves the convergence rates of its standard counterpart. Our key idea is to introduce auxiliary variables to dynamically estimate the operators from their samples, which are then used to update the decision variables. These auxiliary variables help not only to control the variance of the operator estimates but also to decouple the sampling noise and the decision variables. This allows us to select more aggressive step sizes to achieve an optimal convergence rate. Specifically, under a strong monotonicity condition, we show that for any value of $N$ the $t^{\text{th}}$ iterate of the proposed algorithm converges to the desired solution at a rate $\widetilde{O}(1/t)$ when the operator samples are generated from a single from Markov process trajectory.
A second contribution of this work is to demonstrate that the objective of a range of problems in reinforcement learning and multi-agent games can be expressed as a system of fixed-point equations. As such, the proposed approach can be used to design new learning algorithms for solving these problems. We illustrate this observation with numerical simulations in a multi-agent game and show the advantage of the proposed method over the standard multi-time-scale stochastic approximation algorithm.
△ Less
Submitted 12 September, 2024;
originally announced September 2024.
-
Resilient Two-Time-Scale Local Stochastic Gradient Descent for Byzantine Federated Learning
Authors:
Amit Dutta,
Thinh T. Doan
Abstract:
We study local stochastic gradient descent methods for solving federated optimization over a network of agents communicating indirectly through a centralized coordinator. We are interested in the Byzantine setting where there is a subset of $f$ malicious agents that could observe the entire network and send arbitrary values to the coordinator to disrupt the performance of other non-faulty agents.…
▽ More
We study local stochastic gradient descent methods for solving federated optimization over a network of agents communicating indirectly through a centralized coordinator. We are interested in the Byzantine setting where there is a subset of $f$ malicious agents that could observe the entire network and send arbitrary values to the coordinator to disrupt the performance of other non-faulty agents. The objective of the non-faulty agents is to collaboratively compute the optimizer of their respective local functions under the presence of Byzantine agents. In this setting, prior works show that the local stochastic gradient descent method can only return an approximate of the desired solutions due to the impacts of Byzantine agents. Whether this method can find an exact solution remains an open question. In this paper, we will address this open question by proposing a new variant of the local stochastic gradient descent method. Under similar conditions that are considered in the existing works, we will show that the proposed method converges exactly to the desired solutions. We will provide theoretical results to characterize the convergence properties of our method, in particular, the proposed method convergences at an optimal rate $\mathcal{O}(1/k)$ in both strongly convex and non-convex settings, where $k$ is the number of iterations. Finally, we will present a number of simulations to illustrate our theoretical results.
△ Less
Submitted 4 September, 2024;
originally announced September 2024.
-
Fast Two-Time-Scale Stochastic Gradient Method with Applications in Reinforcement Learning
Authors:
Sihan Zeng,
Thinh T. Doan
Abstract:
Two-time-scale optimization is a framework introduced in Zeng et al. (2024) that abstracts a range of policy evaluation and policy optimization problems in reinforcement learning (RL). Akin to bi-level optimization under a particular type of stochastic oracle, the two-time-scale optimization framework has an upper level objective whose gradient evaluation depends on the solution of a lower level p…
▽ More
Two-time-scale optimization is a framework introduced in Zeng et al. (2024) that abstracts a range of policy evaluation and policy optimization problems in reinforcement learning (RL). Akin to bi-level optimization under a particular type of stochastic oracle, the two-time-scale optimization framework has an upper level objective whose gradient evaluation depends on the solution of a lower level problem, which is to find the root of a strongly monotone operator. In this work, we propose a new method for solving two-time-scale optimization that achieves significantly faster convergence than the prior arts. The key idea of our approach is to leverage an averaging step to improve the estimates of the operators in both lower and upper levels before using them to update the decision variables. These additional averaging steps eliminate the direct coupling between the main variables, enabling the accelerated performance of our algorithm. We characterize the finite-time convergence rates of the proposed algorithm under various conditions of the underlying objective function, including strong convexity, Polyak-Lojasiewicz condition, and general non-convexity. These rates significantly improve over the best-known complexity of the standard two-time-scale stochastic approximation algorithm. When applied to RL, we show how the proposed algorithm specializes to novel online sample-based methods that surpass or match the performance of the existing state of the art. Finally, we support our theoretical results with numerical simulations in RL.
△ Less
Submitted 2 March, 2025; v1 submitted 15 May, 2024;
originally announced May 2024.
-
Natural Policy Gradient and Actor Critic Methods for Constrained Multi-Task Reinforcement Learning
Authors:
Sihan Zeng,
Thinh T. Doan,
Justin Romberg
Abstract:
Multi-task reinforcement learning (RL) aims to find a single policy that effectively solves multiple tasks at the same time. This paper presents a constrained formulation for multi-task RL where the goal is to maximize the average performance of the policy across tasks subject to bounds on the performance in each task. We consider solving this problem both in the centralized setting, where informa…
▽ More
Multi-task reinforcement learning (RL) aims to find a single policy that effectively solves multiple tasks at the same time. This paper presents a constrained formulation for multi-task RL where the goal is to maximize the average performance of the policy across tasks subject to bounds on the performance in each task. We consider solving this problem both in the centralized setting, where information for all tasks is accessible to a single server, and in the decentralized setting, where a network of agents, each given one task and observing local information, cooperate to find the solution of the globally constrained objective using local communication.
We first propose a primal-dual algorithm that provably converges to the globally optimal solution of this constrained formulation under exact gradient evaluations. When the gradient is unknown, we further develop a sampled-based actor-critic algorithm that finds the optimal policy using online samples of state, action, and reward. Finally, we study the extension of the algorithm to the linear function approximation setting.
△ Less
Submitted 3 May, 2024;
originally announced May 2024.
-
Fast Nonlinear Two-Time-Scale Stochastic Approximation: Achieving $O(1/k)$ Finite-Sample Complexity
Authors:
Thinh T. Doan
Abstract:
This paper proposes to develop a new variant of the two-time-scale stochastic approximation to find the roots of two coupled nonlinear operators, assuming only noisy samples of these operators can be observed. Our key idea is to leverage the classic Ruppert-Polyak averaging technique to dynamically estimate the operators through their samples. The estimated values of these averaging steps will the…
▽ More
This paper proposes to develop a new variant of the two-time-scale stochastic approximation to find the roots of two coupled nonlinear operators, assuming only noisy samples of these operators can be observed. Our key idea is to leverage the classic Ruppert-Polyak averaging technique to dynamically estimate the operators through their samples. The estimated values of these averaging steps will then be used in the two-time-scale stochastic approximation updates to find the desired solution. Our main theoretical result is to show that under the strongly monotone condition of the underlying nonlinear operators the mean-squared errors of the iterates generated by the proposed method converge to zero at an optimal rate $O(1/k)$, where $k$ is the number of iterations. Our result significantly improves the existing result of two-time-scale stochastic approximation, where the best known finite-time convergence rate is $O(1/k^{2/3})$. We illustrate this result by applying the proposed method to develop new reinforcement learning algorithms with improved performance.
△ Less
Submitted 22 March, 2024; v1 submitted 23 January, 2024;
originally announced January 2024.
-
Resilient Federated Learning under Byzantine Attack in Distributed Nonconvex Optimization with 2-f Redundancy
Authors:
Amit Dutta,
Thinh T. Doan,
Jeffrey H. Reed
Abstract:
We study the problem of Byzantine fault tolerance in a distributed optimization setting, where there is a group of $N$ agents communicating with a trusted centralized coordinator. Among these agents, there is a subset of $f$ agents that may not follow a prescribed algorithm and may share arbitrarily incorrect information with the coordinator. The goal is to find the optimizer of the aggregate cost…
▽ More
We study the problem of Byzantine fault tolerance in a distributed optimization setting, where there is a group of $N$ agents communicating with a trusted centralized coordinator. Among these agents, there is a subset of $f$ agents that may not follow a prescribed algorithm and may share arbitrarily incorrect information with the coordinator. The goal is to find the optimizer of the aggregate cost functions of the honest agents. We will be interested in studying the local gradient descent method, also known as federated learning, to solve this problem. However, this method often returns an approximate value of the underlying optimal solution in the Byzantine setting. Recent work showed that by incorporating the so-called comparative elimination (CE) filter at the coordinator, one can provably mitigate the detrimental impact of Byzantine agents and precisely compute the true optimizer in the convex setting. The focus of the present work is to provide theoretical results to show the convergence of local gradient methods with the CE filter in a nonconvex setting. We will also provide a number of numerical simulations to support our theoretical results.
△ Less
Submitted 15 December, 2023;
originally announced December 2023.
-
Connected Superlevel Set in (Deep) Reinforcement Learning and its Application to Minimax Theorems
Authors:
Sihan Zeng,
Thinh T. Doan,
Justin Romberg
Abstract:
The aim of this paper is to improve the understanding of the optimization landscape for policy optimization problems in reinforcement learning. Specifically, we show that the superlevel set of the objective function with respect to the policy parameter is always a connected set both in the tabular setting and under policies represented by a class of neural networks. In addition, we show that the o…
▽ More
The aim of this paper is to improve the understanding of the optimization landscape for policy optimization problems in reinforcement learning. Specifically, we show that the superlevel set of the objective function with respect to the policy parameter is always a connected set both in the tabular setting and under policies represented by a class of neural networks. In addition, we show that the optimization objective as a function of the policy parameter and reward satisfies a stronger "equiconnectedness" property. To our best knowledge, these are novel and previously unknown discoveries.
We present an application of the connectedness of these superlevel sets to the derivation of minimax theorems for robust reinforcement learning. We show that any minimax optimization program which is convex on one side and is equiconnected on the other side observes the minimax equality (i.e. has a Nash equilibrium). We find that this exact structure is exhibited by an interesting robust reinforcement learning problem under an adversarial reward attack, and the validity of its minimax equality immediately follows. This is the first time such a result is established in the literature.
△ Less
Submitted 30 September, 2023; v1 submitted 22 March, 2023;
originally announced March 2023.
-
Regularized Gradient Descent Ascent for Two-Player Zero-Sum Markov Games
Authors:
Sihan Zeng,
Thinh T. Doan,
Justin Romberg
Abstract:
We study the problem of finding the Nash equilibrium in a two-player zero-sum Markov game. Due to its formulation as a minimax optimization program, a natural approach to solve the problem is to perform gradient descent/ascent with respect to each player in an alternating fashion. However, due to the non-convexity/non-concavity of the underlying objective function, theoretical understandings of th…
▽ More
We study the problem of finding the Nash equilibrium in a two-player zero-sum Markov game. Due to its formulation as a minimax optimization program, a natural approach to solve the problem is to perform gradient descent/ascent with respect to each player in an alternating fashion. However, due to the non-convexity/non-concavity of the underlying objective function, theoretical understandings of this method are limited. In our paper, we consider solving an entropy-regularized variant of the Markov game. The regularization introduces structure into the optimization landscape that make the solutions more identifiable and allow the problem to be solved more efficiently. Our main contribution is to show that under proper choices of the regularization parameter, the gradient descent ascent algorithm converges to the Nash equilibrium of the original unregularized problem. We explicitly characterize the finite-time performance of the last iterate of our algorithm, which vastly improves over the existing convergence bound of the gradient descent ascent algorithm without regularization. Finally, we complement the analysis with numerical simulations that illustrate the accelerated convergence of the algorithm.
△ Less
Submitted 12 October, 2022; v1 submitted 26 May, 2022;
originally announced May 2022.
-
Convergence Rates of Two-Time-Scale Gradient Descent-Ascent Dynamics for Solving Nonconvex Min-Max Problems
Authors:
Thinh T. Doan
Abstract:
There are much recent interests in solving noncovnex min-max optimization problems due to its broad applications in many areas including machine learning, networked resource allocations, and distributed optimization. Perhaps, the most popular first-order method in solving min-max optimization is the so-called simultaneous (or single-loop) gradient descent-ascent algorithm due to its simplicity in…
▽ More
There are much recent interests in solving noncovnex min-max optimization problems due to its broad applications in many areas including machine learning, networked resource allocations, and distributed optimization. Perhaps, the most popular first-order method in solving min-max optimization is the so-called simultaneous (or single-loop) gradient descent-ascent algorithm due to its simplicity in implementation. However, theoretical guarantees on the convergence of this algorithm is very sparse since it can diverge even in a simple bilinear problem.
In this paper, our focus is to characterize the finite-time performance (or convergence rates) of the continuous-time variant of simultaneous gradient descent-ascent algorithm. In particular, we derive the rates of convergence of this method under a number of different conditions on the underlying objective function, namely, two-sided Polyak-L ojasiewicz (PL), one-sided PL, nonconvex-strongly concave, and strongly convex-nonconcave conditions. Our convergence results improve the ones in prior works under the same conditions of objective functions. The key idea in our analysis is to use the classic singular perturbation theory and coupling Lyapunov functions to address the time-scale difference and interactions between the gradient descent and ascent dynamics. Our results on the behavior of continuous-time algorithm may be used to enhance the convergence properties of its discrete-time counterpart.
△ Less
Submitted 17 December, 2021;
originally announced December 2021.
-
Finite-Time Complexity of Online Primal-Dual Natural Actor-Critic Algorithm for Constrained Markov Decision Processes
Authors:
Sihan Zeng,
Thinh T. Doan,
Justin Romberg
Abstract:
We consider a discounted cost constrained Markov decision process (CMDP) policy optimization problem, in which an agent seeks to maximize a discounted cumulative reward subject to a number of constraints on discounted cumulative utilities. To solve this constrained optimization program, we study an online actor-critic variant of a classic primal-dual method where the gradients of both the primal a…
▽ More
We consider a discounted cost constrained Markov decision process (CMDP) policy optimization problem, in which an agent seeks to maximize a discounted cumulative reward subject to a number of constraints on discounted cumulative utilities. To solve this constrained optimization program, we study an online actor-critic variant of a classic primal-dual method where the gradients of both the primal and dual functions are estimated using samples from a single trajectory generated by the underlying time-varying Markov processes. This online primal-dual natural actor-critic algorithm maintains and iteratively updates three variables: a dual variable (or Lagrangian multiplier), a primal variable (or actor), and a critic variable used to estimate the gradients of both primal and dual variables. These variables are updated simultaneously but on different time scales (using different step sizes) and they are all intertwined with each other. Our main contribution is to derive a finite-time analysis for the convergence of this algorithm to the global optimum of a CMDP problem. Specifically, we show that with a proper choice of step sizes the optimality gap and constraint violation converge to zero in expectation at a rate $\mathcal{O}(1/K^{1/6})$, where K is the number of iterations. To our knowledge, this paper is the first to study the finite-time complexity of an online primal-dual actor-critic method for solving a CMDP problem. We also validate the effectiveness of this algorithm through numerical simulations.
△ Less
Submitted 19 November, 2024; v1 submitted 21 October, 2021;
originally announced October 2021.
-
Convergence Rates of Decentralized Gradient Methods over Cluster Networks
Authors:
Amit Dutta,
Nila Masrourisaadat,
Thinh T. Doan
Abstract:
We present an analysis for the performance of decentralized consensus-based gradient (DCG) methods for solving optimization problems over a cluster network of nodes. This type of network is composed of a number of densely connected clusters with a sparse connection between them. Decentralized algorithms over cluster networks have been observed to constitute two-time-scale dynamics, where informati…
▽ More
We present an analysis for the performance of decentralized consensus-based gradient (DCG) methods for solving optimization problems over a cluster network of nodes. This type of network is composed of a number of densely connected clusters with a sparse connection between them. Decentralized algorithms over cluster networks have been observed to constitute two-time-scale dynamics, where information within any cluster is mixed much faster than the one across clusters. Based on this observation, we present a novel analysis to study the convergence of the DCG methods over cluster networks. In particular, we show that these methods converge at a rate $\ln(T)/T$ and only scale with the number of clusters, which is relatively small to the size of the network. Our result improves the existing analysis, where these methods are shown to scale with the size of the network. The key technique in our analysis is to consider a novel Lyapunov function that captures the impact of multiple time-scale dynamics on the convergence of this method. We also illustrate our theoretical results by a number of numerical simulations using DCG methods over different cluster networks.
△ Less
Submitted 13 October, 2021;
originally announced October 2021.
-
A Two-Time-Scale Stochastic Optimization Framework with Applications in Control and Reinforcement Learning
Authors:
Sihan Zeng,
Thinh T. Doan,
Justin Romberg
Abstract:
We study a new two-time-scale stochastic gradient method for solving optimization problems, where the gradients are computed with the aid of an auxiliary variable under samples generated by time-varying MDPs controlled by the underlying optimization variable. These time-varying samples make gradient directions in our update biased and dependent, which can potentially lead to the divergence of the…
▽ More
We study a new two-time-scale stochastic gradient method for solving optimization problems, where the gradients are computed with the aid of an auxiliary variable under samples generated by time-varying MDPs controlled by the underlying optimization variable. These time-varying samples make gradient directions in our update biased and dependent, which can potentially lead to the divergence of the iterates. In our two-time-scale approach, one scale is to estimate the true gradient from these samples, which is then used to update the estimate of the optimal solution. While these two iterates are implemented simultaneously, the former is updated "faster" than the latter. Our first contribution is to characterize the finite-time complexity of the proposed two-time-scale stochastic gradient method. In particular, we provide explicit formulas for the convergence rates of this method under different structural assumptions, namely, strong convexity, PL condition, and general non-convexity.
We apply our framework to various policy optimization problems. First, we look at the infinite-horizon average-reward MDP with finite state and action spaces and derive a convergence rate of $O(k^{-2/5})$ for the online actor-critic algorithm under function approximation, which recovers the best known rate derived specifically for this problem. Second, we study the linear-quadratic regulator and show that an online actor-critic method converges with rate $O(k^{-2/3})$. Third, we use the actor-critic algorithm to solve the policy optimization problem in an entropy regularized Markov decision process, where we also establish a convergence of $O(k^{-2/3})$. The results we derive for both the second and third problem are novel and previously unknown in the literature. Finally, we briefly present the application of our framework to gradient-based policy evaluation algorithms in reinforcement learning.
△ Less
Submitted 23 August, 2024; v1 submitted 29 September, 2021;
originally announced September 2021.
-
Distributed Dual Subgradient Methods with Averaging and Applications to Grid Optimization
Authors:
Subhonmesh Bose,
Hoa Dinh Nguyen,
Haitian Liu,
Ye Guo,
Thinh T. Doan,
Carolyn L. Beck
Abstract:
We study finite-time performance of a recently proposed distributed dual subgradient (DDSG) method for convex constrained multi-agent optimization problems. The algorithm enjoys performance guarantees on the last primal iterate, as opposed to those derived for ergodic means for vanilla DDSG algorithms. Our work improves the recently published convergence rate of $\Ocal(\log T/\sqrt{T})$ with decay…
▽ More
We study finite-time performance of a recently proposed distributed dual subgradient (DDSG) method for convex constrained multi-agent optimization problems. The algorithm enjoys performance guarantees on the last primal iterate, as opposed to those derived for ergodic means for vanilla DDSG algorithms. Our work improves the recently published convergence rate of $\Ocal(\log T/\sqrt{T})$ with decaying step-sizes to $\Ocal(1/\sqrt{T})$ with constant step-size on a metric that combines suboptimality and constraint violation. We then numerically evaluate the algorithm on three grid optimization problems. Namely, these are tie-line scheduling in multi-area power systems, coordination of distributed energy resources in radial distribution networks, and joint dispatch of transmission and distribution assets. The DDSG algorithm applies to each problem with various relaxations and linearizations of the power flow equations. The numerical experiments illustrate various properties of the DDSG algorithm--comparison with vanilla DDSG, impact of the number of agents, and why Nesterov-style acceleration fails in DDSG settings.
△ Less
Submitted 26 July, 2023; v1 submitted 14 July, 2021;
originally announced July 2021.
-
Convergence Rates of Distributed Consensus over Cluster Networks: A Two-Time-Scale Approach
Authors:
Amit Dutta,
Almuatazbellah M. Boker,
Thinh T. Doan
Abstract:
We study the popular distributed consensus method over networks composed of a number of densely connected clusters with a sparse connection between them. In these cluster networks, the method often constitutes two-time-scale dynamics, where the internal nodes within each cluster reach consensus quickly relative to the aggregate nodes across clusters. Our main contribution is to provide the rate of…
▽ More
We study the popular distributed consensus method over networks composed of a number of densely connected clusters with a sparse connection between them. In these cluster networks, the method often constitutes two-time-scale dynamics, where the internal nodes within each cluster reach consensus quickly relative to the aggregate nodes across clusters. Our main contribution is to provide the rate of the distributed consensus method, which characterize explicitly the impacts of the internal and external graphs on the performance of this method. Our main result shows that this rate converges exponentially and only scales with a few number of nodes, which is relatively small to the size of the network.
The key technique in our analysis is to consider a Lyapunov function which captures the impacts of different time-scale dynamics on the convergence of the method. Our approach avoids using model reduction, which is the typical way according to singular perturbation theory and relies on relatively simple definitions of the slow and fast variables. In addition, Lyapunov analysis allows us to derive the rate of distributed consensus methods over cluster networks, which is missing from the existing works using singular perturbation theory. We illustrate our theoretical results by a number of numerical simulations over different cluster networks.
△ Less
Submitted 12 September, 2022; v1 submitted 15 April, 2021;
originally announced April 2021.
-
Finite-Time Convergence Rates of Nonlinear Two-Time-Scale Stochastic Approximation under Markovian Noise
Authors:
Thinh T. Doan
Abstract:
We study the so-called two-time-scale stochastic approximation, a simulation-based approach for finding the roots of two coupled nonlinear operators. Our focus is to characterize its finite-time performance in a Markov setting, which often arises in stochastic control and reinforcement learning problems. In particular, we consider the scenario where the data in the method are generated by Markov p…
▽ More
We study the so-called two-time-scale stochastic approximation, a simulation-based approach for finding the roots of two coupled nonlinear operators. Our focus is to characterize its finite-time performance in a Markov setting, which often arises in stochastic control and reinforcement learning problems. In particular, we consider the scenario where the data in the method are generated by Markov processes, therefore, they are dependent. Such dependent data result to biased observations of the underlying operators. Under some fairly standard assumptions on the operators and the Markov processes, we provide a formula that characterizes the convergence rate of the mean square errors generated by the method to zero. Our result shows that the method achieves a convergence in expectation at a rate $\mathcal{O}(1/k^{2/3})$, where $k$ is the number of iterations. Our analysis is mainly motivated by the classic singular perturbation theory for studying the asymptotic convergence of two-time-scale systems, that is, we consider a Lyapunov function that carefully characterizes the coupling between the two iterates. In addition, we utilize the geometric mixing time of the underlying Markov process to handle the bias and dependence in the data. Our theoretical result complements for the existing literature, where the rate of nonlinear two-time-scale stochastic approximation under Markovian noise is unknown.
△ Less
Submitted 4 April, 2021;
originally announced April 2021.
-
Nonlinear Two-Time-Scale Stochastic Approximation: Convergence and Finite-Time Performance
Authors:
Thinh T. Doan
Abstract:
Two-time-scale stochastic approximation, a generalized version of the popular stochastic approximation, has found broad applications in many areas including stochastic control, optimization, and machine learning. Despite its popularity, theoretical guarantees of this method, especially its finite-time performance, are mostly achieved for the linear case while the results for the nonlinear counterp…
▽ More
Two-time-scale stochastic approximation, a generalized version of the popular stochastic approximation, has found broad applications in many areas including stochastic control, optimization, and machine learning. Despite its popularity, theoretical guarantees of this method, especially its finite-time performance, are mostly achieved for the linear case while the results for the nonlinear counterpart are very sparse. Motivated by the classic control theory for singularly perturbed systems, we study in this paper the asymptotic convergence and finite-time analysis of the nonlinear two-time-scale stochastic approximation. Under some fairly standard assumptions, we provide a formula that characterizes the rate of convergence of the main iterates to the desired solutions. In particular, we show that the method achieves a convergence in expectation at a rate $\mathcal{O}(1/k^{2/3})$, where $k$ is the number of iterations. The key idea in our analysis is to properly choose the two step sizes to characterize the coupling between the fast and slow-time-scale iterates.
△ Less
Submitted 23 March, 2021; v1 submitted 3 November, 2020;
originally announced November 2020.
-
Finite-Time Convergence Rates of Decentralized Stochastic Approximation with Applications in Multi-Agent and Multi-Task Learning
Authors:
Sihan Zeng,
Thinh T. Doan,
Justin Romberg
Abstract:
We study a decentralized variant of stochastic approximation, a data-driven approach for finding the root of an operator under noisy measurements. A network of agents, each with its own operator and data observations, cooperatively find the fixed point of the aggregate operator over a decentralized communication graph. Our main contribution is to provide a finite-time analysis of this decentralize…
▽ More
We study a decentralized variant of stochastic approximation, a data-driven approach for finding the root of an operator under noisy measurements. A network of agents, each with its own operator and data observations, cooperatively find the fixed point of the aggregate operator over a decentralized communication graph. Our main contribution is to provide a finite-time analysis of this decentralized stochastic approximation method when the data observed at each agent are sampled from a Markov process; this lack of independence makes the iterates biased and (potentially) unbounded. Under fairly standard assumptions, we show that the convergence rate of the proposed method is essentially the same as if the samples were independent, differing only by a log factor that accounts for the mixing time of the Markov processes. The key idea in our analysis is to introduce a novel Razumikhin-Lyapunov function, motivated by the one used in analyzing the stability of delayed ordinary differential equations. We also discuss applications of the proposed method on a number of interesting learning problems in multi-agent systems.
△ Less
Submitted 16 June, 2022; v1 submitted 28 October, 2020;
originally announced October 2020.
-
Local Stochastic Approximation: A Unified View of Federated Learning and Distributed Multi-Task Reinforcement Learning Algorithms
Authors:
Thinh T. Doan
Abstract:
Motivated by broad applications in reinforcement learning and federated learning, we study local stochastic approximation over a network of agents, where their goal is to find the root of an operator composed of the local operators at the agents. Our focus is to characterize the finite-time performance of this method when the data at each agent are generated from Markov processes, and hence they a…
▽ More
Motivated by broad applications in reinforcement learning and federated learning, we study local stochastic approximation over a network of agents, where their goal is to find the root of an operator composed of the local operators at the agents. Our focus is to characterize the finite-time performance of this method when the data at each agent are generated from Markov processes, and hence they are dependent. In particular, we provide the convergence rates of local stochastic approximation for both constant and time-varying step sizes. Our results show that these rates are within a logarithmic factor of the ones under independent data. We then illustrate the applications of these results to different interesting problems in multi-task reinforcement learning and federated learning.
△ Less
Submitted 24 June, 2020;
originally announced June 2020.
-
Finite-Time Analysis of Stochastic Gradient Descent under Markov Randomness
Authors:
Thinh T. Doan,
Lam M. Nguyen,
Nhan H. Pham,
Justin Romberg
Abstract:
Motivated by broad applications in reinforcement learning and machine learning, this paper considers the popular stochastic gradient descent (SGD) when the gradients of the underlying objective function are sampled from Markov processes. This Markov sampling leads to the gradient samples being biased and not independent. The existing results for the convergence of SGD under Markov randomness are o…
▽ More
Motivated by broad applications in reinforcement learning and machine learning, this paper considers the popular stochastic gradient descent (SGD) when the gradients of the underlying objective function are sampled from Markov processes. This Markov sampling leads to the gradient samples being biased and not independent. The existing results for the convergence of SGD under Markov randomness are often established under the assumptions on the boundedness of either the iterates or the gradient samples. Our main focus is to study the finite-time convergence of SGD for different types of objective functions, without requiring these assumptions. We show that SGD converges nearly at the same rate with Markovian gradient samples as with independent gradient samples. The only difference is a logarithmic factor that accounts for the mixing time of the Markov chain.
△ Less
Submitted 1 April, 2020; v1 submitted 24 March, 2020;
originally announced March 2020.
-
Convergence Rates of Accelerated Markov Gradient Descent with Applications in Reinforcement Learning
Authors:
Thinh T. Doan,
Lam M. Nguyen,
Nhan H. Pham,
Justin Romberg
Abstract:
Motivated by broad applications in machine learning, we study the popular accelerated stochastic gradient descent (ASGD) algorithm for solving (possibly nonconvex) optimization problems. We characterize the finite-time performance of this method when the gradients are sampled from Markov processes, and hence biased and dependent from time step to time step; in contrast, the analysis in existing wo…
▽ More
Motivated by broad applications in machine learning, we study the popular accelerated stochastic gradient descent (ASGD) algorithm for solving (possibly nonconvex) optimization problems. We characterize the finite-time performance of this method when the gradients are sampled from Markov processes, and hence biased and dependent from time step to time step; in contrast, the analysis in existing work relies heavily on the stochastic gradients being independent and sometimes unbiased. Our main contributions show that under certain (standard) assumptions on the underlying Markov chain generating the gradients, ASGD converges at the nearly the same rate with Markovian gradient samples as with independent gradient samples. The only difference is a logarithmic factor that accounts for the mixing time of the Markov chain. One of the key motivations for this study are complicated control problems that can be modeled by a Markov decision process and solved using reinforcement learning. We apply the accelerated method to several challenging problems in the OpenAI Gym and Mujoco, and show that acceleration can significantly improve the performance of the classic temporal difference learning and REINFORCE algorithms.
△ Less
Submitted 19 October, 2020; v1 submitted 7 February, 2020;
originally announced February 2020.
-
Finite-Time Analysis and Restarting Scheme for Linear Two-Time-Scale Stochastic Approximation
Authors:
Thinh T. Doan
Abstract:
Motivated by their broad applications in reinforcement learning, we study the linear two-time-scale stochastic approximation, an iterative method using two different step sizes for finding the solutions of a system of two equations. Our main focus is to characterize the finite-time complexity of this method under time-varying step sizes and Markovian noise. In particular, we show that the mean squ…
▽ More
Motivated by their broad applications in reinforcement learning, we study the linear two-time-scale stochastic approximation, an iterative method using two different step sizes for finding the solutions of a system of two equations. Our main focus is to characterize the finite-time complexity of this method under time-varying step sizes and Markovian noise. In particular, we show that the mean square errors of the variables generated by the method converge to zero at a sublinear rate $\Ocal(k^{2/3})$, where $k$ is the number of iterations. We then improve the performance of this method by considering the restarting scheme, where we restart the algorithm after every predetermined number of iterations. We show that using this restarting method the complexity of the algorithm under time-varying step sizes is as good as the one using constant step sizes, but still achieving an exact converge to the desired solution. Moreover, the restarting scheme also helps to prevent the step sizes from getting too small, which is useful for the practical implementation of the linear two-time-scale stochastic approximation.
△ Less
Submitted 9 January, 2020; v1 submitted 22 December, 2019;
originally announced December 2019.
-
Finite-Time Performance of Distributed Two-Time-Scale Stochastic Approximation
Authors:
Thinh T. Doan,
Justin Romberg
Abstract:
Two-time-scale stochastic approximation is a popular iterative method for finding the solution of a system of two equations. Such methods have found broad applications in many areas, especially in machine learning and reinforcement learning. In this paper, we propose a distributed variant of this method over a network of agents, where the agents use two graphs representing their communication at d…
▽ More
Two-time-scale stochastic approximation is a popular iterative method for finding the solution of a system of two equations. Such methods have found broad applications in many areas, especially in machine learning and reinforcement learning. In this paper, we propose a distributed variant of this method over a network of agents, where the agents use two graphs representing their communication at different speeds due to the nature of their two-time-scale updates. Our main contribution is to provide a finite-time analysis for the performance of the proposed method. In particular, we establish an upper bound for the convergence rates of the mean square errors at the agents to zero as a function of the step sizes and the network topology.
△ Less
Submitted 20 December, 2019;
originally announced December 2019.
-
Finite-Time Performance of Distributed Temporal Difference Learning with Linear Function Approximation
Authors:
Thinh T. Doan,
Siva Theja Maguluri,
Justin Romberg
Abstract:
We study the policy evaluation problem in multi-agent reinforcement learning, modeled by a Markov decision process. In this problem, the agents operate in a common environment under a fixed control policy, working together to discover the value (global discounted accumulative reward) associated with each environmental state. Over a series of time steps, the agents act, get rewarded, update their l…
▽ More
We study the policy evaluation problem in multi-agent reinforcement learning, modeled by a Markov decision process. In this problem, the agents operate in a common environment under a fixed control policy, working together to discover the value (global discounted accumulative reward) associated with each environmental state. Over a series of time steps, the agents act, get rewarded, update their local estimate of the value function, then communicate with their neighbors. The local update at each agent can be interpreted as a distributed variant of the popular temporal difference learning methods {\sf TD}$ (λ)$.
Our main contribution is to provide a finite-analysis on the performance of this distributed {\sf TD}$(λ)$ algorithm for both constant and time-varying step sizes. The key idea in our analysis is to use the geometric mixing time $τ$ of the underlying Markov chain, that is, although the "noise" in our algorithm is Markovian, its dependence is very weak at samples spaced out at every $τ$. We provide an explicit upper bound on the convergence rate of the proposed method as a function of the network topology, the discount factor, the constant $λ$, and the mixing time $τ$.
Our results also provide a mathematical explanation for observations that have appeared previously in the literature about the choice of $λ$. Our upper bound illustrates the trade-off between approximation accuracy and convergence speed implicit in the choice of $λ$. When $λ=1$, the solution will correspond to the best possible approximation of the value function, while choosing $λ= 0$ leads to faster convergence when the noise in the algorithm has large variance.
△ Less
Submitted 9 January, 2020; v1 submitted 25 July, 2019;
originally announced July 2019.
-
Finite-Sample Analysis of Nonlinear Stochastic Approximation with Applications in Reinforcement Learning
Authors:
Zaiwei Chen,
Sheng Zhang,
Thinh T. Doan,
John-Paul Clarke,
Siva Theja Maguluri
Abstract:
Motivated by applications in reinforcement learning (RL), we study a nonlinear stochastic approximation (SA) algorithm under Markovian noise, and establish its finite-sample convergence bounds under various stepsizes. Specifically, we show that when using constant stepsize (i.e., $α_k\equiv α$), the algorithm achieves exponential fast convergence to a neighborhood (with radius $O(α\log(1/α))$) aro…
▽ More
Motivated by applications in reinforcement learning (RL), we study a nonlinear stochastic approximation (SA) algorithm under Markovian noise, and establish its finite-sample convergence bounds under various stepsizes. Specifically, we show that when using constant stepsize (i.e., $α_k\equiv α$), the algorithm achieves exponential fast convergence to a neighborhood (with radius $O(α\log(1/α))$) around the desired limit point. When using diminishing stepsizes with appropriate decay rate, the algorithm converges with rate $O(\log(k)/k)$. Our proof is based on Lyapunov drift arguments, and to handle the Markovian noise, we exploit the fast mixing of the underlying Markov chain.
To demonstrate the generality of our theoretical results on Markovian SA, we use it to derive the finite-sample bounds of the popular $Q$-learning with linear function approximation algorithm, under a condition on the behavior policy. Importantly, we do not need to make the assumption that the samples are i.i.d., and do not require an artificial projection step in the algorithm to maintain the boundedness of the iterates. Numerical simulations corroborate our theoretical results.
△ Less
Submitted 26 January, 2022; v1 submitted 27 May, 2019;
originally announced May 2019.
-
Finite-Time Analysis of Distributed TD(0) with Linear Function Approximation for Multi-Agent Reinforcement Learning
Authors:
Thinh T. Doan,
Siva Theja Maguluri,
Justin Romberg
Abstract:
We study the policy evaluation problem in multi-agent reinforcement learning. In this problem, a group of agents works cooperatively to evaluate the value function for the global discounted accumulative reward problem, which is composed of local rewards observed by the agents. Over a series of time steps, the agents act, get rewarded, update their local estimate of the value function, then communi…
▽ More
We study the policy evaluation problem in multi-agent reinforcement learning. In this problem, a group of agents works cooperatively to evaluate the value function for the global discounted accumulative reward problem, which is composed of local rewards observed by the agents. Over a series of time steps, the agents act, get rewarded, update their local estimate of the value function, then communicate with their neighbors. The local update at each agent can be interpreted as a distributed consensus-based variant of the popular temporal difference learning algorithm TD(0).
While distributed reinforcement learning algorithms have been presented in the literature, almost nothing is known about their convergence rate. Our main contribution is providing a finite-time analysis for the convergence of the distributed TD(0) algorithm. We do this when the communication network between the agents is time-varying in general. We obtain an explicit upper bound on the rate of convergence of this algorithm as a function of the network topology and the discount factor. Our results mirror what we would expect from using distributed stochastic gradient descent for solving convex optimization problems.
△ Less
Submitted 1 June, 2019; v1 submitted 19 February, 2019;
originally announced February 2019.
-
Fast Convergence Rates of Distributed Subgradient Methods with Adaptive Quantization
Authors:
Thinh T. Doan,
Siva Theja Maguluri,
Justin Romberg
Abstract:
We study distributed optimization problems over a network when the communication between the nodes is constrained, and so information that is exchanged between the nodes must be quantized. Recent advances using the distributed gradient algorithm with a quantization scheme at a fixed resolution have established convergence, but at rates significantly slower than when the communications are unquanti…
▽ More
We study distributed optimization problems over a network when the communication between the nodes is constrained, and so information that is exchanged between the nodes must be quantized. Recent advances using the distributed gradient algorithm with a quantization scheme at a fixed resolution have established convergence, but at rates significantly slower than when the communications are unquantized.
In this paper, we introduce a novel quantization method, which we refer to as adaptive quantization, that allows us to match the convergence rates under perfect communications. Our approach adjusts the quantization scheme used by each node as the algorithm progresses: as we approach the solution, we become more certain about where the state variables are localized, and adapt the quantizer codebook accordingly.
We bound the convergence rates of the proposed method as a function of the communication bandwidth, the underlying network topology, and structural properties of the constituent objective functions. In particular, we show that if the objective functions are convex or strongly convex, then using adaptive quantization does not affect the rate of convergence of the distributed subgradient methods when the communications are quantized, except for a constant that depends on the resolution of the quantizer. To the best of our knowledge, the rates achieved in this paper are better than any existing work in the literature for distributed gradient methods under finite communication bandwidths. We also provide numerical simulations that compare convergence properties of the distributed gradient methods with and without quantization for solving distributed regression problems for both quadratic and absolute loss functions.
△ Less
Submitted 10 May, 2019; v1 submitted 30 October, 2018;
originally announced October 2018.
-
Distributed Stochastic Approximation for Solving Network Optimization Problems Under Random Quantization
Authors:
Thinh T. Doan,
Siva Theja Maguluri,
Justin Romberg
Abstract:
We study distributed optimization problems over a network when the communication between the nodes is constrained, and so information that is exchanged between the nodes must be quantized. This imperfect communication poses a fundamental challenge, and this imperfect communication, if not properly accounted for, prevents the convergence of these algorithms. Our first contribution in this paper is…
▽ More
We study distributed optimization problems over a network when the communication between the nodes is constrained, and so information that is exchanged between the nodes must be quantized. This imperfect communication poses a fundamental challenge, and this imperfect communication, if not properly accounted for, prevents the convergence of these algorithms. Our first contribution in this paper is to propose a modified consensus-based gradient method for solving such problems using random (dithered) quantization. This algorithm can be interpreted as a distributed variant of a well-known two-time-scale stochastic algorithm. We then study the convergence and derive upper bounds on the rates of convergence of the proposed method as a function of the bandwidths available between the nodes and the underlying network topology, for both convex and strongly convex objective functions. Our results complement for existing literature where such convergence and explicit formulas of the convergence rates are missing. Finally, we provide numerical simulations to compare the convergence properties of the distributed gradient methods with and without quantization for solving the well-known regression problems over networks, for both quadratic and absolute loss functions.
△ Less
Submitted 26 October, 2018;
originally announced October 2018.
-
Convergence of the Iterates in Mirror Descent Methods
Authors:
Thinh T. Doan,
Subhonmesh Bose,
D. Hoa Nguyen,
Carolyn L. Beck
Abstract:
We consider centralized and distributed mirror descent algorithms over a finite-dimensional Hilbert space, and prove that the problem variables converge to an optimizer of a possibly nonsmooth function when the step sizes are square summable but not summable. Prior literature has focused on the convergence of the function value to its optimum. However, applications from distributed optimization an…
▽ More
We consider centralized and distributed mirror descent algorithms over a finite-dimensional Hilbert space, and prove that the problem variables converge to an optimizer of a possibly nonsmooth function when the step sizes are square summable but not summable. Prior literature has focused on the convergence of the function value to its optimum. However, applications from distributed optimization and learning in games require the convergence of the variables to an optimizer, which is generally not guaranteed without assuming strong convexity of the objective function. We provide numerical simulations comparing entropic mirror descent and standard subgradient methods for the robust regression problem.
△ Less
Submitted 3 May, 2018;
originally announced May 2018.
-
Distributed Resource Allocation Over Dynamic Networks with Uncertainty
Authors:
Thinh T. Doan,
Carolyn L. Beck
Abstract:
Motivated by broad applications in various fields of engineering, we study a network resource allocation problem where the goal is to optimally allocate a fixed quantity of resources over a network of nodes. We consider large scale networks with complex interconnection structures, thus any solution must be implemented in parallel and based only on local data resulting in a need for distributed alg…
▽ More
Motivated by broad applications in various fields of engineering, we study a network resource allocation problem where the goal is to optimally allocate a fixed quantity of resources over a network of nodes. We consider large scale networks with complex interconnection structures, thus any solution must be implemented in parallel and based only on local data resulting in a need for distributed algorithms. In this paper, we study a distributed Lagrangian method for such problems. By utilizing the so-called distributed subgradient methods to solve the dual problem, our approach eliminates the need for central coordination in updating the dual variables, which is often required in classic Lagrangian methods. Our focus is to understand the performance of this distributed algorithm when the number of resources is unknown and may be time-varying. In particular, we obtain an upper bound on the convergence rate of the algorithm to the optimal value, in expectation, as a function of the size and the topology of the underlying network. The effectiveness of the proposed method is demonstrated by its application to the economic dispatch problem in power systems, with simulations completed on the benchmark IEEE-14 and IEEE-118 bus test systems.
△ Less
Submitted 2 August, 2018; v1 submitted 11 August, 2017;
originally announced August 2017.
-
On the convergence rate of distributed gradient methods for finite-sum optimization under communication delays
Authors:
Thinh T. Doan,
Carolyn L. Beck,
R. Srikant
Abstract:
Motivated by applications in machine learning and statistics, we study distributed optimization problems over a network of processors, where the goal is to optimize a global objective composed of a sum of local functions. In these problems, due to the large scale of the data sets, the data and computation must be distributed over processors resulting in the need for distributed algorithms. In this…
▽ More
Motivated by applications in machine learning and statistics, we study distributed optimization problems over a network of processors, where the goal is to optimize a global objective composed of a sum of local functions. In these problems, due to the large scale of the data sets, the data and computation must be distributed over processors resulting in the need for distributed algorithms. In this paper, we consider a popular distributed gradient-based consensus algorithm, which only requires local computation and communication. An important problem in this area is to analyze the convergence rate of such algorithms in the presence of communication delays that are inevitable in distributed systems. We prove the convergence of the gradient-based consensus algorithm in the presence of uniform, but possibly arbitrarily large, communication delays between the processors. Moreover, we obtain an upper bound on the rate of convergence of the algorithm as a function of the network size, topology, and the inter-processor communication delays.
△ Less
Submitted 11 May, 2019; v1 submitted 10 August, 2017;
originally announced August 2017.
-
On the geometric convergence rate of distributed economic dispatch/demand response in power networks
Authors:
Thinh T. Doan,
Alex Olshevsky
Abstract:
Motivated by potential applications in power systems, we study a problem of optimizing a sum of $n$ convex functions on dynamic networks of $n$ nodes when each function is known to only a single node. The nodes' variables, while satisfy their local constraints, are coupled through a linear constraint. Our main contribution is to design a fully distributed primal-dual method for this problem. Under…
▽ More
Motivated by potential applications in power systems, we study a problem of optimizing a sum of $n$ convex functions on dynamic networks of $n$ nodes when each function is known to only a single node. The nodes' variables, while satisfy their local constraints, are coupled through a linear constraint. Our main contribution is to design a fully distributed primal-dual method for this problem. Under some fairly standard assumptions on objective functions, strong convexity and smoothness, we provide an explicit analysis for the convergence rate of our method on different networks. In particular, the nodes variables achieve a geometric convergence to the optimal with the associated convergence time scales quartically in the number of nodes on any sequence of time-varying undirected graphs satisfying a long-term connectivity condition. Moreover, this convergence time is constant independent on the number of nodes when the network is a b-regular simple graph with $b\geq 3$. Finally, to show the effectiveness of our method we also simulate a number of studies on economic dispatch problems and demand response problems in power systems.
△ Less
Submitted 30 September, 2016; v1 submitted 21 September, 2016;
originally announced September 2016.
-
Distributed Lagrangian Methods for Network Resource Allocation
Authors:
Thinh T. Doan,
Carolyn L. Beck
Abstract:
Motivated by a variety of applications in control engineering and information sciences, we study network resource allocation problems where the goal is to optimally allocate a fixed amount of resource over a network of nodes. In these problems, due to the large scale of the network and complicated inter-connections between nodes, any solution must be implemented in parallel and based only on local…
▽ More
Motivated by a variety of applications in control engineering and information sciences, we study network resource allocation problems where the goal is to optimally allocate a fixed amount of resource over a network of nodes. In these problems, due to the large scale of the network and complicated inter-connections between nodes, any solution must be implemented in parallel and based only on local data resulting in a need for distributed algorithms. In this paper, we propose a novel distributed Lagrangian method, which requires only local computation and communication. Our focus is to understand the performance of this algorithm on the underlying network topology. Specifically, we obtain an upper bound on the rate of convergence of the algorithm as a function of the size and the topology of the underlying network. The effectiveness and applicability of the proposed method is demonstrated by its use in solving the important economic dispatch problem in power systems, specifically on the benchmark IEEE-14 and IEEE-118 bus systems.
△ Less
Submitted 24 August, 2017; v1 submitted 20 September, 2016;
originally announced September 2016.
-
Distributed Resource Allocation on Dynamic Networks in Quadratic Time
Authors:
Thinh T. Doan,
Alex Olshevsky
Abstract:
We consider the problem of allocating a fixed amount of resource among nodes in a network when each node suffers a cost which is a convex function of the amount of resource allocated to it. We propose a new deterministic and distributed protocol for this problem. Our main result is that the associated convergence time for the global objective scales quadratically in the number of nodes on any sequ…
▽ More
We consider the problem of allocating a fixed amount of resource among nodes in a network when each node suffers a cost which is a convex function of the amount of resource allocated to it. We propose a new deterministic and distributed protocol for this problem. Our main result is that the associated convergence time for the global objective scales quadratically in the number of nodes on any sequence of time-varying undirected graphs satisfying a long-term connectivity condition.
△ Less
Submitted 12 June, 2016; v1 submitted 28 July, 2015;
originally announced July 2015.