-
Deep Reinforcement Learning: A Convex Optimization Approach
Authors:
Ather Gattami
Abstract:
In this paper, we consider reinforcement learning of nonlinear systems with continuous state and action spaces. We present an episodic learning algorithm, where we for each episode use convex optimization to find a two-layer neural network approximation of the optimal $Q$-function. The convex optimization approach guarantees that the weights calculated at each episode are optimal, with respect to…
▽ More
In this paper, we consider reinforcement learning of nonlinear systems with continuous state and action spaces. We present an episodic learning algorithm, where we for each episode use convex optimization to find a two-layer neural network approximation of the optimal $Q$-function. The convex optimization approach guarantees that the weights calculated at each episode are optimal, with respect to the given sampled states and actions of the current episode. For stable nonlinear systems, we show that the algorithm converges and that the converging parameters of the trained neural network can be made arbitrarily close to the optimal neural network parameters. In particular, if the regularization parameter in the training phase is given by $ρ$, then the parameters of the trained neural network converge to $w$, where the distance between $w$ and the optimal parameters $w^\star$ is bounded by $\mathcal{O}(ρ)$. That is, when the number of episodes goes to infinity, there exists a constant $C$ such that \[
\|w-w^\star\| \le Cρ. \]
In particular, our algorithm converges arbitrarily close to the optimal neural network parameters as the regularization parameter goes to zero. As a consequence, our algorithm converges fast due to the polynomial-time convergence of convex optimization algorithms.
△ Less
Submitted 24 June, 2024; v1 submitted 29 February, 2024;
originally announced February 2024.
-
Decentralized Online Bandit Optimization on Directed Graphs with Regret Bounds
Authors:
Johan Östman,
Ather Gattami,
Daniel Gillblad
Abstract:
We consider a decentralized multiplayer game, played over $T$ rounds, with a leader-follower hierarchy described by a directed acyclic graph. For each round, the graph structure dictates the order of the players and how players observe the actions of one another. By the end of each round, all players receive a joint bandit-reward based on their joint action that is used to update the player strate…
▽ More
We consider a decentralized multiplayer game, played over $T$ rounds, with a leader-follower hierarchy described by a directed acyclic graph. For each round, the graph structure dictates the order of the players and how players observe the actions of one another. By the end of each round, all players receive a joint bandit-reward based on their joint action that is used to update the player strategies towards the goal of minimizing the joint pseudo-regret. We present a learning algorithm inspired by the single-player multi-armed bandit problem and show that it achieves sub-linear joint pseudo-regret in the number of rounds for both adversarial and stochastic bandit rewards. Furthermore, we quantify the cost incurred due to the decentralized nature of our problem compared to the centralized setting.
△ Less
Submitted 27 January, 2023;
originally announced January 2023.
-
Model-Free Algorithm and Regret Analysis for MDPs with Long-Term Constraints
Authors:
Qinbo Bai,
Vaneet Aggarwal,
Ather Gattami
Abstract:
In the optimization of dynamical systems, the variables typically have constraints. Such problems can be modeled as a constrained Markov Decision Process (CMDP). This paper considers a model-free approach to the problem, where the transition probabilities are not known. In the presence of long-term (or average) constraints, the agent has to choose a policy that maximizes the long-term average rewa…
▽ More
In the optimization of dynamical systems, the variables typically have constraints. Such problems can be modeled as a constrained Markov Decision Process (CMDP). This paper considers a model-free approach to the problem, where the transition probabilities are not known. In the presence of long-term (or average) constraints, the agent has to choose a policy that maximizes the long-term average reward as well as satisfy the average constraints in each episode. The key challenge with the long-term constraints is that the optimal policy is not deterministic in general, and thus standard Q-learning approaches cannot be directly used. This paper uses concepts from constrained optimization and Q-learning to propose an algorithm for CMDP with long-term constraints. For any $γ\in(0,\frac{1}{2})$, the proposed algorithm is shown to achieve $O(T^{1/2+γ})$ regret bound for the obtained reward and $O(T^{1-γ/2})$ regret bound for the constraint violation, where $T$ is the total number of steps. We note that these are the first results on regret analysis for MDP with long-term constraints, where the transition probabilities are not known apriori.
△ Less
Submitted 30 January, 2021; v1 submitted 10 June, 2020;
originally announced June 2020.
-
Provably Efficient Model-Free Algorithm for MDPs with Peak Constraints
Authors:
Qinbo Bai,
Vaneet Aggarwal,
Ather Gattami
Abstract:
In the optimization of dynamic systems, the variables typically have constraints. Such problems can be modeled as a Constrained Markov Decision Process (CMDP). This paper considers the peak Constrained Markov Decision Process (PCMDP), where the agent chooses the policy to maximize total reward in the finite horizon as well as satisfy constraints at each epoch with probability 1. We propose a model…
▽ More
In the optimization of dynamic systems, the variables typically have constraints. Such problems can be modeled as a Constrained Markov Decision Process (CMDP). This paper considers the peak Constrained Markov Decision Process (PCMDP), where the agent chooses the policy to maximize total reward in the finite horizon as well as satisfy constraints at each epoch with probability 1. We propose a model-free algorithm that converts PCMDP problem to an unconstrained problem and a Q-learning based approach is applied. We define the concept of probably approximately correct (PAC) to the proposed PCMDP problem. The proposed algorithm is proved to achieve an $(ε,p)$-PAC policy when the episode $K\geqΩ(\frac{I^2H^6SA\ell}{ε^2})$, where $S$ and $A$ are the number of states and actions, respectively. $H$ is the number of epochs per episode. $I$ is the number of constraint functions, and $\ell=\log(\frac{SAT}{p})$. We note that this is the first result on PAC kind of analysis for PCMDP with peak constraints, where the transition dynamics are not known apriori. We demonstrate the proposed algorithm on an energy harvesting problem and a single machine scheduling problem, where it performs close to the theoretical upper bound of the studied optimization problem.
△ Less
Submitted 13 June, 2022; v1 submitted 11 March, 2020;
originally announced March 2020.
-
Conditional Mutual information-based Contrastive Loss for Financial Time Series Forecasting
Authors:
Hanwei Wu,
Ather Gattami,
Markus Flierl
Abstract:
We present a representation learning framework for financial time series forecasting. One challenge of using deep learning models for finance forecasting is the shortage of available training data when using small datasets. Direct trend classification using deep neural networks trained on small datasets is susceptible to the overfitting problem. In this paper, we propose to first learn compact rep…
▽ More
We present a representation learning framework for financial time series forecasting. One challenge of using deep learning models for finance forecasting is the shortage of available training data when using small datasets. Direct trend classification using deep neural networks trained on small datasets is susceptible to the overfitting problem. In this paper, we propose to first learn compact representations from time series data, then use the learned representations to train a simpler model for predicting time series movements. We consider a class-conditioned latent variable model. We train an encoder network to maximize the mutual information between the latent variables and the trend information conditioned on the encoded observed variables. We show that conditional mutual information maximization can be approximated by a contrastive loss. Then, the problem is transformed into a classification task of determining whether two encoded representations are sampled from the same class or not. This is equivalent to performing pairwise comparisons of the training datapoints, and thus, improves the generalization ability of the encoder network. We use deep autoregressive models as our encoder to capture long-term dependencies of the sequence data. Empirical experiments indicate that our proposed method has the potential to advance state-of-the-art performance.
△ Less
Submitted 7 May, 2021; v1 submitted 18 February, 2020;
originally announced February 2020.
-
Reinforcement Learning of Markov Decision Processes with Peak Constraints
Authors:
Ather Gattami
Abstract:
In this paper, we consider reinforcement learning of Markov Decision Processes (MDP) with peak constraints, where an agent chooses a policy to optimize an objective and at the same time satisfy additional constraints. The agent has to take actions based on the observed states, reward outputs, and constraint-outputs, without any knowledge about the dynamics, reward functions, and/or the knowledge o…
▽ More
In this paper, we consider reinforcement learning of Markov Decision Processes (MDP) with peak constraints, where an agent chooses a policy to optimize an objective and at the same time satisfy additional constraints. The agent has to take actions based on the observed states, reward outputs, and constraint-outputs, without any knowledge about the dynamics, reward functions, and/or the knowledge of the constraint-functions. We introduce a game theoretic approach to construct reinforcement learning algorithms where the agent maximizes an unconstrained objective that depends on the simulated action of the minimizing opponent which acts on a finite set of actions and the output data of the constraint functions (rewards). We show that the policies obtained from maximin Q-learning converge to the optimal policies. To the best of our knowledge, this is the first time learning algorithms guarantee convergence to optimal stationary policies for the MDP problem with peak constraints for both discounted and expected average rewards.
△ Less
Submitted 6 December, 2019; v1 submitted 23 January, 2019;
originally announced January 2019.
-
Communicating One Bit over a Delay Constrained Gaussian MIMO Channel with Feedback
Authors:
Bo Bernhardsson,
Ather Gattami
Abstract:
The energy-optimal scheme is found for communicating one bit over a memoryless Gaussian channel with an ideal feedback channel. It is assumed that the channel is allowed to be used at most N times before decoding. The optimal coding/decoding strategy is derived by dynamic programming. It is found that feedback gives a significant performance gain and that the optimal strategies are discontinuous.…
▽ More
The energy-optimal scheme is found for communicating one bit over a memoryless Gaussian channel with an ideal feedback channel. It is assumed that the channel is allowed to be used at most N times before decoding. The optimal coding/decoding strategy is derived by dynamic programming. It is found that feedback gives a significant performance gain and that the optimal strategies are discontinuous. It is also shown that most of the performance increase can be obtained even with a one-bit feedback channel. The optimal scheme is compared with the strategy by Kailath-Schalkwijk and is found to be significantly more effective. For the case of a diagonal MIMO channel where measurement noise variances are equal along the sub channels we also show that the problem can be reduced to the previous case of transmitting one bit over a scalar feedback channel.
△ Less
Submitted 15 May, 2016;
originally announced May 2016.
-
Feedback Capacity of Gaussian Channels Revisited
Authors:
Ather Gattami
Abstract:
In this paper, we revisit the problem of finding the average capacity of the Gaussian feedback channel. First, we consider the problem of finding the average capacity of the analog Gaussian noise channel where the noise has an arbitrary spectral density. We introduce a new approach to the problem where we solve the problem over a finite number of transmissions and then consider the limit of an inf…
▽ More
In this paper, we revisit the problem of finding the average capacity of the Gaussian feedback channel. First, we consider the problem of finding the average capacity of the analog Gaussian noise channel where the noise has an arbitrary spectral density. We introduce a new approach to the problem where we solve the problem over a finite number of transmissions and then consider the limit of an infinite number of transmissions. Further, we consider the important special case of stationary Gaussian noise with finite memory. We show that the channel capacity at stationarity can be found by solving a semi-definite program, and hence computationally tractable. We also give new proofs and results of the non stationary solution which bridges the gap between results in the literature for the stationary and non stationary feedback channel capacities. It's shown that a linear communication feedback strategy is optimal. Similar to the solution of the stationary problem, it's shown that the optimal linear strategy is to transmit a linear combination of the information symbols to be communicated and the innovations for the estimation error of the state of the noise process.
△ Less
Submitted 23 January, 2019; v1 submitted 21 November, 2015;
originally announced November 2015.
-
Optimal Communication of States of Dynamical Systems over Gaussian Channels with Noisy Feedback: The Scalar Case
Authors:
Ather Gattami
Abstract:
We consider the problem of communicating the state of a dynamical system via a Shannon Gaussian channel. The receiver, which acts as both a decoder and estimator, observes the noisy measurement of the channel output and makes an optimal estimate of the state of the dynamical system in the minimum mean square sense. Noisy feedback from the receiver to the transmitter is present. The transmitter obs…
▽ More
We consider the problem of communicating the state of a dynamical system via a Shannon Gaussian channel. The receiver, which acts as both a decoder and estimator, observes the noisy measurement of the channel output and makes an optimal estimate of the state of the dynamical system in the minimum mean square sense. Noisy feedback from the receiver to the transmitter is present. The transmitter observes the noise-corrupted feedback message from the receiver together with a possibly noisy measurement of the state the dynamical system. These measurements are then used to encode the message to be transmitted over a noisy Gaussian channel, where a per symbol power constraint is imposed on the transmitted message. Thus, we get a mixed problem of Shannon's source-channel coding problem and a sort of Kalman filtering problem. In particular, we consider two feedback instances, one being feedback of receiver measurements and the second being the receiver's state estimates. We show that optimal encoders and decoders are linear filters with a finite memory and we give explicitly the state space realizations of the optimal filters. For the case where the transmitter has access to noisy measurements of the state, we derive a separation principle for the optimal communication scheme. Furthermore, we investigate the presence of noiseless feedback or no feedback from the receiver to the transmitter. Necessary and sufficient conditions for the existence of a stationary solution are also given for the feedback cases considered.
△ Less
Submitted 1 June, 2015;
originally announced June 2015.
-
Time Localization and Capacity of Faster-Than-Nyquist Signaling
Authors:
Ather Gattami,
Emil Ringh,
Johan Karlsson
Abstract:
In this paper, we consider communication over the bandwidth limited analog white Gaussian noise channel using non-orthogonal pulses. In particular, we consider non-orthogonal transmission by signaling samples at a rate higher than the Nyquist rate. Using the faster-than-Nyquist (FTN) framework, Mazo showed that one may transmit symbols carried by sinc pulses at a higher rate than that dictated by…
▽ More
In this paper, we consider communication over the bandwidth limited analog white Gaussian noise channel using non-orthogonal pulses. In particular, we consider non-orthogonal transmission by signaling samples at a rate higher than the Nyquist rate. Using the faster-than-Nyquist (FTN) framework, Mazo showed that one may transmit symbols carried by sinc pulses at a higher rate than that dictated by Nyquist without loosing bit error rate. However, as we will show in this paper, such pulses are not necessarily well localized in time. In fact, assuming that signals in the FTN framework are well localized in time, one can construct a signaling scheme that violates the Shannon capacity bound. We also show directly that FTN signals are in general not well localized in time. Therefore, the results of Mazo do not imply that one can transmit more data per time unit without degrading performance in terms of error probability.
We also consider FTN signaling in the case of pulses that are different from the sinc pulses. We show that one can use a precoding scheme of low complexity to remove the inter-symbol interference. This leads to the possibility of increasing the number of transmitted samples per time unit and compensate for spectral inefficiency due to signaling at the Nyquist rate of the non sinc pulses. We demonstrate the power of the precoding scheme by simulations.
△ Less
Submitted 7 December, 2015; v1 submitted 13 May, 2015;
originally announced May 2015.
-
Optimal Data and Training Symbol Ratio for Communication over Uncertain Channels
Authors:
Ather Gattami
Abstract:
We consider the problem of determining the power ratio between the training symbols and data symbols in order to maximize the channel capacity for transmission over uncertain channels with a channel estimate available at both the transmitter and receiver. The receiver makes an estimate of the channel by using a known sequence of training symbols. This channel estimate is then transmitted back to t…
▽ More
We consider the problem of determining the power ratio between the training symbols and data symbols in order to maximize the channel capacity for transmission over uncertain channels with a channel estimate available at both the transmitter and receiver. The receiver makes an estimate of the channel by using a known sequence of training symbols. This channel estimate is then transmitted back to the transmitter. The capacity that the transceiver maximizes is the worst case capacity, in the sense that given a noise covariance, the transceiver maximizes the minimal capacity over all distributions of the measurement noise under a fixed covariance matrix known at both the transmitter and receiver. We give an exact expression of the channel capacity as a function of the channel covariance matrix, and the number of training symbols used during a coherence time interval. This expression determines the number of training symbols that need to be used by finding the optimal integer number of training symbols that maximize the channel capacity. As a bi-product, we show that linear filters are optimal at both the transmitter and receiver.
△ Less
Submitted 12 May, 2015;
originally announced May 2015.
-
Kalman meets Shannon
Authors:
Ather Gattami
Abstract:
We consider the problem of communicating the state of a dynamical system via a Shannon Gaussian channel. The receiver, which acts as both a decoder and estimator, observes the noisy measurement of the channel output and makes an optimal estimate of the state of the dynamical system in the minimum mean square sense. The transmitter observes a possibly noisy measurement of the state of the dynamical…
▽ More
We consider the problem of communicating the state of a dynamical system via a Shannon Gaussian channel. The receiver, which acts as both a decoder and estimator, observes the noisy measurement of the channel output and makes an optimal estimate of the state of the dynamical system in the minimum mean square sense. The transmitter observes a possibly noisy measurement of the state of the dynamical system. These measurements are then used to encode the message to be transmitted over a noisy Gaussian channel, where a per sample power constraint is imposed on the transmitted message. Thus, we get a mixed problem of Shannon's source-channel coding problem and a sort of Kalman filtering problem. We first consider the problem of communication with full state measurements at the transmitter and show that optimal linear encoders don't need to have memory and the optimal linear decoders have an order of at most that of the state dimension. We also give explicitly the structure of the optimal linear filters. For the case where the transmitter has access to noisy measurements of the state, we derive a separation principle for the optimal communication scheme, where the transmitter needs a filter with an order of at most the dimension of the state of the dynamical system. The results are derived for first order linear dynamical systems, but may be extended to MIMO systems with arbitrary order.
△ Less
Submitted 12 May, 2015; v1 submitted 16 April, 2014;
originally announced April 2014.