-
Reward-Safety Balance in Offline Safe RL via Diffusion Regularization
Authors:
Junyu Guo,
Zhi Zheng,
Donghao Ying,
Ming Jin,
Shangding Gu,
Costas Spanos,
Javad Lavaei
Abstract:
Constrained reinforcement learning (RL) seeks high-performance policies under safety constraints. We focus on an offline setting where the agent has only a fixed dataset -- common in realistic tasks to prevent unsafe exploration. To address this, we propose Diffusion-Regularized Constrained Offline Reinforcement Learning (DRCORL), which first uses a diffusion model to capture the behavioral policy…
▽ More
Constrained reinforcement learning (RL) seeks high-performance policies under safety constraints. We focus on an offline setting where the agent has only a fixed dataset -- common in realistic tasks to prevent unsafe exploration. To address this, we propose Diffusion-Regularized Constrained Offline Reinforcement Learning (DRCORL), which first uses a diffusion model to capture the behavioral policy from offline data and then extracts a simplified policy to enable efficient inference. We further apply gradient manipulation for safety adaptation, balancing the reward objective and constraint satisfaction. This approach leverages high-quality offline data while incorporating safety requirements. Empirical results show that DRCORL achieves reliable safety performance, fast inference, and strong reward outcomes across robot learning tasks. Compared to existing safe offline RL methods, it consistently meets cost limits and performs well with the same hyperparameters, indicating practical applicability in real-world scenarios.
△ Less
Submitted 17 February, 2025;
originally announced February 2025.
-
Subsampled Ensemble Can Improve Generalization Tail Exponentially
Authors:
Huajie Qian,
Donghao Ying,
Henry Lam,
Wotao Yin
Abstract:
Ensemble learning is a popular technique to improve the accuracy of machine learning models. It traditionally hinges on the rationale that aggregating multiple weak models can lead to better models with lower variance and hence higher stability, especially for discontinuous base learners. In this paper, we provide a new perspective on ensembling. By selecting the best model trained on subsamples v…
▽ More
Ensemble learning is a popular technique to improve the accuracy of machine learning models. It traditionally hinges on the rationale that aggregating multiple weak models can lead to better models with lower variance and hence higher stability, especially for discontinuous base learners. In this paper, we provide a new perspective on ensembling. By selecting the best model trained on subsamples via majority voting, we can attain exponentially decaying tails for the excess risk, even if the base learner suffers from slow (i.e., polynomial) decay rates. This tail enhancement power of ensembling is agnostic to the underlying base learner and is stronger than variance reduction in the sense of exhibiting rate improvement. We demonstrate how our ensemble methods can substantially improve out-of-sample performances in a range of numerical examples involving heavy-tailed data or intrinsically slow rates. Code for the proposed methods is available at https://github.com/mickeyhqian/VoteEnsemble.
△ Less
Submitted 1 February, 2025; v1 submitted 23 May, 2024;
originally announced May 2024.
-
Scalable Primal-Dual Actor-Critic Method for Safe Multi-Agent RL with General Utilities
Authors:
Donghao Ying,
Yunkai Zhang,
Yuhao Ding,
Alec Koppel,
Javad Lavaei
Abstract:
We investigate safe multi-agent reinforcement learning, where agents seek to collectively maximize an aggregate sum of local objectives while satisfying their own safety constraints. The objective and constraints are described by {\it general utilities}, i.e., nonlinear functions of the long-term state-action occupancy measure, which encompass broader decision-making goals such as risk, exploratio…
▽ More
We investigate safe multi-agent reinforcement learning, where agents seek to collectively maximize an aggregate sum of local objectives while satisfying their own safety constraints. The objective and constraints are described by {\it general utilities}, i.e., nonlinear functions of the long-term state-action occupancy measure, which encompass broader decision-making goals such as risk, exploration, or imitations. The exponential growth of the state-action space size with the number of agents presents challenges for global observability, further exacerbated by the global coupling arising from agents' safety constraints. To tackle this issue, we propose a primal-dual method utilizing shadow reward and $κ$-hop neighbor truncation under a form of correlation decay property, where $κ$ is the communication radius. In the exact setting, our algorithm converges to a first-order stationary point (FOSP) at the rate of $\mathcal{O}\left(T^{-2/3}\right)$. In the sample-based setting, we demonstrate that, with high probability, our algorithm requires $\widetilde{\mathcal{O}}\left(ε^{-3.5}\right)$ samples to achieve an $ε$-FOSP with an approximation error of $\mathcal{O}(φ_0^{2κ})$, where $φ_0\in (0,1)$. Finally, we demonstrate the effectiveness of our model through extensive numerical experiments.
△ Less
Submitted 27 May, 2023;
originally announced May 2023.
-
No-Regret Learning in Dynamic Competition with Reference Effects Under Logit Demand
Authors:
Mengzi Amy Guo,
Donghao Ying,
Javad Lavaei,
Zuo-Jun Max Shen
Abstract:
This work is dedicated to the algorithm design in a competitive framework, with the primary goal of learning a stable equilibrium. We consider the dynamic price competition between two firms operating within an opaque marketplace, where each firm lacks information about its competitor. The demand follows the multinomial logit (MNL) choice model, which depends on the consumers' observed price and t…
▽ More
This work is dedicated to the algorithm design in a competitive framework, with the primary goal of learning a stable equilibrium. We consider the dynamic price competition between two firms operating within an opaque marketplace, where each firm lacks information about its competitor. The demand follows the multinomial logit (MNL) choice model, which depends on the consumers' observed price and their reference price, and consecutive periods in the repeated games are connected by reference price updates. We use the notion of stationary Nash equilibrium (SNE), defined as the fixed point of the equilibrium pricing policy for the single-period game, to simultaneously capture the long-run market equilibrium and stability. We propose the online projected gradient ascent algorithm (OPGA), where the firms adjust prices using the first-order derivatives of their log-revenues that can be obtained from the market feedback mechanism. Despite the absence of typical properties required for the convergence of online games, such as strong monotonicity and variational stability, we demonstrate that under diminishing step-sizes, the price and reference price paths generated by OPGA converge to the unique SNE, thereby achieving the no-regret learning and a stable market. Moreover, with appropriate step-sizes, we prove that this convergence exhibits a rate of $\mathcal{O}(1/t)$.
△ Less
Submitted 27 May, 2023;
originally announced May 2023.
-
Scalable Multi-Agent Reinforcement Learning with General Utilities
Authors:
Donghao Ying,
Yuhao Ding,
Alec Koppel,
Javad Lavaei
Abstract:
We study the scalable multi-agent reinforcement learning (MARL) with general utilities, defined as nonlinear functions of the team's long-term state-action occupancy measure. The objective is to find a localized policy that maximizes the average of the team's local utility functions without the full observability of each agent in the team. By exploiting the spatial correlation decay property of th…
▽ More
We study the scalable multi-agent reinforcement learning (MARL) with general utilities, defined as nonlinear functions of the team's long-term state-action occupancy measure. The objective is to find a localized policy that maximizes the average of the team's local utility functions without the full observability of each agent in the team. By exploiting the spatial correlation decay property of the network structure, we propose a scalable distributed policy gradient algorithm with shadow reward and localized policy that consists of three steps: (1) shadow reward estimation, (2) truncated shadow Q-function estimation, and (3) truncated policy gradient estimation and policy update. Our algorithm converges, with high probability, to $ε$-stationarity with $\widetilde{\mathcal{O}}(ε^{-2})$ samples up to some approximation error that decreases exponentially in the communication radius. This is the first result in the literature on multi-agent RL with general utilities that does not require the full observability.
△ Less
Submitted 26 August, 2023; v1 submitted 15 February, 2023;
originally announced February 2023.
-
Policy-based Primal-Dual Methods for Concave CMDP with Variance Reduction
Authors:
Donghao Ying,
Mengzi Amy Guo,
Hyunin Lee,
Yuhao Ding,
Javad Lavaei,
Zuo-Jun Max Shen
Abstract:
We study Concave Constrained Markov Decision Processes (Concave CMDPs) where both the objective and constraints are defined as concave functions of the state-action occupancy measure. We propose the Variance-Reduced Primal-Dual Policy Gradient Algorithm (VR-PDPG), which updates the primal variable via policy gradient ascent and the dual variable via projected sub-gradient descent. Despite the chal…
▽ More
We study Concave Constrained Markov Decision Processes (Concave CMDPs) where both the objective and constraints are defined as concave functions of the state-action occupancy measure. We propose the Variance-Reduced Primal-Dual Policy Gradient Algorithm (VR-PDPG), which updates the primal variable via policy gradient ascent and the dual variable via projected sub-gradient descent. Despite the challenges posed by the loss of additivity structure and the nonconcave nature of the problem, we establish the global convergence of VR-PDPG by exploiting a form of hidden concavity. In the exact setting, we prove an $O(T^{-1/3})$ convergence rate for both the average optimality gap and constraint violation, which further improves to $O(T^{-1/2})$ under strong concavity of the objective in the occupancy measure. In the sample-based setting, we demonstrate that VR-PDPG achieves an $\widetilde{O}(ε^{-4})$ sample complexity for $ε$-global optimality. Moreover, by incorporating a diminishing pessimistic term into the constraint, we show that VR-PDPG can attain a zero constraint violation without compromising the convergence rate of the optimality gap. Finally, we validate the effectiveness of our methods through numerical experiments.
△ Less
Submitted 26 May, 2024; v1 submitted 21 May, 2022;
originally announced May 2022.
-
A Dual Approach to Constrained Markov Decision Processes with Entropy Regularization
Authors:
Donghao Ying,
Yuhao Ding,
Javad Lavaei
Abstract:
We study entropy-regularized constrained Markov decision processes (CMDPs) under the soft-max parameterization, in which an agent aims to maximize the entropy-regularized value function while satisfying constraints on the expected total utility. By leveraging the entropy regularization, our theoretical analysis shows that its Lagrangian dual function is smooth and the Lagrangian duality gap can be…
▽ More
We study entropy-regularized constrained Markov decision processes (CMDPs) under the soft-max parameterization, in which an agent aims to maximize the entropy-regularized value function while satisfying constraints on the expected total utility. By leveraging the entropy regularization, our theoretical analysis shows that its Lagrangian dual function is smooth and the Lagrangian duality gap can be decomposed into the primal optimality gap and the constraint violation. Furthermore, we propose an accelerated dual-descent method for entropy-regularized CMDPs. We prove that our method achieves the global convergence rate $\widetilde{\mathcal{O}}(1/T)$ for both the optimality gap and the constraint violation for entropy-regularized CMDPs. A discussion about a linear convergence rate for CMDPs with a single constraint is also provided.
△ Less
Submitted 7 April, 2023; v1 submitted 17 October, 2021;
originally announced October 2021.
-
Kronecker Product Correlation Model and Limited Feedback Codebook Design in a 3D Channel Model
Authors:
Dawei Ying,
Frederick W. Vook,
Timothy A. Thomas,
David J. Love,
Amitava Ghosh
Abstract:
A 2D antenna array introduces a new level of control and additional degrees of freedom in multiple-input-multiple-output (MIMO) systems particularly for the so-called "massive MIMO" systems. To accurately assess the performance gains of these large arrays, existing azimuth-only channel models have been extended to handle 3D channels by modeling both the elevation and azimuth dimensions. In this pa…
▽ More
A 2D antenna array introduces a new level of control and additional degrees of freedom in multiple-input-multiple-output (MIMO) systems particularly for the so-called "massive MIMO" systems. To accurately assess the performance gains of these large arrays, existing azimuth-only channel models have been extended to handle 3D channels by modeling both the elevation and azimuth dimensions. In this paper, we study the channel correlation matrix of a generic ray-based 3D channel model, and our analysis and simulation results demonstrate that the 3D correlation matrix can be well approximated by a Kronecker production of azimuth and elevation correlations. This finding lays the theoretical support for the usage of a product codebook for reduced complexity feedback from the receiver to the transmitter. We also present the design of a product codebook based on Grassmannian line packing.
△ Less
Submitted 13 January, 2014;
originally announced January 2014.