Policy Regularization with Noisy Advantage Values for Cooperative Multi-agent Actor-Critic methods

Hu, Siyue; Hu, Jian; Wang, Weixun; Liao, Shih-wei

Computer Science > Multiagent Systems

arXiv:2106.14334v6 (cs)

[Submitted on 27 Jun 2021 (v1), revised 2 Aug 2021 (this version, v6), latest version 8 Jun 2023 (v14)]

Title:Policy Regularization with Noisy Advantage Values for Cooperative Multi-agent Actor-Critic methods

Authors:Siyue Hu, Jian Hu, Weixun Wang, Shih-wei Liao

View PDF

Abstract:Multi-Agent Reinforcement Learning (MARL) has seen revolutionary breakthroughs with its successful application to multi-agent cooperative tasks such as robot swarms control, autonomous vehicle coordination, and computer games. Recent works have applied the Proximal Policy Optimization (PPO) to the multi-agent tasks, called Multi-agent PPO (MAPPO). However, previous literature shows that the vanilla MAPPO with a shared value function may not perform as well as Independent PPO (IPPO) and the finetuned QMIX. Thus MAPPO-agent-specific (MAPPO-AS) further improves the performance of vanilla MAPPO and IPPO by the artificial agent-specific features. In addition, there is no literature that gives a theoretical analysis of the working mechanism of MAPPO. In this paper, we firstly theoretically generalize single-agent PPO to the vanilla MAPPO, which shows that the vanilla MAPPO is approximately equivalent to optimizing a multi-agent joint policy with the original PPO. Secondly, we find that vanilla MAPPO faces the problem of \textit{The Policies Overfitting in Multi-agent Cooperation(POMAC)} as they learn policies by the sampled centralized advantage values. Then POMAC may lead to updating the policies of some agents in a suboptimal direction and prevent the agents from exploring better trajectories. To solve the POMAC problem, we propose a novel policy regularization method, i.e, Noisy-MAPPO, and Advantage-Noisy-MAPPO, which smooth out the advantage values by noise. The experimental results show that the average performance of Noisy-MAPPO is better than that of finetuned QMIX and MAPPO-AS, and is much better than the vanilla MAPPO. We open-source the code at \url{this https URL}.

Comments:	fix some errors
Subjects:	Multiagent Systems (cs.MA)
Cite as:	arXiv:2106.14334 [cs.MA]
	(or arXiv:2106.14334v6 [cs.MA] for this version)
	https://doi.org/10.48550/arXiv.2106.14334

Submission history

From: Jian Hu [view email]
[v1] Sun, 27 Jun 2021 22:50:35 UTC (732 KB)
[v2] Thu, 1 Jul 2021 06:02:59 UTC (733 KB)
[v3] Tue, 13 Jul 2021 12:24:26 UTC (761 KB)
[v4] Mon, 19 Jul 2021 13:10:15 UTC (759 KB)
[v5] Thu, 22 Jul 2021 13:32:39 UTC (768 KB)
[v6] Mon, 2 Aug 2021 14:01:18 UTC (799 KB)
[v7] Sun, 22 Aug 2021 23:48:07 UTC (1,716 KB)
[v8] Wed, 1 Sep 2021 23:56:31 UTC (1,715 KB)
[v9] Wed, 8 Sep 2021 21:12:49 UTC (1,715 KB)
[v10] Mon, 13 Sep 2021 00:46:27 UTC (1,712 KB)
[v11] Fri, 1 Oct 2021 00:51:10 UTC (1,713 KB)
[v12] Sat, 23 Oct 2021 11:21:34 UTC (1,714 KB)
[v13] Thu, 11 Nov 2021 09:35:41 UTC (1,713 KB)
[v14] Thu, 8 Jun 2023 07:44:23 UTC (1,714 KB)

Computer Science > Multiagent Systems

Title:Policy Regularization with Noisy Advantage Values for Cooperative Multi-agent Actor-Critic methods

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Multiagent Systems

Title:Policy Regularization with Noisy Advantage Values for Cooperative Multi-agent Actor-Critic methods

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators