-
Learning from Delayed Feedback in Games via Extra Prediction
Authors:
Yuma Fujimoto,
Kenshi Abe,
Kaito Ariu
Abstract:
This study raises and addresses the problem of time-delayed feedback in learning in games. Because learning in games assumes that multiple agents independently learn their strategies, a discrepancy in optimization often emerges among the agents. To overcome this discrepancy, the prediction of the future reward is incorporated into algorithms, typically known as Optimistic Follow-the-Regularized-Le…
▽ More
This study raises and addresses the problem of time-delayed feedback in learning in games. Because learning in games assumes that multiple agents independently learn their strategies, a discrepancy in optimization often emerges among the agents. To overcome this discrepancy, the prediction of the future reward is incorporated into algorithms, typically known as Optimistic Follow-the-Regularized-Leader (OFTRL). However, the time delay in observing the past rewards hinders the prediction. Indeed, this study firstly proves that even a single-step delay worsens the performance of OFTRL from the aspects of regret and convergence. This study proposes the weighted OFTRL (WOFTRL), where the prediction vector of the next reward in OFTRL is weighted $n$ times. We further capture an intuition that the optimistic weight cancels out this time delay. We prove that when the optimistic weight exceeds the time delay, our WOFTRL recovers the good performances that the regret is constant ($O(1)$-regret) in general-sum normal-form games, and the strategies converge to the Nash equilibrium as a subsequence (best-iterate convergence) in poly-matrix zero-sum games. The theoretical results are supported and strengthened by our experiments.
△ Less
Submitted 26 September, 2025;
originally announced September 2025.
-
Asymmetric Perturbation in Solving Bilinear Saddle-Point Optimization
Authors:
Kenshi Abe,
Mitsuki Sakamoto,
Kaito Ariu,
Atsushi Iwasaki
Abstract:
This paper proposes an asymmetric perturbation technique for solving saddle-point optimization problems, commonly arising in min-max problems, game theory, and constrained optimization. Perturbing payoffs or values are known to be effective in stabilizing learning dynamics and finding an exact solution or equilibrium. However, it requires careful adjustment of the perturbation magnitude; otherwise…
▽ More
This paper proposes an asymmetric perturbation technique for solving saddle-point optimization problems, commonly arising in min-max problems, game theory, and constrained optimization. Perturbing payoffs or values are known to be effective in stabilizing learning dynamics and finding an exact solution or equilibrium. However, it requires careful adjustment of the perturbation magnitude; otherwise, learning dynamics converge to only an equilibrium. We establish an impossibility result that it almost never reaches an exact equilibrium as long as both players' payoff functions are perturbed. To overcome this, we introduce an asymmetric perturbation approach, where only one player's payoff function is perturbed. This ensures convergence to an equilibrium without requiring parameter adjustments, provided the perturbation strength parameter is sufficiently low. Furthermore, we empirically demonstrate fast convergence toward equilibria in both normal-form and extensive-form games.
△ Less
Submitted 6 June, 2025;
originally announced June 2025.
-
Policy Testing in Markov Decision Processes
Authors:
Kaito Ariu,
Po-An Wang,
Alexandre Proutiere,
Kenshi Abe
Abstract:
We study the policy testing problem in discounted Markov decision processes (MDPs) under the fixed-confidence setting. The goal is to determine whether the value of a given policy exceeds a specified threshold while minimizing the number of observations. We begin by deriving an instance-specific lower bound that any algorithm must satisfy. This lower bound is characterized as the solution to an op…
▽ More
We study the policy testing problem in discounted Markov decision processes (MDPs) under the fixed-confidence setting. The goal is to determine whether the value of a given policy exceeds a specified threshold while minimizing the number of observations. We begin by deriving an instance-specific lower bound that any algorithm must satisfy. This lower bound is characterized as the solution to an optimization problem with non-convex constraints. We propose a policy testing algorithm inspired by this optimization problem--a common approach in pure exploration problems such as best-arm identification, where asymptotically optimal algorithms often stem from such optimization-based characterizations. As for other pure exploration tasks in MDPs, however, the non-convex constraints in the lower-bound problem present significant challenges, raising doubts about whether statistically optimal and computationally tractable algorithms can be designed. To address this, we reformulate the lower-bound problem by interchanging the roles of the objective and the constraints, yielding an alternative problem with a non-convex objective but convex constraints. Strikingly, this reformulated problem admits an interpretation as a policy optimization task in a newly constructed reversed MDP. Leveraging recent advances in policy gradient methods, we efficiently solve this problem and use it to design a policy testing algorithm that is statistically optimal--matching the instance-specific lower bound on sample complexity--while remaining computationally tractable. We validate our approach with numerical experiments.
△ Less
Submitted 21 May, 2025;
originally announced May 2025.
-
Time-Varyingness in Auction Breaks Revenue Equivalence
Authors:
Yuma Fujimoto,
Kaito Ariu,
Kenshi Abe
Abstract:
Auction is one of the most representative buying-selling systems. A celebrated study shows that the seller's expected revenue is equal in equilibrium, regardless of the type of auction, typically first-price and second-price auctions. Here, however, we hypothesize that when some auction environments vary with time, this revenue equivalence may not be maintained. In second-price auctions, the equil…
▽ More
Auction is one of the most representative buying-selling systems. A celebrated study shows that the seller's expected revenue is equal in equilibrium, regardless of the type of auction, typically first-price and second-price auctions. Here, however, we hypothesize that when some auction environments vary with time, this revenue equivalence may not be maintained. In second-price auctions, the equilibrium strategy is robustly feasible. Conversely, in first-price auctions, the buyers must continue to adapt their strategies according to the environment of the auction. Surprisingly, we prove that revenue equivalence can be broken in both directions. First-price auctions bring larger or smaller revenue than second-price auctions, case by case, depending on how the value of an item varies. Our experiments also demonstrate revenue inequivalence in various scenarios, where the value varies periodically or randomly. This study uncovers a phenomenon, the breaking of revenue equivalence by the time-varyingness in auctions, that likely occurs in real-world auctions, revealing its underlying mechanism.
△ Less
Submitted 16 October, 2024;
originally announced October 2024.
-
Synchronization in Learning in Periodic Zero-Sum Games Triggers Divergence from Nash Equilibrium
Authors:
Yuma Fujimoto,
Kaito Ariu,
Kenshi Abe
Abstract:
Learning in zero-sum games studies a situation where multiple agents competitively learn their strategy. In such multi-agent learning, we often see that the strategies cycle around their optimum, i.e., Nash equilibrium. When a game periodically varies (called a ``periodic'' game), however, the Nash equilibrium moves generically. How learning dynamics behave in such periodic games is of interest bu…
▽ More
Learning in zero-sum games studies a situation where multiple agents competitively learn their strategy. In such multi-agent learning, we often see that the strategies cycle around their optimum, i.e., Nash equilibrium. When a game periodically varies (called a ``periodic'' game), however, the Nash equilibrium moves generically. How learning dynamics behave in such periodic games is of interest but still unclear. Interestingly, we discover that the behavior is highly dependent on the relationship between the two speeds at which the game changes and at which players learn. We observe that when these two speeds synchronize, the learning dynamics diverge, and their time-average does not converge. Otherwise, the learning dynamics draw complicated cycles, but their time-average converges. Under some assumptions introduced for the dynamical systems analysis, we prove that this behavior occurs. Furthermore, our experiments observe this behavior even if removing these assumptions. This study discovers a novel phenomenon, i.e., synchronization, and gains insight widely applicable to learning in periodic games.
△ Less
Submitted 5 March, 2025; v1 submitted 20 August, 2024;
originally announced August 2024.
-
Global Behavior of Learning Dynamics in Zero-Sum Games with Memory Asymmetry
Authors:
Yuma Fujimoto,
Kaito Ariu,
Kenshi Abe
Abstract:
This study examines the global behavior of dynamics in learning in games between two players, X and Y. We consider the simplest situation for memory asymmetry between two players: X memorizes the other Y's previous action and uses reactive strategies, while Y has no memory. Although this memory complicates their learning dynamics, we characterize the global behavior of such complex dynamics by dis…
▽ More
This study examines the global behavior of dynamics in learning in games between two players, X and Y. We consider the simplest situation for memory asymmetry between two players: X memorizes the other Y's previous action and uses reactive strategies, while Y has no memory. Although this memory complicates their learning dynamics, we characterize the global behavior of such complex dynamics by discovering and analyzing two novel quantities. One is an extended Kullback-Leibler divergence from the Nash equilibrium, a well-known conserved quantity from previous studies. The other is a family of Lyapunov functions of X's reactive strategy. One of the global behaviors we capture is that if X exploits Y, then their strategies converge to the Nash equilibrium. Another is that if Y's strategy is out of equilibrium, then X becomes more exploitative with time. Consequently, we suggest global convergence to the Nash equilibrium from both aspects of theory and experiment. This study provides a novel characterization of the global behavior in learning in games through a couple of indicators.
△ Less
Submitted 4 March, 2025; v1 submitted 23 May, 2024;
originally announced May 2024.
-
Nash Equilibrium and Learning Dynamics in Three-Player Matching $m$-Action Games
Authors:
Yuma Fujimoto,
Kaito Ariu,
Kenshi Abe
Abstract:
Learning in games discusses the processes where multiple players learn their optimal strategies through the repetition of game plays. The dynamics of learning between two players in zero-sum games, such as Matching Pennies, where their benefits are competitive, have already been well analyzed. However, it is still unexplored and challenging to analyze the dynamics of learning among three players.…
▽ More
Learning in games discusses the processes where multiple players learn their optimal strategies through the repetition of game plays. The dynamics of learning between two players in zero-sum games, such as Matching Pennies, where their benefits are competitive, have already been well analyzed. However, it is still unexplored and challenging to analyze the dynamics of learning among three players. In this study, we formulate a minimalistic game where three players compete to match their actions with one another. Although interaction among three players diversifies and complicates the Nash equilibria, we fully analyze the equilibria. We also discuss the dynamics of learning based on some famous algorithms categorized into Follow the Regularized Leader. From both theoretical and experimental aspects, we characterize the dynamics by categorizing three-player interactions into three forces to synchronize their actions, switch their actions rotationally, and seek competition.
△ Less
Submitted 5 March, 2025; v1 submitted 16 February, 2024;
originally announced February 2024.
-
Memory Asymmetry Creates Heteroclinic Orbits to Nash Equilibrium in Learning in Zero-Sum Games
Authors:
Yuma Fujimoto,
Kaito Ariu,
Kenshi Abe
Abstract:
Learning in games considers how multiple agents maximize their own rewards through repeated games. Memory, an ability that an agent changes his/her action depending on the history of actions in previous games, is often introduced into learning to explore more clever strategies and discuss the decision-making of real agents like humans. However, such games with memory are hard to analyze because th…
▽ More
Learning in games considers how multiple agents maximize their own rewards through repeated games. Memory, an ability that an agent changes his/her action depending on the history of actions in previous games, is often introduced into learning to explore more clever strategies and discuss the decision-making of real agents like humans. However, such games with memory are hard to analyze because they exhibit complex phenomena like chaotic dynamics or divergence from Nash equilibrium. In particular, how asymmetry in memory capacities between agents affects learning in games is still unclear. In response, this study formulates a gradient ascent algorithm in games with asymmetry memory capacities. To obtain theoretical insights into learning dynamics, we first consider a simple case of zero-sum games. We observe complex behavior, where learning dynamics draw a heteroclinic connection from unstable fixed points to stable ones. Despite this complexity, we analyze learning dynamics and prove local convergence to these stable fixed points, i.e., the Nash equilibria. We identify the mechanism driving this convergence: an agent with a longer memory learns to exploit the other, which in turn endows the other's utility function with strict concavity. We further numerically observe such convergence in various initial strategies, action numbers, and memory lengths. This study reveals a novel phenomenon due to memory asymmetry, providing fundamental strides in learning in games and new insights into computing equilibria.
△ Less
Submitted 16 February, 2024; v1 submitted 22 May, 2023;
originally announced May 2023.
-
Learning in Multi-Memory Games Triggers Complex Dynamics Diverging from Nash Equilibrium
Authors:
Yuma Fujimoto,
Kaito Ariu,
Kenshi Abe
Abstract:
Repeated games consider a situation where multiple agents are motivated by their independent rewards throughout learning. In general, the dynamics of their learning become complex. Especially when their rewards compete with each other like zero-sum games, the dynamics often do not converge to their optimum, i.e., the Nash equilibrium. To tackle such complexity, many studies have understood various…
▽ More
Repeated games consider a situation where multiple agents are motivated by their independent rewards throughout learning. In general, the dynamics of their learning become complex. Especially when their rewards compete with each other like zero-sum games, the dynamics often do not converge to their optimum, i.e., the Nash equilibrium. To tackle such complexity, many studies have understood various learning algorithms as dynamical systems and discovered qualitative insights among the algorithms. However, such studies have yet to handle multi-memory games (where agents can memorize actions they played in the past and choose their actions based on their memories), even though memorization plays a pivotal role in artificial intelligence and interpersonal relationship. This study extends two major learning algorithms in games, i.e., replicator dynamics and gradient ascent, into multi-memory games. Then, we prove their dynamics are identical. Furthermore, theoretically and experimentally, we clarify that the learning dynamics diverge from the Nash equilibrium in multi-memory zero-sum games and reach heteroclinic cycles (sojourn longer around the boundary of the strategy space), providing a fundamental advance in learning in games.
△ Less
Submitted 22 May, 2023; v1 submitted 2 February, 2023;
originally announced February 2023.
-
Optimal Best Arm Identification in Two-Armed Bandits with a Fixed Budget under a Small Gap
Authors:
Masahiro Kato,
Kaito Ariu,
Masaaki Imaizumi,
Masahiro Nomura,
Chao Qin
Abstract:
We consider fixed-budget best-arm identification in two-armed Gaussian bandit problems. One of the longstanding open questions is the existence of an optimal strategy under which the probability of misidentification matches a lower bound. We show that a strategy following the Neyman allocation rule (Neyman, 1934) is asymptotically optimal when the gap between the expected rewards is small. First,…
▽ More
We consider fixed-budget best-arm identification in two-armed Gaussian bandit problems. One of the longstanding open questions is the existence of an optimal strategy under which the probability of misidentification matches a lower bound. We show that a strategy following the Neyman allocation rule (Neyman, 1934) is asymptotically optimal when the gap between the expected rewards is small. First, we review a lower bound derived by Kaufmann et al. (2016). Then, we propose the "Neyman Allocation (NA)-Augmented Inverse Probability weighting (AIPW)" strategy, which consists of the sampling rule using the Neyman allocation with an estimated standard deviation and the recommendation rule using an AIPW estimator. Our proposed strategy is optimal because the upper bound matches the lower bound when the budget goes to infinity and the gap goes to zero.
△ Less
Submitted 28 December, 2022; v1 submitted 12 January, 2022;
originally announced January 2022.
-
The Role of Contextual Information in Best Arm Identification
Authors:
Masahiro Kato,
Kaito Ariu
Abstract:
We study the best-arm identification problem with fixed confidence when contextual (covariate) information is available in stochastic bandits. Although we can use contextual information in each round, we are interested in the marginalized mean reward over the contextual distribution. Our goal is to identify the best arm with a minimal number of samplings under a given value of the error rate. We s…
▽ More
We study the best-arm identification problem with fixed confidence when contextual (covariate) information is available in stochastic bandits. Although we can use contextual information in each round, we are interested in the marginalized mean reward over the contextual distribution. Our goal is to identify the best arm with a minimal number of samplings under a given value of the error rate. We show the instance-specific sample complexity lower bounds for the problem. Then, we propose a context-aware version of the "Track-and-Stop" strategy, wherein the proportion of the arm draws tracks the set of optimal allocations and prove that the expected number of arm draws matches the lower bound asymptotically. We demonstrate that contextual information can be used to improve the efficiency of the identification of the best marginalized mean reward compared with the results of Garivier & Kaufmann (2016). We experimentally confirm that context information contributes to faster best-arm identification.
△ Less
Submitted 26 February, 2024; v1 submitted 26 June, 2021;
originally announced June 2021.