Skip to main content

Showing 1–7 of 7 results for author: Nishimori, S

Searching in archive cs. Search in all archives.
.
  1. arXiv:2507.08537  [pdf, ps, other

    cs.LG math.CT

    Recursive Reward Aggregation

    Authors: Yuting Tang, Yivan Zhang, Johannes Ackermann, Yu-Jie Zhang, Soichiro Nishimori, Masashi Sugiyama

    Abstract: In reinforcement learning (RL), aligning agent behavior with specific objectives typically requires careful design of the reward function, which can be challenging when the desired objectives are complex. In this work, we propose an alternative approach for flexible behavior alignment that eliminates the need to modify the reward function by selecting appropriate reward aggregation functions. By i… ▽ More

    Submitted 11 July, 2025; originally announced July 2025.

    Comments: Reinforcement Learning Conference 2025

  2. arXiv:2505.24709  [pdf, ps, other

    cs.LG cs.AI

    On Symmetric Losses for Robust Policy Optimization with Noisy Preferences

    Authors: Soichiro Nishimori, Yu-Jie Zhang, Thanawat Lodkaew, Masashi Sugiyama

    Abstract: Optimizing policies based on human preferences is key to aligning language models with human intent. This work focuses on reward modeling, a core component in reinforcement learning from human feedback (RLHF), and offline preference optimization, such as direct preference optimization. Conventional approaches typically assume accurate annotations. However, real-world preference data often contains… ▽ More

    Submitted 30 May, 2025; originally announced May 2025.

  3. arXiv:2406.00424  [pdf, other

    stat.ML cs.LG

    A Batch Sequential Halving Algorithm without Performance Degradation

    Authors: Sotetsu Koyamada, Soichiro Nishimori, Shin Ishii

    Abstract: In this paper, we investigate the problem of pure exploration in the context of multi-armed bandits, with a specific focus on scenarios where arms are pulled in fixed-size batches. Batching has been shown to enhance computational efficiency, but it can potentially lead to a degradation compared to the original sequential algorithm's performance due to delayed feedback and reduced adaptability. We… ▽ More

    Submitted 1 June, 2024; originally announced June 2024.

    Comments: Accepted to RLC 2024

  4. arXiv:2404.07465  [pdf, other

    cs.LG

    Offline Reinforcement Learning with Domain-Unlabeled Data

    Authors: Soichiro Nishimori, Xin-Qiang Cai, Johannes Ackermann, Masashi Sugiyama

    Abstract: Offline reinforcement learning (RL) is vital in areas where active data collection is expensive or infeasible, such as robotics or healthcare. In the real world, offline datasets often involve multiple domains that share the same state and action spaces but have distinct dynamics, and only a small fraction of samples are clearly labeled as belonging to the target domain we are interested in. For e… ▽ More

    Submitted 28 February, 2025; v1 submitted 11 April, 2024; originally announced April 2024.

  5. arXiv:2401.17780  [pdf, other

    cs.LG

    A Policy Gradient Primal-Dual Algorithm for Constrained MDPs with Uniform PAC Guarantees

    Authors: Toshinori Kitamura, Tadashi Kozuno, Masahiro Kato, Yuki Ichihara, Soichiro Nishimori, Akiyoshi Sannai, Sho Sonoda, Wataru Kumagai, Yutaka Matsuo

    Abstract: We study a primal-dual (PD) reinforcement learning (RL) algorithm for online constrained Markov decision processes (CMDPs). Despite its widespread practical use, the existing theoretical literature on PD-RL algorithms for this problem only provides sublinear regret guarantees and fails to ensure convergence to optimal policies. In this paper, we introduce a novel policy gradient PD algorithm with… ▽ More

    Submitted 1 July, 2024; v1 submitted 31 January, 2024; originally announced January 2024.

  6. arXiv:2304.09769  [pdf, other

    cs.AI

    End-to-End Policy Gradient Method for POMDPs and Explainable Agents

    Authors: Soichiro Nishimori, Sotetsu Koyamada, Shin Ishii

    Abstract: Real-world decision-making problems are often partially observable, and many can be formulated as a Partially Observable Markov Decision Process (POMDP). When we apply reinforcement learning (RL) algorithms to the POMDP, reasonable estimation of the hidden states can help solve the problems. Furthermore, explainable decision-making is preferable, considering their application to real-world tasks s… ▽ More

    Submitted 19 April, 2023; originally announced April 2023.

    Comments: 10 pagee, 6 figures

  7. arXiv:2303.17503  [pdf, other

    cs.AI cs.LG

    Pgx: Hardware-Accelerated Parallel Game Simulators for Reinforcement Learning

    Authors: Sotetsu Koyamada, Shinri Okano, Soichiro Nishimori, Yu Murata, Keigo Habara, Haruka Kita, Shin Ishii

    Abstract: We propose Pgx, a suite of board game reinforcement learning (RL) environments written in JAX and optimized for GPU/TPU accelerators. By leveraging JAX's auto-vectorization and parallelization over accelerators, Pgx can efficiently scale to thousands of simultaneous simulations over accelerators. In our experiments on a DGX-A100 workstation, we discovered that Pgx can simulate RL environments 10-1… ▽ More

    Submitted 15 January, 2024; v1 submitted 28 March, 2023; originally announced March 2023.