-
Estimation of subsidiary performance metrics under optimal policies
Authors:
Zhaoqi Li,
Houssam Nassif,
Alex Luedtke
Abstract:
In policy learning, the goal is typically to optimize a primary performance metric, but other subsidiary metrics often also warrant attention. This paper presents two strategies for evaluating these subsidiary metrics under a policy that is optimal for the primary one. The first relies on a novel margin condition that facilitates Wald-type inference. Under this and other regularity conditions, we…
▽ More
In policy learning, the goal is typically to optimize a primary performance metric, but other subsidiary metrics often also warrant attention. This paper presents two strategies for evaluating these subsidiary metrics under a policy that is optimal for the primary one. The first relies on a novel margin condition that facilitates Wald-type inference. Under this and other regularity conditions, we show that the one-step corrected estimator is efficient. Despite the utility of this margin condition, it places strong restrictions on how the subsidiary metric behaves for nearly optimal policies, which may not hold in practice. We therefore introduce alternative, two-stage strategies that do not require a margin condition. The first stage constructs a set of candidate policies and the second builds a uniform confidence interval over this set. We provide numerical simulations to evaluate the performance of these methods in different scenarios.
△ Less
Submitted 8 January, 2024;
originally announced January 2024.
-
Experimental Designs for Heteroskedastic Variance
Authors:
Justin Weltz,
Tanner Fiez,
Alexander Volfovsky,
Eric Laber,
Blake Mason,
Houssam Nassif,
Lalit Jain
Abstract:
Most linear experimental design problems assume homogeneous variance although heteroskedastic noise is present in many realistic settings. Let a learner have access to a finite set of measurement vectors $\mathcal{X}\subset \mathbb{R}^d$ that can be probed to receive noisy linear responses of the form $y=x^{\top}θ^{\ast}+η$. Here $θ^{\ast}\in \mathbb{R}^d$ is an unknown parameter vector, and $η$ i…
▽ More
Most linear experimental design problems assume homogeneous variance although heteroskedastic noise is present in many realistic settings. Let a learner have access to a finite set of measurement vectors $\mathcal{X}\subset \mathbb{R}^d$ that can be probed to receive noisy linear responses of the form $y=x^{\top}θ^{\ast}+η$. Here $θ^{\ast}\in \mathbb{R}^d$ is an unknown parameter vector, and $η$ is independent mean-zero $σ_x^2$-sub-Gaussian noise defined by a flexible heteroskedastic variance model, $σ_x^2 = x^{\top}Σ^{\ast}x$. Assuming that $Σ^{\ast}\in \mathbb{R}^{d\times d}$ is an unknown matrix, we propose, analyze and empirically evaluate a novel design for uniformly bounding estimation error of the variance parameters, $σ_x^2$. We demonstrate the benefits of this method with two adaptive experimental design problems under heteroskedastic noise, fixed confidence transductive best-arm identification and level-set identification and prove the first instance-dependent lower bounds in these settings. Lastly, we construct near-optimal algorithms and demonstrate the large improvements in sample complexity gained from accounting for heteroskedastic variance in these designs empirically.
△ Less
Submitted 6 October, 2023;
originally announced October 2023.
-
Deep PQR: Solving Inverse Reinforcement Learning using Anchor Actions
Authors:
Sinong Geng,
Houssam Nassif,
Carlos A. Manzanares,
A. Max Reppen,
Ronnie Sircar
Abstract:
We propose a reward function estimation framework for inverse reinforcement learning with deep energy-based policies. We name our method PQR, as it sequentially estimates the Policy, the $Q$-function, and the Reward function by deep learning. PQR does not assume that the reward solely depends on the state, instead it allows for a dependency on the choice of action. Moreover, PQR allows for stochas…
▽ More
We propose a reward function estimation framework for inverse reinforcement learning with deep energy-based policies. We name our method PQR, as it sequentially estimates the Policy, the $Q$-function, and the Reward function by deep learning. PQR does not assume that the reward solely depends on the state, instead it allows for a dependency on the choice of action. Moreover, PQR allows for stochastic state transitions. To accomplish this, we assume the existence of one anchor action whose reward is known, typically the action of doing nothing, yielding no reward. We present both estimators and algorithms for the PQR method. When the environment transition is known, we prove that the PQR reward estimator uniquely recovers the true reward. With unknown transitions, we bound the estimation error of PQR. Finally, the performance of PQR is demonstrated by synthetic and real-world datasets.
△ Less
Submitted 14 August, 2020; v1 submitted 14 July, 2020;
originally announced July 2020.