-
Exposure measurement error correction in longitudinal studies with discrete outcomes
Authors:
Ce Yang,
Ning Zhang,
Jiaxuan Li,
Unnati V. Mehta,
Jaime E. Hart,
Donna Spiegelman,
Molin Wang
Abstract:
Environmental epidemiologists are often interested in estimating the effect of time-varying functions of the exposure history on health outcomes. However, the individual exposure measurements that constitute the history upon which an exposure history function is constructed are usually subject to measurement errors. To obtain unbiased estimates of the effects of such mismeasured functions in longi…
▽ More
Environmental epidemiologists are often interested in estimating the effect of time-varying functions of the exposure history on health outcomes. However, the individual exposure measurements that constitute the history upon which an exposure history function is constructed are usually subject to measurement errors. To obtain unbiased estimates of the effects of such mismeasured functions in longitudinal studies with discrete outcomes, a method applicable to the main study/validation study design is developed. Various estimation procedures are explored. Simulation studies were conducted to assess its performance compared to standard analysis, and we found that the proposed method had good performance in terms of finite sample bias reduction and nominal coverage probability improvement. As an illustrative example, we applied the new method to a study of long-term exposure to PM2.5, in relation to the occurrence of anxiety disorders in the Nurses Health Study II. Failing to correct the error-prone exposure can lead to an underestimation of the chronic exposure effect of PM2.5.
△ Less
Submitted 22 May, 2025;
originally announced May 2025.
-
Sample Efficient Preference Alignment in LLMs via Active Exploration
Authors:
Viraj Mehta,
Syrine Belakaria,
Vikramjeet Das,
Ojash Neopane,
Yijia Dai,
Ilija Bogunovic,
Barbara Engelhardt,
Stefano Ermon,
Jeff Schneider,
Willie Neiswanger
Abstract:
Preference-based feedback is important for many applications in machine learning where evaluation of a reward function is not feasible. Notable recent examples arise in preference alignment for large language models, including in reinforcement learning from human feedback (RLHF) and direct preference optimization (DPO). For many applications of preference alignment, the cost of acquiring human fee…
▽ More
Preference-based feedback is important for many applications in machine learning where evaluation of a reward function is not feasible. Notable recent examples arise in preference alignment for large language models, including in reinforcement learning from human feedback (RLHF) and direct preference optimization (DPO). For many applications of preference alignment, the cost of acquiring human feedback can be substantial. In this work, we take advantage of the fact that one can often choose contexts at which to obtain human feedback to most efficiently identify a good policy, and formalize the setting as an active contextual dueling bandit problem. We propose an active exploration algorithm to efficiently select the data and provide theoretical proof that it has a polynomial worst-case regret bound. We extend the setting and methodology for practical use in preference alignment of large language models. We provide two extensions, an online and an offline approach. Our method outperforms the baselines with limited samples of human preferences on several language models and four real-world datasets including two new datasets that we contribute to the literature.
△ Less
Submitted 20 March, 2025; v1 submitted 30 November, 2023;
originally announced December 2023.
-
Kernelized Offline Contextual Dueling Bandits
Authors:
Viraj Mehta,
Ojash Neopane,
Vikramjeet Das,
Sen Lin,
Jeff Schneider,
Willie Neiswanger
Abstract:
Preference-based feedback is important for many applications where direct evaluation of a reward function is not feasible. A notable recent example arises in reinforcement learning from human feedback on large language models. For many of these applications, the cost of acquiring the human feedback can be substantial or even prohibitive. In this work, we take advantage of the fact that often the a…
▽ More
Preference-based feedback is important for many applications where direct evaluation of a reward function is not feasible. A notable recent example arises in reinforcement learning from human feedback on large language models. For many of these applications, the cost of acquiring the human feedback can be substantial or even prohibitive. In this work, we take advantage of the fact that often the agent can choose contexts at which to obtain human feedback in order to most efficiently identify a good policy, and introduce the offline contextual dueling bandit setting. We give an upper-confidence-bound style algorithm for this setting and prove a regret bound. We also give empirical confirmation that this method outperforms a similar strategy that uses uniformly sampled contexts.
△ Less
Submitted 20 July, 2023;
originally announced July 2023.
-
Near-optimal Policy Identification in Active Reinforcement Learning
Authors:
Xiang Li,
Viraj Mehta,
Johannes Kirschner,
Ian Char,
Willie Neiswanger,
Jeff Schneider,
Andreas Krause,
Ilija Bogunovic
Abstract:
Many real-world reinforcement learning tasks require control of complex dynamical systems that involve both costly data acquisition processes and large state spaces. In cases where the transition dynamics can be readily evaluated at specified states (e.g., via a simulator), agents can operate in what is often referred to as planning with a \emph{generative model}. We propose the AE-LSVI algorithm…
▽ More
Many real-world reinforcement learning tasks require control of complex dynamical systems that involve both costly data acquisition processes and large state spaces. In cases where the transition dynamics can be readily evaluated at specified states (e.g., via a simulator), agents can operate in what is often referred to as planning with a \emph{generative model}. We propose the AE-LSVI algorithm for best-policy identification, a novel variant of the kernelized least-squares value iteration (LSVI) algorithm that combines optimism with pessimism for active exploration (AE). AE-LSVI provably identifies a near-optimal policy \emph{uniformly} over an entire state space and achieves polynomial sample complexity guarantees that are independent of the number of states. When specialized to the recently introduced offline contextual Bayesian optimization setting, our algorithm achieves improved sample complexity bounds. Experimentally, we demonstrate that AE-LSVI outperforms other RL algorithms in a variety of environments when robustness to the initial state is required.
△ Less
Submitted 19 December, 2022;
originally announced December 2022.
-
Exploration via Planning for Information about the Optimal Trajectory
Authors:
Viraj Mehta,
Ian Char,
Joseph Abbate,
Rory Conlin,
Mark D. Boyer,
Stefano Ermon,
Jeff Schneider,
Willie Neiswanger
Abstract:
Many potential applications of reinforcement learning (RL) are stymied by the large numbers of samples required to learn an effective policy. This is especially true when applying RL to real-world control tasks, e.g. in the sciences or robotics, where executing a policy in the environment is costly. In popular RL algorithms, agents typically explore either by adding stochasticity to a reward-maxim…
▽ More
Many potential applications of reinforcement learning (RL) are stymied by the large numbers of samples required to learn an effective policy. This is especially true when applying RL to real-world control tasks, e.g. in the sciences or robotics, where executing a policy in the environment is costly. In popular RL algorithms, agents typically explore either by adding stochasticity to a reward-maximizing policy or by attempting to gather maximal information about environment dynamics without taking the given task into account. In this work, we develop a method that allows us to plan for exploration while taking both the task and the current knowledge about the dynamics into account. The key insight to our approach is to plan an action sequence that maximizes the expected information gain about the optimal trajectory for the task at hand. We demonstrate that our method learns strong policies with 2x fewer samples than strong exploration baselines and 200x fewer samples than model free methods on a diverse set of low-to-medium dimensional control tasks in both the open-loop and closed-loop control settings.
△ Less
Submitted 6 October, 2022;
originally announced October 2022.
-
Variational autoencoders in the presence of low-dimensional data: landscape and implicit bias
Authors:
Frederic Koehler,
Viraj Mehta,
Chenghui Zhou,
Andrej Risteski
Abstract:
Variational Autoencoders are one of the most commonly used generative models, particularly for image data. A prominent difficulty in training VAEs is data that is supported on a lower-dimensional manifold. Recent work by Dai and Wipf (2020) proposes a two-stage training algorithm for VAEs, based on a conjecture that in standard VAE training the generator will converge to a solution with 0 variance…
▽ More
Variational Autoencoders are one of the most commonly used generative models, particularly for image data. A prominent difficulty in training VAEs is data that is supported on a lower-dimensional manifold. Recent work by Dai and Wipf (2020) proposes a two-stage training algorithm for VAEs, based on a conjecture that in standard VAE training the generator will converge to a solution with 0 variance which is correctly supported on the ground truth manifold. They gave partial support for that conjecture by showing that some optima of the VAE loss do satisfy this property, but did not analyze the training dynamics. In this paper, we show that for linear encoders/decoders, the conjecture is true-that is the VAE training does recover a generator with support equal to the ground truth manifold-and does so due to an implicit bias of gradient descent rather than merely the VAE loss itself. In the nonlinear case, we show that VAE training frequently learns a higher-dimensional manifold which is a superset of the ground truth manifold.
△ Less
Submitted 17 May, 2022; v1 submitted 13 December, 2021;
originally announced December 2021.
-
An Experimental Design Perspective on Model-Based Reinforcement Learning
Authors:
Viraj Mehta,
Biswajit Paria,
Jeff Schneider,
Stefano Ermon,
Willie Neiswanger
Abstract:
In many practical applications of RL, it is expensive to observe state transitions from the environment. For example, in the problem of plasma control for nuclear fusion, computing the next state for a given state-action pair requires querying an expensive transition function which can lead to many hours of computer simulation or dollars of scientific research. Such expensive data collection prohi…
▽ More
In many practical applications of RL, it is expensive to observe state transitions from the environment. For example, in the problem of plasma control for nuclear fusion, computing the next state for a given state-action pair requires querying an expensive transition function which can lead to many hours of computer simulation or dollars of scientific research. Such expensive data collection prohibits application of standard RL algorithms which usually require a large number of observations to learn. In this work, we address the problem of efficiently learning a policy while making a minimal number of state-action queries to the transition function. In particular, we leverage ideas from Bayesian optimal experimental design to guide the selection of state-action queries for efficient learning. We propose an acquisition function that quantifies how much information a state-action pair would provide about the optimal solution to a Markov decision process. At each iteration, our algorithm maximizes this acquisition function, to choose the most informative state-action pair to be queried, thus yielding a data-efficient RL approach. We experiment with a variety of simulated continuous control problems and show that our approach learns an optimal policy with up to $5$ -- $1,000\times$ less data than model-based RL baselines and $10^3$ -- $10^5\times$ less data than model-free RL baselines. We also provide several ablated comparisons which point to substantial improvements arising from the principled method of obtaining data.
△ Less
Submitted 15 March, 2022; v1 submitted 9 December, 2021;
originally announced December 2021.
-
Representational aspects of depth and conditioning in normalizing flows
Authors:
Frederic Koehler,
Viraj Mehta,
Andrej Risteski
Abstract:
Normalizing flows are among the most popular paradigms in generative modeling, especially for images, primarily because we can efficiently evaluate the likelihood of a data point. This is desirable both for evaluating the fit of a model, and for ease of training, as maximizing the likelihood can be done by gradient descent. However, training normalizing flows comes with difficulties as well: model…
▽ More
Normalizing flows are among the most popular paradigms in generative modeling, especially for images, primarily because we can efficiently evaluate the likelihood of a data point. This is desirable both for evaluating the fit of a model, and for ease of training, as maximizing the likelihood can be done by gradient descent. However, training normalizing flows comes with difficulties as well: models which produce good samples typically need to be extremely deep -- which comes with accompanying vanishing/exploding gradient problems. A very related problem is that they are often poorly conditioned: since they are parametrized as invertible maps from $\mathbb{R}^d \to \mathbb{R}^d$, and typical training data like images intuitively is lower-dimensional, the learned maps often have Jacobians that are close to being singular.
In our paper, we tackle representational aspects around depth and conditioning of normalizing flows: both for general invertible architectures, and for a particular common architecture, affine couplings. We prove that $Θ(1)$ affine coupling layers suffice to exactly represent a permutation or $1 \times 1$ convolution, as used in GLOW, showing that representationally the choice of partition is not a bottleneck for depth. We also show that shallow affine coupling networks are universal approximators in Wasserstein distance if ill-conditioning is allowed, and experimentally investigate related phenomena involving padding. Finally, we show a depth lower bound for general flow architectures with few neurons per layer and bounded Lipschitz constant.
△ Less
Submitted 25 June, 2021; v1 submitted 2 October, 2020;
originally announced October 2020.
-
Neural Dynamical Systems: Balancing Structure and Flexibility in Physical Prediction
Authors:
Viraj Mehta,
Ian Char,
Willie Neiswanger,
Youngseog Chung,
Andrew Oakleigh Nelson,
Mark D Boyer,
Egemen Kolemen,
Jeff Schneider
Abstract:
We introduce Neural Dynamical Systems (NDS), a method of learning dynamical models in various gray-box settings which incorporates prior knowledge in the form of systems of ordinary differential equations. NDS uses neural networks to estimate free parameters of the system, predicts residual terms, and numerically integrates over time to predict future states. A key insight is that many real dynami…
▽ More
We introduce Neural Dynamical Systems (NDS), a method of learning dynamical models in various gray-box settings which incorporates prior knowledge in the form of systems of ordinary differential equations. NDS uses neural networks to estimate free parameters of the system, predicts residual terms, and numerically integrates over time to predict future states. A key insight is that many real dynamical systems of interest are hard to model because the dynamics may vary across rollouts. We mitigate this problem by taking a trajectory of prior states as the input to NDS and train it to dynamically estimate system parameters using the preceding trajectory. We find that NDS learns dynamics with higher accuracy and fewer samples than a variety of deep learning methods that do not incorporate the prior knowledge and methods from the system identification literature which do. We demonstrate these advantages first on synthetic dynamical systems and then on real data captured from deuterium shots from a nuclear fusion reactor. Finally, we demonstrate that these benefits can be utilized for control in small-scale experiments.
△ Less
Submitted 27 April, 2021; v1 submitted 22 June, 2020;
originally announced June 2020.
-
Learning Task-Oriented Grasping for Tool Manipulation from Simulated Self-Supervision
Authors:
Kuan Fang,
Yuke Zhu,
Animesh Garg,
Andrey Kurenkov,
Viraj Mehta,
Li Fei-Fei,
Silvio Savarese
Abstract:
Tool manipulation is vital for facilitating robots to complete challenging task goals. It requires reasoning about the desired effect of the task and thus properly grasping and manipulating the tool to achieve the task. Task-agnostic grasping optimizes for grasp robustness while ignoring crucial task-specific constraints. In this paper, we propose the Task-Oriented Grasping Network (TOG-Net) to jo…
▽ More
Tool manipulation is vital for facilitating robots to complete challenging task goals. It requires reasoning about the desired effect of the task and thus properly grasping and manipulating the tool to achieve the task. Task-agnostic grasping optimizes for grasp robustness while ignoring crucial task-specific constraints. In this paper, we propose the Task-Oriented Grasping Network (TOG-Net) to jointly optimize both task-oriented grasping of a tool and the manipulation policy for that tool. The training process of the model is based on large-scale simulated self-supervision with procedurally generated tool objects. We perform both simulated and real-world experiments on two tool-based manipulation tasks: sweeping and hammering. Our model achieves overall 71.1% task success rate for sweeping and 80.0% task success rate for hammering. Supplementary material is available at: bit.ly/task-oriented-grasp
△ Less
Submitted 24 June, 2018;
originally announced June 2018.