-
No Free Lunch: Non-Asymptotic Analysis of Prediction-Powered Inference
Authors:
Pranav Mani,
Peng Xu,
Zachary C. Lipton,
Michael Oberst
Abstract:
Prediction-Powered Inference (PPI) is a popular strategy for combining gold-standard and possibly noisy pseudo-labels to perform statistical estimation. Prior work has shown an asymptotic "free lunch" for PPI++, an adaptive form of PPI, showing that the *asymptotic* variance of PPI++ is always less than or equal to the variance obtained from using gold-standard labels alone. Notably, this result h…
▽ More
Prediction-Powered Inference (PPI) is a popular strategy for combining gold-standard and possibly noisy pseudo-labels to perform statistical estimation. Prior work has shown an asymptotic "free lunch" for PPI++, an adaptive form of PPI, showing that the *asymptotic* variance of PPI++ is always less than or equal to the variance obtained from using gold-standard labels alone. Notably, this result holds *regardless of the quality of the pseudo-labels*. In this work, we demystify this result by conducting an exact finite-sample analysis of the estimation error of PPI++ on the mean estimation problem. We give a "no free lunch" result, characterizing the settings (and sample sizes) where PPI++ has provably worse estimation error than using gold-standard labels alone. Specifically, PPI++ will outperform if and only if the correlation between pseudo- and gold-standard is above a certain level that depends on the number of labeled samples ($n$). In some cases our results simplify considerably: For Gaussian data, the correlation must be at least $1/\sqrt{n - 2}$ in order to see improvement, and a similar result holds for binary labels. In experiments, we illustrate that our theoretical findings hold on real-world datasets, and give insights into trade-offs between single-sample and sample-splitting variants of PPI++.
△ Less
Submitted 26 May, 2025;
originally announced May 2025.
-
Just Trial Once: Ongoing Causal Validation of Machine Learning Models
Authors:
Jacob M. Chen,
Michael Oberst
Abstract:
Machine learning (ML) models are increasingly used as decision-support tools in high-risk domains. Evaluating the causal impact of deploying such models can be done with a randomized controlled trial (RCT) that randomizes users to ML vs. control groups and assesses the effect on relevant outcomes. However, ML models are inevitably updated over time, and we often lack evidence for the causal impact…
▽ More
Machine learning (ML) models are increasingly used as decision-support tools in high-risk domains. Evaluating the causal impact of deploying such models can be done with a randomized controlled trial (RCT) that randomizes users to ML vs. control groups and assesses the effect on relevant outcomes. However, ML models are inevitably updated over time, and we often lack evidence for the causal impact of these updates. While the causal effect could be repeatedly validated with ongoing RCTs, such experiments are expensive and time-consuming to run. In this work, we present an alternative solution: using only data from a prior RCT, we give conditions under which the causal impact of a new ML model can be precisely bounded or estimated, even if it was not included in the RCT. Our assumptions incorporate two realistic constraints: ML predictions are often deterministic, and their impacts depend on user trust in the model. Based on our analysis, we give recommendations for trial designs that maximize our ability to assess future versions of an ML model. Our hope is that our trial design recommendations will save practitioners time and resources while allowing for quicker deployments of updates to ML models.
△ Less
Submitted 13 February, 2025;
originally announced February 2025.
-
Auditing Fairness under Unobserved Confounding
Authors:
Yewon Byun,
Dylan Sam,
Michael Oberst,
Zachary C. Lipton,
Bryan Wilder
Abstract:
Many definitions of fairness or inequity involve unobservable causal quantities that cannot be directly estimated without strong assumptions. For instance, it is particularly difficult to estimate notions of fairness that rely on hard-to-measure concepts such as risk (e.g., quantifying whether patients at the same risk level have equal probability of treatment, regardless of group membership). Suc…
▽ More
Many definitions of fairness or inequity involve unobservable causal quantities that cannot be directly estimated without strong assumptions. For instance, it is particularly difficult to estimate notions of fairness that rely on hard-to-measure concepts such as risk (e.g., quantifying whether patients at the same risk level have equal probability of treatment, regardless of group membership). Such measurements of risk can be accurately obtained when no unobserved confounders have jointly influenced past decisions and outcomes. However, in the real world, this assumption rarely holds. In this paper, we show that, surprisingly, one can still compute meaningful bounds on treatment rates for high-risk individuals (i.e., conditional on their true, \textit{unobserved} negative outcome), even when entirely eliminating or relaxing the assumption that we observe all relevant risk factors used by decision makers. We use the fact that in many real-world settings (e.g., the release of a new treatment) we have data from prior to any allocation to derive unbiased estimates of risk. This result enables us to audit unfair outcomes of existing decision-making systems in a principled manner. We demonstrate the effectiveness of our framework with a real-world study of Paxlovid allocation, provably identifying that observed racial inequity cannot be explained by unobserved confounders of the same strength as important observed covariates.
△ Less
Submitted 9 December, 2024; v1 submitted 18 March, 2024;
originally announced March 2024.
-
Benchmarking Observational Studies with Experimental Data under Right-Censoring
Authors:
Ilker Demirel,
Edward De Brouwer,
Zeshan Hussain,
Michael Oberst,
Anthony Philippakis,
David Sontag
Abstract:
Drawing causal inferences from observational studies (OS) requires unverifiable validity assumptions; however, one can falsify those assumptions by benchmarking the OS with experimental data from a randomized controlled trial (RCT). A major limitation of existing procedures is not accounting for censoring, despite the abundance of RCTs and OSes that report right-censored time-to-event outcomes. We…
▽ More
Drawing causal inferences from observational studies (OS) requires unverifiable validity assumptions; however, one can falsify those assumptions by benchmarking the OS with experimental data from a randomized controlled trial (RCT). A major limitation of existing procedures is not accounting for censoring, despite the abundance of RCTs and OSes that report right-censored time-to-event outcomes. We consider two cases where censoring time (1) is independent of time-to-event and (2) depends on time-to-event the same way in OS and RCT. For the former, we adopt a censoring-doubly-robust signal for the conditional average treatment effect (CATE) to facilitate an equivalence test of CATEs in OS and RCT, which serves as a proxy for testing if the validity assumptions hold. For the latter, we show that the same test can still be used even though unbiased CATE estimation may not be possible. We verify the effectiveness of our censoring-aware tests via semi-synthetic experiments and analyze RCT and OS data from the Women's Health Initiative study.
△ Less
Submitted 23 February, 2024;
originally announced February 2024.
-
Falsification of Internal and External Validity in Observational Studies via Conditional Moment Restrictions
Authors:
Zeshan Hussain,
Ming-Chieh Shih,
Michael Oberst,
Ilker Demirel,
David Sontag
Abstract:
Randomized Controlled Trials (RCT)s are relied upon to assess new treatments, but suffer from limited power to guide personalized treatment decisions. On the other hand, observational (i.e., non-experimental) studies have large and diverse populations, but are prone to various biases (e.g. residual confounding). To safely leverage the strengths of observational studies, we focus on the problem of…
▽ More
Randomized Controlled Trials (RCT)s are relied upon to assess new treatments, but suffer from limited power to guide personalized treatment decisions. On the other hand, observational (i.e., non-experimental) studies have large and diverse populations, but are prone to various biases (e.g. residual confounding). To safely leverage the strengths of observational studies, we focus on the problem of falsification, whereby RCTs are used to validate causal effect estimates learned from observational data. In particular, we show that, given data from both an RCT and an observational study, assumptions on internal and external validity have an observable, testable implication in the form of a set of Conditional Moment Restrictions (CMRs). Further, we show that expressing these CMRs with respect to the causal effect, or "causal contrast", as opposed to individual counterfactual means, provides a more reliable falsification test. In addition to giving guarantees on the asymptotic properties of our test, we demonstrate superior power and type I error of our approach on semi-synthetic and real world datasets. Our approach is interpretable, allowing a practitioner to visualize which subgroups in the population lead to falsification of an observational study.
△ Less
Submitted 6 March, 2023; v1 submitted 30 January, 2023;
originally announced January 2023.
-
Evaluating Robustness to Dataset Shift via Parametric Robustness Sets
Authors:
Nikolaj Thams,
Michael Oberst,
David Sontag
Abstract:
We give a method for proactively identifying small, plausible shifts in distribution which lead to large differences in model performance. These shifts are defined via parametric changes in the causal mechanisms of observed variables, where constraints on parameters yield a "robustness set" of plausible distributions and a corresponding worst-case loss over the set. While the loss under an individ…
▽ More
We give a method for proactively identifying small, plausible shifts in distribution which lead to large differences in model performance. These shifts are defined via parametric changes in the causal mechanisms of observed variables, where constraints on parameters yield a "robustness set" of plausible distributions and a corresponding worst-case loss over the set. While the loss under an individual parametric shift can be estimated via reweighting techniques such as importance sampling, the resulting worst-case optimization problem is non-convex, and the estimate may suffer from large variance. For small shifts, however, we can construct a local second-order approximation to the loss under shift and cast the problem of finding a worst-case shift as a particular non-convex quadratic optimization problem, for which efficient algorithms are available. We demonstrate that this second-order approximation can be estimated directly for shifts in conditional exponential family models, and we bound the approximation error. We apply our approach to a computer vision task (classifying gender from images), revealing sensitivity to shifts in non-causal attributes.
△ Less
Submitted 15 January, 2023; v1 submitted 31 May, 2022;
originally announced May 2022.
-
Understanding the Risks and Rewards of Combining Unbiased and Possibly Biased Estimators, with Applications to Causal Inference
Authors:
Michael Oberst,
Alexander D'Amour,
Minmin Chen,
Yuyan Wang,
David Sontag,
Steve Yadlowsky
Abstract:
Several problems in statistics involve the combination of high-variance unbiased estimators with low-variance estimators that are only unbiased under strong assumptions. A notable example is the estimation of causal effects while combining small experimental datasets with larger observational datasets. There exist a series of recent proposals on how to perform such a combination, even when the bia…
▽ More
Several problems in statistics involve the combination of high-variance unbiased estimators with low-variance estimators that are only unbiased under strong assumptions. A notable example is the estimation of causal effects while combining small experimental datasets with larger observational datasets. There exist a series of recent proposals on how to perform such a combination, even when the bias of the low-variance estimator is unknown.
To build intuition for the differing trade-offs of competing approaches, we argue for examining the finite-sample estimation error of each approach as a function of the unknown bias. This includes understanding the bias threshold -- the largest bias for which a given approach improves over using the unbiased estimator alone. Though this lens, we review several recent proposals, and observe in simulation that different approaches exhibits qualitatively different behavior.
We also introduce a simple alternative approach, which compares favorably in simulation to recent alternatives, having a higher bias threshold and generally making a more conservative trade-off between best-case performance (when the bias is zero) and worst-case performance (when the bias is adversarially chosen). More broadly, we prove that for any amount of (unknown) bias, the MSE of this estimator can be bounded in a transparent way that depends on the variance / covariance of the underlying estimators that are being combined.
△ Less
Submitted 24 May, 2023; v1 submitted 20 May, 2022;
originally announced May 2022.
-
Regularizing towards Causal Invariance: Linear Models with Proxies
Authors:
Michael Oberst,
Nikolaj Thams,
Jonas Peters,
David Sontag
Abstract:
We propose a method for learning linear models whose predictive performance is robust to causal interventions on unobserved variables, when noisy proxies of those variables are available. Our approach takes the form of a regularization term that trades off between in-distribution performance and robustness to interventions. Under the assumption of a linear structural causal model, we show that a s…
▽ More
We propose a method for learning linear models whose predictive performance is robust to causal interventions on unobserved variables, when noisy proxies of those variables are available. Our approach takes the form of a regularization term that trades off between in-distribution performance and robustness to interventions. Under the assumption of a linear structural causal model, we show that a single proxy can be used to create estimators that are prediction optimal under interventions of bounded strength. This strength depends on the magnitude of the measurement noise in the proxy, which is, in general, not identifiable. In the case of two proxy variables, we propose a modified estimator that is prediction optimal under interventions up to a known strength. We further show how to extend these estimators to scenarios where additional information about the "test time" intervention is available during training. We evaluate our theoretical findings in synthetic experiments and using real data of hourly pollution levels across several cities in China.
△ Less
Submitted 27 June, 2021; v1 submitted 3 March, 2021;
originally announced March 2021.
-
Treatment Policy Learning in Multiobjective Settings with Fully Observed Outcomes
Authors:
Soorajnath Boominathan,
Michael Oberst,
Helen Zhou,
Sanjat Kanjilal,
David Sontag
Abstract:
In several medical decision-making problems, such as antibiotic prescription, laboratory testing can provide precise indications for how a patient will respond to different treatment options. This enables us to "fully observe" all potential treatment outcomes, but while present in historical data, these results are infeasible to produce in real-time at the point of the initial treatment decision.…
▽ More
In several medical decision-making problems, such as antibiotic prescription, laboratory testing can provide precise indications for how a patient will respond to different treatment options. This enables us to "fully observe" all potential treatment outcomes, but while present in historical data, these results are infeasible to produce in real-time at the point of the initial treatment decision. Moreover, treatment policies in these settings often need to trade off between multiple competing objectives, such as effectiveness of treatment and harmful side effects. We present, compare, and evaluate three approaches for learning individualized treatment policies in this setting: First, we consider two indirect approaches, which use predictive models of treatment response to construct policies optimal for different trade-offs between objectives. Second, we consider a direct approach that constructs such a set of policies without intermediate models of outcomes. Using a medical dataset of Urinary Tract Infection (UTI) patients, we show that all approaches learn policies that achieve strictly better performance on all outcomes than clinicians, while also trading off between different objectives. We demonstrate additional benefits of the direct approach, including flexibly incorporating other goals such as deferral to physicians on simple cases.
△ Less
Submitted 12 August, 2020; v1 submitted 1 June, 2020;
originally announced June 2020.
-
ML4H Abstract Track 2019
Authors:
Matthew B. A. McDermott,
Emily Alsentzer,
Sam Finlayson,
Michael Oberst,
Fabian Falck,
Tristan Naumann,
Brett K. Beaulieu-Jones,
Adrian V. Dalca
Abstract:
A collection of the accepted abstracts for the Machine Learning for Health (ML4H) workshop at NeurIPS 2019. This index is not complete, as some accepted abstracts chose to opt-out of inclusion.
A collection of the accepted abstracts for the Machine Learning for Health (ML4H) workshop at NeurIPS 2019. This index is not complete, as some accepted abstracts chose to opt-out of inclusion.
△ Less
Submitted 4 February, 2020;
originally announced February 2020.
-
Characterization of Overlap in Observational Studies
Authors:
Michael Oberst,
Fredrik D. Johansson,
Dennis Wei,
Tian Gao,
Gabriel Brat,
David Sontag,
Kush R. Varshney
Abstract:
Overlap between treatment groups is required for non-parametric estimation of causal effects. If a subgroup of subjects always receives the same intervention, we cannot estimate the effect of intervention changes on that subgroup without further assumptions. When overlap does not hold globally, characterizing local regions of overlap can inform the relevance of causal conclusions for new subjects,…
▽ More
Overlap between treatment groups is required for non-parametric estimation of causal effects. If a subgroup of subjects always receives the same intervention, we cannot estimate the effect of intervention changes on that subgroup without further assumptions. When overlap does not hold globally, characterizing local regions of overlap can inform the relevance of causal conclusions for new subjects, and can help guide additional data collection. To have impact, these descriptions must be interpretable for downstream users who are not machine learning experts, such as policy makers. We formalize overlap estimation as a problem of finding minimum volume sets subject to coverage constraints and reduce this problem to binary classification with Boolean rule classifiers. We then generalize this method to estimate overlap in off-policy policy evaluation. In several real-world applications, we demonstrate that these rules have comparable accuracy to black-box estimators and provide intuitive and informative explanations that can inform policy making.
△ Less
Submitted 3 June, 2020; v1 submitted 9 July, 2019;
originally announced July 2019.
-
Counterfactual Off-Policy Evaluation with Gumbel-Max Structural Causal Models
Authors:
Michael Oberst,
David Sontag
Abstract:
We introduce an off-policy evaluation procedure for highlighting episodes where applying a reinforcement learned (RL) policy is likely to have produced a substantially different outcome than the observed policy. In particular, we introduce a class of structural causal models (SCMs) for generating counterfactual trajectories in finite partially observable Markov Decision Processes (POMDPs). We see…
▽ More
We introduce an off-policy evaluation procedure for highlighting episodes where applying a reinforcement learned (RL) policy is likely to have produced a substantially different outcome than the observed policy. In particular, we introduce a class of structural causal models (SCMs) for generating counterfactual trajectories in finite partially observable Markov Decision Processes (POMDPs). We see this as a useful procedure for off-policy "debugging" in high-risk settings (e.g., healthcare); by decomposing the expected difference in reward between the RL and observed policy into specific episodes, we can identify episodes where the counterfactual difference in reward is most dramatic. This in turn can be used to facilitate review of specific episodes by domain experts. We demonstrate the utility of this procedure with a synthetic environment of sepsis management.
△ Less
Submitted 6 June, 2019; v1 submitted 14 May, 2019;
originally announced May 2019.