-
Nonparametric Instrumental Variable Inference with Many Weak Instruments
Authors:
Lars van der Laan,
Nathan Kallus,
Aurélien Bibaut
Abstract:
We study inference on linear functionals in the nonparametric instrumental variable (NPIV) problem with a discretely-valued instrument under a many-weak-instruments asymptotic regime, where the number of instrument values grows with the sample size. A key motivating example is estimating long-term causal effects in a new experiment with only short-term outcomes, using past experiments to instrumen…
▽ More
We study inference on linear functionals in the nonparametric instrumental variable (NPIV) problem with a discretely-valued instrument under a many-weak-instruments asymptotic regime, where the number of instrument values grows with the sample size. A key motivating example is estimating long-term causal effects in a new experiment with only short-term outcomes, using past experiments to instrument for the effect of short- on long-term outcomes. Here, the assignment to a past experiment serves as the instrument: we have many past experiments but only a limited number of units in each. Since the structural function is nonparametric but constrained by only finitely many moment restrictions, point identification typically fails. To address this, we consider linear functionals of the minimum-norm solution to the moment restrictions, which is always well-defined. As the number of instrument levels grows, these functionals define an approximating sequence to a target functional, replacing point identification with a weaker asymptotic notion suited to discrete instruments. Extending the Jackknife Instrumental Variable Estimator (JIVE) beyond the classical parametric setting, we propose npJIVE, a nonparametric estimator for solutions to linear inverse problems with many weak instruments. We construct automatic debiased machine learning estimators for linear functionals of both the structural function and its minimum-norm projection, and establish their efficiency in the many-weak-instruments regime.
△ Less
Submitted 12 May, 2025;
originally announced May 2025.
-
Automatic Debiased Machine Learning for Smooth Functionals of Nonparametric M-Estimands
Authors:
Lars van der Laan,
Aurelien Bibaut,
Nathan Kallus,
Alex Luedtke
Abstract:
We propose a unified framework for automatic debiased machine learning (autoDML) to perform inference on smooth functionals of infinite-dimensional M-estimands, defined as population risk minimizers over Hilbert spaces. By automating debiased estimation and inference procedures in causal inference and semiparametric statistics, our framework enables practitioners to construct valid estimators for…
▽ More
We propose a unified framework for automatic debiased machine learning (autoDML) to perform inference on smooth functionals of infinite-dimensional M-estimands, defined as population risk minimizers over Hilbert spaces. By automating debiased estimation and inference procedures in causal inference and semiparametric statistics, our framework enables practitioners to construct valid estimators for complex parameters without requiring specialized expertise. The framework supports Neyman-orthogonal loss functions with unknown nuisance parameters requiring data-driven estimation, as well as vector-valued M-estimands involving simultaneous loss minimization across multiple Hilbert space models. We formalize the class of parameters efficiently estimable by autoDML as a novel class of nonparametric projection parameters, defined via orthogonal minimum loss objectives. We introduce three autoDML estimators based on one-step estimation, targeted minimum loss-based estimation, and the method of sieves. For data-driven model selection, we derive a novel decomposition of model approximation error for smooth functionals of M-estimands and propose adaptive debiased machine learning estimators that are superefficient and adaptive to the functional form of the M-estimand. Finally, we illustrate the flexibility of our framework by constructing autoDML estimators for the long-term survival under a beta-geometric model.
△ Less
Submitted 20 January, 2025;
originally announced January 2025.
-
Nonparametric Jackknife Instrumental Variable Estimation and Confounding Robust Surrogate Indices
Authors:
Aurélien Bibaut,
Nathan Kallus,
Apoorva Lal
Abstract:
Jackknife instrumental variable estimation (JIVE) is a classic method to leverage many weak instrumental variables (IVs) to estimate linear structural models, overcoming the bias of standard methods like two-stage least squares. In this paper, we extend the jackknife approach to nonparametric IV (NPIV) models with many weak IVs. Since NPIV characterizes the structural regression as having residual…
▽ More
Jackknife instrumental variable estimation (JIVE) is a classic method to leverage many weak instrumental variables (IVs) to estimate linear structural models, overcoming the bias of standard methods like two-stage least squares. In this paper, we extend the jackknife approach to nonparametric IV (NPIV) models with many weak IVs. Since NPIV characterizes the structural regression as having residuals projected onto the IV being zero, existing approaches minimize an estimate of the average squared projected residuals, but their estimates are biased under many weak IVs. We introduce an IV splitting device inspired by JIVE to remove this bias, and by carefully studying this split-IV empirical process we establish learning rates that depend on generic complexity measures of the nonparametric hypothesis class. We then turn to leveraging this for semiparametric inference on average treatment effects (ATEs) on unobserved long-term outcomes predicted from short-term surrogates, using historical experiments as IVs to learn this nonparametric predictive relationship even in the presence of confounding between short- and long-term observations. Using split-IV estimates of a debiasing nuisance, we develop asymptotically normal estimates for predicted ATEs, enabling inference.
△ Less
Submitted 7 October, 2024; v1 submitted 20 June, 2024;
originally announced June 2024.
-
Demistifying Inference after Adaptive Experiments
Authors:
Aurélien Bibaut,
Nathan Kallus
Abstract:
Adaptive experiments such as multi-arm bandits adapt the treatment-allocation policy and/or the decision to stop the experiment to the data observed so far. This has the potential to improve outcomes for study participants within the experiment, to improve the chance of identifying best treatments after the experiment, and to avoid wasting data. Seen as an experiment (rather than just a continuall…
▽ More
Adaptive experiments such as multi-arm bandits adapt the treatment-allocation policy and/or the decision to stop the experiment to the data observed so far. This has the potential to improve outcomes for study participants within the experiment, to improve the chance of identifying best treatments after the experiment, and to avoid wasting data. Seen as an experiment (rather than just a continually optimizing system) it is still desirable to draw statistical inferences with frequentist guarantees. The concentration inequalities and union bounds that generally underlie adaptive experimentation algorithms can yield overly conservative inferences, but at the same time the asymptotic normality we would usually appeal to in non-adaptive settings can be imperiled by adaptivity. In this article we aim to explain why, how, and when adaptivity is in fact an issue for inference and, when it is, understand the various ways to fix it: reweighting to stabilize variances and recover asymptotic normality, always-valid inference based on joint normality of an asymptotic limiting sequence, and characterizing and inverting the non-normal distributions induced by adaptivity.
△ Less
Submitted 2 May, 2024;
originally announced May 2024.
-
Long-Term Causal Inference with Imperfect Surrogates using Many Weak Experiments, Proxies, and Cross-Fold Moments
Authors:
Aurélien Bibaut,
Nathan Kallus,
Simon Ejdemyr,
Michael Zhao
Abstract:
Inferring causal effects on long-term outcomes using short-term surrogates is crucial to rapid innovation. However, even when treatments are randomized and surrogates fully mediate their effect on outcomes, it's possible that we get the direction of causal effects wrong due to confounding between surrogates and outcomes -- a situation famously known as the surrogate paradox. The availability of ma…
▽ More
Inferring causal effects on long-term outcomes using short-term surrogates is crucial to rapid innovation. However, even when treatments are randomized and surrogates fully mediate their effect on outcomes, it's possible that we get the direction of causal effects wrong due to confounding between surrogates and outcomes -- a situation famously known as the surrogate paradox. The availability of many historical experiments offer the opportunity to instrument for the surrogate and bypass this confounding. However, even as the number of experiments grows, two-stage least squares has non-vanishing bias if each experiment has a bounded size, and this bias is exacerbated when most experiments barely move metrics, as occurs in practice. We show how to eliminate this bias using cross-fold procedures, JIVE being one example, and construct valid confidence intervals for the long-term effect in new experiments where long-term outcome has not yet been observed. Our methodology further allows to proxy for effects not perfectly mediated by the surrogates, allowing us to handle both confounding and effect leakage as violations of standard statistical surrogacy conditions.
△ Less
Submitted 8 November, 2023;
originally announced November 2023.
-
Near-Optimal Non-Parametric Sequential Tests and Confidence Sequences with Possibly Dependent Observations
Authors:
Aurelien Bibaut,
Nathan Kallus,
Michael Lindon
Abstract:
Sequential tests and their implied confidence sequences, which are valid at arbitrary stopping times, promise flexible statistical inference and on-the-fly decision making. However, strong guarantees are limited to parametric sequential tests that under-cover in practice or concentration-bound-based sequences that over-cover and have suboptimal rejection times. In this work, we consider classic de…
▽ More
Sequential tests and their implied confidence sequences, which are valid at arbitrary stopping times, promise flexible statistical inference and on-the-fly decision making. However, strong guarantees are limited to parametric sequential tests that under-cover in practice or concentration-bound-based sequences that over-cover and have suboptimal rejection times. In this work, we consider classic delayed-start normal-mixture sequential probability ratio tests, and we provide the first asymptotic type-I-error and expected-rejection-time guarantees under general non-parametric data generating processes, where the asymptotics are indexed by the test's burn-in time. The type-I-error results primarily leverage a martingale strong invariance principle and establish that these tests (and their implied confidence sequences) have type-I error rates asymptotically equivalent to the desired (possibly varying) $α$-level. The expected-rejection-time results primarily leverage an identity inspired by Itô's lemma and imply that, in certain asymptotic regimes, the expected rejection time is asymptotically equivalent to the minimum possible among $α$-level tests. We show how to apply our results to sequential inference on parameters defined by estimating equations, such as average treatment effects. Together, our results establish these (ostensibly parametric) tests as general-purpose, non-parametric, and near-optimal. We illustrate this via numerical simulations and a real-data application to A/B testing at Netflix.
△ Less
Submitted 11 March, 2024; v1 submitted 29 December, 2022;
originally announced December 2022.
-
One-step ahead sequential Super Learning from short times series of many slightly dependent data, and anticipating the cost of natural disasters
Authors:
Geoffrey Ecoto,
Aurélien Bibaut,
Antoine Chambaz
Abstract:
Suppose that we observe a short time series where each time-t-specific data-structure consists of many slightly dependent data indexed by a and that we want to estimate a feature of the law of the experiment that depends neither on t nor on a. We develop and study an algorithm to learn sequentially which base algorithm in a user-supplied collection best carries out the estimation task in terms of…
▽ More
Suppose that we observe a short time series where each time-t-specific data-structure consists of many slightly dependent data indexed by a and that we want to estimate a feature of the law of the experiment that depends neither on t nor on a. We develop and study an algorithm to learn sequentially which base algorithm in a user-supplied collection best carries out the estimation task in terms of excess risk and oracular inequalities. The analysis, which uses dependency graph to model the amount of conditional independence within each t-specific data-structure and a concentration inequality by Janson [2004], leverages a large ratio of the number of distinct a's to the degree of the dependency graph in the face of a small number of t-specific data-structures. The so-called one-step ahead Super Learner is applied to the motivating example where the challenge is to anticipate the cost of natural disasters in France.
△ Less
Submitted 28 July, 2021;
originally announced July 2021.
-
Risk Minimization from Adaptively Collected Data: Guarantees for Supervised and Policy Learning
Authors:
Aurélien Bibaut,
Antoine Chambaz,
Maria Dimakopoulou,
Nathan Kallus,
Mark van der Laan
Abstract:
Empirical risk minimization (ERM) is the workhorse of machine learning, whether for classification and regression or for off-policy policy learning, but its model-agnostic guarantees can fail when we use adaptively collected data, such as the result of running a contextual bandit algorithm. We study a generic importance sampling weighted ERM algorithm for using adaptively collected data to minimiz…
▽ More
Empirical risk minimization (ERM) is the workhorse of machine learning, whether for classification and regression or for off-policy policy learning, but its model-agnostic guarantees can fail when we use adaptively collected data, such as the result of running a contextual bandit algorithm. We study a generic importance sampling weighted ERM algorithm for using adaptively collected data to minimize the average of a loss function over a hypothesis class and provide first-of-their-kind generalization guarantees and fast convergence rates. Our results are based on a new maximal inequality that carefully leverages the importance sampling structure to obtain rates with the right dependence on the exploration rate in the data. For regression, we provide fast rates that leverage the strong convexity of squared-error loss. For policy learning, we provide rate-optimal regret guarantees that close an open gap in the existing literature whenever exploration decays to zero, as is the case for bandit-collected data. An empirical investigation validates our theory.
△ Less
Submitted 3 June, 2021;
originally announced June 2021.
-
Post-Contextual-Bandit Inference
Authors:
Aurélien Bibaut,
Antoine Chambaz,
Maria Dimakopoulou,
Nathan Kallus,
Mark van der Laan
Abstract:
Contextual bandit algorithms are increasingly replacing non-adaptive A/B tests in e-commerce, healthcare, and policymaking because they can both improve outcomes for study participants and increase the chance of identifying good or even best policies. To support credible inference on novel interventions at the end of the study, nonetheless, we still want to construct valid confidence intervals on…
▽ More
Contextual bandit algorithms are increasingly replacing non-adaptive A/B tests in e-commerce, healthcare, and policymaking because they can both improve outcomes for study participants and increase the chance of identifying good or even best policies. To support credible inference on novel interventions at the end of the study, nonetheless, we still want to construct valid confidence intervals on average treatment effects, subgroup effects, or value of new policies. The adaptive nature of the data collected by contextual bandit algorithms, however, makes this difficult: standard estimators are no longer asymptotically normally distributed and classic confidence intervals fail to provide correct coverage. While this has been addressed in non-contextual settings by using stabilized estimators, the contextual setting poses unique challenges that we tackle for the first time in this paper. We propose the Contextual Adaptive Doubly Robust (CADR) estimator, the first estimator for policy value that is asymptotically normal under contextual adaptive data collection. The main technical challenge in constructing CADR is designing adaptive and consistent conditional standard deviation estimators for stabilization. Extensive numerical experiments using 57 OpenML datasets demonstrate that confidence intervals based on CADR uniquely provide correct coverage.
△ Less
Submitted 1 June, 2021;
originally announced June 2021.
-
Adaptive Sequential Design for a Single Time-Series
Authors:
Ivana Malenica,
Aurelien Bibaut,
Mark J. van der Laan
Abstract:
The current work is motivated by the need for robust statistical methods for precision medicine; as such, we address the need for statistical methods that provide actionable inference for a single unit at any point in time. We aim to learn an optimal, unknown choice of the controlled components of the design in order to optimize the expected outcome; with that, we adapt the randomization mechanism…
▽ More
The current work is motivated by the need for robust statistical methods for precision medicine; as such, we address the need for statistical methods that provide actionable inference for a single unit at any point in time. We aim to learn an optimal, unknown choice of the controlled components of the design in order to optimize the expected outcome; with that, we adapt the randomization mechanism for future time-point experiments based on the data collected on the individual over time. Our results demonstrate that one can learn the optimal rule based on a single sample, and thereby adjust the design at any point t with valid inference for the mean target parameter. This work provides several contributions to the field of statistical precision medicine. First, we define a general class of averages of conditional causal parameters defined by the current context for the single unit time-series data. We define a nonparametric model for the probability distribution of the time-series under few assumptions, and aim to fully utilize the sequential randomization in the estimation procedure via the double robust structure of the efficient influence curve of the proposed target parameter. We present multiple exploration-exploitation strategies for assigning treatment, and methods for estimating the optimal rule. Lastly, we present the study of the data-adaptive inference on the mean under the optimal treatment rule, where the target parameter adapts over time in response to the observed context of the individual. Our target parameter is pathwise differentiable with an efficient influence function that is doubly robust - which makes it easier to estimate than previously proposed variations. We characterize the limit distribution of our estimator under a Donsker condition expressed in terms of a notion of bracketing entropy adapted to martingale settings.
△ Less
Submitted 1 July, 2021; v1 submitted 29 January, 2021;
originally announced February 2021.
-
Sequential causal inference in a single world of connected units
Authors:
Aurelien Bibaut,
Maya Petersen,
Nikos Vlassis,
Maria Dimakopoulou,
Mark van der Laan
Abstract:
We consider adaptive designs for a trial involving N individuals that we follow along T time steps. We allow for the variables of one individual to depend on its past and on the past of other individuals. Our goal is to learn a mean outcome, averaged across the N individuals, that we would observe, if we started from some given initial state, and we carried out a given sequence of counterfactual i…
▽ More
We consider adaptive designs for a trial involving N individuals that we follow along T time steps. We allow for the variables of one individual to depend on its past and on the past of other individuals. Our goal is to learn a mean outcome, averaged across the N individuals, that we would observe, if we started from some given initial state, and we carried out a given sequence of counterfactual interventions for $τ$ time steps.
We show how to identify a statistical parameter that equals this mean counterfactual outcome, and how to perform inference for this parameter, while adaptively learning an oracle design defined as a parameter of the true data generating distribution. Oracle designs of interest include the design that maximizes the efficiency for a statistical parameter of interest, or designs that mix the optimal treatment rule with a certain exploration distribution. We also show how to design adaptive stopping rules for sequential hypothesis testing.
This setting presents unique technical challenges. Unlike in usual statistical settings where the data consists of several independent observations, here, due to network and temporal dependence, the data reduces to one single observation with dependent components. In particular, this precludes the use of sample splitting techniques. We therefore had to develop a new equicontinuity result and guarantees for estimators fitted on dependent data.
We were motivated to work on this problem by the following two questions. (1) In the context of a sequential adaptive trial with K treatment arms, how to design a procedure to identify in as few rounds as possible the treatment arm with best final outcome? (2) In the context of sequential randomized disease testing at the scale of a city, how to estimate and infer the value of an optimal testing and isolation strategy?
△ Less
Submitted 18 January, 2021;
originally announced January 2021.
-
Sufficient and insufficient conditions for the stochastic convergence of Cesàro means
Authors:
Aurélien F. Bibaut,
Alex Luedtke,
Mark J. van der Laan
Abstract:
We study the stochastic convergence of the Cesàro mean of a sequence of random variables. These arise naturally in statistical problems that have a sequential component, where the sequence of random variables is typically derived from a sequence of estimators computed on data. We show that establishing a rate of convergence in probability for a sequence is not sufficient in general to establish a…
▽ More
We study the stochastic convergence of the Cesàro mean of a sequence of random variables. These arise naturally in statistical problems that have a sequential component, where the sequence of random variables is typically derived from a sequence of estimators computed on data. We show that establishing a rate of convergence in probability for a sequence is not sufficient in general to establish a rate in probability for its Cesàro mean. We also present several sets of conditions on the sequence of random variables that are sufficient to guarantee a rate of convergence for its Cesàro mean. We identify common settings in which these sets of conditions hold.
△ Less
Submitted 13 September, 2020;
originally announced September 2020.
-
Fast rates for empirical risk minimization over càdlàg functions with bounded sectional variation norm
Authors:
Aurélien F. Bibaut,
Mark J. van der Laan
Abstract:
Empirical risk minimization over classes functions that are bounded for some version of the variation norm has a long history, starting with Total Variation Denoising (Rudin et al., 1992), and has been considered by several recent articles, in particular Fang et al., 2019 and van der Laan, 2015. In this article, we consider empirical risk minimization over the class $\mathcal{F}_d$ of càdlàg funct…
▽ More
Empirical risk minimization over classes functions that are bounded for some version of the variation norm has a long history, starting with Total Variation Denoising (Rudin et al., 1992), and has been considered by several recent articles, in particular Fang et al., 2019 and van der Laan, 2015. In this article, we consider empirical risk minimization over the class $\mathcal{F}_d$ of càdlàg functions over $[0,1]^d$ with bounded sectional variation norm (also called Hardy-Krause variation).
We show how a certain representation of functions in $\mathcal{F}_d$ allows to bound the bracketing entropy of sieves of $\mathcal{F}_d$, and therefore derive rates of convergence in nonparametric function estimation. Specifically, for sieves whose growth is controlled by some rate $a_n$, we show that the empirical risk minimizer has rate of convergence $O_P(n^{-1/3} (\log n)^{2(d-1)/3} a_n)$. Remarkably, the dimension only affects the rate in $n$ through the logarithmic factor, making this method especially appropriate for high dimensional problems.
In particular, we show that in the case of nonparametric regression over sieves of càdlàg functions with bounded sectional variation norm, this upper bound on the rate of convergence holds for least-squares estimators, under the random design, sub-exponential errors setting.
△ Less
Submitted 23 August, 2019; v1 submitted 22 July, 2019;
originally announced July 2019.
-
Uniform Consistency of the Highly Adaptive Lasso Estimator of Infinite Dimensional Parameters
Authors:
Mark J. van der Laan,
Aurélien F. Bibaut
Abstract:
Consider the case that we observe $n$ independent and identically distributed copies of a random variable with a probability distribution known to be an element of a specified statistical model. We are interested in estimating an infinite dimensional target parameter that minimizes the expectation of a specified loss function. In \cite{generally_efficient_TMLE} we defined an estimator that minimiz…
▽ More
Consider the case that we observe $n$ independent and identically distributed copies of a random variable with a probability distribution known to be an element of a specified statistical model. We are interested in estimating an infinite dimensional target parameter that minimizes the expectation of a specified loss function. In \cite{generally_efficient_TMLE} we defined an estimator that minimizes the empirical risk over all multivariate real valued cadlag functions with variation norm bounded by some constant $M$ in the parameter space, and selects $M$ with cross-validation. We referred to this estimator as the Highly-Adaptive-Lasso estimator due to the fact that the constrained can be formulated as a bound $M$ on the sum of the coefficients a linear combination of a very large number of basis functions. Specifically, in the case that the target parameter is a conditional mean, then it can be implemented with the standard LASSO regression estimator. In \cite{generally_efficient_TMLE} we proved that the HAL-estimator is consistent w.r.t. the (quadratic) loss-based dissimilarity at a rate faster than $n^{-1/2}$ (i.e., faster than $n^{-1/4}$ w.r.t. a norm), even when the parameter space is completely nonparametric. The only assumption required for this rate is that the true parameter function has a finite variation norm. The loss-based dissimilarity is often equivalent with the square of an $L^2(P_0)$-type norm. In this article, we establish that under some weak continuity condition, the HAL-estimator is also uniformly consistent.
△ Less
Submitted 19 September, 2017;
originally announced September 2017.
-
Data-adaptive smoothing for optimal-rate estimation of possibly non-regular parameters
Authors:
Aurelien F. Bibaut,
Mark J. van der Laan
Abstract:
We consider nonparametric inference of finite dimensional, potentially non-pathwise differentiable target parameters. In a nonparametric model, some examples of such parameters that are always non pathwise differentiable target parameters include probability density functions at a point, or regression functions at a point. In causal inference, under appropriate causal assumptions, mean counterfact…
▽ More
We consider nonparametric inference of finite dimensional, potentially non-pathwise differentiable target parameters. In a nonparametric model, some examples of such parameters that are always non pathwise differentiable target parameters include probability density functions at a point, or regression functions at a point. In causal inference, under appropriate causal assumptions, mean counterfactual outcomes can be pathwise differentiable or not, depending on the degree at which the positivity assumption holds.
In this paper, given a potentially non-pathwise differentiable target parameter, we introduce a family of approximating parameters, that are pathwise differentiable. This family is indexed by a scalar. In kernel regression or density estimation for instance, a natural choice for such a family is obtained by kernel smoothing and is indexed by the smoothing level. For the counterfactual mean outcome, a possible approximating family is obtained through truncation of the propensity score, and the truncation level then plays the role of the index.
We propose a method to data-adaptively select the index in the family, so as to optimize mean squared error. We prove an asymptotic normality result, which allows us to derive confidence intervals. Under some conditions, our estimator achieves an optimal mean squared error convergence rate. Confidence intervals are data-adaptive and have almost optimal width.
A simulation study demonstrates the practical performance of our estimators for the inference of a causal dose-response curve at a given treatment dose.
△ Less
Submitted 12 July, 2017; v1 submitted 22 June, 2017;
originally announced June 2017.