-
Online selective conformal inference: adaptive scores, convergence rate and optimality
Authors:
Pierre Humbert,
Ulysse Gazin,
Ruth Heller,
Etienne Roquain
Abstract:
In a supervised online setting, quantifying uncertainty has been proposed in the seminal work of \cite{gibbs2021adaptive}. For any given point-prediction algorithm, their method (ACI) produces a conformal prediction set with an average missed coverage getting close to a pre-specified level $α$ for a long time horizon. We introduce an extended version of this algorithm, called OnlineSCI, allowing t…
▽ More
In a supervised online setting, quantifying uncertainty has been proposed in the seminal work of \cite{gibbs2021adaptive}. For any given point-prediction algorithm, their method (ACI) produces a conformal prediction set with an average missed coverage getting close to a pre-specified level $α$ for a long time horizon. We introduce an extended version of this algorithm, called OnlineSCI, allowing the user to additionally select times where such an inference should be made. OnlineSCI encompasses several prominent online selective tasks, such as building prediction intervals for extreme outcomes, classification with abstention, and online testing. While OnlineSCI controls the average missed coverage on the selected in an adversarial setting, our theoretical results also show that it controls the instantaneous error rate (IER) at the selected times, up to a non-asymptotical remainder term. Importantly, our theory covers the case where OnlineSCI updates the point-prediction algorithm at each time step, a property which we refer to as {\it adaptive} capability. We show that the adaptive versions of OnlineSCI can convergence to an optimal solution and provide an explicit convergence rate in each of the aforementioned application cases, under specific mild conditions. Finally, the favorable behavior of OnlineSCI in practice is illustrated by numerical experiments.
△ Less
Submitted 14 August, 2025;
originally announced August 2025.
-
Selecting informative conformal prediction sets with false coverage rate control
Authors:
Ulysse Gazin,
Ruth Heller,
Ariane Marandon,
Etienne Roquain
Abstract:
In supervised learning, including regression and classification, conformal methods provide prediction sets for the outcome/label with finite sample coverage for any machine learning predictor. We consider here the case where such prediction sets come after a selection process. The selection process requires that the selected prediction sets be `informative' in a well defined sense. We consider bot…
▽ More
In supervised learning, including regression and classification, conformal methods provide prediction sets for the outcome/label with finite sample coverage for any machine learning predictor. We consider here the case where such prediction sets come after a selection process. The selection process requires that the selected prediction sets be `informative' in a well defined sense. We consider both the classification and regression settings where the analyst may consider as informative only the sample with prediction sets small enough, excluding null values, or obeying other appropriate `monotone' constraints. We develop a unified framework for building such informative conformal prediction sets while controlling the false coverage rate (FCR) on the selected sample. While conformal prediction sets after selection have been the focus of much recent literature in the field, the new introduced procedures, called InfoSP and InfoSCOP, are to our knowledge the first ones providing FCR control for informative prediction sets. We show the usefulness of our resulting procedures on real and simulated data.
△ Less
Submitted 25 November, 2024; v1 submitted 18 March, 2024;
originally announced March 2024.
-
False discovery proportion envelopes with m-consistency
Authors:
Iqraa Meah,
Gilles Blanchard,
Etienne Roquain
Abstract:
We provide new non-asymptotic false discovery proportion (FDP) confidence envelopes in several multiple testing settings relevant for modern high dimensional-data methods. We revisit the multiple testing scenarios considered in the recent work of Katsevich and Ramdas (2020): top-$k$, preordered (including knockoffs), online. Our emphasis is on obtaining FDP confidence bounds that both have non-asy…
▽ More
We provide new non-asymptotic false discovery proportion (FDP) confidence envelopes in several multiple testing settings relevant for modern high dimensional-data methods. We revisit the multiple testing scenarios considered in the recent work of Katsevich and Ramdas (2020): top-$k$, preordered (including knockoffs), online. Our emphasis is on obtaining FDP confidence bounds that both have non-asymptotic coverage and are asymptotically accurate in a specific sense, as the number $m$ of tested hypotheses grows. Namely, we introduce and study the property (which we call $m$-consistency) that the confidence bound converges to or below the desired level $α$ when applied to a specific reference $α$-level false discovery rate (FDR) controlling procedure. In this perspective, we derive new bounds that provide improvements over existing ones, both theoretically and practically, and are suitable for situations where at least a moderate number of rejections is expected. These improvements are illustrated with numerical experiments and real data examples. In particular, the improvement is significant in the knockoffs setting, which shows the impact of the method for a practical use. As side results, we introduce a new confidence envelope for the empirical cumulative distribution function of i.i.d. uniform variables, and we provide new power results in sparse cases, both being of independent interest.
△ Less
Submitted 17 September, 2024; v1 submitted 13 June, 2023;
originally announced June 2023.
-
Adaptive novelty detection with false discovery rate guarantee
Authors:
Ariane Marandon,
Lihua Lei,
David Mary,
Etienne Roquain
Abstract:
This paper studies the semi-supervised novelty detection problem where a set of "typical" measurements is available to the researcher. Motivated by recent advances in multiple testing and conformal inference, we propose AdaDetect, a flexible method that is able to wrap around any probabilistic classification algorithm and control the false discovery rate (FDR) on detected novelties in finite sampl…
▽ More
This paper studies the semi-supervised novelty detection problem where a set of "typical" measurements is available to the researcher. Motivated by recent advances in multiple testing and conformal inference, we propose AdaDetect, a flexible method that is able to wrap around any probabilistic classification algorithm and control the false discovery rate (FDR) on detected novelties in finite samples without any distributional assumption other than exchangeability. In contrast to classical FDR-controlling procedures that are often committed to a pre-specified p-value function, AdaDetect learns the transformation in a data-adaptive manner to focus the power on the directions that distinguish between inliers and outliers. Inspired by the multiple testing literature, we further propose variants of AdaDetect that are adaptive to the proportion of nulls while maintaining the finite-sample FDR control. The methods are illustrated on synthetic datasets and real-world datasets, including an application in astrophysics.
△ Less
Submitted 25 October, 2023; v1 submitted 13 August, 2022;
originally announced August 2022.
-
False membership rate control in mixture models
Authors:
Ariane Marandon,
Tabea Rebafka,
Etienne Roquain,
Nataliya Sokolovska
Abstract:
The clustering task consists in partitioning elements of a sample into homogeneous groups. Most datasets contain individuals that are ambiguous and intrinsically difficult to attribute to one or another cluster. However, in practical applications, misclassifying individuals is potentially disastrous and should be avoided. To keep the misclassification rate small, one can decide to classify only a…
▽ More
The clustering task consists in partitioning elements of a sample into homogeneous groups. Most datasets contain individuals that are ambiguous and intrinsically difficult to attribute to one or another cluster. However, in practical applications, misclassifying individuals is potentially disastrous and should be avoided. To keep the misclassification rate small, one can decide to classify only a part of the sample. In the supervised setting, this approach is well known and referred to as classification with an abstention option. In this paper the approach is revisited in an unsupervised mixture model framework and the purpose is to develop a method that comes with the guarantee that the false membership rate (FMR) does not exceed a pre-defined nominal level $α$. A plug-in procedure is proposed, for which a theoretical analysis is provided, by quantifying the FMR deviation with respect to the target level $α$ with explicit remainder terms. Bootstrap versions of the procedure are shown to improve the performance in numerical experiments.
△ Less
Submitted 25 October, 2023; v1 submitted 4 March, 2022;
originally announced March 2022.
-
Sharp multiple testing boundary for sparse sequences
Authors:
Kweku Abraham,
Ismael Castillo,
Etienne Roquain
Abstract:
This work investigates multiple testing by considering minimax separation rates in the sparse sequence model, when the testing risk is measured as the sum FDR+FNR (False Discovery Rate plus False Negative Rate). First using the popular beta-min separation condition, with all nonzero signals separated from $0$ by at least some amount, we determine the sharp minimax testing risk asymptotically and t…
▽ More
This work investigates multiple testing by considering minimax separation rates in the sparse sequence model, when the testing risk is measured as the sum FDR+FNR (False Discovery Rate plus False Negative Rate). First using the popular beta-min separation condition, with all nonzero signals separated from $0$ by at least some amount, we determine the sharp minimax testing risk asymptotically and thereby explicitly describe the transition from "achievable multiple testing with vanishing risk" to "impossible multiple testing". Adaptive multiple testing procedures achieving the corresponding optimal boundary are provided: the Benjamini--Hochberg procedure with a properly tuned level, and an empirical Bayes $\ell$-value (`local FDR') procedure. We prove that the FDR and FNR make non-symmetric contributions to the testing risk for most optimal procedures, the FNR part being dominant at the boundary. The multiple testing hardness is then investigated for classes of arbitrary sparse signals. A number of extensions, including results for classification losses and convergence rates in the case of large signals, are also investigated.
△ Less
Submitted 30 August, 2023; v1 submitted 28 September, 2021;
originally announced September 2021.
-
Semi-supervised multiple testing
Authors:
David Mary,
Etienne Roquain
Abstract:
An important limitation of standard multiple testing procedures is that the null distribution should be known. Here, we consider a
null distribution-free approach for multiple testing in the following semi-supervised setting: the user does not know the null distribution, but has at hand a sample drawn from this null distribution. In practical situations, this null training sample (NTS) can come…
▽ More
An important limitation of standard multiple testing procedures is that the null distribution should be known. Here, we consider a
null distribution-free approach for multiple testing in the following semi-supervised setting: the user does not know the null distribution, but has at hand a sample drawn from this null distribution. In practical situations, this null training sample (NTS) can come from previous experiments, from a part of the data under test, from specific simulations, or from a sampling process. In this work, we present theoretical results that handle such a framework, with a focus on the false discovery rate (FDR) control and the Benjamini-Hochberg (BH) procedure. First, we provide upper and lower bounds for the FDR of the BH procedure based on empirical $p$-values. These bounds match when $α(n+1)/m$ is an integer, where $n$ is the NTS sample size and $m$ is the number of tests. Second, we give a power analysis for that procedure suggesting that the price to pay for ignoring the null distribution is low when $n$ is sufficiently large in front of $m$; namely $n\gtrsim m/(\max(1,k))$, where $k$ denotes the number of ``detectable'' alternatives. Third, to complete the picture, we also present a negative result that evidences an intrinsic transition phase to the general semi-supervised multiple testing problem {and shows that the empirical BH method is optimal in the sense that its performance boundary follows this transition phase}. Our theoretical properties are supported by numerical experiments, which also show that the delineated boundary is of correct order without further tuning any constant. Finally, we demonstrate that our work provides a theoretical ground for standard practice in astronomical data analysis, and in particular for the procedure proposed in \cite{Origin2020} for galaxy detection.
△ Less
Submitted 24 November, 2021; v1 submitted 25 June, 2021;
originally announced June 2021.
-
Empirical Bayes cumulative $\ell$-value multiple testing procedure for sparse sequences
Authors:
Kweku Abraham,
Ismael Castillo,
Etienne Roquain
Abstract:
In the sparse sequence model, we consider a popular Bayesian multiple testing procedure and investigate for the first time its behaviour from the frequentist point of view. Given a spike-and-slab prior on the high-dimensional sparse unknown parameter, one can easily compute posterior probabilities of coming from the spike, which correspond to the well known local-fdr values, also called $\ell$-val…
▽ More
In the sparse sequence model, we consider a popular Bayesian multiple testing procedure and investigate for the first time its behaviour from the frequentist point of view. Given a spike-and-slab prior on the high-dimensional sparse unknown parameter, one can easily compute posterior probabilities of coming from the spike, which correspond to the well known local-fdr values, also called $\ell$-values. The spike-and-slab weight parameter is calibrated in an empirical Bayes fashion, using marginal maximum likelihood. The multiple testing procedure under study, called here the cumulative $\ell$-value procedure, ranks coordinates according to their empirical $\ell$-values and thresholds so that the cumulative ranked sum does not exceed a user-specified level $t$.
We validate the use of this method from the multiple testing perspective: for alternatives of appropriately large signal strength, the false discovery rate (FDR) of the procedure is shown to converge to the target level $t$, while its false negative rate (FNR) goes to $0$. We complement this study by providing convergence rates for the method. Additionally, we prove that the $q$-value multiple testing procedure shares similar convergence rates in this model.
△ Less
Submitted 28 March, 2022; v1 submitted 1 February, 2021;
originally announced February 2021.
-
False discovery rate control with unknown null distribution: is it possible to mimic the oracle?
Authors:
Etienne Roquain,
Nicolas Verzelen
Abstract:
Classical multiple testing theory prescribes the null distribution, which is often a too stringent assumption for nowadays large scale experiments. This paper presents theoretical foundations to understand the limitations caused by ignoring the null distribution, and how it can be properly learned from the (same) data-set, when possible. We explore this issue in the case where the null distributio…
▽ More
Classical multiple testing theory prescribes the null distribution, which is often a too stringent assumption for nowadays large scale experiments. This paper presents theoretical foundations to understand the limitations caused by ignoring the null distribution, and how it can be properly learned from the (same) data-set, when possible. We explore this issue in the case where the null distributions are Gaussian with an unknown rescaling parameters (mean and variance) and the alternative distribution is let arbitrary. While an oracle procedure in that case is the Benjamini Hochberg procedure applied with the true (unknown) null distribution, we pursue the aim of building a procedure that asymptotically mimics the performance of the oracle (AMO in short). Our main result states that an AMO procedure exists if and only if the sparsity parameter $k$ (number of false nulls) is of order less than $n/\log(n)$, where $n$ is the total number of tests. Further sparsity boundaries are derived for general location models where the shape of the null distribution is not necessarily Gaussian. Given our impossibility results, we also pursue a weaker objective, which is to find a confidence region for the oracle. To this end, we develop a distribution-dependent confidence region for the null distribution. As practical by-products, this provides a goodness of fit test for the null distribution, as well as a visual method assessing the reliability of empirical null multiple testing methods. Our results are illustrated with numerical experiments and a companion vignette \cite{RVvignette2020}.
△ Less
Submitted 21 December, 2020; v1 submitted 6 December, 2019;
originally announced December 2019.
-
On agnostic post hoc approaches to false positive control
Authors:
Gilles Blanchard,
Pierre Neuvial,
Etienne Roquain
Abstract:
This document is a book chapter which gives a partial survey on post hoc approaches to false positive control.
This document is a book chapter which gives a partial survey on post hoc approaches to false positive control.
△ Less
Submitted 25 October, 2019;
originally announced October 2019.
-
Graph inference with clustering and false discovery rate control
Authors:
Tabea Rebafka,
Etienne Roquain,
Fanny Villers
Abstract:
In this paper, a noisy version of the stochastic block model (NSBM) is introduced and we investigate the three following statistical inferences in this model: estimation of the model parameters, clustering of the nodes and identification of the underlying graph. While the two first inferences are done by using a variational expectation-maximization (VEM) algorithm, the graph inference is done by c…
▽ More
In this paper, a noisy version of the stochastic block model (NSBM) is introduced and we investigate the three following statistical inferences in this model: estimation of the model parameters, clustering of the nodes and identification of the underlying graph. While the two first inferences are done by using a variational expectation-maximization (VEM) algorithm, the graph inference is done by controlling the false discovery rate (FDR), that is, the average proportion of errors among the edges declared significant, and by maximizing the true discovery rate (TDR), that is, the average proportion of edges declared significant among the true edges. Provided that the VEM algorithm provides reliable parameter estimates and clustering, we theoretically show that our procedure does control the FDR while satisfying an optimal TDR property, up to remainder terms that become small when the size of the graph grows. Numerical experiments show that our method outperforms the classical FDR controlling methods that ignore the underlying SBM topology. In addition, these simulations demonstrate that the FDR/TDR properties of our method are robust to model mis-specification, that is, are essentially maintained outside our model.
△ Less
Submitted 23 July, 2019;
originally announced July 2019.
-
Estimating minimum effect with outlier selection
Authors:
Alexandra Carpentier,
Sylvain Delattre,
Etienne Roquain,
Nicolas Verzelen
Abstract:
We introduce one-sided versions of Huber's contamination model, in which corrupted samples tend to take larger values than uncorrupted ones. Two intertwined problems are addressed: estimation of the mean of uncorrupted samples (minimum effect) and selection of corrupted samples (outliers). Regarding the minimum effect estimation, we derive the minimax risks and introduce adaptive estimators to the…
▽ More
We introduce one-sided versions of Huber's contamination model, in which corrupted samples tend to take larger values than uncorrupted ones. Two intertwined problems are addressed: estimation of the mean of uncorrupted samples (minimum effect) and selection of corrupted samples (outliers). Regarding the minimum effect estimation, we derive the minimax risks and introduce adaptive estimators to the unknown number of contaminations. Interestingly, the optimal convergence rate highly differs from that in classical Huber's contamination model. Also, our analysis uncovers the effect of particular structural assumptions on the distribution of the contaminated samples. As for the problem of selecting the outliers, we formulate the problem in a multiple testing framework for which the location/scaling of the null hypotheses are unknown. We rigorously prove how estimating the null hypothesis is possible while maintaining a theoretical guarantee on the amount of the falsely selected outliers, both through false discovery rate (FDR) or post hoc bounds. As a by-product, we address a long-standing open issue on FDR control under equi-correlation, which reinforces the interest of removing dependency when making multiple testing.
△ Less
Submitted 21 September, 2018;
originally announced September 2018.
-
On spike and slab empirical Bayes multiple testing
Authors:
Ismael Castillo,
Etienne Roquain
Abstract:
This paper explores a connection between empirical Bayes posterior distributions and false discovery rate (FDR) control. In the Gaussian sequence model, this work shows that empirical Bayes-calibrated spike and slab posterior distributions allow a correct FDR control under sparsity. Doing so, it offers a frequentist theoretical validation of empirical Bayes methods in the context of multiple testi…
▽ More
This paper explores a connection between empirical Bayes posterior distributions and false discovery rate (FDR) control. In the Gaussian sequence model, this work shows that empirical Bayes-calibrated spike and slab posterior distributions allow a correct FDR control under sparsity. Doing so, it offers a frequentist theoretical validation of empirical Bayes methods in the context of multiple testing. Our theoretical results are illustrated with numerical experiments.
△ Less
Submitted 15 June, 2019; v1 submitted 29 August, 2018;
originally announced August 2018.
-
Post hoc false positive control for spatially structured hypotheses
Authors:
Guillermo Durand,
Gilles Blanchard,
Pierre Neuvial,
Etienne Roquain
Abstract:
In a high dimensional multiple testing framework, we present new confidence bounds on the false positives contained in subsets S of selected null hypotheses. The coverage probability holds simultaneously over all subsets S, which means that the obtained confidence bounds are post hoc. Therefore, S can be chosen arbitrarily, possibly by using the data set several times. We focus in this paper speci…
▽ More
In a high dimensional multiple testing framework, we present new confidence bounds on the false positives contained in subsets S of selected null hypotheses. The coverage probability holds simultaneously over all subsets S, which means that the obtained confidence bounds are post hoc. Therefore, S can be chosen arbitrarily, possibly by using the data set several times. We focus in this paper specifically on the case where the null hypotheses are spatially structured. Our method is based on recent advances in post hoc inference and particularly on the general methodology of Blanchard et al. (2017); we build confidence bounds for some pre-specified forest-structured subsets {R k , k $\in$ K}, called the reference family, and then we deduce a bound for any subset S by interpolation. The proposed bounds are shown to improve substantially previous ones when the signal is locally structured. Our findings are supported both by theoretical results and numerical experiments. Moreover, we show that our bound can be obtained by a low-complexity algorithm, which makes our approach completely operational for a practical use. The proposed bounds are implemented in the open-source R package sansSouci.
△ Less
Submitted 19 September, 2018; v1 submitted 4 July, 2018;
originally announced July 2018.
-
Improving the Benjamini-Hochberg Procedure for Discrete Tests
Authors:
Sebastian Döhler,
Guillermo Durand,
Etienne Roquain
Abstract:
To find interesting items in genome-wide association studies or next generation sequencing data, a crucial point is to design powerful false discovery rate (FDR) controlling procedures that suitably combine discrete tests (typically binomial or Fisher tests). In particular, recent research has been striving for appropriate modifications of the classical Benjamini-Hochberg (BH) step-up procedure t…
▽ More
To find interesting items in genome-wide association studies or next generation sequencing data, a crucial point is to design powerful false discovery rate (FDR) controlling procedures that suitably combine discrete tests (typically binomial or Fisher tests). In particular, recent research has been striving for appropriate modifications of the classical Benjamini-Hochberg (BH) step-up procedure that accommodate discreteness. However, despite an important number of attempts, these procedures did not come with theoretical guarantees. The present paper contributes to fill the gap: it presents new modifications of the BH procedure that incorporate the discrete structure of the data and provably control the FDR for any fixed number of null hypotheses (under independence). Markedly, our FDR controlling methodology allows to incorporate simultaneously the discreteness and the quantity of signal of the data (corresponding therefore to a so-called $π\_0$-adaptive procedure). The power advantage of the new methods is demonstrated in a numerical experiment and for some appropriate real data sets.
△ Less
Submitted 15 September, 2017; v1 submitted 26 June, 2017;
originally announced June 2017.
-
Post hoc inference via joint family-wise error rate control
Authors:
Gilles Blanchard,
Pierre Neuvial,
Etienne Roquain
Abstract:
We introduce a general methodology for post hoc inference in a large-scale multiple testing framework. The approach is called "user-agnostic" in the sense that the statistical guarantee on the number of correct rejections holds for any set of candidate items selected by the user (after having seen the data). This task is investigated by defining a suitable criterion, named the joint-family-wise-er…
▽ More
We introduce a general methodology for post hoc inference in a large-scale multiple testing framework. The approach is called "user-agnostic" in the sense that the statistical guarantee on the number of correct rejections holds for any set of candidate items selected by the user (after having seen the data). This task is investigated by defining a suitable criterion, named the joint-family-wise-error rate (JER for short). We propose several procedures for controlling the JER, with a special focus on incorporating dependencies while adapting to the unknown quantity of signal (via a step-down approach). We show that our proposed setting incorporates as particular cases a version of the higher criticism as well as the closed testing based approach of Goeman and Solari (2011). Our theoretical statements are supported by numerical experiments.
△ Less
Submitted 8 January, 2018; v1 submitted 7 March, 2017;
originally announced March 2017.
-
New procedures controlling the false discovery proportion via Romano-Wolf's heuristic
Authors:
Sylvain Delattre,
Etienne Roquain
Abstract:
The false discovery proportion (FDP) is a convenient way to account for false positives when a large number $m$ of tests are performed simultaneously. Romano and Wolf [Ann. Statist. 35 (2007) 1378-1408] have proposed a general principle that builds FDP controlling procedures from $k$-family-wise error rate controlling procedures while incorporating dependencies in an appropriate manner; see Korn e…
▽ More
The false discovery proportion (FDP) is a convenient way to account for false positives when a large number $m$ of tests are performed simultaneously. Romano and Wolf [Ann. Statist. 35 (2007) 1378-1408] have proposed a general principle that builds FDP controlling procedures from $k$-family-wise error rate controlling procedures while incorporating dependencies in an appropriate manner; see Korn et al. [J. Statist. Plann. Inference 124 (2004) 379-398]; Romano and Wolf (2007). However, the theoretical validity of the latter is still largely unknown. This paper provides a careful study of this heuristic: first, we extend this approach by using a notion of "bounding device" that allows us to cover a wide range of critical values, including those that adapt to $m\_0$, the number of true null hypotheses. Second, the theoretical validity of the latter is investigated both nonasymptotically and asymptotically. Third, we introduce suitable modifications of this heuristic that provide new methods, overcoming the existing procedures with a proven FDP control.
△ Less
Submitted 5 June, 2015; v1 submitted 16 November, 2013;
originally announced November 2013.
-
On empirical distribution function of high-dimensional Gaussian vector components with an application to multiple testing
Authors:
Sylvain Delattre,
Etienne Roquain
Abstract:
This paper introduces a new framework to study the asymptotical behavior of the empirical distribution function (e.d.f.) of Gaussian vector components, whose correlation matrix $Γ^{(m)}$ is dimension-dependent. Hence, by contrast with the existing literature, the vector is not assumed to be stationary. Rather, we make a "vanishing second order" assumption ensuring that the covariance matrix…
▽ More
This paper introduces a new framework to study the asymptotical behavior of the empirical distribution function (e.d.f.) of Gaussian vector components, whose correlation matrix $Γ^{(m)}$ is dimension-dependent. Hence, by contrast with the existing literature, the vector is not assumed to be stationary. Rather, we make a "vanishing second order" assumption ensuring that the covariance matrix $Γ^{(m)}$ is not too far from the identity matrix, while the behavior of the e.d.f. is affected by $Γ^{(m)}$ only through the sequence $γ_m=m^{-2} \sum_{i\neq j} Γ_{i,j}^{(m)}$, as $m$ grows to infinity. This result recovers some of the previous results for stationary long-range dependencies while it also applies to various, high-dimensional, non-stationary frameworks, for which the most correlated variables are not necessarily next to each other. Finally, we present an application of this work to the multiple testing problem, which was the initial statistical motivation for developing such a methodology.
△ Less
Submitted 4 May, 2013; v1 submitted 9 October, 2012;
originally announced October 2012.
-
On the false discovery proportion convergence under Gaussian equi-correlation
Authors:
Sylvain Delattre,
Etienne Roquain
Abstract:
We study the convergence of the false discovery proportion (FDP) of the Benjamini-Hochberg procedure in the Gaussian equi-correlated model, when the correlation $ρ_m$ converges to zero as the hypothesis number $m$ grows to infinity. By contrast with the standard convergence rate $m^{1/2}$ holding under independence, this study shows that the FDP converges to the false discovery rate (FDR) at rate…
▽ More
We study the convergence of the false discovery proportion (FDP) of the Benjamini-Hochberg procedure in the Gaussian equi-correlated model, when the correlation $ρ_m$ converges to zero as the hypothesis number $m$ grows to infinity. By contrast with the standard convergence rate $m^{1/2}$ holding under independence, this study shows that the FDP converges to the false discovery rate (FDR) at rate $\{\min(m,1/ρ_m)\}^{1/2}$ in this equi-correlated model.
△ Less
Submitted 8 July, 2010;
originally announced July 2010.
-
Exact calculations for false discovery proportion with application to least favorable configurations
Authors:
Etienne Roquain,
Fanny Villers
Abstract:
In a context of multiple hypothesis testing, we provide several new exact calculations related to the false discovery proportion (FDP) of step-up and step-down procedures. For step-up procedures, we show that the number of erroneous rejections conditionally on the rejection number is simply a binomial variable, which leads to explicit computations of the c.d.f., the {$s$-th} moment and the mean of…
▽ More
In a context of multiple hypothesis testing, we provide several new exact calculations related to the false discovery proportion (FDP) of step-up and step-down procedures. For step-up procedures, we show that the number of erroneous rejections conditionally on the rejection number is simply a binomial variable, which leads to explicit computations of the c.d.f., the {$s$-th} moment and the mean of the FDP, the latter corresponding to the false discovery rate (FDR). For step-down procedures, we derive what is to our knowledge the first explicit formula for the FDR valid for any alternative c.d.f. of the $p$-values. We also derive explicit computations of the power for both step-up and step-down procedures. These formulas are "explicit" in the sense that they only involve the parameters of the model and the c.d.f. of the order statistics of i.i.d. uniform variables. The $p$-values are assumed either independent or coming from an equicorrelated multivariate normal model and an additional mixture model for the true/false hypotheses is used. This new approach is used to investigate new results which are of interest in their own right, related to least/most favorable configurations for the FDR and the variance of the FDP.
△ Less
Submitted 26 May, 2010; v1 submitted 15 February, 2010;
originally announced February 2010.
-
Optimal weighting for false discovery rate control
Authors:
Etienne Roquain,
Mark Van De Wiel
Abstract:
How to weigh the Benjamini-Hochberg procedure? In the context of multiple hypothesis testing, we propose a new step-wise procedure that controls the false discovery rate (FDR) and we prove it to be more powerful than any weighted Benjamini-Hochberg procedure. Both finite-sample and asymptotic results are presented. Moreover, we illustrate good performance of our procedure in simulations and a ge…
▽ More
How to weigh the Benjamini-Hochberg procedure? In the context of multiple hypothesis testing, we propose a new step-wise procedure that controls the false discovery rate (FDR) and we prove it to be more powerful than any weighted Benjamini-Hochberg procedure. Both finite-sample and asymptotic results are presented. Moreover, we illustrate good performance of our procedure in simulations and a genomics application. This work is particularly useful in the case of heterogeneous $p$-value distributions.
△ Less
Submitted 13 July, 2009; v1 submitted 25 July, 2008;
originally announced July 2008.
-
Two simple sufficient conditions for FDR control
Authors:
Gilles Blanchard,
Etienne Roquain
Abstract:
We show that the control of the false discovery rate (FDR) for a multiple testing procedure is implied by two coupled simple sufficient conditions. The first one, which we call ``self-consistency condition'', concerns the algorithm itself, and the second, called ``dependency control condition'' is related to the dependency assumptions on the $p$-value family. Many standard multiple testing proce…
▽ More
We show that the control of the false discovery rate (FDR) for a multiple testing procedure is implied by two coupled simple sufficient conditions. The first one, which we call ``self-consistency condition'', concerns the algorithm itself, and the second, called ``dependency control condition'' is related to the dependency assumptions on the $p$-value family. Many standard multiple testing procedures are self-consistent (e.g. step-up, step-down or step-up-down procedures), and we prove that the dependency control condition can be fulfilled when choosing correspondingly appropriate rejection functions, in three classical types of dependency: independence, positive dependency (PRDS) and unspecified dependency. As a consequence, we recover earlier results through simple and unifying proofs while extending their scope to several regards: weighted FDR, $p$-value reweighting, new family of step-up procedures under unspecified $p$-value dependency and adaptive step-up procedures. We give additional examples of other possible applications. This framework also allows for defining and studying FDR control for multiple testing procedures over a continuous, uncountable space of hypotheses.
△ Less
Submitted 21 October, 2008; v1 submitted 11 February, 2008;
originally announced February 2008.
-
Some nonasymptotic results on resampling in high dimension, I: Confidence regions, II: Multiple tests
Authors:
Sylvain Arlot,
Gilles Blanchard,
Etienne Roquain
Abstract:
We study generalized bootstrap confidence regions for the mean of a random vector whose coordinates have an unknown dependency structure. The random vector is supposed to be either Gaussian or to have a symmetric and bounded distribution. The dimensionality of the vector can possibly be much larger than the number of observations and we focus on a nonasymptotic control of the confidence level, f…
▽ More
We study generalized bootstrap confidence regions for the mean of a random vector whose coordinates have an unknown dependency structure. The random vector is supposed to be either Gaussian or to have a symmetric and bounded distribution. The dimensionality of the vector can possibly be much larger than the number of observations and we focus on a nonasymptotic control of the confidence level, following ideas inspired by recent results in learning theory. We consider two approaches, the first based on a concentration principle (valid for a large class of resampling weights) and the second on a resampled quantile, specifically using Rademacher weights. Several intermediate results established in the approach based on concentration principles are of interest in their own right. We also discuss the question of accuracy when using Monte Carlo approximations of the resampled quantities.
△ Less
Submitted 11 January, 2010; v1 submitted 5 December, 2007;
originally announced December 2007.
-
Adaptive FDR control under independence and dependence
Authors:
Gilles Blanchard,
Etienne Roquain
Abstract:
In the context of multiple hypotheses testing, the proportion $π_0$ of true null hypotheses in the pool of hypotheses to test often plays a crucial role, although it is generally unknown a priori. A testing procedure using an implicit or explicit estimate of this quantity in order to improve its efficency is called adaptive. In this paper, we focus on the issue of False Discovery Rate (FDR) cont…
▽ More
In the context of multiple hypotheses testing, the proportion $π_0$ of true null hypotheses in the pool of hypotheses to test often plays a crucial role, although it is generally unknown a priori. A testing procedure using an implicit or explicit estimate of this quantity in order to improve its efficency is called adaptive. In this paper, we focus on the issue of False Discovery Rate (FDR) control and we present new adaptive multiple testing procedures with control of the FDR. First, in the context of assuming independent $p$-values, we present two new procedures and give a unified review of other existing adaptive procedures that have provably controlled FDR. We report extensive simulation results comparing these procedures and testing their robustness when the independence assumption is violated. The new proposed procedures appear competitive with existing ones. The overall best, though, is reported to be Storey's estimator, but for a parameter setting that does not appear to have been considered before. Second, we propose adaptive versions of step-up procedures that have provably controlled FDR under positive dependences and unspecified dependences of the $p$-values, respectively. While simulations only show an improvement over non-adaptive procedures in limited situations, these are to our knowledge among the first theoretically founded adaptive multiple testing procedures that control the FDR when the $p$-values are not independent.
△ Less
Submitted 17 February, 2009; v1 submitted 4 July, 2007;
originally announced July 2007.
-
Resampling-based confidence regions and multiple tests for a correlated random vector
Authors:
Sylvain Arlot,
Gilles Blanchard,
Etienne Roquain
Abstract:
We derive non-asymptotic confidence regions for the mean of a random vector whose coordinates have an unknown dependence structure. The random vector is supposed to be either Gaussian or to have a symmetric bounded distribution, and we observe $n$ i.i.d copies of it. The confidence regions are built using a data-dependent threshold based on a weighted bootstrap procedure. We consider two approac…
▽ More
We derive non-asymptotic confidence regions for the mean of a random vector whose coordinates have an unknown dependence structure. The random vector is supposed to be either Gaussian or to have a symmetric bounded distribution, and we observe $n$ i.i.d copies of it. The confidence regions are built using a data-dependent threshold based on a weighted bootstrap procedure. We consider two approaches, the first based on a concentration approach and the second on a direct boostrapped quantile approach. The first one allows to deal with a very large class of resampling weights while our results for the second are restricted to Rademacher weights. However, the second method seems more accurate in practice. Our results are motivated by multiple testing problems, and we show on simulations that our procedures are better than the Bonferroni procedure (union bound) as soon as the observed vector has sufficiently correlated coordinates.
△ Less
Submitted 22 January, 2007;
originally announced January 2007.