-
Adaptive novelty detection with false discovery rate guarantee
Authors:
Ariane Marandon,
Lihua Lei,
David Mary,
Etienne Roquain
Abstract:
This paper studies the semi-supervised novelty detection problem where a set of "typical" measurements is available to the researcher. Motivated by recent advances in multiple testing and conformal inference, we propose AdaDetect, a flexible method that is able to wrap around any probabilistic classification algorithm and control the false discovery rate (FDR) on detected novelties in finite sampl…
▽ More
This paper studies the semi-supervised novelty detection problem where a set of "typical" measurements is available to the researcher. Motivated by recent advances in multiple testing and conformal inference, we propose AdaDetect, a flexible method that is able to wrap around any probabilistic classification algorithm and control the false discovery rate (FDR) on detected novelties in finite samples without any distributional assumption other than exchangeability. In contrast to classical FDR-controlling procedures that are often committed to a pre-specified p-value function, AdaDetect learns the transformation in a data-adaptive manner to focus the power on the directions that distinguish between inliers and outliers. Inspired by the multiple testing literature, we further propose variants of AdaDetect that are adaptive to the proportion of nulls while maintaining the finite-sample FDR control. The methods are illustrated on synthetic datasets and real-world datasets, including an application in astrophysics.
△ Less
Submitted 25 October, 2023; v1 submitted 13 August, 2022;
originally announced August 2022.
-
Semi-supervised standardized detection of extrasolar planets
Authors:
S. Sulis,
D. Mary,
L. Bigot,
M. Deleuil
Abstract:
The detection of small exoplanets with the radial velocity (RV) technique is limited by various poorly known noise sources of instrumental and stellar origin. As a consequence, current detection techniques often fail to provide reliable estimates of the significance levels of detection tests (p-values). We designed an RV detection procedure that provides reliable p-value estimates while accounting…
▽ More
The detection of small exoplanets with the radial velocity (RV) technique is limited by various poorly known noise sources of instrumental and stellar origin. As a consequence, current detection techniques often fail to provide reliable estimates of the significance levels of detection tests (p-values). We designed an RV detection procedure that provides reliable p-value estimates while accounting for the various noise sources. The method can incorporate ancillary information about the noise (e.g., stellar activity indicators) and specific data- or context-driven data (e.g., instrumental measurements, simulations of stellar variability) . The detection part of the procedure uses a detection test that is applied to a standardized periodogram. Standardization allows an autocalibration of the noise sources with partially unknown statistics. The estimation of the p-value of the test output is based on dedicated Monte Carlo simulations that allow handling unknown parameters. The procedure is versatile in the sense that the specific pair (periodogram and test) is chosen by the user. We demonstrate by extensive numerical experiments on synthetic and real RV data from the Sun and aCenB that the proposed method reliably allows estimating the p-values. The method also provides a way to evaluate the dependence of the estimated p-values that are attributed to a reported detection on modeling errors. It is a critical point for RV planet detection at low signal-to-noise ratio to evaluate this dependence. The python algorithms are available on GitHub. Accurate estimation of p-values when unknown parameters are involved is an important but only recently addressed question in the field of RV detection. Although this work presents a method to do this, the statistical literature discussed in this paper may trigger the development of other strategies.
△ Less
Submitted 12 July, 2022; v1 submitted 8 July, 2022;
originally announced July 2022.
-
Semi-supervised multiple testing
Authors:
David Mary,
Etienne Roquain
Abstract:
An important limitation of standard multiple testing procedures is that the null distribution should be known. Here, we consider a
null distribution-free approach for multiple testing in the following semi-supervised setting: the user does not know the null distribution, but has at hand a sample drawn from this null distribution. In practical situations, this null training sample (NTS) can come…
▽ More
An important limitation of standard multiple testing procedures is that the null distribution should be known. Here, we consider a
null distribution-free approach for multiple testing in the following semi-supervised setting: the user does not know the null distribution, but has at hand a sample drawn from this null distribution. In practical situations, this null training sample (NTS) can come from previous experiments, from a part of the data under test, from specific simulations, or from a sampling process. In this work, we present theoretical results that handle such a framework, with a focus on the false discovery rate (FDR) control and the Benjamini-Hochberg (BH) procedure. First, we provide upper and lower bounds for the FDR of the BH procedure based on empirical $p$-values. These bounds match when $α(n+1)/m$ is an integer, where $n$ is the NTS sample size and $m$ is the number of tests. Second, we give a power analysis for that procedure suggesting that the price to pay for ignoring the null distribution is low when $n$ is sufficiently large in front of $m$; namely $n\gtrsim m/(\max(1,k))$, where $k$ denotes the number of ``detectable'' alternatives. Third, to complete the picture, we also present a negative result that evidences an intrinsic transition phase to the general semi-supervised multiple testing problem {and shows that the empirical BH method is optimal in the sense that its performance boundary follows this transition phase}. Our theoretical properties are supported by numerical experiments, which also show that the delineated boundary is of correct order without further tuning any constant. Finally, we demonstrate that our work provides a theoretical ground for standard practice in astronomical data analysis, and in particular for the procedure proposed in \cite{Origin2020} for galaxy detection.
△ Less
Submitted 24 November, 2021; v1 submitted 25 June, 2021;
originally announced June 2021.
-
A Bootstrap Method for Sinusoid Detection in Colored Noise and Uneven Sampling. Application to Exoplanet Detection
Authors:
Sophia Sulis,
David Mary,
Lionel Bigot
Abstract:
This study is motivated by the problem of evaluating reliable false alarm (FA) rates for sinusoid detection tests applied to unevenly sampled time series involving colored noise, when a (small) training data set of this noise is available. While analytical expressions for the FA rate are out of reach in this situation, we show that it is possible to combine specific periodogram standardization and…
▽ More
This study is motivated by the problem of evaluating reliable false alarm (FA) rates for sinusoid detection tests applied to unevenly sampled time series involving colored noise, when a (small) training data set of this noise is available. While analytical expressions for the FA rate are out of reach in this situation, we show that it is possible to combine specific periodogram standardization and bootstrap techniques to consistently estimate the FA rate. We also show that the procedure can be improved by using generalized extremevalue distributions. The paper presents several numerical results including a case study in exoplanet detection from radial velocity data.
△ Less
Submitted 20 June, 2017;
originally announced June 2017.
-
Using hydrodynamical simulations of stellar atmospheres for periodogram standardization : application to exoplanet detection
Authors:
Sophia Sulis,
David Mary,
Lionel Bigot
Abstract:
Our aim is to devise a detection method for exoplanet signatures (multiple sinusoids) that is both powerful and robust to partially unknown statistics under the null hypothesis. In the considered application, the noise is mostly created by the stellar atmosphere, with statistics depending on the complicated interplay of several parameters. Recent progresses in hydrodynamic (HD) simulations show ho…
▽ More
Our aim is to devise a detection method for exoplanet signatures (multiple sinusoids) that is both powerful and robust to partially unknown statistics under the null hypothesis. In the considered application, the noise is mostly created by the stellar atmosphere, with statistics depending on the complicated interplay of several parameters. Recent progresses in hydrodynamic (HD) simulations show however that realistic stellar noise realizations can be numerically produced off-line by astrophysicists. We propose a detection method that is calibrated by HD simulations and analyze its performances. A comparison of the theoretical results with simulations on synthetic and real data shows that the proposed method is powerful and robust.
△ Less
Submitted 27 January, 2016;
originally announced January 2016.
-
Statistical characterization of polychromatic absolute and differential squared visibilities obtained from AMBER/VLTI instrument
Authors:
Antony Schutz,
Martin Vannier,
David Mary,
Andre Ferrari,
Florentin Millour,
Romain Petrov
Abstract:
In optical interferometry, the visibility squared modulus are generally assumed to follow a Gaussian distribution and to be independent of each other. A quantitative analysis of the relevance of such assumptions is important to help improving the exploitation of existing and upcoming multi-wavelength interferometric instruments. Analyze the statistical behaviour of both the absolute and the colour…
▽ More
In optical interferometry, the visibility squared modulus are generally assumed to follow a Gaussian distribution and to be independent of each other. A quantitative analysis of the relevance of such assumptions is important to help improving the exploitation of existing and upcoming multi-wavelength interferometric instruments. Analyze the statistical behaviour of both the absolute and the colour-differential squared visibilities: distribution laws, correlations and cross-correlations between different baselines. We use observations of stellar calibrators obtained with AMBER instrument on VLTI in different instrumental and observing configurations, from which we extract the frame-by-frame transfer function. Statistical hypotheses tests and diagnostics are then systematically applied. For both absolute and differential squared visibilities and under all instrumental and observing conditions, we find a better fit for the Student distribution than for the Gaussian, log-normal and Cauchy distributions. We find and analyze clear correlation effects caused by atmospheric perturbations. The differential squared visibilities allow to keep a larger fraction of data with respect to selected absolute squared visibilities and thus benefit from reduced temporal dispersion, while their distribution is more clearly characterized. The frame selection based on the criterion of a fixed SNR value might result in either a biased sample of frames or in a too severe selection.
△ Less
Submitted 14 May, 2014;
originally announced May 2014.
-
Blind and fully constrained unmixing of hyperspectral images
Authors:
Rita Ammanouil,
André Ferrari,
Cédric Richard,
David Mary
Abstract:
This paper addresses the problem of blind and fully constrained unmixing of hyperspectral images. Unmixing is performed without the use of any dictionary, and assumes that the number of constituent materials in the scene and their spectral signatures are unknown. The estimated abundances satisfy the desired sum-to-one and nonnegativity constraints. Two models with increasing complexity are develop…
▽ More
This paper addresses the problem of blind and fully constrained unmixing of hyperspectral images. Unmixing is performed without the use of any dictionary, and assumes that the number of constituent materials in the scene and their spectral signatures are unknown. The estimated abundances satisfy the desired sum-to-one and nonnegativity constraints. Two models with increasing complexity are developed to achieve this challenging task, depending on how noise interacts with hyperspectral data. The first one leads to a convex optimization problem, and is solved with the Alternating Direction Method of Multipliers. The second one accounts for signal-dependent noise, and is addressed with a Reweighted Least Squares algorithm. Experiments on synthetic and real data demonstrate the effectiveness of our approach.
△ Less
Submitted 2 March, 2014;
originally announced March 2014.