-
Model uncertainty quantification using feature confidence sets for outcome excursions
Authors:
Junting Ren,
Armin Schwartzman
Abstract:
When implementing prediction models for high-stakes real-world applications such as medicine, finance, and autonomous systems, quantifying prediction uncertainty is critical for effective risk management. Traditional approaches to uncertainty quantification, such as confidence and prediction intervals, provide probability coverage guarantees for the expected outcomes $f(\boldsymbol{x})$ or the rea…
▽ More
When implementing prediction models for high-stakes real-world applications such as medicine, finance, and autonomous systems, quantifying prediction uncertainty is critical for effective risk management. Traditional approaches to uncertainty quantification, such as confidence and prediction intervals, provide probability coverage guarantees for the expected outcomes $f(\boldsymbol{x})$ or the realized outcomes $f(\boldsymbol{x})+ε$. Instead, this paper introduces a novel, model-agnostic framework for quantifying uncertainty in continuous and binary outcomes using confidence sets for outcome excursions, where the goal is to identify a subset of the feature space where the expected or realized outcome exceeds a specific value. The proposed method constructs data-dependent inner and outer confidence sets that aim to contain the true feature subset for which the expected or realized outcomes of these features exceed a specified threshold. We establish theoretical guarantees for the probability that these confidence sets contain the true feature subset, both asymptotically and for finite sample sizes. The framework is validated through simulations and applied to real-world datasets, demonstrating its utility in contexts such as housing price prediction and time to sepsis diagnosis in healthcare. This approach provides a unified method for uncertainty quantification that is broadly applicable across various continuous and binary prediction models.
△ Less
Submitted 28 April, 2025;
originally announced April 2025.
-
Spatial Confidence Regions for Excursion Sets with False Discovery Rate Control
Authors:
Howon Ryu,
Thomas Maullin-Sapey,
Armin Schwartzman,
Samuel Davenport
Abstract:
Identifying areas where the signal is prominent is an important task in image analysis, with particular applications in brain mapping. In this work, we develop confidence regions for spatial excursion sets above and below a given level. We achieve this by treating the confidence procedure as a testing problem at the given level, allowing control of the False Discovery Rate (FDR). Methods are devel…
▽ More
Identifying areas where the signal is prominent is an important task in image analysis, with particular applications in brain mapping. In this work, we develop confidence regions for spatial excursion sets above and below a given level. We achieve this by treating the confidence procedure as a testing problem at the given level, allowing control of the False Discovery Rate (FDR). Methods are developed to control the FDR, separately for positive and negative excursions, as well as jointly over both. Furthermore, power is increased by incorporating a two-stage adaptive procedure. Simulation results with various signals show that our confidence regions successfully control the FDR under the nominal level. We showcase our methods with an application to functional magnetic resonance imaging (fMRI) data from the Human Connectome Project illustrating the improvement in statistical power over existing approaches.
△ Less
Submitted 17 April, 2025;
originally announced April 2025.
-
On the peak height distribution of non-stationary Gaussian random fields: 1D general covariance and scale space
Authors:
Yu Zhao,
Dan Cheng,
Samuel Davenport,
Armin Schwartzman
Abstract:
We study the peak height distribution of certain non-stationary Gaussian random fields. The explicit peak height distribution of smooth, non-stationary Gaussian processes in 1D with general covariance is derived. The formula is determined by two parameters, each of which has a clear statistical meaning. For multidimensional non-stationary Gaussian random fields, we generalize these results to the…
▽ More
We study the peak height distribution of certain non-stationary Gaussian random fields. The explicit peak height distribution of smooth, non-stationary Gaussian processes in 1D with general covariance is derived. The formula is determined by two parameters, each of which has a clear statistical meaning. For multidimensional non-stationary Gaussian random fields, we generalize these results to the setting of scale space fields, which play an important role in peak detection by helping to handle peaks of different spatial extents. We demonstrate that these properties not only offer a better interpretation of the scale space field but also simplify the computation of the peak height distribution. Finally, two efficient numerical algorithms are proposed as a general solution for computing the peak height distribution of smooth multidimensional Gaussian random fields in applications.
△ Less
Submitted 17 February, 2025;
originally announced February 2025.
-
Peak Inference for Gaussian Random Fields on a Lattice
Authors:
Tuo Lin,
Armin Schwartzman,
Samuel Davenport
Abstract:
In this work we develop a Monte Carlo method to compute the height distribution of local maxima of a stationary Gaussian or Gaussian-related random field that is observed on a regular lattice. We show that our method can be used to provide valid peak based inference in datasets with low levels of smoothness, where existing formulae derived for continuous domains are not accurate. We also extend th…
▽ More
In this work we develop a Monte Carlo method to compute the height distribution of local maxima of a stationary Gaussian or Gaussian-related random field that is observed on a regular lattice. We show that our method can be used to provide valid peak based inference in datasets with low levels of smoothness, where existing formulae derived for continuous domains are not accurate. We also extend the methods in Worsley (2005) and Taylor et al. (2007) to compute the peak height distribution and compare them with our approach. Lastly, we apply our method to a task fMRI dataset to show how it can be used in practice.
△ Less
Submitted 22 January, 2025;
originally announced January 2025.
-
Robust FWER control in Neuroimaging using Random Field Theory: Riding the SuRF to Continuous Land Part 2
Authors:
Samuel Davenport,
Armin Schwartzman,
Thomas E. Nichols,
Fabian J. E. Telschow
Abstract:
Historically, applications of RFT in fMRI have relied on assumptions of smoothness, stationarity and Gaussianity. The first two assumptions have been addressed in Part 1 of this article series. Here we address the severe non-Gaussianity of (real) fMRI data to greatly improve the performance of voxelwise RFT in fMRI group analysis. In particular, we introduce a transformation which accelerates the…
▽ More
Historically, applications of RFT in fMRI have relied on assumptions of smoothness, stationarity and Gaussianity. The first two assumptions have been addressed in Part 1 of this article series. Here we address the severe non-Gaussianity of (real) fMRI data to greatly improve the performance of voxelwise RFT in fMRI group analysis. In particular, we introduce a transformation which accelerates the convergence of the Central Limit Theorem allowing us to rely on limiting Gaussianity of the test-statistic. We shall show that, when the GKF is combined with the Gaussianization transformation, we are able to accurately estimate the EEC of the excursion set of the transformed test-statistic even when the data is non-Gaussian. This allows us to drop the key assumptions of RFT inference and enables us to provide a fast approach which correctly controls the voxelwise false positive rate in fMRI. We employ a big data \cite{Eklund2016} style validation in which we process resting state data from 7000 subjects from the UK BioBank with fake task designs. We resample from this data to create realistic noise and use this to demonstrate that the error rate is correctly controlled.
△ Less
Submitted 17 December, 2023;
originally announced December 2023.
-
Averaging symmetric positive-definite matrices on the space of eigen-decompositions
Authors:
Sungkyu Jung,
Brian Rooks,
David Groisser,
Armin Schwartzman
Abstract:
We study extensions of Fréchet means for random objects in the space ${\rm Sym}^+(p)$ of $p \times p$ symmetric positive-definite matrices using the scaling-rotation geometric framework introduced by Jung et al. [\textit{SIAM J. Matrix. Anal. Appl.} \textbf{36} (2015) 1180-1201]. The scaling-rotation framework is designed to enjoy a clearer interpretation of the changes in random ellipsoids in ter…
▽ More
We study extensions of Fréchet means for random objects in the space ${\rm Sym}^+(p)$ of $p \times p$ symmetric positive-definite matrices using the scaling-rotation geometric framework introduced by Jung et al. [\textit{SIAM J. Matrix. Anal. Appl.} \textbf{36} (2015) 1180-1201]. The scaling-rotation framework is designed to enjoy a clearer interpretation of the changes in random ellipsoids in terms of scaling and rotation. In this work, we formally define the \emph{scaling-rotation (SR) mean set} to be the set of Fréchet means in ${\rm Sym}^+(p)$ with respect to the scaling-rotation distance. Since computing such means requires a difficult optimization, we also define the \emph{partial scaling-rotation (PSR) mean set} lying on the space of eigen-decompositions as a proxy for the SR mean set. The PSR mean set is easier to compute and its projection to ${\rm Sym}^+(p)$ often coincides with SR mean set. Minimal conditions are required to ensure that the mean sets are non-empty. Because eigen-decompositions are never unique, neither are PSR means, but we give sufficient conditions for the sample PSR mean to be unique up to the action of a certain finite group. We also establish strong consistency of the sample PSR means as estimators of the population PSR mean set, and a central limit theorem. In an application to multivariate tensor-based morphometry, we demonstrate that a two-group test using the proposed PSR means can have greater power than the two-group test using the usual affine-invariant geometric framework for symmetric positive-definite matrices.
△ Less
Submitted 21 June, 2023;
originally announced June 2023.
-
An approximation to peak detection power using Gaussian random field theory
Authors:
Yu Zhao,
Dan Cheng,
Armin Schwartzman
Abstract:
We study power approximation formulas for peak detection using Gaussian random field theory. The approximation, based on the expected number of local maxima above the threshold $u$, $\mathbb{E}[M_u]$, is proved to work well under three asymptotic scenarios: small domain, large threshold, and sharp signal. An adjusted version of $\mathbb{E}[M_u]$ is also proposed to improve accuracy when the expect…
▽ More
We study power approximation formulas for peak detection using Gaussian random field theory. The approximation, based on the expected number of local maxima above the threshold $u$, $\mathbb{E}[M_u]$, is proved to work well under three asymptotic scenarios: small domain, large threshold, and sharp signal. An adjusted version of $\mathbb{E}[M_u]$ is also proposed to improve accuracy when the expected number of local maxima $\mathbb{E}[M_{-\infty}]$ exceeds 1.
Cheng and Schwartzman (2018) developed explicit formulas for $\mathbb{E}[M_u]$ of smooth isotropic Gaussian random fields with zero mean. In this paper, these formulas are extended to allow for rotational symmetric mean functions, so that they are suitable for power calculations. We also apply our formulas to 2D and 3D simulated datasets, and the 3D data is induced by a group analysis of fMRI data from the Human Connectome Project to measure performance in a realistic setting.
△ Less
Submitted 15 January, 2023; v1 submitted 12 January, 2023;
originally announced January 2023.
-
Inverse set estimation and inversion of simultaneous confidence intervals
Authors:
Junting Ren,
Fabian J. E. Telschow,
Armin Schwartzman
Abstract:
Motivated by the questions of risk assessment in climatology (temperature change in North America) and medicine (impact of statin usage and COVID-19 on hospitalized patients), we address the problem of estimating the set in the domain of a function whose image equals a predefined subset. Existing methods that construct confidence sets require strict assumptions. We generalize the estimation of suc…
▽ More
Motivated by the questions of risk assessment in climatology (temperature change in North America) and medicine (impact of statin usage and COVID-19 on hospitalized patients), we address the problem of estimating the set in the domain of a function whose image equals a predefined subset. Existing methods that construct confidence sets require strict assumptions. We generalize the estimation of such sets to dense and non-dense domains with protection against "data peeking" by proving that confidence sets of multiple levels can be simultaneously constructed with the desired confidence non-asymptotically through inverting simultaneous confidence bands. A non-parametric bootstrap algorithm and code are provided.
△ Less
Submitted 11 July, 2023; v1 submitted 8 October, 2022;
originally announced October 2022.
-
Spatial Confidence Regions for Combinations of Excursion Sets in Image Analysis
Authors:
Thomas Maullin-Sapey,
Armin Schwartzman,
Thomas E. Nichols
Abstract:
The analysis of excursion sets in imaging data is essential to a wide range of scientific disciplines such as neuroimaging, climatology and cosmology. Despite growing literature, there is little published concerning the comparison of processes that have been sampled across the same spatial region but which reflect different study conditions. Given a set of asymptotically Gaussian random fields, ea…
▽ More
The analysis of excursion sets in imaging data is essential to a wide range of scientific disciplines such as neuroimaging, climatology and cosmology. Despite growing literature, there is little published concerning the comparison of processes that have been sampled across the same spatial region but which reflect different study conditions. Given a set of asymptotically Gaussian random fields, each corresponding to a sample acquired for a different study condition, this work aims to provide confidence statements about the intersection, or union, of the excursion sets across all fields. Such spatial regions are of natural interest as they directly correspond to the questions "all random fields exceed a predetermined threshold?", or "Where does at least one random field exceed a predetermined threshold?". To assess the degree of spatial variability present, we develop a method that provides, with a desired confidence, subsets and supersets of spatial regions defined by logical conjunctions (i.e. set intersections) or disjunctions (i.e. set unions), without any assumption on the dependence between the different fields. The method is verified by extensive simulations and demonstrated using a task-fMRI dataset to identify brain regions with activation common to four variants of a working memory task.
△ Less
Submitted 7 January, 2022;
originally announced January 2022.
-
Simultaneous Confidence Bands for Functional Data Using the Gaussian Kinematic Formula
Authors:
Fabian J. E. Telschow,
Armin Schwartzman
Abstract:
This article constructs simultaneous confidence bands (SCBs) for functional parameters using the Gaussian Kinematic formula of $t$-processes (tGKF). Although the tGKF relies on Gaussianity, we show that a central limit theorem (CLT) for the parameter of interest is enough to obtain asymptotically precise covering rates even for non-Gaussian processes. As a proof of concept we study the functional…
▽ More
This article constructs simultaneous confidence bands (SCBs) for functional parameters using the Gaussian Kinematic formula of $t$-processes (tGKF). Although the tGKF relies on Gaussianity, we show that a central limit theorem (CLT) for the parameter of interest is enough to obtain asymptotically precise covering rates even for non-Gaussian processes. As a proof of concept we study the functional signal-plus-noise model and derive a CLT for an estimator of the Lipschitz-Killing curvatures, the only data dependent quantities in the tGKF SCBs. Extensions to discrete sampling with additive observation noise are discussed using scale space ideas from regression analysis. Here we provide sufficient conditions on the processes and kernels to obtain convergence of the functional scale space surface.
The theoretical work is accompanied by a simulation study comparing different methods to construct SCBs for the population mean. We show that the tGKF works well even for small sample sizes and only a Rademacher multiplier-$t$ bootstrap performs similarily well. For larger sample sizes the tGKF often outperforms the bootstrap methods and is computational faster. We apply the method to diffusion tensor imaging (DTI) fibers using a scale space approach for the difference of population means. R code is available in our Rpackage SCBfda.
△ Less
Submitted 18 January, 2019;
originally announced January 2019.
-
Machine Learning in High Energy Physics Community White Paper
Authors:
Kim Albertsson,
Piero Altoe,
Dustin Anderson,
John Anderson,
Michael Andrews,
Juan Pedro Araque Espinosa,
Adam Aurisano,
Laurent Basara,
Adrian Bevan,
Wahid Bhimji,
Daniele Bonacorsi,
Bjorn Burkle,
Paolo Calafiura,
Mario Campanelli,
Louis Capps,
Federico Carminati,
Stefano Carrazza,
Yi-fan Chen,
Taylor Childers,
Yann Coadou,
Elias Coniavitis,
Kyle Cranmer,
Claire David,
Douglas Davis,
Andrea De Simone
, et al. (103 additional authors not shown)
Abstract:
Machine learning has been applied to several problems in particle physics research, beginning with applications to high-level physics analysis in the 1990s and 2000s, followed by an explosion of applications in particle and event identification and reconstruction in the 2010s. In this document we discuss promising future research and development areas for machine learning in particle physics. We d…
▽ More
Machine learning has been applied to several problems in particle physics research, beginning with applications to high-level physics analysis in the 1990s and 2000s, followed by an explosion of applications in particle and event identification and reconstruction in the 2010s. In this document we discuss promising future research and development areas for machine learning in particle physics. We detail a roadmap for their implementation, software and hardware resource requirements, collaborative initiatives with the data science community, academia and industry, and training the particle physics community in data science. The main objective of the document is to connect and motivate these areas of research and development with the physics drivers of the High-Luminosity Large Hadron Collider and future neutrino experiments and identify the resource needs for their implementation. Additionally we identify areas where collaboration with external communities will be of great benefit.
△ Less
Submitted 16 May, 2019; v1 submitted 8 July, 2018;
originally announced July 2018.
-
Standardization of multivariate Gaussian mixture models and background adjustment of PET images in brain oncology
Authors:
Meng Li,
Armin Schwartzman
Abstract:
In brain oncology, it is routine to evaluate the progress or remission of the disease based on the differences between a pre-treatment and a post-treatment Positron Emission Tomography (PET) scan. Background adjustment is necessary to reduce confounding by tissue-dependent changes not related to the disease. When modeling the voxel intensities for the two scans as a bivariate Gaussian mixture, bac…
▽ More
In brain oncology, it is routine to evaluate the progress or remission of the disease based on the differences between a pre-treatment and a post-treatment Positron Emission Tomography (PET) scan. Background adjustment is necessary to reduce confounding by tissue-dependent changes not related to the disease. When modeling the voxel intensities for the two scans as a bivariate Gaussian mixture, background adjustment translates into standardizing the mixture at each voxel, while tumor lesions present themselves as outliers to be detected. In this paper, we address the question of how to standardize the mixture to a standard multivariate normal distribution, so that the outliers (i.e., tumor lesions) can be detected using a statistical test. We show theoretically and numerically that the tail distribution of the standardized scores is favorably close to standard normal in a wide range of scenarios while being conservative at the tails, validating voxelwise hypothesis testing based on standardized scores. To address standardization in spatially heterogeneous image data, we propose a spatial and robust multivariate expectation-maximization (EM) algorithm, where prior class membership probabilities are provided by transformation of spatial probability template maps and the estimation of the class mean and covariances are robust to outliers. Simulations in both univariate and bivariate cases suggest that standardized scores with soft assignment have tail probabilities that are either very close to or more conservative than standard normal. The proposed methods are applied to a real data set from a PET phantom experiment, yet they are generic and can be used in other contexts.
△ Less
Submitted 13 February, 2018; v1 submitted 23 October, 2017;
originally announced October 2017.
-
Weakly Supervised Classification in High Energy Physics
Authors:
Lucio Mwinmaarong Dery,
Benjamin Nachman,
Francesco Rubbo,
Ariel Schwartzman
Abstract:
As machine learning algorithms become increasingly sophisticated to exploit subtle features of the data, they often become more dependent on simulations. This paper presents a new approach called weakly supervised classification in which class proportions are the only input into the machine learning algorithm. Using one of the most challenging binary classification tasks in high energy physics - q…
▽ More
As machine learning algorithms become increasingly sophisticated to exploit subtle features of the data, they often become more dependent on simulations. This paper presents a new approach called weakly supervised classification in which class proportions are the only input into the machine learning algorithm. Using one of the most challenging binary classification tasks in high energy physics - quark versus gluon tagging - we show that weakly supervised classification can match the performance of fully supervised algorithms. Furthermore, by design, the new algorithm is insensitive to any mis-modeling of discriminating features in the data by the simulation. Weakly supervised classification is a general procedure that can be applied to a wide variety of learning problems to boost performance and robustness when detailed simulations are not reliable or not available.
△ Less
Submitted 2 July, 2017; v1 submitted 1 February, 2017;
originally announced February 2017.
-
Jet-Images -- Deep Learning Edition
Authors:
Luke de Oliveira,
Michael Kagan,
Lester Mackey,
Benjamin Nachman,
Ariel Schwartzman
Abstract:
Building on the notion of a particle physics detector as a camera and the collimated streams of high energy particles, or jets, it measures as an image, we investigate the potential of machine learning techniques based on deep learning architectures to identify highly boosted W bosons. Modern deep learning algorithms trained on jet images can out-perform standard physically-motivated feature drive…
▽ More
Building on the notion of a particle physics detector as a camera and the collimated streams of high energy particles, or jets, it measures as an image, we investigate the potential of machine learning techniques based on deep learning architectures to identify highly boosted W bosons. Modern deep learning algorithms trained on jet images can out-perform standard physically-motivated feature driven approaches to jet tagging. We develop techniques for visualizing how these features are learned by the network and what additional information is used to improve performance. This interplay between physically-motivated feature driven tools and supervised learning algorithms is general and can be used to significantly increase the sensitivity to discover new particles and new forces, and gain a deeper understanding of the physics within jets.
△ Less
Submitted 22 January, 2017; v1 submitted 16 November, 2015;
originally announced November 2015.
-
Fuzzy Jets
Authors:
Lester Mackey,
Benjamin Nachman,
Ariel Schwartzman,
Conrad Stansbury
Abstract:
Collimated streams of particles produced in high energy physics experiments are organized using clustering algorithms to form jets. To construct jets, the experimental collaborations based at the Large Hadron Collider (LHC) primarily use agglomerative hierarchical clustering schemes known as sequential recombination. We propose a new class of algorithms for clustering jets that use infrared and co…
▽ More
Collimated streams of particles produced in high energy physics experiments are organized using clustering algorithms to form jets. To construct jets, the experimental collaborations based at the Large Hadron Collider (LHC) primarily use agglomerative hierarchical clustering schemes known as sequential recombination. We propose a new class of algorithms for clustering jets that use infrared and collinear safe mixture models. These new algorithms, known as fuzzy jets, are clustered using maximum likelihood techniques and can dynamically determine various properties of jets like their size. We show that the fuzzy jet size adds additional information to conventional jet tagging variables. Furthermore, we study the impact of pileup and show that with some slight modifications to the algorithm, fuzzy jets can be stable up to high pileup interaction multiplicities.
△ Less
Submitted 7 September, 2015;
originally announced September 2015.
-
Confidence regions for excursion sets in asymptotically Gaussian random fields, with an application to climate
Authors:
Max Sommerfeld,
Stephen Sain,
Armin Schwartzman
Abstract:
The goal of this paper is to give confidence regions for the excursion set of a spatial function above a given threshold from repeated noisy observations on a fine grid of fixed locations. Given an asymptotically Gaussian estimator of the target function, a pair of data-dependent nested excursion sets are constructed that are sub- and super-sets of the true excursion set, respectively, with a desi…
▽ More
The goal of this paper is to give confidence regions for the excursion set of a spatial function above a given threshold from repeated noisy observations on a fine grid of fixed locations. Given an asymptotically Gaussian estimator of the target function, a pair of data-dependent nested excursion sets are constructed that are sub- and super-sets of the true excursion set, respectively, with a desired confidence. Asymptotic coverage probabilities are determined via a multiplier bootstrap method, not requiring Gaussianity of the original data nor stationarity or smoothness of the limiting Gaussian field. The method is used to determine regions in North America where the mean summer and winter temperatures are expected to increase by mid 21st century by more than 2 degrees Celsius.
△ Less
Submitted 28 January, 2015;
originally announced January 2015.
-
Lognormal Distributions and Geometric Averages of Positive Definite Matrices
Authors:
Armin Schwartzman
Abstract:
This article gives a formal definition of a lognormal family of probability distributions on the set of symmetric positive definite (PD) matrices, seen as a matrix-variate extension of the univariate lognormal family of distributions. Two forms of this distribution are obtained as the large sample limiting distribution via the central limit theorem of two types of geometric averages of i.i.d. PD m…
▽ More
This article gives a formal definition of a lognormal family of probability distributions on the set of symmetric positive definite (PD) matrices, seen as a matrix-variate extension of the univariate lognormal family of distributions. Two forms of this distribution are obtained as the large sample limiting distribution via the central limit theorem of two types of geometric averages of i.i.d. PD matrices: the log-Euclidean average and the canonical geometric average. These averages correspond to two different geometries imposed on the set of PD matrices. The limiting distributions of these averages are used to provide large-sample confidence regions for the corresponding population means. The methods are illustrated on a voxelwise analysis of diffusion tensor imaging data, permitting a comparison between the various average types from the point of view of their sampling variability.
△ Less
Submitted 28 July, 2014; v1 submitted 23 July, 2014;
originally announced July 2014.
-
Multiple Testing of Local Maxima for Detection of Peaks in Random Fields
Authors:
Dan Cheng,
Armin Schwartzman
Abstract:
A topological multiple testing scheme is presented for detecting peaks in images under stationary ergodic Gaussian noise, where tests are performed at local maxima of the smoothed observed signals. The procedure generalizes the one-dimensional scheme of Schwartzman et al. (2011) to Euclidean domains of arbitrary dimension. Two methods are developed according to two different ways of computing p-va…
▽ More
A topological multiple testing scheme is presented for detecting peaks in images under stationary ergodic Gaussian noise, where tests are performed at local maxima of the smoothed observed signals. The procedure generalizes the one-dimensional scheme of Schwartzman et al. (2011) to Euclidean domains of arbitrary dimension. Two methods are developed according to two different ways of computing p-values: (i) using the exact distribution of the height of local maxima (Cheng and Schwartzman, 2014), available explicitly when the noise field is isotropic; (ii) using an approximation to the overshoot distribution of local maxima above a pre-threshold (Cheng and Schwartzman, 2014), applicable when the exact distribution is unknown, such as when the stationary noise field is non-isotropic. The algorithms, combined with the Benjamini-Hochberg procedure for thresholding p-values, provide asymptotic strong control of the False Discovery Rate (FDR) and power consistency, with specific rates, as the search space and signal strength get large. The optimal smoothing bandwidth and optimal pre-threshold are obtained to achieve maximum power. Simulations show that FDR levels are maintained in non-asymptotic conditions. The methods are illustrated in a nanoscopy image analysis problem of detecting fluorescent molecules against the image background.
△ Less
Submitted 6 May, 2014; v1 submitted 6 May, 2014;
originally announced May 2014.
-
Multiple testing of local maxima for detection of peaks in ChIP-Seq data
Authors:
Armin Schwartzman,
Andrew Jaffe,
Yulia Gavrilov,
Clifford A. Meyer
Abstract:
A topological multiple testing approach to peak detection is proposed for the problem of detecting transcription factor binding sites in ChIP-Seq data. After kernel smoothing of the tag counts over the genome, the presence of a peak is tested at each observed local maximum, followed by multiple testing correction at the desired false discovery rate level. Valid p-values for candidate peaks are com…
▽ More
A topological multiple testing approach to peak detection is proposed for the problem of detecting transcription factor binding sites in ChIP-Seq data. After kernel smoothing of the tag counts over the genome, the presence of a peak is tested at each observed local maximum, followed by multiple testing correction at the desired false discovery rate level. Valid p-values for candidate peaks are computed via Monte Carlo simulations of smoothed Poisson sequences, whose background Poisson rates are obtained via linear regression from a Control sample at two different scales. The proposed method identifies nearby binding sites that other methods do not.
△ Less
Submitted 28 May, 2013;
originally announced May 2013.
-
Peak Detection as Multiple Testing
Authors:
Armin Schwartzman,
Yulia Gavrilov,
Robert J. Adler
Abstract:
This paper considers the problem of detecting equal-shaped non-overlapping unimodal peaks in the presence of Gaussian ergodic stationary noise, where the number, location and heights of the peaks are unknown. A multiple testing approach is proposed in which, after kernel smoothing, the presence of a peak is tested at each observed local maximum. The procedure provides strong control of the family…
▽ More
This paper considers the problem of detecting equal-shaped non-overlapping unimodal peaks in the presence of Gaussian ergodic stationary noise, where the number, location and heights of the peaks are unknown. A multiple testing approach is proposed in which, after kernel smoothing, the presence of a peak is tested at each observed local maximum. The procedure provides strong control of the family wise error rate and the false discovery rate asymptotically as both the signal-to-noise ratio (SNR) and the search space get large, where the search space may grow exponentially as a function of SNR. Simulations assuming a Gaussian peak shape and a Gaussian autocorrelation function show that desired error levels are achieved for relatively low SNR and are robust to partial peak overlap. Simulations also show that detection power is maximized when the smoothing bandwidth is close to the bandwidth of the signal peaks, akin to the well-known matched filter theorem in signal processing. The procedure is illustrated in an analysis of electrical recordings of neuronal cell activity.
△ Less
Submitted 11 August, 2010;
originally announced August 2010.
-
Empirical null and false discovery rate inference for exponential families
Authors:
Armin Schwartzman
Abstract:
In large scale multiple testing, the use of an empirical null distribution rather than the theoretical null distribution can be critical for correct inference. This paper proposes a ``mode matching'' method for fitting an empirical null when the theoretical null belongs to any exponential family. Based on the central matching method for $z$-scores, mode matching estimates the null density by fit…
▽ More
In large scale multiple testing, the use of an empirical null distribution rather than the theoretical null distribution can be critical for correct inference. This paper proposes a ``mode matching'' method for fitting an empirical null when the theoretical null belongs to any exponential family. Based on the central matching method for $z$-scores, mode matching estimates the null density by fitting an appropriate exponential family to the histogram of the test statistics by Poisson regression in a region surrounding the mode. The empirical null estimate is then used to estimate local and tail false discovery rate (FDR) for inference. Delta-method covariance formulas and approximate asymptotic bias formulas are provided, as well as simulation studies of the effect of the tuning parameters of the procedure on the bias-variance trade-off. The standard FDR estimates are found to be biased down at the far tails. Correlation between test statistics is taken into account in the covariance estimates, providing a generalization of Efron's ``wing function'' for exponential families. Applications with $χ^2$ statistics are shown in a family-based genome-wide association study from the Framingham Heart Study and an anatomical brain imaging study of dyslexia in children.
△ Less
Submitted 26 January, 2009;
originally announced January 2009.
-
False discovery rate analysis of brain diffusion direction maps
Authors:
Armin Schwartzman,
Robert F. Dougherty,
Jonathan E. Taylor
Abstract:
Diffusion tensor imaging (DTI) is a novel modality of magnetic resonance imaging that allows noninvasive mapping of the brain's white matter. A particular map derived from DTI measurements is a map of water principal diffusion directions, which are proxies for neural fiber directions. We consider a study in which diffusion direction maps were acquired for two groups of subjects. The objective of…
▽ More
Diffusion tensor imaging (DTI) is a novel modality of magnetic resonance imaging that allows noninvasive mapping of the brain's white matter. A particular map derived from DTI measurements is a map of water principal diffusion directions, which are proxies for neural fiber directions. We consider a study in which diffusion direction maps were acquired for two groups of subjects. The objective of the analysis is to find regions of the brain in which the corresponding diffusion directions differ between the groups. This is attained by first computing a test statistic for the difference in direction at every brain location using a Watson model for directional data. Interesting locations are subsequently selected with control of the false discovery rate. More accurate modeling of the null distribution is obtained using an empirical null density based on the empirical distribution of the test statistics across the brain. Further, substantial improvements in power are achieved by local spatial averaging of the test statistic map. Although the focus is on one particular study and imaging technology, the proposed inference methods can be applied to other large scale simultaneous hypothesis testing problems with a continuous underlying spatial structure.
△ Less
Submitted 26 March, 2008;
originally announced March 2008.