-
Expanding the scope of statistical computing: Training statisticians to be software engineers
Authors:
Alex Reinhart,
Christopher R. Genovese
Abstract:
Traditionally, statistical computing courses have taught the syntax of a particular programming language or specific statistical computation methods. Since the publication of Nolan and Temple Lang (2010), we have seen a greater emphasis on data wrangling, reproducible research, and visualization. This shift better prepares students for careers working with complex datasets and producing analyses f…
▽ More
Traditionally, statistical computing courses have taught the syntax of a particular programming language or specific statistical computation methods. Since the publication of Nolan and Temple Lang (2010), we have seen a greater emphasis on data wrangling, reproducible research, and visualization. This shift better prepares students for careers working with complex datasets and producing analyses for multiple audiences. But, we argue, statisticians are now often called upon to develop statistical software, not just analyses, such as R packages implementing new analysis methods or machine learning systems integrated into commercial products. This demands different skills.
We describe a graduate course that we developed to meet this need by focusing on four themes: programming practices; software design; important algorithms and data structures; and essential tools and methods. Through code review and revision, and a semester-long software project, students practice all the skills of software engineering. The course allows students to expand their understanding of computing as applied to statistical problems while building expertise in the kind of software development that is increasingly the province of the working statistician. We see this as a model for the future evolution of the computing curriculum in statistics and data science.
△ Less
Submitted 28 October, 2020; v1 submitted 30 December, 2019;
originally announced December 2019.
-
Nonparametric Clustering of Functional Data Using Pseudo-Densities
Authors:
Mattia Ciollaro,
Christopher R. Genovese,
Daren Wang
Abstract:
We study nonparametric clustering of smooth random curves on the basis of the L2 gradient flow associated to a pseudo-density functional and we show that the clustering is well-defined both at the population and at the sample level. We provide an algorithm to mark significant local modes, which are associated to informative sample clusters, and we derive its consistency properties. Our theory is d…
▽ More
We study nonparametric clustering of smooth random curves on the basis of the L2 gradient flow associated to a pseudo-density functional and we show that the clustering is well-defined both at the population and at the sample level. We provide an algorithm to mark significant local modes, which are associated to informative sample clusters, and we derive its consistency properties. Our theory is developed under weak assumptions, which essentially reduce to the integrability of the random curves, and does not require to project the random curves on a finite-dimensional subspace. However, if the underlying probability distribution is supported on a finite-dimensional subspace, we show that the pseudo-density and the expectation of a kernel density estimator induce the same gradient flow, and therefore the same clustering. Although our theory is developed for smooth curves that belong to an infinite-dimensional functional space, we also provide consistent procedures that can be used with real data (discretized and noisy observations).
△ Less
Submitted 28 January, 2016;
originally announced January 2016.
-
Cosmic Web Reconstruction through Density Ridges: Catalogue
Authors:
Yen-Chi Chen,
Shirley Ho,
Jon Brinkmann,
Peter E. Freeman,
Christopher R. Genovese,
Donald P. Schneider,
Larry Wasserman
Abstract:
We construct a catalogue for filaments using a novel approach called SCMS (subspace constrained mean shift; Ozertem & Erdogmus 2011; Chen et al. 2015). SCMS is a gradient-based method that detects filaments through density ridges (smooth curves tracing high-density regions). A great advantage of SCMS is its uncertainty measure, which allows an evaluation of the errors for the detected filaments. T…
▽ More
We construct a catalogue for filaments using a novel approach called SCMS (subspace constrained mean shift; Ozertem & Erdogmus 2011; Chen et al. 2015). SCMS is a gradient-based method that detects filaments through density ridges (smooth curves tracing high-density regions). A great advantage of SCMS is its uncertainty measure, which allows an evaluation of the errors for the detected filaments. To detect filaments, we use data from the Sloan Digital Sky Survey, which consist of three galaxy samples: the NYU main galaxy sample (MGS), the LOWZ sample and the CMASS sample. Each of the three dataset covers different redshift regions so that the combined sample allows detection of filaments up to z = 0.7. Our filament catalogue consists of a sequence of two-dimensional filament maps at different redshifts that provide several useful statistics on the evolution cosmic web. To construct the maps, we select spectroscopically confirmed galaxies within 0.050 < z < 0.700 and partition them into 130 bins. For each bin, we ignore the redshift, treating the galaxy observations as a 2-D data and detect filaments using SCMS. The filament catalogue consists of 130 individual 2-D filament maps, and each map comprises points on the detected filaments that describe the filamentary structures at a particular redshift. We also apply our filament catalogue to investigate galaxy luminosity and its relation with distance to filament. Using a volume-limited sample, we find strong evidence (6.1$σ$ - 12.3$σ$) that galaxies close to filaments are generally brighter than those at significant distance from filaments.
△ Less
Submitted 21 September, 2015;
originally announced September 2015.
-
Detecting Effects of Filaments on Galaxy Properties in the Sloan Digital Sky Survey III
Authors:
Yen-Chi Chen,
Shirley Ho,
Rachel Mandelbaum,
Neta A. Bahcall,
Joel R. Brownstein,
Peter E. Freeman,
Christopher R. Genovese,
Donald P. Schneider,
Larry Wasserman
Abstract:
We study the effects of filaments on galaxy properties in the Sloan Digital Sky Survey (SDSS) Data Release 12 using filaments from the `Cosmic Web Reconstruction' catalogue (Chen et al. 2016), a publicly available filament catalogue for SDSS. Since filaments are tracers of medium-to-high density regions, we expect that galaxy properties associated with the environment are dependent on the distance…
▽ More
We study the effects of filaments on galaxy properties in the Sloan Digital Sky Survey (SDSS) Data Release 12 using filaments from the `Cosmic Web Reconstruction' catalogue (Chen et al. 2016), a publicly available filament catalogue for SDSS. Since filaments are tracers of medium-to-high density regions, we expect that galaxy properties associated with the environment are dependent on the distance to the nearest filament. Our analysis demonstrates that a red galaxy or a high-mass galaxy tend to reside closer to filaments than a blue or low-mass galaxy. After adjusting the effect from stellar mass, on average, early-forming galaxies or large galaxies have a shorter distance to filaments than late-forming galaxies or small galaxies. For the Main galaxy sample (MGS), all signals are very significant ($>6σ$). For the LOWZ and CMASS sample, the stellar mass and size are significant ($>2 σ$). The filament effects we observe persist until $z = 0.7$ (the edge of the CMASS sample). Comparing our results to those using the galaxy distances from redMaPPer galaxy clusters as a reference, we find a similar result between filaments and clusters. Moreover, we find that the effect of clusters on the stellar mass of nearby galaxies depends on the galaxy's filamentary environment. Our findings illustrate the strong correlation of galaxy properties with proximity to density ridges, strongly supporting the claim that density ridges are good tracers of filaments.
△ Less
Submitted 12 January, 2017; v1 submitted 21 September, 2015;
originally announced September 2015.
-
Investigating Galaxy-Filament Alignments in Hydrodynamic Simulations using Density Ridges
Authors:
Yen-Chi Chen,
Shirley Ho,
Ananth Tenneti,
Rachel Mandelbaum,
Rupert Croft,
Tiziana DiMatteo,
Peter E. Freeman,
Christopher R. Genovese,
Larry Wasserman
Abstract:
In this paper, we study the filamentary structures and the galaxy alignment along filaments at redshift $z=0.06$ in the MassiveBlack-II simulation, a state-of-the-art, high-resolution hydrodynamical cosmological simulation which includes stellar and AGN feedback in a volume of (100 Mpc$/h$)$^3$. The filaments are constructed using the subspace constrained mean shift (SCMS; Ozertem & Erdogmus (2011…
▽ More
In this paper, we study the filamentary structures and the galaxy alignment along filaments at redshift $z=0.06$ in the MassiveBlack-II simulation, a state-of-the-art, high-resolution hydrodynamical cosmological simulation which includes stellar and AGN feedback in a volume of (100 Mpc$/h$)$^3$. The filaments are constructed using the subspace constrained mean shift (SCMS; Ozertem & Erdogmus (2011) and Chen et al. (2015a)). First, we show that reconstructed filaments using galaxies and reconstructed filaments using dark matter particles are similar to each other; over $50\%$ of the points on the galaxy filaments have a corresponding point on the dark matter filaments within distance $0.13$ Mpc$/h$ (and vice versa) and this distance is even smaller at high-density regions. Second, we observe the alignment of the major principal axis of a galaxy with respect to the orientation of its nearest filament and detect a $2.5$ Mpc$/h$ critical radius for filament's influence on the alignment when the subhalo mass of this galaxy is between $10^9M_\odot/h$ and $10^{12}M_\odot/h$. Moreover, we find the alignment signal to increase significantly with the subhalo mass. Third, when a galaxy is close to filaments (less than $0.25$ Mpc$/h$), the galaxy alignment toward the nearest galaxy group depends on the galaxy subhalo mass. Finally, we find that galaxies close to filaments or groups tend to be rounder than those away from filaments or groups.
△ Less
Submitted 17 August, 2015;
originally announced August 2015.
-
Statistical Inference using the Morse-Smale Complex
Authors:
Yen-Chi Chen,
Christopher R. Genovese,
Larry Wasserman
Abstract:
The Morse-Smale complex of a function $f$ decomposes the sample space into cells where $f$ is increasing or decreasing. When applied to nonparametric density estimation and regression, it provides a way to represent, visualize, and compare multivariate functions. In this paper, we present some statistical results on estimating Morse-Smale complexes. This allows us to derive new results for two exi…
▽ More
The Morse-Smale complex of a function $f$ decomposes the sample space into cells where $f$ is increasing or decreasing. When applied to nonparametric density estimation and regression, it provides a way to represent, visualize, and compare multivariate functions. In this paper, we present some statistical results on estimating Morse-Smale complexes. This allows us to derive new results for two existing methods: mode clustering and Morse-Smale regression. We also develop two new methods based on the Morse-Smale complex: a visualization technique for multivariate functions and a two-sample, multivariate hypothesis test.
△ Less
Submitted 3 April, 2017; v1 submitted 29 June, 2015;
originally announced June 2015.
-
Optimal Ridge Detection using Coverage Risk
Authors:
Yen-Chi Chen,
Christopher R. Genovese,
Shirley Ho,
Larry Wasserman
Abstract:
We introduce the concept of coverage risk as an error measure for density ridge estimation. The coverage risk generalizes the mean integrated square error to set estimation. We propose two risk estimators for the coverage risk and we show that we can select tuning parameters by minimizing the estimated risk. We study the rate of convergence for coverage risk and prove consistency of the risk estim…
▽ More
We introduce the concept of coverage risk as an error measure for density ridge estimation. The coverage risk generalizes the mean integrated square error to set estimation. We propose two risk estimators for the coverage risk and we show that we can select tuning parameters by minimizing the estimated risk. We study the rate of convergence for coverage risk and prove consistency of the risk estimators. We apply our method to three simulated datasets and to cosmology data. In all the examples, the proposed method successfully recover the underlying density structure.
△ Less
Submitted 7 June, 2015;
originally announced June 2015.
-
Density Level Sets: Asymptotics, Inference, and Visualization
Authors:
Yen-Chi Chen,
Christopher R. Genovese,
Larry Wasserman
Abstract:
We derive asymptotic theory for the plug-in estimate for density level sets under Hausdoff loss. Based on the asymptotic theory, we propose two bootstrap confidence regions for level sets. The confidence regions can be used to perform tests for anomaly detection and clustering. We also introduce a technique to visualize high dimensional density level sets by combining mode clustering and multidime…
▽ More
We derive asymptotic theory for the plug-in estimate for density level sets under Hausdoff loss. Based on the asymptotic theory, we propose two bootstrap confidence regions for level sets. The confidence regions can be used to perform tests for anomaly detection and clustering. We also introduce a technique to visualize high dimensional density level sets by combining mode clustering and multidimensional scaling.
△ Less
Submitted 5 September, 2016; v1 submitted 21 April, 2015;
originally announced April 2015.
-
Cosmic Web Reconstruction through Density Ridges: Method and Algorithm
Authors:
Yen-Chi Chen,
Shirley Ho,
Peter E. Freeman,
Christopher R. Genovese,
Larry Wasserman
Abstract:
The detection and characterization of filamentary structures in the cosmic web allows cosmologists to constrain parameters that dictates the evolution of the Universe. While many filament estimators have been proposed, they generally lack estimates of uncertainty, reducing their inferential power. In this paper, we demonstrate how one may apply the Subspace Constrained Mean Shift (SCMS) algorithm…
▽ More
The detection and characterization of filamentary structures in the cosmic web allows cosmologists to constrain parameters that dictates the evolution of the Universe. While many filament estimators have been proposed, they generally lack estimates of uncertainty, reducing their inferential power. In this paper, we demonstrate how one may apply the Subspace Constrained Mean Shift (SCMS) algorithm (Ozertem and Erdogmus (2011); Genovese et al. (2012)) to uncover filamentary structure in galaxy data. The SCMS algorithm is a gradient ascent method that models filaments as density ridges, one-dimensional smooth curves that trace high-density regions within the point cloud. We also demonstrate how augmenting the SCMS algorithm with bootstrap-based methods of uncertainty estimation allows one to place uncertainty bands around putative filaments. We apply the SCMS method to datasets sampled from the P3M N-body simulation, with galaxy number densities consistent with SDSS and WFIRST-AFTA and to LOWZ and CMASS data from the Baryon Oscillation Spectroscopic Survey (BOSS). To further assess the efficacy of SCMS, we compare the relative locations of BOSS filaments with galaxy clusters in the redMaPPer catalog, and find that redMaPPer clusters are significantly closer (with p-values $< 10^{-9}$) to SCMS-detected filaments than to randomly selected galaxies.
△ Less
Submitted 27 August, 2015; v1 submitted 21 January, 2015;
originally announced January 2015.
-
Nonparametric modal regression
Authors:
Yen-Chi Chen,
Christopher R. Genovese,
Ryan J. Tibshirani,
Larry Wasserman
Abstract:
Modal regression estimates the local modes of the distribution of $Y$ given $X=x$, instead of the mean, as in the usual regression sense, and can hence reveal important structure missed by usual regression methods. We study a simple nonparametric method for modal regression, based on a kernel density estimate (KDE) of the joint distribution of $Y$ and $X$. We derive asymptotic error bounds for thi…
▽ More
Modal regression estimates the local modes of the distribution of $Y$ given $X=x$, instead of the mean, as in the usual regression sense, and can hence reveal important structure missed by usual regression methods. We study a simple nonparametric method for modal regression, based on a kernel density estimate (KDE) of the joint distribution of $Y$ and $X$. We derive asymptotic error bounds for this method, and propose techniques for constructing confidence sets and prediction sets. The latter is used to select the smoothing bandwidth of the underlying KDE. The idea behind modal regression is connected to many others, such as mixture regression and density ridge estimation, and we discuss these ties as well.
△ Less
Submitted 30 March, 2016; v1 submitted 4 December, 2014;
originally announced December 2014.
-
Asymptotic theory for density ridges
Authors:
Yen-Chi Chen,
Christopher R. Genovese,
Larry Wasserman
Abstract:
The large sample theory of estimators for density modes is well understood. In this paper we consider density ridges, which are a higher-dimensional extension of modes. Modes correspond to zero-dimensional, local high-density regions in point clouds. Density ridges correspond to $s$-dimensional, local high-density regions in point clouds. We establish three main results. First we show that under a…
▽ More
The large sample theory of estimators for density modes is well understood. In this paper we consider density ridges, which are a higher-dimensional extension of modes. Modes correspond to zero-dimensional, local high-density regions in point clouds. Density ridges correspond to $s$-dimensional, local high-density regions in point clouds. We establish three main results. First we show that under appropriate regularity conditions, the local variation of the estimated ridge can be approximated by an empirical process. Second, we show that the distribution of the estimated ridge converges to a Gaussian process. Third, we establish that the bootstrap leads to valid confidence sets for density ridges.
△ Less
Submitted 13 October, 2015; v1 submitted 21 June, 2014;
originally announced June 2014.
-
Generalized Mode and Ridge Estimation
Authors:
Yen-Chi Chen,
Christopher R. Genovese,
Larry Wasserman
Abstract:
The generalized density is a product of a density function and a weight function. For example, the average local brightness of an astronomical image is the probability of finding a galaxy times the mean brightness of the galaxy. We propose a method for studying the geometric structure of generalized densities. In particular, we show how to find the modes and ridges of a generalized density functio…
▽ More
The generalized density is a product of a density function and a weight function. For example, the average local brightness of an astronomical image is the probability of finding a galaxy times the mean brightness of the galaxy. We propose a method for studying the geometric structure of generalized densities. In particular, we show how to find the modes and ridges of a generalized density function using a modification of the mean shift algorithm and its variant, subspace constrained mean shift. Our method can be used to perform clustering and to calculate a measure of connectivity between clusters. We establish consistency and rates of convergence for our estimator and apply the methods to data from two astronomical problems.
△ Less
Submitted 6 June, 2014;
originally announced June 2014.
-
A Comprehensive Approach to Mode Clustering
Authors:
Yen-Chi Chen,
Christopher R. Genovese,
Larry Wasserman
Abstract:
Mode clustering is a nonparametric method for clustering that defines clusters using the basins of attraction of a density estimator's modes. We provide several enhancements to mode clustering: (i) a soft variant of cluster assignment, (ii) a measure of connectivity between clusters, (iii) a technique for choosing the bandwidth, (iv) a method for denoising small clusters, and (v) an approach to vi…
▽ More
Mode clustering is a nonparametric method for clustering that defines clusters using the basins of attraction of a density estimator's modes. We provide several enhancements to mode clustering: (i) a soft variant of cluster assignment, (ii) a measure of connectivity between clusters, (iii) a technique for choosing the bandwidth, (iv) a method for denoising small clusters, and (v) an approach to visualizing the clusters. Combining all these enhancements gives us a complete procedure for clustering in multivariate problems. We also compare mode clustering to other clustering methods in several examples
△ Less
Submitted 22 December, 2015; v1 submitted 6 June, 2014;
originally announced June 2014.
-
Nonparametric 3D map of the IGM using the Lyman-alpha forest
Authors:
Jessi Cisewski,
Rupert A. C. Croft,
Peter E. Freeman,
Christopher R. Genovese,
Nishikanta Khandai,
Melih Ozbek,
Larry Wasserman
Abstract:
Visualizing the high-redshift Universe is difficult due to the dearth of available data; however, the Lyman-alpha forest provides a means to map the intergalactic medium at redshifts not accessible to large galaxy surveys. Large-scale structure surveys, such as the Baryon Oscillation Spectroscopic Survey (BOSS), have collected quasar (QSO) spectra that enable the reconstruction of HI density fluct…
▽ More
Visualizing the high-redshift Universe is difficult due to the dearth of available data; however, the Lyman-alpha forest provides a means to map the intergalactic medium at redshifts not accessible to large galaxy surveys. Large-scale structure surveys, such as the Baryon Oscillation Spectroscopic Survey (BOSS), have collected quasar (QSO) spectra that enable the reconstruction of HI density fluctuations. The data fall on a collection of lines defined by the lines-of-sight (LOS) of the QSO, and a major issue with producing a 3D reconstruction is determining how to model the regions between the LOS. We present a method that produces a 3D map of this relatively uncharted portion of the Universe by employing local polynomial smoothing, a nonparametric methodology. The performance of the method is analyzed on simulated data that mimics the varying number of LOS expected in real data, and then is applied to a sample region selected from BOSS. Evaluation of the reconstruction is assessed by considering various features of the predicted 3D maps including visual comparison of slices, PDFs, counts of local minima and maxima, and standardized correlation functions. This 3D reconstruction allows for an initial investigation of the topology of this portion of the Universe using persistent homology.
△ Less
Submitted 8 January, 2014;
originally announced January 2014.
-
Uncertainty Measures and Limiting Distributions for Filament Estimation
Authors:
Yen-Chi Chen,
Christopher R. Genovese,
Larry Wasserman
Abstract:
A filament is a high density, connected region in a point cloud. There are several methods for estimating filaments but these methods do not provide any measure of uncertainty. We give a definition for the uncertainty of estimated filaments and we study statistical properties of the estimated filaments. We show how to estimate the uncertainty measures and we construct confidence sets based on a bo…
▽ More
A filament is a high density, connected region in a point cloud. There are several methods for estimating filaments but these methods do not provide any measure of uncertainty. We give a definition for the uncertainty of estimated filaments and we study statistical properties of the estimated filaments. We show how to estimate the uncertainty measures and we construct confidence sets based on a bootstrapping technique. We apply our methods to astronomy data and earthquake data.
△ Less
Submitted 7 December, 2013;
originally announced December 2013.
-
Nonparametric ridge estimation
Authors:
Christopher R. Genovese,
Marco Perone-Pacifico,
Isabella Verdinelli,
Larry Wasserman
Abstract:
We study the problem of estimating the ridges of a density function. Ridge estimation is an extension of mode finding and is useful for understanding the structure of a density. It can also be used to find hidden structure in point cloud data. We show that, under mild regularity conditions, the ridges of the kernel density estimator consistently estimate the ridges of the true density. When the da…
▽ More
We study the problem of estimating the ridges of a density function. Ridge estimation is an extension of mode finding and is useful for understanding the structure of a density. It can also be used to find hidden structure in point cloud data. We show that, under mild regularity conditions, the ridges of the kernel density estimator consistently estimate the ridges of the true density. When the data are noisy measurements of a manifold, we show that the ridges are close and topologically similar to the hidden manifold. To find the estimated ridges in practice, we adapt the modified mean-shift algorithm proposed by Ozertem and Erdogmus [J. Mach. Learn. Res. 12 (2011) 1249-1286]. Some numerical experiments verify that the algorithm is accurate.
△ Less
Submitted 28 August, 2014; v1 submitted 20 December, 2012;
originally announced December 2012.
-
Efficient Estimators for Sequential and Resolution-Limited Inverse Problems
Authors:
Darren Homrighausen,
Christopher R. Genovese
Abstract:
A common problem in the sciences is that a signal of interest is observed only indirectly, through smooth functionals of the signal whose values are then obscured by noise. In such inverse problems, the functionals dampen or entirely eliminate some of the signal's interesting features. This makes it difficult or even impossible to fully reconstruct the signal, even without noise. In this paper, we…
▽ More
A common problem in the sciences is that a signal of interest is observed only indirectly, through smooth functionals of the signal whose values are then obscured by noise. In such inverse problems, the functionals dampen or entirely eliminate some of the signal's interesting features. This makes it difficult or even impossible to fully reconstruct the signal, even without noise. In this paper, we develop methods for handling sequences of related inverse problems, with the problems varying either systematically or randomly over time. Such sequences often arise with automated data collection systems, like the data pipelines of large astronomical instruments such as the Large Synoptic Survey Telescope (LSST). The LSST will observe each patch of the sky many times over its lifetime under varying conditions. A possible additional complication in these problems is that the observational resolution is limited by the instrument, so that even with many repeated observations, only an approximation of the underlying signal can be reconstructed. We propose an efficient estimator for reconstructing a signal of interest given a sequence of related, resolution-limited inverse problems. We demonstrate our method's effectiveness in some representative examples and provide theoretical support for its adoption.
△ Less
Submitted 2 July, 2012;
originally announced July 2012.
-
Regularization Techniques for PSF-Matching Kernels. I. Choice of Kernel Basis
Authors:
A. C. Becker,
D. Homrighausen,
A. J. Connolly,
C. R. Genovese,
R. Owen,
S. J. Bickerton,
R. H. Lupton
Abstract:
We review current methods for building PSF-matching kernels for the purposes of image subtraction or coaddition. Such methods use a linear decomposition of the kernel on a series of basis functions. The correct choice of these basis functions is fundamental to the efficiency and effectiveness of the matching - the chosen bases should represent the underlying signal using a reasonably small number…
▽ More
We review current methods for building PSF-matching kernels for the purposes of image subtraction or coaddition. Such methods use a linear decomposition of the kernel on a series of basis functions. The correct choice of these basis functions is fundamental to the efficiency and effectiveness of the matching - the chosen bases should represent the underlying signal using a reasonably small number of shapes, and/or have a minimum number of user-adjustable tuning parameters. We examine methods whose bases comprise multiple Gauss-Hermite polynomials, as well as a form free basis composed of delta-functions. Kernels derived from delta-functions are unsurprisingly shown to be more expressive; they are able to take more general shapes and perform better in situations where sum-of-Gaussian methods are known to fail. However, due to its many degrees of freedom (the maximum number allowed by the kernel size) this basis tends to overfit the problem, and yields noisy kernels having large variance. We introduce a new technique to regularize these delta-function kernel solutions, which bridges the gap between the generality of delta-function kernels, and the compactness of sum-of-Gaussian kernels. Through this regularization we are able to create general kernel solutions that represent the intrinsic shape of the PSF-matching kernel with only one degree of freedom, the strength of the regularization lambda. The role of lambda is effectively to exchange variance in the resulting difference image with variance in the kernel itself. We examine considerations in choosing the value of lambda, including statistical risk estimators and the ability of the solution to predict solutions for adjacent areas. Both of these suggest moderate strengths of lambda between 0.1 and 1.0, although this optimization is likely dataset dependent.
△ Less
Submitted 13 February, 2012;
originally announced February 2012.
-
Manifold estimation and singular deconvolution under Hausdorff loss
Authors:
Christopher R. Genovese,
Marco Perone-Pacifico,
Isabella Verdinelli,
Larry Wasserman
Abstract:
We find lower and upper bounds for the risk of estimating a manifold in Hausdorff distance under several models. We also show that there are close connections between manifold estimation and the problem of deconvolving a singular measure.
We find lower and upper bounds for the risk of estimating a manifold in Hausdorff distance under several models. We also show that there are close connections between manifold estimation and the problem of deconvolving a singular measure.
△ Less
Submitted 5 June, 2012; v1 submitted 21 September, 2011;
originally announced September 2011.
-
Discussion of: Brownian distance covariance
Authors:
Christopher R. Genovese
Abstract:
Discussion on "Brownian distance covariance" by Gábor J. Székely and Maria L. Rizzo [arXiv:1010.0297]
Discussion on "Brownian distance covariance" by Gábor J. Székely and Maria L. Rizzo [arXiv:1010.0297]
△ Less
Submitted 5 October, 2010;
originally announced October 2010.
-
The Geometry of Nonparametric Filament Estimation
Authors:
Christopher R. Genovese,
Marco Perone-Pacifico,
Isabella Verdinelli,
Larry Wasserman
Abstract:
We consider the problem of estimating filamentary structure from planar point process data. We make some connections with computational geometry and we develop nonparametric methods for estimating the filaments. We show that, under weak conditions, the filaments have a simple geometric representation as the medial axis of the data distribution's support. Our methods convert an estimator of the sup…
▽ More
We consider the problem of estimating filamentary structure from planar point process data. We make some connections with computational geometry and we develop nonparametric methods for estimating the filaments. We show that, under weak conditions, the filaments have a simple geometric representation as the medial axis of the data distribution's support. Our methods convert an estimator of the support's boundary into an estimator of the filaments. We also find the rates of convergence of our estimators.
△ Less
Submitted 12 December, 2010; v1 submitted 25 March, 2010;
originally announced March 2010.
-
Straight to the Source: Detecting Aggregate Objects in Astronomical Images with Proper Error Control
Authors:
David A. Friedenberg,
Christopher R. Genovese
Abstract:
The next generation of telescopes will acquire terabytes of image data on a nightly basis. Collectively, these large images will contain billions of interesting objects, which astronomers call sources. The astronomers' task is to construct a catalog detailing the coordinates and other properties of the sources. The source catalog is the primary data product for most telescopes and is an importan…
▽ More
The next generation of telescopes will acquire terabytes of image data on a nightly basis. Collectively, these large images will contain billions of interesting objects, which astronomers call sources. The astronomers' task is to construct a catalog detailing the coordinates and other properties of the sources. The source catalog is the primary data product for most telescopes and is an important input for testing new astrophysical theories, but to construct the catalog one must first detect the sources. Existing algorithms for catalog creation are effective at detecting sources, but do not have rigorous statistical error control. At the same time, there are several multiple testing procedures that provide rigorous error control, but they are not designed to detect sources that are aggregated over several pixels. In this paper, we propose a technique that does both, by providing rigorous statistical error control on the aggregate objects themselves rather than the pixels. We demonstrate the effectiveness of this approach on data from the Chandra X-ray Observatory Satellite. Our technique effectively controls the rate of false sources, yet still detects almost all of the sources detected by procedures that do not have such rigorous error control and have the advantage of additional data in the form of follow up observations, which will not be available for upcoming large telescopes. In fact, we even detect a new source that was missed by previous studies. The statistical methods developed in this paper can be extended to problems beyond Astronomy, as we will illustrate with an example from Neuroimaging.
△ Less
Submitted 28 October, 2009;
originally announced October 2009.
-
Revealing components of the galaxy population through nonparametric techniques
Authors:
Steven P. Bamford,
Alex L. Rojas,
Robert C. Nichol,
Christopher J. Miller,
Larry Wasserman,
Christopher R. Genovese,
Peter E. Freeman
Abstract:
The distributions of galaxy properties vary with environment, and are often multimodal, suggesting that the galaxy population may be a combination of multiple components. The behaviour of these components versus environment holds details about the processes of galaxy development. To release this information we apply a novel, nonparametric statistical technique, identifying four components presen…
▽ More
The distributions of galaxy properties vary with environment, and are often multimodal, suggesting that the galaxy population may be a combination of multiple components. The behaviour of these components versus environment holds details about the processes of galaxy development. To release this information we apply a novel, nonparametric statistical technique, identifying four components present in the distribution of galaxy H$α$ emission-line equivalent-widths. We interpret these components as passive, star-forming, and two varieties of active galactic nuclei. Independent of this interpretation, the properties of each component are remarkably constant as a function of environment. Only their relative proportions display substantial variation. The galaxy population thus appears to comprise distinct components which are individually independent of environment, with galaxies rapidly transitioning between components as they move into denser environments.
△ Less
Submitted 16 September, 2008;
originally announced September 2008.
-
On the path density of a gradient field
Authors:
Christopher R. Genovese,
Marco Perone-Pacifico,
Isabella Verdinelli,
Larry Wasserman
Abstract:
We consider the problem of reliably finding filaments in point clouds. Realistic data sets often have numerous filaments of various sizes and shapes. Statistical techniques exist for finding one (or a few) filaments but these methods do not handle noisy data sets with many filaments. Other methods can be found in the astronomy literature but they do not have rigorous statistical guarantees. We p…
▽ More
We consider the problem of reliably finding filaments in point clouds. Realistic data sets often have numerous filaments of various sizes and shapes. Statistical techniques exist for finding one (or a few) filaments but these methods do not handle noisy data sets with many filaments. Other methods can be found in the astronomy literature but they do not have rigorous statistical guarantees. We propose the following method. Starting at each data point we construct the steepest ascent path along a kernel density estimator. We locate filaments by finding regions where these paths are highly concentrated. Formally, we define the density of these paths and we construct a consistent estimator of this path density.
△ Less
Submitted 11 September, 2009; v1 submitted 27 May, 2008;
originally announced May 2008.
-
Adaptive Confidence Bands
Authors:
Christopher R. Genovese,
Larry Wasserman
Abstract:
We show that there do not exist adaptive confidence bands for curve estimation except under very restrictive assumptions. We propose instead to construct adaptive bands that cover a surrogate function f^\star which is close to, but simpler than, f. The surrogate captures the significant features in f. We establish lower bounds on the width for any confidence band for f^\star and construct a proc…
▽ More
We show that there do not exist adaptive confidence bands for curve estimation except under very restrictive assumptions. We propose instead to construct adaptive bands that cover a surrogate function f^\star which is close to, but simpler than, f. The surrogate captures the significant features in f. We establish lower bounds on the width for any confidence band for f^\star and construct a procedure that comes within a small constant factor of attaining the lower bound for finite-samples.
△ Less
Submitted 18 January, 2007;
originally announced January 2007.
-
Examining the Effect of the Map-Making Algorithm on Observed Power Asymmetry in WMAP Data
Authors:
P. E. Freeman,
C. R. Genovese,
C. J. Miller,
R. C. Nichol,
L. Wasserman
Abstract:
We analyze first-year data of WMAP to determine the significance of asymmetry in summed power between arbitrarily defined opposite hemispheres, using maps that we create ourselves with software developed independently of the WMAP team. We find that over the multipole range l=[2,64], the significance of asymmetry is ~ 10^-4, a value insensitive to both frequency and power spectrum. We determine t…
▽ More
We analyze first-year data of WMAP to determine the significance of asymmetry in summed power between arbitrarily defined opposite hemispheres, using maps that we create ourselves with software developed independently of the WMAP team. We find that over the multipole range l=[2,64], the significance of asymmetry is ~ 10^-4, a value insensitive to both frequency and power spectrum. We determine the smallest multipole ranges exhibiting significant asymmetry, and find twelve, including l=[2,3] and [6,7], for which the significance -> 0. In these ranges there is an improbable association between the direction of maximum significance and the ecliptic plane (p ~ 0.01). Also, contours of least significance follow great circles inclined relative to the ecliptic at the largest scales. The great circle for l=[2,3] passes over previously reported preferred axes and is insensitive to frequency, while the great circle for l=[6,7] is aligned with the ecliptic poles. We examine how changing map-making parameters affects asymmetry, and find that at large scales, it is rendered insignificant if the magnitude of the WMAP dipole vector is increased by approximately 1-3 sigma (or 2-6 km/s). While confirmation of this result would require data recalibration, such a systematic change would be consistent with observations of frequency-independent asymmetry. We conclude that the use of an incorrect dipole vector, in combination with a systematic or foreground process associated with the ecliptic, may help to explain the observed asymmetry.
△ Less
Submitted 13 October, 2005;
originally announced October 2005.
-
Confidence sets for nonparametric wavelet regression
Authors:
Christopher R. Genovese,
Larry Wasserman
Abstract:
We construct nonparametric confidence sets for regression functions using wavelets that are uniform over Besov balls. We consider both thresholding and modulation estimators for the wavelet coefficients. The confidence set is obtained by showing that a pivot process, constructed from the loss function, converges uniformly to a mean zero Gaussian process. Inverting this pivot yields a confidence…
▽ More
We construct nonparametric confidence sets for regression functions using wavelets that are uniform over Besov balls. We consider both thresholding and modulation estimators for the wavelet coefficients. The confidence set is obtained by showing that a pivot process, constructed from the loss function, converges uniformly to a mean zero Gaussian process. Inverting this pivot yields a confidence set for the wavelet coefficients, and from this we obtain confidence sets on functionals of the regression curve.
△ Less
Submitted 30 May, 2005;
originally announced May 2005.
-
Nonparametric Inference for the Cosmic Microwave Background
Authors:
Christopher R. Genovese,
Christopher J. Miller,
Robert C. Nichol,
Mihir Arjunwadkar,
Larry Wasserman
Abstract:
The Cosmic Microwave Background (CMB), which permeates the entire Universe, is the radiation left over from just 380,000 years after the Big Bang. On very large scales, the CMB radiation field is smooth and isotropic, but the existence of structure in the Universe - stars, galaxies, clusters of galaxies - suggests that the field should fluctuate on smaller scales. Recent observations, from the C…
▽ More
The Cosmic Microwave Background (CMB), which permeates the entire Universe, is the radiation left over from just 380,000 years after the Big Bang. On very large scales, the CMB radiation field is smooth and isotropic, but the existence of structure in the Universe - stars, galaxies, clusters of galaxies - suggests that the field should fluctuate on smaller scales. Recent observations, from the Cosmic Microwave Background Explorer to the Wilkinson Microwave Anisotropy Project, have strikingly confirmed this prediction. CMB fluctuations provide clues to the Universe's structure and composition shortly after the Big Bang that are critical for testing cosmological models. For example, CMB data can be used to determine what portion of the Universe is composed of ordinary matter versus the mysterious dark matter and dark energy. To this end, cosmologists usually summarize the fluctuations by the power spectrum, which gives the variance as a function of angular frequency. The spectrum's shape, and in particular the location and height of its peaks, relates directly to the parameters in the cosmological models. Thus, a critical statistical question is how accurately can these peaks be estimated. We use recently developed techniques to construct a nonparametric confidence set for the unknown CMB spectrum. Our estimated spectrum, based on minimal assumptions, closely matches the model-based estimates used by cosmologists, but we can make a wide range of additional inferences. We apply these techniques to test various models and to extract confidence intervals on cosmological parameters of interest. Our analysis shows that, even without parametric assumptions, the first peak is resolved accurately with current data but that the second and third peaks are not.
△ Less
Submitted 6 October, 2004;
originally announced October 2004.