-
Bayesian analysis of regression discontinuity designs with heterogeneous treatment effects
Authors:
Kevin Tao,
Y. Samuel Wang,
David Ruppert
Abstract:
Regression Discontinuity Design (RDD) is a popular framework for estimating a causal effect in settings where treatment is assigned if an observed covariate exceeds a fixed threshold. We consider estimation and inference in the common setting where the sample consists of multiple known sub-populations with potentially heterogeneous treatment effects. In the applied literature, it is common to acco…
▽ More
Regression Discontinuity Design (RDD) is a popular framework for estimating a causal effect in settings where treatment is assigned if an observed covariate exceeds a fixed threshold. We consider estimation and inference in the common setting where the sample consists of multiple known sub-populations with potentially heterogeneous treatment effects. In the applied literature, it is common to account for heterogeneity by either fitting a parametric model or considering each sub-population separately. In contrast, we develop a Bayesian hierarchical model using Gaussian process regression which allows for non-parametric regression while borrowing information across sub-populations. We derive the posterior distribution, prove posterior consistency, and develop a Metropolis-Hastings within Gibbs sampling algorithm. In extensive simulations, we show that the proposed procedure outperforms existing methods in both estimation and inferential tasks. Finally, we apply our procedure to U.S. Senate election data and discover an incumbent party advantage which is heterogeneous over different time periods.
△ Less
Submitted 14 April, 2025;
originally announced April 2025.
-
Bayesian functional data analysis in astronomy
Authors:
Thomas Loredo,
Tamas Budavari,
David Kent,
David Ruppert
Abstract:
Cosmic demographics -- the statistical study of populations of astrophysical objects -- has long relied on *multivariate statistics*, providing methods for analyzing data comprising fixed-length vectors of properties of objects, as might be compiled in a tabular astronomical catalog (say, with sky coordinates, and brightness measurements in a fixed number of spectral passbands). But beginning with…
▽ More
Cosmic demographics -- the statistical study of populations of astrophysical objects -- has long relied on *multivariate statistics*, providing methods for analyzing data comprising fixed-length vectors of properties of objects, as might be compiled in a tabular astronomical catalog (say, with sky coordinates, and brightness measurements in a fixed number of spectral passbands). But beginning with the emergence of automated digital sky surveys, ca. ~2000, astronomers began producing large collections of data with more complex structure: light curves (brightness time series) and spectra (brightness vs. wavelength). These comprise what statisticians call *functional data* -- measurements of populations of functions. Upcoming automated sky surveys will soon provide astronomers with a flood of functional data. New methods are needed to accurately and optimally analyze large ensembles of light curves and spectra, accumulating information both along and across measured functions. Functional data analysis (FDA) provides tools for statistical modeling of functional data. Astronomical data presents several challenges for FDA methodology, e.g., sparse, irregular, and asynchronous sampling, and heteroscedastic measurement error. Bayesian FDA uses hierarchical Bayesian models for function populations, and is well suited to addressing these challenges. We provide an overview of astronomical functional data, and of some key Bayesian FDA modeling approaches, including functional mixed effects models, and stochastic process models. We briefly describe a Bayesian FDA framework combining FDA and machine learning methods to build low-dimensional parametric models for galaxy spectra.
△ Less
Submitted 26 August, 2024;
originally announced August 2024.
-
Splines 'n Lines: Rest-frame galaxy spectral energy distributions via Bayesian functional data analysis
Authors:
David Kent,
Tamás Budavári,
Thomas J. Loredo,
David Ruppert
Abstract:
Survey-based measurements of the spectral energy distributions (SEDs) of galaxies have flux density estimates on badly misaligned grids in rest-frame wavelength. The shift to rest frame wavelength also causes estimated SEDs to have differing support. For many galaxies, there are sizeable wavelength regions with missing data. Finally, dim galaxies dominate typical samples and have noisy SED measure…
▽ More
Survey-based measurements of the spectral energy distributions (SEDs) of galaxies have flux density estimates on badly misaligned grids in rest-frame wavelength. The shift to rest frame wavelength also causes estimated SEDs to have differing support. For many galaxies, there are sizeable wavelength regions with missing data. Finally, dim galaxies dominate typical samples and have noisy SED measurements, many near the limiting signal-to-noise level of the survey. These limitations of SED measurements shifted to the rest frame complicate downstream analysis tasks, particularly tasks requiring computation of functionals (e.g., weighted integrals) of the SEDs, such as synthetic photometry, quantifying SED similarity, and using SED measurements for photometric redshift estimation. We describe a hierarchical Bayesian framework, drawing on tools from functional data analysis, that models SEDs as a random superposition of smooth continuum basis functions (B-splines) and line features, comprising a finite-rank, nonstationary Gaussian process, measured with additive Gaussian noise. We apply this *Splines 'n Lines* (SnL) model to a collection of 678,239 galaxy SED measurements comprising the Main Galaxy Sample from the Sloan Digital Sky Survey, Data Release 17, demonstrating capability to provide continuous estimated SEDs that reliably denoise, interpolate, and extrapolate, with quantified uncertainty, including the ability to predict line features where there is missing data by leveraging correlations between line features and the entire continuum.
△ Less
Submitted 30 October, 2023;
originally announced October 2023.
-
Adaptive Ridge-Penalized Functional Local Linear Regression
Authors:
Wentian Huang,
David Ruppert
Abstract:
We introduce an original method of multidimensional ridge penalization in functional local linear regressions. The nonparametric regression of functional data is extended from its multivariate counterpart, and is known to be sensitive to the choice of $J$, where $J$ is the dimension of the projection subspace of the data. Under multivariate setting, a roughness penalty is helpful for variance redu…
▽ More
We introduce an original method of multidimensional ridge penalization in functional local linear regressions. The nonparametric regression of functional data is extended from its multivariate counterpart, and is known to be sensitive to the choice of $J$, where $J$ is the dimension of the projection subspace of the data. Under multivariate setting, a roughness penalty is helpful for variance reduction. However, among the limited works covering roughness penalty under the functional setting, most only use a single scalar for tuning. Our new approach proposes a class of data-adaptive ridge penalties, meaning that the model automatically adjusts the structure of the penalty according to the data sets. This structure has $J$ free parameters and enables a quadratic programming search for optimal tuning parameters that minimize the estimated mean squared error (MSE) of prediction, and is capable of applying different roughness penalty levels to each of the $J$ basis.
The strength of the method in prediction accuracy and variance reduction with finite data is demonstrated through multiple simulation scenarios and two real-data examples. Its asymptotic performance is proved and compared to the unpenalized functional local linear regressions.
△ Less
Submitted 16 September, 2021;
originally announced September 2021.
-
Measurement Errors in Semiparametric Generalized Regression Models
Authors:
Mohammad W. Hattab,
David Ruppert
Abstract:
Regression models that ignore measurement error in predictors may produce highly biased estimates leading to erroneous inferences. It is well known that it is extremely difficult to take measurement error into account in Gaussian nonparametric regression. This problem becomes tremendously more difficult when considering other families such as logistic regression, Poisson and negative-binomial. For…
▽ More
Regression models that ignore measurement error in predictors may produce highly biased estimates leading to erroneous inferences. It is well known that it is extremely difficult to take measurement error into account in Gaussian nonparametric regression. This problem becomes tremendously more difficult when considering other families such as logistic regression, Poisson and negative-binomial. For the first time, we present a method aiming to correct for measurement error when estimating regression functions flexibly covering virtually all distributions and link functions regularly considered in generalized linear models. This approach depends on approximating the first and the second moment of the response after integrating out the true unobserved predictors in a semiparametric generalized linear model. Unlike previous methods, this method is not restricted to truncated splines and can utilize various basis functions. Through extensive simulation studies, we study the performance of our method under many scenarios.
△ Less
Submitted 31 January, 2023; v1 submitted 9 April, 2021;
originally announced April 2021.
-
Bayesian Functional Principal Components Analysis via Variational Message Passing
Authors:
Tui H. Nolan,
Jeff Goldsmith,
David Ruppert
Abstract:
Functional principal components analysis is a popular tool for inference on functional data. Standard approaches rely on an eigendecomposition of a smoothed covariance surface in order to extract the orthonormal functions representing the major modes of variation. This approach can be a computationally intensive procedure, especially in the presence of large datasets with irregular observations. I…
▽ More
Functional principal components analysis is a popular tool for inference on functional data. Standard approaches rely on an eigendecomposition of a smoothed covariance surface in order to extract the orthonormal functions representing the major modes of variation. This approach can be a computationally intensive procedure, especially in the presence of large datasets with irregular observations. In this article, we develop a Bayesian approach, which aims to determine the Karhunen-Loève decomposition directly without the need to smooth and estimate a covariance surface. More specifically, we develop a variational Bayesian algorithm via message passing over a factor graph, which is more commonly referred to as variational message passing. Message passing algorithms are a powerful tool for compartmentalizing the algebra and coding required for inference in hierarchical statistical models. Recently, there has been much focus on formulating variational inference algorithms in the message passing framework because it removes the need for rederiving approximate posterior density functions if there is a change to the model. Instead, model changes are handled by changing specific computational units, known as fragments, within the factor graph. We extend the notion of variational message passing to functional principal components analysis. Indeed, this is the first article to address a functional data model via variational message passing. Our approach introduces two new fragments that are necessary for Bayesian functional principal components analysis. We present the computational details, a set of simulations for assessing accuracy and speed and an application to United States temperature data.
△ Less
Submitted 1 April, 2021;
originally announced April 2021.
-
Bootstrap inference for quantile-based modal regression
Authors:
Tao Zhang,
Kengo Kato,
David Ruppert
Abstract:
In this paper, we develop uniform inference methods for the conditional mode based on quantile regression. Specifically, we propose to estimate the conditional mode by minimizing the derivative of the estimated conditional quantile function defined by smoothing the linear quantile regression estimator, and develop two bootstrap methods, a novel pivotal bootstrap and the nonparametric bootstrap, fo…
▽ More
In this paper, we develop uniform inference methods for the conditional mode based on quantile regression. Specifically, we propose to estimate the conditional mode by minimizing the derivative of the estimated conditional quantile function defined by smoothing the linear quantile regression estimator, and develop two bootstrap methods, a novel pivotal bootstrap and the nonparametric bootstrap, for our conditional mode estimator. Building on high-dimensional Gaussian approximation techniques, we establish the validity of simultaneous confidence rectangles constructed from the two bootstrap methods for the conditional mode. We also extend the preceding analysis to the case where the dimension of the covariate vector is increasing with the sample size. Finally, we conduct simulation experiments and a real data analysis using U.S. wage data to demonstrate the finite sample performance of our inference method.
△ Less
Submitted 12 April, 2021; v1 submitted 1 June, 2020;
originally announced June 2020.
-
Finite Sample Hypothesis Tests for Stacked Estimating Equations
Authors:
Eli S. Kravitz,
Raymond J. Carroll,
David Ruppert
Abstract:
Suppose there are two unknown parameters, each parameter is the solution to an estimating equation, and the estimating equation of one parameter depends on the other parameter. The parameters can be jointly estimated by "stacking" their estimating equations and solving for both parameters simultaneously. Asymptotic confidence intervals are readily available for stacked estimating equations. We int…
▽ More
Suppose there are two unknown parameters, each parameter is the solution to an estimating equation, and the estimating equation of one parameter depends on the other parameter. The parameters can be jointly estimated by "stacking" their estimating equations and solving for both parameters simultaneously. Asymptotic confidence intervals are readily available for stacked estimating equations. We introduce a bootstrap-based hypothesis test for stacked estimating equations which does not rely on asymptotic approximations. Test statistics are constructed by splitting the sample in two, estimating the first parameter on a portion of the sample then plugging the result into the second estimating equation to solve for the next parameter using the remaining sample. To reduce simulation variability from a single split, we repeatedly split the sample and take the sample mean of all the estimates. For parametric models, we derive the limiting distribution of sample splitting estimator and show they are equivalent to stacked estimating equations.
△ Less
Submitted 11 August, 2019;
originally announced August 2019.
-
Sample Splitting as an M-Estimator with Application to Physical Activity Scoring
Authors:
Eli S. Kravitz,
Raymond J. Carroll,
David Ruppert
Abstract:
Sample splitting is widely used in statistical applications, including classically in classification and more recently for inference post model selection. Motivating by problems in the study of diet, physical activity, and health, we consider a new application of sample splitting. Physical activity researchers wanted to create a scoring system to quickly assess physical activity levels. A score is…
▽ More
Sample splitting is widely used in statistical applications, including classically in classification and more recently for inference post model selection. Motivating by problems in the study of diet, physical activity, and health, we consider a new application of sample splitting. Physical activity researchers wanted to create a scoring system to quickly assess physical activity levels. A score is created using a large cohort study. Then, using the same data, this score serves as a covariate in a model for the risk of disease or mortality. Since the data are used twice in this way, standard errors and confidence intervals from fitting the second model are not valid. To allow for proper inference, sample splitting can be used. One builds the score with a random half of the data and then uses the score when fitting a model to the other half of the data. We derive the limiting distribution of the estimators. An obvious question is what happens if multiple sample splits are performed. We show that as the number of sample splits increases, the combination of multiple sample splits is effectively equivalent to solving a set of estimating equations.
△ Less
Submitted 11 August, 2019;
originally announced August 2019.
-
Optimal Sampling for Generalized Linear Models under Measurement Constraints
Authors:
Tao Zhang,
Yang Ning,
David Ruppert
Abstract:
Under "measurement constraints," responses are expensive to measure and initially unavailable on most of records in the dataset, but the covariates are available for the entire dataset. Our goal is to sample a relatively small portion of the dataset where the expensive responses will be measured and the resultant sampling estimator is statistically efficient. Measurement constraints require the sa…
▽ More
Under "measurement constraints," responses are expensive to measure and initially unavailable on most of records in the dataset, but the covariates are available for the entire dataset. Our goal is to sample a relatively small portion of the dataset where the expensive responses will be measured and the resultant sampling estimator is statistically efficient. Measurement constraints require the sampling probabilities can only depend on a very small set of the responses. A sampling procedure that uses responses at most only on a small pilot sample will be called "response-free." We propose a response-free sampling procedure \mbox{(OSUMC)} for generalized linear models (GLMs). Using the A-optimality criterion, i.e., the trace of the asymptotic variance, the resultant estimator is statistically efficient within a class of sampling estimators. We establish the unconditional asymptotic distribution of a general class of response-free sampling estimators. This result is novel compared with the existing conditional results obtained by conditioning on both covariates and responses. Under our unconditional framework, the subsamples are no longer independent and new martingale techniques are developed for our asymptotic theory. We further derive the A-optimal response-free sampling distribution. Since this distribution depends on population level quantities, we propose the Optimal Sampling Under Measurement Constraints (OSUMC) algorithm to approximate the theoretical optimal sampling. Finally, we conduct an intensive empirical study to demonstrate the advantages of OSUMC algorithm over existing methods in both statistical and computational perspectives.
△ Less
Submitted 25 March, 2020; v1 submitted 16 July, 2019;
originally announced July 2019.
-
Density Estimation on a Network
Authors:
Yang Liu,
David Ruppert
Abstract:
This paper develops a novel approach to density estimation on a network. We formulate nonparametric density estimation on a network as a nonparametric regression problem by binning. Nonparametric regression using local polynomial kernel-weighted least squares have been studied rigorously, and its asymptotic properties make it superior to kernel estimators such as the Nadaraya-Watson estimator. Whe…
▽ More
This paper develops a novel approach to density estimation on a network. We formulate nonparametric density estimation on a network as a nonparametric regression problem by binning. Nonparametric regression using local polynomial kernel-weighted least squares have been studied rigorously, and its asymptotic properties make it superior to kernel estimators such as the Nadaraya-Watson estimator. When applied to a network, the best estimator near a vertex depends on the amount of smoothness at the vertex. Often, there are no compelling reasons to assume that a density will be continuous or discontinuous at a vertex, hence a data driven approach is proposed. To estimate the density in a neighborhood of a vertex, we propose a two-step procedure. The first step of this pretest estimator fits a separate local polynomial regression on each edge using data only on that edge, and then tests for equality of the estimates at the vertex. If the null hypothesis is not rejected, then the second step re-estimates the regression function in a small neighborhood of the vertex, subject to a joint equality constraint. Since the derivative of the density may be discontinuous at the vertex, we propose a piecewise polynomial local regression estimate to model the change in slope. We study in detail the special case of local piecewise linear regression and derive the leading bias and variance terms using weighted least squares theory. We show that the proposed approach will remove the bias near a vertex that has been noted for existing methods, which typically do not allow for discontinuity at vertices. For a fixed network, the proposed method scales sub-linearly with sample size and it can be extended to regression and varying coefficient models on a network. We demonstrate the workings of the proposed model by simulation studies and apply it to a dendrite network data set.
△ Less
Submitted 4 August, 2020; v1 submitted 22 June, 2019;
originally announced June 2019.
-
Copula-based functional Bayes classification with principal components and partial least squares
Authors:
Wentian Huang,
David Ruppert
Abstract:
We present a new functional Bayes classifier that uses principal component (PC) or partial least squares (PLS) scores from the common covariance function, that is, the covariance function marginalized over groups. When the groups have different covariance functions, the PC or PLS scores need not be independent or even uncorrelated. We use copulas to model the dependence. Our method is semiparametr…
▽ More
We present a new functional Bayes classifier that uses principal component (PC) or partial least squares (PLS) scores from the common covariance function, that is, the covariance function marginalized over groups. When the groups have different covariance functions, the PC or PLS scores need not be independent or even uncorrelated. We use copulas to model the dependence. Our method is semiparametric; the marginal densities are estimated nonparametrically by kernel smoothing and the copula is modeled parametrically. We focus on Gaussian and t-copulas, but other copulas could be used. The strong performance of our methodology is demonstrated through simulation, real data examples, and asymptotic properties.
△ Less
Submitted 16 September, 2021; v1 submitted 2 June, 2019;
originally announced June 2019.
-
Density Deconvolution with Additive Measurement Errors using Quadratic Programming
Authors:
Ran Yang,
Daniel Apley,
Jeremy Staum,
David Ruppert
Abstract:
Distribution estimation for noisy data via density deconvolution is a notoriously difficult problem for typical noise distributions like Gaussian. We develop a density deconvolution estimator based on quadratic programming (QP) that can achieve better estimation than kernel density deconvolution methods. The QP approach appears to have a more favorable regularization tradeoff between oversmoothing…
▽ More
Distribution estimation for noisy data via density deconvolution is a notoriously difficult problem for typical noise distributions like Gaussian. We develop a density deconvolution estimator based on quadratic programming (QP) that can achieve better estimation than kernel density deconvolution methods. The QP approach appears to have a more favorable regularization tradeoff between oversmoothing vs. oscillation, especially at the tails of the distribution. An additional advantage is that it is straightforward to incorporate a number of common density constraints such as nonnegativity, integration-to-one, unimodality, tail convexity, tail monotonicity, and support constraints. We demonstrate that the QP approach has outstanding estimation performance relative to existing methods. Its performance is superior when only the universally applicable nonnegativity and integration-to-one constraints are incorporated, and incorporating additional common constraints when applicable (e.g., nonnegative support, unimodality, tail monotonicity or convexity, etc.) can further substantially improve the estimation.
△ Less
Submitted 4 December, 2018;
originally announced December 2018.
-
Dynamic Shrinkage Processes
Authors:
Daniel R. Kowal,
David S. Matteson,
David Ruppert
Abstract:
We propose a novel class of dynamic shrinkage processes for Bayesian time series and regression analysis. Building upon a global-local framework of prior construction, in which continuous scale mixtures of Gaussian distributions are employed for both desirable shrinkage properties and computational tractability, we model dependence among the local scale parameters. The resulting processes inherit…
▽ More
We propose a novel class of dynamic shrinkage processes for Bayesian time series and regression analysis. Building upon a global-local framework of prior construction, in which continuous scale mixtures of Gaussian distributions are employed for both desirable shrinkage properties and computational tractability, we model dependence among the local scale parameters. The resulting processes inherit the desirable shrinkage behavior of popular global-local priors, such as the horseshoe prior, but provide additional localized adaptivity, which is important for modeling time series data or regression functions with local features. We construct a computationally efficient Gibbs sampling algorithm based on a Pólya-Gamma scale mixture representation of the proposed process. Using dynamic shrinkage processes, we develop a Bayesian trend filtering model that produces more accurate estimates and tighter posterior credible intervals than competing methods, and apply the model for irregular curve-fitting of minute-by-minute Twitter CPU usage data. In addition, we develop an adaptive time-varying parameter regression model to assess the efficacy of the Fama-French five-factor asset pricing model with momentum added as a sixth factor. Our dynamic analysis of manufacturing and healthcare industry data shows that with the exception of the market risk, no other risk factors are significant except for brief periods.
△ Less
Submitted 23 February, 2018; v1 submitted 3 July, 2017;
originally announced July 2017.
-
Profile Estimation for Partial Functional Partially Linear Single-Index Model
Authors:
Qingguo Tang,
Linglong Kong,
David Ruppert,
Rohana J. Karunamuni
Abstract:
This paper studies a \textit{partial functional partially linear single-index model} that consists of a functional linear component as well as a linear single-index component. This model generalizes many well-known existing models and is suitable for more complicated data structures. However, its estimation inherits the difficulties and complexities from both components and makes it a challenging…
▽ More
This paper studies a \textit{partial functional partially linear single-index model} that consists of a functional linear component as well as a linear single-index component. This model generalizes many well-known existing models and is suitable for more complicated data structures. However, its estimation inherits the difficulties and complexities from both components and makes it a challenging problem, which calls for new methodology. We propose a novel profile B-spline method to estimate the parameters by approximating the unknown nonparametric link function in the single-index component part with B-spline, while the linear slope function in the functional component part is estimated by the functional principal component basis. The consistency and asymptotic normality of the parametric estimators are derived, and the global convergence of the proposed estimator of the linear slope function is also established. More excitingly, the latter convergence is optimal in the minimax sense. A two-stage procedure is implemented to estimate the nonparametric link function, and the resulting estimator possesses the optimal global rate of convergence. Furthermore, the convergence rate of the mean squared prediction error for a predictor is also obtained. Empirical properties of the proposed procedures are studied through Monte Carlo simulations. A real data example is also analyzed to illustrate the power and flexibility of the proposed methodology.
△ Less
Submitted 8 March, 2017;
originally announced March 2017.
-
Additive Function-on-Function Regression
Authors:
Janet S. Kim,
Ana-Maria Staicu,
Arnab Maity,
Raymond J. Carroll,
David Ruppert
Abstract:
We study additive function-on-function regression where the mean response at a particular time point depends on the time point itself as well as the entire covariate trajectory. We develop a computationally efficient estimation methodology based on a novel combination of spline bases with an eigenbasis to represent the trivariate kernel function. We discuss prediction of a new response trajectory,…
▽ More
We study additive function-on-function regression where the mean response at a particular time point depends on the time point itself as well as the entire covariate trajectory. We develop a computationally efficient estimation methodology based on a novel combination of spline bases with an eigenbasis to represent the trivariate kernel function. We discuss prediction of a new response trajectory, propose an inference procedure that accounts for total variability in the predicted response curves, and construct pointwise prediction intervals. The estimation/inferential procedure accommodates realistic scenarios such as correlated error structure as well as sparse and/or irregular designs. We investigate our methodology in finite sample size through simulations and two real data applications.
△ Less
Submitted 14 December, 2016; v1 submitted 12 June, 2016;
originally announced June 2016.
-
Functional Autoregression for Sparsely Sampled Data
Authors:
Daniel R. Kowal,
David S. Matteson,
David Ruppert
Abstract:
We develop a hierarchical Gaussian process model for forecasting and inference of functional time series data. Unlike existing methods, our approach is especially suited for sparsely or irregularly sampled curves and for curves sampled with non-negligible measurement error. The latent process is dynamically modeled as a functional autoregression (FAR) with Gaussian process innovations. We propose…
▽ More
We develop a hierarchical Gaussian process model for forecasting and inference of functional time series data. Unlike existing methods, our approach is especially suited for sparsely or irregularly sampled curves and for curves sampled with non-negligible measurement error. The latent process is dynamically modeled as a functional autoregression (FAR) with Gaussian process innovations. We propose a fully nonparametric dynamic functional factor model for the dynamic innovation process, with broader applicability and improved computational efficiency over standard Gaussian process models. We prove finite-sample forecasting and interpolation optimality properties of the proposed model, which remain valid with the Gaussian assumption relaxed. An efficient Gibbs sampling algorithm is developed for estimation, inference, and forecasting, with extensions for FAR(p) models with model averaging over the lag p. Extensive simulations demonstrate substantial improvements in forecasting performance and recovery of the autoregressive surface over competing methods, especially under sparse designs. We apply the proposed methods to forecast nominal and real yield curves using daily U.S. data. Real yields are observed more sparsely than nominal yields, yet the proposed methods are highly competitive in both settings.
△ Less
Submitted 19 October, 2016; v1 submitted 9 March, 2016;
originally announced March 2016.
-
Linear Non-Gaussian Component Analysis via Maximum Likelihood
Authors:
Benjamin B. Risk,
David S. Matteson,
David Ruppert
Abstract:
Independent component analysis (ICA) is popular in many applications, including cognitive neuroscience and signal processing. Due to computational constraints, principal component analysis is used for dimension reduction prior to ICA (PCA+ICA), which could remove important information. The problem is that interesting independent components (ICs) could be mixed in several principal components that…
▽ More
Independent component analysis (ICA) is popular in many applications, including cognitive neuroscience and signal processing. Due to computational constraints, principal component analysis is used for dimension reduction prior to ICA (PCA+ICA), which could remove important information. The problem is that interesting independent components (ICs) could be mixed in several principal components that are discarded and then these ICs cannot be recovered. We formulate a linear non-Gaussian component model with Gaussian noise components. To estimate this model, we propose likelihood component analysis (LCA), in which dimension reduction and latent variable estimation are achieved simultaneously. Our method orders components by their marginal likelihood rather than ordering components by variance as in PCA. We present a parametric LCA using the logistic density and a semi-parametric LCA using tilted Gaussians with cubic B-splines. Our algorithm is scalable to datasets common in applications (e.g., hundreds of thousands of observations across hundreds of variables with dozens of latent components). In simulations, latent components are recovered that are discarded by PCA+ICA methods. We apply our method to multivariate data and demonstrate that LCA is a useful data visualization and dimension reduction tool that reveals features not apparent from PCA or PCA+ICA. We also apply our method to an fMRI experiment from the Human Connectome Project and identify artifacts missed by PCA+ICA. We present theoretical results on identifiability of the linear non-Gaussian component model and consistency of LCA.
△ Less
Submitted 1 October, 2017; v1 submitted 5 November, 2015;
originally announced November 2015.
-
A Bayesian Multivariate Functional Dynamic Linear Model
Authors:
Daniel R. Kowal,
David S. Matteson,
David Ruppert
Abstract:
We present a Bayesian approach for modeling multivariate, dependent functional data. To account for the three dominant structural features in the data--functional, time dependent, and multivariate components--we extend hierarchical dynamic linear models for multivariate time series to the functional data setting. We also develop Bayesian spline theory in a more general constrained optimization fra…
▽ More
We present a Bayesian approach for modeling multivariate, dependent functional data. To account for the three dominant structural features in the data--functional, time dependent, and multivariate components--we extend hierarchical dynamic linear models for multivariate time series to the functional data setting. We also develop Bayesian spline theory in a more general constrained optimization framework. The proposed methods identify a time-invariant functional basis for the functional observations, which is smooth and interpretable, and can be made common across multivariate observations for additional information sharing. The Bayesian framework permits joint estimation of the model parameters, provides exact inference (up to MCMC error) on specific parameters, and allows generalized dependence structures. Sampling from the posterior distribution is accomplished with an efficient Gibbs sampling algorithm. We illustrate the proposed framework with two applications: (1) multi-economy yield curve data from the recent global recession, and (2) local field potential brain signals in rats, for which we develop a multivariate functional time series approach for multivariate time-frequency analysis. Supplementary materials, including R code and the multi-economy yield curve data, are available online.
△ Less
Submitted 5 August, 2015; v1 submitted 3 November, 2014;
originally announced November 2014.
-
RAPTT: An Exact Two-Sample Test in High Dimensions Using Random Projections
Authors:
Radhendushka Srivastava,
Ping Li,
David Ruppert
Abstract:
In high dimensions, the classical Hotelling's $T^2$ test tends to have low power or becomes undefined due to singularity of the sample covariance matrix. In this paper, this problem is overcome by projecting the data matrix onto lower dimensional subspaces through multiplication by random matrices. We propose RAPTT (RAndom Projection T-Test), an exact test for equality of means of two normal popul…
▽ More
In high dimensions, the classical Hotelling's $T^2$ test tends to have low power or becomes undefined due to singularity of the sample covariance matrix. In this paper, this problem is overcome by projecting the data matrix onto lower dimensional subspaces through multiplication by random matrices. We propose RAPTT (RAndom Projection T-Test), an exact test for equality of means of two normal populations based on projected lower dimensional data. RAPTT does not require any constraints on the dimension of the data or the sample size. A simulation study indicates that in high dimensions the power of this test is often greater than that of competing tests. The advantage of RAPTT is illustrated on high-dimensional gene expression data involving the discrimination of tumor and normal colon tissues.
△ Less
Submitted 7 May, 2014;
originally announced May 2014.
-
Restricted Likelihood Ratio Tests for Linearity in Scalar-on-Function Regression
Authors:
Mathew W. McLean,
Giles Hooker,
David Ruppert
Abstract:
We propose a procedure for testing the linearity of a scalar-on-function regression relationship. To do so, we use the functional generalized additive model (FGAM), a recently developed extension of the functional linear model. For a functional covariate X(t), the FGAM models the mean response as the integral with respect to t of F{X(t),t} where F is an unknown bivariate function. The FGAM can be…
▽ More
We propose a procedure for testing the linearity of a scalar-on-function regression relationship. To do so, we use the functional generalized additive model (FGAM), a recently developed extension of the functional linear model. For a functional covariate X(t), the FGAM models the mean response as the integral with respect to t of F{X(t),t} where F is an unknown bivariate function. The FGAM can be viewed as the natural functional extension of generalized additive models. We show how the functional linear model can be represented as a simple mixed model nested within the FGAM. Using this representation, we then consider restricted likelihood ratio tests for zero variance components in mixed models to test the null hypothesis that the functional linear model holds. The methods are general and can also be applied to testing for interactions in a multivariate additive model or for testing for no effect in the functional linear model. The performance of the proposed tests is assessed on simulated data and in an application to measuring diesel truck emissions, where strong evidence of nonlinearities in the relationship between the functional predictor and the response are found.
△ Less
Submitted 22 October, 2013;
originally announced October 2013.
-
Fast Covariance Estimation for High-dimensional Functional Data
Authors:
Luo Xiao,
David Ruppert,
Vadim Zipunnikov,
Ciprian Crainiceanu
Abstract:
For smoothing covariance functions, we propose two fast algorithms that scale linearly with the number of observations per function. Most available methods and software cannot smooth covariance matrices of dimension $J \times J$ with $J>500$; the recently introduced sandwich smoother is an exception, but it is not adapted to smooth covariance matrices of large dimensions such as $J \ge 10,000$. Co…
▽ More
For smoothing covariance functions, we propose two fast algorithms that scale linearly with the number of observations per function. Most available methods and software cannot smooth covariance matrices of dimension $J \times J$ with $J>500$; the recently introduced sandwich smoother is an exception, but it is not adapted to smooth covariance matrices of large dimensions such as $J \ge 10,000$. Covariance matrices of order $J=10,000$, and even $J=100,000$, are becoming increasingly common, e.g., in 2- and 3-dimensional medical imaging and high-density wearable sensor data. We introduce two new algorithms that can handle very large covariance matrices: 1) FACE: a fast implementation of the sandwich smoother and 2) SVDS: a two-step procedure that first applies singular value decomposition to the data matrix and then smoothes the eigenvectors. Compared to existing techniques, these new algorithms are at least an order of magnitude faster in high dimensions and drastically reduce memory requirements. The new algorithms provide instantaneous (few seconds) smoothing for matrices of dimension $J=10,000$ and very fast ($<$ 10 minutes) smoothing for $J=100,000$. Although SVDS is simpler than FACE, we provide ready to use, scalable R software for FACE. When incorporated into R package {\it refund}, FACE improves the speed of penalized functional regression by an order of magnitude, even for data of normal size ($J <500$). We recommend that FACE be used in practice for the analysis of noisy and high-dimensional functional data.
△ Less
Submitted 26 February, 2014; v1 submitted 24 June, 2013;
originally announced June 2013.
-
Bayesian Functional Generalized Additive Models with Sparsely Observed Covariates
Authors:
Mathew W. McLean,
Fabian Scheipl,
Giles Hooker,
Sonja Greven,
David Ruppert
Abstract:
The functional generalized additive model (FGAM) was recently proposed in McLean et al. (2013) as a more flexible alternative to the common functional linear model (FLM) for regressing a scalar on functional covariates. In this paper, we develop a Bayesian version of FGAM for the case of Gaussian errors with identity link function. Our approach allows the functional covariates to be sparsely obser…
▽ More
The functional generalized additive model (FGAM) was recently proposed in McLean et al. (2013) as a more flexible alternative to the common functional linear model (FLM) for regressing a scalar on functional covariates. In this paper, we develop a Bayesian version of FGAM for the case of Gaussian errors with identity link function. Our approach allows the functional covariates to be sparsely observed and measured with error, whereas the estimation procedure of McLean et al. (2013) required that they be noiselessly observed on a regular grid. We consider both Monte Carlo and variational Bayes methods for fitting the FGAM with sparsely observed covariates. Due to the complicated form of the model posterior distribution and full conditional distributions, standard Monte Carlo and variational Bayes algorithms cannot be used. The strategies we use to handle the updating of parameters without closed-form full conditionals should be of independent interest to applied Bayesian statisticians working with nonconjugate models. Our numerical studies demonstrate the benefits of our algorithms over a two-step approach of first recovering the complete trajectories using standard techniques and then fitting a functional regression model. In a real data analysis, our methods are applied to forecasting closing price for items up for auction on the online auction website eBay.
△ Less
Submitted 26 May, 2017; v1 submitted 15 May, 2013;
originally announced May 2013.
-
Optimal Prediction in an Additive Functional Model
Authors:
Xiao Wang,
David Ruppert
Abstract:
The functional generalized additive model (FGAM) provides a more flexible nonlinear functional regression model than the well-studied functional linear regression model. This paper restricts attention to the FGAM with identity link and additive errors, which we will call the additive functional model, a generalization of the functional linear model. This paper studies the minimax rate of convergen…
▽ More
The functional generalized additive model (FGAM) provides a more flexible nonlinear functional regression model than the well-studied functional linear regression model. This paper restricts attention to the FGAM with identity link and additive errors, which we will call the additive functional model, a generalization of the functional linear model. This paper studies the minimax rate of convergence of predictions from the additive functional model in the framework of reproducing kernel Hilbert space. It is shown that the optimal rate is determined by the decay rate of the eigenvalues of a specific kernel function, which in turn is determined by the reproducing kernel and the joint distribution of any two points in the random predictor function. For the special case of the functional linear model, this kernel function is jointly determined by the covariance function of the predictor function and the reproducing kernel. The easily implementable roughness-regularized predictor is shown to achieve the optimal rate of convergence. Numerical studies are carried out to illustrate the merits of the predictor. Our simulations and real data examples demonstrate a competitive performance against the existing approach.
△ Less
Submitted 21 January, 2013;
originally announced January 2013.
-
Multilevel Bayesian framework for modeling the production, propagation and detection of ultra-high energy cosmic rays
Authors:
Kunlaya Soiaporn,
David Chernoff,
Thomas Loredo,
David Ruppert,
Ira Wasserman
Abstract:
Ultra-high energy cosmic rays (UHECRs) are atomic nuclei with energies over ten million times energies accessible to human-made particle accelerators. Evidence suggests that they originate from relatively nearby extragalactic sources, but the nature of the sources is unknown. We develop a multilevel Bayesian framework for assessing association of UHECRs and candidate source populations, and Markov…
▽ More
Ultra-high energy cosmic rays (UHECRs) are atomic nuclei with energies over ten million times energies accessible to human-made particle accelerators. Evidence suggests that they originate from relatively nearby extragalactic sources, but the nature of the sources is unknown. We develop a multilevel Bayesian framework for assessing association of UHECRs and candidate source populations, and Markov chain Monte Carlo algorithms for estimating model parameters and comparing models by computing, via Chib's method, marginal likelihoods and Bayes factors. We demonstrate the framework by analyzing measurements of 69 UHECRs observed by the Pierre Auger Observatory (PAO) from 2004-2009, using a volume-complete catalog of 17 local active galactic nuclei (AGN) out to 15 megaparsecs as candidate sources. An early portion of the data ("period 1," with 14 events) was used by PAO to set an energy cut maximizing the anisotropy in period 1; the 69 measurements include this "tuned" subset, and subsequent "untuned" events with energies above the same cutoff. Also, measurement errors are approximately summarized. These factors are problematic for independent analyses of PAO data. Within the context of "standard candle" source models (i.e., with a common isotropic emission rate), and considering only the 55 untuned events, there is no significant evidence favoring association of UHECRs with local AGN vs. an isotropic background. The highest-probability associations are with the two nearest, adjacent AGN, Centaurus A and NGC 4945. If the association model is adopted, the fraction of UHECRs that may be associated is likely nonzero but is well below 50%. Our framework enables estimation of the angular scale for deflection of cosmic rays by cosmic magnetic fields; relatively modest scales of $\approx\!3^{\circ}$ to $30^{\circ}$ are favored. Models that assign a large fraction of UHECRs to a single nearby source (e.g., Centaurus A) are ruled out unless very large deflection scales are specified a priori, and even then they are disfavored. However, including the period 1 data alters the conclusions significantly, and a simulation study supports the idea that the period 1 data are anomalous, presumably due to the tuning. Accurate and optimal analysis of future data will likely require more complete disclosure of the data.
△ Less
Submitted 28 November, 2013; v1 submitted 20 June, 2012;
originally announced June 2012.
-
Guilt by Association: Finding Cosmic Ray Sources Using Hierarchical Bayesian Clustering
Authors:
Kunlaya Soiaporn,
David Chernoff,
Thomas Loredo,
David Ruppert,
Ira Wasserman
Abstract:
The Earth is continuously showered by charged cosmic ray particles, naturally produced atomic nuclei moving with velocity close to the speed of light. Among these are ultra high energy cosmic ray particles with energy exceeding 5x10^19 eV, which is ten million times more energetic than the most energetic particles produced at the Large Hadron Collider. Astrophysical questions include: what phenome…
▽ More
The Earth is continuously showered by charged cosmic ray particles, naturally produced atomic nuclei moving with velocity close to the speed of light. Among these are ultra high energy cosmic ray particles with energy exceeding 5x10^19 eV, which is ten million times more energetic than the most energetic particles produced at the Large Hadron Collider. Astrophysical questions include: what phenomenon accelerates particles to such high energies, and what sort of nuclei are energized? Also, the magnetic deflection of the trajectories of the cosmic rays makes them potential probes of galactic and intergalactic magnetic fields. We develop a Bayesian hierarchical model that can be used to compare different association models between the cosmic rays and source population, using Bayes factors. A measurement model with directional uncertainties and accounting for non-uniform sky exposure is incoporated into the model. The methodology allows us to learn about astrophysical parameters, such as those governing the source luminosity function and the cosmic magnetic field.
△ Less
Submitted 15 June, 2012;
originally announced June 2012.
-
Fast Bivariate Penalized Splines: the Sandwich Smoother
Authors:
Luo Xiao,
Yingxing Li,
David Ruppert
Abstract:
We propose a fast penalized spline method for bivariate smoothing. Univariate P-spline smoothers (Eilers and Marx, 1996) are applied simultaneously along both coordinates. The new smoother has a sandwich form which suggested the name "sandwich smoother" to a referee. The sandwich smoother has a tensor product structure that simplifies an asymptotic analysis and it can be fast computed. We derive a…
▽ More
We propose a fast penalized spline method for bivariate smoothing. Univariate P-spline smoothers (Eilers and Marx, 1996) are applied simultaneously along both coordinates. The new smoother has a sandwich form which suggested the name "sandwich smoother" to a referee. The sandwich smoother has a tensor product structure that simplifies an asymptotic analysis and it can be fast computed. We derive a local central limit theorem for the sandwich smoother, with simple expressions for the asymptotic bias and variance, by showing that the sandwich smoother is asymptotically equivalent to a bivariate kernel regression estimator with a product kernel. As far as we are aware, this is the first central limit theorem for a bivariate spline estimator of any type. Our simulation study shows that the sandwich smoother is orders of magnitude faster to compute than other bivariate spline smoothers, even when the latter are computed using a fast GLAM (Generalized Linear Array Model) algorithm, and comparable to them in terms of mean squared integrated errors. We extend the sandwich smoother to array data of higher dimensions, where a GLAM algorithm improves the computational speed of the sandwich smoother. One important application of the sandwich smoother is to estimate covariance functions in functional data analysis. In this application, our numerical results show that the sandwich smoother is orders of magnitude faster than local linear regression. The speed of the sandwich formula is important because functional data sets are becoming quite large.
△ Less
Submitted 13 July, 2012; v1 submitted 22 November, 2010;
originally announced November 2010.