-
Bootstrap Nonparametric Inference under Data Integration
Authors:
Zuofeng Shang,
Peijun Sang,
Chong Jin
Abstract:
We propose multiplier bootstrap procedures for nonparametric inference and uncertainty quantification of the target mean function, based on a novel framework of integrating target and source data. We begin with the relatively easier covariate shift scenario with equal target and source mean functions and propose estimation and inferential procedures through a straightforward combination of all tar…
▽ More
We propose multiplier bootstrap procedures for nonparametric inference and uncertainty quantification of the target mean function, based on a novel framework of integrating target and source data. We begin with the relatively easier covariate shift scenario with equal target and source mean functions and propose estimation and inferential procedures through a straightforward combination of all target and source datasets. We next consider the more general and flexible distribution shift scenario with arbitrary target and source mean functions, and propose a two-step inferential procedure. First, we estimate the target-to-source differences based on separate portions of the target and source data. Second, the remaining source data are adjusted by these differences and combined with the remaining target data to perform the multiplier bootstrap procedure. Our method enables local and global inference on the target mean function without using asymptotic distributions. To justify our approach, we derive an optimal convergence rate for the nonparametric estimator and establish bootstrap consistency to estimate the asymptotic distribution of the nonparametric estimator. The proof of global bootstrap consistency involves a central limit theorem for quadratic forms with dependent variables under a conditional probability measure. Our method applies to arbitrary source and target datasets, provided that the data sizes meet a specific quantitative relationship. Simulation studies and real data analysis are provided to examine the performance of our approach.
△ Less
Submitted 2 January, 2025;
originally announced January 2025.
-
Variational Nonparametric Inference in Functional Stochastic Block Model
Authors:
Zuofeng Shang,
Peijun Sang,
Yang Feng,
Chong Jin
Abstract:
We propose a functional stochastic block model whose vertices involve functional data information. This new model extends the classic stochastic block model with vector-valued nodal information, and finds applications in real-world networks whose nodal information could be functional curves. Examples include international trade data in which a network vertex (country) is associated with the annual…
▽ More
We propose a functional stochastic block model whose vertices involve functional data information. This new model extends the classic stochastic block model with vector-valued nodal information, and finds applications in real-world networks whose nodal information could be functional curves. Examples include international trade data in which a network vertex (country) is associated with the annual or quarterly GDP over certain time period, and MyFitnessPal data in which a network vertex (MyFitnessPal user) is associated with daily calorie information measured over certain time period. Two statistical tasks will be jointly executed. First, we will detect community structures of the network vertices assisted by the functional nodal information. Second, we propose computationally efficient variational test to examine the significance of the functional nodal information. We show that the community detection algorithms achieve weak and strong consistency, and the variational test is asymptotically chi-square with diverging degrees of freedom. As a byproduct, we propose pointwise confidence intervals for the slop function of the functional nodal information. Our methods are examined through both simulated and real datasets.
△ Less
Submitted 29 June, 2024;
originally announced July 2024.
-
Empirical likelihood test for community structure in networks
Authors:
Mingao Yuan,
Sharmin Hossain,
Zuofeng Shang
Abstract:
Network data, characterized by interconnected nodes and edges, is pervasive in various domains and has gained significant popularity in recent years. In network data analysis, testing the presence of community structure in a network is one of the important research tasks. Existing tests are mainly developed for unweighted networks. In this paper, we study the problem of testing the existence of co…
▽ More
Network data, characterized by interconnected nodes and edges, is pervasive in various domains and has gained significant popularity in recent years. In network data analysis, testing the presence of community structure in a network is one of the important research tasks. Existing tests are mainly developed for unweighted networks. In this paper, we study the problem of testing the existence of community structure in general (either weighted or unweighted) networks. We propose two new tests: the Weighted Signed-Triangle (WST) test and the empirical likelihood (EL) test. Both tests can be applied to weighted or unweighted networks and outperform existing tests for small networks. The EL test may outperform the WST test for small networks.
△ Less
Submitted 25 July, 2023;
originally announced July 2023.
-
Matrix Autoregressive Model with Vector Time Series Covariates for Spatio-Temporal Data
Authors:
Hu Sun,
Zuofeng Shang,
Yang Chen
Abstract:
We develop a new methodology for forecasting matrix-valued time series with historical matrix data and auxiliary vector time series data. We focus on a time series of matrices defined on a static 2-D spatial grid and an auxiliary time series of non-spatial vectors. The proposed model, Matrix AutoRegression with Auxiliary Covariates (MARAC), contains an autoregressive component for the historical m…
▽ More
We develop a new methodology for forecasting matrix-valued time series with historical matrix data and auxiliary vector time series data. We focus on a time series of matrices defined on a static 2-D spatial grid and an auxiliary time series of non-spatial vectors. The proposed model, Matrix AutoRegression with Auxiliary Covariates (MARAC), contains an autoregressive component for the historical matrix predictors and an additive component that maps the auxiliary vector predictors to a matrix response via tensor-vector product. The autoregressive component adopts a bi-linear transformation framework following Chen et al. (2021), significantly reducing the number of parameters. The auxiliary component posits that the tensor coefficient, which maps non-spatial predictors to a spatial response, contains slices of spatially smooth matrix coefficients that are discrete evaluations of smooth functions from a Reproducible Kernel Hilbert Space (RKHS). We propose to estimate the model parameters under a penalized maximum likelihood estimation framework coupled with an alternating minimization algorithm. We establish the joint asymptotics of the autoregressive and tensor parameters under fixed and high-dimensional regimes. Extensive simulations and a geophysical application for forecasting the global Total Electron Content (TEC) are conducted to validate the performance of MARAC.
△ Less
Submitted 17 May, 2024; v1 submitted 24 May, 2023;
originally announced May 2023.
-
Statistical Inference with Stochastic Gradient Methods under $φ$-mixing Data
Authors:
Ruiqi Liu,
Xi Chen,
Zuofeng Shang
Abstract:
Stochastic gradient descent (SGD) is a scalable and memory-efficient optimization algorithm for large datasets and stream data, which has drawn a great deal of attention and popularity. The applications of SGD-based estimators to statistical inference such as interval estimation have also achieved great success. However, most of the related works are based on i.i.d. observations or Markov chains.…
▽ More
Stochastic gradient descent (SGD) is a scalable and memory-efficient optimization algorithm for large datasets and stream data, which has drawn a great deal of attention and popularity. The applications of SGD-based estimators to statistical inference such as interval estimation have also achieved great success. However, most of the related works are based on i.i.d. observations or Markov chains. When the observations come from a mixing time series, how to conduct valid statistical inference remains unexplored. As a matter of fact, the general correlation among observations imposes a challenge on interval estimation. Most existing methods may ignore this correlation and lead to invalid confidence intervals. In this paper, we propose a mini-batch SGD estimator for statistical inference when the data is $φ$-mixing. The confidence intervals are constructed using an associated mini-batch bootstrap SGD procedure. Using ``independent block'' trick from \cite{yu1994rates}, we show that the proposed estimator is asymptotically normal, and its limiting distribution can be effectively approximated by the bootstrap procedure. The proposed method is memory-efficient and easy to implement in practice. Simulation studies on synthetic data and an application to a real-world dataset confirm our theory.
△ Less
Submitted 28 March, 2023; v1 submitted 24 February, 2023;
originally announced February 2023.
-
Scalable inference in functional linear regression with streaming data
Authors:
Jinhan Xie,
Enze Shi,
Peijun Sang,
Zuofeng Shang,
Bei Jiang,
Linglong Kong
Abstract:
Traditional static functional data analysis is facing new challenges due to streaming data, where data constantly flow in. A major challenge is that storing such an ever-increasing amount of data in memory is nearly impossible. In addition, existing inferential tools in online learning are mainly developed for finite-dimensional problems, while inference methods for functional data are focused on…
▽ More
Traditional static functional data analysis is facing new challenges due to streaming data, where data constantly flow in. A major challenge is that storing such an ever-increasing amount of data in memory is nearly impossible. In addition, existing inferential tools in online learning are mainly developed for finite-dimensional problems, while inference methods for functional data are focused on the batch learning setting. In this paper, we tackle these issues by developing functional stochastic gradient descent algorithms and proposing an online bootstrap resampling procedure to systematically study the inference problem for functional linear regression. In particular, the proposed estimation and inference procedures use only one pass over the data; thus they are easy to implement and suitable to the situation where data arrive in a streaming manner. Furthermore, we establish the convergence rate as well as the asymptotic distribution of the proposed estimator. Meanwhile, the proposed perturbed estimator from the bootstrap procedure is shown to enjoy the same theoretical properties, which provide the theoretical justification for our online inference tool. As far as we know, this is the first inference result on the functional linear regression model with streaming data. Simulation studies are conducted to investigate the finite-sample performance of the proposed procedure. An application is illustrated with the Beijing multi-site air-quality data.
△ Less
Submitted 10 October, 2023; v1 submitted 5 February, 2023;
originally announced February 2023.
-
Multiscale topology classifies and quantifies cell types in subcellular spatial transcriptomics
Authors:
Katherine Benjamin,
Aneesha Bhandari,
Zhouchun Shang,
Yanan Xing,
Yanru An,
Nannan Zhang,
Yong Hou,
Ulrike Tillmann,
Katherine R. Bull,
Heather A. Harrington
Abstract:
Spatial transcriptomics has the potential to transform our understanding of RNA expression in tissues. Classical array-based technologies produce multiple-cell-scale measurements requiring deconvolution to recover single cell information. However, rapid advances in subcellular measurement of RNA expression at whole-transcriptome depth necessitate a fundamentally different approach. To integrate si…
▽ More
Spatial transcriptomics has the potential to transform our understanding of RNA expression in tissues. Classical array-based technologies produce multiple-cell-scale measurements requiring deconvolution to recover single cell information. However, rapid advances in subcellular measurement of RNA expression at whole-transcriptome depth necessitate a fundamentally different approach. To integrate single-cell RNA-seq data with nanoscale spatial transcriptomics, we present a topological method for automatic cell type identification (TopACT). Unlike popular decomposition approaches to multicellular resolution data, TopACT is able to pinpoint the spatial locations of individual sparsely dispersed cells without prior knowledge of cell boundaries. Pairing TopACT with multiparameter persistent homology landscapes predicts immune cells forming a peripheral ring structure within kidney glomeruli in a murine model of lupus nephritis, which we experimentally validate with immunofluorescent imaging. The proposed topological data analysis unifies multiple biological scales, from subcellular gene expression to multicellular tissue organization.
△ Less
Submitted 13 December, 2022;
originally announced December 2022.
-
Solar Flare Index Prediction Using SDO/HMI Vector Magnetic Data Products with Statistical and Machine Learning Methods
Authors:
Hewei Zhang,
Qin Li,
Yanxing Yang,
Ju Jing,
Jason T. L. Wang,
Haimin Wang,
Zuofeng Shang
Abstract:
Solar flares, especially the M- and X-class flares, are often associated with coronal mass ejections (CMEs). They are the most important sources of space weather effects, that can severely impact the near-Earth environment. Thus it is essential to forecast flares (especially the M-and X-class ones) to mitigate their destructive and hazardous consequences. Here, we introduce several statistical and…
▽ More
Solar flares, especially the M- and X-class flares, are often associated with coronal mass ejections (CMEs). They are the most important sources of space weather effects, that can severely impact the near-Earth environment. Thus it is essential to forecast flares (especially the M-and X-class ones) to mitigate their destructive and hazardous consequences. Here, we introduce several statistical and Machine Learning approaches to the prediction of the AR's Flare Index (FI) that quantifies the flare productivity of an AR by taking into account the numbers of different class flares within a certain time interval. Specifically, our sample includes 563 ARs appeared on solar disk from May 2010 to Dec 2017. The 25 magnetic parameters, provided by the Space-weather HMI Active Region Patches (SHARP) from Helioseismic and Magnetic Imager (HMI) on board the Solar Dynamics Observatory (SDO), characterize coronal magnetic energy stored in ARs by proxy and are used as the predictors. We investigate the relationship between these SHARP parameters and the FI of ARs with a machine-learning algorithm (spline regression) and the resampling method (Synthetic Minority Over-Sampling Technique for Regression with Gaussian Noise, short by SMOGN). Based on the established relationship, we are able to predict the value of FIs for a given AR within the next 1-day period. Compared with other 4 popular machine learning algorithms, our methods improve the accuracy of FI prediction, especially for large FI. In addition, we sort the importance of SHARP parameters by Borda Count method calculated from the ranks that are rendered by 9 different machine learning methods.
△ Less
Submitted 1 December, 2022; v1 submitted 27 September, 2022;
originally announced September 2022.
-
Partial-Mastery Cognitive Diagnosis Models
Authors:
Zhuoran Shang,
Elena A. Erosheva,
Gongjun Xu
Abstract:
Cognitive diagnosis models (CDMs) are a family of discrete latent attribute models that serve as statistical basis in educational and psychological cognitive diagnosis assessments. CDMs aim to achieve fine-grained inference on individuals' latent attributes, based on their observed responses to a set of designed diagnostic items. In the literature, CDMs usually assume that items require mastery of…
▽ More
Cognitive diagnosis models (CDMs) are a family of discrete latent attribute models that serve as statistical basis in educational and psychological cognitive diagnosis assessments. CDMs aim to achieve fine-grained inference on individuals' latent attributes, based on their observed responses to a set of designed diagnostic items. In the literature, CDMs usually assume that items require mastery of specific latent attributes and that each attribute is either fully mastered or not mastered by a given subject. We propose a new class of models, partial mastery CDMs (PM-CDMs), that generalizes CDMs by allowing for partial mastery levels for each attribute of interest. We demonstrate that PM-CDMs can be represented as restricted latent class models. Relying on the latent class representation, we propose a Bayesian approach for estimation. We present simulation studies to demonstrate parameter recovery, to investigate the impact of model misspecification with respect to partial mastery, and to develop diagnostic tools that could be used by practitioners to decide between CDMs and PM-CDMs. We use two examples of real test data -- the fraction subtraction and the English tests -- to demonstrate that employing PM-CDMs not only improves model fit, compared to CDMs, but also can make substantial difference in conclusions about attribute mastery. We conclude that PM-CDMs can lead to more effective remediation programs by providing detailed individual-level information about skills learned and skills that need to study.
△ Less
Submitted 5 August, 2022;
originally announced August 2022.
-
Deep Neural Network Classifier for Multi-dimensional Functional Data
Authors:
Shuoyang Wang,
Guanqun Cao,
Zuofeng Shang
Abstract:
We propose a new approach, called as functional deep neural network (FDNN), for classifying multi-dimensional functional data. Specifically, a deep neural network is trained based on the principle components of the training data which shall be used to predict the class label of a future data function. Unlike the popular functional discriminant analysis approaches which rely on Gaussian assumption,…
▽ More
We propose a new approach, called as functional deep neural network (FDNN), for classifying multi-dimensional functional data. Specifically, a deep neural network is trained based on the principle components of the training data which shall be used to predict the class label of a future data function. Unlike the popular functional discriminant analysis approaches which rely on Gaussian assumption, the proposed FDNN approach applies to general non-Gaussian multi-dimensional functional data. Moreover, when the log density ratio possesses a locally connected functional modular structure, we show that FDNN achieves minimax optimality. The superiority of our approach is demonstrated through both simulated and real-world datasets.
△ Less
Submitted 17 May, 2022;
originally announced May 2022.
-
Information-theoretic Limits for Testing Community Structures in Weighted Networks
Authors:
Mingao Yuan,
Zuofeng Shang
Abstract:
Community detection refers to the problem of clustering the nodes of a network into groups. Existing inferential methods for community structure mainly focus on unweighted (binary) networks. Many real-world networks are nonetheless weighted and a common practice is to dichotomize a weighted network to an unweighted one which is known to result in information loss. Literature on hypothesis testing…
▽ More
Community detection refers to the problem of clustering the nodes of a network into groups. Existing inferential methods for community structure mainly focus on unweighted (binary) networks. Many real-world networks are nonetheless weighted and a common practice is to dichotomize a weighted network to an unweighted one which is known to result in information loss. Literature on hypothesis testing in the latter situation is still missing. In this paper, we study the problem of testing the existence of community structure in weighted networks. Our contributions are threefold: (a). We use the (possibly infinite-dimensional) exponential family to model the weights and derive the sharp information-theoretic limit for the existence of consistent test. Within the limit, any test is inconsistent; and beyond the limit, we propose a useful consistent test. (b). Based on the information-theoretic limits, we provide the first formal way to quantify the loss of information incurred by dichotomizing weighted graphs into unweighted graphs in the context of hypothesis testing. (c). We propose several new and practically useful test statistics. Simulation study show that the proposed tests have good performance. Finally, we apply the proposed tests to an animal social network.
△ Less
Submitted 19 April, 2022;
originally announced April 2022.
-
Statistical Inference for Functional Linear Quantile Regression
Authors:
Peijun Sang,
Zuofeng Shang,
Pang Du
Abstract:
We propose inferential tools for functional linear quantile regression where the conditional quantile of a scalar response is assumed to be a linear functional of a functional covariate. In contrast to conventional approaches, we employ kernel convolution to smooth the original loss function. The coefficient function is estimated under a reproducing kernel Hilbert space framework. A gradient desce…
▽ More
We propose inferential tools for functional linear quantile regression where the conditional quantile of a scalar response is assumed to be a linear functional of a functional covariate. In contrast to conventional approaches, we employ kernel convolution to smooth the original loss function. The coefficient function is estimated under a reproducing kernel Hilbert space framework. A gradient descent algorithm is designed to minimize the smoothed loss function with a roughness penalty. With the aid of the Banach fixed-point theorem, we show the existence and uniqueness of our proposed estimator as the minimizer of the regularized loss function in an appropriate Hilbert space. Furthermore, we establish the convergence rate as well as the weak convergence of our estimator. As far as we know, this is the first weak convergence result for a functional quantile regression model. Pointwise confidence intervals and a simultaneous confidence band for the true coefficient function are then developed based on these theoretical properties. Numerical studies including both simulations and a data application are conducted to investigate the performance of our estimator and inference tools in finite sample.
△ Less
Submitted 23 February, 2022;
originally announced February 2022.
-
Statistical Limits for Testing Correlation of Hypergraphs
Authors:
Mingao Yuan,
Zuofeng Shang
Abstract:
In this paper, we consider the hypothesis testing of correlation between two $m$-uniform hypergraphs on $n$ unlabelled nodes. Under the null hypothesis, the hypergraphs are independent, while under the alternative hypothesis, the hyperdges have the same marginal distributions as in the null hypothesis but are correlated after some unknown node permutation. We focus on two scenarios: the hypergraph…
▽ More
In this paper, we consider the hypothesis testing of correlation between two $m$-uniform hypergraphs on $n$ unlabelled nodes. Under the null hypothesis, the hypergraphs are independent, while under the alternative hypothesis, the hyperdges have the same marginal distributions as in the null hypothesis but are correlated after some unknown node permutation. We focus on two scenarios: the hypergraphs are generated from the Gaussian-Wigner model and the dense Erdös-Rényi model. We derive the sharp information-theoretic testing threshold. Above the threshold, there exists a powerful test to distinguish the alternative hypothesis from the null hypothesis. Below the threshold, the alternative hypothesis and the null hypothesis are not distinguishable. The threshold involves $m$ and decreases as $m$ gets larger. This indicates testing correlation of hypergraphs ($m\geq3$) becomes easier than testing correlation of graphs ($m=2$)
△ Less
Submitted 11 February, 2022;
originally announced February 2022.
-
An Approach of Bayesian Variable Selection for Ultrahigh Dimensional Multivariate Regression
Authors:
Xiaotian Dai,
Guifang Fu,
Randall Reese,
Shaofei Zhao,
Zuofeng Shang
Abstract:
In many practices, scientists are particularly interested in detecting which of the predictors are truly associated with a multivariate response. It is more accurate to model multiple responses as one vector rather than separating each component one by one. This is particularly true for complex traits having multiple correlated components. A Bayesian multivariate variable selection (BMVS) approach…
▽ More
In many practices, scientists are particularly interested in detecting which of the predictors are truly associated with a multivariate response. It is more accurate to model multiple responses as one vector rather than separating each component one by one. This is particularly true for complex traits having multiple correlated components. A Bayesian multivariate variable selection (BMVS) approach is proposed to select important predictors influencing the multivariate response from a candidate pool with an ultrahigh dimension. By applying the sample-size-dependent spike and slab priors, the BMVS approach satisfies the strong selection consistency property under certain conditions, which represents the advantages of BMVS over other existing Bayesian multivariate regression-based approaches. The proposed approach considers the covariance structure of multiple responses without assuming independence and integrates the estimation of covariance-related parameters together with all regression parameters into one framework through a fast updating MCMC procedure. It is demonstrated through simulations that the BMVS approach outperforms some other relevant frequentist and Bayesian approaches. The proposed BMVS approach possesses the flexibility of wide applications, including genome-wide association studies with multiple correlated phenotypes and a large scale of genetic variants and/or environmental variables, as demonstrated in the real data analyses section. The computer code and test data of the proposed method are available as an R package.
△ Less
Submitted 15 November, 2021;
originally announced November 2021.
-
Calibrating multi-dimensional complex ODE from noisy data via deep neural networks
Authors:
Kexuan Li,
Fangfang Wang,
Ruiqi Liu,
Fan Yang,
Zuofeng Shang
Abstract:
Ordinary differential equations (ODEs) are widely used to model complex dynamics that arises in biology, chemistry, engineering, finance, physics, etc. Calibration of a complicated ODE system using noisy data is generally very difficult. In this work, we propose a two-stage nonparametric approach to address this problem. We first extract the de-noised data and their higher order derivatives using…
▽ More
Ordinary differential equations (ODEs) are widely used to model complex dynamics that arises in biology, chemistry, engineering, finance, physics, etc. Calibration of a complicated ODE system using noisy data is generally very difficult. In this work, we propose a two-stage nonparametric approach to address this problem. We first extract the de-noised data and their higher order derivatives using boundary kernel method, and then feed them into a sparsely connected deep neural network with ReLU activation function. Our method is able to recover the ODE system without being subject to the curse of dimensionality and complicated ODE structure. When the ODE possesses a general modular structure, with each modular component involving only a few input variables, and the network architecture is properly chosen, our method is proven to be consistent. Theoretical properties are corroborated by an extensive simulation study that demonstrates the validity and effectiveness of the proposed method. Finally, we use our method to simultaneously characterize the growth rate of Covid-19 infection cases from 50 states of the USA.
△ Less
Submitted 18 September, 2023; v1 submitted 7 June, 2021;
originally announced June 2021.
-
Detection of a rank-one signal with limited training data
Authors:
Weijian Liu,
Zhaojian Zhang,
Jun Liu,
Zheran Shang,
Yong-Liang Wang
Abstract:
In this paper, we reconsider the problem of detecting a matrix-valued rank-one signal in unknown Gaussian noise, which was previously addressed for the case of sufficient training data. We relax the above assumption to the case of limited training data. We re-derive the corresponding generalized likelihood ratio test (GLRT) and two-step GLRT (2S--GLRT) based on certain unitary transformation on th…
▽ More
In this paper, we reconsider the problem of detecting a matrix-valued rank-one signal in unknown Gaussian noise, which was previously addressed for the case of sufficient training data. We relax the above assumption to the case of limited training data. We re-derive the corresponding generalized likelihood ratio test (GLRT) and two-step GLRT (2S--GLRT) based on certain unitary transformation on the test data. It is shown that the re-derived detectors can work with low sample support. Moreover, in sample-abundant environments the re-derived GLRT is the same as the previously proposed GLRT and the re-derived 2S--GLRT has better detection performance than the previously proposed 2S--GLRT. Numerical examples are provided to demonstrate the effectiveness of the re-derived detectors.
△ Less
Submitted 13 April, 2021;
originally announced May 2021.
-
Online Statistical Inference for Parameters Estimation with Linear-Equality Constraints
Authors:
Ruiqi Liu,
Mingao Yuan,
Zuofeng Shang
Abstract:
Stochastic gradient descent (SGD) and projected stochastic gradient descent (PSGD) are scalable algorithms to compute model parameters in unconstrained and constrained optimization problems. In comparison with SGD, PSGD forces its iterative values into the constrained parameter space via projection. From a statistical point of view, this paper studies the limiting distribution of PSGD-based estima…
▽ More
Stochastic gradient descent (SGD) and projected stochastic gradient descent (PSGD) are scalable algorithms to compute model parameters in unconstrained and constrained optimization problems. In comparison with SGD, PSGD forces its iterative values into the constrained parameter space via projection. From a statistical point of view, this paper studies the limiting distribution of PSGD-based estimate when the true parameters satisfy some linear-equality constraints. Our theoretical findings reveal the role of projection played in the uncertainty of the PSGD-based estimate. As a byproduct, we propose an online hypothesis testing procedure to test the linear-equality constraints. Simulation studies on synthetic data and an application to a real-world dataset confirm our theory.
△ Less
Submitted 22 March, 2022; v1 submitted 21 May, 2021;
originally announced May 2021.
-
Distributed Adaptive Nearest Neighbor Classifier: Algorithm and Theory
Authors:
Ruiqi Liu,
Ganggang Xu,
Zuofeng Shang
Abstract:
When data is of an extraordinarily large size or physically stored in different locations, the distributed nearest neighbor (NN) classifier is an attractive tool for classification. We propose a novel distributed adaptive NN classifier for which the number of nearest neighbors is a tuning parameter stochastically chosen by a data-driven criterion. An early stopping rule is proposed when searching…
▽ More
When data is of an extraordinarily large size or physically stored in different locations, the distributed nearest neighbor (NN) classifier is an attractive tool for classification. We propose a novel distributed adaptive NN classifier for which the number of nearest neighbors is a tuning parameter stochastically chosen by a data-driven criterion. An early stopping rule is proposed when searching for the optimal tuning parameter, which not only speeds up the computation but also improves the finite sample performance of the proposed Algorithm. Convergence rate of excess risk of the distributed adaptive NN classifier is investigated under various sub-sample size compositions. In particular, we show that when the sub-sample sizes are sufficiently large, the proposed classifier achieves the nearly optimal convergence rate. Effectiveness of the proposed approach is demonstrated through simulation studies as well as an empirical application to a real-world dataset.
△ Less
Submitted 3 June, 2023; v1 submitted 20 May, 2021;
originally announced May 2021.
-
Information Limits for Detecting a Subhypergraph
Authors:
Mingao Yuan,
Zuofeng Shang
Abstract:
We consider the problem of recovering a subhypergraph based on an observed adjacency tensor corresponding to a uniform hypergraph. The uniform hypergraph is assumed to contain a subset of vertices called as subhypergraph. The edges restricted to the subhypergraph are assumed to follow a different probability distribution than other edges. We consider both weak recovery and exact recovery of the su…
▽ More
We consider the problem of recovering a subhypergraph based on an observed adjacency tensor corresponding to a uniform hypergraph. The uniform hypergraph is assumed to contain a subset of vertices called as subhypergraph. The edges restricted to the subhypergraph are assumed to follow a different probability distribution than other edges. We consider both weak recovery and exact recovery of the subhypergraph, and establish information-theoretic limits in each case. Specifically, we establish sharp conditions for the possibility of weakly or exactly recovering the subhypergraph from an information-theoretic point of view. These conditions are fundamentally different from their counterparts derived in hypothesis testing literature.
△ Less
Submitted 5 May, 2021;
originally announced May 2021.
-
Heterogeneous Dense Subhypergraph Detection
Authors:
Mingao Yuan,
Zuofeng Shang
Abstract:
We study the problem of testing the existence of a heterogeneous dense subhypergraph. The null hypothesis corresponds to a heterogeneous Erdös-Rényi uniform random hypergraph and the alternative hypothesis corresponds to a heterogeneous uniform random hypergraph that contains a dense subhypergraph. We establish detection boundaries when the edge probabilities are known and construct an asymptotica…
▽ More
We study the problem of testing the existence of a heterogeneous dense subhypergraph. The null hypothesis corresponds to a heterogeneous Erdös-Rényi uniform random hypergraph and the alternative hypothesis corresponds to a heterogeneous uniform random hypergraph that contains a dense subhypergraph. We establish detection boundaries when the edge probabilities are known and construct an asymptotically powerful test for distinguishing the hypotheses. We also construct an adaptive test which does not involve edge probabilities, and hence, is more practically useful.
△ Less
Submitted 8 April, 2021;
originally announced April 2021.
-
Optimal Classification for Functional Data
Authors:
Shuoyang Wang,
Zuofeng Shang,
Guanqun Cao,
Jun Liu
Abstract:
A central topic in functional data analysis is how to design an optimaldecision rule, based on training samples, to classify a data function. We exploit the optimal classification problem when data functions are Gaussian processes. Sharp nonasymptotic convergence rates for minimax excess mis-classification risk are derived in both settings that data functions are fully observed and discretely obse…
▽ More
A central topic in functional data analysis is how to design an optimaldecision rule, based on training samples, to classify a data function. We exploit the optimal classification problem when data functions are Gaussian processes. Sharp nonasymptotic convergence rates for minimax excess mis-classification risk are derived in both settings that data functions are fully observed and discretely observed. We explore two easily implementable classifiers based on discriminant analysis and deep neural network, respectively, which are both proven to achieve optimality in Gaussian setting. Our deepneural network classifier is new in literature which demonstrates outstanding performance even when data functions are non-Gaussian. In case of discretely observed data, we discover a novel critical sampling frequency thatgoverns the sharp convergence rates. The proposed classifiers perform favorably in finite-sample applications, as we demonstrate through comparisonswith other functional classifiers in simulations and one real data application.
△ Less
Submitted 10 September, 2021; v1 submitted 28 February, 2021;
originally announced March 2021.
-
Sharp detection boundaries on testing dense subhypergraph
Authors:
Mingao Yuan,
Zuofeng Shang
Abstract:
We study the problem of testing the existence of a dense subhypergraph. The null hypothesis is an Erdos-Renyi uniform random hypergraph and the alternative hypothesis is a uniform random hypergraph that contains a dense subhypergraph. We establish sharp detection boundaries in both scenarios: (1) the edge probabilities are known; (2) the edge probabilities are unknown. In both scenarios, sharp det…
▽ More
We study the problem of testing the existence of a dense subhypergraph. The null hypothesis is an Erdos-Renyi uniform random hypergraph and the alternative hypothesis is a uniform random hypergraph that contains a dense subhypergraph. We establish sharp detection boundaries in both scenarios: (1) the edge probabilities are known; (2) the edge probabilities are unknown. In both scenarios, sharp detectable boundaries are characterized by the appropriate model parameters. Asymptotically powerful tests are provided when the model parameters fall in the detectable regions. Our results indicate that the detectable regions for general hypergraph models are dramatically different from their graph counterparts.
△ Less
Submitted 12 January, 2021;
originally announced January 2021.
-
Estimation of the Mean Function of Functional Data via Deep Neural Networks
Authors:
Shuoyang Wang,
Guanqun Cao,
Zuofeng Shang
Abstract:
In this work, we propose a deep neural network method to perform nonparametric regression for functional data. The proposed estimators are based on sparsely connected deep neural networks with ReLU activation function. By properly choosing network architecture, our estimator achieves the optimal nonparametric convergence rate in empirical norm. Under certain circumstances such as trigonometric pol…
▽ More
In this work, we propose a deep neural network method to perform nonparametric regression for functional data. The proposed estimators are based on sparsely connected deep neural networks with ReLU activation function. By properly choosing network architecture, our estimator achieves the optimal nonparametric convergence rate in empirical norm. Under certain circumstances such as trigonometric polynomial kernel and a sufficiently large sampling frequency, the convergence rate is even faster than root-$n$ rate. Through Monte Carlo simulation studies we examine the finite-sample performance of the proposed method. Finally, the proposed method is applied to analyze positron emission tomography images of patients with Alzheimer disease obtained from the Alzheimer Disease Neuroimaging Initiative database.
△ Less
Submitted 8 December, 2020;
originally announced December 2020.
-
A Computationally Efficient Classification Algorithm in Posterior Drift Model: Phase Transition and Minimax Adaptivity
Authors:
Ruiqi Liu,
Kexuan Li,
Zuofeng Shang
Abstract:
In massive data analysis, training and testing data often come from very different sources, and their probability distributions are not necessarily identical. A feature example is nonparametric classification in posterior drift model where the conditional distributions of the label given the covariates are possibly different. In this paper, we derive minimax rate of the excess risk for nonparametr…
▽ More
In massive data analysis, training and testing data often come from very different sources, and their probability distributions are not necessarily identical. A feature example is nonparametric classification in posterior drift model where the conditional distributions of the label given the covariates are possibly different. In this paper, we derive minimax rate of the excess risk for nonparametric classification in posterior drift model in the setting that both training and testing data have smooth distributions, extending a recent work by Cai and Wei (2019) who only impose smoothness condition on the distribution of testing data. The minimax rate demonstrates a phase transition characterized by the mutual relationship between the smoothness orders of the training and testing data distributions. We also propose a computationally efficient and data-driven nearest neighbor classifier which achieves the minimax excess risk (up to a logarithm factor). Simulation studies and a real-world application are conducted to demonstrate our approach.
△ Less
Submitted 8 November, 2020;
originally announced November 2020.
-
On Deep Instrumental Variables Estimate
Authors:
Ruiqi Liu,
Zuofeng Shang,
Guang Cheng
Abstract:
The endogeneity issue is fundamentally important as many empirical applications may suffer from the omission of explanatory variables, measurement error, or simultaneous causality. Recently, \cite{hllt17} propose a "Deep Instrumental Variable (IV)" framework based on deep neural networks to address endogeneity, demonstrating superior performances than existing approaches. The aim of this paper is…
▽ More
The endogeneity issue is fundamentally important as many empirical applications may suffer from the omission of explanatory variables, measurement error, or simultaneous causality. Recently, \cite{hllt17} propose a "Deep Instrumental Variable (IV)" framework based on deep neural networks to address endogeneity, demonstrating superior performances than existing approaches. The aim of this paper is to theoretically understand the empirical success of the Deep IV. Specifically, we consider a two-stage estimator using deep neural networks in the linear instrumental variables model. By imposing a latent structural assumption on the reduced form equation between endogenous variables and instrumental variables, the first-stage estimator can automatically capture this latent structure and converge to the optimal instruments at the minimax optimal rate, which is free of the dimension of instrumental variables and thus mitigates the curse of dimensionality. Additionally, in comparison with classical methods, due to the faster convergence rate of the first-stage estimator, the second-stage estimator has {a smaller (second order) estimation error} and requires a weaker condition on the smoothness of the optimal instruments. Given that the depth and width of the employed deep neural network are well chosen, we further show that the second-stage estimator achieves the semiparametric efficiency bound. Simulation studies on synthetic data and application to automobile market data confirm our theory.
△ Less
Submitted 30 April, 2020;
originally announced April 2020.
-
Sharp Rate of Convergence for Deep Neural Network Classifiers under the Teacher-Student Setting
Authors:
Tianyang Hu,
Zuofeng Shang,
Guang Cheng
Abstract:
Classifiers built with neural networks handle large-scale high dimensional data, such as facial images from computer vision, extremely well while traditional statistical methods often fail miserably. In this paper, we attempt to understand this empirical success in high dimensional classification by deriving the convergence rates of excess risk. In particular, a teacher-student framework is propos…
▽ More
Classifiers built with neural networks handle large-scale high dimensional data, such as facial images from computer vision, extremely well while traditional statistical methods often fail miserably. In this paper, we attempt to understand this empirical success in high dimensional classification by deriving the convergence rates of excess risk. In particular, a teacher-student framework is proposed that assumes the Bayes classifier to be expressed as ReLU neural networks. In this setup, we obtain a sharp rate of convergence, i.e., $\tilde{O}_d(n^{-2/3})$, for classifiers trained using either 0-1 loss or hinge loss. This rate can be further improved to $\tilde{O}_d(n^{-1})$ when the data distribution is separable. Here, $n$ denotes the sample size. An interesting observation is that the data dimension only contributes to the $\log(n)$ term in the above rates. This may provide one theoretical explanation for the empirical successes of deep neural networks in high dimensional classification, particularly for structured data.
△ Less
Submitted 31 January, 2020; v1 submitted 19 January, 2020;
originally announced January 2020.
-
Statistical Inference on Partially Linear Panel Model under Unobserved Linearity
Authors:
Ruiqi Liu,
Ben Boukai,
Zuofeng Shang
Abstract:
A new statistical procedure, based on a modified spline basis, is proposed to identify the linear components in the panel data model with fixed effects. Under some mild assumptions, the proposed procedure is shown to consistently estimate the underlying regression function, correctly select the linear components, and effectively conduct the statistical inference. When compared to existing methods…
▽ More
A new statistical procedure, based on a modified spline basis, is proposed to identify the linear components in the panel data model with fixed effects. Under some mild assumptions, the proposed procedure is shown to consistently estimate the underlying regression function, correctly select the linear components, and effectively conduct the statistical inference. When compared to existing methods for detection of linearity in the panel model, our approach is demonstrated to be theoretically justified as well as practically convenient. We provide a computational algorithm that implements the proposed procedure along with a path-based solution method for linearity detection, which avoids the burden of selecting the tuning parameter for the penalty term. Monte Carlo simulations are conducted to examine the finite sample performance of our proposed procedure with detailed findings that confirm our theoretical results in the paper. Applications to Aggregate Production and Environmental Kuznets Curve data also illustrate the necessity for detecting linearity in the partially linear panel model.
△ Less
Submitted 20 November, 2019;
originally announced November 2019.
-
Minimax Nonparametric Two-sample Test under Smoothing
Authors:
Xin Xing,
Zuofeng Shang,
Pang Du,
Ping Ma,
Wenxuan Zhong,
Jun S. Liu
Abstract:
We consider the problem of comparing probability densities between two groups. A new probabilistic tensor product smoothing spline framework is developed to model the joint density of two variables. Under such a framework, the probability density comparison is equivalent to testing the presence/absence of interactions. We propose a penalized likelihood ratio test for such interaction testing and s…
▽ More
We consider the problem of comparing probability densities between two groups. A new probabilistic tensor product smoothing spline framework is developed to model the joint density of two variables. Under such a framework, the probability density comparison is equivalent to testing the presence/absence of interactions. We propose a penalized likelihood ratio test for such interaction testing and show that the test statistic is asymptotically chi-square distributed under the null hypothesis. Furthermore, we derive a sharp minimax testing rate based on the Bernstein width for nonparametric two-sample tests and show that our proposed test statistics is minimax optimal. In addition, a data-adaptive tuning criterion is developed to choose the penalty parameter. Simulations and real applications demonstrate that the proposed test outperforms the conventional approaches under various scenarios.
△ Less
Submitted 11 January, 2021; v1 submitted 5 November, 2019;
originally announced November 2019.
-
Optimal Nonparametric Inference via Deep Neural Network
Authors:
Ruiqi Liu,
Ben Boukai,
Zuofeng Shang
Abstract:
Deep neural network is a state-of-art method in modern science and technology. Much statistical literature have been devoted to understanding its performance in nonparametric estimation, whereas the results are suboptimal due to a redundant logarithmic sacrifice. In this paper, we show that such log-factors are not necessary. We derive upper bounds for the $L^2$ minimax risk in nonparametric estim…
▽ More
Deep neural network is a state-of-art method in modern science and technology. Much statistical literature have been devoted to understanding its performance in nonparametric estimation, whereas the results are suboptimal due to a redundant logarithmic sacrifice. In this paper, we show that such log-factors are not necessary. We derive upper bounds for the $L^2$ minimax risk in nonparametric estimation. Sufficient conditions on network architectures are provided such that the upper bounds become optimal (without log-sacrifice). Our proof relies on an explicitly constructed network estimator based on tensor product B-splines. We also derive asymptotic distributions for the constructed network and a relating hypothesis testing procedure. The testing procedure is further proven as minimax optimal under suitable network architectures.
△ Less
Submitted 16 August, 2021; v1 submitted 5 February, 2019;
originally announced February 2019.
-
Nonparametric Inference under B-bits Quantization
Authors:
Kexuan Li,
Ruiqi Liu,
Ganggang Xu,
Zuofeng Shang
Abstract:
Statistical inference based on lossy or incomplete samples is often needed in research areas such as signal/image processing, medical image storage, remote sensing, signal transmission. In this paper, we propose a nonparametric testing procedure based on samples quantized to $B$ bits through a computationally efficient algorithm. Under mild technical conditions, we establish the asymptotic propert…
▽ More
Statistical inference based on lossy or incomplete samples is often needed in research areas such as signal/image processing, medical image storage, remote sensing, signal transmission. In this paper, we propose a nonparametric testing procedure based on samples quantized to $B$ bits through a computationally efficient algorithm. Under mild technical conditions, we establish the asymptotic properties of the proposed test statistic and investigate how the testing power changes as $B$ increases. In particular, we show that if $B$ exceeds a certain threshold, the proposed nonparametric testing procedure achieves the classical minimax rate of testing (Shang and Cheng, 2015) for spline models. We further extend our theoretical investigations to a nonparametric linearity test and an adaptive nonparametric test, expanding the applicability of the proposed methods. Extensive simulation studies {together with a real-data analysis} are used to demonstrate the validity and effectiveness of the proposed tests.
△ Less
Submitted 11 August, 2023; v1 submitted 24 January, 2019;
originally announced January 2019.
-
A likelihood-ratio type test for stochastic block models with bounded degrees
Authors:
Mingao Yuan,
Yang Feng,
Zuofeng Shang
Abstract:
A fundamental problem in network data analysis is to test Erdös-Rényi model $\mathcal{G}\left(n,\frac{a+b}{2n}\right)$ versus a bisection stochastic block model $\mathcal{G}\left(n,\frac{a}{n},\frac{b}{n}\right)$, where $a,b>0$ are constants that represent the expected degrees of the graphs and $n$ denotes the number of nodes. This problem serves as the foundation of many other problems such as te…
▽ More
A fundamental problem in network data analysis is to test Erdös-Rényi model $\mathcal{G}\left(n,\frac{a+b}{2n}\right)$ versus a bisection stochastic block model $\mathcal{G}\left(n,\frac{a}{n},\frac{b}{n}\right)$, where $a,b>0$ are constants that represent the expected degrees of the graphs and $n$ denotes the number of nodes. This problem serves as the foundation of many other problems such as testing-based methods for determining the number of communities (\cite{BS16,L16}) and community detection (\cite{MS16}). Existing work has been focusing on growing-degree regime $a,b\to\infty$ (\cite{BS16,L16,MS16,BM17,B18,GL17a,GL17b}) while leaving the bounded-degree regime untreated. In this paper, we propose a likelihood-ratio (LR) type procedure based on regularization to test stochastic block models with bounded degrees. We derive the limit distributions as power Poisson laws under both null and alternative hypotheses, based on which the limit power of the test is carefully analyzed. We also examine a Monte-Carlo method that partly resolves the computational cost issue. The proposed procedures are examined by both simulated and real-world data. The proof depends on a contiguity theory developed by Janson \cite{J95}.
△ Less
Submitted 22 November, 2018; v1 submitted 12 July, 2018;
originally announced July 2018.
-
How Many Machines Can We Use in Parallel Computing for Kernel Ridge Regression?
Authors:
Meimei Liu,
Zuofeng Shang,
Guang Cheng
Abstract:
This paper aims to solve a basic problem in distributed statistical inference: how many machines can we use in parallel computing? In kernel ridge regression, we address this question in two important settings: nonparametric estimation and hypothesis testing. Specifically, we find a range for the number of machines under which optimal estimation/testing is achievable. The employed empirical proces…
▽ More
This paper aims to solve a basic problem in distributed statistical inference: how many machines can we use in parallel computing? In kernel ridge regression, we address this question in two important settings: nonparametric estimation and hypothesis testing. Specifically, we find a range for the number of machines under which optimal estimation/testing is achievable. The employed empirical processes method provides a unified framework, that allows us to handle various regression problems (such as thin-plate splines and nonparametric additive regression) under different settings (such as univariate, multivariate and diverging-dimensional designs). It is worth noting that the upper bounds of the number of machines are proven to be un-improvable (upto a logarithmic factor) in two important cases: smoothing spline regression and Gaussian RKHS regression. Our theoretical findings are backed by thorough numerical studies.
△ Less
Submitted 23 February, 2019; v1 submitted 24 May, 2018;
originally announced May 2018.
-
Nonparametric Testing under Random Projection
Authors:
Meimei Liu,
Zuofeng Shang,
Guang Cheng
Abstract:
A common challenge in nonparametric inference is its high computational complexity when data volume is large. In this paper, we develop computationally efficient nonparametric testing by employing a random projection strategy. In the specific kernel ridge regression setup, a simple distance-based test statistic is proposed. Notably, we derive the minimum number of random projections that is suffic…
▽ More
A common challenge in nonparametric inference is its high computational complexity when data volume is large. In this paper, we develop computationally efficient nonparametric testing by employing a random projection strategy. In the specific kernel ridge regression setup, a simple distance-based test statistic is proposed. Notably, we derive the minimum number of random projections that is sufficient for achieving testing optimality in terms of the minimax rate. An adaptive testing procedure is further established without prior knowledge of regularity. One technical contribution is to establish upper bounds for a range of tail sums of empirical kernel eigenvalues. Simulations and real data analysis are conducted to support our theory.
△ Less
Submitted 17 February, 2018;
originally announced February 2018.
-
Distributed Generalized Cross-Validation for Divide-and-Conquer Kernel Ridge Regression and its Asymptotic Optimality
Authors:
Ganggang Xu,
Zuofeng Shang,
Guang Cheng
Abstract:
Tuning parameter selection is of critical importance for kernel ridge regression. To this date, data driven tuning method for divide-and-conquer kernel ridge regression (d-KRR) has been lacking in the literature, which limits the applicability of d-KRR for large data sets. In this paper, by modifying the Generalized Cross-validation (GCV, Wahba, 1990) score, we propose a distributed Generalized Cr…
▽ More
Tuning parameter selection is of critical importance for kernel ridge regression. To this date, data driven tuning method for divide-and-conquer kernel ridge regression (d-KRR) has been lacking in the literature, which limits the applicability of d-KRR for large data sets. In this paper, by modifying the Generalized Cross-validation (GCV, Wahba, 1990) score, we propose a distributed Generalized Cross-Validation (dGCV) as a data-driven tool for selecting the tuning parameters in d-KRR. Not only the proposed dGCV is computationally scalable for massive data sets, it is also shown, under mild conditions, to be asymptotically optimal in the sense that minimizing the dGCV score is equivalent to minimizing the true global conditional empirical loss of the averaged function estimator, extending the existing optimality results of GCV to the divide-and-conquer framework.
△ Less
Submitted 18 February, 2019; v1 submitted 18 December, 2016;
originally announced December 2016.
-
Sparse and Efficient Estimation for Partial Spline Models with Increasing Dimension
Authors:
Guang Cheng,
Hao Helen Zhang,
Zuofeng Shang
Abstract:
We consider model selection and estimation for partial spline models and propose a new regularization method in the context of smoothing splines. The regularization method has a simple yet elegant form, consisting of roughness penalty on the nonparametric component and shrinkage penalty on the parametric components, which can achieve function smoothing and sparse estimation simultaneously. We esta…
▽ More
We consider model selection and estimation for partial spline models and propose a new regularization method in the context of smoothing splines. The regularization method has a simple yet elegant form, consisting of roughness penalty on the nonparametric component and shrinkage penalty on the parametric components, which can achieve function smoothing and sparse estimation simultaneously. We establish the convergence rate and oracle properties of the estimator under weak regularity conditions. Remarkably, the estimated parametric components are sparse and efficient, and the nonparametric component can be estimated with the optimal rate. The procedure also has attractive computational properties. Using the representer theory of smoothing splines, we reformulate the objective function as a LASSO-type problem, enabling us to use the LARS algorithm to compute the solution path. We then extend the procedure to situations when the number of predictors increases with the sample size and investigate its asymptotic properties in that context. Finite-sample performance is illustrated by simulations.
△ Less
Submitted 21 November, 2013; v1 submitted 31 October, 2013;
originally announced October 2013.
-
High-Dimensional Bayesian Inference in Nonparametric Additive Models
Authors:
Zuofeng Shang,
Ping Li
Abstract:
A fully Bayesian approach is proposed for ultrahigh-dimensional nonparametric additive models in which the number of additive components may be larger than the sample size, though ideally the true model is believed to include only a small number of components. Bayesian approaches can conduct stochastic model search and fulfill flexible parameter estimation by stochastic draws. The theory shows tha…
▽ More
A fully Bayesian approach is proposed for ultrahigh-dimensional nonparametric additive models in which the number of additive components may be larger than the sample size, though ideally the true model is believed to include only a small number of components. Bayesian approaches can conduct stochastic model search and fulfill flexible parameter estimation by stochastic draws. The theory shows that the proposed model selection method has satisfactory properties. For instance, when the hyperparameter associated with the model prior is correctly specified, the true model has posterior probability approaching one as the sample size goes to infinity; when this hyperparameter is incorrectly specified, the selected model is still acceptable since asymptotically it is proved to be nested in the true model. To enhance model flexibility, two new $g$-priors are proposed and their theoretical performance is examined. We also propose an efficient MCMC algorithm to handle the computational issues. Several simulation examples are provided to demonstrate the computational advantages of our method.
△ Less
Submitted 23 September, 2013; v1 submitted 28 June, 2013;
originally announced July 2013.
-
Bayesian Ultrahigh-Dimensional Screening Via MCMC
Authors:
Zuofeng Shang,
Ping Li
Abstract:
We explore the theoretical and numerical property of a fully Bayesian model selection method in sparse ultrahigh-dimensional settings, i.e., $p\gg n$, where $p$ is the number of covariates and $n$ is the sample size. Our method consists of (1) a hierarchical Bayesian model with a novel prior placed over the model space which includes a hyperparameter $t_n$ controlling the model size, and (2) an ef…
▽ More
We explore the theoretical and numerical property of a fully Bayesian model selection method in sparse ultrahigh-dimensional settings, i.e., $p\gg n$, where $p$ is the number of covariates and $n$ is the sample size. Our method consists of (1) a hierarchical Bayesian model with a novel prior placed over the model space which includes a hyperparameter $t_n$ controlling the model size, and (2) an efficient MCMC algorithm for automatic and stochastic search of the models. Our theory shows that, when specifying $t_n$ correctly, the proposed method yields selection consistency, i.e., the posterior probability of the true model asymptotically approaches one; when $t_n$ is misspecified, the selected model is still asymptotically nested in the true model. The theory also reveals insensitivity of the selection result with respect to the choice of $t_n$. In implementations, a reasonable prior is further assumed on $t_n$ which allows us to draw its samples stochastically. Our approach conducts selection, estimation and even inference in a unified framework. No additional prescreening or dimension reduction step is needed. Two novel $g$-priors are proposed to make our approach more flexible. A simulation study is given to display the numerical advantage of our method.
△ Less
Submitted 12 March, 2013; v1 submitted 5 February, 2013;
originally announced February 2013.
-
Local and global asymptotic inference in smoothing spline models
Authors:
Zuofeng Shang,
Guang Cheng
Abstract:
This article studies local and global inference for smoothing spline estimation in a unified asymptotic framework. We first introduce a new technical tool called functional Bahadur representation, which significantly generalizes the traditional Bahadur representation in parametric models, that is, Bahadur [Ann. Inst. Statist. Math. 37 (1966) 577-580]. Equipped with this tool, we develop four inter…
▽ More
This article studies local and global inference for smoothing spline estimation in a unified asymptotic framework. We first introduce a new technical tool called functional Bahadur representation, which significantly generalizes the traditional Bahadur representation in parametric models, that is, Bahadur [Ann. Inst. Statist. Math. 37 (1966) 577-580]. Equipped with this tool, we develop four interconnected procedures for inference: (i) pointwise confidence interval; (ii) local likelihood ratio testing; (iii) simultaneous confidence band; (iv) global likelihood ratio testing. In particular, our confidence intervals are proved to be asymptotically valid at any point in the support, and they are shorter on average than the Bayesian confidence intervals proposed by Wahba [J. R. Stat. Soc. Ser. B Stat. Methodol. 45 (1983) 133-150] and Nychka [J. Amer. Statist. Assoc. 83 (1988) 1134-1143]. We also discuss a version of the Wilks phenomenon arising from local/global likelihood ratio testing. It is also worth noting that our simultaneous confidence bands are the first ones applicable to general quasi-likelihood models. Furthermore, issues relating to optimality and efficiency are carefully addressed. As a by-product, we discover a surprising relationship between periodic and nonperiodic smoothing splines in terms of inference.
△ Less
Submitted 26 November, 2013; v1 submitted 30 December, 2012;
originally announced December 2012.
-
An Application of Bayesian Variable Selection to Spatial Concurrent Linear Models
Authors:
Zuofeng Shang,
Murray K. Clayton
Abstract:
Spatial concurrent linear models, in which the model coefficients are spatial processes varying at a local level, are flexible and useful tools for analyzing spatial data. One approach places stationary Gaussian process priors on the spatial processes, but in applications the data may display strong nonstationary patterns. In this article, we propose a Bayesian variable selection approach based on…
▽ More
Spatial concurrent linear models, in which the model coefficients are spatial processes varying at a local level, are flexible and useful tools for analyzing spatial data. One approach places stationary Gaussian process priors on the spatial processes, but in applications the data may display strong nonstationary patterns. In this article, we propose a Bayesian variable selection approach based on wavelet tools to address this problem. The proposed approach does not involve any stationarity assumptions on the priors, and instead we impose a mixture prior directly on each wavelet coefficient. We introduce an option to control the priors such that high resolution coefficients are more likely to be zero. Computationally efficient MCMC procedures are provided to address posterior sampling, and uncertainty in the estimation is assessed through posterior means and standard deviations. Examples based on simulated data demonstrate the estimation accuracy and advantages of the proposed method. We also illustrate the performance of the proposed method for real data obtained through remote sensing.
△ Less
Submitted 2 February, 2012;
originally announced February 2012.