-
Minimax Optimal Fair Classification with Bounded Demographic Disparity
Authors:
Xianli Zeng,
Guang Cheng,
Edgar Dobriban
Abstract:
Mitigating the disparate impact of statistical machine learning methods is crucial for ensuring fairness. While extensive research aims to reduce disparity, the effect of using a \emph{finite dataset} -- as opposed to the entire population -- remains unclear. This paper explores the statistical foundations of fair binary classification with two protected groups, focusing on controlling demographic…
▽ More
Mitigating the disparate impact of statistical machine learning methods is crucial for ensuring fairness. While extensive research aims to reduce disparity, the effect of using a \emph{finite dataset} -- as opposed to the entire population -- remains unclear. This paper explores the statistical foundations of fair binary classification with two protected groups, focusing on controlling demographic disparity, defined as the difference in acceptance rates between the groups. Although fairness may come at the cost of accuracy even with infinite data, we show that using a finite sample incurs additional costs due to the need to estimate group-specific acceptance thresholds. We study the minimax optimal classification error while constraining demographic disparity to a user-specified threshold. To quantify the impact of fairness constraints, we introduce a novel measure called \emph{fairness-aware excess risk} and derive a minimax lower bound on this measure that all classifiers must satisfy. Furthermore, we propose FairBayes-DDP+, a group-wise thresholding method with an offset that we show attains the minimax lower bound. Our lower bound proofs involve several innovations. Experiments support that FairBayes-DDP+ controls disparity at the user-specified level, while being faster and having a more favorable fairness-accuracy tradeoff than several baselines.
△ Less
Submitted 26 March, 2024;
originally announced March 2024.
-
Approximation of RKHS Functionals by Neural Networks
Authors:
Tian-Yi Zhou,
Namjoon Suh,
Guang Cheng,
Xiaoming Huo
Abstract:
Motivated by the abundance of functional data such as time series and images, there has been a growing interest in integrating such data into neural networks and learning maps from function spaces to R (i.e., functionals). In this paper, we study the approximation of functionals on reproducing kernel Hilbert spaces (RKHS's) using neural networks. We establish the universality of the approximation…
▽ More
Motivated by the abundance of functional data such as time series and images, there has been a growing interest in integrating such data into neural networks and learning maps from function spaces to R (i.e., functionals). In this paper, we study the approximation of functionals on reproducing kernel Hilbert spaces (RKHS's) using neural networks. We establish the universality of the approximation of functionals on the RKHS's. Specifically, we derive explicit error bounds for those induced by inverse multiquadric, Gaussian, and Sobolev kernels. Moreover, we apply our findings to functional regression, proving that neural networks can accurately approximate the regression maps in generalized functional linear models. Existing works on functional learning require integration-type basis function expansions with a set of pre-specified basis functions. By leveraging the interpolating orthogonal projections in RKHS's, our proposed network is much simpler in that we use point evaluations to replace basis function expansions.
△ Less
Submitted 18 March, 2024;
originally announced March 2024.
-
A Survey on Statistical Theory of Deep Learning: Approximation, Training Dynamics, and Generative Models
Authors:
Namjoon Suh,
Guang Cheng
Abstract:
In this article, we review the literature on statistical theories of neural networks from three perspectives: approximation, training dynamics and generative models. In the first part, results on excess risks for neural networks are reviewed in the nonparametric framework of regression (and classification in Appendix~{\color{blue}B}). These results rely on explicit constructions of neural networks…
▽ More
In this article, we review the literature on statistical theories of neural networks from three perspectives: approximation, training dynamics and generative models. In the first part, results on excess risks for neural networks are reviewed in the nonparametric framework of regression (and classification in Appendix~{\color{blue}B}). These results rely on explicit constructions of neural networks, leading to fast convergence rates of excess risks. Nonetheless, their underlying analysis only applies to the global minimizer in the highly non-convex landscape of deep neural networks. This motivates us to review the training dynamics of neural networks in the second part. Specifically, we review papers that attempt to answer ``how the neural network trained via gradient-based methods finds the solution that can generalize well on unseen data.'' In particular, two well-known paradigms are reviewed: the Neural Tangent Kernel (NTK) paradigm, and Mean-Field (MF) paradigm. Last but not least, we review the most recent theoretical advancements in generative models including Generative Adversarial Networks (GANs), diffusion models, and in-context learning (ICL) in the Large Language Models (LLMs) from two perpsectives reviewed previously, i.e., approximation and training dynamics.
△ Less
Submitted 16 September, 2024; v1 submitted 13 January, 2024;
originally announced January 2024.
-
Legendre-Moment Transform for Linear Ensemble Control and Computation
Authors:
Xin Ning,
Gong Cheng,
Wei Zhang,
Jr-Shin Li
Abstract:
Ensemble systems, pervasive in diverse scientific and engineering domains, pose challenges to existing control methods due to their massive scale and underactuated nature. This paper presents a dynamic moment approach to addressing theoretical and computational challenges in systems-theoretic analysis and control design for linear ensemble systems. We introduce the Legendre-moments and Legendre-mo…
▽ More
Ensemble systems, pervasive in diverse scientific and engineering domains, pose challenges to existing control methods due to their massive scale and underactuated nature. This paper presents a dynamic moment approach to addressing theoretical and computational challenges in systems-theoretic analysis and control design for linear ensemble systems. We introduce the Legendre-moments and Legendre-moment transform, which maps an ensemble system defined on the $L^2$-space to a Legendre-moment system defined on the $\ell^2$-space. We show that this pair of systems is of one-to-one correspondence and shares the same controllability property. This equivalence admits the control of an ensemble system through the control of the corresponding Legendre-moment system and inspires a unified control design scheme for linear ensemble systems using structured truncated moment systems. In particular, we develop a sampling-free ensemble control design algorithm, then conduct error analysis for control design using truncated moment systems and derive error bounds with respect to the truncation orders, which are illustrated with numerical examples.
△ Less
Submitted 3 January, 2024;
originally announced January 2024.
-
Statistical Inference for Ultrahigh Dimensional Location Parameter Based on Spatial Median
Authors:
Guanghui Cheng,
Liuhua Peng,
Changliang Zou
Abstract:
Motivated by the widely used geometric median-of-means estimator in machine learning, this paper studies statistical inference for ultrahigh dimensionality location parameter based on the sample spatial median under a general multivariate model, including simultaneous confidence intervals construction, global tests, and multiple testing with false discovery rate control. To achieve these goals, we…
▽ More
Motivated by the widely used geometric median-of-means estimator in machine learning, this paper studies statistical inference for ultrahigh dimensionality location parameter based on the sample spatial median under a general multivariate model, including simultaneous confidence intervals construction, global tests, and multiple testing with false discovery rate control. To achieve these goals, we derive a novel Bahadur representation of the sample spatial median with a maximum-norm bound on the remainder term, and establish Gaussian approximation for the sample spatial median over the class of hyperrectangles. In addition, a multiplier bootstrap algorithm is proposed to approximate the distribution of the sample spatial median. The approximations are valid when the dimension diverges at an exponentially rate of the sample size, which facilitates the application of the spatial median in the ultrahigh dimensional region. The proposed approaches are further illustrated by simulations and analysis of a genomic dataset from a microarray study.
△ Less
Submitted 8 January, 2023;
originally announced January 2023.
-
Ranking Differential Privacy
Authors:
Shirong Xu,
Will Wei Sun,
Guang Cheng
Abstract:
Rankings are widely collected in various real-life scenarios, leading to the leakage of personal information such as users' preferences on videos or news. To protect rankings, existing works mainly develop privacy protection on a single ranking within a set of ranking or pairwise comparisons of a ranking under the $ε$-differential privacy. This paper proposes a novel notion called $ε$-ranking diff…
▽ More
Rankings are widely collected in various real-life scenarios, leading to the leakage of personal information such as users' preferences on videos or news. To protect rankings, existing works mainly develop privacy protection on a single ranking within a set of ranking or pairwise comparisons of a ranking under the $ε$-differential privacy. This paper proposes a novel notion called $ε$-ranking differential privacy for protecting ranks. We establish the connection between the Mallows model (Mallows, 1957) and the proposed $ε$-ranking differential privacy. This allows us to develop a multistage ranking algorithm to generate synthetic rankings while satisfying the developed $ε$-ranking differential privacy. Theoretical results regarding the utility of synthetic rankings in the downstream tasks, including the inference attack and the personalized ranking tasks, are established. For the inference attack, we quantify how $ε$ affects the estimation of the true ranking based on synthetic rankings. For the personalized ranking task, we consider varying privacy preferences among users and quantify how their privacy preferences affect the consistency in estimating the optimal ranking function. Extensive numerical experiments are carried out to verify the theoretical results and demonstrate the effectiveness of the proposed synthetic ranking algorithm.
△ Less
Submitted 2 January, 2023;
originally announced January 2023.
-
Private optimization in the interpolation regime: faster rates and hardness results
Authors:
Hilal Asi,
Karan Chadha,
Gary Cheng,
John Duchi
Abstract:
In non-private stochastic convex optimization, stochastic gradient methods converge much faster on interpolation problems -- problems where there exists a solution that simultaneously minimizes all of the sample losses -- than on non-interpolating ones; we show that generally similar improvements are impossible in the private setting. However, when the functions exhibit quadratic growth around the…
▽ More
In non-private stochastic convex optimization, stochastic gradient methods converge much faster on interpolation problems -- problems where there exists a solution that simultaneously minimizes all of the sample losses -- than on non-interpolating ones; we show that generally similar improvements are impossible in the private setting. However, when the functions exhibit quadratic growth around the optimum, we show (near) exponential improvements in the private sample complexity. In particular, we propose an adaptive algorithm that improves the sample complexity to achieve expected error $α$ from $\frac{d}{\varepsilon \sqrtα}$ to $\frac{1}{α^ρ} + \frac{d}{\varepsilon} \log\left(\frac{1}α\right)$ for any fixed $ρ>0$, while retaining the standard minimax-optimal sample complexity for non-interpolation problems. We prove a lower bound that shows the dimension-dependent term is tight. Furthermore, we provide a superefficiency result which demonstrates the necessity of the polynomial term for adaptive algorithms: any algorithm that has a polylogarithmic sample complexity for interpolation problems cannot achieve the minimax-optimal rates for the family of non-interpolation problems.
△ Less
Submitted 31 October, 2022;
originally announced October 2022.
-
Minimax Optimal Deep Neural Network Classifiers Under Smooth Decision Boundary
Authors:
Tianyang Hu,
Ruiqi Liu,
Zuofeng Shang,
Guang Cheng
Abstract:
Deep learning has gained huge empirical successes in large-scale classification problems. In contrast, there is a lack of statistical understanding about deep learning methods, particularly in the minimax optimality perspective. For instance, in the classical smooth decision boundary setting, existing deep neural network (DNN) approaches are rate-suboptimal, and it remains elusive how to construct…
▽ More
Deep learning has gained huge empirical successes in large-scale classification problems. In contrast, there is a lack of statistical understanding about deep learning methods, particularly in the minimax optimality perspective. For instance, in the classical smooth decision boundary setting, existing deep neural network (DNN) approaches are rate-suboptimal, and it remains elusive how to construct minimax optimal DNN classifiers. Moreover, it is interesting to explore whether DNN classifiers can circumvent the curse of dimensionality in handling high-dimensional data. The contributions of this paper are two-fold. First, based on a localized margin framework, we discover the source of suboptimality of existing DNN approaches. Motivated by this, we propose a new deep learning classifier using a divide-and-conquer technique: DNN classifiers are constructed on each local region and then aggregated to a global one. We further propose a localized version of the classical Tsybakov's noise condition, under which statistical optimality of our new classifier is established. Second, we show that DNN classifiers can adapt to low-dimensional data structures and circumvent the curse of dimensionality in the sense that the minimax rate only depends on the effective dimension, potentially much smaller than the actual data dimension. Numerical experiments are conducted on simulated data to corroborate our theoretical results.
△ Less
Submitted 4 July, 2022;
originally announced July 2022.
-
Optimal Convergence Rates of Deep Convolutional Neural Networks: Additive Ridge Functions
Authors:
Zhiying Fang,
Guang Cheng
Abstract:
Convolutional neural networks have shown impressive abilities in many applications, especially those related to the classification tasks. However, for the regression problem, the abilities of convolutional structures have not been fully understood, and further investigation is needed. In this paper, we consider the mean squared error analysis for deep convolutional neural networks. We show that, f…
▽ More
Convolutional neural networks have shown impressive abilities in many applications, especially those related to the classification tasks. However, for the regression problem, the abilities of convolutional structures have not been fully understood, and further investigation is needed. In this paper, we consider the mean squared error analysis for deep convolutional neural networks. We show that, for additive ridge functions, convolutional neural networks followed by one fully connected layer with ReLU activation functions can reach optimal mini-max rates (up to a log factor). The input dimension only appears in the constant of convergence rates. This work shows the statistical optimality of convolutional neural networks and may shed light on why convolutional neural networks are able to behave well for high dimensional input.
△ Less
Submitted 20 January, 2023; v1 submitted 24 February, 2022;
originally announced February 2022.
-
Decentralized Sparse Linear Regression via Gradient-Tracking: Linear Convergence and Statistical Guarantees
Authors:
Marie Maros,
Gesualdo Scutari,
Ying Sun,
Guang Cheng
Abstract:
We study sparse linear regression over a network of agents, modeled as an undirected graph and no server node. The estimation of the $s$-sparse parameter is formulated as a constrained LASSO problem wherein each agent owns a subset of the $N$ total observations. We analyze the convergence rate and statistical guarantees of a distributed projected gradient tracking-based algorithm under high-dimens…
▽ More
We study sparse linear regression over a network of agents, modeled as an undirected graph and no server node. The estimation of the $s$-sparse parameter is formulated as a constrained LASSO problem wherein each agent owns a subset of the $N$ total observations. We analyze the convergence rate and statistical guarantees of a distributed projected gradient tracking-based algorithm under high-dimensional scaling, allowing the ambient dimension $d$ to grow with (and possibly exceed) the sample size $N$. Our theory shows that, under standard notions of restricted strong convexity and smoothness of the loss functions, suitable conditions on the network connectivity and algorithm tuning, the distributed algorithm converges globally at a {\it linear} rate to an estimate that is within the centralized {\it statistical precision} of the model, $O(s\log d/N)$. When $s\log d/N=o(1)$, a condition necessary for statistical consistency, an $\varepsilon$-optimal solution is attained after $\mathcal{O}(κ\log (1/\varepsilon))$ gradient computations and $O (κ/(1-ρ) \log (1/\varepsilon))$ communication rounds, where $κ$ is the restricted condition number of the loss function and $ρ$ measures the network connectivity. The computation cost matches that of the centralized projected gradient algorithm despite having data distributed; whereas the communication rounds reduce as the network connectivity improves. Overall, our study reveals interesting connections between statistical efficiency, network connectivity \& topology, and convergence rate in high dimensions.
△ Less
Submitted 26 December, 2024; v1 submitted 20 January, 2022;
originally announced January 2022.
-
On Uniform Ensemble Controllability of Diagonalizable Linear Ensemble Systems
Authors:
Wei Miao,
Gong Cheng,
Jr-Shin Li
Abstract:
In this paper, we study uniform ensemble controllability (UEC) of linear ensemble systems defined in an infinite-dimensional space through finite-dimensional settings. Specifically, with the help of the Stone-Weierstrass theorem for modules, we provide an algebraic framework for examining UEC of linear ensemble systems with diagonalizable drift vector fields through checking the controllability of…
▽ More
In this paper, we study uniform ensemble controllability (UEC) of linear ensemble systems defined in an infinite-dimensional space through finite-dimensional settings. Specifically, with the help of the Stone-Weierstrass theorem for modules, we provide an algebraic framework for examining UEC of linear ensemble systems with diagonalizable drift vector fields through checking the controllability of finite-dimensional subsystems in the ensemble. The new framework renders a novel concept of ensemble controllability matrix, which rank-condition serves as a sufficient and necessary condition for UEC of linear ensembles. We provide several examples demonstrating that the proposed approach well-encompasses existing results and analyzes UEC of linear ensembles not addressed by literature.
△ Less
Submitted 28 December, 2021;
originally announced December 2021.
-
Federated Asymptotics: a model to compare federated learning algorithms
Authors:
Gary Cheng,
Karan Chadha,
John Duchi
Abstract:
We propose an asymptotic framework to analyze the performance of (personalized) federated learning algorithms. In this new framework, we formulate federated learning as a multi-criterion objective, where the goal is to minimize each client's loss using information from all of the clients. We analyze a linear regression model where, for a given client, we may theoretically compare the performance o…
▽ More
We propose an asymptotic framework to analyze the performance of (personalized) federated learning algorithms. In this new framework, we formulate federated learning as a multi-criterion objective, where the goal is to minimize each client's loss using information from all of the clients. We analyze a linear regression model where, for a given client, we may theoretically compare the performance of various algorithms in the high-dimensional asymptotic limit. This asymptotic multi-criterion approach naturally models the high-dimensional, many-device nature of federated learning. These tools make fairly precise predictions about the benefits of personalization and information sharing in federated scenarios -- at least in our (stylized) model -- including that Federated Averaging with simple client fine-tuning achieves the same asymptotic risk as the more intricate meta-learning and proximal-regularized approaches and outperforming Federated Averaging without personalization. We evaluate these predictions on federated versions of the EMNIST, CIFAR-100, Shakespeare, and Stack Overflow datasets, where the experiments corroborate the theoretical predictions, suggesting such frameworks may provide a useful guide to practical algorithmic development.
△ Less
Submitted 18 February, 2022; v1 submitted 16 August, 2021;
originally announced August 2021.
-
Online Bootstrap Inference For Policy Evaluation in Reinforcement Learning
Authors:
Pratik Ramprasad,
Yuantong Li,
Zhuoran Yang,
Zhaoran Wang,
Will Wei Sun,
Guang Cheng
Abstract:
The recent emergence of reinforcement learning has created a demand for robust statistical inference methods for the parameter estimates computed using these algorithms. Existing methods for statistical inference in online learning are restricted to settings involving independently sampled observations, while existing statistical inference methods in reinforcement learning (RL) are limited to the…
▽ More
The recent emergence of reinforcement learning has created a demand for robust statistical inference methods for the parameter estimates computed using these algorithms. Existing methods for statistical inference in online learning are restricted to settings involving independently sampled observations, while existing statistical inference methods in reinforcement learning (RL) are limited to the batch setting. The online bootstrap is a flexible and efficient approach for statistical inference in linear stochastic approximation algorithms, but its efficacy in settings involving Markov noise, such as RL, has yet to be explored. In this paper, we study the use of the online bootstrap method for statistical inference in RL. In particular, we focus on the temporal difference (TD) learning and Gradient TD (GTD) learning algorithms, which are themselves special instances of linear stochastic approximation under Markov noise. The method is shown to be distributionally consistent for statistical inference in policy evaluation, and numerical experiments are included to demonstrate the effectiveness of this algorithm at statistical inference tasks across a range of real RL environments.
△ Less
Submitted 28 June, 2022; v1 submitted 8 August, 2021;
originally announced August 2021.
-
Distributed Bootstrap for Simultaneous Inference Under High Dimensionality
Authors:
Yang Yu,
Shih-Kang Chao,
Guang Cheng
Abstract:
We propose a distributed bootstrap method for simultaneous inference on high-dimensional massive data that are stored and processed with many machines. The method produces an $\ell_\infty$-norm confidence region based on a communication-efficient de-biased lasso, and we propose an efficient cross-validation approach to tune the method at every iteration. We theoretically prove a lower bound on the…
▽ More
We propose a distributed bootstrap method for simultaneous inference on high-dimensional massive data that are stored and processed with many machines. The method produces an $\ell_\infty$-norm confidence region based on a communication-efficient de-biased lasso, and we propose an efficient cross-validation approach to tune the method at every iteration. We theoretically prove a lower bound on the number of communication rounds $τ_{\min}$ that warrants the statistical accuracy and efficiency. Furthermore, $τ_{\min}$ only increases logarithmically with the number of workers and the intrinsic dimensionality, while nearly invariant to the nominal dimensionality. We test our theory by extensive simulation studies, and a variable screening task on a semi-synthetic dataset based on the US Airline On-Time Performance dataset. The code to reproduce the numerical results is available at GitHub: https://github.com/skchao74/Distributed-bootstrap.
△ Less
Submitted 14 June, 2022; v1 submitted 19 February, 2021;
originally announced February 2021.
-
Accelerated, Optimal, and Parallel: Some Results on Model-Based Stochastic Optimization
Authors:
Karan Chadha,
Gary Cheng,
John C. Duchi
Abstract:
We extend the Approximate-Proximal Point (aProx) family of model-based methods for solving stochastic convex optimization problems, including stochastic subgradient, proximal point, and bundle methods, to the minibatch and accelerated setting. To do so, we propose specific model-based algorithms and an acceleration scheme for which we provide non-asymptotic convergence guarantees, which are order-…
▽ More
We extend the Approximate-Proximal Point (aProx) family of model-based methods for solving stochastic convex optimization problems, including stochastic subgradient, proximal point, and bundle methods, to the minibatch and accelerated setting. To do so, we propose specific model-based algorithms and an acceleration scheme for which we provide non-asymptotic convergence guarantees, which are order-optimal in all problem-dependent constants and provide linear speedup in minibatch size, while maintaining the desirable robustness traits (e.g. to stepsize) of the aProx family. Additionally, we show improved convergence rates and matching lower bounds identifying new fundamental constants for "interpolation" problems, whose importance in statistical machine learning is growing; this, for example, gives a parallelization strategy for alternating projections. We corroborate our theoretical results with empirical testing to demonstrate the gains accurate modeling, acceleration, and minibatching provide.
△ Less
Submitted 7 January, 2021;
originally announced January 2021.
-
Variance Reduction on General Adaptive Stochastic Mirror Descent
Authors:
Wenjie Li,
Zhanyu Wang,
Yichen Zhang,
Guang Cheng
Abstract:
In this work, we investigate the idea of variance reduction by studying its properties with general adaptive mirror descent algorithms in nonsmooth nonconvex finite-sum optimization problems. We propose a simple yet generalized framework for variance reduced adaptive mirror descent algorithms named SVRAMD and provide its convergence analysis in both the nonsmooth nonconvex problem and the P-L cond…
▽ More
In this work, we investigate the idea of variance reduction by studying its properties with general adaptive mirror descent algorithms in nonsmooth nonconvex finite-sum optimization problems. We propose a simple yet generalized framework for variance reduced adaptive mirror descent algorithms named SVRAMD and provide its convergence analysis in both the nonsmooth nonconvex problem and the P-L conditioned problem. We prove that variance reduction reduces the SFO complexity of adaptive mirror descent algorithms and thus accelerates their convergence. In particular, our general theory implies that variance reduction can be applied to algorithms using time-varying step sizes and self-adaptive algorithms such as AdaGrad and RMSProp. Moreover, the convergence rates of SVRAMD recover the best existing rates of non-adaptive variance reduced mirror descent algorithms without complicated algorithmic components. Extensive experiments in deep learning validate our theoretical findings.
△ Less
Submitted 29 August, 2021; v1 submitted 26 December, 2020;
originally announced December 2020.
-
Power Iteration for Tensor PCA
Authors:
Jiaoyang Huang,
Daniel Z. Huang,
Qing Yang,
Guang Cheng
Abstract:
In this paper, we study the power iteration algorithm for the spiked tensor model, as introduced in [44]. We give necessary and sufficient conditions for the convergence of the power iteration algorithm. When the power iteration algorithm converges, for the rank one spiked tensor model, we show the estimators for the spike strength and linear functionals of the signal are asymptotically Gaussian;…
▽ More
In this paper, we study the power iteration algorithm for the spiked tensor model, as introduced in [44]. We give necessary and sufficient conditions for the convergence of the power iteration algorithm. When the power iteration algorithm converges, for the rank one spiked tensor model, we show the estimators for the spike strength and linear functionals of the signal are asymptotically Gaussian; for the multi-rank spiked tensor model, we show the estimators are asymptotically mixtures of Gaussian. This new phenomenon is different from the spiked matrix model. Using these asymptotic results of our estimators, we construct valid and efficient confidence intervals for spike strengths and linear functionals of the signals.
△ Less
Submitted 25 December, 2020;
originally announced December 2020.
-
Efficient Variational Inference for Sparse Deep Learning with Theoretical Guarantee
Authors:
Jincheng Bai,
Qifan Song,
Guang Cheng
Abstract:
Sparse deep learning aims to address the challenge of huge storage consumption by deep neural networks, and to recover the sparse structure of target functions. Although tremendous empirical successes have been achieved, most sparse deep learning algorithms are lacking of theoretical support. On the other hand, another line of works have proposed theoretical frameworks that are computationally inf…
▽ More
Sparse deep learning aims to address the challenge of huge storage consumption by deep neural networks, and to recover the sparse structure of target functions. Although tremendous empirical successes have been achieved, most sparse deep learning algorithms are lacking of theoretical support. On the other hand, another line of works have proposed theoretical frameworks that are computationally infeasible. In this paper, we train sparse deep neural networks with a fully Bayesian treatment under spike-and-slab priors, and develop a set of computationally efficient variational inferences via continuous relaxation of Bernoulli distribution. The variational posterior contraction rate is provided, which justifies the consistency of the proposed variational Bayes method. Notably, our empirical results demonstrate that this variational procedure provides uncertainty quantification in terms of Bayesian predictive distribution and is also capable to accomplish consistent variable selection by training a sparse multi-layer neural network.
△ Less
Submitted 14 November, 2020;
originally announced November 2020.
-
Nearly Optimal Variational Inference for High Dimensional Regression with Shrinkage Priors
Authors:
Jincheng Bai,
Qifan Song,
Guang Cheng
Abstract:
We propose a variational Bayesian (VB) procedure for high-dimensional linear model inferences with heavy tail shrinkage priors, such as student-t prior. Theoretically, we establish the consistency of the proposed VB method and prove that under the proper choice of prior specifications, the contraction rate of the VB posterior is nearly optimal. It justifies the validity of VB inference as an alter…
▽ More
We propose a variational Bayesian (VB) procedure for high-dimensional linear model inferences with heavy tail shrinkage priors, such as student-t prior. Theoretically, we establish the consistency of the proposed VB method and prove that under the proper choice of prior specifications, the contraction rate of the VB posterior is nearly optimal. It justifies the validity of VB inference as an alternative of Markov Chain Monte Carlo (MCMC) sampling. Meanwhile, comparing to conventional MCMC methods, the VB procedure achieves much higher computational efficiency, which greatly alleviates the computing burden for modern machine learning applications such as massive data analysis. Through numerical studies, we demonstrate that the proposed VB method leads to shorter computing time, higher estimation accuracy, and lower variable selection error than competitive sparse Bayesian methods.
△ Less
Submitted 24 October, 2020;
originally announced October 2020.
-
Combinatorics-Based Approaches to Controllability Characterization for Bilinear Systems
Authors:
Gong Cheng,
Wei Zhang,
Jr-Shin Li
Abstract:
The control of bilinear systems has attracted considerable attention in the field of systems and control for decades, owing to their prevalence in diverse applications across science and engineering disciplines. Although much work has been conducted on analyzing controllability properties, the mostly used tool remains the Lie algebra rank condition. In this paper, we develop alternative approaches…
▽ More
The control of bilinear systems has attracted considerable attention in the field of systems and control for decades, owing to their prevalence in diverse applications across science and engineering disciplines. Although much work has been conducted on analyzing controllability properties, the mostly used tool remains the Lie algebra rank condition. In this paper, we develop alternative approaches based on theory and techniques in combinatorics to study controllability of bilinear systems. The core idea of our methodology is to represent vector fields of a bilinear system by permutations or graphs, so that Lie brackets are represented by permutation multiplications or graph operations, respectively. Following these representations, we derive combinatorial characterization of controllability for bilinear systems, which consequently provides novel applications of symmetric group and graph theory to control theory. Moreover, the developed combinatorial approaches are compatible with Lie algebra decompositions, including the Cartan and non-intertwining decomposition. This compatibility enables the exploitation of representation theory for analyzing controllability, which allows us to characterize controllability properties of bilinear systems governed by semisimple and reductive Lie algebras.
△ Less
Submitted 7 September, 2020;
originally announced September 2020.
-
Sparse Confidence Sets for Normal Mean Models
Authors:
Yang Ning,
Guang Cheng
Abstract:
In this paper, we propose a new framework to construct confidence sets for a $d$-dimensional unknown sparse parameter $θ$ under the normal mean model $X\sim N(θ,σ^2I)$. A key feature of the proposed confidence set is its capability to account for the sparsity of $θ$, thus named as {\em sparse} confidence set. This is in sharp contrast with the classical methods, such as Bonferroni confidence inter…
▽ More
In this paper, we propose a new framework to construct confidence sets for a $d$-dimensional unknown sparse parameter $θ$ under the normal mean model $X\sim N(θ,σ^2I)$. A key feature of the proposed confidence set is its capability to account for the sparsity of $θ$, thus named as {\em sparse} confidence set. This is in sharp contrast with the classical methods, such as Bonferroni confidence intervals and other resampling based procedures, where the sparsity of $θ$ is often ignored. Specifically, we require the desired sparse confidence set to satisfy the following two conditions: (i) uniformly over the parameter space, the coverage probability for $θ$ is above a pre-specified level; (ii) there exists a random subset $S$ of $\{1,...,d\}$ such that $S$ guarantees the pre-specified true negative rate (TNR) for detecting nonzero $θ_j$'s. To exploit the sparsity of $θ$, we define that the confidence interval for $θ_j$ degenerates to a single point 0 for any $j\notin S$. Under this new framework, we first consider whether there exist sparse confidence sets that satisfy the above two conditions. To address this question, we establish a non-asymptotic minimax lower bound for the non-coverage probability over a suitable class of sparse confidence sets. The lower bound deciphers the role of sparsity and minimum signal-to-noise ratio (SNR) in the construction of sparse confidence sets. Furthermore, under suitable conditions on the SNR, a two-stage procedure is proposed to construct a sparse confidence set. To evaluate the optimality, the proposed sparse confidence set is shown to attain a minimax lower bound of some properly defined risk function up to a constant factor. Finally, we develop an adaptive procedure to the unknown sparsity and SNR. Numerical studies are conducted to verify the theoretical results.
△ Less
Submitted 17 August, 2020;
originally announced August 2020.
-
A Gaussian version of Littlewood's theorem on random power series
Authors:
Guozheng Cheng,
Xiang Fang,
Kunyu Guo,
Chao Liu
Abstract:
We prove a Littlewood-type theorem on random analytic functions for not necessarily independent Gaussian processes. We show that if we randomize a function in the Hardy space $H^2(\dd)$ by a Gaussian process whose covariance matrix $K$ induces a bounded operator on $l^2$, then the resulting random function is almost surely in $H^p(\dd)$ for any $p>0$. The case $K=\text{Id}$, the identity operator,…
▽ More
We prove a Littlewood-type theorem on random analytic functions for not necessarily independent Gaussian processes. We show that if we randomize a function in the Hardy space $H^2(\dd)$ by a Gaussian process whose covariance matrix $K$ induces a bounded operator on $l^2$, then the resulting random function is almost surely in $H^p(\dd)$ for any $p>0$. The case $K=\text{Id}$, the identity operator, recovers Littlewood's theorem. A new ingredient in our proof is to recast the membership problem as the boundedness of an operator. This reformulation enables us to use tools in functional analysis and is applicable to other situations. The sharpness of the new condition and several ramifications are discussed.
△ Less
Submitted 26 March, 2021; v1 submitted 13 July, 2020;
originally announced July 2020.
-
On Deep Instrumental Variables Estimate
Authors:
Ruiqi Liu,
Zuofeng Shang,
Guang Cheng
Abstract:
The endogeneity issue is fundamentally important as many empirical applications may suffer from the omission of explanatory variables, measurement error, or simultaneous causality. Recently, \cite{hllt17} propose a "Deep Instrumental Variable (IV)" framework based on deep neural networks to address endogeneity, demonstrating superior performances than existing approaches. The aim of this paper is…
▽ More
The endogeneity issue is fundamentally important as many empirical applications may suffer from the omission of explanatory variables, measurement error, or simultaneous causality. Recently, \cite{hllt17} propose a "Deep Instrumental Variable (IV)" framework based on deep neural networks to address endogeneity, demonstrating superior performances than existing approaches. The aim of this paper is to theoretically understand the empirical success of the Deep IV. Specifically, we consider a two-stage estimator using deep neural networks in the linear instrumental variables model. By imposing a latent structural assumption on the reduced form equation between endogenous variables and instrumental variables, the first-stage estimator can automatically capture this latent structure and converge to the optimal instruments at the minimax optimal rate, which is free of the dimension of instrumental variables and thus mitigates the curse of dimensionality. Additionally, in comparison with classical methods, due to the faster convergence rate of the first-stage estimator, the second-stage estimator has {a smaller (second order) estimation error} and requires a weaker condition on the smoothness of the optimal instruments. Given that the depth and width of the employed deep neural network are well chosen, we further show that the second-stage estimator achieves the semiparametric efficiency bound. Simulation studies on synthetic data and application to automobile market data confirm our theory.
△ Less
Submitted 30 April, 2020;
originally announced April 2020.
-
Low rank tensor completion with sparse regularization in a transformed domain
Authors:
Ping-Ping Wang,
Liang Li,
Guang-Hui Cheng
Abstract:
Tensor completion is a challenging problem with various applications. Many related models based on the low-rank prior of the tensor have been proposed. However, the low-rank prior may not be enough to recover the original tensor from the observed incomplete tensor. In this paper, we prose a tensor completion method by exploiting both the low-rank and sparse prior of tensor. Specifically, the tenso…
▽ More
Tensor completion is a challenging problem with various applications. Many related models based on the low-rank prior of the tensor have been proposed. However, the low-rank prior may not be enough to recover the original tensor from the observed incomplete tensor. In this paper, we prose a tensor completion method by exploiting both the low-rank and sparse prior of tensor. Specifically, the tensor completion task can be formulated as a low-rank minimization problem with a sparse regularizer. The low-rank property is depicted by the tensor truncated nuclear norm based on tensor singular value decomposition (T-SVD) which is a better approximation of tensor tubal rank than tensor nuclear norm. While the sparse regularizer is imposed by a $\ell_{1}$-norm in a discrete cosine transformation (DCT) domain, which can better employ the local sparse property of completed data. To solve the optimization problem, we employ an alternating direction method of multipliers (ADMM) in which we only need to solve several subproblems which have closed-form solutions. Substantial experiments on real world images and videos show that the proposed method has better performances than the existing state-of-the-art methods.
△ Less
Submitted 18 November, 2019;
originally announced November 2019.
-
Adaptive Variational Bayesian Inference for Sparse Deep Neural Network
Authors:
Jincheng Bai,
Qifan Song,
Guang Cheng
Abstract:
In this work, we focus on variational Bayesian inference on the sparse Deep Neural Network (DNN) modeled under a class of spike-and-slab priors. Given a pre-specified sparse DNN structure, the corresponding variational posterior contraction rate is characterized that reveals a trade-off between the variational error and the approximation error, which are both determined by the network structural c…
▽ More
In this work, we focus on variational Bayesian inference on the sparse Deep Neural Network (DNN) modeled under a class of spike-and-slab priors. Given a pre-specified sparse DNN structure, the corresponding variational posterior contraction rate is characterized that reveals a trade-off between the variational error and the approximation error, which are both determined by the network structural complexity (i.e., depth, width and sparsity). However, the optimal network structure, which strikes the balance of the aforementioned trade-off and yields the best rate, is generally unknown in reality. Therefore, our work further develops an {\em adaptive} variational inference procedure that can automatically select a reasonably good (data-dependent) network structure that achieves the best contraction rate, without knowing the optimal network structure. In particular, when the true function is H{ö}lder smooth, the adaptive variational inference is capable to attain (near-)optimal rate without the knowledge of smoothness level. The above rate still suffers from the curse of dimensionality, and thus motivates the teacher-student setup, i.e., the true function is a sparse DNN model, under which the rate only logarithmically depends on the input dimension.
△ Less
Submitted 2 August, 2020; v1 submitted 9 October, 2019;
originally announced October 2019.
-
Rates of Convergence for Large-scale Nearest Neighbor Classification
Authors:
Xingye Qiao,
Jiexin Duan,
Guang Cheng
Abstract:
Nearest neighbor is a popular class of classification methods with many desirable properties. For a large data set which cannot be loaded into the memory of a single machine due to computation, communication, privacy, or ownership limitations, we consider the divide and conquer scheme: the entire data set is divided into small subsamples, on which nearest neighbor predictions are made, and then a…
▽ More
Nearest neighbor is a popular class of classification methods with many desirable properties. For a large data set which cannot be loaded into the memory of a single machine due to computation, communication, privacy, or ownership limitations, we consider the divide and conquer scheme: the entire data set is divided into small subsamples, on which nearest neighbor predictions are made, and then a final decision is reached by aggregating the predictions on subsamples by majority voting. We name this method the big Nearest Neighbor (bigNN) classifier, and provide its rates of convergence under minimal assumptions, in terms of both the excess risk and the classification instability, which are proven to be the same rates as the oracle nearest neighbor classifier and cannot be improved. To significantly reduce the prediction time that is required for achieving the optimal rate, we also consider the pre-training acceleration technique applied to the bigNN method, with proven convergence rate. We find that in the distributed setting, the optimal choice of the neighbor $k$ should scale with both the total sample size and the number of partitions, and there is a theoretical upper limit for the latter. Numerical studies have verified the theoretical findings.
△ Less
Submitted 30 October, 2019; v1 submitted 3 September, 2019;
originally announced September 2019.
-
Enhancing Multi-model Inference with Natural Selection
Authors:
Ching-Wei Cheng,
Guang Cheng
Abstract:
Multi-model inference covers a wide range of modern statistical applications such as variable selection, model confidence set, model averaging and variable importance. The performance of multi-model inference depends on the availability of candidate models, whose quality has been rarely studied in literature. In this paper, we study genetic algorithm (GA) in order to obtain high-quality candidate…
▽ More
Multi-model inference covers a wide range of modern statistical applications such as variable selection, model confidence set, model averaging and variable importance. The performance of multi-model inference depends on the availability of candidate models, whose quality has been rarely studied in literature. In this paper, we study genetic algorithm (GA) in order to obtain high-quality candidate models. Inspired by the process of natural selection, GA performs genetic operations such as selection, crossover and mutation iteratively to update a collection of potential solutions (models) until convergence. The convergence properties are studied based on the Markov chain theory and used to design an adaptive termination criterion that vastly reduces the computational cost. In addition, a new schema theory is established to characterize how the current model set is improved through evolutionary process. Extensive numerical experiments are carried out to verify our theory and demonstrate the empirical power of GA, and new findings are obtained for two real data examples.
△ Less
Submitted 5 June, 2019;
originally announced June 2019.
-
Optimal False Discovery Control of Minimax Estimator
Authors:
Qifan Song,
Guang Cheng
Abstract:
Two major research tasks lie at the heart of high dimensional data analysis: accurate parameter estimation and correct support recovery. The existing literature mostly aims for either the best parameter estimation or the best model selection result, however little has been done to understand the potential interaction between the estimation precision and the selection behavior. In this work, our mi…
▽ More
Two major research tasks lie at the heart of high dimensional data analysis: accurate parameter estimation and correct support recovery. The existing literature mostly aims for either the best parameter estimation or the best model selection result, however little has been done to understand the potential interaction between the estimation precision and the selection behavior. In this work, our minimax result shows that an estimator's performance of type I error control directly links with its $L_2$ estimation error rate, and reveals a trade-off phenomenon between the rate of convergence and the false discovery control: to achieve better accuracy, one risks yielding more false discoveries. In particular, we characterize the false discovery control behavior of rate optimal and rate suboptimal estimators under different sparsity regimes, and discover a rigid dichotomy between these two estimators under near-linear and linear sparsity settings. In addition, this work provides a rigorous explanation to the incompatibility phenomenon between selection consistency and rate minimaxity which has been frequently observed in the high dimensional literature.
△ Less
Submitted 23 June, 2022; v1 submitted 24 December, 2018;
originally announced December 2018.
-
Distributed Nearest Neighbor Classification
Authors:
Jiexin Duan,
Xingye Qiao,
Guang Cheng
Abstract:
Nearest neighbor is a popular nonparametric method for classification and regression with many appealing properties. In the big data era, the sheer volume and spatial/temporal disparity of big data may prohibit centrally processing and storing the data. This has imposed considerable hurdle for nearest neighbor predictions since the entire training data must be memorized. One effective way to overc…
▽ More
Nearest neighbor is a popular nonparametric method for classification and regression with many appealing properties. In the big data era, the sheer volume and spatial/temporal disparity of big data may prohibit centrally processing and storing the data. This has imposed considerable hurdle for nearest neighbor predictions since the entire training data must be memorized. One effective way to overcome this issue is the distributed learning framework. Through majority voting, the distributed nearest neighbor classifier achieves the same rate of convergence as its oracle version in terms of both the regret and instability, up to a multiplicative constant that depends solely on the data dimension. The multiplicative difference can be eliminated by replacing majority voting with the weighted voting scheme. In addition, we provide sharp theoretical upper bounds of the number of subsamples in order for the distributed nearest neighbor classifier to reach the optimal convergence rate. It is interesting to note that the weighted voting scheme allows a larger number of subsamples than the majority voting one. Our findings are supported by numerical studies using both simulated and real data sets.
△ Less
Submitted 12 December, 2018;
originally announced December 2018.
-
Finite Time Analysis of Vector Autoregressive Models under Linear Restrictions
Authors:
Yao Zheng,
Guang Cheng
Abstract:
This paper develops a unified finite-time theory for the ordinary least squares estimation of possibly unstable and even slightly explosive vector autoregressive models under linear restrictions, with the applicable region $ρ(A)\leq 1+c/n$, where $ρ(A)$ is the spectral radius of the transition matrix $A$ in the \VAR(1) representation, $n$ is the time horizon and $c>0$ is a universal constant. The…
▽ More
This paper develops a unified finite-time theory for the ordinary least squares estimation of possibly unstable and even slightly explosive vector autoregressive models under linear restrictions, with the applicable region $ρ(A)\leq 1+c/n$, where $ρ(A)$ is the spectral radius of the transition matrix $A$ in the \VAR(1) representation, $n$ is the time horizon and $c>0$ is a universal constant. The linear restriction framework encompasses various existing models such as banded/network vector autoregressive models. We show that the restrictions reduce the error bounds via not only the reduced dimensionality but also a scale factor resembling the asymptotic covariance matrix of the estimator in the fixed-dimensional setup: as long as the model is correctly specified, this scale factor is decreasing in the number of restrictions. It is revealed that the phase transition from slow to fast error rate regimes is determined by the smallest singular value of $A$, a measure of the least excitable mode of the system. The minimax lower bounds are derived across different regimes. The developed non-asymptotic theory not only bridges the theoretical gap between stable and unstable regimes but precisely characterizes the effect of restrictions and its interplay with model parameters. Simulations support our theoretical results.
△ Less
Submitted 18 May, 2020; v1 submitted 26 November, 2018;
originally announced November 2018.
-
Random weighted shifts
Authors:
Guozheng Cheng,
Xiang Fang,
Sen Zhu
Abstract:
In this paper we initiate the study of a fundamental yet untapped random model of non-selfadjoint, bounded linear operators acting on a separable complex Hilbert space. We replace the weights $w_n=1$ in the classical unilateral shift $T$, defined as $Te_n=w_ne_{n+1}$, where $\{e_n\}_{n=1}^\infty$ form an orthonormal basis of a complex Hilbert space, by a sequence of i.i.d. random variables…
▽ More
In this paper we initiate the study of a fundamental yet untapped random model of non-selfadjoint, bounded linear operators acting on a separable complex Hilbert space. We replace the weights $w_n=1$ in the classical unilateral shift $T$, defined as $Te_n=w_ne_{n+1}$, where $\{e_n\}_{n=1}^\infty$ form an orthonormal basis of a complex Hilbert space, by a sequence of i.i.d. random variables $\{X_n\}_{n=1}^{\infty}$; that is, $w_n=X_n$. This paper answers basic questions concerning such a model. We propose that this model can be studied in comparison with the classical Hardy/Bergman/Dirichlet spaces in function-theoretic operator theory.
We calculate the spectra and determine their fine structures (Section 3). We classify the samples up to four equivalence relationships (Section 4). We introduce a family of random Hardy spaces and determine the growth rate of the coefficients of analytic functions in these spaces (Section 5). We compare them with three types of classical operators (Section 6); this is achieved in the form of generalized von Neumann inequalities. The invariant subspaces are shown to admit arbitrarily large indices and their semi-invariant subspaces model arbitrary contractions almost surely. We discuss a Beurling-type theorem (Section 7). We determine various non-selfadjoint algebras generated by $T$ (Section 8). Their dynamical properties are clarified (Section 9). Their iterated Aluthge transforms are shown to converge (Section 10).
In summary, they provide a new random model from the viewpoint of probability theory, and they provide a new class of analytic functional Hilbert spaces from the viewpoint of operator theory. The technical novelty in this paper is that the methodology used draws from three (largely separate) sources: probability theory, functional Hilbert spaces, and the approximation theory of bounded operators.
△ Less
Submitted 14 November, 2018;
originally announced November 2018.
-
High Dimensional Robust Inference for Cox Regression Models
Authors:
Shengchun Kong,
Zhuqing Yu,
Xianyang Zhang,
Guang Cheng
Abstract:
We consider high-dimensional inference for potentially misspecified Cox proportional hazard models based on low dimensional results by Lin and Wei [1989]. A de-sparsified Lasso estimator is proposed based on the log partial likelihood function and shown to converge to a pseudo-true parameter vector. Interestingly, the sparsity of the true parameter can be inferred from that of the above limiting p…
▽ More
We consider high-dimensional inference for potentially misspecified Cox proportional hazard models based on low dimensional results by Lin and Wei [1989]. A de-sparsified Lasso estimator is proposed based on the log partial likelihood function and shown to converge to a pseudo-true parameter vector. Interestingly, the sparsity of the true parameter can be inferred from that of the above limiting parameter. Moreover, each component of the above (non-sparse) estimator is shown to be asymptotically normal with a variance that can be consistently estimated even under model misspecifications. In some cases, this asymptotic distribution leads to valid statistical inference procedures, whose empirical performances are illustrated through numerical examples.
△ Less
Submitted 1 November, 2018;
originally announced November 2018.
-
Moderate-Dimensional Inferences on Quadratic Functionals in Ordinary Least Squares
Authors:
Xiao Guo,
Guang Cheng
Abstract:
Statistical inferences for quadratic functionals of linear regression parameter have found wide applications including signal detection, global testing, inferences of error variance and fraction of variance explained. Classical theory based on ordinary least squares estimator works perfectly in the low-dimensional regime, but fails when the parameter dimension $p_n$ grows proportionally to the sam…
▽ More
Statistical inferences for quadratic functionals of linear regression parameter have found wide applications including signal detection, global testing, inferences of error variance and fraction of variance explained. Classical theory based on ordinary least squares estimator works perfectly in the low-dimensional regime, but fails when the parameter dimension $p_n$ grows proportionally to the sample size $n$. In some cases, its performance is not satisfactory even when $n\ge 5p_n$.
The main contribution of this paper is to develop {\em dimension-adaptive} inferences for quadratic functionals when $\lim_{n\to \infty} p_n/n=τ\in[0,1)$. We propose a bias-and-variance-corrected test statistic and demonstrate that its theoretical validity (such as consistency and asymptotic normality) is adaptive to both low dimension with $τ= 0$ and moderate dimension with $τ\in(0, 1)$. Our general theory holds, in particular, without Gaussian design/error or structural parameter assumption, and applies to a broad class of quadratic functionals covering all aforementioned applications. As a by-product, we find that the classical fixed-dimensional results continue to hold {\em if and only if} the signal-to-noise ratio is large enough, say when $p_n$ diverges but slower than $n$. Extensive numerical results demonstrate the satisfactory performance of the proposed methodology even when $p_n\ge 0.9n$ in some extreme cases. The mathematical arguments are based on the random matrix theory and leave-one-observation-out method.
△ Less
Submitted 15 June, 2020; v1 submitted 2 October, 2018;
originally announced October 2018.
-
Statistically and Computationally Efficient Variance Estimator for Kernel Ridge Regression
Authors:
Meimei Liu,
Jean Honorio,
Guang Cheng
Abstract:
In this paper, we propose a random projection approach to estimate variance in kernel ridge regression. Our approach leads to a consistent estimator of the true variance, while being computationally more efficient. Our variance estimator is optimal for a large family of kernels, including cubic splines and Gaussian kernels. Simulation analysis is conducted to support our theory.
In this paper, we propose a random projection approach to estimate variance in kernel ridge regression. Our approach leads to a consistent estimator of the true variance, while being computationally more efficient. Our variance estimator is optimal for a large family of kernels, including cubic splines and Gaussian kernels. Simulation analysis is conducted to support our theory.
△ Less
Submitted 17 September, 2018;
originally announced September 2018.
-
Quadratic Discriminant Analysis under Moderate Dimension
Authors:
Qing Yang,
Guang Cheng
Abstract:
Quadratic discriminant analysis (QDA) is a simple method to classify a subject into two populations, and was proven to perform as well as the Bayes rule when the data dimension p is fixed. The main purpose of this paper is to examine the empirical and theoretical behaviors of QDA where p grows proportionally to the sample sizes without imposing any structural assumption on the parameters. The firs…
▽ More
Quadratic discriminant analysis (QDA) is a simple method to classify a subject into two populations, and was proven to perform as well as the Bayes rule when the data dimension p is fixed. The main purpose of this paper is to examine the empirical and theoretical behaviors of QDA where p grows proportionally to the sample sizes without imposing any structural assumption on the parameters. The first finding in this moderate dimension regime is that QDA can perform as poorly as random guessing even when the two populations deviate significantly. This motivates a generalized version of QDA that automatically adapts to dimensionality. Under a finite fourth moment condition, we derive misclassification rates for both the generalized QDA and the optimal one. A direct comparison reveals one "easy" case where the difference between two rates converges to zero and one "hard" case where that converges to some strictly positive constant. For the latter, a divide-and-conquer approach over dimension (rather than sample) followed by a screening procedure is proposed to narrow the gap. Various numerical studies are conducted to back up the proposed methodology.
△ Less
Submitted 29 August, 2018;
originally announced August 2018.
-
Early Stopping for Nonparametric Testing
Authors:
Meimei Liu,
Guang Cheng
Abstract:
Early stopping of iterative algorithms is an algorithmic regularization method to avoid over-fitting in estimation and classification. In this paper, we show that early stopping can also be applied to obtain the minimax optimal testing in a general non-parametric setup. Specifically, a Wald-type test statistic is obtained based on an iterated estimate produced by functional gradient descent algori…
▽ More
Early stopping of iterative algorithms is an algorithmic regularization method to avoid over-fitting in estimation and classification. In this paper, we show that early stopping can also be applied to obtain the minimax optimal testing in a general non-parametric setup. Specifically, a Wald-type test statistic is obtained based on an iterated estimate produced by functional gradient descent algorithms in a reproducing kernel Hilbert space. A notable contribution is to establish a "sharp" stopping rule: when the number of iterations achieves an optimal order, testing optimality is achievable; otherwise, testing optimality becomes impossible. As a by-product, a similar sharpness result is also derived for minimax optimal estimation under early stopping studied in [11] and [19]. All obtained results hold for various kernel classes, including Sobolev smoothness classes and Gaussian kernel classes.
△ Less
Submitted 17 September, 2018; v1 submitted 24 May, 2018;
originally announced May 2018.
-
How Many Machines Can We Use in Parallel Computing for Kernel Ridge Regression?
Authors:
Meimei Liu,
Zuofeng Shang,
Guang Cheng
Abstract:
This paper aims to solve a basic problem in distributed statistical inference: how many machines can we use in parallel computing? In kernel ridge regression, we address this question in two important settings: nonparametric estimation and hypothesis testing. Specifically, we find a range for the number of machines under which optimal estimation/testing is achievable. The employed empirical proces…
▽ More
This paper aims to solve a basic problem in distributed statistical inference: how many machines can we use in parallel computing? In kernel ridge regression, we address this question in two important settings: nonparametric estimation and hypothesis testing. Specifically, we find a range for the number of machines under which optimal estimation/testing is achievable. The employed empirical processes method provides a unified framework, that allows us to handle various regression problems (such as thin-plate splines and nonparametric additive regression) under different settings (such as univariate, multivariate and diverging-dimensional designs). It is worth noting that the upper bounds of the number of machines are proven to be un-improvable (upto a logarithmic factor) in two important cases: smoothing spline regression and Gaussian RKHS regression. Our theoretical findings are backed by thorough numerical studies.
△ Less
Submitted 23 February, 2019; v1 submitted 24 May, 2018;
originally announced May 2018.
-
Three Remarks on Carleson Measures for Dirichlet Space
Authors:
Guozheng Cheng,
Xiang Fang,
Zipeng Wang,
Jiayang Yu
Abstract:
In this paper, we prove that all doubling measures on the unit disk $\mathbb{D}$ are Carleson measures for the standard Dirichlet space $\mathcal{D}$. The proof has three ingredients. The first one is a characterization of Carleson measures which holds true for general reproducing kernel Hilbert spaces. The second one is another new equivalent condition for Carleson measures, which holds true only…
▽ More
In this paper, we prove that all doubling measures on the unit disk $\mathbb{D}$ are Carleson measures for the standard Dirichlet space $\mathcal{D}$. The proof has three ingredients. The first one is a characterization of Carleson measures which holds true for general reproducing kernel Hilbert spaces. The second one is another new equivalent condition for Carleson measures, which holds true only for the standard Dirichlet space. The third one is an application of the dyadic method to our settings.
△ Less
Submitted 22 April, 2018;
originally announced April 2018.
-
Nonparametric Testing under Random Projection
Authors:
Meimei Liu,
Zuofeng Shang,
Guang Cheng
Abstract:
A common challenge in nonparametric inference is its high computational complexity when data volume is large. In this paper, we develop computationally efficient nonparametric testing by employing a random projection strategy. In the specific kernel ridge regression setup, a simple distance-based test statistic is proposed. Notably, we derive the minimum number of random projections that is suffic…
▽ More
A common challenge in nonparametric inference is its high computational complexity when data volume is large. In this paper, we develop computationally efficient nonparametric testing by employing a random projection strategy. In the specific kernel ridge regression setup, a simple distance-based test statistic is proposed. Notably, we derive the minimum number of random projections that is sufficient for achieving testing optimality in terms of the minimax rate. An adaptive testing procedure is further established without prior knowledge of regularity. One technical contribution is to establish upper bounds for a range of tail sums of empirical kernel eigenvalues. Simulations and real data analysis are conducted to support our theory.
△ Less
Submitted 17 February, 2018;
originally announced February 2018.
-
Sparse and Low-rank Tensor Estimation via Cubic Sketchings
Authors:
Botao Hao,
Anru Zhang,
Guang Cheng
Abstract:
In this paper, we propose a general framework for sparse and low-rank tensor estimation from cubic sketchings. A two-stage non-convex implementation is developed based on sparse tensor decomposition and thresholded gradient descent, which ensures exact recovery in the noiseless case and stable recovery in the noisy case with high probability. The non-asymptotic analysis sheds light on an interplay…
▽ More
In this paper, we propose a general framework for sparse and low-rank tensor estimation from cubic sketchings. A two-stage non-convex implementation is developed based on sparse tensor decomposition and thresholded gradient descent, which ensures exact recovery in the noiseless case and stable recovery in the noisy case with high probability. The non-asymptotic analysis sheds light on an interplay between optimization error and statistical error. The proposed procedure is shown to be rate-optimal under certain conditions. As a technical by-product, novel high-order concentration inequalities are derived for studying high-moment sub-Gaussian tensors. An interesting tensor formulation illustrates the potential application to high-order interaction pursuit in high-dimensional linear regression.
△ Less
Submitted 14 March, 2020; v1 submitted 28 January, 2018;
originally announced January 2018.
-
Skew-symmetric Nitsche's formulation in isogeometric analysis: Dirichlet and symmetry conditions, patch coupling and frictionless contact
Authors:
Qingyuan Hu,
Franz Chouly,
Ping Hu,
Gengdong Cheng,
Stéphane Pierre Alain Bordas
Abstract:
A simple skew-symmetric Nitsche's formulation is introduced into the framework of isogeometric analysis (IGA) to deal with various problems in small strain elasticity: essential boundary conditions, symmetry conditions for Kirchhoff plates, patch coupling in statics and in modal analysis as well as Signorini contact conditions. For linear boundary or interface conditions, the skew-symmetric formul…
▽ More
A simple skew-symmetric Nitsche's formulation is introduced into the framework of isogeometric analysis (IGA) to deal with various problems in small strain elasticity: essential boundary conditions, symmetry conditions for Kirchhoff plates, patch coupling in statics and in modal analysis as well as Signorini contact conditions. For linear boundary or interface conditions, the skew-symmetric formulation is parameter-free. For contact conditions, it remains stable and accurate for a wide range of the stabilization parameter. Several numerical tests are performed to illustrate its accuracy, stability and convergence performance. We investigate particularly the effects introduced by Nitsche's coupling, including the convergence performance and condition numbers in statics as well as the extra "outlier" frequencies and corresponding eigenmodes in structural dynamics. We present the Hertz test, the block test, and a 3D self-contact example showing that the skew-symmetric Nitsche's formulation is a suitable approach to simulate contact problems in IGA.
△ Less
Submitted 27 April, 2018; v1 submitted 28 November, 2017;
originally announced November 2017.
-
Anisotropic Radial Basis Function Methods for Continental Size Ice Sheet Simulations
Authors:
Gong Cheng,
Victor Shcherbakov
Abstract:
In this paper we develop and implement anisotropic radial basis function methods for simulating the dynamics of ice sheets and glaciers. We test the methods on two problems: the well-known benchmark ISMIP-HOM B that corresponds to a glacier size ice and a synthetic ice sheet whose geometry is inspired by the EISMINT benchmark that corresponds to a continental size ice sheet. We illustrate the adva…
▽ More
In this paper we develop and implement anisotropic radial basis function methods for simulating the dynamics of ice sheets and glaciers. We test the methods on two problems: the well-known benchmark ISMIP-HOM B that corresponds to a glacier size ice and a synthetic ice sheet whose geometry is inspired by the EISMINT benchmark that corresponds to a continental size ice sheet. We illustrate the advantages of the radial basis function methods over a standard finite element method. We also show how the use of anisotropic radial basis functions allows for accurate simulation of the velocities on a large ice sheet, which was not possible with standard isotropic radial basis function methods due to a large aspect ratio between the ice length and the ice thickness. Additionally, we implement a partition of unity method in order to improve the computational efficiency of the radial basis function methods.
△ Less
Submitted 27 November, 2017;
originally announced November 2017.
-
High Dimensional Inference in Partially Linear Models
Authors:
Ying Zhu,
Zhuqing Yu,
Guang Cheng
Abstract:
We propose two semiparametric versions of the debiased Lasso procedure for the model $Y_i = X_iβ_0 + g_0(Z_i) + ε_i$, where $β_0$ is high dimensional but sparse (exactly or approximately). Both versions are shown to have the same asymptotic normal distribution and do not require the minimal signal condition for statistical inference of any component in $β_0$. Our method also works when $Z_i$ is hi…
▽ More
We propose two semiparametric versions of the debiased Lasso procedure for the model $Y_i = X_iβ_0 + g_0(Z_i) + ε_i$, where $β_0$ is high dimensional but sparse (exactly or approximately). Both versions are shown to have the same asymptotic normal distribution and do not require the minimal signal condition for statistical inference of any component in $β_0$. Our method also works when $Z_i$ is high dimensional provided that the function classes $E(X_{ij} |Z_i)$s and $E(Y_i|Z_i)$ belong to exhibit certain sparsity features, e.g., a sparse additive decomposition structure. We further develop a simultaneous hypothesis testing procedure based on multiplier bootstrap. Our testing method automatically takes into account of the dependence structure within the debiased estimates, and allows the number of tested components to be exponentially high.
△ Less
Submitted 8 August, 2017;
originally announced August 2017.
-
Solving General Joint Block Diagonalization Problem via Linearly Independent Eigenvectors of a Matrix Polynomial
Authors:
Yunfeng Cai,
Guanghui Cheng,
Decai Shi
Abstract:
In this paper, we consider the exact/approximate general joint block diagonalization (GJBD) problem of a matrix set $\{A_i\}_{i=0}^p$ ($p\ge 1$), where a nonsingular matrix $W$ (often referred to as diagonalizer) needs to be found such that the matrices $W^{H}A_iW$'s are all exactly/approximately block diagonal matrices with as many diagonal blocks as possible. We show that the diagonalizer of the…
▽ More
In this paper, we consider the exact/approximate general joint block diagonalization (GJBD) problem of a matrix set $\{A_i\}_{i=0}^p$ ($p\ge 1$), where a nonsingular matrix $W$ (often referred to as diagonalizer) needs to be found such that the matrices $W^{H}A_iW$'s are all exactly/approximately block diagonal matrices with as many diagonal blocks as possible. We show that the diagonalizer of the exact GJBD problem can be given by $W=[x_1, x_2, \dots, x_n]Π$, where $Π$ is a permutation matrix, $x_i$'s are eigenvectors of the matrix polynomial $P(λ)=\sum_{i=0}^pλ^i A_i$, satisfying that $[x_1, x_2, \dots, x_n]$ is nonsingular, and the geometric multiplicity of each $λ_i$ corresponding with $x_i$ equals one. And the equivalence of all solutions to the exact GJBD problem is established. Moreover, theoretical proof is given to show why the approximate GJBD problem can be solved similarly to the exact GJBD problem. Based on the theoretical results, a three-stage method is proposed and numerical results show the merits of the method.
△ Less
Submitted 19 April, 2017;
originally announced April 2017.
-
Nonparametric Inference via Bootstrapping the Debiased Estimator
Authors:
Gang Cheng,
Yen-Chi Chen
Abstract:
In this paper, we propose to construct confidence bands by bootstrapping the debiased kernel density estimator (for density estimation) and the debiased local polynomial regression estimator (for regression analysis). The idea of using a debiased estimator was recently employed by Calonico et al. (2018b) to construct a confidence interval of the density function (and regression function) at a give…
▽ More
In this paper, we propose to construct confidence bands by bootstrapping the debiased kernel density estimator (for density estimation) and the debiased local polynomial regression estimator (for regression analysis). The idea of using a debiased estimator was recently employed by Calonico et al. (2018b) to construct a confidence interval of the density function (and regression function) at a given point by explicitly estimating stochastic variations. We extend their ideas of using the debiased estimator and further propose a bootstrap approach for constructing simultaneous confidence bands. This modified method has an advantage that we can easily choose the smoothing bandwidth from conventional bandwidth selectors and the confidence band will be asymptotically valid. We prove the validity of the bootstrap confidence band and generalize it to density level sets and inverse regression problems. Simulation studies confirm the validity of the proposed confidence bands/sets. We apply our approach to an Astronomy dataset to show its applicability
△ Less
Submitted 4 June, 2019; v1 submitted 22 February, 2017;
originally announced February 2017.
-
Non-asymptotic theory for nonparametric testing
Authors:
Yun Yang,
Zuofeng Shang,
Guang Cheng
Abstract:
We consider nonparametric testing in a non-asymptotic framework. Our statistical guarantees are exact in the sense that Type I and II errors are controlled for any finite sample size. Meanwhile, one proposed test is shown to achieve minimax optimality in the asymptotic sense. An important consequence of this non-asymptotic theory is a new and practically useful formula for selecting the optimal sm…
▽ More
We consider nonparametric testing in a non-asymptotic framework. Our statistical guarantees are exact in the sense that Type I and II errors are controlled for any finite sample size. Meanwhile, one proposed test is shown to achieve minimax optimality in the asymptotic sense. An important consequence of this non-asymptotic theory is a new and practically useful formula for selecting the optimal smoothing parameter in nonparametric testing. The leading example in this paper is smoothing spline models under Gaussian errors. The results obtained therein can be further generalized to the kernel ridge regression framework under possibly non-Gaussian errors. Simulations demonstrate that our proposed test improves over the conventional asymptotic test when sample size is small to moderate.
△ Less
Submitted 4 February, 2017;
originally announced February 2017.
-
Distributed inference for quantile regression processes
Authors:
Stanislav Volgushev,
Shih-Kang Chao,
Guang Cheng
Abstract:
The increased availability of massive data sets provides a unique opportunity to discover subtle patterns in their distributions, but also imposes overwhelming computational challenges. To fully utilize the information contained in big data, we propose a two-step procedure: (i) estimate conditional quantile functions at different levels in a parallel computing environment; (ii) construct a conditi…
▽ More
The increased availability of massive data sets provides a unique opportunity to discover subtle patterns in their distributions, but also imposes overwhelming computational challenges. To fully utilize the information contained in big data, we propose a two-step procedure: (i) estimate conditional quantile functions at different levels in a parallel computing environment; (ii) construct a conditional quantile regression process through projection based on these estimated quantile curves. Our general quantile regression framework covers both linear models with fixed or growing dimension and series approximation models. We prove that the proposed procedure does not sacrifice any statistical inferential accuracy provided that the number of distributed computing units and quantile levels are chosen properly. In particular, a sharp upper bound for the former and a sharp lower bound for the latter are derived to capture the minimal computational cost from a statistical perspective. As an important application, the statistical inference on conditional distribution functions is considered. Moreover, we propose computationally efficient approaches to conducting inference in the distributed estimation setting described above. Those approaches directly utilize the availability of estimators from sub-samples and can be carried out at almost no additional computational cost. Simulations confirm our statistical inferential theory.
△ Less
Submitted 10 April, 2018; v1 submitted 21 January, 2017;
originally announced January 2017.
-
Minimax Optimal Estimation in Partially Linear Additive Models under High Dimension
Authors:
Zhuqing Yu,
Michael Levine,
Guang Cheng
Abstract:
In this paper, we derive minimax rates for estimating both parametric and nonparametric components in partially linear additive models with high dimensional sparse vectors and smooth functional components. The minimax lower bound for Euclidean components is the typical sparse estimation rate that is independent of nonparametric smoothness indices. However, the minimax lower bound for each componen…
▽ More
In this paper, we derive minimax rates for estimating both parametric and nonparametric components in partially linear additive models with high dimensional sparse vectors and smooth functional components. The minimax lower bound for Euclidean components is the typical sparse estimation rate that is independent of nonparametric smoothness indices. However, the minimax lower bound for each component function exhibits an interplay between the dimensionality and sparsity of the parametric component and the smoothness of the relevant nonparametric component. Indeed, the minimax risk for smooth nonparametric estimation can be slowed down to the sparse estimation rate whenever the smoothness of the nonparametric component or dimensionality of the parametric component is suffciently large. In the above setting, we demonstrate that penalized least square estimators can nearly achieve minimax lower bounds.
△ Less
Submitted 13 January, 2018; v1 submitted 18 December, 2016;
originally announced December 2016.
-
Simultaneous Clustering and Estimation of Heterogeneous Graphical Models
Authors:
Botao Hao,
Will Wei Sun,
Yufeng Liu,
Guang Cheng
Abstract:
We consider joint estimation of multiple graphical models arising from heterogeneous and high-dimensional observations. Unlike most previous approaches which assume that the cluster structure is given in advance, an appealing feature of our method is to learn cluster structure while estimating heterogeneous graphical models. This is achieved via a high dimensional version of Expectation Conditiona…
▽ More
We consider joint estimation of multiple graphical models arising from heterogeneous and high-dimensional observations. Unlike most previous approaches which assume that the cluster structure is given in advance, an appealing feature of our method is to learn cluster structure while estimating heterogeneous graphical models. This is achieved via a high dimensional version of Expectation Conditional Maximization (ECM) algorithm (Meng and Rubin, 1993). A joint graphical lasso penalty is imposed on the conditional maximization step to extract both homogeneity and heterogeneity components across all clusters. Our algorithm is computationally efficient due to fast sparse learning routines and can be implemented without unsupervised learning knowledge. The superior performance of our method is demonstrated by extensive experiments and its application to a Glioblastoma cancer dataset reveals some new insights in understanding the Glioblastoma cancer. In theory, a non-asymptotic error bound is established for the output directly from our high dimensional ECM algorithm, and it consists of two quantities: statistical error (statistical accuracy) and optimization error (computational complexity). Such a result gives a theoretical guideline in terminating our ECM iterations.
△ Less
Submitted 12 January, 2018; v1 submitted 28 November, 2016;
originally announced November 2016.
-
Embracing the Blessing of Dimensionality in Factor Models
Authors:
Quefeng Li,
Guang Cheng,
Jianqing Fan,
Yuyan Wang
Abstract:
Factor modeling is an essential tool for exploring intrinsic dependence structures among high-dimensional random variables. Much progress has been made for estimating the covariance matrix from a high-dimensional factor model. However, the blessing of dimensionality has not yet been fully embraced in the literature: much of the available data is often ignored in constructing covariance matrix esti…
▽ More
Factor modeling is an essential tool for exploring intrinsic dependence structures among high-dimensional random variables. Much progress has been made for estimating the covariance matrix from a high-dimensional factor model. However, the blessing of dimensionality has not yet been fully embraced in the literature: much of the available data is often ignored in constructing covariance matrix estimates. If our goal is to accurately estimate a covariance matrix of a set of targeted variables, shall we employ additional data, which are beyond the variables of interest, in the estimation? In this paper, we provide sufficient conditions for an affirmative answer, and further quantify its gain in terms of Fisher information and convergence rate. In fact, even an oracle-like result (as if all the factors were known) can be achieved when a sufficiently large number of variables is used. The idea of utilizing data as much as possible brings computational challenges. A divide-and-conquer algorithm is thus proposed to alleviate the computational burden, and also shown not to sacrifice any statistical accuracy in comparison with a pooled analysis. Simulation studies further confirm our advocacy for the use of full data, and demonstrate the effectiveness of the above algorithm. Our proposal is applied to a microarray data example that shows empirical benefits of using more data.
△ Less
Submitted 24 October, 2016;
originally announced October 2016.