-
Valid Post-selection Inference in Assumption-lean Linear Regression
Authors:
Arun Kumar Kuchibhotla,
Lawrence D. Brown,
Andreas Buja,
Edward I. George,
Linda Zhao
Abstract:
Construction of valid statistical inference for estimators based on data-driven selection has received a lot of attention in the recent times. Berk et al. (2013) is possibly the first work to provide valid inference for Gaussian homoscedastic linear regression with fixed covariates under arbitrary covariate/variable selection. The setting is unrealistic and is extended by Bachoc et al. (2016) by r…
▽ More
Construction of valid statistical inference for estimators based on data-driven selection has received a lot of attention in the recent times. Berk et al. (2013) is possibly the first work to provide valid inference for Gaussian homoscedastic linear regression with fixed covariates under arbitrary covariate/variable selection. The setting is unrealistic and is extended by Bachoc et al. (2016) by relaxing the distributional assumptions. A major drawback of the aforementioned works is that the construction of valid confidence regions is computationally intensive. In this paper, we first prove that post-selection inference is equivalent to simultaneous inference and then construct valid post-selection confidence regions which are computationally simple. Our construction is based on deterministic inequalities and apply to independent as well as dependent random variables without the requirement of correct distributional assumptions. Finally, we compare the volume of our confidence regions with the existing ones and show that under non-stochastic covariates, our regions are much smaller.
△ Less
Submitted 11 June, 2018;
originally announced June 2018.
-
Moving Beyond Sub-Gaussianity in High-Dimensional Statistics: Applications in Covariance Estimation and Linear Regression
Authors:
Arun Kumar Kuchibhotla,
Abhishek Chakrabortty
Abstract:
Concentration inequalities form an essential toolkit in the study of high dimensional (HD) statistical methods. Most of the relevant statistics literature in this regard is based on sub-Gaussian or sub-exponential tail assumptions. In this paper, we first bring together various probabilistic inequalities for sums of independent random variables under much more general exponential type (namely sub-…
▽ More
Concentration inequalities form an essential toolkit in the study of high dimensional (HD) statistical methods. Most of the relevant statistics literature in this regard is based on sub-Gaussian or sub-exponential tail assumptions. In this paper, we first bring together various probabilistic inequalities for sums of independent random variables under much more general exponential type (namely sub-Weibull) tail assumptions. These results extract a part sub-Gaussian tail behavior in finite samples, matching the asymptotics governed by the central limit theorem, and are compactly represented in terms of a new Orlicz quasi-norm - the Generalized Bernstein-Orlicz norm - that typifies such tail behaviors.
We illustrate the usefulness of these inequalities through the analysis of four fundamental problems in HD statistics. In the first two problems, we study the rate of convergence of the sample covariance matrix in terms of the maximum elementwise norm and the maximum k-sub-matrix operator norm which are key quantities of interest in bootstrap, HD covariance matrix estimation and HD inference. The third example concerns the restricted eigenvalue condition, required in HD linear regression, which we verify for all sub-Weibull random vectors through a unified analysis, and also prove a more general result related to restricted strong convexity in the process. In the final example, we consider the Lasso estimator for linear regression and establish its rate of convergence under much weaker than usual tail assumptions (on the errors as well as the covariates), while also allowing for misspecified models and both fixed and random design. To our knowledge, these are the first such results for Lasso obtained in this generality. The common feature in all our results over all the examples is that the convergence rates under most exponential tails match the usual ones under sub-Gaussian assumptions.
△ Less
Submitted 9 May, 2022; v1 submitted 7 April, 2018;
originally announced April 2018.
-
Uniform-in-Submodel Bounds for Linear Regression in a Model Free Framework
Authors:
Arun Kumar Kuchibhotla,
Lawrence D. Brown,
Andreas Buja,
Edward I. George,
Linda Zhao
Abstract:
For the last two decades, high-dimensional data and methods have proliferated throughout the literature. Yet, the classical technique of linear regression has not lost its usefulness in applications. In fact, many high-dimensional estimation techniques can be seen as variable selection that leads to a smaller set of variables (a ``sub-model'') where classical linear regression applies. We analyze…
▽ More
For the last two decades, high-dimensional data and methods have proliferated throughout the literature. Yet, the classical technique of linear regression has not lost its usefulness in applications. In fact, many high-dimensional estimation techniques can be seen as variable selection that leads to a smaller set of variables (a ``sub-model'') where classical linear regression applies. We analyze linear regression estimators resulting from model-selection by proving estimation error and linear representation bounds uniformly over sets of submodels. Based on deterministic inequalities, our results provide ``good'' rates when applied to both independent and dependent data. These results are useful in meaningfully interpreting the linear regression estimator obtained after exploring and reducing the variables and also in justifying post model-selection inference. All results are derived under no model assumptions and are non-asymptotic in nature.
△ Less
Submitted 17 May, 2021; v1 submitted 15 February, 2018;
originally announced February 2018.
-
Semiparametric Efficiency in Convexity Constrained Single Index Model
Authors:
Arun K. Kuchibhotla,
Rohit K. Patra,
Bodhisattva Sen
Abstract:
We consider estimation and inference in a single index regression model with an unknown convex link function. We introduce a convex and Lipschitz constrained least squares estimator (CLSE) for both the parametric and the nonparametric components given independent and identically distributed observations. We prove the consistency and find the rates of convergence of the CLSE when the errors are ass…
▽ More
We consider estimation and inference in a single index regression model with an unknown convex link function. We introduce a convex and Lipschitz constrained least squares estimator (CLSE) for both the parametric and the nonparametric components given independent and identically distributed observations. We prove the consistency and find the rates of convergence of the CLSE when the errors are assumed to have only $q \ge 2$ moments and are allowed to depend on the covariates. When $q\ge 5$, we establish $n^{-1/2}$-rate of convergence and asymptotic normality of the estimator of the parametric component. Moreover, the CLSE is proved to be semiparametrically efficient if the errors happen to be homoscedastic. {We develop and implement a numerically stable and computationally fast algorithm to compute our proposed estimator in the R package~\texttt{simest}}. We illustrate our methodology through extensive simulations and data analysis. Finally, our proof of efficiency is geometric and provides a general framework that can be used to prove efficiency of estimators in a wide variety of semiparametric models even when they do not satisfy the efficient score equation directly.
△ Less
Submitted 13 January, 2021; v1 submitted 31 July, 2017;
originally announced August 2017.
-
Statistical Inference based on Bridge Divergences
Authors:
Arun Kumar Kuchibhotla,
Somabha Mukherjee,
Ayanendranath Basu
Abstract:
M-estimators offer simple robust alternatives to the maximum likelihood estimator. Much of the robustness literature, however, has focused on the problems of location, location-scale and regression estimation rather than on estimation of general parameters. The density power divergence (DPD) and the logarithmic density power divergence (LDPD) measures provide two classes of competitive M-estimator…
▽ More
M-estimators offer simple robust alternatives to the maximum likelihood estimator. Much of the robustness literature, however, has focused on the problems of location, location-scale and regression estimation rather than on estimation of general parameters. The density power divergence (DPD) and the logarithmic density power divergence (LDPD) measures provide two classes of competitive M-estimators (obtained from divergences) in general parametric models which contain the MLE as a special case. In each of these families, the robustness of the estimator is achieved through a density power down-weighting of outlying observations. Both the families have proved to be very useful tools in the area of robust inference. However, the relation and hierarchy between the minimum distance estimators of the two families are yet to be comprehensively studied or fully established. Given a particular set of real data, how does one choose an optimal member from the union of these two classes of divergences? In this paper, we present a generalized family of divergences incorporating the above two classes; this family provides a smooth bridge between the DPD and the LDPD measures. This family helps to clarify and settle several longstanding issues in the relation between the important families of DPD and LDPD, apart from being an important tool in different areas of statistical inference in its own right.
△ Less
Submitted 18 June, 2017;
originally announced June 2017.
-
Models as Approximations II: A Model-Free Theory of Parametric Regression
Authors:
Andreas Buja,
Lawrence Brown,
Arun Kumar Kuchibhotla,
Richard Berk,
Ed George,
Linda Zhao
Abstract:
We develop a model-free theory of general types of parametric regression for iid observations. The theory replaces the parameters of parametric models with statistical functionals, to be called "regression functionals'', defined on large non-parametric classes of joint $\xy$ distributions, without assuming a correct model. Parametric models are reduced to heuristics to suggest plausible objective…
▽ More
We develop a model-free theory of general types of parametric regression for iid observations. The theory replaces the parameters of parametric models with statistical functionals, to be called "regression functionals'', defined on large non-parametric classes of joint $\xy$ distributions, without assuming a correct model. Parametric models are reduced to heuristics to suggest plausible objective functions. An example of a regression functional is the vector of slopes of linear equations fitted by OLS to largely arbitrary $\xy$ distributions, without assuming a linear model (see Part~I). More generally, regression functionals can be defined by minimizing objective functions or solving estimating equations at joint $\xy$ distributions. In this framework it is possible to achieve the following: (1)~define a notion of well-specification for regression functionals that replaces the notion of correct specification of models, (2)~propose a well-specification diagnostic for regression functionals based on reweighting distributions and data, (3)~decompose sampling variability of regression functionals into two sources, one due to the conditional response distribution
and another due to the regressor distribution interacting with misspecification, both of order $N^{-1/2}$, (4)~exhibit plug-in/sandwich estimators of standard error as limit cases of $\xy$ bootstrap estimators, and (5)~provide theoretical heuristics to indicate that $\xy$ bootstrap standard errors may generally be more stable than sandwich estimators.
△ Less
Submitted 6 July, 2019; v1 submitted 10 December, 2016;
originally announced December 2016.
-
Efficient Estimation in Single Index Models through Smoothing splines
Authors:
Arun Kumar Kuchibhotla,
Rohit Kumar Patra
Abstract:
We consider estimation and inference in a single index regression model with an unknown but smooth link function. In contrast to the standard approach of using kernels or regression splines, we use smoothing splines to estimate the smooth link function. We develop a method to compute the penalized least squares estimators (PLSEs) of the parametric and the nonparametric components given independent…
▽ More
We consider estimation and inference in a single index regression model with an unknown but smooth link function. In contrast to the standard approach of using kernels or regression splines, we use smoothing splines to estimate the smooth link function. We develop a method to compute the penalized least squares estimators (PLSEs) of the parametric and the nonparametric components given independent and identically distributed (i.i.d.)~data. We prove the consistency and find the rates of convergence of the estimators. We establish asymptotic normality under under mild assumption and prove asymptotic efficiency of the parametric component under homoscedastic errors. A finite sample simulation corroborates our asymptotic theory. We also analyze a car mileage data set and a Ozone concentration data set. The identifiability and existence of the PLSEs are also investigated.
△ Less
Submitted 25 May, 2019; v1 submitted 30 November, 2016;
originally announced December 2016.