Search | arXiv e-print repository

Valid Post-selection Inference in Assumption-lean Linear Regression

Authors: Arun Kumar Kuchibhotla, Lawrence D. Brown, Andreas Buja, Edward I. George, Linda Zhao

Abstract: Construction of valid statistical inference for estimators based on data-driven selection has received a lot of attention in the recent times. Berk et al. (2013) is possibly the first work to provide valid inference for Gaussian homoscedastic linear regression with fixed covariates under arbitrary covariate/variable selection. The setting is unrealistic and is extended by Bachoc et al. (2016) by r… ▽ More Construction of valid statistical inference for estimators based on data-driven selection has received a lot of attention in the recent times. Berk et al. (2013) is possibly the first work to provide valid inference for Gaussian homoscedastic linear regression with fixed covariates under arbitrary covariate/variable selection. The setting is unrealistic and is extended by Bachoc et al. (2016) by relaxing the distributional assumptions. A major drawback of the aforementioned works is that the construction of valid confidence regions is computationally intensive. In this paper, we first prove that post-selection inference is equivalent to simultaneous inference and then construct valid post-selection confidence regions which are computationally simple. Our construction is based on deterministic inequalities and apply to independent as well as dependent random variables without the requirement of correct distributional assumptions. Finally, we compare the volume of our confidence regions with the existing ones and show that under non-stochastic covariates, our regions are much smaller. △ Less

Submitted 11 June, 2018; originally announced June 2018.

Comments: 49 pages

arXiv:1804.02605 [pdf, other]

doi 10.1093/imaiai/iaac012

Moving Beyond Sub-Gaussianity in High-Dimensional Statistics: Applications in Covariance Estimation and Linear Regression

Authors: Arun Kumar Kuchibhotla, Abhishek Chakrabortty

Abstract: Concentration inequalities form an essential toolkit in the study of high dimensional (HD) statistical methods. Most of the relevant statistics literature in this regard is based on sub-Gaussian or sub-exponential tail assumptions. In this paper, we first bring together various probabilistic inequalities for sums of independent random variables under much more general exponential type (namely sub-… ▽ More Concentration inequalities form an essential toolkit in the study of high dimensional (HD) statistical methods. Most of the relevant statistics literature in this regard is based on sub-Gaussian or sub-exponential tail assumptions. In this paper, we first bring together various probabilistic inequalities for sums of independent random variables under much more general exponential type (namely sub-Weibull) tail assumptions. These results extract a part sub-Gaussian tail behavior in finite samples, matching the asymptotics governed by the central limit theorem, and are compactly represented in terms of a new Orlicz quasi-norm - the Generalized Bernstein-Orlicz norm - that typifies such tail behaviors. We illustrate the usefulness of these inequalities through the analysis of four fundamental problems in HD statistics. In the first two problems, we study the rate of convergence of the sample covariance matrix in terms of the maximum elementwise norm and the maximum k-sub-matrix operator norm which are key quantities of interest in bootstrap, HD covariance matrix estimation and HD inference. The third example concerns the restricted eigenvalue condition, required in HD linear regression, which we verify for all sub-Weibull random vectors through a unified analysis, and also prove a more general result related to restricted strong convexity in the process. In the final example, we consider the Lasso estimator for linear regression and establish its rate of convergence under much weaker than usual tail assumptions (on the errors as well as the covariates), while also allowing for misspecified models and both fixed and random design. To our knowledge, these are the first such results for Lasso obtained in this generality. The common feature in all our results over all the examples is that the convergence rates under most exponential tails match the usual ones under sub-Gaussian assumptions. △ Less

Submitted 9 May, 2022; v1 submitted 7 April, 2018; originally announced April 2018.

Comments: 68 pages; Revised version; To appear in Information and Inference: A Journal of the IMA

MSC Class: 60G50; 62J05; 60B20; 62J07; 62E17; 60F05; 60E15

Journal ref: Information and Inference: A Journal of the IMA (2022), Vol. 11, No. 4, 1389-1456

arXiv:1802.05801 [pdf, ps, other]

Uniform-in-Submodel Bounds for Linear Regression in a Model Free Framework

Authors: Arun Kumar Kuchibhotla, Lawrence D. Brown, Andreas Buja, Edward I. George, Linda Zhao

Abstract: For the last two decades, high-dimensional data and methods have proliferated throughout the literature. Yet, the classical technique of linear regression has not lost its usefulness in applications. In fact, many high-dimensional estimation techniques can be seen as variable selection that leads to a smaller set of variables (a ``sub-model'') where classical linear regression applies. We analyze… ▽ More For the last two decades, high-dimensional data and methods have proliferated throughout the literature. Yet, the classical technique of linear regression has not lost its usefulness in applications. In fact, many high-dimensional estimation techniques can be seen as variable selection that leads to a smaller set of variables (a ``sub-model'') where classical linear regression applies. We analyze linear regression estimators resulting from model-selection by proving estimation error and linear representation bounds uniformly over sets of submodels. Based on deterministic inequalities, our results provide ``good'' rates when applied to both independent and dependent data. These results are useful in meaningfully interpreting the linear regression estimator obtained after exploring and reducing the variables and also in justifying post model-selection inference. All results are derived under no model assumptions and are non-asymptotic in nature. △ Less

Submitted 17 May, 2021; v1 submitted 15 February, 2018; originally announced February 2018.

Comments: Forthcoming at Econometric Theory

arXiv:1708.00145 [pdf, other]

Semiparametric Efficiency in Convexity Constrained Single Index Model

Authors: Arun K. Kuchibhotla, Rohit K. Patra, Bodhisattva Sen

Abstract: We consider estimation and inference in a single index regression model with an unknown convex link function. We introduce a convex and Lipschitz constrained least squares estimator (CLSE) for both the parametric and the nonparametric components given independent and identically distributed observations. We prove the consistency and find the rates of convergence of the CLSE when the errors are ass… ▽ More We consider estimation and inference in a single index regression model with an unknown convex link function. We introduce a convex and Lipschitz constrained least squares estimator (CLSE) for both the parametric and the nonparametric components given independent and identically distributed observations. We prove the consistency and find the rates of convergence of the CLSE when the errors are assumed to have only $q \ge 2$ moments and are allowed to depend on the covariates. When $q\ge 5$, we establish $n^{-1/2}$-rate of convergence and asymptotic normality of the estimator of the parametric component. Moreover, the CLSE is proved to be semiparametrically efficient if the errors happen to be homoscedastic. {We develop and implement a numerically stable and computationally fast algorithm to compute our proposed estimator in the R package~\texttt{simest}}. We illustrate our methodology through extensive simulations and data analysis. Finally, our proof of efficiency is geometric and provides a general framework that can be used to prove efficiency of estimators in a wide variety of semiparametric models even when they do not satisfy the efficient score equation directly. △ Less

Submitted 13 January, 2021; v1 submitted 31 July, 2017; originally announced August 2017.

Comments: Removed the density bounded away from zero assumption in assumption (A5). Weakened assumption (B2)

arXiv:1706.05745 [pdf, other]

Statistical Inference based on Bridge Divergences

Authors: Arun Kumar Kuchibhotla, Somabha Mukherjee, Ayanendranath Basu

Abstract: M-estimators offer simple robust alternatives to the maximum likelihood estimator. Much of the robustness literature, however, has focused on the problems of location, location-scale and regression estimation rather than on estimation of general parameters. The density power divergence (DPD) and the logarithmic density power divergence (LDPD) measures provide two classes of competitive M-estimator… ▽ More M-estimators offer simple robust alternatives to the maximum likelihood estimator. Much of the robustness literature, however, has focused on the problems of location, location-scale and regression estimation rather than on estimation of general parameters. The density power divergence (DPD) and the logarithmic density power divergence (LDPD) measures provide two classes of competitive M-estimators (obtained from divergences) in general parametric models which contain the MLE as a special case. In each of these families, the robustness of the estimator is achieved through a density power down-weighting of outlying observations. Both the families have proved to be very useful tools in the area of robust inference. However, the relation and hierarchy between the minimum distance estimators of the two families are yet to be comprehensively studied or fully established. Given a particular set of real data, how does one choose an optimal member from the union of these two classes of divergences? In this paper, we present a generalized family of divergences incorporating the above two classes; this family provides a smooth bridge between the DPD and the LDPD measures. This family helps to clarify and settle several longstanding issues in the relation between the important families of DPD and LDPD, apart from being an important tool in different areas of statistical inference in its own right. △ Less

Submitted 18 June, 2017; originally announced June 2017.

Comments: 45 pages

arXiv:1612.03257 [pdf, other]

Models as Approximations II: A Model-Free Theory of Parametric Regression

Authors: Andreas Buja, Lawrence Brown, Arun Kumar Kuchibhotla, Richard Berk, Ed George, Linda Zhao

Abstract: We develop a model-free theory of general types of parametric regression for iid observations. The theory replaces the parameters of parametric models with statistical functionals, to be called "regression functionals'', defined on large non-parametric classes of joint $\xy$ distributions, without assuming a correct model. Parametric models are reduced to heuristics to suggest plausible objective… ▽ More We develop a model-free theory of general types of parametric regression for iid observations. The theory replaces the parameters of parametric models with statistical functionals, to be called "regression functionals'', defined on large non-parametric classes of joint $\xy$ distributions, without assuming a correct model. Parametric models are reduced to heuristics to suggest plausible objective functions. An example of a regression functional is the vector of slopes of linear equations fitted by OLS to largely arbitrary $\xy$ distributions, without assuming a linear model (see Part~I). More generally, regression functionals can be defined by minimizing objective functions or solving estimating equations at joint $\xy$ distributions. In this framework it is possible to achieve the following: (1)~define a notion of well-specification for regression functionals that replaces the notion of correct specification of models, (2)~propose a well-specification diagnostic for regression functionals based on reweighting distributions and data, (3)~decompose sampling variability of regression functionals into two sources, one due to the conditional response distribution and another due to the regressor distribution interacting with misspecification, both of order $N^{-1/2}$, (4)~exhibit plug-in/sandwich estimators of standard error as limit cases of $\xy$ bootstrap estimators, and (5)~provide theoretical heuristics to indicate that $\xy$ bootstrap standard errors may generally be more stable than sandwich estimators. △ Less

Submitted 6 July, 2019; v1 submitted 10 December, 2016; originally announced December 2016.

Comments: Submitted

MSC Class: 62A01

arXiv:1612.00068 [pdf, other]

Efficient Estimation in Single Index Models through Smoothing splines

Authors: Arun Kumar Kuchibhotla, Rohit Kumar Patra

Abstract: We consider estimation and inference in a single index regression model with an unknown but smooth link function. In contrast to the standard approach of using kernels or regression splines, we use smoothing splines to estimate the smooth link function. We develop a method to compute the penalized least squares estimators (PLSEs) of the parametric and the nonparametric components given independent… ▽ More We consider estimation and inference in a single index regression model with an unknown but smooth link function. In contrast to the standard approach of using kernels or regression splines, we use smoothing splines to estimate the smooth link function. We develop a method to compute the penalized least squares estimators (PLSEs) of the parametric and the nonparametric components given independent and identically distributed (i.i.d.)~data. We prove the consistency and find the rates of convergence of the estimators. We establish asymptotic normality under under mild assumption and prove asymptotic efficiency of the parametric component under homoscedastic errors. A finite sample simulation corroborates our asymptotic theory. We also analyze a car mileage data set and a Ozone concentration data set. The identifiability and existence of the PLSEs are also investigated. △ Less

Submitted 25 May, 2019; v1 submitted 30 November, 2016; originally announced December 2016.

Comments: 50 pages, 3 figures, and 2 tables

Showing 51–57 of 57 results for author: Kuchibhotla, A K