-
RSVP-graphs: Fast High-dimensional Covariance Matrix Estimation under Latent Confounding
Authors:
Rajen D. Shah,
Benjamin Frot,
Gian-Andrea Thanei,
Nicolai Meinshausen
Abstract:
In this work we consider the problem of estimating a high-dimensional $p \times p$ covariance matrix $Σ$, given $n$ observations of confounded data with covariance $Σ+ ΓΓ^T$, where $Γ$ is an unknown $p \times q$ matrix of latent factor loadings. We propose a simple and scalable estimator based on the projection on to the right singular vectors of the observed data matrix, which we call RSVP. Our t…
▽ More
In this work we consider the problem of estimating a high-dimensional $p \times p$ covariance matrix $Σ$, given $n$ observations of confounded data with covariance $Σ+ ΓΓ^T$, where $Γ$ is an unknown $p \times q$ matrix of latent factor loadings. We propose a simple and scalable estimator based on the projection on to the right singular vectors of the observed data matrix, which we call RSVP. Our theoretical analysis of this method reveals that in contrast to PCA-based approaches, RSVP is able to cope well with settings where the smallest eigenvalue of $Γ^T Γ$ is close to the largest eigenvalue of $Σ$, as well as settings where the eigenvalues of $Γ^T Γ$ are diverging fast. It is also able to handle data that may have heavy tails and only requires that the data has an elliptical distribution. RSVP does not require knowledge or estimation of the number of latent factors $q$, but only recovers $Σ$ up to an unknown positive scale factor. We argue this suffices in many applications, for example if an estimate of the correlation matrix is desired. We also show that by using subsampling, we can further improve the performance of the method. We demonstrate the favourable performance of RSVP through simulation experiments and an analysis of gene expression datasets collated by the GTEX consortium.
△ Less
Submitted 29 November, 2019; v1 submitted 2 November, 2018;
originally announced November 2018.
-
Random Projections For Large-Scale Regression
Authors:
Gian-Andrea Thanei,
Christina Heinze,
Nicolai Meinshausen
Abstract:
Fitting linear regression models can be computationally very expensive in large-scale data analysis tasks if the sample size and the number of variables are very large. Random projections are extensively used as a dimension reduction tool in machine learning and statistics. We discuss the applications of random projections in linear regression problems, developed to decrease computational costs, a…
▽ More
Fitting linear regression models can be computationally very expensive in large-scale data analysis tasks if the sample size and the number of variables are very large. Random projections are extensively used as a dimension reduction tool in machine learning and statistics. We discuss the applications of random projections in linear regression problems, developed to decrease computational costs, and give an overview of the theoretical guarantees of the generalization error. It can be shown that the combination of random projections with least squares regression leads to similar recovery as ridge regression and principal component regression. We also discuss possible improvements when averaging over multiple random projections, an approach that lends itself easily to parallel implementation.
△ Less
Submitted 19 January, 2017;
originally announced January 2017.
-
Linear regression estimation in non-linear single index models
Authors:
Fadoua Balabdaoui,
Gian-Andrea Thanei
Abstract:
In this article, we consider the problem of estimating the index parameter $α_0$ in the single index model $E[Y |X] = f_0(α_0^T X)$ with $f_0$ the unknown ridge function defined on $\mathbb{R}$, $X$ a d-dimensional covariate and $Y$ the response. We show that when $X$ is Gaussian, then $α_0$ can be consistently estimated by regressing the observed responses $Y_i$, $i = 1, . . ., n$ on the covariat…
▽ More
In this article, we consider the problem of estimating the index parameter $α_0$ in the single index model $E[Y |X] = f_0(α_0^T X)$ with $f_0$ the unknown ridge function defined on $\mathbb{R}$, $X$ a d-dimensional covariate and $Y$ the response. We show that when $X$ is Gaussian, then $α_0$ can be consistently estimated by regressing the observed responses $Y_i$, $i = 1, . . ., n$ on the covariates $X_1, . . ., X_n$ after centering and rescaling. The method works without any additional smoothness assumptions on $f_0$ and only requires that $cov(f_0(α_0^T X),α_0^TX) \neq 0$, which is always satisfied by monotone and non-constant functions $f_0$. We show that our estimator is asymptotically normal and give the expression with its asymptotic variance. The approach is illustrated through a simulation study.
△ Less
Submitted 17 December, 2017; v1 submitted 22 December, 2016;
originally announced December 2016.