-
Consistent Estimation of Low-Dimensional Latent Structure in High-Dimensional Data
Authors:
Xiongzhi Chen,
John D. Storey
Abstract:
We consider the problem of extracting a low-dimensional, linear latent variable structure from high-dimensional random variables. Specifically, we show that under mild conditions and when this structure manifests itself as a linear space that spans the conditional means, it is possible to consistently recover the structure using only information up to the second moments of these random variables.…
▽ More
We consider the problem of extracting a low-dimensional, linear latent variable structure from high-dimensional random variables. Specifically, we show that under mild conditions and when this structure manifests itself as a linear space that spans the conditional means, it is possible to consistently recover the structure using only information up to the second moments of these random variables. This finding, specialized to one-parameter exponential families whose variance function is quadratic in their means, allows for the derivation of an explicit estimator of such latent structure. This approach serves as a latent variable model estimator and as a tool for dimension reduction for a high-dimensional matrix of data composed of many related variables. Our theoretical results are verified by simulation studies and an application to genomic data.
△ Less
Submitted 12 October, 2015;
originally announced October 2015.
-
Probabilistic models of genetic variation in structured populations applied to global human studies
Authors:
Wei Hao,
Minsun Song,
John D. Storey
Abstract:
Modern population genetics studies typically involve genome-wide genotyping of individuals from a diverse network of ancestries. An important, unsolved problem is how to formulate and estimate probabilistic models of observed genotypes that allow for complex population structure. We formulate two general probabilistic models, and we propose computationally efficient algorithms to estimate them. Fi…
▽ More
Modern population genetics studies typically involve genome-wide genotyping of individuals from a diverse network of ancestries. An important, unsolved problem is how to formulate and estimate probabilistic models of observed genotypes that allow for complex population structure. We formulate two general probabilistic models, and we propose computationally efficient algorithms to estimate them. First, we show how principal component analysis (PCA) can be utilized to estimate a general model that includes the well-known Pritchard-Stephens-Donnelly mixed-membership model as a special case. Noting some drawbacks of this approach, we introduce a new "logistic factor analysis" (LFA) framework that seeks to directly model the logit transformation of probabilities underlying observed genotypes in terms of latent variables that capture population structure. We demonstrate these advances on data from the human genome diversity panel and 1000 genomes project, where we are able to identify SNPs that are highly differentiated with respect to structure while making minimal modeling assumptions.
△ Less
Submitted 3 March, 2015; v1 submitted 6 December, 2013;
originally announced December 2013.
-
Statistical significance of variables driving systematic variation
Authors:
Neo Christopher Chung,
John D. Storey
Abstract:
There are a number of well-established methods such as principal components analysis (PCA) for automatically capturing systematic variation due to latent variables in large-scale genomic data. PCA and related methods may directly provide a quantitative characterization of a complex biological variable that is otherwise difficult to precisely define or model. An unsolved problem in this context is…
▽ More
There are a number of well-established methods such as principal components analysis (PCA) for automatically capturing systematic variation due to latent variables in large-scale genomic data. PCA and related methods may directly provide a quantitative characterization of a complex biological variable that is otherwise difficult to precisely define or model. An unsolved problem in this context is how to systematically identify the genomic variables that are drivers of systematic variation captured by PCA. Principal components (and other estimates of systematic variation) are directly constructed from the genomic variables themselves, making measures of statistical significance artificially inflated when using conventional methods due to over-fitting. We introduce a new approach called the jackstraw that allows one to accurately identify genomic variables that are statistically significantly associated with any subset or linear combination of principal components (PCs). The proposed method can greatly simplify complex significance testing problems encountered in genomics and can be utilized to identify the genomic variables significantly associated with latent variables. Using simulation, we demonstrate that our method attains accurate measures of statistical significance over a range of relevant scenarios. We consider yeast cell-cycle gene expression data, and show that the proposed method can be used to straightforwardly identify statistically significant genes that are cell-cycle regulated. We also analyze gene expression data from post-trauma patients, allowing the gene expression data to provide a molecularly-driven phenotype. We find a greater enrichment for inflammatory-related gene sets compared to using a clinically defined phenotype. The proposed method provides a useful bridge between large-scale quantifications of systematic variation and gene-level significance analyses.
△ Less
Submitted 27 August, 2013;
originally announced August 2013.
-
Gene set bagging for estimating replicability of gene set analyses
Authors:
Andrew E. Jaffe,
John D. Storey,
Hongkai Ji,
Jeffrey T. Leek
Abstract:
Background: Significance analysis plays a major role in identifying and ranking genes, transcription factor binding sites, DNA methylation regions, and other high-throughput features for association with disease. We propose a new approach, called gene set bagging, for measuring the stability of ranking procedures using predefined gene sets. Gene set bagging involves resampling the original high-th…
▽ More
Background: Significance analysis plays a major role in identifying and ranking genes, transcription factor binding sites, DNA methylation regions, and other high-throughput features for association with disease. We propose a new approach, called gene set bagging, for measuring the stability of ranking procedures using predefined gene sets. Gene set bagging involves resampling the original high-throughput data, performing gene-set analysis on the resampled data, and confirming that biological categories replicate. This procedure can be thought of as bootstrapping gene-set analysis and can be used to determine which are the most reproducible gene sets. Results: Here we apply this approach to two common genomics applications: gene expression and DNA methylation. Even with state-of-the-art statistical ranking procedures, significant categories in a gene set enrichment analysis may be unstable when subjected to resampling. Conclusions: We demonstrate that gene lists are not necessarily stable, and therefore additional steps like gene set bagging can improve biological inference of gene set analysis.
△ Less
Submitted 17 January, 2013; v1 submitted 16 January, 2013;
originally announced January 2013.
-
Identifying and Mapping Cell-type Specific Chromatin Programming of Gene Expression
Authors:
Troels T. Marstrand,
John D. Storey
Abstract:
A problem of substantial interest is to systematically map variation in chromatin structure to gene expression regulation across conditions, environments, or differentiated cell types. We developed and applied a quantitative framework for determining the existence, strength, and type of relationship between high-resolution chromatin structure in terms of DNaseI hypersensitivity (DHS) and genome-wi…
▽ More
A problem of substantial interest is to systematically map variation in chromatin structure to gene expression regulation across conditions, environments, or differentiated cell types. We developed and applied a quantitative framework for determining the existence, strength, and type of relationship between high-resolution chromatin structure in terms of DNaseI hypersensitivity (DHS) and genome-wide gene expression levels in 20 diverse human cell lines. We show that ~25% of genes show cell-type specific expression explained by alterations in chromatin structure. We find that distal regions of chromatin structure (e.g., +/- 200kb) capture more genes with this relationship than local regions (e.g., +/- 2.5kb), yet the local regions show a more pronounced effect. By exploiting variation across cell-types, we were capable of pinpointing the most likely hypersensitive sites related to cell-type specific expression, which we show have a range of contextual usages. This quantitative framework is likely applicable to other settings aimed at relating continuous genomic measurements to gene expression variation.
△ Less
Submitted 11 October, 2012;
originally announced October 2012.