Search | arXiv e-print repository

doi 10.1214/17-AOAS1110

A Unified Statistical Framework for Single Cell and Bulk RNA Sequencing Data

Authors: Lingxue Zhu, Jing Lei, Bernie Devlin, Kathryn Roeder

Abstract: Recent advances in technology have enabled the measurement of RNA levels for individual cells. Compared to traditional tissue-level bulk RNA-seq data, single cell sequencing yields valuable insights about gene expression profiles for different cell types, which is potentially critical for understanding many complex human diseases. However, developing quantitative tools for such data remains challe… ▽ More Recent advances in technology have enabled the measurement of RNA levels for individual cells. Compared to traditional tissue-level bulk RNA-seq data, single cell sequencing yields valuable insights about gene expression profiles for different cell types, which is potentially critical for understanding many complex human diseases. However, developing quantitative tools for such data remains challenging because of high levels of technical noise, especially the "dropout" events. A "dropout" happens when the RNA for a gene fails to be amplified prior to sequencing, producing a "false" zero in the observed data. In this paper, we propose a Unified RNA-Sequencing Model (URSM) for both single cell and bulk RNA-seq data, formulated as a hierarchical model. URSM borrows the strength from both data sources and carefully models the dropouts in single cell data, leading to a more accurate estimation of cell type specific gene expression profile. In addition, URSM naturally provides inference on the dropout entries in single cell data that need to be imputed for downstream analyses, as well as the mixing proportions of different cell types in bulk samples. We adopt an empirical Bayes approach, where parameters are estimated using the EM algorithm and approximate inference is obtained by Gibbs sampling. Simulation results illustrate that URSM outperforms existing approaches both in correcting for dropouts in single cell data, as well as in deconvolving bulk samples. We also demonstrate an application to gene expression data on fetal brains, where our model successfully imputes the dropout genes and reveals cell type specific expression patterns. △ Less

Submitted 19 October, 2017; v1 submitted 26 September, 2016; originally announced September 2016.

Journal ref: Ann. Appl. Stat., Volume 12, Number 1 (2018), 609-632

arXiv:1606.00252 [pdf, other]

doi 10.1214/17-AOAS1062

Testing High Dimensional Covariance Matrices, with Application to Detecting Schizophrenia Risk Genes

Authors: Lingxue Zhu, Jing Lei, Bernie Devlin, Kathryn Roeder

Abstract: Scientists routinely compare gene expression levels in cases versus controls in part to determine genes associated with a disease. Similarly, detecting case-control differences in co-expression among genes can be critical to understanding complex human diseases; however statistical methods have been limited by the high dimensional nature of this problem. In this paper, we construct a sparse-Leadin… ▽ More Scientists routinely compare gene expression levels in cases versus controls in part to determine genes associated with a disease. Similarly, detecting case-control differences in co-expression among genes can be critical to understanding complex human diseases; however statistical methods have been limited by the high dimensional nature of this problem. In this paper, we construct a sparse-Leading-Eigenvalue-Driven (sLED) test for comparing two high-dimensional covariance matrices. By focusing on the spectrum of the differential matrix, sLED provides a novel perspective that accommodates what we assume to be common, namely sparse and weak signals in gene expression data, and it is closely related with Sparse Principal Component Analysis. We prove that sLED achieves full power asymptotically under mild assumptions, and simulation studies verify that it outperforms other existing procedures under many biologically plausible scenarios. Applying sLED to the largest gene-expression dataset obtained from post-mortem brain tissue from Schizophrenia patients and controls, we provide a novel list of genes implicated in Schizophrenia and reveal intriguing patterns in gene co-expression change for Schizophrenia subjects. We also illustrate that sLED can be generalized to compare other gene-gene "relationship" matrices that are of practical interest, such as the weighted adjacency matrices. △ Less

Submitted 7 December, 2016; v1 submitted 1 June, 2016; originally announced June 2016.

Comments: 25 pages, 5 figures, 3 tables

Journal ref: Ann. Appl. Stat. 11 (2017), no. 3, 1810--1831

arXiv:1208.2253 [pdf, ps, other]

doi 10.1214/12-AOAS598

Refining genetically inferred relationships using treelet covariance smoothing

Authors: Andrew Crossett, Ann B. Lee, Lambertus Klei, Bernie Devlin, Kathryn Roeder

Abstract: Recent technological advances coupled with large sample sets have uncovered many factors underlying the genetic basis of traits and the predisposition to complex disease, but much is left to discover. A common thread to most genetic investigations is familial relationships. Close relatives can be identified from family records, and more distant relatives can be inferred from large panels of geneti… ▽ More Recent technological advances coupled with large sample sets have uncovered many factors underlying the genetic basis of traits and the predisposition to complex disease, but much is left to discover. A common thread to most genetic investigations is familial relationships. Close relatives can be identified from family records, and more distant relatives can be inferred from large panels of genetic markers. Unfortunately these empirical estimates can be noisy, especially regarding distant relatives. We propose a new method for denoising genetically - inferred relationship matrices by exploiting the underlying structure due to hierarchical groupings of correlated individuals. The approach, which we call Treelet Covariance Smoothing, employs a multiscale decomposition of covariance matrices to improve estimates of pairwise relationships. On both simulated and real data, we show that smoothing leads to better estimates of the relatedness amongst distantly related individuals. We illustrate our method with a large genome-wide association study and estimate the "heritability" of body mass index quite accurately. Traditionally heritability, defined as the fraction of the total trait variance attributable to additive genetic effects, is estimated from samples of closely related individuals using random effects models. We show that by using smoothed relationship matrices we can estimate heritability using population-based samples. Finally, while our methods have been developed for refining genetic relationship matrices and improving estimates of heritability, they have much broader potential application in statistics. Most notably, for error-in-variables random effects models and settings that require regularization of matrices with block or hierarchical structure. △ Less

Submitted 10 December, 2013; v1 submitted 10 August, 2012; originally announced August 2012.

Comments: Published in at http://dx.doi.org/10.1214/12-AOAS598 the Annals of Applied Statistics (http://www.imstat.org/aoas/) by the Institute of Mathematical Statistics (http://www.imstat.org)

Report number: IMS-AOAS-AOAS598

Journal ref: Annals of Applied Statistics 2013, Vol. 7, No. 2, 669-690

arXiv:1104.1162 [pdf, other]

GemTools: A fast and efficient approach to estimating genetic ancestry

Authors: Lambertus Klei, Brian P. Kent, Nadine Melhem, Bernie Devlin, Kathryn Roeder

Abstract: To uncover the genetic basis of complex disease, individuals are often measured at a large number of genetic variants (usually SNPs) across the genome. GemTools provides computationally efficient tools for modeling genetic ancestry based on SNP genotypes. The main algorithm creates an eigenmap based on genetic similarities, and then clusters subjects based on their map position. This process is co… ▽ More To uncover the genetic basis of complex disease, individuals are often measured at a large number of genetic variants (usually SNPs) across the genome. GemTools provides computationally efficient tools for modeling genetic ancestry based on SNP genotypes. The main algorithm creates an eigenmap based on genetic similarities, and then clusters subjects based on their map position. This process is continued iteratively until each cluster is relatively homogeneous. For genetic association studies, GemTools matches cases and controls based on genetic similarity. △ Less

Submitted 6 April, 2011; originally announced April 2011.

Comments: 5 pages, 1 figure

Showing 1–4 of 4 results for author: Devlin, B