-
OPENMENDEL: A Cooperative Programming Project for Statistical Genetics
Authors:
Hua Zhou,
Janet S. Sinsheimer,
Christopher A. German,
Sarah S. Ji,
Douglas M. Bates,
Benjamin B. Chu,
Kevin L. Keys,
Juhyun Kim,
Seyoon Ko,
Gordon D. Mosher,
Jeanette C. Papp,
Eric M. Sobel,
Jing Zhai,
Jin J. Zhou,
Kenneth Lange
Abstract:
Statistical methods for genomewide association studies (GWAS) continue to improve. However, the increasing volume and variety of genetic and genomic data make computational speed and ease of data manipulation mandatory in future software. In our view, a collaborative effort of statistical geneticists is required to develop open source software targeted to genetic epidemiology. Our attempt to meet…
▽ More
Statistical methods for genomewide association studies (GWAS) continue to improve. However, the increasing volume and variety of genetic and genomic data make computational speed and ease of data manipulation mandatory in future software. In our view, a collaborative effort of statistical geneticists is required to develop open source software targeted to genetic epidemiology. Our attempt to meet this need is called the OPENMENDELproject (https://openmendel.github.io). It aims to (1) enable interactive and reproducible analyses with informative intermediate results, (2) scale to big data analytics, (3) embrace parallel and distributed computing, (4) adapt to rapid hardware evolution, (5) allow cloud computing, (6) allow integration of varied genetic data types, and (7) foster easy communication between clinicians, geneticists, statisticians, and computer scientists. This article reviews and makes recommendations to the genetic epidemiology community in the context of the OPENMENDEL project.
△ Less
Submitted 13 February, 2019;
originally announced February 2019.
-
BioSimulator.jl: Stochastic simulation in Julia
Authors:
Alfonso Landeros,
Timothy Stutz,
Kevin L. Keys,
Alexander Alekseyenko,
Janet S. Sinsheimer,
Kenneth Lange,
Mary Sehl
Abstract:
Biological systems with intertwined feedback loops pose a challenge to mathematical modeling efforts. Moreover, rare events, such as mutation and extinction, complicate system dynamics. Stochastic simulation algorithms are useful in generating time-evolution trajectories for these systems because they can adequately capture the influence of random fluctuations and quantify rare events. We present…
▽ More
Biological systems with intertwined feedback loops pose a challenge to mathematical modeling efforts. Moreover, rare events, such as mutation and extinction, complicate system dynamics. Stochastic simulation algorithms are useful in generating time-evolution trajectories for these systems because they can adequately capture the influence of random fluctuations and quantify rare events. We present a simple and flexible package, BioSimulator.jl, for implementing the Gillespie algorithm, $τ$-leaping, and related stochastic simulation algorithms. The objective of this work is to provide scientists across domains with fast, user-friendly simulation tools. We used the high-performance programming language Julia because of its emphasis on scientific computing. Our software package implements a suite of stochastic simulation algorithms based on Markov chain theory. We provide the ability to (a) diagram Petri Nets describing interactions, (b) plot average trajectories and attached standard deviations of each participating species over time, and (c) generate frequency distributions of each species at a specified time. BioSimulator.jl's interface allows users to build models programmatically within Julia. A model is then passed to the simulate routine to generate simulation data. The built-in tools allow one to visualize results and compute summary statistics. Our examples highlight the broad applicability of our software to systems of varying complexity from ecology, systems biology, chemistry, and genetics. The user-friendly nature of BioSimulator.jl encourages the use of stochastic simulation, minimizes tedious programming efforts, and reduces errors during model specification.
△ Less
Submitted 29 November, 2018;
originally announced November 2018.
-
Convex Clustering: An Attractive Alternative to Hierarchical Clustering
Authors:
Gary K. Chen,
Eric Chi,
John Ranola,
Kenneth Lange
Abstract:
The primary goal in cluster analysis is to discover natural groupings of objects. The field of cluster analysis is crowded with diverse methods that make special assumptions about data and address different scientific aims. Despite its shortcomings in accuracy, hierarchical clustering is the dominant clustering method in bioinformatics. Biologists find the trees constructed by hierarchical cluster…
▽ More
The primary goal in cluster analysis is to discover natural groupings of objects. The field of cluster analysis is crowded with diverse methods that make special assumptions about data and address different scientific aims. Despite its shortcomings in accuracy, hierarchical clustering is the dominant clustering method in bioinformatics. Biologists find the trees constructed by hierarchical clustering visually appealing and in tune with their evolutionary perspective. Hierarchical clustering operates on multiple scales simultaneously. This is essential, for instance, in transcriptome data where one may be interested in making qualitative inferences about how lower-order relationships like gene modules lead to higher-order relationships like pathways or biological processes. The recently developed method of convex clustering preserves the visual appeal of hierarchical clustering while ameliorating its propensity to make false inferences in the presence of outliers and noise. The current paper exploits the proximal distance principle to construct a novel algorithm for solving the convex clustering problem. The solution paths generated by convex clustering reveal relationships between clusters that are hidden by static methods such as k-means clustering. Our convex clustering software separates parameters, accommodates missing data, and supports prior information on relationships. The software is implemented on ATI and nVidia graphics processing units (GPUs) for maximal speed. Several biological examples illustrate the strengths of convex clustering and the ability of the proximal distance algorithm to handle high-dimensional problems.
△ Less
Submitted 6 September, 2014;
originally announced September 2014.
-
Fast Genome-Wide QTL Analysis Using Mendel
Authors:
Hua Zhou,
Jin Zhou,
Tao Hu,
Eric M Sobel,
Kenneth Lange
Abstract:
Pedigree GWAS (Option 29) in the current version of the Mendel software is an optimized subroutine for performing large scale genome-wide QTL analysis. This analysis (a) works for random sample data, pedigree data, or a mix of both, (b) is highly efficient in both run time and memory requirement, (c) accommodates both univariate and multivariate traits, (d) works for autosomal and x-linked loci, (…
▽ More
Pedigree GWAS (Option 29) in the current version of the Mendel software is an optimized subroutine for performing large scale genome-wide QTL analysis. This analysis (a) works for random sample data, pedigree data, or a mix of both, (b) is highly efficient in both run time and memory requirement, (c) accommodates both univariate and multivariate traits, (d) works for autosomal and x-linked loci, (e) correctly deals with missing data in traits, covariates, and genotypes, (f) allows for covariate adjustment and constraints among parameters, (g) uses either theoretical or SNP-based empirical kinship matrix for additive polygenic effects, (h) allows extra variance components such as dominant polygenic effects and household effects, (i) detects and reports outlier individuals and pedigrees, and (j) allows for robust estimation via the $t$-distribution. The current paper assesses these capabilities on the genetics analysis workshop 19 (GAW19) sequencing data. We analyzed simulated and real phenotypes for both family and random sample data sets. For instance, when jointly testing the 8 longitudinally measured systolic blood pressure (SBP) and diastolic blood pressure (DBP) traits, it takes Mendel 78 minutes on a standard laptop computer to read, quality check, and analyze a data set with 849 individuals and 8.3 million SNPs. Genome-wide eQTL analysis of 20,643 expression traits on 641 individuals with 8.3 million SNPs takes 30 hours using 20 parallel runs on a cluster. Mendel is freely available at \url{http://www.genetics.ucla.edu/software}.
△ Less
Submitted 30 July, 2014;
originally announced July 2014.
-
Fast Genome-Wide QTL Association Mapping on Pedigree and Population Data
Authors:
Hua Zhou,
John Blangero,
Thomas D. Dyer,
Kei-hang K. Chan,
Kenneth Lange,
Eric M. Sobel
Abstract:
Since most analysis software for genome-wide association studies (GWAS) currently exploit only unrelated individuals, there is a need for efficient applications that can handle general pedigree data or mixtures of both population and pedigree data. Even data sets thought to consist of only unrelated individuals may include cryptic relationships that can lead to false positives if not discovered an…
▽ More
Since most analysis software for genome-wide association studies (GWAS) currently exploit only unrelated individuals, there is a need for efficient applications that can handle general pedigree data or mixtures of both population and pedigree data. Even data sets thought to consist of only unrelated individuals may include cryptic relationships that can lead to false positives if not discovered and controlled for. In addition, family designs possess compelling advantages. They are better equipped to detect rare variants, control for population stratification, and facilitate the study of parent-of-origin effects. Pedigrees selected for extreme trait values often segregate a single gene with strong effect. Finally, many pedigrees are available as an important legacy from the era of linkage analysis. Unfortunately, pedigree likelihoods are notoriously hard to compute. In this paper we re-examine the computational bottlenecks and implement ultra-fast pedigree-based GWAS analysis. Kinship coefficients can either be based on explicitly provided pedigrees or automatically estimated from dense markers. Our strategy (a) works for random sample data, pedigree data, or a mix of both; (b) entails no loss of power; (c) allows for any number of covariate adjustments, including correction for population stratification; (d) allows for testing SNPs under additive, dominant, and recessive models; and (e) accommodates both univariate and multivariate quantitative traits. On a typical personal computer (6 CPU cores at 2.67 GHz), analyzing a univariate HDL (high-density lipoprotein) trait from the San Antonio Family Heart Study (935,392 SNPs on 1357 individuals in 124 pedigrees) takes less than 2 minutes and 1.5 GB of memory. Complete multivariate QTL analysis of the three time-points of the longitudinal HDL multivariate trait takes less than 5 minutes and 1.5 GB of memory.
△ Less
Submitted 19 December, 2014; v1 submitted 30 July, 2014;
originally announced July 2014.
-
Reconstructing DNA copy number by penalized estimation and imputation
Authors:
Zhongyang Zhang,
Kenneth Lange,
Roel Ophoff,
Chiara Sabatti
Abstract:
Recent advances in genomics have underscored the surprising ubiquity of DNA copy number variation (CNV). Fortunately, modern genotyping platforms also detect CNVs with fairly high reliability. Hidden Markov models and algorithms have played a dominant role in the interpretation of CNV data. Here we explore CNV reconstruction via estimation with a fused-lasso penalty as suggested by Tibshirani and…
▽ More
Recent advances in genomics have underscored the surprising ubiquity of DNA copy number variation (CNV). Fortunately, modern genotyping platforms also detect CNVs with fairly high reliability. Hidden Markov models and algorithms have played a dominant role in the interpretation of CNV data. Here we explore CNV reconstruction via estimation with a fused-lasso penalty as suggested by Tibshirani and Wang [Biostatistics 9 (2008) 18--29]. We mount a fresh attack on this difficult optimization problem by the following: (a) changing the penalty terms slightly by substituting a smooth approximation to the absolute value function, (b) designing and implementing a new MM (majorization--minimization) algorithm, and (c) applying a fast version of Newton's method to jointly update all model parameters. Together these changes enable us to minimize the fused-lasso criterion in a highly effective way. We also reframe the reconstruction problem in terms of imputation via discrete optimization. This approach is easier and more accurate than parameter estimation because it relies on the fact that only a handful of possible copy number states exist at each SNP. The dynamic programming framework has the added bonus of exploiting information that the current fused-lasso approach ignores. The accuracy of our imputations is comparable to that of hidden Markov models at a substantially lower computational cost.
△ Less
Submitted 10 January, 2011; v1 submitted 11 June, 2009;
originally announced June 2009.