Search | arXiv e-print repository

Infinite Mixture Models for Improved Modeling of Across-Site Evolutionary Variation

Authors: Mandev S. Gill, Guy Baele, Marc A. Suchard, Philippe Lemey

Abstract: Scientific studies in many areas of biology routinely employ evolutionary analyses based on the probabilistic inference of phylogenetic trees from molecular sequence data. Evolutionary processes that act at the molecular level are highly variable, and properly accounting for heterogeneity in evolutionary processes is crucial for more accurate phylogenetic inference. Nucleotide substitution rates a… ▽ More Scientific studies in many areas of biology routinely employ evolutionary analyses based on the probabilistic inference of phylogenetic trees from molecular sequence data. Evolutionary processes that act at the molecular level are highly variable, and properly accounting for heterogeneity in evolutionary processes is crucial for more accurate phylogenetic inference. Nucleotide substitution rates and patterns are known to vary among sites in multiple sequence alignments, and such variation can be modeled by partitioning alignments into categories corresponding to different substitution models. Determining $\textit{a priori}$ appropriate partitions can be difficult, however, and better model fit can be achieved through flexible Bayesian infinite mixture models that simultaneously infer the number of partitions, the partition that each site belongs to, and the evolutionary parameters corresponding to each partition. Here, we consider several different types of infinite mixture models, including classic Dirichlet process mixtures, as well as novel approaches for modeling across-site evolutionary variation: hierarchical models for data with a natural group structure, and infinite hidden Markov models that account for spatial patterns in alignments. In analyses of several viral data sets, we find that different types of infinite mixture models emerge as the best choices in different scenarios. To enable these models to scale efficiently to large data sets, we adapt efficient Markov chain Monte Carlo algorithms and exploit opportunities for parallel computing. We implement this infinite mixture modeling framework in BEAST X, a widely-used software package for Bayesian phylogenetic inference. △ Less

Submitted 8 December, 2024; originally announced December 2024.

arXiv:2002.00245 [pdf, other]

Online Bayesian phylodynamic inference in BEAST with application to epidemic reconstruction

Authors: Mandev S. Gill, Philippe Lemey, Marc A. Suchard, Andrew Rambaut, Guy Baele

Abstract: Reconstructing pathogen dynamics from genetic data as they become available during an outbreak or epidemic represents an important statistical scenario in which observations arrive sequentially in time and one is interested in performing inference in an 'online' fashion. Widely-used Bayesian phylogenetic inference packages are not set up for this purpose, generally requiring one to recompute trees… ▽ More Reconstructing pathogen dynamics from genetic data as they become available during an outbreak or epidemic represents an important statistical scenario in which observations arrive sequentially in time and one is interested in performing inference in an 'online' fashion. Widely-used Bayesian phylogenetic inference packages are not set up for this purpose, generally requiring one to recompute trees and evolutionary model parameters de novo when new data arrive. To accommodate increasing data flow in a Bayesian phylogenetic framework, we introduce a methodology to efficiently update the posterior distribution with newly available genetic data. Our procedure is implemented in the BEAST 1.10 software package, and relies on a distance-based measure to insert new taxa into the current estimate of the phylogeny and imputes plausible values for new model parameters to accommodate growing dimensionality. This augmentation creates informed starting values and re-uses optimally tuned transition kernels for posterior exploration of growing data sets, reducing the time necessary to converge to target posterior distributions. We apply our framework to data from the recent West African Ebola virus epidemic and demonstrate a considerable reduction in time required to obtain posterior estimates at different time points of the outbreak. Beyond epidemic monitoring, this framework easily finds other applications within the phylogenetics community, where changes in the data -- in terms of alignment changes, sequence addition or removal -- present common scenarios that can benefit from online inference. △ Less

Submitted 1 February, 2020; originally announced February 2020.

Comments: 20 pages, 3 figures

arXiv:1906.05136 [pdf, other]

Markov-modulated continuous-time Markov chains to identify site- and branch-specific evolutionary variation

Authors: Guy Baele, Mandev S. Gill, Philippe Lemey, Marc A. Suchard

Abstract: Markov models of character substitution on phylogenies form the foundation of phylogenetic inference frameworks. Early models made the simplifying assumption that the substitution process is homogeneous over time and across sites in the molecular sequence alignment. While standard practice adopts extensions that accommodate heterogeneity of substitution rates across sites, heterogeneity in the pro… ▽ More Markov models of character substitution on phylogenies form the foundation of phylogenetic inference frameworks. Early models made the simplifying assumption that the substitution process is homogeneous over time and across sites in the molecular sequence alignment. While standard practice adopts extensions that accommodate heterogeneity of substitution rates across sites, heterogeneity in the process over time in a site-specific manner remains frequently overlooked. This is problematic, as evolutionary processes that act at the molecular level are highly variable, subjecting different sites to different selective constraints over time, impacting their substitution behaviour. We propose incorporating time variability through Markov-modulated models (MMMs) that allow the substitution process (including relative character exchange rates as well as the overall substitution rate) that models the evolution at an individual site to vary across lineages. We implement a general MMM framework in BEAST, a popular Bayesian phylogenetic inference software package, allowing researchers to compose a wide range of MMMs through flexible XML specification. Using examples from bacterial, viral and plastid genome evolution, we show that MMMs impact phylogenetic tree estimation and can substantially improve model fit compared to standard substitution models. Through simulations, we show that marginal likelihood estimation accurately identifies the generative model and does not systematically prefer the more parameter-rich MMMs. In order to mitigate the increased computational demands associated with MMMs, our implementation exploits recently developed updates to BEAGLE, a high-performance computational library for phylogenetic inference. △ Less

Submitted 12 June, 2019; originally announced June 2019.

Comments: 30 pages, 8 figures

arXiv:1601.05078 [pdf, other]

Understanding Past Population Dynamics: Bayesian Coalescent-Based Modeling with Covariates

Authors: Mandev S. Gill, Philippe Lemey, Shannon N. Bennett, Roman Biek, Marc A. Suchard

Abstract: Effective population size characterizes the genetic variability in a population and is a parameter of paramount importance in population genetics. Kingman's coalescent process enables inference of past population dynamics directly from molecular sequence data, and researchers have developed a number of flexible coalescent-based models for Bayesian nonparametric estimation of the effective populati… ▽ More Effective population size characterizes the genetic variability in a population and is a parameter of paramount importance in population genetics. Kingman's coalescent process enables inference of past population dynamics directly from molecular sequence data, and researchers have developed a number of flexible coalescent-based models for Bayesian nonparametric estimation of the effective population size as a function of time. A major goal of demographic reconstruction is understanding the association between the effective population size and potential explanatory factors. Building upon Bayesian nonparametric coalescent-based approaches, we introduce a flexible framework that incorporates time-varying covariates through Gaussian Markov random fields. To approximate the posterior distribution, we adapt efficient Markov chain Monte Carlo algorithms designed for highly structured Gaussian models. Incorporating covariates into the demographic inference framework enables the modeling of associations between the effective population size and covariates while accounting for uncertainty in population histories. Furthermore, it can lead to more precise estimates of population dynamics. We apply our model to four examples. We reconstruct the demographic history of raccoon rabies in North America and find a significant association with the spatiotemporal spread of the outbreak. Next, we examine the effective population size trajectory of the DENV-4 virus in Puerto Rico along with viral isolate count data and find similar cyclic patterns. We compare the population history of the HIV-1 CRF02_AG clade in Cameroon with HIV incidence and prevalence data and find that the effective population size is more reflective of incidence rate. Finally, we explore the hypothesis that the population dynamics of musk ox during the Late Quaternary period were related to climate change. △ Less

Submitted 19 January, 2016; originally announced January 2016.

Comments: 31 pages, 6 figures

arXiv:1512.07948 [pdf, other]

A Relaxed Drift Diffusion Model for Phylogenetic Trait Evolution

Authors: Mandev S. Gill, Lam Si Tung Ho, Guy Baele, Philippe Lemey, Marc A. Suchard

Abstract: Understanding the processes that give rise to quantitative measurements associated with molecular sequence data remains an important issue in statistical phylogenetics. Examples of such measurements include geographic coordinates in the context of phylogeography and phenotypic traits in the context of comparative studies. A popular approach is to model the evolution of continuously varying traits… ▽ More Understanding the processes that give rise to quantitative measurements associated with molecular sequence data remains an important issue in statistical phylogenetics. Examples of such measurements include geographic coordinates in the context of phylogeography and phenotypic traits in the context of comparative studies. A popular approach is to model the evolution of continuously varying traits as a Brownian diffusion process. However, standard Brownian diffusion is quite restrictive and may not accurately characterize certain trait evolutionary processes. Here, we relax one of the major restrictions of standard Brownian diffusion by incorporating a nontrivial estimable drift into the process. We introduce a relaxed drift diffusion model for the evolution of multivariate continuously varying traits along a phylogenetic tree via Brownian diffusion with drift. Notably, the relaxed drift model accommodates branch-specific variation of drift rates while preserving model identifiability. We implement the relaxed drift model in a Bayesian inference framework to simultaneously reconstruct the evolutionary histories of molecular sequence data and associated multivariate continuous trait data, and provide tools to visualize evolutionary reconstructions. We illustrate our approach in three viral examples. In the first two, we examine the spatiotemporal spread of HIV-1 in central Africa and West Nile virus in North America and show that a relaxed drift approach uncovers a clearer, more detailed picture of the dynamics of viral dispersal than standard Brownian diffusion. Finally, we study antigenic evolution in the context of HIV-1 resistance to three broadly neutralizing antibodies. Our analysis reveals evidence of a continuous drift at the HIV-1 population level towards enhanced resistance to neutralization by the VRC01 monoclonal antibody over the course of the epidemic. △ Less

Submitted 29 December, 2015; v1 submitted 24 December, 2015; originally announced December 2015.

Comments: 35 pages, 3 figures, 5 tables. Changed from double-spaced to single-spaced

arXiv:1009.0779 [pdf, ps, other]

doi 10.1214/11-AOAS484

Gravitational Lensing Accuracy Testing 2010 (GREAT10) Challenge Handbook

Authors: Thomas Kitching, Sreekumar Balan, Gary Bernstein, Matthias Bethge, Sarah Bridle, Frederic Courbin, Marc Gentile, Alan Heavens, Michael Hirsch, Reshad Hosseini, Alina Kiessling, Adam Amara, Donnacha Kirk, Konrad Kuijken, Rachel Mandelbaum, Baback Moghaddam, Guldariya Nurbaeva, Stephane Paulin-Henriksson, Anais Rassat, Jason Rhodes, Bernhard Schölkopf, John Shawe-Taylor, Mandeep Gill, Marina Shmakova, Andy Taylor , et al. (10 additional authors not shown)

Abstract: GRavitational lEnsing Accuracy Testing 2010 (GREAT10) is a public image analysis challenge aimed at the development of algorithms to analyze astronomical images. Specifically, the challenge is to measure varying image distortions in the presence of a variable convolution kernel, pixelization and noise. This is the second in a series of challenges set to the astronomy, computer science and statisti… ▽ More GRavitational lEnsing Accuracy Testing 2010 (GREAT10) is a public image analysis challenge aimed at the development of algorithms to analyze astronomical images. Specifically, the challenge is to measure varying image distortions in the presence of a variable convolution kernel, pixelization and noise. This is the second in a series of challenges set to the astronomy, computer science and statistics communities, providing a structured environment in which methods can be improved and tested in preparation for planned astronomical surveys. GREAT10 extends upon previous work by introducing variable fields into the challenge. The "Galaxy Challenge" involves the precise measurement of galaxy shape distortions, quantified locally by two parameters called shear, in the presence of a known convolution kernel. Crucially, the convolution kernel and the simulated gravitational lensing shape distortion both now vary as a function of position within the images, as is the case for real data. In addition, we introduce the "Star Challenge" that concerns the reconstruction of a variable convolution kernel, similar to that in a typical astronomical observation. This document details the GREAT10 Challenge for potential participants. Continually updated information is also available from http://www.greatchallenges.info. △ Less

Submitted 30 November, 2011; v1 submitted 3 September, 2010; originally announced September 2010.

Comments: Published in at http://dx.doi.org/10.1214/11-AOAS484 the Annals of Applied Statistics (http://www.imstat.org/aoas/) by the Institute of Mathematical Statistics (http://www.imstat.org)

Report number: IMS-AOAS-AOAS484

Journal ref: Annals of Applied Statistics 2011, Vol. 5, No. 3, 2231-2263

arXiv:0802.1214 [pdf, ps, other]

doi 10.1214/08-AOAS222

Handbook for the GREAT08 Challenge: An image analysis competition for cosmological lensing

Authors: Sarah Bridle, John Shawe-Taylor, Adam Amara, Douglas Applegate, Sreekumar T. Balan, Joel Berge, Gary Bernstein, Hakon Dahle, Thomas Erben, Mandeep Gill, Alan Heavens, Catherine Heymans, F. William High, Henk Hoekstra, Mike Jarvis, Donnacha Kirk, Thomas Kitching, Jean-Paul Kneib, Konrad Kuijken, David Lagatutta, Rachel Mandelbaum, Richard Massey, Yannick Mellier, Baback Moghaddam, Yassir Moudden , et al. (13 additional authors not shown)

Abstract: The GRavitational lEnsing Accuracy Testing 2008 (GREAT08) Challenge focuses on a problem that is of crucial importance for future observations in cosmology. The shapes of distant galaxies can be used to determine the properties of dark energy and the nature of gravity, because light from those galaxies is bent by gravity from the intervening dark matter. The observed galaxy images appear distort… ▽ More The GRavitational lEnsing Accuracy Testing 2008 (GREAT08) Challenge focuses on a problem that is of crucial importance for future observations in cosmology. The shapes of distant galaxies can be used to determine the properties of dark energy and the nature of gravity, because light from those galaxies is bent by gravity from the intervening dark matter. The observed galaxy images appear distorted, although only slightly, and their shapes must be precisely disentangled from the effects of pixelisation, convolution and noise. The worldwide gravitational lensing community has made significant progress in techniques to measure these distortions via the Shear TEsting Program (STEP). Via STEP, we have run challenges within our own community, and come to recognise that this particular image analysis problem is ideally matched to experts in statistical inference, inverse problems and computational learning. Thus, in order to continue the progress seen in recent years, we are seeking an infusion of new ideas from these communities. This document details the GREAT08 Challenge for potential participants. Please visit http://www.great08challenge.info for the latest information. △ Less

Submitted 15 June, 2009; v1 submitted 11 February, 2008; originally announced February 2008.

Comments: Published in at http://dx.doi.org/10.1214/08-AOAS222 the Annals of Applied Statistics (http://www.imstat.org/aoas/) by the Institute of Mathematical Statistics (http://www.imstat.org)

Report number: IMS-AOAS-AOAS222

Journal ref: Annals of Applied Statistics 2009, Vol. 3, No. 1, 6-37

Showing 1–7 of 7 results for author: Gill, M