-
Torchtree: flexible phylogenetic model development and inference using PyTorch
Authors:
Mathieu Fourment,
Matthew Macaulay,
Christiaan J Swanepoel,
Xiang Ji,
Marc A Suchard,
Frederick A Matsen IV
Abstract:
Bayesian inference has predominantly relied on the Markov chain Monte Carlo (MCMC) algorithm for many years. However, MCMC is computationally laborious, especially for complex phylogenetic models of time trees. This bottleneck has led to the search for alternatives, such as variational Bayes, which can scale better to large datasets. In this paper, we introduce torchtree, a framework written in Py…
▽ More
Bayesian inference has predominantly relied on the Markov chain Monte Carlo (MCMC) algorithm for many years. However, MCMC is computationally laborious, especially for complex phylogenetic models of time trees. This bottleneck has led to the search for alternatives, such as variational Bayes, which can scale better to large datasets. In this paper, we introduce torchtree, a framework written in Python that allows developers to easily implement rich phylogenetic models and algorithms using a fixed tree topology. One can either use automatic differentiation, or leverage torchtree's plug-in system to compute gradients analytically for model components for which automatic differentiation is slow. We demonstrate that the torchtree variational inference framework performs similarly to BEAST in terms of speed and approximation accuracy. Furthermore, we explore the use of the forward KL divergence as an optimizing criterion for variational inference, which can handle discontinuous and non-differentiable models. Our experiments show that inference using the forward KL divergence tends to be faster per iteration compared to the evidence lower bound (ELBO) criterion, although the ELBO-based inference may converge faster in some cases. Overall, torchtree provides a flexible and efficient framework for phylogenetic model development and inference using PyTorch.
△ Less
Submitted 25 June, 2024;
originally announced June 2024.
-
Differentiable Phylogenetics via Hyperbolic Embeddings with Dodonaphy
Authors:
Matthew Macaulay,
Mathieu Fourment
Abstract:
Motivation: Navigating the high dimensional space of discrete trees for phylogenetics presents a challenging problem for tree optimisation. To address this, hyperbolic embeddings of trees offer a promising approach to encoding trees efficiently in continuous spaces. However, they require a differentiable tree decoder to optimise the phylogenetic likelihood. We present soft-NJ, a differentiable ver…
▽ More
Motivation: Navigating the high dimensional space of discrete trees for phylogenetics presents a challenging problem for tree optimisation. To address this, hyperbolic embeddings of trees offer a promising approach to encoding trees efficiently in continuous spaces. However, they require a differentiable tree decoder to optimise the phylogenetic likelihood. We present soft-NJ, a differentiable version of neighbour-joining that enables gradient-based optimisation over the space of trees.
Results: We illustrate the potential for differentiable optimisation over tree space for maximum likelihood inference. We then perform variational Bayesian phylogenetics by optimising embedding distributions in hyperbolic space. We compare the performance of this approximation technique on eight benchmark datasets to state-of-art methods. However, geometric frustrations of the embedding locations produce local optima that pose a challenge for optimisation.
Availability: Dodonaphy is freely available on the web at www.https://github.com/mattapow/dodonaphy. It includes an implementation of soft-NJ.
△ Less
Submitted 20 September, 2023;
originally announced September 2023.
-
Many-core algorithms for high-dimensional gradients on phylogenetic trees
Authors:
Karthik Gangavarapu,
Xiang Ji,
Guy Baele,
Mathieu Fourment,
Philippe Lemey,
Frederick A. Matsen IV,
Marc A. Suchard
Abstract:
The rapid growth in genomic pathogen data spurs the need for efficient inference techniques, such as Hamiltonian Monte Carlo (HMC) in a Bayesian framework, to estimate parameters of these phylogenetic models where the dimensions of the parameters increase with the number of sequences $N$. HMC requires repeated calculation of the gradient of the data log-likelihood with respect to (wrt) all branch-…
▽ More
The rapid growth in genomic pathogen data spurs the need for efficient inference techniques, such as Hamiltonian Monte Carlo (HMC) in a Bayesian framework, to estimate parameters of these phylogenetic models where the dimensions of the parameters increase with the number of sequences $N$. HMC requires repeated calculation of the gradient of the data log-likelihood with respect to (wrt) all branch-length-specific (BLS) parameters that traditionally takes $\mathcal{O}(N^2)$ operations using the standard pruning algorithm. A recent study proposes an approach to calculate this gradient in $\mathcal{O}(N)$, enabling researchers to take advantage of gradient-based samplers such as HMC. The CPU implementation of this approach makes the calculation of the gradient computationally tractable for nucleotide-based models but falls short in performance for larger state-space size models, such as codon models. Here, we describe novel massively parallel algorithms to calculate the gradient of the log-likelihood wrt all BLS parameters that take advantage of graphics processing units (GPUs) and result in many fold higher speedups over previous CPU implementations. We benchmark these GPU algorithms on three computing systems using three evolutionary inference examples: carnivores, dengue and yeast, and observe a greater than 128-fold speedup over the CPU implementation for codon-based models and greater than 8-fold speedup for nucleotide-based models. As a practical demonstration, we also estimate the timing of the first introduction of West Nile virus into the continental Unites States under a codon model with a relaxed molecular clock from 104 full viral genomes, an inference task previously intractable. We provide an implementation of our GPU algorithms in BEAGLE v4.0.0, an open source library for statistical phylogenetics that enables parallel calculations on multi-core CPUs and GPUs.
△ Less
Submitted 8 March, 2023;
originally announced March 2023.
-
TreeFlow: probabilistic programming and automatic differentiation for phylogenetics
Authors:
Christiaan Swanepoel,
Mathieu Fourment,
Xiang Ji,
Hassan Nasif,
Marc A Suchard,
Frederick A Matsen IV,
Alexei Drummond
Abstract:
Probabilistic programming frameworks are powerful tools for statistical modelling and inference. They are not immediately generalisable to phylogenetic problems due to the particular computational properties of the phylogenetic tree object. TreeFlow is a software library for probabilistic programming and automatic differentiation with phylogenetic trees. It implements inference algorithms for phyl…
▽ More
Probabilistic programming frameworks are powerful tools for statistical modelling and inference. They are not immediately generalisable to phylogenetic problems due to the particular computational properties of the phylogenetic tree object. TreeFlow is a software library for probabilistic programming and automatic differentiation with phylogenetic trees. It implements inference algorithms for phylogenetic tree times and model parameters given a tree topology. We demonstrate how TreeFlow can be used to quickly implement and assess new models. We also show that it provides reasonable performance for gradient-based inference algorithms compared to specialized computational libraries for phylogenetics.
△ Less
Submitted 9 November, 2022;
originally announced November 2022.
-
Automatic differentiation is no panacea for phylogenetic gradient computation
Authors:
Mathieu Fourment,
Christiaan J. Swanepoel,
Jared G. Galloway,
Xiang Ji,
Karthik Gangavarapu,
Marc A. Suchard,
Frederick A. Matsen IV
Abstract:
Gradients of probabilistic model likelihoods with respect to their parameters are essential for modern computational statistics and machine learning. These calculations are readily available for arbitrary models via automatic differentiation implemented in general-purpose machine-learning libraries such as TensorFlow and PyTorch. Although these libraries are highly optimized, it is not clear if th…
▽ More
Gradients of probabilistic model likelihoods with respect to their parameters are essential for modern computational statistics and machine learning. These calculations are readily available for arbitrary models via automatic differentiation implemented in general-purpose machine-learning libraries such as TensorFlow and PyTorch. Although these libraries are highly optimized, it is not clear if their general-purpose nature will limit their algorithmic complexity or implementation speed for the phylogenetic case compared to phylogenetics-specific code. In this paper, we compare six gradient implementations of the phylogenetic likelihood functions, in isolation and also as part of a variational inference procedure. We find that although automatic differentiation can scale approximately linearly in tree size, it is much slower than the carefully-implemented gradient calculation for tree likelihood and ratio transformation operations. We conclude that a mixed approach combining phylogenetic libraries with machine learning libraries will provide the optimal combination of speed and model flexibility moving forward.
△ Less
Submitted 4 June, 2023; v1 submitted 3 November, 2022;
originally announced November 2022.
-
Fidelity of Hyperbolic Space for Bayesian Phylogenetic Inference
Authors:
Matthew Macaulay,
Aaron E. Darling,
Mathieu Fourment
Abstract:
Bayesian inference for phylogenetics is a gold standard for computing distributions of phylogenies. It faces the challenging problem of. moving throughout the high-dimensional space of trees. However, hyperbolic space offers a low dimensional representation of tree-like data. In this paper, we embed genomic sequences into hyperbolic space and perform hyperbolic Markov Chain Monte Carlo for Bayesia…
▽ More
Bayesian inference for phylogenetics is a gold standard for computing distributions of phylogenies. It faces the challenging problem of. moving throughout the high-dimensional space of trees. However, hyperbolic space offers a low dimensional representation of tree-like data. In this paper, we embed genomic sequences into hyperbolic space and perform hyperbolic Markov Chain Monte Carlo for Bayesian inference. The posterior probability is computed by decoding a neighbour joining tree from proposed embedding locations. We empirically demonstrate the fidelity of this method on eight data sets. The sampled posterior distribution recovers the splits and branch lengths to a high degree. We investigated the effects of curvature and embedding dimension on the Markov Chain's performance. Finally, we discuss the prospects for adapting this method to navigate tree space with gradients.
△ Less
Submitted 16 June, 2022;
originally announced June 2022.
-
19 dubious ways to compute the marginal likelihood of a phylogenetic tree topology
Authors:
Mathieu Fourment,
Andrew F. Magee,
Chris Whidden,
Arman Bilge,
Frederick A. Matsen IV,
Vladimir N. Minin
Abstract:
The marginal likelihood of a model is a key quantity for assessing the evidence provided by the data in support of a model. The marginal likelihood is the normalizing constant for the posterior density, obtained by integrating the product of the likelihood and the prior with respect to model parameters. Thus, the computational burden of computing the marginal likelihood scales with the dimension o…
▽ More
The marginal likelihood of a model is a key quantity for assessing the evidence provided by the data in support of a model. The marginal likelihood is the normalizing constant for the posterior density, obtained by integrating the product of the likelihood and the prior with respect to model parameters. Thus, the computational burden of computing the marginal likelihood scales with the dimension of the parameter space. In phylogenetics, where we work with tree topologies that are high-dimensional models, standard approaches to computing marginal likelihoods are very slow. Here we study methods to quickly compute the marginal likelihood of a single fixed tree topology. We benchmark the speed and accuracy of 19 different methods to compute the marginal likelihood of phylogenetic topologies on a suite of real datasets. These methods include several new ones that we develop explicitly to solve this problem, as well as existing algorithms that we apply to phylogenetic models for the first time. Altogether, our results show that the accuracy of these methods varies widely, and that accuracy does not necessarily correlate with computational burden. Our newly developed methods are orders of magnitude faster than standard approaches, and in some cases, their accuracy rivals the best established estimators.
△ Less
Submitted 28 November, 2018;
originally announced November 2018.
-
Systematic Exploration of the High Likelihood Set of Phylogenetic Tree Topologies
Authors:
Chris Whidden,
Brian C. Claywell,
Thayer Fisher,
Andrew F. Magee,
Mathieu Fourment,
Frederick A. Matsen IV
Abstract:
Bayesian Markov chain Monte Carlo explores tree space slowly, in part because it frequently returns to the same tree topology. An alternative strategy would be to explore tree space systematically, and never return to the same topology. In this paper, we present an efficient parallelized method to map out the high likelihood set of phylogenetic tree topologies via systematic search, which we show…
▽ More
Bayesian Markov chain Monte Carlo explores tree space slowly, in part because it frequently returns to the same tree topology. An alternative strategy would be to explore tree space systematically, and never return to the same topology. In this paper, we present an efficient parallelized method to map out the high likelihood set of phylogenetic tree topologies via systematic search, which we show to be a good approximation of the high posterior set of tree topologies. Here `likelihood' of a topology refers to the tree likelihood for the corresponding tree with optimized branch lengths. We call this method `phylogenetic topographer' (PT). The PT strategy is very simple: starting in a number of local topology maxima (obtained by hill-climbing from random starting points), explore out using local topology rearrangements, only continuing through topologies that are better than than some likelihood threshold below the best observed topology. We show that the normalized topology likelihoods are a useful proxy for the Bayesian posterior probability of those topologies. By using a non-blocking hash table keyed on unique representations of tree topologies, we avoid visiting topologies more than once across all concurrent threads exploring tree space. We demonstrate that PT can be used directly to approximate a Bayesian consensus tree topology. When combined with an accurate means of evaluating per-topology marginal likelihoods, PT gives an alternative procedure for obtaining Bayesian posterior distributions on phylogenetic tree topologies.
△ Less
Submitted 27 November, 2018;
originally announced November 2018.