-
Brownian motion, bridges and Bayesian inference in phylogenetic tree space
Authors:
William M. Woodman,
Tom M. W. Nye
Abstract:
Billera-Holmes-Vogtmann (BHV) tree space is a geodesic metric space of edge-weighted phylogenetic trees with a fixed leaf set. Constructing parametric distributions on this space is challenging due to its non-Euclidean geometry and the intractability of normalizing constants. We address this by fitting Brownian motion transition kernels to tree-valued data via a non-Euclidean bridge construction.…
▽ More
Billera-Holmes-Vogtmann (BHV) tree space is a geodesic metric space of edge-weighted phylogenetic trees with a fixed leaf set. Constructing parametric distributions on this space is challenging due to its non-Euclidean geometry and the intractability of normalizing constants. We address this by fitting Brownian motion transition kernels to tree-valued data via a non-Euclidean bridge construction. Each kernel is determined by a source tree $x_0$ (the Brownian motion's starting point) and a dispersion parameter $t_0$ (its duration). Observed trees are modelled as independent draws from the transition kernel defined by $(x_0, t_0)$, analogous to a Gaussian model in Euclidean space. Brownian motion is approximated by an $m$-step random walk, with the parameter space augmented to include full sample paths. We develop a bridge algorithm to sample paths conditional on their endpoints, and introduce methods for sampling a Bayesian posterior for $(x_0, t_0)$ and for marginal likelihood evaluation. This enables hypothesis testing for alternative source trees. The approach is validated on simulated data and applied to an experimental data set of yeast gene trees. These methods provide a foundation for future development of a wider class of probabilistic models of tree-valued data.
△ Less
Submitted 27 June, 2025;
originally announced June 2025.
-
Statistics for Phylogenetic Trees in the Presence of Stickiness
Authors:
Lars Lammers,
Tom M. W. Nye,
Stephan F. Huckemann
Abstract:
Samples of phylogenetic trees arise in a variety of evolutionary and biomedical applications, and the Fréchet mean in Billera-Holmes-Vogtmann tree space is a summary tree shown to have advantages over other mean or consensus trees. However, use of the Fréchet mean raises computational and statistical issues which we explore in this paper. The Fréchet sample mean is known often to contain fewer int…
▽ More
Samples of phylogenetic trees arise in a variety of evolutionary and biomedical applications, and the Fréchet mean in Billera-Holmes-Vogtmann tree space is a summary tree shown to have advantages over other mean or consensus trees. However, use of the Fréchet mean raises computational and statistical issues which we explore in this paper. The Fréchet sample mean is known often to contain fewer internal edges than the trees in the sample, and in this circumstance calculating the mean by iterative schemes can be problematic due to slow convergence. We present new methods for identifying edges which must lie in the Fréchet sample mean and apply these to a data set of gene trees relating organisms from the apicomplexa which cause a variety of parasitic infections. When a sample of trees contains a significant level of heterogeneity in the branching patterns, or topologies, displayed by the trees then the Fréchet mean is often a star tree, lacking any internal edges. Not only in this situation, the population Fréchet mean is affected by a non-Euclidean phenomenon called stickness which impacts upon asymptotics, and we examine two data sets for which the mean tree is a star tree. The first consists of trees representing the physical shape of artery structures in a sample of medical images of human brains in which the branching patterns are very diverse. The second consists of gene trees from a population of baboons in which there is evidence of substantial hybridization. We develop hypothesis tests which work in the presence of stickiness. The first is a test for the presence of a given edge in the Fréchet population mean; the second is a two-sample test for differences in two distributions which share the same sticky population mean.
△ Less
Submitted 4 July, 2024;
originally announced July 2024.
-
Manifold-valued models for analysis of EEG time series data
Authors:
Tao Ding,
Tom M. W. Nye,
Yujiang Wang
Abstract:
We propose a model for time series taking values on a Riemannian manifold and fit it to time series of covariance matrices derived from EEG data for patients suffering from epilepsy. The aim of the study is two-fold: to develop a model with interpretable parameters for different possible modes of EEG dynamics, and to explore the extent to which modelling results are affected by the choice of manif…
▽ More
We propose a model for time series taking values on a Riemannian manifold and fit it to time series of covariance matrices derived from EEG data for patients suffering from epilepsy. The aim of the study is two-fold: to develop a model with interpretable parameters for different possible modes of EEG dynamics, and to explore the extent to which modelling results are affected by the choice of manifold and its associated geometry. The model specifies a distribution for the tangent direction vector at any time point, combining an autoregressive term, a mean reverting term and a form of Gaussian noise. Parameter inference is carried out by maximum likelihood estimation, and we compare modelling results obtained using the standard Euclidean geometry on covariance matrices and the affine invariant geometry. Results distinguish between epileptic seizures and interictal periods between seizures in patients: between seizures the dynamics have a strong mean reverting component and the autoregressive component is missing, while for the majority of seizures there is a significant autoregressive component and the mean reverting effect is weak. The fitted models are also used to compare seizures within and between patients. The affine invariant geometry is advantageous and it provides a better fit to the data.
△ Less
Submitted 12 February, 2024; v1 submitted 9 February, 2024;
originally announced February 2024.
-
Types of Stickiness in BHV Phylogenetic Tree Spaces and Their Degree
Authors:
Lars Lammers,
Do Tran Van,
Tom M. W. Nye,
Stephan F. Huckemann
Abstract:
It has been observed that the sample mean of certain probability distributions in Billera-Holmes-Vogtmann (BHV) phylogenetic spaces is confined to a lower-dimensional subspace for large enough sample size. This non-standard behavior has been called stickiness and poses difficulties in statistical applications when comparing samples of sticky distributions. We extend previous results on stickiness…
▽ More
It has been observed that the sample mean of certain probability distributions in Billera-Holmes-Vogtmann (BHV) phylogenetic spaces is confined to a lower-dimensional subspace for large enough sample size. This non-standard behavior has been called stickiness and poses difficulties in statistical applications when comparing samples of sticky distributions. We extend previous results on stickiness to show the equivalence of this sampling behavior to topological conditions in the special case of BHV spaces. Furthermore, we propose to alleviate statistical comparision of sticky distributions by including the directional derivatives of the Fréchet function: the degree of stickiness.
△ Less
Submitted 11 April, 2023;
originally announced April 2023.
-
Foundations of the Wald Space for Phylogenetic Trees
Authors:
Jonas Lueg,
Maryam K. Garba,
Tom M. W. Nye,
Stephan F. Huckemann
Abstract:
Evolutionary relationships between species are represented by phylogenetic trees, but these relationships are subject to uncertainty due to the random nature of evolution. A geometry for the space of phylogenetic trees is necessary in order to properly quantify this uncertainty during the statistical analysis of collections of possible evolutionary trees inferred from biological data. Recently, th…
▽ More
Evolutionary relationships between species are represented by phylogenetic trees, but these relationships are subject to uncertainty due to the random nature of evolution. A geometry for the space of phylogenetic trees is necessary in order to properly quantify this uncertainty during the statistical analysis of collections of possible evolutionary trees inferred from biological data. Recently, the wald space has been introduced: a length space for trees which is a certain subset of the manifold of symmetric positive definite matrices. In this work, the wald space is introduced formally and its topology and structure is studied in detail. In particular, we show that wald space has the topology of a disjoint union of open cubes, it is contractible, and by careful characterization of cube boundaries, we demonstrate that wald space is a Whitney stratified space of type (A). Imposing the metric induced by the affine invariant metric on symmetric positive definite matrices, we prove that wald space is a geodesic Riemann stratified space. A new numerical method is proposed and investigated for construction of geodesics, computation of Fréchet means and calculation of curvature in wald space. This work is intended to serve as a mathematical foundation for further geometric and statistical research on this space.
△ Less
Submitted 12 September, 2022;
originally announced September 2022.
-
A sparse Bayesian hierarchical vector autoregressive model for microbial dynamics in a wastewater treatment plant
Authors:
Naomi E. Hannaford,
Sarah E. Heaps,
Tom M. W. Nye,
Thomas P. Curtis,
Ben Allen,
Andrew Golightly,
Darren J. Wilkinson
Abstract:
Proper function of a wastewater treatment plant (WWTP) relies on maintaining a delicate balance between a multitude of competing microorganisms. Gaining a detailed understanding of the complex network of interactions therein is essential to maximising not only current operational efficiencies, but also for the effective design of new treatment technologies. Metagenomics offers an insight into thes…
▽ More
Proper function of a wastewater treatment plant (WWTP) relies on maintaining a delicate balance between a multitude of competing microorganisms. Gaining a detailed understanding of the complex network of interactions therein is essential to maximising not only current operational efficiencies, but also for the effective design of new treatment technologies. Metagenomics offers an insight into these dynamic systems through the analysis of the microbial DNA sequences present. Unique taxa are inferred through sequence clustering to form operational taxonomic units (OTUs), with per-taxa abundance estimates obtained from corresponding sequence counts. The data in this study comprise weekly OTU counts from an activated sludge (AS) tank of a WWTP. To model the OTU dynamics, we develop a Bayesian hierarchical vector autoregressive model, which is a linear approximation to the commonly used generalised Lotka-Volterra (gLV) model. To tackle the high dimensionality and sparsity of the data, they are first clustered into 12 "bins" using a seasonal phase-based approach. The autoregressive coefficient matrix is assumed to be sparse, so we explore different shrinkage priors by analysing simulated data sets before selecting the regularised horseshoe prior for the biological application. We find that ammonia and chemical oxygen demand have a positive relationship with several bins and pH has a positive relationship with one bin. These results are supported by findings in the biological literature. We identify several negative interactions, which suggests OTUs in different bins may be competing for resources and that these relationships are complex. We also identify two positive interactions. Although simpler than a gLV model, our vector autoregression offers valuable insight into the microbial dynamics of the WWTP.
△ Less
Submitted 1 July, 2021;
originally announced July 2021.
-
Incorporating compositional heterogeneity into Lie Markov models for phylogenetic inference
Authors:
Naomi E. Hannaford,
Sarah E. Heaps,
Tom M. W. Nye,
Tom A. Williams,
T. Martin Embley
Abstract:
Phylogenetics uses alignments of molecular sequence data to learn about evolutionary trees. Substitutions in sequences are modelled through a continuous-time Markov process, characterised by an instantaneous rate matrix, which standard models assume is time-reversible and stationary. These assumptions are biologically questionable and induce a likelihood function which is invariant to a tree's roo…
▽ More
Phylogenetics uses alignments of molecular sequence data to learn about evolutionary trees. Substitutions in sequences are modelled through a continuous-time Markov process, characterised by an instantaneous rate matrix, which standard models assume is time-reversible and stationary. These assumptions are biologically questionable and induce a likelihood function which is invariant to a tree's root position. This hampers inference because a tree's biological interpretation depends critically on where it is rooted. Relaxing both assumptions, we introduce a model whose likelihood can distinguish between rooted trees. The model is non-stationary, with step changes in the instantaneous rate matrix at each speciation event. Exploiting recent theoretical work, each rate matrix belongs to a non-reversible family of Lie Markov models. These models are closed under matrix multiplication, so our extension offers the conceptually appealing property that a tree and all its sub-trees could have arisen from the same family of non-stationary models.
We adopt a Bayesian approach, describe an MCMC algorithm for posterior inference and provide software. The biological insight that our model can provide is illustrated through an analysis in which non-reversible but stationary, and non-stationary but reversible models cannot identify a plausible root.
△ Less
Submitted 17 July, 2020; v1 submitted 16 July, 2020;
originally announced July 2020.
-
Information geometry for phylogenetic trees
Authors:
Maryam K. Garba,
Tom M. W. Nye,
Jonas Lueg,
Stephan F. Huckemann
Abstract:
We propose a new space of phylogenetic trees which we call wald space. The motivation is to develop a space suitable for statistical analysis of phylogenies, but with a geometry based on more biologically principled assumptions than existing spaces: in wald space, trees are close if they induce similar distributions on genetic sequence data. As a point set, wald space contains the previously devel…
▽ More
We propose a new space of phylogenetic trees which we call wald space. The motivation is to develop a space suitable for statistical analysis of phylogenies, but with a geometry based on more biologically principled assumptions than existing spaces: in wald space, trees are close if they induce similar distributions on genetic sequence data. As a point set, wald space contains the previously developed Billera-Holmes-Vogtmann (BHV) tree space; it also contains disconnected forests, like the edge-product (EP) space but without certain singularities of the EP space. We investigate two related geometries on wald space. The first is the geometry of the Fisher information metric of character distributions induced by the two-state symmetric Markov substitution process on each tree. Infinitesimally, the metric is proportional to the Kullback-Leibler divergence, or equivalently, as we show, any to f -divergence. The second geometry is obtained analogously but using a related continuous-valued Gaussian process on each tree, and it can be viewed as the trace metric of the affine-invariant metric for covariance matrices. We derive a gradient descent algorithm to project from the ambient space of covariance matrices to wald space. For both geometries we derive computational methods to compute geodesics in polynomial time and show numerically that the two information geometries (discrete and continuous) are very similar. In particular geodesics are approximated extrinsically. Comparison with the BHV geometry shows that our canonical and biologically motivated space is substantially different.
△ Less
Submitted 17 September, 2020; v1 submitted 29 March, 2020;
originally announced March 2020.
-
Generalising rate heterogeneity across sites in statistical phylogenetics
Authors:
Sarah E. Heaps,
Tom M. W. Nye,
Richard J. Boys,
Tom A. Williams,
Svetlana Cherlin,
T. Martin Embley
Abstract:
Phylogenetics uses alignments of molecular sequence data to learn about evolutionary trees relating species. Along branches, sequence evolution is modelled using a continuous-time Markov process characterised by an instantaneous rate matrix. Early models assumed the same rate matrix governed substitutions at all sites of the alignment, ignoring variation in evolutionary pressures. Substantial impr…
▽ More
Phylogenetics uses alignments of molecular sequence data to learn about evolutionary trees relating species. Along branches, sequence evolution is modelled using a continuous-time Markov process characterised by an instantaneous rate matrix. Early models assumed the same rate matrix governed substitutions at all sites of the alignment, ignoring variation in evolutionary pressures. Substantial improvements in phylogenetic inference and model fit were achieved by augmenting these models with multiplicative random effects that describe the result of variation in selective constraints and allow sites to evolve at different rates which linearly scale a baseline rate matrix. Motivated by this pioneering work, we consider an extension using a quadratic, rather than linear, transformation. The resulting models allow for variation in the selective coefficients of different types of point mutation at a site in addition to variation in selective constraints.
We derive properties of the extended models. For certain non-stationary processes, the extension gives a model that allows variation in sequence composition both across sites and taxa. We adopt a Bayesian approach, describe an MCMC algorithm for posterior inference and provide software. Our quadratic models are applied to alignments spanning the tree of life and compared with site-homogeneous and linear models.
△ Less
Submitted 2 May, 2019; v1 submitted 20 February, 2017;
originally announced February 2017.
-
Principal component analysis and the locus of the Frechet mean in the space of phylogenetic trees
Authors:
Tom M. W. Nye,
Xiaoxian Tang,
Grady Weyenberg,
Ruriko Yoshida
Abstract:
Most biological data are multidimensional, posing a major challenge to human comprehension and computational analysis. Principal component analysis is the most popular approach to rendering two- or three-dimensional representations of the major trends in such multidimensional data. The problem of multidimensionality is acute in the rapidly growing area of phylogenomics. Evolutionary relationships…
▽ More
Most biological data are multidimensional, posing a major challenge to human comprehension and computational analysis. Principal component analysis is the most popular approach to rendering two- or three-dimensional representations of the major trends in such multidimensional data. The problem of multidimensionality is acute in the rapidly growing area of phylogenomics. Evolutionary relationships are represented by phylogenetic trees, and very typically a phylogenomic analysis results in a collection of such trees, one for each gene in the analysis. Principal component analysis offers a means of quantifying variation and summarizing a collection of phylogenies by dimensional reduction. However, the space of all possible phylogenies on a fixed set of species does not form a Euclidean vector space, so principal component analysis must be reformulated in the geometry of tree-space, which is a CAT(0) geodesic metric space. Previous work has focused on construction of the first principal component, or principal geodesic. Here we propose a geometric object which represents a $k$-th order principal component: the locus of the weighted Fréchet mean of $k+1$ points in tree-space, where the weights vary over the standard $k$-dimensional simplex. We establish basic properties of these objects, in particular that locally they generically have dimension $k$, and we propose an efficient algorithm for projection onto these surfaces. Combined with a stochastic optimization algorithm, this projection algorithm gives a procedure for constructing a principal component of arbitrary order in tree-space. Simulation studies confirm these algorithms perform well, and they are applied to data sets of Apicomplexa gene trees and the African coelacanth genome. The results enable visualizations of slices of tree-space, revealing structure within these complex data sets.
△ Less
Submitted 10 September, 2016;
originally announced September 2016.
-
Convergence of random walks to Brownian motion on cubical complexes
Authors:
Tom M. W. Nye
Abstract:
Cubical complexes are metric spaces constructed by gluing together unit cubes in an analogous way to the construction of simplicial complexes. We construct Brownian motion on such spaces, define random walks, and prove that the transition kernels of the random walks converge to that for Brownian motion. The proof involves pulling back onto the complex the distribution of Brownian sample paths on t…
▽ More
Cubical complexes are metric spaces constructed by gluing together unit cubes in an analogous way to the construction of simplicial complexes. We construct Brownian motion on such spaces, define random walks, and prove that the transition kernels of the random walks converge to that for Brownian motion. The proof involves pulling back onto the complex the distribution of Brownian sample paths on the standard cube, and combining this with a distribution on walks between cubes in the complex. The main application lies in analysing sets of evolutionary trees: several tree spaces are cubical complexes and we briefly describe our results and some applications in this context. Our results extend readily to a class of polyhedral complex in which every cell of maximal dimension is isometric to a given fixed polyhedron.
△ Less
Submitted 22 May, 2019; v1 submitted 12 August, 2015;
originally announced August 2015.
-
The effect of non-reversibility on inferring rooted phylogenies
Authors:
S. Cherlin,
T. M. W. Nye,
S. E. Heaps,
R. J. Boys,
T. A. Williams,
T. M. Embley
Abstract:
Most phylogenetic models assume that the evolutionary process is stationary and reversible. As a result, the root of the tree cannot be inferred as part of the analysis because the likelihood of the data does not depend on the position of the root. Yet defining the root of a phylogenetic tree is a key component of phylogenetic inference because it provides a point of reference for polarising ances…
▽ More
Most phylogenetic models assume that the evolutionary process is stationary and reversible. As a result, the root of the tree cannot be inferred as part of the analysis because the likelihood of the data does not depend on the position of the root. Yet defining the root of a phylogenetic tree is a key component of phylogenetic inference because it provides a point of reference for polarising ancestor/descendant relationships and therefore interpreting the tree. In this paper we investigate the effect of relaxing the reversibility assumption and allowing the position of the root to be another unknown in the model. We propose two hierarchical models that are centred on a reversible model but perturbed to allow non-reversibility. The models differ in the degree of structure imposed on the perturbations. The analysis is performed in the Bayesian framework using Markov chain Monte Carlo methods. We illustrate the performance of the two non-reversible models in analyses of simulated data sets using two types of topological priors. We then apply the models to a real biological data set, the radiation of polyploid yeasts, for which there is a robust biological opinion about the root position. Finally we apply the models to a second biological data set for which the rooted tree is controversial: the ribosomal tree of life. We compare the two non-reversible models and conclude that both are useful in inferring the position of the root from real biological data sets.
△ Less
Submitted 20 February, 2017; v1 submitted 29 May, 2015;
originally announced May 2015.
-
An algorithm for constructing principal geodesics in phylogenetic treespace
Authors:
Tom M. W. Nye
Abstract:
Most phylogenetic analyses result in a sample of trees, but summarizing and visualizing these samples can be challenging. Consensus trees often provide limited information about a sample, and so methods such as consensus networks, clustering and multidimensional scaling have been developed and applied to tree samples. This paper describes a stochastic algorithm for constructing a principal geodesi…
▽ More
Most phylogenetic analyses result in a sample of trees, but summarizing and visualizing these samples can be challenging. Consensus trees often provide limited information about a sample, and so methods such as consensus networks, clustering and multidimensional scaling have been developed and applied to tree samples. This paper describes a stochastic algorithm for constructing a principal geodesic or line through treespace which is analogous to the first principal component in standard Principal Components Analysis. A principal geodesic summarizes the most variable features of a sample of trees, in terms of both tree topology and branch lengths, and it can be visualized as an animation of smoothly changing trees. The algorithm performs a stochastic search through parameter space for a geodesic which minimises the sum of squared projected distances of the data points. This procedure aims to identify the globally optimal principal geodesic, though convergence to locally optimal geodesics is possible. The methodology is illustrated by constructing principal geodesics for experimental and simulated data sets, demonstrating the insight into samples of trees that can be gained and how the method improves on a previously published approach. A java package called GeoPhytter for constructing and visualising principal geodesics is freely available from www.ncl.ac.uk/~ntmwn/geophytter.
△ Less
Submitted 2 September, 2014;
originally announced September 2014.
-
Principal components analysis in the space of phylogenetic trees
Authors:
Tom M. W. Nye
Abstract:
Phylogenetic analysis of DNA or other data commonly gives rise to a collection or sample of inferred evolutionary trees. Principal Components Analysis (PCA) cannot be applied directly to collections of trees since the space of evolutionary trees on a fixed set of taxa is not a vector space. This paper describes a novel geometrical approach to PCA in tree-space that constructs the first principal p…
▽ More
Phylogenetic analysis of DNA or other data commonly gives rise to a collection or sample of inferred evolutionary trees. Principal Components Analysis (PCA) cannot be applied directly to collections of trees since the space of evolutionary trees on a fixed set of taxa is not a vector space. This paper describes a novel geometrical approach to PCA in tree-space that constructs the first principal path in an analogous way to standard linear Euclidean PCA. Given a data set of phylogenetic trees, a geodesic principal path is sought that maximizes the variance of the data under a form of projection onto the path. Due to the high dimensionality of tree-space and the nonlinear nature of this problem, the computational complexity is potentially very high, so approximate optimization algorithms are used to search for the optimal path. Principal paths identified in this way reveal and quantify the main sources of variation in the original collection of trees in terms of both topology and branch lengths. The approach is illustrated by application to simulated sets of trees and to a set of gene trees from metazoan (animal) species.
△ Less
Submitted 23 February, 2012;
originally announced February 2012.
-
The Geometry of Calorons
Authors:
Tom M. W. Nye
Abstract:
Calorons (periodic instantons) are anti-self-dual (ASD) connections on S^1 \times R^3 and form an intermediate case between instantons and monopoles. The ADHM and Nahm constructions of instantons and monopoles can be regarded as generalizations of a correspondence between ASD connections on the 4-torus, often referred to as the Nahm transform. This thesis describes how the Nahm transform can be…
▽ More
Calorons (periodic instantons) are anti-self-dual (ASD) connections on S^1 \times R^3 and form an intermediate case between instantons and monopoles. The ADHM and Nahm constructions of instantons and monopoles can be regarded as generalizations of a correspondence between ASD connections on the 4-torus, often referred to as the Nahm transform. This thesis describes how the Nahm transform can be extended to the case of calorons. It is shown how calorons can be constructed from Nahm data similar to that for monopoles, but defined over the circle. The inverse transformation, from the caloron to the Nahm data, is also described.
△ Less
Submitted 22 November, 2003;
originally announced November 2003.
-
An L^2-Index Theorem for Dirac Operators on S^1 * R^3
Authors:
Tom M. W. Nye,
Michael A. Singer
Abstract:
An expression is found for the $L^2$-index of a Dirac operator coupled to a connection on a $U_n$ vector bundle over $S^1\times{\mathbb R}^3$. Boundary conditions for the connection are given which ensure the coupled Dirac operator is Fredholm. Callias' index theorem is used to calculate the index when the connection is independent of the coordinate on $S^1$. An excision theorem due to Gromov, L…
▽ More
An expression is found for the $L^2$-index of a Dirac operator coupled to a connection on a $U_n$ vector bundle over $S^1\times{\mathbb R}^3$. Boundary conditions for the connection are given which ensure the coupled Dirac operator is Fredholm. Callias' index theorem is used to calculate the index when the connection is independent of the coordinate on $S^1$. An excision theorem due to Gromov, Lawson, and Anghel reduces the index theorem to this special case. The index formula can be expressed using the adiabatic limit of the $η$-invariant of a Dirac operator canonically associated to the boundary. An application of the theorem is to count the zero modes of the Dirac operator in the background of a caloron (periodic instanton).
△ Less
Submitted 14 September, 2000;
originally announced September 2000.