-
Statistics for Phylogenetic Trees in the Presence of Stickiness
Authors:
Lars Lammers,
Tom M. W. Nye,
Stephan F. Huckemann
Abstract:
Samples of phylogenetic trees arise in a variety of evolutionary and biomedical applications, and the Fréchet mean in Billera-Holmes-Vogtmann tree space is a summary tree shown to have advantages over other mean or consensus trees. However, use of the Fréchet mean raises computational and statistical issues which we explore in this paper. The Fréchet sample mean is known often to contain fewer int…
▽ More
Samples of phylogenetic trees arise in a variety of evolutionary and biomedical applications, and the Fréchet mean in Billera-Holmes-Vogtmann tree space is a summary tree shown to have advantages over other mean or consensus trees. However, use of the Fréchet mean raises computational and statistical issues which we explore in this paper. The Fréchet sample mean is known often to contain fewer internal edges than the trees in the sample, and in this circumstance calculating the mean by iterative schemes can be problematic due to slow convergence. We present new methods for identifying edges which must lie in the Fréchet sample mean and apply these to a data set of gene trees relating organisms from the apicomplexa which cause a variety of parasitic infections. When a sample of trees contains a significant level of heterogeneity in the branching patterns, or topologies, displayed by the trees then the Fréchet mean is often a star tree, lacking any internal edges. Not only in this situation, the population Fréchet mean is affected by a non-Euclidean phenomenon called stickness which impacts upon asymptotics, and we examine two data sets for which the mean tree is a star tree. The first consists of trees representing the physical shape of artery structures in a sample of medical images of human brains in which the branching patterns are very diverse. The second consists of gene trees from a population of baboons in which there is evidence of substantial hybridization. We develop hypothesis tests which work in the presence of stickiness. The first is a test for the presence of a given edge in the Fréchet population mean; the second is a two-sample test for differences in two distributions which share the same sticky population mean.
△ Less
Submitted 4 July, 2024;
originally announced July 2024.
-
Types of Stickiness in BHV Phylogenetic Tree Spaces and Their Degree
Authors:
Lars Lammers,
Do Tran Van,
Tom M. W. Nye,
Stephan F. Huckemann
Abstract:
It has been observed that the sample mean of certain probability distributions in Billera-Holmes-Vogtmann (BHV) phylogenetic spaces is confined to a lower-dimensional subspace for large enough sample size. This non-standard behavior has been called stickiness and poses difficulties in statistical applications when comparing samples of sticky distributions. We extend previous results on stickiness…
▽ More
It has been observed that the sample mean of certain probability distributions in Billera-Holmes-Vogtmann (BHV) phylogenetic spaces is confined to a lower-dimensional subspace for large enough sample size. This non-standard behavior has been called stickiness and poses difficulties in statistical applications when comparing samples of sticky distributions. We extend previous results on stickiness to show the equivalence of this sampling behavior to topological conditions in the special case of BHV spaces. Furthermore, we propose to alleviate statistical comparision of sticky distributions by including the directional derivatives of the Fréchet function: the degree of stickiness.
△ Less
Submitted 11 April, 2023;
originally announced April 2023.
-
Foundations of the Wald Space for Phylogenetic Trees
Authors:
Jonas Lueg,
Maryam K. Garba,
Tom M. W. Nye,
Stephan F. Huckemann
Abstract:
Evolutionary relationships between species are represented by phylogenetic trees, but these relationships are subject to uncertainty due to the random nature of evolution. A geometry for the space of phylogenetic trees is necessary in order to properly quantify this uncertainty during the statistical analysis of collections of possible evolutionary trees inferred from biological data. Recently, th…
▽ More
Evolutionary relationships between species are represented by phylogenetic trees, but these relationships are subject to uncertainty due to the random nature of evolution. A geometry for the space of phylogenetic trees is necessary in order to properly quantify this uncertainty during the statistical analysis of collections of possible evolutionary trees inferred from biological data. Recently, the wald space has been introduced: a length space for trees which is a certain subset of the manifold of symmetric positive definite matrices. In this work, the wald space is introduced formally and its topology and structure is studied in detail. In particular, we show that wald space has the topology of a disjoint union of open cubes, it is contractible, and by careful characterization of cube boundaries, we demonstrate that wald space is a Whitney stratified space of type (A). Imposing the metric induced by the affine invariant metric on symmetric positive definite matrices, we prove that wald space is a geodesic Riemann stratified space. A new numerical method is proposed and investigated for construction of geodesics, computation of Fréchet means and calculation of curvature in wald space. This work is intended to serve as a mathematical foundation for further geometric and statistical research on this space.
△ Less
Submitted 12 September, 2022;
originally announced September 2022.
-
Information geometry for phylogenetic trees
Authors:
Maryam K. Garba,
Tom M. W. Nye,
Jonas Lueg,
Stephan F. Huckemann
Abstract:
We propose a new space of phylogenetic trees which we call wald space. The motivation is to develop a space suitable for statistical analysis of phylogenies, but with a geometry based on more biologically principled assumptions than existing spaces: in wald space, trees are close if they induce similar distributions on genetic sequence data. As a point set, wald space contains the previously devel…
▽ More
We propose a new space of phylogenetic trees which we call wald space. The motivation is to develop a space suitable for statistical analysis of phylogenies, but with a geometry based on more biologically principled assumptions than existing spaces: in wald space, trees are close if they induce similar distributions on genetic sequence data. As a point set, wald space contains the previously developed Billera-Holmes-Vogtmann (BHV) tree space; it also contains disconnected forests, like the edge-product (EP) space but without certain singularities of the EP space. We investigate two related geometries on wald space. The first is the geometry of the Fisher information metric of character distributions induced by the two-state symmetric Markov substitution process on each tree. Infinitesimally, the metric is proportional to the Kullback-Leibler divergence, or equivalently, as we show, any to f -divergence. The second geometry is obtained analogously but using a related continuous-valued Gaussian process on each tree, and it can be viewed as the trace metric of the affine-invariant metric for covariance matrices. We derive a gradient descent algorithm to project from the ambient space of covariance matrices to wald space. For both geometries we derive computational methods to compute geodesics in polynomial time and show numerically that the two information geometries (discrete and continuous) are very similar. In particular geodesics are approximated extrinsically. Comparison with the BHV geometry shows that our canonical and biologically motivated space is substantially different.
△ Less
Submitted 17 September, 2020; v1 submitted 29 March, 2020;
originally announced March 2020.
-
Convergence of random walks to Brownian motion on cubical complexes
Authors:
Tom M. W. Nye
Abstract:
Cubical complexes are metric spaces constructed by gluing together unit cubes in an analogous way to the construction of simplicial complexes. We construct Brownian motion on such spaces, define random walks, and prove that the transition kernels of the random walks converge to that for Brownian motion. The proof involves pulling back onto the complex the distribution of Brownian sample paths on t…
▽ More
Cubical complexes are metric spaces constructed by gluing together unit cubes in an analogous way to the construction of simplicial complexes. We construct Brownian motion on such spaces, define random walks, and prove that the transition kernels of the random walks converge to that for Brownian motion. The proof involves pulling back onto the complex the distribution of Brownian sample paths on the standard cube, and combining this with a distribution on walks between cubes in the complex. The main application lies in analysing sets of evolutionary trees: several tree spaces are cubical complexes and we briefly describe our results and some applications in this context. Our results extend readily to a class of polyhedral complex in which every cell of maximal dimension is isometric to a given fixed polyhedron.
△ Less
Submitted 22 May, 2019; v1 submitted 12 August, 2015;
originally announced August 2015.
-
Principal components analysis in the space of phylogenetic trees
Authors:
Tom M. W. Nye
Abstract:
Phylogenetic analysis of DNA or other data commonly gives rise to a collection or sample of inferred evolutionary trees. Principal Components Analysis (PCA) cannot be applied directly to collections of trees since the space of evolutionary trees on a fixed set of taxa is not a vector space. This paper describes a novel geometrical approach to PCA in tree-space that constructs the first principal p…
▽ More
Phylogenetic analysis of DNA or other data commonly gives rise to a collection or sample of inferred evolutionary trees. Principal Components Analysis (PCA) cannot be applied directly to collections of trees since the space of evolutionary trees on a fixed set of taxa is not a vector space. This paper describes a novel geometrical approach to PCA in tree-space that constructs the first principal path in an analogous way to standard linear Euclidean PCA. Given a data set of phylogenetic trees, a geodesic principal path is sought that maximizes the variance of the data under a form of projection onto the path. Due to the high dimensionality of tree-space and the nonlinear nature of this problem, the computational complexity is potentially very high, so approximate optimization algorithms are used to search for the optimal path. Principal paths identified in this way reveal and quantify the main sources of variation in the original collection of trees in terms of both topology and branch lengths. The approach is illustrated by application to simulated sets of trees and to a set of gene trees from metazoan (animal) species.
△ Less
Submitted 23 February, 2012;
originally announced February 2012.
-
An L^2-Index Theorem for Dirac Operators on S^1 * R^3
Authors:
Tom M. W. Nye,
Michael A. Singer
Abstract:
An expression is found for the $L^2$-index of a Dirac operator coupled to a connection on a $U_n$ vector bundle over $S^1\times{\mathbb R}^3$. Boundary conditions for the connection are given which ensure the coupled Dirac operator is Fredholm. Callias' index theorem is used to calculate the index when the connection is independent of the coordinate on $S^1$. An excision theorem due to Gromov, L…
▽ More
An expression is found for the $L^2$-index of a Dirac operator coupled to a connection on a $U_n$ vector bundle over $S^1\times{\mathbb R}^3$. Boundary conditions for the connection are given which ensure the coupled Dirac operator is Fredholm. Callias' index theorem is used to calculate the index when the connection is independent of the coordinate on $S^1$. An excision theorem due to Gromov, Lawson, and Anghel reduces the index theorem to this special case. The index formula can be expressed using the adiabatic limit of the $η$-invariant of a Dirac operator canonically associated to the boundary. An application of the theorem is to count the zero modes of the Dirac operator in the background of a caloron (periodic instanton).
△ Less
Submitted 14 September, 2000;
originally announced September 2000.