-
Constrained Shape Analysis with Applications to RNA Structure
Authors:
Kanti V. Mardia,
Benjamin Eltzner,
Stephan F. Huckemann
Abstract:
In many applications of shape analysis, lengths between some landmarks are constrained. For instance, biomolecules often have some bond lengths and some bond angles constrained, and variation occurs only along unconstrained bonds and constrained bonds' torsions where the latter are conveniently modelled by dihedral angles. Our work has been motivated by low resolution biomolecular chain RNA where…
▽ More
In many applications of shape analysis, lengths between some landmarks are constrained. For instance, biomolecules often have some bond lengths and some bond angles constrained, and variation occurs only along unconstrained bonds and constrained bonds' torsions where the latter are conveniently modelled by dihedral angles. Our work has been motivated by low resolution biomolecular chain RNA where only some prominent atomic bonds can be well identified. Here, we propose a new modelling strategy for such constrained shape analysis starting with a product of polar coordinates (polypolars), where, due to constraints, for example, some radial coordinates should be omitted, leaving products of spheres (polyspheres). We give insight into these coordinates for particular cases such as five landmarks which are motivated by a practical RNA application. We also discuss distributions for polypolar coordinates and give a specific methodology with illustration when the constrained size-and-shape variables are concentrated. There are applications of this in clustering and we give some insight into a modified version of the MINT-AGE algorithm.
△ Less
Submitted 22 February, 2025;
originally announced February 2025.
-
Statistics for Phylogenetic Trees in the Presence of Stickiness
Authors:
Lars Lammers,
Tom M. W. Nye,
Stephan F. Huckemann
Abstract:
Samples of phylogenetic trees arise in a variety of evolutionary and biomedical applications, and the Fréchet mean in Billera-Holmes-Vogtmann tree space is a summary tree shown to have advantages over other mean or consensus trees. However, use of the Fréchet mean raises computational and statistical issues which we explore in this paper. The Fréchet sample mean is known often to contain fewer int…
▽ More
Samples of phylogenetic trees arise in a variety of evolutionary and biomedical applications, and the Fréchet mean in Billera-Holmes-Vogtmann tree space is a summary tree shown to have advantages over other mean or consensus trees. However, use of the Fréchet mean raises computational and statistical issues which we explore in this paper. The Fréchet sample mean is known often to contain fewer internal edges than the trees in the sample, and in this circumstance calculating the mean by iterative schemes can be problematic due to slow convergence. We present new methods for identifying edges which must lie in the Fréchet sample mean and apply these to a data set of gene trees relating organisms from the apicomplexa which cause a variety of parasitic infections. When a sample of trees contains a significant level of heterogeneity in the branching patterns, or topologies, displayed by the trees then the Fréchet mean is often a star tree, lacking any internal edges. Not only in this situation, the population Fréchet mean is affected by a non-Euclidean phenomenon called stickness which impacts upon asymptotics, and we examine two data sets for which the mean tree is a star tree. The first consists of trees representing the physical shape of artery structures in a sample of medical images of human brains in which the branching patterns are very diverse. The second consists of gene trees from a population of baboons in which there is evidence of substantial hybridization. We develop hypothesis tests which work in the presence of stickiness. The first is a test for the presence of a given edge in the Fréchet population mean; the second is a two-sample test for differences in two distributions which share the same sticky population mean.
△ Less
Submitted 4 July, 2024;
originally announced July 2024.
-
Clustering Schemes on the Torus with Application to RNA Clashes
Authors:
Henrik Wiechers,
Benjamin Eltzner,
Stephan F. Huckemann,
Kanti V. Mardia
Abstract:
Molecular structures of RNA molecules reconstructed from X-ray crystallography frequently contain errors. Motivated by this problem we examine clustering on a torus since RNA shapes can be described by dihedral angles. A previously developed clustering method for torus data involves two tuning parameters and we assess clustering results for different parameter values in relation to the problem of…
▽ More
Molecular structures of RNA molecules reconstructed from X-ray crystallography frequently contain errors. Motivated by this problem we examine clustering on a torus since RNA shapes can be described by dihedral angles. A previously developed clustering method for torus data involves two tuning parameters and we assess clustering results for different parameter values in relation to the problem of so-called RNA clashes. This clustering problem is part of the dynamically evolving field of statistics on manifolds. Statistical problems on the torus highlight general challenges for statistics on manifolds. Therefore, the torus PCA and clustering methods we propose make an important contribution to directional statistics and statistics on manifolds in general.
△ Less
Submitted 28 February, 2021;
originally announced April 2021.
-
Information geometry for phylogenetic trees
Authors:
Maryam K. Garba,
Tom M. W. Nye,
Jonas Lueg,
Stephan F. Huckemann
Abstract:
We propose a new space of phylogenetic trees which we call wald space. The motivation is to develop a space suitable for statistical analysis of phylogenies, but with a geometry based on more biologically principled assumptions than existing spaces: in wald space, trees are close if they induce similar distributions on genetic sequence data. As a point set, wald space contains the previously devel…
▽ More
We propose a new space of phylogenetic trees which we call wald space. The motivation is to develop a space suitable for statistical analysis of phylogenies, but with a geometry based on more biologically principled assumptions than existing spaces: in wald space, trees are close if they induce similar distributions on genetic sequence data. As a point set, wald space contains the previously developed Billera-Holmes-Vogtmann (BHV) tree space; it also contains disconnected forests, like the edge-product (EP) space but without certain singularities of the EP space. We investigate two related geometries on wald space. The first is the geometry of the Fisher information metric of character distributions induced by the two-state symmetric Markov substitution process on each tree. Infinitesimally, the metric is proportional to the Kullback-Leibler divergence, or equivalently, as we show, any to f -divergence. The second geometry is obtained analogously but using a related continuous-valued Gaussian process on each tree, and it can be viewed as the trace metric of the affine-invariant metric for covariance matrices. We derive a gradient descent algorithm to project from the ambient space of covariance matrices to wald space. For both geometries we derive computational methods to compute geodesics in polynomial time and show numerically that the two information geometries (discrete and continuous) are very similar. In particular geodesics are approximated extrinsically. Comparison with the BHV geometry shows that our canonical and biologically motivated space is substantially different.
△ Less
Submitted 17 September, 2020; v1 submitted 29 March, 2020;
originally announced March 2020.