-
Beyond level-1: Identifiability of a class of galled tree-child networks
Authors:
Elizabeth S. Allman,
Cecile Ane,
Hector Banos,
John A. Rhodes
Abstract:
Inference of phylogenetic networks is of increasing interest in the genomic era. However, the extent to which phylogenetic networks are identifiable from various types of data remains poorly understood, despite its crucial role in justifying methods. This work obtains strong identifiability results for large sub-classes of galled tree-child semidirected networks. Some of the conditions our proofs…
▽ More
Inference of phylogenetic networks is of increasing interest in the genomic era. However, the extent to which phylogenetic networks are identifiable from various types of data remains poorly understood, despite its crucial role in justifying methods. This work obtains strong identifiability results for large sub-classes of galled tree-child semidirected networks. Some of the conditions our proofs require, such as the identifiability of a network's tree of blobs or the circular order of 4 taxa around a cycle in a level-1 network, are already known to hold for many data types. We show that all these conditions hold for quartet concordance factor data under various gene tree models, yielding the strongest results from 2 or more samples per taxon. Although the network classes we consider have topological restrictions, they include non-planar networks of any level and are substantially more general than level-1 networks -- the only class previously known to enjoy identifiability from many data types. Our work establishes a route for proving future identifiability results for tree-child galled networks from data types other than quartet concordance factors, by checking that explicit conditions are met.
△ Less
Submitted 29 April, 2025;
originally announced April 2025.
-
Identifying circular orders for blobs in phylogenetic networks
Authors:
John A. Rhodes,
Hector Banos,
Jingcheng Xu,
Cécile Ané
Abstract:
Interest in the inference of evolutionary networks relating species or populations has grown with the increasing recognition of the importance of hybridization, gene flow and admixture, and the availability of large-scale genomic data. However, what network features may be validly inferred from various data types under different models remains poorly understood. Previous work has largely focused o…
▽ More
Interest in the inference of evolutionary networks relating species or populations has grown with the increasing recognition of the importance of hybridization, gene flow and admixture, and the availability of large-scale genomic data. However, what network features may be validly inferred from various data types under different models remains poorly understood. Previous work has largely focused on level-1 networks, in which reticulation events are well separated, and on a general network's tree of blobs, the tree obtained by contracting every blob to a node. An open question is the identifiability of the topology of a blob of unknown level. We consider the identifiability of the circular order in which subnetworks attach to a blob, first proving that this order is well-defined for outer-labeled planar blobs. For this class of blobs, we show that the circular order information from 4-taxon subnetworks identifies the full circular order of the blob. Similarly, the circular order from 3-taxon rooted subnetworks identifies the full circular order of a rooted blob. We then show that subnetwork circular information is identifiable from certain data types and evolutionary models. This provides a general positive result for high-level networks, on the identifiability of the ordering in which taxon blocks attach to blobs in outer-labeled planar networks. Finally, we give examples of blobs with different internal structures which cannot be distinguished under many models and data types.
△ Less
Submitted 20 July, 2024; v1 submitted 18 February, 2024;
originally announced February 2024.
-
Identifiability of Level-1 Species Networks from Gene Tree Quartets
Authors:
Elizabeth S. Allman,
Hector Baños,
Marina Garrote-Lopez,
John A. Rhodes
Abstract:
When hybridization or other forms of lateral gene transfer have occurred, evolutionary relationships of species are better represented by phylogenetic networks than by trees. While inference of such networks remains challenging, several recently proposed methods are based on quartet concordance factors -- the probabilities that a tree relating a gene sampled from the species displays the possible…
▽ More
When hybridization or other forms of lateral gene transfer have occurred, evolutionary relationships of species are better represented by phylogenetic networks than by trees. While inference of such networks remains challenging, several recently proposed methods are based on quartet concordance factors -- the probabilities that a tree relating a gene sampled from the species displays the possible 4-taxon relationships. Building on earlier results, we investigate what level-1 network features are identifiable from concordance factors under the network multispecies coalescent model. We obtain results on both topological features of the network, and numerical parameters, uncovering a number of failures of identifiability related to 3-cycles in the network.
△ Less
Submitted 11 January, 2024;
originally announced January 2024.
-
Phylogenomic Models from Tree Symmetries
Authors:
Elizabeth A. Allman,
Colby Long,
John A. Rhodes
Abstract:
A model of genomic sequence evolution on a species tree should include not only a sequence substitution process, but also a coalescent process, since different sites may evolve on different gene trees due to incomplete lineage sorting. Chifman and Kubatko initiated the study of such models, leading to the development of the SVDquartets methods of species tree inference. A key observation was that…
▽ More
A model of genomic sequence evolution on a species tree should include not only a sequence substitution process, but also a coalescent process, since different sites may evolve on different gene trees due to incomplete lineage sorting. Chifman and Kubatko initiated the study of such models, leading to the development of the SVDquartets methods of species tree inference. A key observation was that symmetries in an ultrametric species tree led to symmetries in the joint distribution of bases at the taxa. In this work, we explore the implications of such symmetry more fully, defining new models incorporating only the symmetries of this distribution, regardless of the mechanism that might have produced them. The models are thus supermodels of many standard ones with mechanistic parameterizations. We study phylogenetic invariants for the models, and establish identifiability of species tree topologies using them.
△ Less
Submitted 13 March, 2023;
originally announced March 2023.
-
A generalized AIC for models with singularities and boundaries
Authors:
Jonathan D. Mitchell,
Elizabeth S. Allman,
John A. Rhodes
Abstract:
The Akaike information criterion (AIC) is a common tool for model selection. It is frequently used in violation of regularity conditions at parameter space singularities and boundaries. The expected AIC is generally not asymptotically equivalent to its target at singularities and boundaries, and convergence to the target at nearby parameter points may be slow. We develop a generalized AIC for cand…
▽ More
The Akaike information criterion (AIC) is a common tool for model selection. It is frequently used in violation of regularity conditions at parameter space singularities and boundaries. The expected AIC is generally not asymptotically equivalent to its target at singularities and boundaries, and convergence to the target at nearby parameter points may be slow. We develop a generalized AIC for candidate models with or without singularities and boundaries. We show that the expectation of this generalized form converges everywhere in the parameter space, and its convergence can be faster than that of the AIC. We illustrate the generalized AIC on example models from phylogenomics, showing that it can outperform the AIC and gives rise to an interpolated effective number of model parameters, which can differ substantially from the number of parameters near singularities and boundaries. We outline methods for estimating the often unknown generating parameter and bias correction term of the generalized AIC.
△ Less
Submitted 8 November, 2022;
originally announced November 2022.
-
Parameter Identifiability of a Multitype Pure-Birth Model of Speciation
Authors:
Dakota Dragomir,
Elizabeth S. Allman,
John A. Rhodes
Abstract:
Diversification models describe the random growth of evolutionary trees, modeling the historical relationships of species through speciation and extinction events. One class of such models allows for independently changing traits, or types, of the species within the tree, upon which speciation and extinction rates depend. Although identifiability of parameters is necessary to justify parameter est…
▽ More
Diversification models describe the random growth of evolutionary trees, modeling the historical relationships of species through speciation and extinction events. One class of such models allows for independently changing traits, or types, of the species within the tree, upon which speciation and extinction rates depend. Although identifiability of parameters is necessary to justify parameter estimation with a model, it has not been formally established for these models, despite their adoption for inference. This work establishes generic identifiability up to label swapping for the parameters of one of the simpler forms of such a model, a multitype pure birth model of speciation, from an asymptotic distribution derived from a single tree observation as its depth goes to infinity. Crucially for applications to available data, no observation of types is needed at any internal points in the tree, nor even at the leaves.
△ Less
Submitted 17 June, 2022;
originally announced June 2022.
-
The Tree of Blobs of a Species Network: Identifiability under the Coalescent
Authors:
Elizabeth S. Allman,
Hector Baños,
Jonathan D. Mitchell,
John A. Rhodes
Abstract:
Inference of species networks from genomic data under the Network Multispecies Coalescent Model is currently severely limited by heavy computational demands. It also remains unclear how complicated networks can be for consistent inference to be possible. As a step toward inferring a general species network, this work considers its tree of blobs, in which non-cut edges are contracted to nodes, so o…
▽ More
Inference of species networks from genomic data under the Network Multispecies Coalescent Model is currently severely limited by heavy computational demands. It also remains unclear how complicated networks can be for consistent inference to be possible. As a step toward inferring a general species network, this work considers its tree of blobs, in which non-cut edges are contracted to nodes, so only tree-like relationships between the taxa are shown. An identifiability theorem, that most features of the unrooted tree of blobs can be determined from the distribution of gene quartet topologies, is established. This depends upon an analysis of gene quartet concordance factors under the model, together with a new combinatorial inference rule. The arguments for this theoretical result suggest a practical algorithm for tree of blobs inference, to be fully developed in a subsequent work.
△ Less
Submitted 6 May, 2022;
originally announced May 2022.
-
Identifiability of species network topologies from genomic sequences using the logDet distance
Authors:
Elizabeth S. Allman,
Hector Baños,
John A. Rhodes
Abstract:
Inference of network-like evolutionary relationships between species from genomic data must address the interwoven signals from both gene flow and incomplete lineage sorting. The heavy computational demands of standard approaches to this problem severely limit the size of datasets that may be analyzed, in both the number of species and the number of genetic loci. Here we provide a theoretical poin…
▽ More
Inference of network-like evolutionary relationships between species from genomic data must address the interwoven signals from both gene flow and incomplete lineage sorting. The heavy computational demands of standard approaches to this problem severely limit the size of datasets that may be analyzed, in both the number of species and the number of genetic loci. Here we provide a theoretical pointer to more efficient methods, by showing that logDet distances computed from genomic-scale sequences retain sufficient information to recover network relationships in the level-1 ultrametric case. This result is obtained under the Network Multispecies Coalescent model combined with a mixture of General Time-Reversible sequence evolution models across individual gene trees, but does not depend on partitioning sequences by genes. Thus under standard stochastic models statistically justifiable inference of network relationships from sequences can be accomplished without consideration of individual genes or gene trees.
△ Less
Submitted 3 August, 2021;
originally announced August 2021.
-
Parameter identifiability for a profile mixture model of protein evolution
Authors:
Samaneh Yourdkhani,
Elizabeth S. Allman,
John A. Rhodes
Abstract:
A Profile Mixture Model is a model of protein evolution, describing sequence data in which sites are assumed to follow many related substitution processes on a single evolutionary tree. The processes depend in part on different amino acid distributions, or profiles, varying over sites in aligned sequences. A fundamental question for any stochastic model, which must be answered positively to justif…
▽ More
A Profile Mixture Model is a model of protein evolution, describing sequence data in which sites are assumed to follow many related substitution processes on a single evolutionary tree. The processes depend in part on different amino acid distributions, or profiles, varying over sites in aligned sequences. A fundamental question for any stochastic model, which must be answered positively to justify model-based inference, is whether the parameters are identifiable from the probability distribution they determine. Here we show that a Profile Mixture Model has identifiable parameters under circumstances in which it is likely to be used for empirical analyses. In particular, for a tree relating 9 or more taxa, both the tree topology and all numerical parameters are generically identifiable when the number of profiles is less than 74.
△ Less
Submitted 4 July, 2020;
originally announced July 2020.
-
Inferring metric trees from weighted quartets via an intertaxon distance
Authors:
Samaneh Yourdkhani,
John A. Rhodes
Abstract:
A metric phylogenetic tree relating a collection of taxa induces weighted rooted triples and weighted quartets for all subsets of three and four taxa, respectively. New intertaxon distances are defined that can be calculated from these weights, and shown to exactly fit the same tree topology, but with edge weights rescaled by certain factors dependent on the associated split size. These distances…
▽ More
A metric phylogenetic tree relating a collection of taxa induces weighted rooted triples and weighted quartets for all subsets of three and four taxa, respectively. New intertaxon distances are defined that can be calculated from these weights, and shown to exactly fit the same tree topology, but with edge weights rescaled by certain factors dependent on the associated split size. These distances are analogs for metric trees of similar ones recently introduced for topological trees that are based on induced unweighted rooted triples and quartets. The distances introduced here lead to new statistically consistent methods of inferring a metric species tree from a collection of topological gene trees generated under the multispecies coalescent model of incomplete lineage sorting. Simulations provide insight into their potential.
△ Less
Submitted 11 February, 2020;
originally announced February 2020.
-
NJst and ASTRID are not statistically consistent under a random model of missing data
Authors:
John A. Rhodes,
Michael G. Nute,
Tandy Warnow
Abstract:
Species tree estimation from multi-locus datasets is statistically challenging for multiple reasons, including gene tree heterogeneity across the genome due to incomplete lineage sorting (ILS). Species tree estimation methods have been developed that operate by estimating gene trees and then using those gene trees to estimate the species tree. Several of these methods (e.g., ASTRAL, ASTRID, and NJ…
▽ More
Species tree estimation from multi-locus datasets is statistically challenging for multiple reasons, including gene tree heterogeneity across the genome due to incomplete lineage sorting (ILS). Species tree estimation methods have been developed that operate by estimating gene trees and then using those gene trees to estimate the species tree. Several of these methods (e.g., ASTRAL, ASTRID, and NJst) are provably statistically consistent under the multi-species coalescent (MSC) model, provided that the gene trees are estimated correctly, and there is no missing data. Recently, Nute et al. (BMC Genomics 2018) addressed the question of whether these methods remain statistically consistent under random models of taxon deletion, and asserted that they do so. Here we provide a counterexample to one of these theorems, and establish that ASTRID and NJst are not statistically consistent under an i.i.d. model of taxon deletion.
△ Less
Submitted 21 January, 2020;
originally announced January 2020.
-
Testing Multispecies Coalescent Simulators using Summary Statistics
Authors:
Elizabeth S. Allman,
Hector Baños,
John A. Rhodes
Abstract:
As genomic scale datasets motivate research on species tree inference, simulators of the multispecies coalescent (MSC) process are essential for the testing and evaluation of new inference methods. However, the simulators themselves must be tested to ensure they give valid samples from the coalescent process. In this work we develop several statistical tools using summary statistics to evaluate th…
▽ More
As genomic scale datasets motivate research on species tree inference, simulators of the multispecies coalescent (MSC) process are essential for the testing and evaluation of new inference methods. However, the simulators themselves must be tested to ensure they give valid samples from the coalescent process. In this work we develop several statistical tools using summary statistics to evaluate the fit of a simulated gene tree sample to the MSC model. Using these tests on samples from four published simulators, we uncover flaws in several. The tests are implemented as an R package, so that both developers and users will be able to easily check proper performance of future simulators.
△ Less
Submitted 4 August, 2019;
originally announced August 2019.
-
Hypothesis testing near singularities and boundaries
Authors:
Jonathan D. Mitchell,
Elizabeth S. Allman,
John A. Rhodes
Abstract:
The likelihood ratio statistic, with its asymptotic $χ^2$ distribution at regular model points, is often used for hypothesis testing. At model singularities and boundaries, however, the asymptotic distribution may not be $χ^2$, as highlighted by recent work of Drton. Indeed, poor behavior of a $χ^2$ for testing near singularities and boundaries is apparent in simulations, and can lead to conservat…
▽ More
The likelihood ratio statistic, with its asymptotic $χ^2$ distribution at regular model points, is often used for hypothesis testing. At model singularities and boundaries, however, the asymptotic distribution may not be $χ^2$, as highlighted by recent work of Drton. Indeed, poor behavior of a $χ^2$ for testing near singularities and boundaries is apparent in simulations, and can lead to conservative or anti-conservative tests. Here we develop a new distribution designed for use in hypothesis testing near singularities and boundaries, which asymptotically agrees with that of the likelihood ratio statistic. For two example trinomial models, arising in the context of inference of evolutionary trees, we show the new distributions outperform a $χ^2$.
△ Less
Submitted 21 June, 2018;
originally announced June 2018.
-
Species tree inference from genomic sequences using the log-det distance
Authors:
Elizabeth S. Allman,
Colby Long,
John A. Rhodes
Abstract:
The log-det distance between two aligned DNA sequences was introduced as a tool for statistically consistent inference of a gene tree under simple non-mixture models of sequence evolution. Here we prove that the log-det distance, coupled with a distance-based tree construction method, also permits consistent inference of species trees under mixture models appropriate to aligned genomic-scale seque…
▽ More
The log-det distance between two aligned DNA sequences was introduced as a tool for statistically consistent inference of a gene tree under simple non-mixture models of sequence evolution. Here we prove that the log-det distance, coupled with a distance-based tree construction method, also permits consistent inference of species trees under mixture models appropriate to aligned genomic-scale sequences data. Data may include sites from many genetic loci, which evolved on different gene trees due to incomplete lineage sorting on an ultrametric species tree, with different time-reversible substitution processes. The simplicity and speed of distance-based inference suggests log-det based methods should serve as benchmarks for judging more elaborate and computationally-intensive species trees inference methods.
△ Less
Submitted 13 June, 2018;
originally announced June 2018.
-
Split probabilities and species tree inference under the multispecies coalescent model
Authors:
Elizabeth S. Allman,
James H. Degnan,
John A. Rhodes
Abstract:
Using topological summaries of gene trees as a basis for species tree inference is a promising approach to obtain acceptable speed on genomic-scale datasets, and to avoid some undesirable modeling assumptions. Here we study the probabilities of splits on gene trees under the multispecies coalescent model, and how their features might inform species tree inference. After investigating the behavior…
▽ More
Using topological summaries of gene trees as a basis for species tree inference is a promising approach to obtain acceptable speed on genomic-scale datasets, and to avoid some undesirable modeling assumptions. Here we study the probabilities of splits on gene trees under the multispecies coalescent model, and how their features might inform species tree inference. After investigating the behavior of split consensus methods, we investigate split invariants --- that is, polynomial relationships between split probabilities. These invariants are then used to show that, even though a split is an unrooted notion, split probabilities retain enough information to identify the rooted species tree topology for trees of more than 5 taxa, with one possible 6-taxon exception.
△ Less
Submitted 13 April, 2017;
originally announced April 2017.
-
Topological metrizations of trees, and new quartet methods of tree inference
Authors:
John A. Rhodes
Abstract:
Topological phylogenetic trees can be assigned edge weights in several natural ways, highlighting different aspects of the tree. Here the rooted triple and quartet metrizations are introduced, and applied to formulate novel fast methods of inferring large trees from rooted triple and quartet data. These methods can be applied in new statistically consistent procedures for inference of a species tr…
▽ More
Topological phylogenetic trees can be assigned edge weights in several natural ways, highlighting different aspects of the tree. Here the rooted triple and quartet metrizations are introduced, and applied to formulate novel fast methods of inferring large trees from rooted triple and quartet data. These methods can be applied in new statistically consistent procedures for inference of a species tree from gene trees under the multispecies coalescent model.
△ Less
Submitted 13 May, 2019; v1 submitted 6 April, 2017;
originally announced April 2017.
-
Split scores: a tool to quantify phylogenetic signal in genome-scale data
Authors:
Elizabeth S. Allman,
Laura S. Kubatko,
John A. Rhodes
Abstract:
Detecting variation in the evolutionary process along chromosomes is increasingly important as whole-genome data becomes more widely available. For example, factors such as incomplete lineage sorting, horizontal gene transfer, and chromosomal inversion are expected to result in changes in the underlying gene trees along a chromosome, while changes in selective pressure and mutational rates for dif…
▽ More
Detecting variation in the evolutionary process along chromosomes is increasingly important as whole-genome data becomes more widely available. For example, factors such as incomplete lineage sorting, horizontal gene transfer, and chromosomal inversion are expected to result in changes in the underlying gene trees along a chromosome, while changes in selective pressure and mutational rates for different genomic regions may lead to shifts in the underlying mutational process. We propose the split score as a general method for quantifying support for a particular phylogenetic relationship within a genomic data set. Because the split score is based on algebraic properties of a matrix of site pattern frequencies, it can be rapidly computed, even for data sets that are large in the number of taxa and/or in the length of the alignment, providing an advantage over other methods (e.g., maximum likelihood) that are often used to assess such support. Using simulation we explore the properties of the split score, including its dependence on sequence length, branch length, size of a split and its ability to detect true splits in the underlying tree. Using a sliding window analysis, we show that split scores can be used to detect changes in the underlying evolutionary process for genome-scale data from primates, mosquitoes, and viruses in a computationally efficient manner. Computation of the split score has been implemented in the software package SplitSup.
△ Less
Submitted 30 December, 2016; v1 submitted 2 August, 2016;
originally announced August 2016.
-
Phylogenetic trees and Euclidean embeddings
Authors:
Mark Layer,
John A. Rhodes
Abstract:
It was recently observed by de Vienne et al. that a simple square root transformation of distances between taxa on a phylogenetic tree allowed for an embedding of the taxa into Euclidean space. While the justification for this was based on a diffusion model of continuous character evolution along the tree, here we give a direct and elementary explanation for it that provides substantial additional…
▽ More
It was recently observed by de Vienne et al. that a simple square root transformation of distances between taxa on a phylogenetic tree allowed for an embedding of the taxa into Euclidean space. While the justification for this was based on a diffusion model of continuous character evolution along the tree, here we give a direct and elementary explanation for it that provides substantial additional insight. We use this embedding to reinterpret the differences between the NJ and BIONJ tree building algorithms, providing one illustration of how this embedding reflects tree structures in data.
△ Less
Submitted 3 May, 2016;
originally announced May 2016.
-
Species tree inference from gene splits by Unrooted STAR methods
Authors:
Elizabeth S. Allman,
James H. Degnan,
John A. Rhodes
Abstract:
The $\text{NJ}_{st}$ method was proposed by Liu and Yu to infer a species tree topology from unrooted topological gene trees. While its statistical consistency under the multispecies coalescent model was established only for a 4-taxon tree, simulations demonstrated its good performance on gene trees inferred from sequences for many taxa. Here we prove the statistical consistency of the method for…
▽ More
The $\text{NJ}_{st}$ method was proposed by Liu and Yu to infer a species tree topology from unrooted topological gene trees. While its statistical consistency under the multispecies coalescent model was established only for a 4-taxon tree, simulations demonstrated its good performance on gene trees inferred from sequences for many taxa. Here we prove the statistical consistency of the method for an arbitrarily large species tree. Our approach connects $\text{NJ}_{st}$ to a generalization of the STAR method of Liu, Pearl and Edwards, and a previous theoretical analysis of it. We further show $\text{NJ}_{st}$ utilizes only the distribution of splits in the gene trees, and not their individual topologies. Finally, we discuss how multiple samples per taxon per gene should be handled for statistical consistency.
△ Less
Submitted 18 April, 2016;
originally announced April 2016.
-
Statistically-Consistent k-mer Methods for Phylogenetic Tree Reconstruction
Authors:
Elizabeth S. Allman,
John A. Rhodes,
Seth Sullivant
Abstract:
Frequencies of $k$-mers in sequences are sometimes used as a basis for inferring phylogenetic trees without first obtaining a multiple sequence alignment. We show that a standard approach of using the squared-Euclidean distance between $k$-mer vectors to approximate a tree metric can be statistically inconsistent. To remedy this, we derive model-based distance corrections for orthologous sequences…
▽ More
Frequencies of $k$-mers in sequences are sometimes used as a basis for inferring phylogenetic trees without first obtaining a multiple sequence alignment. We show that a standard approach of using the squared-Euclidean distance between $k$-mer vectors to approximate a tree metric can be statistically inconsistent. To remedy this, we derive model-based distance corrections for orthologous sequences without gaps, which lead to consistent tree inference. The identifiability of model parameters from $k$-mer frequencies is also studied. Finally, we report simulations showing the corrected distance out-performs many other $k$-mer methods, even when sequences are generated with an insertion and deletion process. These results have implications for multiple sequence alignment as well, since $k$-mer methods are usually the first step in constructing a guide tree for such algorithms.
△ Less
Submitted 14 January, 2016; v1 submitted 5 November, 2015;
originally announced November 2015.
-
There are no caterpillars in a wicked forest
Authors:
James H. Degnan,
John A. Rhodes
Abstract:
Species trees represent the historical divergences of populations or species, while gene trees trace the ancestry of individual gene copies sampled within those populations. In cases involving rapid speciation, gene trees with topologies that differ from that of the species tree can be most probable under the standard multispecies coalescent model, making species tree inference more difficult. Suc…
▽ More
Species trees represent the historical divergences of populations or species, while gene trees trace the ancestry of individual gene copies sampled within those populations. In cases involving rapid speciation, gene trees with topologies that differ from that of the species tree can be most probable under the standard multispecies coalescent model, making species tree inference more difficult. Such anomalous gene trees are not well understood except for some small cases. In this work, we establish one constraint that applies to trees of any size: gene trees with "caterpillar" topologies cannot be anomalous. The proof of this involves a new combinatorial object, called a population history, which keeps track of the number of coalescent events in each ancestral population.
△ Less
Submitted 26 August, 2015;
originally announced August 2015.
-
A semialgebraic description of the general Markov model on phylogenetic trees
Authors:
Elizabeth S. Allman,
John A. Rhodes,
Amelia Taylor
Abstract:
Many of the stochastic models used in inference of phylogenetic trees from biological sequence data have polynomial parameterization maps. The image of such a map --- the collection of joint distributions for a model --- forms the model space. Since the parameterization is polynomial, the Zariski closure of the model space is an algebraic variety which is typically much larger than the model space…
▽ More
Many of the stochastic models used in inference of phylogenetic trees from biological sequence data have polynomial parameterization maps. The image of such a map --- the collection of joint distributions for a model --- forms the model space. Since the parameterization is polynomial, the Zariski closure of the model space is an algebraic variety which is typically much larger than the model space, but has been usefully studied with algebraic methods. Of ultimate interest, however, is not the full variety, but only the model space. Here we develop complete semialgebraic descriptions of the model space arising from the k-state general Markov model on a tree, with slightly restricted parameters. Our approach depends upon both recently-formulated analogs of Cayley's hyperdeterminant, and the construction of certain quadratic forms from the joint distribution whose positive (semi-)definiteness encodes information about parameter values. We additionally investigate the use of Sturm sequences for obtaining similar results.
△ Less
Submitted 5 December, 2012;
originally announced December 2012.
-
Tensor Rank, Invariants, Inequalities, and Applications
Authors:
Elizabeth S. Allman,
Peter D. Jarvis,
John A. Rhodes,
Jeremy G. Sumner
Abstract:
Though algebraic geometry over $\mathbb C$ is often used to describe the closure of the tensors of a given size and complex rank, this variety includes tensors of both smaller and larger rank. Here we focus on the $n\times n\times n$ tensors of rank $n$ over $\mathbb C$, which has as a dense subset the orbit of a single tensor under a natural group action. We construct polynomial invariants under…
▽ More
Though algebraic geometry over $\mathbb C$ is often used to describe the closure of the tensors of a given size and complex rank, this variety includes tensors of both smaller and larger rank. Here we focus on the $n\times n\times n$ tensors of rank $n$ over $\mathbb C$, which has as a dense subset the orbit of a single tensor under a natural group action. We construct polynomial invariants under this group action whose non-vanishing distinguishes this orbit from points only in its closure. Together with an explicit subset of the defining polynomials of the variety, this gives a semialgebraic description of the tensors of rank $n$ and multilinear rank $(n,n,n)$. The polynomials we construct coincide with Cayley's hyperdeterminant in the case $n=2$, and thus generalize it. Though our construction is direct and explicit, we also recast our functions in the language of representation theory for additional insights.
We give three applications in different directions: First, we develop basic topological understanding of how the real tensors of complex rank $n$ and multilinear rank $(n,n,n)$ form a collection of path-connected subsets, one of which contains tensors of real rank $n$. Second, we use the invariants to develop a semialgebraic description of the set of probability distributions that can arise from a simple stochastic model with a hidden variable, a model that is important in phylogenetics and other fields. Third, we construct simple examples of tensors of rank $2n-1$ which lie in the closure of those of rank $n$.
△ Less
Submitted 14 November, 2012;
originally announced November 2012.
-
Species tree inference by the STAR method, and generalizations
Authors:
Elizabeth S. Allman,
James H. Degnan,
John A. Rhodes
Abstract:
The multispecies coalescent model describes the generation of gene trees from a rooted metric species tree, and thus provides a framework for the inference of species trees from sampled gene trees. We prove that the STAR method of Liu et al., and generalizations of it, are statistically consistent methods of topological species tree inference under this model. We discuss the impact of gene tree sa…
▽ More
The multispecies coalescent model describes the generation of gene trees from a rooted metric species tree, and thus provides a framework for the inference of species trees from sampled gene trees. We prove that the STAR method of Liu et al., and generalizations of it, are statistically consistent methods of topological species tree inference under this model. We discuss the impact of gene tree sampling schemes for species tree inference using generalized STAR methods, and reinterpret the original STAR as a consensus method based on clades.
△ Less
Submitted 19 April, 2012;
originally announced April 2012.
-
When Do Phylogenetic Mixture Models Mimic Other Phylogenetic Models?
Authors:
Elizabeth S. Allman,
John A. Rhodes,
Seth Sullivant
Abstract:
Phylogenetic mixture models, in which the sites in sequences undergo different substitution processes along the same or different trees, allow the description of heterogeneous evolutionary processes. As data sets consisting of longer sequences become available, it is important to understand such models, for both theoretical insights and use in statistical analyses. Some recent articles have highli…
▽ More
Phylogenetic mixture models, in which the sites in sequences undergo different substitution processes along the same or different trees, allow the description of heterogeneous evolutionary processes. As data sets consisting of longer sequences become available, it is important to understand such models, for both theoretical insights and use in statistical analyses. Some recent articles have highlighted disturbing "mimicking" behavior in which a distribution from a mixture model is identical to one arising on a different tree or trees. Other works have indicated such problems are unlikely to occur in practice, as they require very special parameter choices.
After surveying some of these works on mixture models, we give several new results. In general, if the number of components in a generating mixture is not too large and we disallow zero or infinite branch lengths, then it cannot mimic the behavior of a non-mixture on a different tree. On the other hand, if the mixture model is locally over-parameterized, it is possible for a phylogenetic mixture model to mimic distributions of another tree model. Though theoretical questions remain, these sorts of results can serve as a guide to when the use of mixture models in either ML or Bayesian frameworks is likely to lead to statistically consistent inference, and when mimicking due to heterogeneity should be considered a realistic possibility.
△ Less
Submitted 16 July, 2012; v1 submitted 10 February, 2012;
originally announced February 2012.
-
Determining species tree topologies from clade probabilities under the coalescent
Authors:
Elizabeth S. Allman,
James H. Degnan,
John A. Rhodes
Abstract:
One approach to estimating a species tree from a collection of gene trees is to first estimate probabilities of clades from the gene trees, and then to construct the species tree from the estimated clade probabilities. While a greedy consensus algorithm, which consecutively accepts the most probable clades compatible with previously accepted clades, can be used for this second stage, this method i…
▽ More
One approach to estimating a species tree from a collection of gene trees is to first estimate probabilities of clades from the gene trees, and then to construct the species tree from the estimated clade probabilities. While a greedy consensus algorithm, which consecutively accepts the most probable clades compatible with previously accepted clades, can be used for this second stage, this method is known to be statistically inconsistent under the multispecies coalescent model. This raises the question of whether it is theoretically possible to reconstruct the species tree from known probabilities of clades on gene trees. We investigate clade probabilities arising from the multispecies coalescent model, with an eye toward identifying features of the species tree. Clades on gene trees with probability greater than 1/3 are shown to reflect clades on the species tree, while those with smaller probabilities may not. Linear invariants of clade probabilities are studied both computationally and theoretically, with certain linear invariants giving insight into the clade structure of the species tree. For species trees with generic edge lengths, these invariants can be used to identify the species tree topology. These theoretical results both confirm that clade probabilities contain full information on the species tree topology and suggest future directions of study for developing statistically consistent inference methods from clade frequencies on gene trees.
△ Less
Submitted 2 March, 2011;
originally announced March 2011.
-
Identifiability of Large Phylogenetic Mixture Models
Authors:
John A. Rhodes,
Seth Sullivant
Abstract:
Phylogenetic mixture models are statistical models of character evolution allowing for heterogeneity. Each of the classes in some unknown partition of the characters may evolve by different processes, or even along different trees. The fundamental question of whether parameters of such a model are identifiable is difficult to address, due to the complexity of the parameterization. We analyze mixtu…
▽ More
Phylogenetic mixture models are statistical models of character evolution allowing for heterogeneity. Each of the classes in some unknown partition of the characters may evolve by different processes, or even along different trees. The fundamental question of whether parameters of such a model are identifiable is difficult to address, due to the complexity of the parameterization. We analyze mixture models on large trees, with many mixture components, showing that both numerical and tree parameters are indeed identifiable in these models when all trees are the same. We also explore the extent to which our algebraic techniques can be employed to extend the result to mixtures on different trees.
△ Less
Submitted 17 November, 2010;
originally announced November 2010.
-
Identifying the Rooted Species Tree from the Distribution of Unrooted Gene Trees under the Coalescent
Authors:
Elizabeth S. Allman,
James H. Degnan,
John A. Rhodes
Abstract:
Gene trees are evolutionary trees representing the ancestry of genes sampled from multiple populations. Species trees represent populations of individuals -- each with many genes -- splitting into new populations or species. The coalescent process, which models ancestry of gene copies within populations, is often used to model the probability distribution of gene trees given a fixed species tree.…
▽ More
Gene trees are evolutionary trees representing the ancestry of genes sampled from multiple populations. Species trees represent populations of individuals -- each with many genes -- splitting into new populations or species. The coalescent process, which models ancestry of gene copies within populations, is often used to model the probability distribution of gene trees given a fixed species tree. This multispecies coalescent model provides a framework for phylogeneticists to infer species trees from gene trees using maximum likelihood or Bayesian approaches. Because the coalescent models a branching process over time, all trees are typically assumed to be rooted in this setting. Often, however, gene trees inferred by traditional phylogenetic methods are unrooted.
We investigate probabilities of unrooted gene trees under the multispecies coalescent model. We show that when there are 4 species with one gene sampled per species, the distribution of unrooted gene tree topologies identifies the unrooted species tree topology and some, but not all, information in the species tree edges (branch lengths). The location of the root on the species tree is not identifiable in this situation. However, for 5 or more species with one gene sampled per species, we show that the distribution of unrooted gene tree topologies identifies the rooted species tree topology and all its internal branch lengths. The length of any pendent branch leading to a leaf of the species tree is also identifiable for any species from which more than one gene is sampled.
△ Less
Submitted 29 July, 2010; v1 submitted 22 December, 2009;
originally announced December 2009.
-
Identifiability of 2-tree mixtures for group-based models
Authors:
Elizabeth S. Allman,
Sonja Petrović,
John A. Rhodes,
Seth Sullivant
Abstract:
Phylogenetic data arising on two possibly different tree topologies might be mixed through several biological mechanisms, including incomplete lineage sorting or horizontal gene transfer in the case of different topologies, or simply different substitution processes on characters in the case of the same topology. Recent work on a 2-state symmetric model of character change showed such a mixture…
▽ More
Phylogenetic data arising on two possibly different tree topologies might be mixed through several biological mechanisms, including incomplete lineage sorting or horizontal gene transfer in the case of different topologies, or simply different substitution processes on characters in the case of the same topology. Recent work on a 2-state symmetric model of character change showed such a mixture model has non-identifiable parameters, and thus it is theoretically impossible to determine the two tree topologies from any amount of data under such circumstances. Here the question of identifiability is investigated for 2-tree mixtures of the 4-state group-based models, which are more relevant to DNA sequence data. Using algebraic techniques, we show that the tree parameters are identifiable for the JC and K2P models. We also prove that generic substitution parameters for the JC mixture models are identifiable, and for the K2P and K3P models obtain generic identifiability results for mixtures on the same tree. This indicates that the full phylogenetic signal remains in such mixtures, and that the 2-state symmetric result is thus a misleading guide to the behavior of other models.
△ Less
Submitted 18 December, 2009; v1 submitted 9 September, 2009;
originally announced September 2009.
-
Estimating Trees from Filtered Data: Identifiability of Models for Morphological Phylogenetics
Authors:
Elizabeth S. Allman,
Mark T. Holder,
John A. Rhodes
Abstract:
As an alternative to parsimony analyses, stochastic models have been proposed (Lewis, 2001), (Nylander, et al., 2004) for morphological characters, so that maximum likelihood or Bayesian analyses may be used for phylogenetic inference. A key feature of these models is that they account for ascertainment bias, in that only varying, or parsimony-informative characters are observed. However, statis…
▽ More
As an alternative to parsimony analyses, stochastic models have been proposed (Lewis, 2001), (Nylander, et al., 2004) for morphological characters, so that maximum likelihood or Bayesian analyses may be used for phylogenetic inference. A key feature of these models is that they account for ascertainment bias, in that only varying, or parsimony-informative characters are observed. However, statistical consistency of such model-based inference requires that the model parameters be identifiable from the joint distribution they entail, and this issue has not been addressed.
Here we prove that parameters for several such models, with finite state spaces of arbitrary size, are identifiable, provided the tree has at least 8 leaves. If the tree topology is already known, then 7 leaves suffice for identifiability of the numerical parameters. The method of proof involves first inferring a full distribution of both parsimony-informative and non-informative pattern joint probabilities from the parsimony-informative ones, using phylogenetic invariants. The failure of identifiability of the tree parameter for 4-taxon trees is also investigated.
△ Less
Submitted 20 December, 2009; v1 submitted 27 May, 2009;
originally announced May 2009.
-
The Identifiability of Covarion Models in Phylogenetics
Authors:
Elizabeth S. Allman,
John A. Rhodes
Abstract:
Covarion models of character evolution describe inhomogeneities in substitution processes through time. In phylogenetics, such models are used to describe changing functional constraints or selection regimes during the evolution of biological sequences. In this work the identifiability of such models for generic parameters on a known phylogenetic tree is established, provided the number of covar…
▽ More
Covarion models of character evolution describe inhomogeneities in substitution processes through time. In phylogenetics, such models are used to describe changing functional constraints or selection regimes during the evolution of biological sequences. In this work the identifiability of such models for generic parameters on a known phylogenetic tree is established, provided the number of covarion classes does not exceed the size of the observable state space. `Generic parameters' as used here means all parameters except possibly those in a set of measure zero within the parameter space. Combined with earlier results, this implies both the tree and generic numerical parameters are identifiable if the number of classes is strictly smaller than the number of observable states.
△ Less
Submitted 26 May, 2008; v1 submitted 18 January, 2008;
originally announced January 2008.
-
Identifiability of a Markovian model of molecular evolution with Gamma-distributed rates
Authors:
Elizabeth S. Allman,
Cecile Ane,
John A. Rhodes
Abstract:
Inference of evolutionary trees and rates from biological sequences is commonly performed using continuous-time Markov models of character change. The Markov process evolves along an unknown tree while observations arise only from the tips of the tree. Rate heterogeneity is present in most real data sets and is accounted for by the use of flexible mixture models where each site is allowed its ow…
▽ More
Inference of evolutionary trees and rates from biological sequences is commonly performed using continuous-time Markov models of character change. The Markov process evolves along an unknown tree while observations arise only from the tips of the tree. Rate heterogeneity is present in most real data sets and is accounted for by the use of flexible mixture models where each site is allowed its own rate. Very little has been rigorously established concerning the identifiability of the models currently in common use in data analysis, although non-identifiability was proven for a semi-parametric model and an incorrect proof of identifiability was published for a general parametric model (GTR+Gamma+I). Here we prove that one of the most widely used models (GTR+Gamma) is identifiable for generic parameters, and for all parameter choices in the case of 4-state (DNA) models. This is the first proof of identifiability of a phylogenetic model with a continuous distribution of rates.
△ Less
Submitted 1 February, 2008; v1 submitted 4 September, 2007;
originally announced September 2007.
-
Identifying evolutionary trees and substitution parameters for the general Markov model with invariable sites
Authors:
Elizabeth S. Allman,
John A. Rhodes
Abstract:
The general Markov plus invariable sites (GM+I) model of biological sequence evolution is a two-class model in which an unknown proportion of sites are not allowed to change, while the remainder undergo substitutions according to a Markov process on a tree. For statistical use it is important to know if the model is identifiable; can both the tree topology and the numerical parameters be determi…
▽ More
The general Markov plus invariable sites (GM+I) model of biological sequence evolution is a two-class model in which an unknown proportion of sites are not allowed to change, while the remainder undergo substitutions according to a Markov process on a tree. For statistical use it is important to know if the model is identifiable; can both the tree topology and the numerical parameters be determined from a joint distribution describing sequences only at the leaves of the tree? We establish that for generic parameters both the tree and all numerical parameter values can be recovered, up to clearly understood issues of `label swapping.' The method of analysis is algebraic, using phylogenetic invariants to study the variety defined by the model. Simple rational formulas, expressed in terms of determinantal ratios, are found for recovering numerical parameters describing the invariable sites.
△ Less
Submitted 23 February, 2007;
originally announced February 2007.
-
The identifiability of tree topology for phylogenetic models, including covarion and mixture models
Authors:
Elizabeth S. Allman,
John A. Rhodes
Abstract:
For a model of molecular evolution to be useful for phylogenetic inference, the topology of evolutionary trees must be identifiable. That is, from a joint distribution the model predicts, it must be possible to recover the tree parameter. We establish tree identifiability for a number of phylogenetic models, including a covarion model and a variety of mixture models with a limited number of clas…
▽ More
For a model of molecular evolution to be useful for phylogenetic inference, the topology of evolutionary trees must be identifiable. That is, from a joint distribution the model predicts, it must be possible to recover the tree parameter. We establish tree identifiability for a number of phylogenetic models, including a covarion model and a variety of mixture models with a limited number of classes. The proof is based on the introduction of a more general model, allowing more states at internal nodes of the tree than at leaves, and the study of the algebraic variety formed by the joint distributions to which it gives rise. Tree identifiability is first established for this general model through the use of certain phylogenetic invariants.
△ Less
Submitted 9 November, 2005;
originally announced November 2005.
-
Phylogenetic ideals and varieties for the general Markov model
Authors:
Elizabeth S. Allman,
John A. Rhodes
Abstract:
The general Markov model of the evolution of biological sequences along a tree leads to a parameterization of an algebraic variety. Understanding this variety and the polynomials, called phylogenetic invariants, which vanish on it, is a problem within the broader area of Algebraic Statistics.
For an arbitrary trivalent tree, we determine the full ideal of invariants for the 2-state model, esta…
▽ More
The general Markov model of the evolution of biological sequences along a tree leads to a parameterization of an algebraic variety. Understanding this variety and the polynomials, called phylogenetic invariants, which vanish on it, is a problem within the broader area of Algebraic Statistics.
For an arbitrary trivalent tree, we determine the full ideal of invariants for the 2-state model, establishing a conjecture of Pachter-Sturmfels. For the $κ$-state model, we reduce the problem of determining a defining set of polynomials to that of determining a defining set for a 3-leaved tree.
Along the way, we prove several new cases of a conjecture of Garcia-Stillman-Sturmfels on certain statistical models on star trees, and reduce their conjecture to a family of subcases.
△ Less
Submitted 6 October, 2006; v1 submitted 28 October, 2004;
originally announced October 2004.
-
Phylogenetic invariants for stationary base composition
Authors:
Elizabeth S. Allman,
John A. Rhodes
Abstract:
Changing base composition during the evolution of biological sequences can mislead some of the phylogenetic inference techniques in current use. However, detecting whether such a process has occurred may be difficult, since convergent evolution may lead to similar base frequencies emerging from different lineages.
To study this situation, algebraic models of biological sequence evolution are i…
▽ More
Changing base composition during the evolution of biological sequences can mislead some of the phylogenetic inference techniques in current use. However, detecting whether such a process has occurred may be difficult, since convergent evolution may lead to similar base frequencies emerging from different lineages.
To study this situation, algebraic models of biological sequence evolution are introduced in which the base composition is fixed throughout evolution. Basic properties of the associated algebraic varieties are investigated, including the construction of some phylogenetic invariants.
△ Less
Submitted 26 July, 2004;
originally announced July 2004.