-
Beyond level-1: Identifiability of a class of galled tree-child networks
Authors:
Elizabeth S. Allman,
Cecile Ane,
Hector Banos,
John A. Rhodes
Abstract:
Inference of phylogenetic networks is of increasing interest in the genomic era. However, the extent to which phylogenetic networks are identifiable from various types of data remains poorly understood, despite its crucial role in justifying methods. This work obtains strong identifiability results for large sub-classes of galled tree-child semidirected networks. Some of the conditions our proofs…
▽ More
Inference of phylogenetic networks is of increasing interest in the genomic era. However, the extent to which phylogenetic networks are identifiable from various types of data remains poorly understood, despite its crucial role in justifying methods. This work obtains strong identifiability results for large sub-classes of galled tree-child semidirected networks. Some of the conditions our proofs require, such as the identifiability of a network's tree of blobs or the circular order of 4 taxa around a cycle in a level-1 network, are already known to hold for many data types. We show that all these conditions hold for quartet concordance factor data under various gene tree models, yielding the strongest results from 2 or more samples per taxon. Although the network classes we consider have topological restrictions, they include non-planar networks of any level and are substantially more general than level-1 networks -- the only class previously known to enjoy identifiability from many data types. Our work establishes a route for proving future identifiability results for tree-child galled networks from data types other than quartet concordance factors, by checking that explicit conditions are met.
△ Less
Submitted 29 April, 2025;
originally announced April 2025.
-
A consistent least-squares criterion for calibrating edge lengths in phylogenetic networks
Authors:
Jingcheng Xu,
Cécile Ané
Abstract:
In phylogenetic networks, it is desirable to estimate edge lengths in substitutions per site or calendar time. Yet, there is a lack of scalable methods that provide such estimates. Here we consider the problem of obtaining edge length estimates from genetic distances, in the presence of rate variation across genes and lineages, when the network topology is known. We propose a novel criterion based…
▽ More
In phylogenetic networks, it is desirable to estimate edge lengths in substitutions per site or calendar time. Yet, there is a lack of scalable methods that provide such estimates. Here we consider the problem of obtaining edge length estimates from genetic distances, in the presence of rate variation across genes and lineages, when the network topology is known. We propose a novel criterion based on least-squares that is both consistent and computationally tractable. The crux of our approach is to decompose the genetic distances into two parts, one of which is invariant across displayed trees of the network. The scaled genetic distances are then fitted to the invariant part, while the average scaled genetic distances are fitted to the non-invariant part. We show that this criterion is consistent provided that there exists a tree path between some pair of tips in the network, and that edge lengths in the network are identifiable from average distances. We also provide a constrained variant of this criterion assuming a molecular clock, which can be used to obtain relative edge lengths in calendar time.
△ Less
Submitted 2 August, 2024; v1 submitted 27 July, 2024;
originally announced July 2024.
-
A dissimilarity measure for semidirected networks
Authors:
Michael Maxfield,
Jingcheng Xu,
Cécile Ané
Abstract:
Semidirected networks have received interest in evolutionary biology as the appropriate generalization of unrooted trees to networks, in which some but not all edges are directed. Yet these networks lack proper theoretical study. We define here a general class of semidirected phylogenetic networks, with a stable set of leaves, tree nodes and hybrid nodes. We prove that for these networks, if we lo…
▽ More
Semidirected networks have received interest in evolutionary biology as the appropriate generalization of unrooted trees to networks, in which some but not all edges are directed. Yet these networks lack proper theoretical study. We define here a general class of semidirected phylogenetic networks, with a stable set of leaves, tree nodes and hybrid nodes. We prove that for these networks, if we locally choose the direction of one edge, then globally the set of directed paths starting by this edge is stable across all choices to root the network. We define an edge-based representation of semidirected phylogenetic networks and use it to define a dissimilarity between networks, which can be efficiently computed in near-quadratic time. Our dissimilarity extends the widely-used Robinson-Foulds distance on both rooted trees and unrooted trees. After generalizing the notion of tree-child networks to semidirected networks, we prove that our edge-based dissimilarity is in fact a distance on the space of tree-child semidirected phylogenetic networks.
△ Less
Submitted 10 October, 2024; v1 submitted 24 May, 2024;
originally announced May 2024.
-
Leveraging graphical model techniques to study evolution on phylogenetic networks
Authors:
Benjamin Teo,
Paul Bastide,
Cécile Ané
Abstract:
The evolution of molecular and phenotypic traits is commonly modelled using Markov processes along a phylogeny. This phylogeny can be a tree, or a network if it includes reticulations, representing events such as hybridization or admixture. Computing the likelihood of data observed at the leaves is costly as the size and complexity of the phylogeny grows. Efficient algorithms exist for trees, but…
▽ More
The evolution of molecular and phenotypic traits is commonly modelled using Markov processes along a phylogeny. This phylogeny can be a tree, or a network if it includes reticulations, representing events such as hybridization or admixture. Computing the likelihood of data observed at the leaves is costly as the size and complexity of the phylogeny grows. Efficient algorithms exist for trees, but cannot be applied to networks. We show that a vast array of models for trait evolution along phylogenetic networks can be reformulated as graphical models, for which efficient belief propagation algorithms exist. We provide a brief review of belief propagation on general graphical models, then focus on linear Gaussian models for continuous traits. We show how belief propagation techniques can be applied for exact or approximate (but more scalable) likelihood and gradient calculations, and prove novel results for efficient parameter inference of some models. We highlight the possible fruitful interactions between graphical models and phylogenetic methods. For example, approximate likelihood approaches have the potential to greatly reduce computational costs for phylogenies with reticulations.
△ Less
Submitted 26 August, 2024; v1 submitted 15 May, 2024;
originally announced May 2024.
-
Identifying circular orders for blobs in phylogenetic networks
Authors:
John A. Rhodes,
Hector Banos,
Jingcheng Xu,
Cécile Ané
Abstract:
Interest in the inference of evolutionary networks relating species or populations has grown with the increasing recognition of the importance of hybridization, gene flow and admixture, and the availability of large-scale genomic data. However, what network features may be validly inferred from various data types under different models remains poorly understood. Previous work has largely focused o…
▽ More
Interest in the inference of evolutionary networks relating species or populations has grown with the increasing recognition of the importance of hybridization, gene flow and admixture, and the availability of large-scale genomic data. However, what network features may be validly inferred from various data types under different models remains poorly understood. Previous work has largely focused on level-1 networks, in which reticulation events are well separated, and on a general network's tree of blobs, the tree obtained by contracting every blob to a node. An open question is the identifiability of the topology of a blob of unknown level. We consider the identifiability of the circular order in which subnetworks attach to a blob, first proving that this order is well-defined for outer-labeled planar blobs. For this class of blobs, we show that the circular order information from 4-taxon subnetworks identifies the full circular order of the blob. Similarly, the circular order from 3-taxon rooted subnetworks identifies the full circular order of a rooted blob. We then show that subnetwork circular information is identifiable from certain data types and evolutionary models. This provides a general positive result for high-level networks, on the identifiability of the ordering in which taxon blocks attach to blobs in outer-labeled planar networks. Finally, we give examples of blobs with different internal structures which cannot be distinguished under many models and data types.
△ Less
Submitted 20 July, 2024; v1 submitted 18 February, 2024;
originally announced February 2024.
-
Identifiability of local and global features of phylogenetic networks from average distances
Authors:
Jingcheng Xu,
Cécile Ané
Abstract:
Phylogenetic networks extend phylogenetic trees to model non-vertical inheritance, by which a lineage inherits material from multiple parents. The computational complexity of estimating phylogenetic networks from genome-wide data with likelihood-based methods limits the size of networks that can be handled. Methods based on pairwise distances could offer faster alternatives. We study here the info…
▽ More
Phylogenetic networks extend phylogenetic trees to model non-vertical inheritance, by which a lineage inherits material from multiple parents. The computational complexity of estimating phylogenetic networks from genome-wide data with likelihood-based methods limits the size of networks that can be handled. Methods based on pairwise distances could offer faster alternatives. We study here the information that average pairwise distances contain on the underlying phylogenetic network, by characterizing local and global features that can or cannot be identified. For general networks, we clarify that the root and edge lengths adjacent to reticulations are not identifiable, and then focus on the class of zipped-up semidirected networks. We provide a criterion to swap subgraphs locally, such as 3-cycles, resulting in indistinguishable networks. We propose the "distance split tree", which can be constructed from pairwise distances, and prove that it is a refinement of the network's tree of blobs, capturing the tree-like features of the network. For level-1 networks, this distance split tree is equal to the tree of blobs refined to separate polytomies from blobs, and we prove that the mixed representation of the network is identifiable. The information loss is localized around 4-cycles, for which the placement of the reticulation is unidentifiable. The mixed representation combines split edges for 4-cycles, regular tree and hybrid edges from the semidirected network, and edge parameters that encode all information identifiable from average pairwise distances.
△ Less
Submitted 25 June, 2022; v1 submitted 22 October, 2021;
originally announced October 2021.
-
On the Identifiability of Phylogenetic Networks under a Pseudolikelihood model
Authors:
Claudia Solis-Lemus,
Arrigo Coen,
Cecile Ane
Abstract:
The Tree of Life is the graphical structure that represents the evolutionary process from single-cell organisms at the origin of life to the vast biodiversity we see today. Reconstructing this tree from genomic sequences is challenging due to the variety of biological forces that shape the signal in the data, and many of those processes like incomplete lineage sorting and hybridization can produce…
▽ More
The Tree of Life is the graphical structure that represents the evolutionary process from single-cell organisms at the origin of life to the vast biodiversity we see today. Reconstructing this tree from genomic sequences is challenging due to the variety of biological forces that shape the signal in the data, and many of those processes like incomplete lineage sorting and hybridization can produce confounding information. Here, we present the mathematical version of the identifiability proofs of phylogenetic networks under the pseudolikelihood model in SNaQ. We establish that the ability to detect different hybridization events depends on the number of nodes on the hybridization blob, with small blobs (corresponding to closely related species) being the hardest to be detected. Our work focuses on level-1 networks, but raises attention to the importance of identifiability studies on phylogenetic inference methods for broader classes of networks.
△ Less
Submitted 4 October, 2020;
originally announced October 2020.
-
Inferring phylogenetic networks with maximum pseudolikelihood under incomplete lineage sorting
Authors:
Claudia Solís-Lemus,
Cécile Ané
Abstract:
Phylogenetic networks are necessary to represent the tree of life expanded by edges to represent events such as horizontal gene transfers, hybridizations or gene flow. Not all species follow the paradigm of vertical inheritance of their genetic material. While a great deal of research has flourished into the inference of phylogenetic trees, statistical methods to infer phylogenetic networks are st…
▽ More
Phylogenetic networks are necessary to represent the tree of life expanded by edges to represent events such as horizontal gene transfers, hybridizations or gene flow. Not all species follow the paradigm of vertical inheritance of their genetic material. While a great deal of research has flourished into the inference of phylogenetic trees, statistical methods to infer phylogenetic networks are still limited and under development. The main disadvantage of existing methods is a lack of scalability. Here, we present a statistical method to infer phylogenetic networks from multi-locus genetic data in a pseudolikelihood framework. Our model accounts for incomplete lineage sorting through the coalescent model, and for horizontal inheritance of genes through reticulation nodes in the network. Computation of the pseudolikelihood is fast and simple, and it avoids the burdensome calculation of the full likelihood which can be intractable with many species. Moreover, estimation at the quartet-level has the added computational benefit that it is easily parallelizable. Simulation studies comparing our method to a full likelihood approach show that our pseudolikelihood approach is much faster without compromising accuracy. We applied our method to reconstruct the evolutionary relationships among swordtails and platyfishes ($Xiphophorus$: Poeciliidae), which is characterized by widespread hybridizations.
△ Less
Submitted 12 February, 2016; v1 submitted 20 September, 2015;
originally announced September 2015.
-
Phase transition on the convergence rate of parameter estimation under an Ornstein-Uhlenbeck diffusion on a tree
Authors:
Cécile Ané,
Lam Si Tung Ho,
Sebastien Roch
Abstract:
Diffusion processes on trees are commonly used in evolutionary biology to model the joint distribution of continuous traits, such as body mass, across species. Estimating the parameters of such processes from tip values presents challenges because of the intrinsic correlation between the observations produced by the shared evolutionary history, thus violating the standard independence assumption o…
▽ More
Diffusion processes on trees are commonly used in evolutionary biology to model the joint distribution of continuous traits, such as body mass, across species. Estimating the parameters of such processes from tip values presents challenges because of the intrinsic correlation between the observations produced by the shared evolutionary history, thus violating the standard independence assumption of large-sample theory. For instance Ho and Ané \cite{HoAne13} recently proved that the mean (also known in this context as selection optimum) of an Ornstein-Uhlenbeck process on a tree cannot be estimated consistently from an increasing number of tip observations if the tree height is bounded. Here, using a fruitful connection to the so-called reconstruction problem in probability theory, we study the convergence rate of parameter estimation in the unbounded height case. For the mean of the process, we provide a necessary and sufficient condition for the consistency of the maximum likelihood estimator (MLE) and establish a phase transition on its convergence rate in terms of the growth of the tree. In particular we show that a loss of $\sqrt{n}$-consistency (i.e., the variance of the MLE becomes $Ω(n^{-1})$, where $n$ is the number of tips) occurs when the tree growth is larger than a threshold related to the phase transition of the reconstruction problem. For the covariance parameters, we give a novel, efficient estimation method which achieves $\sqrt{n}$-consistency under natural assumptions on the tree.
△ Less
Submitted 25 May, 2016; v1 submitted 5 June, 2014;
originally announced June 2014.
-
Asymptotic theory with hierarchical autocorrelation: Ornstein-Uhlenbeck tree models
Authors:
Lam Si Tung Ho,
Cécile Ané
Abstract:
Hierarchical autocorrelation in the error term of linear models arises when sampling units are related to each other according to a tree. The residual covariance is parametrized using the tree-distance between sampling units. When observations are modeled using an Ornstein-Uhlenbeck (OU) process along the tree, the autocorrelation between two tips decreases exponentially with their tree distance.…
▽ More
Hierarchical autocorrelation in the error term of linear models arises when sampling units are related to each other according to a tree. The residual covariance is parametrized using the tree-distance between sampling units. When observations are modeled using an Ornstein-Uhlenbeck (OU) process along the tree, the autocorrelation between two tips decreases exponentially with their tree distance. These models are most often applied in evolutionary biology, when tips represent biological species and the OU process parameters represent the strength and direction of natural selection. For these models, we show that the mean is not microergodic: no estimator can ever be consistent for this parameter and provide a lower bound for the variance of its MLE. For covariance parameters, we give a general sufficient condition ensuring microergodicity. This condition suggests that some parameters may not be estimated at the same rate as others. We show that, indeed, maximum likelihood estimators of the autocorrelation parameter converge at a slower rate than that of generally microergodic parameters. We showed this theoretically in a symmetric tree asymptotic framework and through simulations on a large real tree comprising 4507 mammal species.
△ Less
Submitted 6 June, 2013;
originally announced June 2013.
-
Analysis of comparative data with hierarchical autocorrelation
Authors:
Cécile Ané
Abstract:
The asymptotic behavior of estimates and information criteria in linear models are studied in the context of hierarchically correlated sampling units. The work is motivated by biological data collected on species where autocorrelation is based on the species' genealogical tree. Hierarchical autocorrelation is also found in many other kinds of data, such as from microarray experiments or human la…
▽ More
The asymptotic behavior of estimates and information criteria in linear models are studied in the context of hierarchically correlated sampling units. The work is motivated by biological data collected on species where autocorrelation is based on the species' genealogical tree. Hierarchical autocorrelation is also found in many other kinds of data, such as from microarray experiments or human languages. Similar correlation also arises in ANOVA models with nested effects. I show that the best linear unbiased estimators are almost surely convergent but may not be consistent for some parameters such as the intercept and lineage effects, in the context of Brownian motion evolution on the genealogical tree. For the purpose of model selection I show that the usual BIC does not provide an appropriate approximation to the posterior probability of a model. To correct for this, an effective sample size is introduced for parameters that are inconsistently estimated. For biological studies, this work implies that tree-aware sampling design is desirable; adding more sampling units may not help ancestral reconstruction and only strong lineage effects may be detected with high power.
△ Less
Submitted 14 November, 2008; v1 submitted 19 April, 2008;
originally announced April 2008.
-
Identifiability of a Markovian model of molecular evolution with Gamma-distributed rates
Authors:
Elizabeth S. Allman,
Cecile Ane,
John A. Rhodes
Abstract:
Inference of evolutionary trees and rates from biological sequences is commonly performed using continuous-time Markov models of character change. The Markov process evolves along an unknown tree while observations arise only from the tips of the tree. Rate heterogeneity is present in most real data sets and is accounted for by the use of flexible mixture models where each site is allowed its ow…
▽ More
Inference of evolutionary trees and rates from biological sequences is commonly performed using continuous-time Markov models of character change. The Markov process evolves along an unknown tree while observations arise only from the tips of the tree. Rate heterogeneity is present in most real data sets and is accounted for by the use of flexible mixture models where each site is allowed its own rate. Very little has been rigorously established concerning the identifiability of the models currently in common use in data analysis, although non-identifiability was proven for a semi-parametric model and an incorrect proof of identifiability was published for a general parametric model (GTR+Gamma+I). Here we prove that one of the most widely used models (GTR+Gamma) is identifiable for generic parameters, and for all parameter choices in the case of 4-state (DNA) models. This is the first proof of identifiability of a phylogenetic model with a continuous distribution of rates.
△ Less
Submitted 1 February, 2008; v1 submitted 4 September, 2007;
originally announced September 2007.