-
Convergence-divergence models: Generalizations of phylogenetic trees modeling gene flow over time
Authors:
Jonathan D. Mitchell,
Barbara R. Holland
Abstract:
Phylogenetic trees are simple models of evolutionary processes. They describe conditionally independent divergent evolution of taxa from common ancestors. Phylogenetic trees commonly do not have enough flexibility to adequately model all evolutionary processes. For example, introgressive hybridization, where genes can flow from one taxon to another. Phylogenetic networks model evolution not fully…
▽ More
Phylogenetic trees are simple models of evolutionary processes. They describe conditionally independent divergent evolution of taxa from common ancestors. Phylogenetic trees commonly do not have enough flexibility to adequately model all evolutionary processes. For example, introgressive hybridization, where genes can flow from one taxon to another. Phylogenetic networks model evolution not fully described by a phylogenetic tree. However, many phylogenetic network models assume ancestral taxa merge instantaneously to form ``hybrid'' descendant taxa. In contrast, our convergence-divergence models retain a single underlying ``principal'' tree, but permit gene flow over arbitrary time frames. Alternatively, convergence-divergence models can describe other biological processes leading to taxa becoming more similar over a time frame, such as replicated evolution. Here we present novel maximum likelihood-based algorithms to infer most aspects of $N$-taxon convergence-divergence models, many consistently, using a quartet-based approach. The algorithms can be applied to multiple sequence alignments restricted to genes or genomic windows or to gene presence/absence datasets.
△ Less
Submitted 9 April, 2025;
originally announced April 2025.
-
From trees to traits: A review of advances in PhyloG2P methods and future directions
Authors:
Arlie R. Macdonald,
Maddie E. James,
Jonathan D. Mitchell,
Barbara R. Holland
Abstract:
Mapping genotypes to phenotypes (G2P) is a fundamental goal in biology. So called PhyloG2P methods are a relatively new set of tools that leverage replicated evolution in phylogenetically independent lineages to identify genomic regions associated with traits of interest. Here, we review recent developments in PhyloG2P methods, focusing on three key areas: methods based on replicated amino acid su…
▽ More
Mapping genotypes to phenotypes (G2P) is a fundamental goal in biology. So called PhyloG2P methods are a relatively new set of tools that leverage replicated evolution in phylogenetically independent lineages to identify genomic regions associated with traits of interest. Here, we review recent developments in PhyloG2P methods, focusing on three key areas: methods based on replicated amino acid substitutions, methods detecting changes in evolutionary rates, and methods analysing gene duplication and loss. We discuss how the definition and measurement of traits impacts the utility of these methods, arguing that focusing on simple rather than compound traits will lead to more meaningful genotype-phenotype associations. We advocate for the use of methods that work with continuous traits directly rather than collapsing them to binary representations. We examine the strengths and limitations of different approaches to modeling genetic replication, highlighting the importance of explicit modeling of evolutionary processes. Finally, we outline promising future directions, including the integration of population-level variation, as well as epigenetic and environmental information. No one method is likely to identify all genomic regions of interest, so we encourage users to apply multiple methods that are capable of detecting a wide range of associations. The overall aim of this review is to provide practitioners a roadmap for understanding and applying PhyloG2P methods.
△ Less
Submitted 12 January, 2025;
originally announced January 2025.
-
A generalized AIC for models with singularities and boundaries
Authors:
Jonathan D. Mitchell,
Elizabeth S. Allman,
John A. Rhodes
Abstract:
The Akaike information criterion (AIC) is a common tool for model selection. It is frequently used in violation of regularity conditions at parameter space singularities and boundaries. The expected AIC is generally not asymptotically equivalent to its target at singularities and boundaries, and convergence to the target at nearby parameter points may be slow. We develop a generalized AIC for cand…
▽ More
The Akaike information criterion (AIC) is a common tool for model selection. It is frequently used in violation of regularity conditions at parameter space singularities and boundaries. The expected AIC is generally not asymptotically equivalent to its target at singularities and boundaries, and convergence to the target at nearby parameter points may be slow. We develop a generalized AIC for candidate models with or without singularities and boundaries. We show that the expectation of this generalized form converges everywhere in the parameter space, and its convergence can be faster than that of the AIC. We illustrate the generalized AIC on example models from phylogenomics, showing that it can outperform the AIC and gives rise to an interpolated effective number of model parameters, which can differ substantially from the number of parameters near singularities and boundaries. We outline methods for estimating the often unknown generating parameter and bias correction term of the generalized AIC.
△ Less
Submitted 8 November, 2022;
originally announced November 2022.
-
The Tree of Blobs of a Species Network: Identifiability under the Coalescent
Authors:
Elizabeth S. Allman,
Hector Baños,
Jonathan D. Mitchell,
John A. Rhodes
Abstract:
Inference of species networks from genomic data under the Network Multispecies Coalescent Model is currently severely limited by heavy computational demands. It also remains unclear how complicated networks can be for consistent inference to be possible. As a step toward inferring a general species network, this work considers its tree of blobs, in which non-cut edges are contracted to nodes, so o…
▽ More
Inference of species networks from genomic data under the Network Multispecies Coalescent Model is currently severely limited by heavy computational demands. It also remains unclear how complicated networks can be for consistent inference to be possible. As a step toward inferring a general species network, this work considers its tree of blobs, in which non-cut edges are contracted to nodes, so only tree-like relationships between the taxa are shown. An identifiability theorem, that most features of the unrooted tree of blobs can be determined from the distribution of gene quartet topologies, is established. This depends upon an analysis of gene quartet concordance factors under the model, together with a new combinatorial inference rule. The arguments for this theoretical result suggest a practical algorithm for tree of blobs inference, to be fully developed in a subsequent work.
△ Less
Submitted 6 May, 2022;
originally announced May 2022.
-
Hypothesis testing near singularities and boundaries
Authors:
Jonathan D. Mitchell,
Elizabeth S. Allman,
John A. Rhodes
Abstract:
The likelihood ratio statistic, with its asymptotic $χ^2$ distribution at regular model points, is often used for hypothesis testing. At model singularities and boundaries, however, the asymptotic distribution may not be $χ^2$, as highlighted by recent work of Drton. Indeed, poor behavior of a $χ^2$ for testing near singularities and boundaries is apparent in simulations, and can lead to conservat…
▽ More
The likelihood ratio statistic, with its asymptotic $χ^2$ distribution at regular model points, is often used for hypothesis testing. At model singularities and boundaries, however, the asymptotic distribution may not be $χ^2$, as highlighted by recent work of Drton. Indeed, poor behavior of a $χ^2$ for testing near singularities and boundaries is apparent in simulations, and can lead to conservative or anti-conservative tests. Here we develop a new distribution designed for use in hypothesis testing near singularities and boundaries, which asymptotically agrees with that of the likelihood ratio statistic. For two example trinomial models, arising in the context of inference of evolutionary trees, we show the new distributions outperform a $χ^2$.
△ Less
Submitted 21 June, 2018;
originally announced June 2018.
-
Distinguishing between convergent evolution and violation of the molecular clock
Authors:
Jonathan D. Mitchell,
Jeremy G. Sumner,
Barbara R. Holland
Abstract:
We give a non-technical introduction to convergence-divergence models, a new modeling approach for phylogenetic data that allows for the usual divergence of species post speciation but also allows for species to converge, i.e. become more similar over time. By examining the $3$-taxon case in some detail we illustrate that phylogeneticists have been "spoiled" in the sense of not having to think abo…
▽ More
We give a non-technical introduction to convergence-divergence models, a new modeling approach for phylogenetic data that allows for the usual divergence of species post speciation but also allows for species to converge, i.e. become more similar over time. By examining the $3$-taxon case in some detail we illustrate that phylogeneticists have been "spoiled" in the sense of not having to think about the structural parameters in their models by virtue of the strong assumption that evolution is treelike. We show that there are not always good statistical reasons to prefer the usual class of treelike models over more general convergence-divergence models. Specifically we show many $3$-taxon datasets can be equally well explained by supposing violation of the molecular clock due to change in the rate of evolution along different edges, or by keeping the assumption of a constant rate of evolution but instead assuming that evolution is not a purely divergent process. Given the abundance of evidence that evolution is not strictly treelike, our discussion is an illustration that as phylogeneticists we often need to think clearly about the structural form of the models we use.
△ Less
Submitted 13 September, 2017;
originally announced September 2017.