-
Recoverability of Ancestral Recombination Graph Topologies
Authors:
Elizabeth Hayman,
Anastasia Ignatieva,
Jotun Hein
Abstract:
Recombination is a powerful evolutionary process that shapes the genetic diversity observed in the populations of many species. Reconstructing genealogies in the presence of recombination from sequencing data is a very challenging problem, as this relies on mutations having occurred on the correct lineages in order to detect the recombination and resolve the placement of edges in the local trees.…
▽ More
Recombination is a powerful evolutionary process that shapes the genetic diversity observed in the populations of many species. Reconstructing genealogies in the presence of recombination from sequencing data is a very challenging problem, as this relies on mutations having occurred on the correct lineages in order to detect the recombination and resolve the placement of edges in the local trees. We investigate the probability of recovering the true topology of ancestral recombination graphs (ARGs)under the coalescent with recombination and gene conversion. We explore how sample size and mutation rate affect the inherent uncertainty in reconstructed ARGs; this sheds light on the theoretical limitations of ARG reconstruction methods. We illustrate our results using estimates of evolutionary rates for several biological organisms; in particular, we find that for parameter values that are realistic for SARS-CoV-2, the probability of reconstructing genealogies that are close to the truth is low.
△ Less
Submitted 30 May, 2022; v1 submitted 10 October, 2021;
originally announced October 2021.
-
Combinatorics of polymer models of early metabolism
Authors:
Oliver Weller-Davies,
Mike Steel,
Jotun Hein
Abstract:
Polymer models are a widely used tool to study the prebiotic formation of metabolism at the origins of life. Counts of the number of reactions in these models are often crucial in probabilistic arguments concerning the emergence of autocatalytic networks. In the first part of this paper, we provide the first exact description of the number of reactions under widely applied model assumptions. Concl…
▽ More
Polymer models are a widely used tool to study the prebiotic formation of metabolism at the origins of life. Counts of the number of reactions in these models are often crucial in probabilistic arguments concerning the emergence of autocatalytic networks. In the first part of this paper, we provide the first exact description of the number of reactions under widely applied model assumptions. Conclusions from earlier studies rely on either approximations or asymptotic counting, and we show that the exact counts lead to similar, though not always identical, asymptotic results. In the second part of the paper, we investigate a novel model assumption whereby polymers are invariant under spatial rotation. We outline the biochemical relevance of this condition and again give exact enumerative and asymptotic formulae for the number of reactions.
△ Less
Submitted 12 April, 2021;
originally announced April 2021.
-
KwARG: Parsimonious reconstruction of ancestral recombination graphs with recurrent mutation
Authors:
Anastasia Ignatieva,
Rune B. Lyngsø,
Paul A. Jenkins,
Jotun Hein
Abstract:
The reconstruction of possible histories given a sample of genetic data in the presence of recombination and recurrent mutation is a challenging problem, but can provide key insights into the evolution of a population. We present KwARG, which implements a parsimony-based greedy heuristic algorithm for finding plausible genealogical histories (ancestral recombination graphs) that are minimal or nea…
▽ More
The reconstruction of possible histories given a sample of genetic data in the presence of recombination and recurrent mutation is a challenging problem, but can provide key insights into the evolution of a population. We present KwARG, which implements a parsimony-based greedy heuristic algorithm for finding plausible genealogical histories (ancestral recombination graphs) that are minimal or near-minimal in the number of posited recombination and mutation events. Given an input dataset of aligned sequences, KwARG outputs a list of possible candidate solutions, each comprising a list of mutation and recombination events that could have generated the dataset; the relative proportion of recombinations and recurrent mutations in a solution can be controlled via specifying a set of 'cost' parameters. We demonstrate that the algorithm performs well when compared against existing methods. The software is made available on GitHub.
△ Less
Submitted 13 May, 2021; v1 submitted 17 December, 2020;
originally announced December 2020.
-
A characterisation of the reconstructed birth-death process through time rescaling
Authors:
Anastasia Ignatieva,
Jotun Hein,
Paul A. Jenkins
Abstract:
The dynamics of a population exhibiting exponential growth can be modelled as a birth-death process, which naturally captures the stochastic variation in population size over time. In this article, we consider a supercritical birth-death process, started at a random time in the past, and conditioned to have n sampled individuals at the present. The genealogy of individuals sampled at the present t…
▽ More
The dynamics of a population exhibiting exponential growth can be modelled as a birth-death process, which naturally captures the stochastic variation in population size over time. In this article, we consider a supercritical birth-death process, started at a random time in the past, and conditioned to have n sampled individuals at the present. The genealogy of individuals sampled at the present time is then described by the reversed reconstructed process (RRP), which traces the ancestry of the sample backwards from the present. We show that a simple, analytic, time rescaling of the RRP provides a straightforward way to derive its inter-event times. The same rescaling characterises other distributions underlying this process, obtained elsewhere in the literature via more cumbersome calculations. We also consider the case of incomplete sampling of the population, in which each leaf of the genealogy is retained with an independent Bernoulli trial with probability $ψ$, and we show that corresponding results for Bernoulli-sampled RRPs can be derived using time rescaling, for any values of the underlying parameters. A central result is the derivation of a scaling limit as $ψ$ approaches 0, corresponding to the underlying population growing to infinity, using the time rescaling formalism. We show that in this setting, after a linear time rescaling, the event times are the order statistics of $n$ logistic random variables with mode $\log(1/ψ)$; moreover, we show that the inter-event times are approximately exponentially distributed.
△ Less
Submitted 6 May, 2020; v1 submitted 10 December, 2019;
originally announced December 2019.
-
Combinatorial results for network-based models of metabolic origins
Authors:
Oliver Weller-Davies,
Mike Steel,
Jotun Hein
Abstract:
A key step in the origin of life is the emergence of a primitive metabolism. This requires the formation of a subset of chemical reactions that is both self-sustaining and collectively autocatalytic. A generic theory to study such processes (called 'RAF theory') has provided a precise and computationally effective way to address these questions, both on simulated data and in laboratory studies. On…
▽ More
A key step in the origin of life is the emergence of a primitive metabolism. This requires the formation of a subset of chemical reactions that is both self-sustaining and collectively autocatalytic. A generic theory to study such processes (called 'RAF theory') has provided a precise and computationally effective way to address these questions, both on simulated data and in laboratory studies. One of the classic applications of this theory (arising from Stuart Kauffman's pioneering work in the 1980s) involves networks of polymers under cleavage and ligation reactions; in the first part of this paper, we provide the first exact description of the number of such reactions under various model assumptions. Conclusions from earlier studies relied on either approximations or asymptotic counting, and we show that the exact counts lead to similar (though not always identical) asymptotic results. In the second part of the paper, we solve some questions posed in more recent papers concerning the computational complexity of some key questions in RAF theory. In particular, although there is a fast algorithm to determine whether or not a catalytic reaction network contains a subset that is both self-sustaining and autocatalytic (and, if so, find one), determining whether or not sets exist that satisfy certain additional constraints exist turns out to be NP-complete.
△ Less
Submitted 20 October, 2019;
originally announced October 2019.
-
Toroidal diffusions and protein structure evolution
Authors:
Eduardo García-Portugués,
Michael Golden,
Michael Sørensen,
Kanti V. Mardia,
Thomas Hamelryck,
Jotun Hein
Abstract:
This chapter shows how toroidal diffusions are convenient methodological tools for modelling protein evolution in a probabilistic framework. The chapter addresses the construction of ergodic diffusions with stationary distributions equal to well-known directional distributions, which can be regarded as toroidal analogues of the Ornstein-Uhlenbeck process. The important challenges that arise in the…
▽ More
This chapter shows how toroidal diffusions are convenient methodological tools for modelling protein evolution in a probabilistic framework. The chapter addresses the construction of ergodic diffusions with stationary distributions equal to well-known directional distributions, which can be regarded as toroidal analogues of the Ornstein-Uhlenbeck process. The important challenges that arise in the estimation of the diffusion parameters require the consideration of tractable approximate likelihoods and, among the several approaches introduced, the one yielding a specific approximation to the transition density of the wrapped normal process is shown to give the best empirical performance on average. This provides the methodological building block for Evolutionary Torus Dynamic Bayesian Network (ETDBN), a hidden Markov model for protein evolution that emits a wrapped normal process and two continuous-time Markov chains per hidden state. The chapter describes the main features of ETDBN, which allows for both "smooth" conformational changes and "catastrophic" conformational jumps, and several empirical benchmarks. The insights into the relationship between sequence and structure evolution that ETDBN provides are illustrated in a case study.
△ Less
Submitted 21 September, 2020; v1 submitted 1 April, 2018;
originally announced April 2018.
-
A generative angular model of protein structure evolution
Authors:
Michael Golden,
Eduardo García-Portugués,
Michael Sørensen,
Kanti V. Mardia,
Thomas Hamelryck,
Jotun Hein
Abstract:
Recently described stochastic models of protein evolution have demonstrated that the inclusion of structural information in addition to amino acid sequences leads to a more reliable estimation of evolutionary parameters. We present a generative, evolutionary model of protein structure and sequence that is valid on a local length scale. The model concerns the local dependencies between sequence and…
▽ More
Recently described stochastic models of protein evolution have demonstrated that the inclusion of structural information in addition to amino acid sequences leads to a more reliable estimation of evolutionary parameters. We present a generative, evolutionary model of protein structure and sequence that is valid on a local length scale. The model concerns the local dependencies between sequence and structure evolution in a pair of homologous proteins. The evolutionary trajectory between the two structures in the protein pair is treated as a random walk in dihedral angle space, which is modelled using a novel angular diffusion process on the two-dimensional torus. Coupling sequence and structure evolution in our model allows for modelling both "smooth" conformational changes and "catastrophic" conformational jumps, conditioned on the amino acid changes. The model has interpretable parameters and is comparatively more realistic than previous stochastic models, providing new insights into the relationship between sequence and structure evolution. For example, using the trained model we were able to identify an apparent sequence-structure evolutionary motif present in a large number of homologous protein pairs. The generative nature of our model enables us to evaluate its validity and its ability to simulate aspects of protein evolution conditioned on an amino acid sequence, a related amino acid sequence, a related structure or any combination thereof.
△ Less
Submitted 21 September, 2020; v1 submitted 30 December, 2016;
originally announced December 2016.
-
Approximate statistical alignment by iterative sampling of substitution matrices
Authors:
Joseph L. Herman,
Adrienn Szabó,
Instván Miklós,
Jotun Hein
Abstract:
We outline a procedure for jointly sampling substitution matrices and multiple sequence alignments, according to an approximate posterior distribution, using an MCMC-based algorithm. This procedure provides an efficient and simple method by which to generate alternative alignments according to their expected accuracy, and allows appropriate parameters for substitution matrices to be selected in an…
▽ More
We outline a procedure for jointly sampling substitution matrices and multiple sequence alignments, according to an approximate posterior distribution, using an MCMC-based algorithm. This procedure provides an efficient and simple method by which to generate alternative alignments according to their expected accuracy, and allows appropriate parameters for substitution matrices to be selected in an automated fashion. In the cases considered here, the sampled alignments with the highest likelihood have an accuracy consistently higher than alignments generated using the standard BLOSUM62 matrix.
△ Less
Submitted 19 January, 2015;
originally announced January 2015.