Search | arXiv e-print repository

GenSpectrum Chat: Data Exploration in Public Health Using Large Language Models

Abstract: Introduction: The COVID-19 pandemic highlighted the importance of making epidemiological data and scientific insights easily accessible and explorable for public health agencies, the general public, and researchers. State-of-the-art approaches for sharing data and insights included regularly updated reports and web dashboards. However, they face a trade-off between the simplicity and flexibility o… ▽ More Introduction: The COVID-19 pandemic highlighted the importance of making epidemiological data and scientific insights easily accessible and explorable for public health agencies, the general public, and researchers. State-of-the-art approaches for sharing data and insights included regularly updated reports and web dashboards. However, they face a trade-off between the simplicity and flexibility of data exploration. With the capabilities of recent large language models (LLMs) such as GPT-4, this trade-off can be overcome. Results: We developed the chatbot "GenSpectrum Chat" (https://cov-spectrum.org/chat) which uses GPT-4 as the underlying large language model (LLM) to explore SARS-CoV-2 genomic sequencing data. Out of 500 inputs from real-world users, the chatbot provided a correct answer for 453 prompts; an incorrect answer for 13 prompts, and no answer although the question was within scope for 34 prompts. We also tested the chatbot with inputs from 10 different languages, and despite being provided solely with English instructions and examples, it successfully processed prompts in all tested languages. Conclusion: LLMs enable new ways of interacting with information systems. In the field of public health, GenSpectrum Chat can facilitate the analysis of real-time pathogen genomic data. With our chatbot supporting interactive exploration in different languages, we envision quick and direct access to the latest evidence for policymakers around the world. △ Less

Submitted 23 May, 2023; originally announced May 2023.

arXiv:2206.01210 [pdf, other]

LAPIS is a fast web API for massive open virus sequencing databases

Authors: Chaoran Chen, Alexander Taepper, Fabian Engelniederhammer, Jonas Kellerer, Cornelius Roemer, Tanja Stadler

Abstract: Background: Recent epidemic outbreaks such as the SARS-CoV-2 pandemic and the mpox outbreak in 2022 have demonstrated the value of genomic sequencing data for tracking the origin and spread of pathogens. Laboratories around the globe generated new sequences at unprecedented speed and volume and bioinformaticians developed new tools and dashboards to analyze this wealth of data. However, a major ch… ▽ More Background: Recent epidemic outbreaks such as the SARS-CoV-2 pandemic and the mpox outbreak in 2022 have demonstrated the value of genomic sequencing data for tracking the origin and spread of pathogens. Laboratories around the globe generated new sequences at unprecedented speed and volume and bioinformaticians developed new tools and dashboards to analyze this wealth of data. However, a major challenge that remains is the lack of simple and efficient approaches for accessing and processing sequencing data. Results: The Lightweight API for Sequences (LAPIS) facilitates rapid retrieval and analysis of genomic sequencing data through a REST API. It supports complex mutation- and metadata-based queries and can perform aggregation operations on massive datasets. LAPIS is optimized for typical questions relevant to genomic epidemiology. Using a newly-developed in-memory database engine, it has a high speed and throughput: between 25 January and 4 February 2023, the SARS-CoV-2 instance of LAPIS, which contains 14.5 million sequences, processed over 20 million requests with a mean response time of 411 ms and a median response time of 1 ms. LAPIS is the core engine behind our dashboards on genspectrum.org and we currently maintain public LAPIS instances for SARS-CoV-2 and mpox. Conclusions: Powered by an optimized database engine and available through a web API, LAPIS enhances the accessibility of genomic sequencing data. It is designed to serve as a common backend for dashboards and analyses with the potential to be integrated into common database platforms such as GenBank. △ Less

Submitted 18 May, 2023; v1 submitted 2 June, 2022; originally announced June 2022.

arXiv:2106.08106 [pdf, ps, other]

doi 10.1093/bioinformatics/btab856

CoV-Spectrum: Analysis of globally shared SARS-CoV-2 data to Identify and Characterize New Variants

Authors: Chaoran Chen, Sarah Nadeau, Michael Yared, Philippe Voinov, Ning Xie, Cornelius Roemer, Tanja Stadler

Abstract: Summary: The CoV-Spectrum website supports the identification of new SARS-CoV-2 variants of concern and the tracking of known variants. Its flexible amino acid and nucleotide mutation search allows querying of variants before they are designated by a lineage nomenclature system. The platform brings together SARS-CoV-2 data from different sources and applies analyses. Results include the proportion… ▽ More Summary: The CoV-Spectrum website supports the identification of new SARS-CoV-2 variants of concern and the tracking of known variants. Its flexible amino acid and nucleotide mutation search allows querying of variants before they are designated by a lineage nomenclature system. The platform brings together SARS-CoV-2 data from different sources and applies analyses. Results include the proportion of different variants over time, their demographic and geographic distributions, common mutations, hospitalization and mortality probabilities, estimates for transmission fitness advantage and insights obtained from wastewater samples. Availability and Implementation: CoV-Spectrum is available at https://cov-spectrum.ethz.ch. The code is released under the GPL-3.0 license at https://github.com/cevo-public/cov-spectrum-website △ Less

Submitted 28 November, 2021; v1 submitted 15 June, 2021; originally announced June 2021.

arXiv:1809.09014 [pdf, other]

doi 10.1016/j.tpb.2019.11.005

Fast likelihood evaluation for multivariate phylogenetic comparative methods: the PCMBase R package

Authors: Venelin Mitov, Krzysztof Bartoszek, Georgios Asimomitis, Tanja Stadler

Abstract: We introduce an R package, PCMBase, to rapidly calculate the likelihood for multivariate phylogenetic comparative methods. The package is not specific to particular models but offers the user the functionality to very easily implement a wide range of models where the transition along a branch is multivariate normal. We demonstrate the package's possibilities on the now standard, multitrait Ornstei… ▽ More We introduce an R package, PCMBase, to rapidly calculate the likelihood for multivariate phylogenetic comparative methods. The package is not specific to particular models but offers the user the functionality to very easily implement a wide range of models where the transition along a branch is multivariate normal. We demonstrate the package's possibilities on the now standard, multitrait Ornstein-Uhlenbeck process as well as the novel multivariate punctuated equilibrium model. The package can handle trees of various types (e.g. ultrametric, nonultrametric, polytomies, e.t.c.), as well as measurement error, missing measurements or non-existing traits for some of the species in the tree. △ Less

Submitted 24 September, 2018; originally announced September 2018.

Comments: 34 pages, 6 figures

arXiv:1706.10106 [pdf, other]

The fossilized birth-death model for the analysis of stratigraphic range data under different speciation concepts

Authors: Tanja Stadler, Alexandra Gavryushkina, Rachel C. M. Warnock, Alexei J. Drummond, Tracy A. Heath

Abstract: A birth-death-sampling model gives rise to phylogenetic trees with samples from the past and the present. Interpreting "birth" as branching speciation, "death" as extinction, and "sampling" as fossil preservation and recovery, this model -- also referred to as the fossilized birth-death (FBD) model -- gives rise to phylogenetic trees on extant and fossil samples. The model has been mathematically… ▽ More A birth-death-sampling model gives rise to phylogenetic trees with samples from the past and the present. Interpreting "birth" as branching speciation, "death" as extinction, and "sampling" as fossil preservation and recovery, this model -- also referred to as the fossilized birth-death (FBD) model -- gives rise to phylogenetic trees on extant and fossil samples. The model has been mathematically analyzed and successfully applied to a range of datasets on different taxonomic levels, such as penguins, plants, and insects. However, the current mathematical treatment of this model does not allow for a group of temporally distinct fossil specimens to be assigned to the same species. In this paper, we provide a general mathematical FBD modeling framework that explicitly takes "stratigraphic ranges" into account, with a stratigraphic range being defined as the lineage interval associated with a single species, ranging through time from the first to the last fossil appearance of the species. To assign a sequence of fossil samples in the phylogenetic tree to the same species, i.e., to specify a stratigraphic range, we need to define the mode of speciation. We provide expressions to account for three common speciation modes: budding (or asymmetric) speciation, bifurcating (or symmetric) speciation, and anagenetic speciation. Our equations allow for flexible joint Bayesian analysis of paleontological and neontological data. Furthermore, our framework is directly applicable to epidemiology, where a stratigraphic range is the observed duration of infection of a single patient, "birth" via budding is transmission, "death" is recovery, and "sampling" is sequencing the pathogen of a patient. Thus, we present a model that allows for incorporation of multiple observations through time from a single patient. △ Less

Submitted 9 March, 2018; v1 submitted 30 June, 2017; originally announced June 2017.

arXiv:1601.07447 [pdf, other]

Bayesian phylogenetic estimation of fossil ages

Authors: Alexei J. Drummond, Tanja Stadler

Abstract: Recent advances have allowed for both morphological fossil evidence and molecular sequences to be integrated into a single combined inference of divergence dates under the rule of Bayesian probability. In particular the fossilized birth-death tree prior and the Lewis-Mk model of discrete morphological evolution allow for the estimation of both divergence times and phylogenetic relationships betwee… ▽ More Recent advances have allowed for both morphological fossil evidence and molecular sequences to be integrated into a single combined inference of divergence dates under the rule of Bayesian probability. In particular the fossilized birth-death tree prior and the Lewis-Mk model of discrete morphological evolution allow for the estimation of both divergence times and phylogenetic relationships between fossil and extant taxa. We exploit this statistical framework to investigate the internal consistency of these models by producing phylogenetic estimates of the age of each fossil in turn, within two rich and well-characterized data sets of fossil and extant species (penguins and canids). We find that the estimation accuracy of fossil ages is generally high with credible intervals seldom excluding the true age and median relative error in the two data sets of 5.7% and 13.2% respectively. The median relative standard error (RSD) was 9.2% and 7.2% respectively, suggesting good precision, although with some outliers. In fact in the two data sets we analyze the phylogenetic estimates of fossil age is on average < 2 My from the midpoint age of the geological strata from which it was excavated. The high level of internal consistency found in our analyses suggests that the Bayesian statistical model employed is an adequate fit for both the geological and morphological data, and provides evidence from real data that the framework used can accurately model the evolution of discrete morphological traits coded from fossil and extant taxa. We anticipate that this approach will have diverse applications beyond divergence time dating, including dating fossils that are temporally unconstrained, testing of the "morphological clock", and for uncovering potential model misspecification and/or data errors when controversial phylogenetic hypotheses are obtained based on combined divergence dating analyses. △ Less

Submitted 3 May, 2016; v1 submitted 27 January, 2016; originally announced January 2016.

Comments: 28 pages, 8 figures

arXiv:1506.04797 [pdf, other]

doi 10.1093/sysbio/syw060

Bayesian total evidence dating reveals the recent crown radiation of penguins

Authors: Alexandra Gavryushkina, Tracy A. Heath, Daniel T. Ksepka, Tanja Stadler, David Welch, Alexei J. Drummond

Abstract: The total-evidence approach to divergence-time dating uses molecular and morphological data from extant and fossil species to infer phylogenetic relationships, species divergence times, and macroevolutionary parameters in a single coherent framework. Current model-based implementations of this approach lack an appropriate model for the tree describing the diversification and fossilization process… ▽ More The total-evidence approach to divergence-time dating uses molecular and morphological data from extant and fossil species to infer phylogenetic relationships, species divergence times, and macroevolutionary parameters in a single coherent framework. Current model-based implementations of this approach lack an appropriate model for the tree describing the diversification and fossilization process and can produce estimates that lead to erroneous conclusions. We address this shortcoming by providing a total-evidence method implemented in a Bayesian framework. This approach uses a mechanistic tree prior to describe the underlying diversification process that generated the tree of extant and fossil taxa. Previous attempts to apply the total-evidence approach have used tree priors that do not account for the possibility that fossil samples may be direct ancestors of other samples. The fossilized birth-death (FBD) process explicitly models the diversification, fossilization, and sampling processes and naturally allows for sampled ancestors. This model was recently applied to estimate divergence times based on molecular data and fossil occurrence dates. We incorporate the FBD model and a model of morphological trait evolution into a Bayesian total-evidence approach to dating species phylogenies. We apply this method to extant and fossil penguins and show that the modern penguins radiated much more recently than has been previously estimated, with the basal divergence in the crown clade occurring at ~12.7 Ma and most splits leading to extant species occurring in the last 2 million years. Our results demonstrate that including stem-fossil diversity can greatly improve the estimates of the divergence times of crown taxa. The method is available in BEAST2 (v. 2.4) www.beast2.org with packages SA (v. at least 1.1.4) and morph-models (v. at least 1.0.4). △ Less

Submitted 24 January, 2017; v1 submitted 15 June, 2015; originally announced June 2015.

Comments: 50 pages, 6 figures

arXiv:1407.1792 [pdf, other]

doi 10.1534/genetics.114.172791

Inferring epidemiological dynamics with Bayesian coalescent inference: The merits of deterministic and stochastic models

Authors: Alex Popinga, Tim Vaughan, Tanja Stadler, Alexei Drummond

Abstract: Estimation of epidemiological and population parameters from molecular sequence data has become central to the understanding of infectious disease dynamics. Various models have been proposed to infer details of the dynamics that describe epidemic progression. These include inference approaches derived from Kingman's coalescent theory. Here, we use recently described coalescent theory for epidemic… ▽ More Estimation of epidemiological and population parameters from molecular sequence data has become central to the understanding of infectious disease dynamics. Various models have been proposed to infer details of the dynamics that describe epidemic progression. These include inference approaches derived from Kingman's coalescent theory. Here, we use recently described coalescent theory for epidemic dynamics to develop stochastic and deterministic coalescent SIR tree priors. We implement these in a Bayesian phylogenetic inference framework to permit joint estimation of SIR epidemic parameters and the sample genealogy. We assess the performance of the two coalescent models and also juxtapose results obtained with BDSIR, a recently published birth-death-sampling model for epidemic inference. Comparisons are made by analyzing sets of genealogies simulated under precisely known epidemiological parameters. Additionally, we analyze influenza A (H1N1) sequence data sampled in the Canterbury region of New Zealand and HIV-1 sequence data obtained from known UK infection clusters. We show that both coalescent SIR models are effective at estimating epidemiological parameters from data with large fundamental reproductive number $R_0$ and large population size $S_0$. Furthermore, we find that the stochastic variant generally outperforms its deterministic counterpart in terms of error, bias, and highest posterior density coverage, particularly for smaller $R_0$ and $S_0$. However, each of these inference models are shown to have undesirable properties in certain circumstances, especially for epidemic outbreaks with $R_0$ close to one or with small effective susceptible populations. △ Less

Submitted 19 December, 2014; v1 submitted 7 July, 2014; originally announced July 2014.

Comments: Submitted

arXiv:1406.4573 [pdf, other]

doi 10.1371/journal.pcbi.1003919

Bayesian inference of sampled ancestor trees for epidemiology and fossil calibration

Authors: Alexandra Gavryushkina, David Welch, Tanja Stadler, Alexei Drummond

Abstract: Phylogenetic analyses which include fossils or molecular sequences that are sampled through time require models that allow one sample to be a direct ancestor of another sample. As previously available phylogenetic inference tools assume that all samples are tips, they do not allow for this possibility. We have developed and implemented a Bayesian Markov Chain Monte Carlo (MCMC) algorithm to infer… ▽ More Phylogenetic analyses which include fossils or molecular sequences that are sampled through time require models that allow one sample to be a direct ancestor of another sample. As previously available phylogenetic inference tools assume that all samples are tips, they do not allow for this possibility. We have developed and implemented a Bayesian Markov Chain Monte Carlo (MCMC) algorithm to infer what we call sampled ancestor trees, that is, trees in which sampled individuals can be direct ancestors of other sampled individuals. We use a family of birth-death models where individuals may remain in the tree process after the sampling, in particular we extend the birth-death skyline model [Stadler et al, 2013] to sampled ancestor trees. This method allows the detection of sampled ancestors as well as estimation of the probability that an individual will be removed from the process when it is sampled. We show that sampled ancestor birth-death models where all samples come from different time points are non-identifiable and thus require one parameter to be known in order to infer other parameters. We apply this method to epidemiological data, where the possibility of sampled ancestors enables us to identify individuals that infected other individuals after being sampled and to infer fundamental epidemiological parameters. We also apply the method to infer divergence times and diversification rates when fossils are included among the species samples, so that fossilisation events are modelled as a part of the tree branching process. Such modelling has many advantages as argued in literature. The sampler is available as an open-source BEAST2 package (https://github.com/gavryushkina/sampled-ancestors). △ Less

Submitted 24 June, 2014; v1 submitted 17 June, 2014; originally announced June 2014.

Comments: 34 pages (including Supporting Information), 8 figures, 1 table. Part of the work presented at Epidemics 2013 and The 18th Annual New Zealand Phylogenomics Meeting, 2014

Journal ref: PLoS Comput Biol 10(12): e1003919, Published: December 4, 2014

arXiv:1310.2968 [pdf, ps, other]

doi 10.1073/pnas.1319091111

The Fossilized Birth-Death Process: A Coherent Model of Fossil Calibration for Divergence Time Estimation

Authors: Tracy A. Heath, John P. Huelsenbeck, Tanja Stadler

Abstract: Time-calibrated species phylogenies are critical for addressing a wide range of questions in evolutionary biology, such as those that elucidate historical biogeography or uncover patterns of coevolution and diversification. Because molecular sequence data are not informative on absolute time, external data, most commonly fossil age estimates, are required to calibrate estimates of species divergen… ▽ More Time-calibrated species phylogenies are critical for addressing a wide range of questions in evolutionary biology, such as those that elucidate historical biogeography or uncover patterns of coevolution and diversification. Because molecular sequence data are not informative on absolute time, external data, most commonly fossil age estimates, are required to calibrate estimates of species divergence dates. For Bayesian divergence-time methods, the common practice for calibration using fossil information involves placing arbitrarily chosen parametric distributions on internal nodes, often disregarding most of the information in the fossil record. We introduce the 'fossilized birth-death' (FBD) process, a model for calibrating divergence-time estimates in a Bayesian framework, explicitly acknowledging that extant species and fossils are part of the same macroevolutionary process. Under this model, absolute node age estimates are calibrated by a single diversification model and arbitrary calibration densities are not necessary. Moreover, the FBD model allows for inclusion of all available fossils. We performed analyses of simulated data and show that node-age estimation under the FBD model results in robust and accurate estimates of species divergence times with realistic measures of statistical uncertainty, overcoming major limitations of standard divergence time estimation methods. We then used this model to estimate the speciation times for a dataset composed of all living bears, indicating that the genus Ursus diversified in the late Miocene to mid Pliocene. △ Less

Submitted 18 October, 2013; v1 submitted 10 October, 2013; originally announced October 2013.

Comments: 42 total pages including: 29 text pages, 5 tables, and 12 figures. Work presented at Evolution 2013 (http://www.slideshare.net/trayc7/heath-evolution-2013)

arXiv:1308.5140 [pdf, other]

doi 10.1098/rsif.2013.1106

Simultaneous reconstruction of evolutionary history and epidemiological dynamics from viral sequences with the birth-death SIR model

Authors: Denise Kühnert, Tanja Stadler, Timothy G. Vaughan, Alexei J. Drummond

Abstract: The evolution of RNA viruses such as HIV, Hepatitis C and Influenza virus occurs so rapidly that the viruses' genomes contain information on past ecological dynamics. Hence, we develop a phylodynamic method that enables the joint estimation of epidemiological parameters and phylogenetic history. Based on a compartmental susceptible-infected-removed (SIR) model, this method provides separate inform… ▽ More The evolution of RNA viruses such as HIV, Hepatitis C and Influenza virus occurs so rapidly that the viruses' genomes contain information on past ecological dynamics. Hence, we develop a phylodynamic method that enables the joint estimation of epidemiological parameters and phylogenetic history. Based on a compartmental susceptible-infected-removed (SIR) model, this method provides separate information on incidence and prevalence of infections. Detailed information on the interaction of host population dynamics and evolutionary history can inform decisions on how to contain or entirely avoid disease outbreaks. We apply our Birth-Death SIR method (BDSIR) to two viral data sets. First, five human immunodeficiency virus type 1 clusters sampled in the United Kingdom between 1999 and 2003 are analyzed. The estimated basic reproduction ratios range from 1.9 to 3.2 among the clusters. All clusters show a decline in the growth rate of the local epidemic in the middle or end of the 90's. The analysis of a hepatitis C virus (HCV) genotype 2c data set shows that the local epidemic in the Córdoban city Cruz del Eje originated around 1906 (median), coinciding with an immigration wave from Europe to central Argentina that dates from 1880--1920. The estimated time of epidemic peak is around 1970. △ Less

Submitted 21 March, 2014; v1 submitted 23 August, 2013; originally announced August 2013.

Comments: Journal link: http://rsif.royalsocietypublishing.org/content/11/94/20131106.full

Journal ref: J. R. Soc. Interface 2014 11, 20131106, published 26 February 2014

arXiv:1308.1289 [pdf, ps, other]

Macro-evolutionary models and coalescent point processes: The shape and probability of reconstructed phylogenies

Authors: Amaury Lambert, Tanja Stadler

Abstract: Forward-time models of diversification (i.e., speciation and extinction) produce phylogenetic trees that grow "vertically" as time goes by. Pruning the extinct lineages out of such trees leads to natural models for reconstructed trees (i.e., phylogenies of extant species). Alternatively, reconstructed trees can be modelled by coalescent point processes (CPP), where trees grow "horizontally" by the… ▽ More Forward-time models of diversification (i.e., speciation and extinction) produce phylogenetic trees that grow "vertically" as time goes by. Pruning the extinct lineages out of such trees leads to natural models for reconstructed trees (i.e., phylogenies of extant species). Alternatively, reconstructed trees can be modelled by coalescent point processes (CPP), where trees grow "horizontally" by the sequential addition of vertical edges. Each new edge starts at some random speciation time and ends at the present time; speciation times are drawn from the same distribution independently. CPP lead to extremely fast computation of tree likelihoods and simulation of reconstructed trees. Their topology always follows the uniform distribution on ranked tree shapes (URT). We characterize which forward-time models lead to URT reconstructed trees and among these, which lead to CPP reconstructed trees. We show that for any "asymmetric" diversification model in which speciation rates only depend on time and extinction rates only depend on time and on a non-heritable trait (e.g., age), the reconstructed tree is CPP, even if extant species are incompletely sampled. If rates additionally depend on the number of species, the reconstructed tree is (only) URT (but not CPP). We characterize the common distribution of speciation times in the CPP description, and discuss incomplete species sampling as well as three special model cases in detail: 1) extinction rate does not depend on a trait; 2) rates do not depend on time; 3) mass extinctions may happen additionally at certain points in the past. △ Less

Submitted 6 August, 2013; originally announced August 2013.

arXiv:1306.3427 [pdf, ps, other]

Phylogenetic analysis accounting for age-dependent death and sampling with applications to epidemics

Authors: Amaury Lambert, Helen K. Alexander, Tanja Stadler

Abstract: The reconstruction of phylogenetic trees based on viral genetic sequence data sequentially sampled from an epidemic provides estimates of the past transmission dynamics, by fitting epidemiological models to these trees. To our knowledge, none of the epidemiological models currently used in phylogenetics can account for recovery rates and sampling rates dependent on the time elapsed since transmiss… ▽ More The reconstruction of phylogenetic trees based on viral genetic sequence data sequentially sampled from an epidemic provides estimates of the past transmission dynamics, by fitting epidemiological models to these trees. To our knowledge, none of the epidemiological models currently used in phylogenetics can account for recovery rates and sampling rates dependent on the time elapsed since transmission. Here we introduce an epidemiological model where infectives leave the epidemic, either by recovery or sampling, after some random time which may follow an arbitrary distribution. We derive an expression for the likelihood of the phylogenetic tree of sampled infectives under our general epidemiological model. The analytic concept developed in this paper will facilitate inference of past epidemiological dynamics and provide an analytical framework for performing very efficient simulations of phylogenetic trees under our model. The main idea of our analytic study is that the non-Markovian epidemiological model giving rise to phylogenetic trees growing vertically as time goes by, can be represented by a Markovian "coalescent point process" growing horizontally by the sequential addition of pairs of coalescence and sampling times. As examples, we discuss two special cases of our general model, namely an application to influenza and an application to HIV. Though phrased in epidemiological terms, our framework can also be used for instance to fit macroevolutionary models to phylogenies of extant and extinct species, accounting for general species lifetime distributions. △ Less

Submitted 14 June, 2013; originally announced June 2013.

Comments: 30 pages, 2 figures

arXiv:1203.0204 [pdf, ps, other]

A polynomial time algorithm for calculating the probability of a ranked gene tree given a species tree

Authors: Tanja Stadler, James H. Degnan

Abstract: In this paper, we provide a polynomial time algorithm to calculate the probability of a {\it ranked} gene tree topology for a given species tree, where a ranked tree topology is a tree topology with the internal vertices being ordered. The probability of a gene tree topology can thus be calculated in polynomial time if the number of orderings of the internal vertices is a polynomial number. Howeve… ▽ More In this paper, we provide a polynomial time algorithm to calculate the probability of a {\it ranked} gene tree topology for a given species tree, where a ranked tree topology is a tree topology with the internal vertices being ordered. The probability of a gene tree topology can thus be calculated in polynomial time if the number of orderings of the internal vertices is a polynomial number. However, the complexity of calculating the probability of a gene tree topology with an exponential number of rankings for a given species tree remains unknown. △ Less

Submitted 1 March, 2012; originally announced March 2012.

arXiv:1011.5539 [pdf, other]

Branch lengths on Yule trees and the expected loss of phylogenetic diversity

Authors: Arne Mooers, Olivier Gascuel, Tanja Stadler, Heyang Li, Mike Steel

Abstract: Diversification is nested, and early models suggested this could lead to a great deal of evolutionary redundancy in the Tree of Life. This result is based on a particular set of branch lengths produced by the common coalescent, where pendant branches leading to tips can be very short compared to branches deeper in the tree. Here, we analyze alternative and more realistic Yule and birth-death model… ▽ More Diversification is nested, and early models suggested this could lead to a great deal of evolutionary redundancy in the Tree of Life. This result is based on a particular set of branch lengths produced by the common coalescent, where pendant branches leading to tips can be very short compared to branches deeper in the tree. Here, we analyze alternative and more realistic Yule and birth-death models. We show how censoring at the present both makes average branches one half what we might expect and makes pendant and interior branches roughly equal in length. Although dependent on whether we condition on the size of the tree, its age, or both, these results hold both for the Yule model and for birth-death models with moderate extinction. Importantly, the rough equivalency in interior and exterior branch lengths means the loss of evolutionary history with loss of species can be roughly linear. Under these models, the Tree of Life may offer limited redundancy in the face of ongoing species loss. △ Less

Submitted 29 July, 2011; v1 submitted 24 November, 2010; originally announced November 2010.

Comments: 3 figures

Showing 1–15 of 15 results for author: Stadler, T