-
Multiple Sequence Alignment is not a Solved Problem
Authors:
David A. Morrison
Abstract:
Multiple sequence alignment is a basic procedure in molecular biology, and it is often treated as being essentially a solved computational problem. However, this is not so, and here I review the evidence for this claim, and outline the requirements for a solution. The goal of alignment is often stated to be to juxtapose nucleotides (or their derivatives, such as amino acids) that have been inherit…
▽ More
Multiple sequence alignment is a basic procedure in molecular biology, and it is often treated as being essentially a solved computational problem. However, this is not so, and here I review the evidence for this claim, and outline the requirements for a solution. The goal of alignment is often stated to be to juxtapose nucleotides (or their derivatives, such as amino acids) that have been inherited from a common ancestral nucleotide (although other goals are also possible). Unfortunately, this is not an operational definition, because homology (in this sense) refers to unique and unobservable historical events, and so there can be no objective mathematical function to optimize. Consequently, almost all algorithms developed for multiple sequence alignment are based on optimizing some sort of compositional similarity (similarity = homology + analogy). As a result, many, if not most, practitioners either manually modify computer-produced alignments or they perform de novo manual alignment, especially in the field of phylogenetics. So, if homology is the goal, then multiple sequence alignment is not yet a solved computational problem. Several criteria have been developed by biologists to help them identify potential homologies (compositional, ontogenetic, topographical and functional similarity, plus conjunction and congruence), and these criteria can be applied to molecular data, in principle. Current computer programs do implement one (or occasionally two) of these criteria, but no program implements them all. What is needed is a program that evaluates all of the evidence for the sequence homologies, optimizes their combination, and thus produces the best hypotheses of homology. This is basically an inference problem not an optimization problem.
△ Less
Submitted 23 August, 2018;
originally announced August 2018.
-
Who is Who in Phylogenetic Networks: Articles, Authors and Programs
Authors:
Tushar Agarwal,
Philippe Gambette,
David Morrison
Abstract:
The phylogenetic network emerged in the 1990s as a new model to represent the evolution of species in the case where coexisting species transfer genetic information through hybridization, recombination, lateral gene transfer, etc. As is true for many rapidly evolving fields, there is considerable fragmentation and diversity in methodologies, standards and vocabulary in phylogenetic network researc…
▽ More
The phylogenetic network emerged in the 1990s as a new model to represent the evolution of species in the case where coexisting species transfer genetic information through hybridization, recombination, lateral gene transfer, etc. As is true for many rapidly evolving fields, there is considerable fragmentation and diversity in methodologies, standards and vocabulary in phylogenetic network research, thus creating the need for an integrated database of articles, authors, techniques, keywords and software. We describe such a database, "Who is Who in Phylogenetic Networks", available at http://phylnet.univ-mlv.fr. "Who is Who in Phylogenetic Networks" comprises more than 600 publications and 500 authors interlinked with a rich set of more than 200 keywords related to phylogenetic networks. The database is integrated with web-based tools to visualize authorship and collaboration networks and analyze these networks using common graph and social network metrics such as centrality (betweenness, eigenvector, degree and closeness) and clustering. We provide downloads of raw information about entries in the database, and a facility to suggest modifications and contribute new information to the database. We also present in this article common use cases of the database and identify trends in the research on phylogenetic networks using the information in the database and textual analysis.
△ Less
Submitted 5 October, 2016;
originally announced October 2016.
-
Fighting network space: it is time for an SQL-type language to filter phylogenetic networks
Authors:
Steven Kelk,
Simone Linz,
David A. Morrison
Abstract:
The search space of rooted phylogenetic trees is vast and a major research focus of recent decades has been the development of algorithms to effectively navigate this space. However this space is tiny when compared with the space of rooted phylogenetic networks, and navigating this enlarged space remains a poorly understood problem. This, and the difficulty of biologically interpreting such networ…
▽ More
The search space of rooted phylogenetic trees is vast and a major research focus of recent decades has been the development of algorithms to effectively navigate this space. However this space is tiny when compared with the space of rooted phylogenetic networks, and navigating this enlarged space remains a poorly understood problem. This, and the difficulty of biologically interpreting such networks, obstructs adoption of networks as tools for modelling reticulation. Here, we argue that the superimposition of biologically motivated constraints, via an SQL-style language, can both stimulate use of network software by biologists and potentially significantly prune the search space.
△ Less
Submitted 25 October, 2013;
originally announced October 2013.
-
Estimating seed bank accumulation and dynamics in three obligate-seeder Proteaceae species
Authors:
Meaghan E. Jenkins,
David A. Morrison,
Tony D. Auld
Abstract:
The seed bank dynamics of the three co-occurring obligate-seeder (i.e. fire-sensitive) Proteaceae species, Banksia ericifolia, Banksia marginata and Petrophile pulchella, were examined at sites of varying time since the most recent fire (i.e. plant age) in the Sydney region. Significant variation among species was found in the number of cones produced, the position of the cones within the canopy…
▽ More
The seed bank dynamics of the three co-occurring obligate-seeder (i.e. fire-sensitive) Proteaceae species, Banksia ericifolia, Banksia marginata and Petrophile pulchella, were examined at sites of varying time since the most recent fire (i.e. plant age) in the Sydney region. Significant variation among species was found in the number of cones produced, the position of the cones within the canopy, the percentage of barren cones produced (Banksia species only), the number of follicles/bracts produced per cone, and the number of seeds lost/released due to spontaneous fruit rupture. Thus, three different regeneration strategies were observed, highlighting the variation in reproductive strategies of co-occurring Proteaceae species. Ultimately, B. marginata potentially accumulated a seed bank of c. 3000 seeds per plant after 20 years, with c. 1500 seeds per plant for P. pulchella and c. 500 for B. ericifolia. Based on these data, B. marginata and B. ericifolia require a minimum fire-free period of 8-10 years, with 7-8 years for P. pulchella, to allow for an adequate seed bank to accumulate and thus ensure local persistence of these species in fire-prone habitats.
△ Less
Submitted 27 January, 2010;
originally announced January 2010.
-
How and where to look for tRNAs in Metazoan mitochondrial genomes, and what you might find when you get there
Authors:
David A. Morrison
Abstract:
The ability to locate and annotate mitochondrial genes is an important practical issue, given the rapidly increasing number of mitogenomes appearing in the public databases. Unfortunately, tRNA genes in Metazoan mitochondria have proved to be problematic because they often vary in number (genes missing or duplicated) and also in the secondary structure of the transcribed tRNAs (T or D arms missing…
▽ More
The ability to locate and annotate mitochondrial genes is an important practical issue, given the rapidly increasing number of mitogenomes appearing in the public databases. Unfortunately, tRNA genes in Metazoan mitochondria have proved to be problematic because they often vary in number (genes missing or duplicated) and also in the secondary structure of the transcribed tRNAs (T or D arms missing). I have performed a series of comparative analyses of the tRNA genes of a broad range of Metazoan mitogenomes in order to address this issue. I conclude that no single computer program is necessarily capable of finding all of the tRNA genes in any given mitogenome, and that use of both the ARWEN and DOGMA programs is sometimes necessary because they produce complementary false negatives. There are apparently a very large number of erroneous annotations in the databased mitogenome sequences, including missed genes, wrongly annotated locations, false complements, and inconsistent criteria for assigning the 5' and 3' boundaries; and I have listed many of these. The extent of overlap between genes is often greatly exaggerated due to inconsistent annotations, although notable overlaps involving tRNAs are apparently real. Finally, three novel hypotheses were examined and found to have support from the comparative analyses: (1) some organisms have mitogenomic locations that simultaneously code for multiple tRNAs; (2) some organisms have mitogenomic locations that simultaneously code for tRNAs and proteins (but not rRNAs); and (3) one group of nematodes has several genes that code for tRNAs lacking both the D and T arms.
△ Less
Submitted 3 February, 2012; v1 submitted 21 January, 2010;
originally announced January 2010.
-
Bayesian posterior probabilities: revisited
Authors:
David A. Morrison
Abstract:
Huelsenbeck and Rannala (2004, Systematic Biology 53, 904-913) presented a series of simulations in order to assess the extent to which the bayesian posterior probabilities associated with phylogenetic trees represent the standard frequentist statistical interpretation. They concluded that when the analysis model matches the generating model then the bayesian posterior probabilities are correct,…
▽ More
Huelsenbeck and Rannala (2004, Systematic Biology 53, 904-913) presented a series of simulations in order to assess the extent to which the bayesian posterior probabilities associated with phylogenetic trees represent the standard frequentist statistical interpretation. They concluded that when the analysis model matches the generating model then the bayesian posterior probabilities are correct, but that the probabilities are much too large when the model is under-specified and slightly too small when the model is over-specified. Here, I take issue with the first conclusion, and instead contend that their simulation data show that the posterior probabilities are still slightly too large even when the models match. Furthermore, I suggest that the data show that the degree of this over-estimation increases as the sequence length increases, and that it might increase as model complexity increases. I also provide some comments on the authors' conclusions concerning whether bootstrap proportions over- or under-estimate the true probabilities.
△ Less
Submitted 20 January, 2010;
originally announced January 2010.
-
Counting chickens before they hatch: reciprocal consistency of calibration points for estimating divergence dates
Authors:
David A. Morrison
Abstract:
There has been concern in the literature about the methodology of using secondary calibration timepoints when estimating evolutionary divergence dates. Such timepoints are divergence time estimates that have been derived from one molecular data set on the basis of a primary external calibration timepoint, and which are then used independently on a second data set. Logically, the primary and seco…
▽ More
There has been concern in the literature about the methodology of using secondary calibration timepoints when estimating evolutionary divergence dates. Such timepoints are divergence time estimates that have been derived from one molecular data set on the basis of a primary external calibration timepoint, and which are then used independently on a second data set. Logically, the primary and secondary calibration points must be mutually consistent, in the sense that it must be possible to predict each time point from the other. However, the attempt by Shaul and Graur (2002, Gene 300: 59-61) to assess the reliability of secondary timepoints is flawed because they presented time estimates without presenting confidence intervals on those estimates, and so it was not possible to make any explicit hypothesis tests of divergence times. Also, they inappropriately excluded some of the data, which leads to a very biased estimate of one of the divergence times. Here, I present a re-analysis of the same data set, with more appropriate methodology, and come to the conclusion that no inconsistencies are involved. However, it is clear from the analysis that molecular data often have such large confidence intervals that they are uninformative, and thus cannot be used for reliable hypothesis tests.
△ Less
Submitted 20 January, 2010;
originally announced January 2010.