-
Heuristic Algorithms for Best Match Graph Editing
Authors:
David Schaller,
Manuela Geiß,
Marc Hellmuth,
Peter F. Stadler
Abstract:
Best match graphs (BMGs) are a class of colored digraphs that naturally appear in mathematical phylogenetics and can be approximated with the help of similarity measures between gene sequences, albeit not without errors. The corresponding graph editing problem can be used as a means of error correction. Since the arc set modification problems for BMGs are NP-complete, efficient heuristics are need…
▽ More
Best match graphs (BMGs) are a class of colored digraphs that naturally appear in mathematical phylogenetics and can be approximated with the help of similarity measures between gene sequences, albeit not without errors. The corresponding graph editing problem can be used as a means of error correction. Since the arc set modification problems for BMGs are NP-complete, efficient heuristics are needed if BMGs are to be used for the practical analysis of biological sequence data. Since BMGs have a characterization in terms of consistency of a certain set of rooted triples, we consider heuristics that operate on triple sets. As an alternative, we show that there is a close connection to a set partitioning problem that leads to a class of top-down recursive algorithms that are similar to Aho's supertree algorithm and give rise to BMG editing algorithms that are consistent in the sense that they leave BMGs invariant. Extensive benchmarking shows that community detection algorithms for the partitioning steps perform best for BMG editing.
△ Less
Submitted 10 March, 2021;
originally announced March 2021.
-
Arc-Completion of 2-Colored Best Match Graphs to Binary-Explainable Best Match Graphs
Authors:
David Schaller,
Manuela Geiß,
Marc Hellmuth,
Peter F. Stadler
Abstract:
Best match graphs (BMGs) are vertex-colored digraphs that naturally arise in mathematical phylogenetics to formalize the notion of evolutionary closest genes w.r.t. an a priori unknown phylogenetic tree. BMGs are explained by unique least resolved trees. We prove that the property of a rooted, leaf-colored tree to be least resolved for some BMG is preserved by the contraction of inner edges. For t…
▽ More
Best match graphs (BMGs) are vertex-colored digraphs that naturally arise in mathematical phylogenetics to formalize the notion of evolutionary closest genes w.r.t. an a priori unknown phylogenetic tree. BMGs are explained by unique least resolved trees. We prove that the property of a rooted, leaf-colored tree to be least resolved for some BMG is preserved by the contraction of inner edges. For the special case of two-colored BMGs, this leads to a characterization of the least resolved trees (LRTs) of binary-explainable trees and a simple, polynomial-time algorithm for the minimum cardinality completion of the arc set of a BMG to reach a BMG that can be explained by a binary tree.
△ Less
Submitted 11 March, 2021;
originally announced March 2021.
-
Least resolved trees for two-colored best match graphs
Authors:
David Schaller,
Manuela Geiß,
Marc Hellmuth,
Peter F. Stadler
Abstract:
2-colored best match graphs (2-BMGs) form a subclass of sink-free bi-transitive graphs that appears in phylogenetic combinatorics. There, 2-BMGs describe evolutionarily most closely related genes between a pair of species. They are explained by a unique least resolved tree (LRT). Introducing the concept of support vertices we derive an $O(|V|+|E|\log^2|V|)$-time algorithm to recognize 2-BMGs and t…
▽ More
2-colored best match graphs (2-BMGs) form a subclass of sink-free bi-transitive graphs that appears in phylogenetic combinatorics. There, 2-BMGs describe evolutionarily most closely related genes between a pair of species. They are explained by a unique least resolved tree (LRT). Introducing the concept of support vertices we derive an $O(|V|+|E|\log^2|V|)$-time algorithm to recognize 2-BMGs and to construct its LRT. The approach can be extended to also recognize binary-explainable 2-BMGs with the same complexity. An empirical comparison emphasizes the efficiency of the new algorithm.
△ Less
Submitted 18 January, 2021;
originally announced January 2021.
-
Best Match Graphs with Binary Trees
Authors:
David Schaller,
Manuela Geiß,
Marc Hellmuth,
Peter F. Stadler
Abstract:
Best match graphs (BMG) are a key intermediate in graph-based orthology detection and contain a large amount of information on the gene tree. We provide a near-cubic algorithm to determine whether a BMG is binary-explainable, i.e., whether it can be explained by a fully resolved gene tree and, if so, to construct such a tree. Moreover, we show that all such binary trees are refinements of the uniq…
▽ More
Best match graphs (BMG) are a key intermediate in graph-based orthology detection and contain a large amount of information on the gene tree. We provide a near-cubic algorithm to determine whether a BMG is binary-explainable, i.e., whether it can be explained by a fully resolved gene tree and, if so, to construct such a tree. Moreover, we show that all such binary trees are refinements of the unique binary-resolvable tree (BRT), which in general is a substantial refinement of the also unique least resolved tree of a BMG. Finally, we show that the problem of editing an arbitrary vertex-colored graph to a binary-explainable BMG is NP-complete and provide an integer linear program formulation for this task.
△ Less
Submitted 8 March, 2021; v1 submitted 1 November, 2020;
originally announced November 2020.
-
Complete Characterization of Incorrect Orthology Assignments in Best Match Graphs
Authors:
David Schaller,
Manuela Geiß,
Peter F. Stadler,
Marc Hellmuth
Abstract:
Genome-scale orthology assignments are usually based on reciprocal best matches. In the absence of horizontal gene transfer (HGT), every pair of orthologs forms a reciprocal best match. Incorrect orthology assignments therefore are always false positives in the reciprocal best match graph. We consider duplication/loss scenarios and characterize unambiguous false-positive (u-fp) orthology assignmen…
▽ More
Genome-scale orthology assignments are usually based on reciprocal best matches. In the absence of horizontal gene transfer (HGT), every pair of orthologs forms a reciprocal best match. Incorrect orthology assignments therefore are always false positives in the reciprocal best match graph. We consider duplication/loss scenarios and characterize unambiguous false-positive (u-fp) orthology assignments, that is, edges in the best match graphs (BMGs) that cannot correspond to orthologs for any gene tree that explains the BMG. Moreover, we provide a polynomial-time algorithm to identify all u-fp orthology assignments in a BMG. Simulations show that at least $75\%$ of all incorrect orthology assignments can be detected in this manner. All results rely only on the structure of the BMGs and not on any a priori knowledge about underlying gene or species trees.
△ Less
Submitted 26 November, 2020; v1 submitted 3 June, 2020;
originally announced June 2020.
-
Hierarchical and Modularly-Minimal Vertex Colorings
Authors:
Dulce I. Valdivia,
Manuela Geiß,
Maribel Hernández Rosales,
Peter F. Stadler,
Marc Hellmuth
Abstract:
Cographs are exactly the hereditarily well-colored graphs, i.e., the graphs for which a greedy vertex coloring of every induced subgraph uses only the minimally necessary number of colors $χ(G)$. We show that greedy colorings are a special case of the more general hierarchical vertex colorings, which recently were introduced in phylogenetic combinatorics. Replacing cotrees by modular decomposition…
▽ More
Cographs are exactly the hereditarily well-colored graphs, i.e., the graphs for which a greedy vertex coloring of every induced subgraph uses only the minimally necessary number of colors $χ(G)$. We show that greedy colorings are a special case of the more general hierarchical vertex colorings, which recently were introduced in phylogenetic combinatorics. Replacing cotrees by modular decomposition trees generalizes the concept of hierarchical colorings to arbitrary graphs. We show that every graph has a modularly-minimal coloring $σ$ satisfying $|σ(M)|=χ(M)$ for every strong module $M$ of $G$. This, in particular, shows that modularly-minimal colorings provide a useful device to design efficient coloring algorithms for certain hereditary graph classes. For cographs, the hierarchical colorings coincide with the modularly-minimal coloring. As a by-product, we obtain a simple linear-time algorithm to compute a modularly-minimal coloring of $P_4$-sparse graphs.
△ Less
Submitted 14 April, 2020;
originally announced April 2020.
-
From Best Hits to Best Matches
Authors:
Peter F. Stadler,
Manuela Geiß,
David Schaller,
Alitzel López Sánchez,
Marcos E. González,
Dulce I. Valdivia,
Marc Hellmuth,
Maribel Hernández Rosales
Abstract:
Many of the commonly used methods for orthology detection start from mutually most similar pairs of genes (reciprocal best hits) as an approximation for evolutionary most closely related pairs of genes (reciprocal best matches). This approximation of best matches by best hits becomes exact for ultrametric dissimilarities, i.e., under the Molecular Clock Hypothesis. It fails, however, whenever ther…
▽ More
Many of the commonly used methods for orthology detection start from mutually most similar pairs of genes (reciprocal best hits) as an approximation for evolutionary most closely related pairs of genes (reciprocal best matches). This approximation of best matches by best hits becomes exact for ultrametric dissimilarities, i.e., under the Molecular Clock Hypothesis. It fails, however, whenever there are large lineage specific rate variations among paralogous genes. In practice, this introduces a high level of noise into the input data for best-hit-based orthology detection methods.
If additive distances between genes are known, then evolutionary most closely related pairs can be identified by considering certain quartets of genes provided that in each quartet the outgroup relative to the remaining three genes is known. \emph{A priori} knowledge of underlying species phylogeny greatly facilitates the identification of the required outgroup. Although the workflow remains a heuristic since the correct outgroup cannot be determined reliably in all cases, simulations with lineage specific biases and rate asymmetries show that nearly perfect results can be achieved. In a realistic setting, where distances data have to be estimated from sequence data and hence are noisy, it is still possible to obtain highly accurate sets of best matches.
Improvements of tree-free orthology assessment methods can be expected from a combination of the accurate inference of best matches reported here and recent mathematical advances in the understanding of (reciprocal) best match graphs and orthology relations.
△ Less
Submitted 3 January, 2020;
originally announced January 2020.
-
Best Match Graphs and Reconciliation of Gene Trees with Species Trees
Authors:
Manuela Geiß,
Marcos E. González,
Alitzel López Sánchez,
Dulce I. Valdivia,
Marc Hellmuth,
Maribel Hernández Rosales,
Peter F. Stadler
Abstract:
A wide variety of problems in computational biology, most notably the assessment of orthology, are solved with the help of reciprocal best matches. Using an evolutionary definition of best matches that captures the intuition behind the concept we clarify rigorously the relationships between reciprocal best matches, orthology, and evolutionary events under the assumption of duplication/loss scenari…
▽ More
A wide variety of problems in computational biology, most notably the assessment of orthology, are solved with the help of reciprocal best matches. Using an evolutionary definition of best matches that captures the intuition behind the concept we clarify rigorously the relationships between reciprocal best matches, orthology, and evolutionary events under the assumption of duplication/loss scenarios. We show that the orthology graph is a subgraph of the reciprocal best match graph (RBMG). We furthermore give conditions under which an RBMG that is a cograph identifies the correct orthlogy relation. Using computer simulations we find that most false positive orthology assignments can be identified as so-called good quartets -- and thus corrected -- in the absence of horizontal transfer. Horizontal transfer, however, may introduce also false-negative orthology assignments.
△ Less
Submitted 12 December, 2019; v1 submitted 26 April, 2019;
originally announced April 2019.
-
Reciprocal Best Match Graphs
Authors:
Manuela Geiß,
Peter F. Stadler,
Marc Hellmuth
Abstract:
Reciprocal best matches play an important role in numerous applications in computational biology, in particular as the basis of many widely used tools for orthology assessment. Nevertheless, very little is known about their mathematical structure. Here, we investigate the structure of reciprocal best match graphs (RBMGs). In order to abstract from the details of measuring distances, we define reci…
▽ More
Reciprocal best matches play an important role in numerous applications in computational biology, in particular as the basis of many widely used tools for orthology assessment. Nevertheless, very little is known about their mathematical structure. Here, we investigate the structure of reciprocal best match graphs (RBMGs). In order to abstract from the details of measuring distances, we define reciprocal best matches here as pairwise most closely related leaves in a gene tree, arguing that conceptually this is the notion that is pragmatically approximated by distance- or similarity-based heuristics. We start by showing that a graph $G$ is an RBMG if and only if its quotient graph w.r.t.\ a certain thinness relation is an RBMG. Furthermore, it is necessary and sufficient that all connected components of $G$ are RBMGs. The main result of this contribution is a complete characterization of RBMGs with 3 colors/species that can be checked in polynomial time. For 3 colors, there are three distinct classes of trees that are related to the structure of the phylogenetic trees explaining them. We derive an approach to recognize RBMGs with an arbitrary number of colors; it remains open however, whether a polynomial-time for RBMG recognition exists. In addition, we show that RBMGs that at the same time are cographs (co-RBMGs) can be recognized in polynomial time. Co-RBMGs are characterized in terms of hierarchically colored cographs, a particular class of vertex colored cographs that is introduced here. The (least resolved) trees that explain co-RBMGs can be constructed in polynomial time.
△ Less
Submitted 29 August, 2019; v1 submitted 19 March, 2019;
originally announced March 2019.