Search | arXiv e-print repository

Exact Algorithms for No-Rainbow Coloring and Phylogenetic Decisiveness

Authors: Ghazaleh Parvini, David Fernández-Baca

Abstract: The input to the no-rainbow hypergraph coloring problem is a hypergraph $H$ where every hyperedge has $r$ nodes. The question is whether there exists an $r$-coloring of the nodes of $H$ such that all $r$ colors are used and there is no rainbow hyperedge -- i.e., no hyperedge uses all $r$ colors. The no-rainbow hypergraph $r$-coloring problem is known to be NP-complete for $r \geq 3$. The special c… ▽ More The input to the no-rainbow hypergraph coloring problem is a hypergraph $H$ where every hyperedge has $r$ nodes. The question is whether there exists an $r$-coloring of the nodes of $H$ such that all $r$ colors are used and there is no rainbow hyperedge -- i.e., no hyperedge uses all $r$ colors. The no-rainbow hypergraph $r$-coloring problem is known to be NP-complete for $r \geq 3$. The special case of $r=4$ is the complement of the phylogenetic decisiveness problem. Here we present a deterministic algorithm that solves the no-rainbow $r$-coloring problem in $O^*((r-1)^{(r-1)n/r})$ time and a randomized algorithm that solves the problem in $O^*((\frac{r}{2})^n)$ time. △ Less

Submitted 5 April, 2021; originally announced April 2021.

MSC Class: 05C15; 68Q25; 68W20; 68W40; 92D15

arXiv:2002.09725 [pdf, other]

Testing the Agreement of Trees with Internal Labels

Authors: David Fernández-Baca, Lei Liu

Abstract: The input to the agreement problem is a collection $P = \{T_1, T_2, \dots , T_k\}$ of phylogenetic trees, called input trees, over partially overlapping sets of taxa. The question is whether there exists a tree $T$, called an agreement tree, whose taxon set is the union of the taxon sets of the input trees, such that for each $i \in \{1, 2, \dots , k\}$, the restriction of $T$ to the taxon set of… ▽ More The input to the agreement problem is a collection $P = \{T_1, T_2, \dots , T_k\}$ of phylogenetic trees, called input trees, over partially overlapping sets of taxa. The question is whether there exists a tree $T$, called an agreement tree, whose taxon set is the union of the taxon sets of the input trees, such that for each $i \in \{1, 2, \dots , k\}$, the restriction of $T$ to the taxon set of $T_i$ is isomorphic to $T_i$. We give a $O(n k (\sum_{i \in [k]} d_i + \log^2(nk)))$ algorithm for a generalization of the agreement problem in which the input trees may have internal labels, where $n$ is the total number of distinct taxa in $P$, $k$ is the number of trees in $P$, and $d_i$ is the maximum number of children of a node in $T_i$. △ Less

Submitted 22 February, 2020; originally announced February 2020.

ACM Class: F.2; J.3

arXiv:2002.09722 [pdf, ps, other]

Checking Phylogenetic Decisiveness in Theory and in Practice

Authors: Ghazaleh Parvini, Katherine Braught, David Fernández-Baca

Abstract: Suppose we have a set $X$ consisting of $n$ taxa and we are given information from $k$ loci from which to construct a phylogeny for $X$. Each locus offers information for only a fraction of the taxa. The question is whether this data suffices to construct a reliable phylogeny. The decisiveness problem expresses this question combinatorially. Although a precise characterization of decisiveness is k… ▽ More Suppose we have a set $X$ consisting of $n$ taxa and we are given information from $k$ loci from which to construct a phylogeny for $X$. Each locus offers information for only a fraction of the taxa. The question is whether this data suffices to construct a reliable phylogeny. The decisiveness problem expresses this question combinatorially. Although a precise characterization of decisiveness is known, the complexity of the problem is open. Here we relate decisiveness to a hypergraph coloring problem. We use this idea to (1) obtain lower bounds on the amount of coverage needed to achieve decisiveness, (2) devise an exact algorithm for decisiveness, (3) develop problem reduction rules, and use them to obtain efficient algorithms for inputs with few loci, and (4) devise an integer linear programming formulation of the decisiveness problem, which allows us to analyze data sets that arise in practice. △ Less

Submitted 22 February, 2020; originally announced February 2020.

MSC Class: 05C15; 05C65 ACM Class: F.2; J.3

arXiv:1910.07819 [pdf, other]

EvoZip: Efficient Compression of Large Collections of Evolutionary Trees

Authors: Balanand Jha, David Fernández-Baca, Akshay Deepak, Kumar Abhishek

Abstract: Phylogenetic trees represent evolutionary relationships among sets of organisms. Popular phylogenetic reconstruction approaches typically yield hundreds to thousands of trees on a common leafset. Storing and sharing such large collection of trees requires considerable amount of space and bandwidth. Furthermore, the huge size of phylogenetic tree databases can make search and retrieval operations t… ▽ More Phylogenetic trees represent evolutionary relationships among sets of organisms. Popular phylogenetic reconstruction approaches typically yield hundreds to thousands of trees on a common leafset. Storing and sharing such large collection of trees requires considerable amount of space and bandwidth. Furthermore, the huge size of phylogenetic tree databases can make search and retrieval operations time-consuming. Phylogenetic compression techniques are specialized compression techniques that exploit redundant topological information to achieve better compression of phylogenetic trees. Here, we present EvoZip, a new approach for phylogenetic tree compression. On average, EvoZip achieves 71.6% better compression and takes 80.71% less compression time and 60.47% less decompression time than TreeZip, the current state-of-the-art algorithm for phylogenetic tree compression. While EvoZip is based on TreeZip, it betters TreeZip due to (a) an improved bipartition and support list encoding scheme, (b) use of Deflate compression algorithm, and (c) use of an efficient tree reconstruction algorithm. EvoZip is freely available online for use by the scientific community. △ Less

Submitted 17 October, 2019; originally announced October 2019.

arXiv:1811.01338 [pdf, other]

doi 10.1109/TCBB.2019.2911609

Deep Robust Framework for Protein Function Prediction using Variable-Length Protein Sequences

Authors: Ashish Ranjan, Md Shah Fahad, David Fernandez-Baca, Akshay Deepak, Sudhakar Tripathi

Abstract: Amino acid sequence portrays most intrinsic form of a protein and expresses primary structure of protein. The order of amino acids in a sequence enables a protein to acquire a particular stable conformation that is responsible for the functions of the protein. This relationship between a sequence and its function motivates the need to analyse the sequences for predicting protein functions. Early g… ▽ More Amino acid sequence portrays most intrinsic form of a protein and expresses primary structure of protein. The order of amino acids in a sequence enables a protein to acquire a particular stable conformation that is responsible for the functions of the protein. This relationship between a sequence and its function motivates the need to analyse the sequences for predicting protein functions. Early generation computational methods using BLAST, FASTA, etc. perform function transfer based on sequence similarity with existing databases and are computationally slow. Although machine learning based approaches are fast, they fail to perform well for long protein sequences (i.e., protein sequences with more than 300 amino acid residues). In this paper, we introduce a novel method for construction of two separate feature sets for protein sequences based on analysis of 1) single fixed-sized segments and 2) multi-sized segments, using bi-directional long short-term memory network. Further, model based on proposed feature set is combined with the state of the art Multi-lable Linear Discriminant Analysis (MLDA) features based model to improve the accuracy. Extensive evaluations using separate datasets for biological processes and molecular functions demonstrate promising results for both single-sized and multi-sized segments based feature sets. While former showed an improvement of +3.37% and +5.48%, the latter produces an improvement of +5.38% and +8.00% respectively for two datasets over the state of the art MLDA based classifier. After combining two models, there is a significant improvement of +7.41% and +9.21% respectively for two datasets compared to MLDA based classifier. Specifically, the proposed approach performed well for the long protein sequences and superior overall performance. △ Less

Submitted 19 June, 2019; v1 submitted 4 November, 2018; originally announced November 2018.

Journal ref: IEEE/ACM Transactions on Computational Biology and Bioinformatics, 2019

arXiv:1605.02045 [pdf, other]

Fast Compatibility Testing for Phylogenies with Nested Taxa

Authors: Yun Deng, David Fernández-Baca

Abstract: Semi-labeled trees are phylogenies whose internal nodes may be labeled by higher-order taxa. Thus, a leaf labeled Mus musculus could nest within a subtree whose root node is labeled Rodentia, which itself could nest within a subtree whose root is labeled Mammalia. Suppose we are given collection $\mathcal P$ of semi-labeled trees over various subsets of a set of taxa. The ancestral compatibility p… ▽ More Semi-labeled trees are phylogenies whose internal nodes may be labeled by higher-order taxa. Thus, a leaf labeled Mus musculus could nest within a subtree whose root node is labeled Rodentia, which itself could nest within a subtree whose root is labeled Mammalia. Suppose we are given collection $\mathcal P$ of semi-labeled trees over various subsets of a set of taxa. The ancestral compatibility problem asks whether there is a semi-labeled tree $\mathcal T$ that respects the clusterings and the ancestor/descendant relationships implied by the trees in $\mathcal P$. We give a $\tilde{O}(M_{\mathcal{P}})$ algorithm for the ancestral compatibility problem, where $M_{\mathcal{P}}$ is the total number of nodes and edges in the trees in $\mathcal P$. Unlike the best previous algorithm, the running time of our method does not depend on the degrees of the nodes in the input trees. △ Less

Submitted 6 May, 2016; originally announced May 2016.

Comments: 3 figures

MSC Class: 05C85; 68Q25; 68W40; 92D15 ACM Class: F.2.2; G.2.2; J.3

arXiv:1510.07758 [pdf, other]

Fast Compatibility Testing for Rooted Phylogenetic Trees

Authors: Yun Deng, David Fernández-Baca

Abstract: We consider the following basic problem in phylogenetic tree construction. Let $\mathcal{P} = \{T_1, \ldots, T_k\}$ be a collection of rooted phylogenetic trees over various subsets of a set of species. The tree compatibility problem asks whether there is a tree $T$ with the following property: for each $i \in \{1, \dots, k\}$, $T_i$ can be obtained from the restriction of $T$ to the species set o… ▽ More We consider the following basic problem in phylogenetic tree construction. Let $\mathcal{P} = \{T_1, \ldots, T_k\}$ be a collection of rooted phylogenetic trees over various subsets of a set of species. The tree compatibility problem asks whether there is a tree $T$ with the following property: for each $i \in \{1, \dots, k\}$, $T_i$ can be obtained from the restriction of $T$ to the species set of $T_i$ by contracting zero or more edges. If such a tree $T$ exists, we say that $\mathcal{P}$ is compatible. We give a $\tilde{O}(M_\mathcal{P})$ algorithm for the tree compatibility problem, where $M_\mathcal{P}$ is the total number of nodes and edges in $\mathcal{P}$. Unlike previous algorithms for this problem, the running time of our method does not depend on the degrees of the nodes in the input trees. Thus, it is equally fast on highly resolved and highly unresolved trees. △ Less

Submitted 26 October, 2015; originally announced October 2015.

ACM Class: F.2.0

arXiv:1503.03877 [pdf, ps, other]

Constructing and Employing Tree Alignment Graphs for Phylogenetic Synthesis

Authors: Ruchi Chaudhary, David Fernandez-Baca, J. Gordon Burleigh

Abstract: Tree alignment graphs (TAGs) provide an intuitive data structure for storing phylogenetic trees that exhibits the relationships of the individual input trees and can potentially account for nested taxonomic relationships. This paper provides a theoretical foundation for the use of TAGs in phylogenetics. We provide a formal definition of TAG that - unlike previous definition - does not depend on th… ▽ More Tree alignment graphs (TAGs) provide an intuitive data structure for storing phylogenetic trees that exhibits the relationships of the individual input trees and can potentially account for nested taxonomic relationships. This paper provides a theoretical foundation for the use of TAGs in phylogenetics. We provide a formal definition of TAG that - unlike previous definition - does not depend on the order in which input trees are provided. In the consensus case, when all input trees have the same leaf labels, we describe algorithms for constructing majority-rule and strict consensus trees using the TAG. When the input trees do not have identical sets of leaf labels, we describe how to determine if the input trees are compatible and, if they are compatible, to construct a supertree that contains the input trees. △ Less

Submitted 12 March, 2015; originally announced March 2015.

arXiv:1307.7828 [pdf, ps, other]

Characterizing Compatibility and Agreement of Unrooted Trees via Cuts in Graphs

Authors: Sudheer Vakati, David Fernández-Baca

Abstract: Deciding whether there is a single tree -a supertree- that summarizes the evolutionary information in a collection of unrooted trees is a fundamental problem in phylogenetics. We consider two versions of this question: agreement and compatibility. In the first, the supertree is required to reflect precisely the relationships among the species exhibited by the input trees. In the second, the supert… ▽ More Deciding whether there is a single tree -a supertree- that summarizes the evolutionary information in a collection of unrooted trees is a fundamental problem in phylogenetics. We consider two versions of this question: agreement and compatibility. In the first, the supertree is required to reflect precisely the relationships among the species exhibited by the input trees. In the second, the supertree can be more refined than the input trees. Tree compatibility can be characterized in terms of the existence of a specific kind of triangulation in a structure known as the display graph. Alternatively, it can be characterized as a chordal graph sandwich problem in a structure known as the edge label intersection graph. Here, we show that the latter characterization yields a natural characterization of compatibility in terms of minimal cuts in the display graph, which is closely related to compatibility of splits. We then derive a characterization for agreement. △ Less

Submitted 30 July, 2013; originally announced July 2013.

Comments: Peer-reviewed and presented as part of the 13th Workshop on Algorithms in Bioinformatics (WABI2013)

arXiv:1210.3762 [pdf, ps, other]

On Two Graph-Theoretic Characterizations of Tree Compatibility

Authors: Sudheer Vakati, David Fernández-Baca

Abstract: Deciding whether a collection of unrooted trees is compatible is a fundamental problem in phylogenetics. Two different graph-theoretic characterizations of tree compatibility have recently been proposed. In one of these, tree compatibility is characterized in terms of the existence of a specific kind of triangulation in a structure known as the display graph. An alternative characterization expres… ▽ More Deciding whether a collection of unrooted trees is compatible is a fundamental problem in phylogenetics. Two different graph-theoretic characterizations of tree compatibility have recently been proposed. In one of these, tree compatibility is characterized in terms of the existence of a specific kind of triangulation in a structure known as the display graph. An alternative characterization expresses the tree compatibility problem as a chordal graph sandwich problem in a structure known as the edge label intersection graph. In this paper we show that the characterization using edge label intersection graphs transforms to a characterization in terms of minimal cuts of the display graph. We show how these two characterizations are related to compatibility of splits. We also show how the characterization in terms of minimal cuts of display graph is related to the characterization in terms of triangulation of the display graph. △ Less

Submitted 14 October, 2012; originally announced October 2012.

MSC Class: 68R10; 92B10 ACM Class: F.2.2; G.2.2; J.3

arXiv:1210.2665 [pdf, other]

Inferring Species Trees from Incongruent Multi-Copy Gene Trees Using the Robinson-Foulds Distance

Authors: Ruchi Chaudhary, J. Gordon Burleigh, David Fernández-Baca

Abstract: We present a new method for inferring species trees from multi-copy gene trees. Our method is based on a generalization of the Robinson-Foulds (RF) distance to multi-labeled trees (mul-trees), i.e., gene trees in which multiple leaves can have the same label. Unlike most previous phylogenetic methods using gene trees, this method does not assume that gene tree incongruence is caused by a single, s… ▽ More We present a new method for inferring species trees from multi-copy gene trees. Our method is based on a generalization of the Robinson-Foulds (RF) distance to multi-labeled trees (mul-trees), i.e., gene trees in which multiple leaves can have the same label. Unlike most previous phylogenetic methods using gene trees, this method does not assume that gene tree incongruence is caused by a single, specific biological process, such as gene duplication and loss, deep coalescence, or lateral gene transfer. We prove that it is NP-hard to compute the RF distance between two mul-trees, but it is easy to calculate the generalized RF distance between a mul-tree and a singly-labeled tree. Motivated by this observation, we formulate the RF supertree problem for mul-trees (MulRF), which takes a collection of mul-trees and constructs a species tree that minimizes the total RF distance from the input mul-trees. We present a fast heuristic algorithm for the MulRF supertree problem. Simulation experiments demonstrate that the MulRF method produces more accurate species trees than gene tree parsimony methods when incongruence is caused by gene tree error, duplications and losses, and/or lateral gene transfer. Furthermore, the MulRF heuristic runs quickly on data sets containing hundreds of trees with up to a hundred taxa. △ Less

Submitted 9 October, 2012; originally announced October 2012.

Comments: 16 pages, 11 figures

arXiv:1205.6359 [pdf, other]

Extracting Conflict-free Information from Multi-labeled Trees

Authors: Akshay Deepak, David Fernández-Baca, Michelle M. McMahon

Abstract: A multi-labeled tree, or MUL-tree, is a phylogenetic tree where two or more leaves share a label, e.g., a species name. A MUL-tree can imply multiple conflicting phylogenetic relationships for the same set of taxa, but can also contain conflict-free information that is of interest and yet is not obvious. We define the information content of a MUL-tree T as the set of all conflict-free quartet topo… ▽ More A multi-labeled tree, or MUL-tree, is a phylogenetic tree where two or more leaves share a label, e.g., a species name. A MUL-tree can imply multiple conflicting phylogenetic relationships for the same set of taxa, but can also contain conflict-free information that is of interest and yet is not obvious. We define the information content of a MUL-tree T as the set of all conflict-free quartet topologies implied by T, and define the maximal reduced form of T as the smallest tree that can be obtained from T by pruning leaves and contracting edges while retaining the same information content. We show that any two MUL-trees with the same information content exhibit the same reduced form. This introduces an equivalence relation in MUL-trees with potential applications to comparing MUL-trees. We present an efficient algorithm to reduce a MUL-tree to its maximally reduced form and evaluate its performance on empirical datasets in terms of both quality of the reduced tree and the degree of data reduction achieved. △ Less

Submitted 28 June, 2012; v1 submitted 29 May, 2012; originally announced May 2012.

Comments: Submitted in Workshop on Algorithms in Bioinformatics 2012 (http://algo12.fri.uni-lj.si/?file=wabi)

arXiv:1205.5779 [pdf, ps, other]

Improved Lower Bounds on the Compatibility of Multi-State Characters

Authors: Brad Shutters, Sudheer Vakati, David Fernández-Baca

Abstract: We study a long standing conjecture on the necessary and sufficient conditions for the compatibility of multi-state characters: There exists a function $f(r)$ such that, for any set $C$ of $r$-state characters, $C$ is compatible if and only if every subset of $f(r)$ characters of $C$ is compatible. We show that for every $r \ge 2$, there exists an incompatible set $C$ of… ▽ More We study a long standing conjecture on the necessary and sufficient conditions for the compatibility of multi-state characters: There exists a function $f(r)$ such that, for any set $C$ of $r$-state characters, $C$ is compatible if and only if every subset of $f(r)$ characters of $C$ is compatible. We show that for every $r \ge 2$, there exists an incompatible set $C$ of $\lfloor\frac{r}{2}\rfloor\cdot\lceil\frac{r}{2}\rceil + 1$ $r$-state characters such that every proper subset of $C$ is compatible. Thus, $f(r) \ge \lfloor\frac{r}{2}\rfloor\cdot\lceil\frac{r}{2}\rceil + 1$ for every $r \ge 2$. This improves the previous lower bound of $f(r) \ge r$ given by Meacham (1983), and generalizes the construction showing that $f(4) \ge 5$ given by Habib and To (2011). We prove our result via a result on quartet compatibility that may be of independent interest: For every integer $n \ge 4$, there exists an incompatible set $Q$ of $\lfloor\frac{n-2}{2}\rfloor\cdot\lceil\frac{n-2}{2}\rceil + 1$ quartets over $n$ labels such that every proper subset of $Q$ is compatible. We contrast this with a result on the compatibility of triplets: For every $n \ge 3$, if $R$ is an incompatible set of more than $n-1$ triplets over $n$ labels, then some proper subset of $R$ is incompatible. We show this upper bound is tight by exhibiting, for every $n \ge 3$, a set of $n-1$ triplets over $n$ taxa such that $R$ is incompatible, but every proper subset of $R$ is compatible. △ Less

Submitted 25 May, 2012; originally announced May 2012.

arXiv:1106.0874 [pdf, ps, other]

A Simple Characterization of the Minimal Obstruction Sets for Three-State Perfect Phylogenies

Authors: Brad Shutters, David Fernández-Baca

Abstract: Lam, Gusfield, and Sridhar (2009) showed that a set of three-state characters has a perfect phylogeny if and only if every subset of three characters has a perfect phylogeny. They also gave a complete characterization of the sets of three three-state characters that do not have a perfect phylogeny. However, it is not clear from their characterization how to find a subset of three characters that d… ▽ More Lam, Gusfield, and Sridhar (2009) showed that a set of three-state characters has a perfect phylogeny if and only if every subset of three characters has a perfect phylogeny. They also gave a complete characterization of the sets of three three-state characters that do not have a perfect phylogeny. However, it is not clear from their characterization how to find a subset of three characters that does not have a perfect phylogeny without testing all triples of characters. In this note, we build upon their result by giving a simple characterization of when a set of three-state characters does not have a perfect phylogeny that can be inferred from testing all pairs of characters. △ Less

Submitted 5 June, 2011; originally announced June 2011.

arXiv:1004.4196 [pdf, ps, other]

Graph Triangulations and the Compatibility of Unrooted Phylogenetic Trees

Authors: Sudheer Vakati, David Fernández-Baca

Abstract: We characterize the compatibility of a collection of unrooted phylogenetic trees as a question of determining whether a graph derived from these trees --- the display graph --- has a specific kind of triangulation, which we call legal. Our result is a counterpart to the well known triangulation-based characterization of the compatibility of undirected multi-state characters. We characterize the compatibility of a collection of unrooted phylogenetic trees as a question of determining whether a graph derived from these trees --- the display graph --- has a specific kind of triangulation, which we call legal. Our result is a counterpart to the well known triangulation-based characterization of the compatibility of undirected multi-state characters. △ Less

Submitted 23 April, 2010; originally announced April 2010.

MSC Class: 68R10; 92B10 ACM Class: F.2.2; G.2.2; J.3

arXiv:0906.5089 [pdf, ps, other]

Comparing and Aggregating Partially Resolved Trees

Authors: Mukul S. Bansal, Jianrong Dong, David Fernández-Baca

Abstract: We define, analyze, and give efficient algorithms for two kinds of distance measures for rooted and unrooted phylogenies. For rooted trees, our measures are based on the topologies the input trees induce on triplets; that is, on three-element subsets of the set of species. For unrooted trees, the measures are based on quartets (four-element subsets). Triplet and quartet-based distances provide a… ▽ More We define, analyze, and give efficient algorithms for two kinds of distance measures for rooted and unrooted phylogenies. For rooted trees, our measures are based on the topologies the input trees induce on triplets; that is, on three-element subsets of the set of species. For unrooted trees, the measures are based on quartets (four-element subsets). Triplet and quartet-based distances provide a robust and fine-grained measure of the similarities between trees. The distinguishing feature of our distance measures relative to traditional quartet and triplet distances is their ability to deal cleanly with the presence of unresolved nodes, also called polytomies. For rooted trees, these are nodes with more than two children; for unrooted trees, they are nodes of degree greater than three. Our first class of measures are parametric distances, where there is a parameter that weighs the difference between an unresolved triplet/quartet topology and a resolved one. Our second class of measures are based on Hausdorff distance. Each tree is viewed as a set of all possible ways in which the tree could be refined to eliminate unresolved nodes. The distance between the original (unresolved) trees is then taken to be the Hausdorff distance between the associated sets of fully resolved trees, where the distance between trees in the sets is the triplet or quartet distance, as appropriate. △ Less

Submitted 27 June, 2009; originally announced June 2009.

Comments: 34 pages

ACM Class: F.2.2; G.2; J.3

Showing 1–16 of 16 results for author: Fernandez-Baca, D