-
Extraction of Deep Phylogenetic Signal and Improved Resolution of Evolutionary Events within the recA/RAD51 Phylogeny
Authors:
Sree V. Chintapalli,
Gaurav Bhardwaj,
Jagadish Babu,
Loukia Hadjiyianni,
Yoojin Hong,
Zhenhai Zhang,
Xiaofan Zhou,
Hong Ma,
Andriy Anishkin,
Damian B. van Rossum,
Randen L. Patterson
Abstract:
The recA/RAD51 gene family encodes a diverse set of recombinase proteins that effect homologous recombination, DNA-repair, and genome stability. The recA gene family is expressed in almost all species of Eubacteria, Archaea, and Eukaryotes, and even in some viruses. To date, efforts to resolve the deep evolutionary origins of this ancient protein family have been hindered, in part, by the high seq…
▽ More
The recA/RAD51 gene family encodes a diverse set of recombinase proteins that effect homologous recombination, DNA-repair, and genome stability. The recA gene family is expressed in almost all species of Eubacteria, Archaea, and Eukaryotes, and even in some viruses. To date, efforts to resolve the deep evolutionary origins of this ancient protein family have been hindered, in part, by the high sequence divergence between families (i.e. ~30% identity between paralogous groups). Through (i) large taxon sampling, (ii) the use of a phylogenetic algorithm designed for measuring highly divergent paralogs, and (iii) novel Evolutionary Spatial Dynamics simulation and analytical tools, we obtained a robust, parsimonious and more refined phylogenetic history of the recA/RAD51 superfamily. Taken together, our model for the evolution of recA/RAD51 family provides a better understanding of ancient origin of recA proteins and multiple events leading to the diversification of recA homologs in eukaryotes, including the discovery of additional RAD51 sub-families.
△ Less
Submitted 14 June, 2012;
originally announced June 2012.
-
Towards Solving the Inverse Protein Folding Problem
Authors:
Yoojin Hong,
Kyung Dae Ko,
Gaurav Bhardwaj,
Zhenhai Zhang,
Damian B. van Rossum,
Randen L. Patterson
Abstract:
Accurately assigning folds for divergent protein sequences is a major obstacle to structural studies and underlies the inverse protein folding problem. Herein, we outline our theories for fold-recognition in the "twilight-zone" of sequence similarity (<25% identity). Our analyses demonstrate that structural sequence profiles built using Position-Specific Scoring Matrices (PSSMs) significantly outp…
▽ More
Accurately assigning folds for divergent protein sequences is a major obstacle to structural studies and underlies the inverse protein folding problem. Herein, we outline our theories for fold-recognition in the "twilight-zone" of sequence similarity (<25% identity). Our analyses demonstrate that structural sequence profiles built using Position-Specific Scoring Matrices (PSSMs) significantly outperform multiple popular homology-modeling algorithms for relating and predicting structures given only their amino acid sequences. Importantly, structural sequence profiles reconstitute SCOP fold classifications in control and test datasets. Results from our experiments suggest that structural sequence profiles can be used to rapidly annotate protein folds at proteomic scales. We propose that encoding the entire Protein DataBank (~1070 folds) into structural sequence profiles would extract interoperable information capable of improving most if not all methods of structural modeling.
△ Less
Submitted 29 August, 2010;
originally announced August 2010.
-
Theories on PHYlogenetic ReconstructioN (PHYRN)
Authors:
Gaurav Bhardwaj,
Zhenhai Zhang,
Yoojin Hong,
Kyung Dae Ko,
Gue Su Chang,
Evan J. Smith,
Lindsay A. Kline,
D. Nicholas Hartranft,
Edward C. Holmes,
Randen L. Patterson,
Damian B. van Rossum
Abstract:
The inability to resolve deep node relationships of highly divergent/rapidly evolving protein families is a major factor that stymies evolutionary studies. In this manuscript, we propose a Multiple Sequence Alignment (MSA) independent method to infer evolutionary relationships. We previously demonstrated that phylogenetic profiles built using position specific scoring matrices (PSSMs) are capabl…
▽ More
The inability to resolve deep node relationships of highly divergent/rapidly evolving protein families is a major factor that stymies evolutionary studies. In this manuscript, we propose a Multiple Sequence Alignment (MSA) independent method to infer evolutionary relationships. We previously demonstrated that phylogenetic profiles built using position specific scoring matrices (PSSMs) are capable of constructing informative evolutionary histories(1;2). In this manuscript, we theorize that PSSMs derived specifically from the query sequences used to construct the phylogenetic tree will improve this method for the study of rapidly evolving proteins. To test this theory, we performed phylogenetic analyses of a benchmark protein superfamily (reverse transcriptases (RT)) as well as simulated datasets. When we compare the results obtained from our method, PHYlogenetic ReconstructioN (PHYRN), with other MSA dependent methods, we observe that PHYRN provides a 4- to 100-fold increase in accurate measurements at deep nodes. As phylogenetic profiles are used as the information source, rather than MSA, we propose PHYRN as a paradigm shift in studying evolution when MSA approaches fail. Perhaps most importantly, due to the improvements in our computational approach and the availability of vast amount of sequencing data, PHYRN is scalable to thousands of sequences. Taken together with PHYRNs adaptability to any protein family, this method can serve as a tool for resolving ambiguities in evolutionary studies of rapidly evolving/highly divergent protein families.
△ Less
Submitted 26 February, 2010; v1 submitted 2 February, 2010;
originally announced February 2010.
-
Mapping Complex Networks: Exploring Boolean Modeling of Signal Transduction Pathways
Authors:
Gaurav Bhardwaj,
Christine P. Wells,
Reka Albert,
Damian B. van Rossum,
Randen L. Patterson
Abstract:
In this study, we explored the utility of a descriptive and predictive bionetwork model for phospholipase C-coupled calcium signaling pathways, built with non-kinetic experimental information. Boolean models generated from these data yield oscillatory activity patterns for both the endoplasmic reticulum resident inositol-1,4,5-trisphosphate receptor (IP3R) and the plasma-membrane resident canoni…
▽ More
In this study, we explored the utility of a descriptive and predictive bionetwork model for phospholipase C-coupled calcium signaling pathways, built with non-kinetic experimental information. Boolean models generated from these data yield oscillatory activity patterns for both the endoplasmic reticulum resident inositol-1,4,5-trisphosphate receptor (IP3R) and the plasma-membrane resident canonical transient receptor potential channel 3 (TRPC3). These results are specific as randomization of the Boolean operators ablates oscillatory pattern formation. Furthermore, knock-out simulations of the IP3R, TRPC3, and multiple other proteins recapitulate experimentally derived results. The potential of this approach can be observed by its ability to predict previously undescribed cellular phenotypes using in vitro experimental data. Indeed our cellular analysis of the developmental and calcium-regulatory protein, DANGER1a, confirms the counter-intuitive predictions from our Boolean models in two highly relevant cellular models. Based on these results, we theorize that with sufficient legacy knowledge and/or computational biology predictions, Boolean networks provide a robust method for predictive-modeling of any biological system.
△ Less
Submitted 14 December, 2009; v1 submitted 3 November, 2009;
originally announced November 2009.
-
Brainstorming through the Sequence Universe: Theories on the Protein Problem
Authors:
Kyung Dae Ko,
Yoojin Hong,
Gaurav Bhardwaj,
Teresa M. Killick,
Damian B. van Rossum,
Randen L. Patterson
Abstract:
Just as physicists strive to develop a TOE (theory of everything), which explains and unifies the physical laws of the universe, the life-scientist wishes to uncover the TOE as it relates to cellular systems. This can only be achieved with a quantitative platform that can comprehensively deduce and relate protein structure, functional, and evolution of genomes and proteomes in a comparative fash…
▽ More
Just as physicists strive to develop a TOE (theory of everything), which explains and unifies the physical laws of the universe, the life-scientist wishes to uncover the TOE as it relates to cellular systems. This can only be achieved with a quantitative platform that can comprehensively deduce and relate protein structure, functional, and evolution of genomes and proteomes in a comparative fashion. Were this perfected, proper analyses would start to uncover the underlying physical laws governing the emergent behavior of biological systems and the evolutionary pressures responsible for functional innovation. In the near term, such methodology would allow the vast quantities of uncharacterized (e.g. metagenomic samples) primary amino acid sequences to be rapidly decoded. Analogous to natural products found in the Amazon, genomes of living organisms contain large numbers of proteins that would prove useful as new therapeutics for human health, energy sources, and/or waste management solutions if they could be identified and characterized. We previously theorized that phylogenetic profiles could provide a quantitative platform for obtaining unified measures of structure, function, and evolution (SF&E)(1). In the present manuscript, we present data that support this theory and demonstrates how refinements of our analysis algorithms improve the performance of phylogenetic profiles for deriving structural/functional relationships.
△ Less
Submitted 3 November, 2009;
originally announced November 2009.
-
Adaptive BLASTing through the Sequence Dataspace: Theories on Protein Sequence Embedding
Authors:
Yoojin Hong,
Jaewoo Kang,
Dongwon Lee,
Randen L. Patterson,
Damian B. van Rossum
Abstract:
We theorize that phylogenetic profiles provide a quantitative method that can relate the structural and functional properties of proteins, as well as their evolutionary relationships. A key feature of phylogenetic profiles is the interoperable data format (e.g. alignment information, physiochemical information, genomic information, etc). Indeed, we have previously demonstrated Position Specific…
▽ More
We theorize that phylogenetic profiles provide a quantitative method that can relate the structural and functional properties of proteins, as well as their evolutionary relationships. A key feature of phylogenetic profiles is the interoperable data format (e.g. alignment information, physiochemical information, genomic information, etc). Indeed, we have previously demonstrated Position Specific Scoring Matrices (PSSMs) are an informative M-dimension which can be scored from quantitative measure of embedded or unmodified sequence alignments. Moreover, the information obtained from these alignments is informative, even in the twilight zone of sequence similarity (<25% identity)(1-5). Although powerful, our previous embedding strategy suffered from contaminating alignments(embedded AND unmodified) and computational expense. Herein, we describe the logic and algorithmic process for a heuristic embedding strategy (Adaptive GDDA-BLAST, Ada-BLAST). Ada-BLAST on average up to ~19-fold faster and has similar sensitivity to our previous method. Further, we provide data demonstrating the benefits of embedded alignment measurements for isolating secondary structural elements and the classifying transmembrane-domain structure/function. We theorize that sequence-embedding is one of multiple ways that low-identity alignments can be measured and incorporated into high-performance PSSM-based phylogenetic profiles.
△ Less
Submitted 3 November, 2009;
originally announced November 2009.
-
Phylogenetic Profiles as a Unified Framework for Measuring Protein Structure, Function and Evolution
Authors:
Kyung Dae Ko,
Yoojin Hong,
Gue Su Chang,
Gaurav Bhardwaj,
Damian B. van Rossum,
Randen L. Patterson
Abstract:
The sequence of amino acids in a protein is believed to determine its native state structure, which in turn is related to the functionality of the protein. In addition, information pertaining to evolutionary relationships is contained in homologous sequences. One powerful method for inferring these sequence attributes is through comparison of a query sequence with reference sequences that contai…
▽ More
The sequence of amino acids in a protein is believed to determine its native state structure, which in turn is related to the functionality of the protein. In addition, information pertaining to evolutionary relationships is contained in homologous sequences. One powerful method for inferring these sequence attributes is through comparison of a query sequence with reference sequences that contain significant homology and whose structure, function, and/or evolutionary relationships are already known. In spite of decades of concerted work, there is no simple framework for deducing structure, function, and evolutionary (SF&E) relationships directly from sequence information alone, especially when the pair-wise identity is less than a threshold figure ~25% [1,2]. However, recent research has shown that sequence identity as low as 8% is sufficient to yield common structure/function relationships and sequence identities as large as 88% may yet result in distinct structure and function [3,4]. Starting with a basic premise that protein sequence encodes information about SF&E, one might ask how one could tease out these measures in an unbiased manner. Here we present a unified framework for inferring SF&E from sequence information using a knowledge-based approach which generates phylogenetic profiles in an unbiased manner. We illustrate the power of phylogenetic profiles generated using the Gestalt Domain Detection Algorithm Basic Local Alignment Tool (GDDA-BLAST) to derive structural domains, functional annotation, and evolutionary relationships for a host of ion-channels and human proteins of unknown function. These data are in excellent accord with published data and new experiments. Our results suggest that there is a wealth of previously unexplored information in protein sequence.
△ Less
Submitted 15 June, 2008;
originally announced June 2008.