-
An Investigation in Optimal Encoding of Protein Primary Sequence for Structure Prediction by Artificial Neural Networks
Authors:
Aaron Hein,
Casey Cole,
Homayoun Valafar
Abstract:
Machine learning and the use of neural networks has increased precipitously over the past few years primarily due to the ever-increasing accessibility to data and the growth of computation power. It has become increasingly easy to harness the power of machine learning for predictive tasks. Protein structure prediction is one area where neural networks are becoming increasingly popular and successf…
▽ More
Machine learning and the use of neural networks has increased precipitously over the past few years primarily due to the ever-increasing accessibility to data and the growth of computation power. It has become increasingly easy to harness the power of machine learning for predictive tasks. Protein structure prediction is one area where neural networks are becoming increasingly popular and successful. Although very powerful, the use of ANN require selection of most appropriate input/output encoding, architecture, and class to produce the optimal results. In this investigation we have explored and evaluated the effect of several conventional and newly proposed input encodings and selected an optimal architecture. We considered 11 variations of input encoding, 11 alternative window sizes, and 7 different architectures. In total, we evaluated 2,541 permutations in application to the training and testing of more than 10,000 protein structures over the course of 3 months. Our investigations concluded that one-hot encoding, the use of LSTMs, and window sizes of 9, 11, and 15 produce the optimal outcome. Through this optimization, we were able to improve the quality of protein structure prediction by predicting the φ dihedrals to within 14° - 16° and ψ dihedrals to within 23°- 25°. This is a notable improvement compared to previously similar investigations.
△ Less
Submitted 2 August, 2020;
originally announced August 2020.
-
A Preliminary Investigation in the Molecular Basis of Host Shutoff Mechanism in SARS-CoV
Authors:
Niharika Pandala,
Casey A. Cole,
Devaun McFarland,
Anita Nag,
Homayoun Valafar
Abstract:
Recent events leading to the worldwide pandemic of COVID-19 have demonstrated the effective use of genomic sequencing technologies to establish the genetic sequence of this virus. In contrast, the COVID-19 pandemic has demonstrated the absence of computational approaches to understand the molecular basis of this infection rapidly. Here we present an integrated approach to the study of the nsp1 pro…
▽ More
Recent events leading to the worldwide pandemic of COVID-19 have demonstrated the effective use of genomic sequencing technologies to establish the genetic sequence of this virus. In contrast, the COVID-19 pandemic has demonstrated the absence of computational approaches to understand the molecular basis of this infection rapidly. Here we present an integrated approach to the study of the nsp1 protein in SARS-CoV-1, which plays an essential role in maintaining the expression of viral proteins and further disabling the host protein expression, also known as the host shutoff mechanism. We present three independent methods of evaluating two potential binding sites speculated to participate in host shutoff by nsp1. We have combined results from computed models of nsp1, with deep mining of all existing protein structures (using PDBMine), and binding site recognition (using msTALI) to examine the two sites consisting of residues 55-59 and 73-80. Based on our preliminary results, we conclude that the residues 73-80 appear as the regions that facilitate the critical initial steps in the function of nsp1. Given the 90% sequence identity between nsp1 from SARS-CoV-1 and SARS-CoV-2, we conjecture the same critical initiation step in the function of COVID-19 nsp1.
△ Less
Submitted 23 July, 2020;
originally announced July 2020.
-
De Novo Assembly of Uca minax Transcriptome from Next Generation Sequencing
Authors:
Hanin Omar,
Casey A. Cole,
Arjang Fahim,
Giuliana Gusmaroli,
Stephen Borgianini,
Homayoun Valafar
Abstract:
High-throughput cDNA sequencing (RNA-seq) is a very powerful technique to quantify gene expression in an unbiased way. The Crustacean family is among the groups of organisms sparsely represented in current genomic databases. Here we present transcriptome data from Uca minax (red-jointed fiddler crab) as an opportunity to extend our knowledge. Next generation sequencing was performed on six tissue…
▽ More
High-throughput cDNA sequencing (RNA-seq) is a very powerful technique to quantify gene expression in an unbiased way. The Crustacean family is among the groups of organisms sparsely represented in current genomic databases. Here we present transcriptome data from Uca minax (red-jointed fiddler crab) as an opportunity to extend our knowledge. Next generation sequencing was performed on six tissue samples from Uca minax using the Illumina HiSeq system. Six Transcriptome libraries were created using Trinity; a free, open-source software tool for de novo transcriptome assembly of high-throughput mRNA sequencing (RNA-seq) data with the absence of a reference genome. In addition, several tools that aid in management of data were used, such as RSEM, Bowtie, Blast, and IGV; a tool for visualizing RNA-seq analysis results. Fast quality control (FastQC) analysis of the raw sequenced files revealed that both adapter and PCR primer sequences were prevalently present, which may require a preprocessing step.
△ Less
Submitted 9 January, 2020;
originally announced January 2020.
-
An Investigation of Minimum Data Requirement for Successful Structure Determination of Pf2048.1 with REDCRAFT
Authors:
Casey A. Cole,
Daniela Ishimaru,
Mirko Hennig,
Homayoun Valafar
Abstract:
Traditional approaches to elucidation of protein structures by NMR spectroscopy rely on distance restraints also know as nuclear Overhauser effects (NOEs). The use of NOEs as the primary source of structure determination by NMR spectroscopy is time consuming and expensive. Residual Dipolar Couplings (RDCs) have become an alternate approach for structure calculation by NMR spectroscopy. In this wor…
▽ More
Traditional approaches to elucidation of protein structures by NMR spectroscopy rely on distance restraints also know as nuclear Overhauser effects (NOEs). The use of NOEs as the primary source of structure determination by NMR spectroscopy is time consuming and expensive. Residual Dipolar Couplings (RDCs) have become an alternate approach for structure calculation by NMR spectroscopy. In this work we report our results for structure calculation of the novel protein PF2048.1 from RDC data and establish the minimum data requirement for successful structure calculation using the software package REDCRAFT. Our investigations start with utilizing four sets of synthetic RDC data in two alignment media and proceed by reducing the RDC data to the final limit of {CN, NH} and {NH} from two alignment media respectively. Our results indicate that structure elucidation of this protein is possible with as little as {CN, NH} and {NH} to within 0.533Å of the target structure.
△ Less
Submitted 9 January, 2020;
originally announced January 2020.
-
PDBMine: A Reformulation of the Protein Data Bank to Facilitate Structural Data Mining
Authors:
Casey A Cole,
Christopher Ott,
Diego Valdes,
Homayoun Valafar
Abstract:
Large scale initiatives such as the Human Genome Project, Structural Genomics, and individual research teams have provided large deposits of genomic and proteomic data. The transfer of data to knowledge has become one of the existing challenges, which is a consequence of capturing data in databases that are optimally designed for archiving and not mining. In this research, we have targeted the Pro…
▽ More
Large scale initiatives such as the Human Genome Project, Structural Genomics, and individual research teams have provided large deposits of genomic and proteomic data. The transfer of data to knowledge has become one of the existing challenges, which is a consequence of capturing data in databases that are optimally designed for archiving and not mining. In this research, we have targeted the Protein Databank (PDB) and demonstrated a transformation of its content, named PDBMine, that reduces storage space by an order of magnitude, and allows for powerful mining in relation to the topic of protein structure determination. We have demonstrated the utility of PDBMine in exploring the prevalence of dimeric and trimeric amino acid sequences and provided a mechanism of predicting protein structure.
△ Less
Submitted 19 November, 2019;
originally announced November 2019.
-
Improvements of the REDCRAFT Software Package
Authors:
Casey A Cole,
Caleb Parks,
Julian Rachele,
Homayoun Valafar
Abstract:
Traditional approaches to elucidation of protein structures by NMR spectroscopy rely on distance restraints also known as nuclear Overhauser effects (NOEs). The use of NOEs as the primary source of structure determination by NMR spectroscopy is time consuming and expensive. Residual Dipolar Couplings (RDCs) have become an alternate approach for structure calculation by NMR spectroscopy. In previou…
▽ More
Traditional approaches to elucidation of protein structures by NMR spectroscopy rely on distance restraints also known as nuclear Overhauser effects (NOEs). The use of NOEs as the primary source of structure determination by NMR spectroscopy is time consuming and expensive. Residual Dipolar Couplings (RDCs) have become an alternate approach for structure calculation by NMR spectroscopy. In previous works, the software package REDCRAFT has been presented as a means of harnessing the information containing in RDCs for structure calculation of proteins. In this work, we present significant improvements to the REDCRAFT package including: refinement of the decimation procedure, the inclusion of graphical user interface, adoption of NEF standards, and addition of scripts for enhanced protein modeling options. The improvements to REDCRAFT have resulted in the ability to fold proteins that the previous versions were unable to fold. For instance, we report the results of folding of the protein 1A1Z in the presence of highly erroneous data.
△ Less
Submitted 19 November, 2019;
originally announced November 2019.
-
Evaluation of tools for differential gene expression analysis by RNA-seq on a 48 biological replicate experiment
Authors:
Nicholas J. Schurch,
Pieta Schofield,
Marek Gierliński,
Christian Cole,
Alexander Sherstnev,
Vijender Singh,
Nicola Wrobel,
Karim Gharbi,
Gordon G. Simpson,
Tom Owen-Hughes,
Mark Blaxter,
Geoffrey J. Barton
Abstract:
An RNA-seq experiment with 48 biological replicates in each of 2 conditions was performed to determine the number of biological replicates ($n_r$) required, and to identify the most effective statistical analysis tools for identifying differential gene expression (DGE). When $n_r=3$, seven of the nine tools evaluated give true positive rates (TPR) of only 20 to 40 percent. For high fold-change gen…
▽ More
An RNA-seq experiment with 48 biological replicates in each of 2 conditions was performed to determine the number of biological replicates ($n_r$) required, and to identify the most effective statistical analysis tools for identifying differential gene expression (DGE). When $n_r=3$, seven of the nine tools evaluated give true positive rates (TPR) of only 20 to 40 percent. For high fold-change genes ($|log_{2}(FC)|\gt2$) the TPR is $\gt85$ percent. Two tools performed poorly; over- or under-predicting the number of differentially expressed genes. Increasing replication gives a large increase in TPR when considering all DE genes but only a small increase for high fold-change genes. Achieving a TPR $\gt85$% across all fold-changes requires $n_r\gt20$. For future RNA-seq experiments these results suggest $n_r\gt6$, rising to $n_r\gt12$ when identifying DGE irrespective of fold-change is important. For $6 \lt n_r \lt 12$, superior TPR makes edgeR the leading tool tested. For $n_r \ge12$, minimizing false positives is more important and DESeq outperforms the other tools.
△ Less
Submitted 8 June, 2015; v1 submitted 8 May, 2015;
originally announced May 2015.
-
Statistical models for RNA-seq data derived from a two-condition 48-replicate experiment
Authors:
Marek Gierliński,
Christian Cole,
Pietà Schofield,
Nicholas J. Schurch,
Alexander Sherstnev,
Vijender Singh,
Nicola Wrobel,
Karim Gharbi,
Gordon Simpson,
Tom Owen-Hughes,
Mark Blaxter,
Geoffrey J. Barton
Abstract:
High-throughput RNA sequencing (RNA-seq) is now the standard method to determine differential gene expression. Identifying differentially expressed genes crucially depends on estimates of read count variability. These estimates are typically based on statistical models such as the negative binomial distribution, which is employed by the tools edgeR, DESeq and cuffdiff. Until now, the validity of t…
▽ More
High-throughput RNA sequencing (RNA-seq) is now the standard method to determine differential gene expression. Identifying differentially expressed genes crucially depends on estimates of read count variability. These estimates are typically based on statistical models such as the negative binomial distribution, which is employed by the tools edgeR, DESeq and cuffdiff. Until now, the validity of these models has usually been tested on either low-replicate RNA-seq data or simulations. A 48-replicate RNA-seq experiment in yeast was performed and data tested against theoretical models. The observed gene read counts were consistent with both log-normal and negative binomial distributions, while the mean-variance relation followed the line of constant dispersion parameter of ~0.01. The high-replicate data also allowed for strict quality control and screening of bad replicates, which can drastically affect the gene read-count distribution. RNA-seq data have been submitted to ENA archive with project ID PRJEB5348.
△ Less
Submitted 4 May, 2015;
originally announced May 2015.
-
Ancient human genomes suggest three ancestral populations for present-day Europeans
Authors:
Iosif Lazaridis,
Nick Patterson,
Alissa Mittnik,
Gabriel Renaud,
Swapan Mallick,
Karola Kirsanow,
Peter H. Sudmant,
Joshua G. Schraiber,
Sergi Castellano,
Mark Lipson,
Bonnie Berger,
Christos Economou,
Ruth Bollongino,
Qiaomei Fu,
Kirsten I. Bos,
Susanne Nordenfelt,
Heng Li,
Cesare de Filippo,
Kay Prüfer,
Susanna Sawyer,
Cosimo Posth,
Wolfgang Haak,
Fredrik Hallgren,
Elin Fornander,
Nadin Rohland
, et al. (95 additional authors not shown)
Abstract:
We sequenced genomes from a $\sim$7,000 year old early farmer from Stuttgart in Germany, an $\sim$8,000 year old hunter-gatherer from Luxembourg, and seven $\sim$8,000 year old hunter-gatherers from southern Sweden. We analyzed these data together with other ancient genomes and 2,345 contemporary humans to show that the great majority of present-day Europeans derive from at least three highly diff…
▽ More
We sequenced genomes from a $\sim$7,000 year old early farmer from Stuttgart in Germany, an $\sim$8,000 year old hunter-gatherer from Luxembourg, and seven $\sim$8,000 year old hunter-gatherers from southern Sweden. We analyzed these data together with other ancient genomes and 2,345 contemporary humans to show that the great majority of present-day Europeans derive from at least three highly differentiated populations: West European Hunter-Gatherers (WHG), who contributed ancestry to all Europeans but not to Near Easterners; Ancient North Eurasians (ANE), who were most closely related to Upper Paleolithic Siberians and contributed to both Europeans and Near Easterners; and Early European Farmers (EEF), who were mainly of Near Eastern origin but also harbored WHG-related ancestry. We model these populations' deep relationships and show that EEF had $\sim$44% ancestry from a "Basal Eurasian" lineage that split prior to the diversification of all other non-African lineages.
△ Less
Submitted 1 April, 2014; v1 submitted 23 December, 2013;
originally announced December 2013.
-
Improved annotation of 3-prime untranslated regions and complex loci by combination of strand-specific Direct RNA Sequencing, RNA-seq and ESTs
Authors:
Nick Schurch,
Christian Cole,
Alexander Sherstnev,
Junfang Song,
Céline Duc,
Kate G. Storey,
W. H. Irwin McLean,
Sara J. Brown,
Gordon G. Simpson,
Geoffrey J. Barton
Abstract:
The reference annotations made for a genome sequence provide the framework for all subsequent analyses of the genome. Correct annotation is particularly important when interpreting the results of RNA-seq experiments where short sequence reads are mapped against the genome and assigned to genes according to the annotation. Inconsistencies in annotations between the reference and the experimental sy…
▽ More
The reference annotations made for a genome sequence provide the framework for all subsequent analyses of the genome. Correct annotation is particularly important when interpreting the results of RNA-seq experiments where short sequence reads are mapped against the genome and assigned to genes according to the annotation. Inconsistencies in annotations between the reference and the experimental system can lead to incorrect interpretation of the effect on RNA expression of an experimental treatment or mutation in the system under study. Until recently, the genome-wide annotation of 3-prime untranslated regions received less attention than coding regions and the delineation of intron/exon boundaries. In this paper, data produced for samples in Human, Chicken and A. thaliana by the novel single-molecule, strand-specific, Direct RNA Sequencing technology from Helicos Biosciences which locates 3-prime polyadenylation sites to within +/- 2 nt, were combined with archival EST and RNA-Seq data. Nine examples are illustrated where this combination of data allowed: (1) gene and 3-prime UTR re-annotation (including extension of one 3-prime UTR by 5.9 kb); (2) disentangling of gene expression in complex regions; (3) clearer interpretation of small RNA expression and (4) identification of novel genes. While the specific examples displayed here may become obsolete as genome sequences and their annotations are refined, the principles laid out in this paper will be of general use both to those annotating genomes and those seeking to interpret existing publically available annotations in the context of their own experimental data
△ Less
Submitted 11 November, 2013;
originally announced November 2013.