-
An expanded evaluation of protein function prediction methods shows an improvement in accuracy
Authors:
Yuxiang Jiang,
Tal Ronnen Oron,
Wyatt T Clark,
Asma R Bankapur,
Daniel D'Andrea,
Rosalba Lepore,
Christopher S Funk,
Indika Kahanda,
Karin M Verspoor,
Asa Ben-Hur,
Emily Koo,
Duncan Penfold-Brown,
Dennis Shasha,
Noah Youngs,
Richard Bonneau,
Alexandra Lin,
Sayed ME Sahraeian,
Pier Luigi Martelli,
Giuseppe Profiti,
Rita Casadio,
Renzhi Cao,
Zhaolong Zhong,
Jianlin Cheng,
Adrian Altenhoff,
Nives Skunca
, et al. (122 additional authors not shown)
Abstract:
Background: The increasing volume and variety of genotypic and phenotypic data is a major defining characteristic of modern biomedical sciences. At the same time, the limitations in technology for generating data and the inherently stochastic nature of biomolecular events have led to the discrepancy between the volume of data and the amount of knowledge gleaned from it. A major bottleneck in our a…
▽ More
Background: The increasing volume and variety of genotypic and phenotypic data is a major defining characteristic of modern biomedical sciences. At the same time, the limitations in technology for generating data and the inherently stochastic nature of biomolecular events have led to the discrepancy between the volume of data and the amount of knowledge gleaned from it. A major bottleneck in our ability to understand the molecular underpinnings of life is the assignment of function to biological macromolecules, especially proteins. While molecular experiments provide the most reliable annotation of proteins, their relatively low throughput and restricted purview have led to an increasing role for computational function prediction. However, accurately assessing methods for protein function prediction and tracking progress in the field remain challenging. Methodology: We have conducted the second Critical Assessment of Functional Annotation (CAFA), a timed challenge to assess computational methods that automatically assign protein function. One hundred twenty-six methods from 56 research groups were evaluated for their ability to predict biological functions using the Gene Ontology and gene-disease associations using the Human Phenotype Ontology on a set of 3,681 proteins from 18 species. CAFA2 featured significantly expanded analysis compared with CAFA1, with regards to data set size, variety, and assessment metrics. To review progress in the field, the analysis also compared the best methods participating in CAFA1 to those of CAFA2. Conclusions: The top performing methods in CAFA2 outperformed the best methods from CAFA1, demonstrating that computational function prediction is improving. This increased accuracy can be attributed to the combined effect of the growing number of experimental annotations and improved methods for function prediction.
△ Less
Submitted 2 January, 2016;
originally announced January 2016.
-
ProbMetab: an R package for Bayesian probabilistic annotation of LC-MS based metabolomics
Authors:
Ricardo R. Silva,
Fabien Jourdan,
Diego M. Salvanha,
Fabien Letisse,
Emilien L. Jamin,
Simone Guidetti-Gonzalez,
Carlos A. Labate,
Ricardo Z. N. Vêncio
Abstract:
We present ProbMetab, an R package which promotes substantial improvement in automatic probabilistic LC-MS based metabolome annotation. The inference engine core is based on a Bayesian model implemented to: (i) allow diverse source of experimental data and metadata to be systematically incorporated into the model with alternative ways to calculate the likelihood function and; (ii) allow sensitive…
▽ More
We present ProbMetab, an R package which promotes substantial improvement in automatic probabilistic LC-MS based metabolome annotation. The inference engine core is based on a Bayesian model implemented to: (i) allow diverse source of experimental data and metadata to be systematically incorporated into the model with alternative ways to calculate the likelihood function and; (ii) allow sensitive selection of biologically meaningful biochemical reactions databases as Dirichlet-categorical prior distribution. Additionally, to ensure result interpretation by system biologists, we display the annotation in a network where observed mass peaks are connected if their candidate metabolites are substrate/product of known biochemical reactions. This graph can be overlaid with other graph-based analysis, such as partial correlation networks, in a visualization scheme exported to Cytoscape, with web and stand alone versions. ProbMetab was implemented in a modular fashion to fit together with established upstream (xcms, CAMERA, AStream, mzMatch.R, etc) and downstream R package tools (GeneNet, RCytoscape, DiffCorr, etc). ProbMetab, along with extensive documentation and case studies, is freely available under GNU license at: http://labpib.fmrp.usp.br/methods/probmetab/.
△ Less
Submitted 27 October, 2013;
originally announced October 2013.
-
Simcluster: clustering enumeration gene expression data on the simplex space
Authors:
Ricardo ZN Vêncio,
Leonardo Varuzza,
Carlos AB Pereira,
Helena Brentani,
Ilya Shmulevich
Abstract:
Transcript enumeration methods such as SAGE, MPSS, and sequencing-by-synthesis EST ``digital northern'', are important high-throughput techniques for digital gene expression measurement. As other counting or voting processes, these measurements constitute compositional data exhibiting properties particular to the simplex space where the summation of the components is constrained. These propertie…
▽ More
Transcript enumeration methods such as SAGE, MPSS, and sequencing-by-synthesis EST ``digital northern'', are important high-throughput techniques for digital gene expression measurement. As other counting or voting processes, these measurements constitute compositional data exhibiting properties particular to the simplex space where the summation of the components is constrained. These properties are not present on regular Euclidean spaces, on which hybridization-based microarray data is often modeled. Therefore, pattern recognition methods commonly used for microarray data analysis may be non-informative for the data generated by transcript enumeration techniques since they ignore certain fundamental properties of this space. Here we present a software tool, Simcluster, designed to perform clustering analysis for data on the simplex space. We present Simcluster as a stand-alone command-line C package and as a user-friendly on-line tool. Both versions are available at: http://xerad.systemsbiology.net/simcluster/ . Simcluster is designed in accordance with a well-established mathematical framework for compositional data analysis, which provides principled procedures for dealing with the simplex space, and is thus applicable in a number of contexts, including enumeration-based gene expression data.
△ Less
Submitted 22 June, 2007; v1 submitted 2 March, 2007;
originally announced March 2007.