-
Stacking Models for Nearly Optimal Link Prediction in Complex Networks
Authors:
Amir Ghasemian,
Homa Hosseinmardi,
Aram Galstyan,
Edoardo M. Airoldi,
Aaron Clauset
Abstract:
Most real-world networks are incompletely observed. Algorithms that can accurately predict which links are missing can dramatically speedup the collection of network data and improve the validity of network models. Many algorithms now exist for predicting missing links, given a partially observed network, but it has remained unknown whether a single best predictor exists, how link predictability v…
▽ More
Most real-world networks are incompletely observed. Algorithms that can accurately predict which links are missing can dramatically speedup the collection of network data and improve the validity of network models. Many algorithms now exist for predicting missing links, given a partially observed network, but it has remained unknown whether a single best predictor exists, how link predictability varies across methods and networks from different domains, and how close to optimality current methods are. We answer these questions by systematically evaluating 203 individual link predictor algorithms, representing three popular families of methods, applied to a large corpus of 548 structurally diverse networks from six scientific domains. We first show that individual algorithms exhibit a broad diversity of prediction errors, such that no one predictor or family is best, or worst, across all realistic inputs. We then exploit this diversity via meta-learning to construct a series of "stacked" models that combine predictors into a single algorithm. Applied to a broad range of synthetic networks, for which we may analytically calculate optimal performance, these stacked models achieve optimal or nearly optimal levels of accuracy. Applied to real-world networks, stacked models are also superior, but their accuracy varies strongly by domain, suggesting that link prediction may be fundamentally easier in social networks than in biological or technological networks. These results indicate that the state-of-the-art for link prediction comes from combining individual algorithms, which achieves nearly optimal predictions. We close with a brief discussion of limitations and opportunities for further improvement of these results.
△ Less
Submitted 17 September, 2019;
originally announced September 2019.
-
Quantifying homologous proteins and proteoforms
Authors:
Dmitry Malioutov,
Tianchi Chen,
Jacob Jaffe,
Edoardo Airoldi,
Steven Carr,
Bogdan Budnik,
Nikolai Slavov
Abstract:
Many proteoforms - arising from alternative splicing, post-translational modifications (PTMs), or paralogous genes - have distinct biological functions, such as histone PTM proteoforms. However, their quantification by existing bottom-up mass-spectrometry (MS) methods is undermined by peptide-specific biases. To avoid these biases, we developed and implemented a first-principles model (HIquant) fo…
▽ More
Many proteoforms - arising from alternative splicing, post-translational modifications (PTMs), or paralogous genes - have distinct biological functions, such as histone PTM proteoforms. However, their quantification by existing bottom-up mass-spectrometry (MS) methods is undermined by peptide-specific biases. To avoid these biases, we developed and implemented a first-principles model (HIquant) for quantifying proteoform stoichiometries. We characterized when MS data allow inferring proteoform stoichiometries by HIquant, derived an algorithm for optimal inference, and demonstrated experimentally high accuracy in quantifying fractional PTM occupancy without using external standards, even in the challenging case of the histone modification code.
HIquant server is implemented at: https://web.northeastern.edu/slavov/2014_HIquant/
△ Less
Submitted 5 August, 2017;
originally announced August 2017.
-
Post-transcriptional regulation across human tissues
Authors:
Alexander Franks,
Edoardo Airoldi,
Nikolai Slavov
Abstract:
Transcriptional and post-transcriptional regulation shape tissue-type-specific proteomes, but their relative contributions remain contested. Estimates of the factors determining protein levels in human tissues do not distinguish between (i) the factors determining the variability between the abundances of different proteins, i.e., mean-level-variability and, (ii) the factors determining the physio…
▽ More
Transcriptional and post-transcriptional regulation shape tissue-type-specific proteomes, but their relative contributions remain contested. Estimates of the factors determining protein levels in human tissues do not distinguish between (i) the factors determining the variability between the abundances of different proteins, i.e., mean-level-variability and, (ii) the factors determining the physiological variability of the same protein across different tissue types, i.e., across-tissues variability. We sought to estimate the contribution of transcript levels to these two orthogonal sources of variability, and found that scaled mRNA levels can account for most of the mean-level-variability but not necessarily for across-tissues variability. The reliable quantification of the latter estimate is limited by substantial measurement noise. However, protein-to-mRNA ratios exhibit substantial across-tissues variability that is functionally concerted and reproducible across different datasets, suggesting extensive post-transcriptional regulation. These results caution against estimating protein fold-changes from mRNA fold-changes between different cell-types, and highlight the contribution of post-transcriptional regulation to shaping tissue-type-specific proteomes.
△ Less
Submitted 2 May, 2017; v1 submitted 31 May, 2015;
originally announced June 2015.
-
Estimating cellular pathways from an ensemble of heterogeneous data sources
Authors:
Alexander Franks,
Florian Markowetz,
Edoardo Airoldi
Abstract:
Building better models of cellular pathways is one of the major challenges of systems biology and functional genomics. There is a need for methods to build on established expert knowledge and reconcile it with results of high-throughput studies. Moreover, the available data sources are heterogeneous and need to be combined in a way specific for the part of the pathway in which they are most inform…
▽ More
Building better models of cellular pathways is one of the major challenges of systems biology and functional genomics. There is a need for methods to build on established expert knowledge and reconcile it with results of high-throughput studies. Moreover, the available data sources are heterogeneous and need to be combined in a way specific for the part of the pathway in which they are most informative. Here, we present a compartment specific strategy to integrate edge, node and path data for the refinement of a network hypothesis. Specifically, we use a local-move Gibbs sampler for refining pathway hypotheses from a compendium of heterogeneous data sources, including novel methodology for integrating protein attributes. We demonstrate the utility of this approach in a case study of the pheromone response MAPK pathway in the yeast S. cerevisiae.
△ Less
Submitted 22 June, 2014;
originally announced June 2014.
-
Differential stoichiometry among core ribosomal proteins
Authors:
Nikolai Slavov,
Sefan Semrau,
Edoardo Airoldi,
Bogdan Budnik,
Alexander van Oudenaarden
Abstract:
Understanding the regulation and structure of ribosomes is essential to understanding protein synthesis and its deregulation in disease. While ribosomes are believed to have a fixed stoichiometry among their core ribosomal proteins (RPs), some experiments suggest a more variable composition. Testing such variability requires direct and precise quantification of RPs. We used mass-spectrometry to di…
▽ More
Understanding the regulation and structure of ribosomes is essential to understanding protein synthesis and its deregulation in disease. While ribosomes are believed to have a fixed stoichiometry among their core ribosomal proteins (RPs), some experiments suggest a more variable composition. Testing such variability requires direct and precise quantification of RPs. We used mass-spectrometry to directly quantify RPs across monosomes and polysomes of mouse embryonic stem cells (ESC) and budding yeast. Our data show that the stoichiometry among core RPs in wild-type yeast cells and ESC depends both on the growth conditions and on the number of ribosomes bound per mRNA. Furthermore, we find that the fitness of cells with a deleted RP-gene is inversely proportional to the enrichment of the corresponding RP in polysomes. Together, our findings support the existence of ribosomes with distinct protein composition and physiological function.
△ Less
Submitted 15 April, 2015; v1 submitted 2 June, 2014;
originally announced June 2014.
-
Learning modular structures from network data and node variables
Authors:
Elham Azizi,
James E. Galagan,
Edoardo M. Airoldi
Abstract:
A standard technique for understanding underlying dependency structures among a set of variables posits a shared conditional probability distribution for the variables measured on individuals within a group. This approach is often referred to as module networks, where individuals are represented by nodes in a network, groups are termed modules, and the focus is on estimating the network structure…
▽ More
A standard technique for understanding underlying dependency structures among a set of variables posits a shared conditional probability distribution for the variables measured on individuals within a group. This approach is often referred to as module networks, where individuals are represented by nodes in a network, groups are termed modules, and the focus is on estimating the network structure among modules. However, estimation solely from node-specific variables can lead to spurious dependencies, and unverifiable structural assumptions are often used for regularization. Here, we propose an extended model that leverages direct observations about the network in addition to node-specific variables. By integrating complementary data types, we avoid the need for structural assumptions. We illustrate theoretical and practical significance of the model and develop a reversible-jump MCMC learning procedure for learning modules and model parameters. We demonstrate the method accuracy in predicting modular structures from synthetic data and capability to learn influence structures in twitter data and regulatory modules in the Mycobacterium tuberculosis gene regulatory network.
△ Less
Submitted 11 May, 2014;
originally announced May 2014.
-
Sashimi plots: Quantitative visualization of RNA sequencing read alignments
Authors:
Yarden Katz,
Eric T. Wang,
Jacob Silterra,
Schraga Schwartz,
Bang Wong,
Jill P. Mesirov,
Edoardo M. Airoldi,
Christopher B. Burge
Abstract:
We introduce Sashimi plots, a quantitative multi-sample visualization of mRNA sequencing reads aligned to gene annotations. Sashimi plots are made using alignments (stored in the SAM/BAM format) and gene model annotations (in GFF format), which can be custom-made by the user or obtained from databases such as Ensembl or UCSC. We describe two implementations of Sashimi plots: (1) a stand-alone comm…
▽ More
We introduce Sashimi plots, a quantitative multi-sample visualization of mRNA sequencing reads aligned to gene annotations. Sashimi plots are made using alignments (stored in the SAM/BAM format) and gene model annotations (in GFF format), which can be custom-made by the user or obtained from databases such as Ensembl or UCSC. We describe two implementations of Sashimi plots: (1) a stand-alone command line implementation aimed at making customizable publication quality figures, and (2) an implementation built into the Integrated Genome Viewer (IGV) browser, which enables rapid and dynamic creation of Sashimi plots for any genomic region of interest, suitable for exploratory analysis of alternatively spliced regions of the transcriptome. Isoform expression estimates outputted by the MISO program can be optionally plotted along with Sashimi plots. Sashimi plots can be used to quickly screen differentially spliced exons along genomic regions of interest and can be used in publication quality figures. The Sashimi plot software and documentation is available from: http://genes.mit.edu/burgelab/miso/docs/sashimi.html
△ Less
Submitted 14 June, 2013;
originally announced June 2013.
-
Mapping Dynamic Histone Acetylation Patterns to Gene Expression in Nanog-depleted Murine Embryonic Stem Cells
Authors:
Florian Markowetz,
Klaas W Mulder,
Edoardo M Airoldi,
Ihor R Lemischka,
Olga G Troyanskaya
Abstract:
Embryonic stem cells (ESC) have the potential to self-renew indefinitely and to differentiate into any of the three germ layers. The molecular mechanisms for self-renewal, maintenance of pluripotency and lineage specification are poorly understood, but recent results point to a key role for epigenetic mechanisms. In this study, we focus on quantifying the impact of histone 3 acetylation (H3K9,14ac…
▽ More
Embryonic stem cells (ESC) have the potential to self-renew indefinitely and to differentiate into any of the three germ layers. The molecular mechanisms for self-renewal, maintenance of pluripotency and lineage specification are poorly understood, but recent results point to a key role for epigenetic mechanisms. In this study, we focus on quantifying the impact of histone 3 acetylation (H3K9,14ac) on gene expression in murine embryonic stem cells. We analyze genome-wide histone acetylation patterns and gene expression profiles measured over the first five days of cell differentiation triggered by silencing Nanog, a key transcription factor in ESC regulation. We explore the temporal and spatial dynamics of histone acetylation data and its correlation with gene expression using supervised and unsupervised statistical models. On a genome-wide scale, changes in acetylation are significantly correlated to changes in mRNA expression and, surprisingly, this coherence increases over time. We quantify the predictive power of histone acetylation for gene expression changes in a balanced cross-validation procedure. In an in-depth study we focus on genes central to the regulatory network of Mouse ESC, including those identified in a recent genome-wide RNAi screen and in the PluriNet, a computationally derived stem cell signature. We find that compared to the rest of the genome, ESC-specific genes show significantly more acetylation signal and a much stronger decrease in acetylation over time, which is often not reflected in an concordant expression change. These results shed light on the complexity of the relationship between histone acetylation and gene expression and are a step forward to dissect the multilayer regulatory mechanisms that determine stem cell fate.
△ Less
Submitted 15 October, 2010;
originally announced October 2010.
-
A survey of statistical network models
Authors:
Anna Goldenberg,
Alice X Zheng,
Stephen E Fienberg,
Edoardo M Airoldi
Abstract:
Networks are ubiquitous in science and have become a focal point for discussion in everyday life. Formal statistical models for the analysis of network data have emerged as a major topic of interest in diverse areas of study, and most of these involve a form of graphical representation. Probability models on graphs date back to 1959. Along with empirical studies in social psychology and sociolog…
▽ More
Networks are ubiquitous in science and have become a focal point for discussion in everyday life. Formal statistical models for the analysis of network data have emerged as a major topic of interest in diverse areas of study, and most of these involve a form of graphical representation. Probability models on graphs date back to 1959. Along with empirical studies in social psychology and sociology from the 1960s, these early works generated an active network community and a substantial literature in the 1970s. This effort moved into the statistical literature in the late 1970s and 1980s, and the past decade has seen a burgeoning network literature in statistical physics and computer science. The growth of the World Wide Web and the emergence of online networking communities such as Facebook, MySpace, and LinkedIn, and a host of more specialized professional network communities has intensified interest in the study of networks and network data. Our goal in this review is to provide the reader with an entry point to this burgeoning literature. We begin with an overview of the historical development of statistical network modeling and then we introduce a number of examples that have been studied in the network literature. Our subsequent discussion focuses on a number of prominent static and dynamic network models and their interconnections. We emphasize formal model descriptions, and pay special attention to the interpretation of parameters and their estimation. We end with a description of some open problems and challenges for machine learning and statistics.
△ Less
Submitted 29 December, 2009;
originally announced December 2009.
-
Ranking relations using analogies in biological and information networks
Authors:
Ricardo Silva,
Katherine Heller,
Zoubin Ghahramani,
Edoardo M. Airoldi
Abstract:
Analogical reasoning depends fundamentally on the ability to learn and generalize about relations between objects. We develop an approach to relational learning which, given a set of pairs of objects $\mathbf{S}=\{A^{(1)}:B^{(1)},A^{(2)}:B^{(2)},\ldots,A^{(N)}:B ^{(N)}\}$, measures how well other pairs A:B fit in with the set $\mathbf{S}$. Our work addresses the following question: is the relation…
▽ More
Analogical reasoning depends fundamentally on the ability to learn and generalize about relations between objects. We develop an approach to relational learning which, given a set of pairs of objects $\mathbf{S}=\{A^{(1)}:B^{(1)},A^{(2)}:B^{(2)},\ldots,A^{(N)}:B ^{(N)}\}$, measures how well other pairs A:B fit in with the set $\mathbf{S}$. Our work addresses the following question: is the relation between objects A and B analogous to those relations found in $\mathbf{S}$? Such questions are particularly relevant in information retrieval, where an investigator might want to search for analogous pairs of objects that match the query set of interest. There are many ways in which objects can be related, making the task of measuring analogies very challenging. Our approach combines a similarity measure on function spaces with Bayesian analysis to produce a ranking. It requires data containing features of the objects of interest and a link matrix specifying which relationships exist; no further attributes of such relationships are necessary. We illustrate the potential of our method on text analysis and information networks. An application on discovering functional interactions between pairs of proteins is discussed in detail, where we show that our approach can work in practice even if a small set of protein pairs is provided.
△ Less
Submitted 29 August, 2013; v1 submitted 28 December, 2009;
originally announced December 2009.
-
Mixed membership analysis of genome-wide expression data
Authors:
Edoardo M Airoldi,
Stephen E Fienberg,
Eric P Xing
Abstract:
Learning latent expression themes that best express complex patterns in a sample is a central problem in data mining and scientific research. For example, in computational biology we seek a set of salient gene expression themes that explain a biological process, extracting them from a large pool of gene expression profiles. In this paper, we introduce probabilistic models to learn such latent th…
▽ More
Learning latent expression themes that best express complex patterns in a sample is a central problem in data mining and scientific research. For example, in computational biology we seek a set of salient gene expression themes that explain a biological process, extracting them from a large pool of gene expression profiles. In this paper, we introduce probabilistic models to learn such latent themes in an unsupervised fashion. Our models capture contagion, i.e., dependence among multiple occurrences of the same feature, using a hierarchical Bayesian scheme. Contagion is a convenient analytical formalism to characterize semantic themes underlying observed feature patterns, such as biological context. We present model variants tailored to different properties of biological data, and we outline a general variational inference scheme for approximate posterior inference. We validate our methods on both simulated data and realistic high-throughput gene expression profiles via SAGE. Our results show improved predictions of gene functions over existing methods based on stronger independence assumptions, and demonstrate feasibility of a promising hierarchical Bayesian formalism for soft clustering and latent aspects analysis.
△ Less
Submitted 15 November, 2007;
originally announced November 2007.
-
Getting started in probabilistic graphical models
Authors:
Edoardo M Airoldi
Abstract:
Probabilistic graphical models (PGMs) have become a popular tool for computational analysis of biological data in a variety of domains. But, what exactly are they and how do they work? How can we use PGMs to discover patterns that are biologically relevant? And to what extent can PGMs help us formulate new hypotheses that are testable at the bench? This note sketches out some answers and illustr…
▽ More
Probabilistic graphical models (PGMs) have become a popular tool for computational analysis of biological data in a variety of domains. But, what exactly are they and how do they work? How can we use PGMs to discover patterns that are biologically relevant? And to what extent can PGMs help us formulate new hypotheses that are testable at the bench? This note sketches out some answers and illustrates the main ideas behind the statistical approach to biological pattern discovery.
△ Less
Submitted 10 November, 2007; v1 submitted 14 June, 2007;
originally announced June 2007.
-
Mixed membership analysis of high-throughput interaction studies: Relational data
Authors:
Edoardo M Airoldi,
David M Blei,
Stephen E Fienberg,
Eric P Xing
Abstract:
In this paper, we consider the statistical analysis of a protein interaction network. We propose a Bayesian model that uses a hierarchy of probabilistic assumptions about the way proteins interact with one another in order to: (i) identify the number of non-observable functional modules; (ii) estimate the degree of membership of proteins to modules; and (iii) estimate typical interaction pattern…
▽ More
In this paper, we consider the statistical analysis of a protein interaction network. We propose a Bayesian model that uses a hierarchy of probabilistic assumptions about the way proteins interact with one another in order to: (i) identify the number of non-observable functional modules; (ii) estimate the degree of membership of proteins to modules; and (iii) estimate typical interaction patterns among the functional modules themselves. Our model describes large amount of (relational) data using a relatively small set of parameters that we can reliably estimate with an efficient inference algorithm. We apply our methodology to data on protein-to-protein interactions in saccharomyces cerevisiae to reveal proteins' diverse functional roles. The case study provides the basis for an overview of which scientific questions can be addressed using our methods, and for a discussion of technical issues.
△ Less
Submitted 15 November, 2007; v1 submitted 2 June, 2007;
originally announced June 2007.