Search | arXiv e-print repository

arXiv:1909.07578 [pdf, other]

doi 10.1073/pnas.1914950117

Stacking Models for Nearly Optimal Link Prediction in Complex Networks

Authors: Amir Ghasemian, Homa Hosseinmardi, Aram Galstyan, Edoardo M. Airoldi, Aaron Clauset

Abstract: Most real-world networks are incompletely observed. Algorithms that can accurately predict which links are missing can dramatically speedup the collection of network data and improve the validity of network models. Many algorithms now exist for predicting missing links, given a partially observed network, but it has remained unknown whether a single best predictor exists, how link predictability v… ▽ More Most real-world networks are incompletely observed. Algorithms that can accurately predict which links are missing can dramatically speedup the collection of network data and improve the validity of network models. Many algorithms now exist for predicting missing links, given a partially observed network, but it has remained unknown whether a single best predictor exists, how link predictability varies across methods and networks from different domains, and how close to optimality current methods are. We answer these questions by systematically evaluating 203 individual link predictor algorithms, representing three popular families of methods, applied to a large corpus of 548 structurally diverse networks from six scientific domains. We first show that individual algorithms exhibit a broad diversity of prediction errors, such that no one predictor or family is best, or worst, across all realistic inputs. We then exploit this diversity via meta-learning to construct a series of "stacked" models that combine predictors into a single algorithm. Applied to a broad range of synthetic networks, for which we may analytically calculate optimal performance, these stacked models achieve optimal or nearly optimal levels of accuracy. Applied to real-world networks, stacked models are also superior, but their accuracy varies strongly by domain, suggesting that link prediction may be fundamentally easier in social networks than in biological or technological networks. These results indicate that the state-of-the-art for link prediction comes from combining individual algorithms, which achieves nearly optimal predictions. We close with a brief discussion of limitations and opportunities for further improvement of these results. △ Less

Submitted 17 September, 2019; originally announced September 2019.

Comments: 30 pages, 9 figures, 22 tables

Journal ref: Proc. Natl. Acad. Sci. USA 117(38), 23393-23400 (2020)

arXiv:1708.01772 [pdf, other]

doi 10.1074/mcp.TIR118.000947

Quantifying homologous proteins and proteoforms

Authors: Dmitry Malioutov, Tianchi Chen, Jacob Jaffe, Edoardo Airoldi, Steven Carr, Bogdan Budnik, Nikolai Slavov

Abstract: Many proteoforms - arising from alternative splicing, post-translational modifications (PTMs), or paralogous genes - have distinct biological functions, such as histone PTM proteoforms. However, their quantification by existing bottom-up mass-spectrometry (MS) methods is undermined by peptide-specific biases. To avoid these biases, we developed and implemented a first-principles model (HIquant) fo… ▽ More Many proteoforms - arising from alternative splicing, post-translational modifications (PTMs), or paralogous genes - have distinct biological functions, such as histone PTM proteoforms. However, their quantification by existing bottom-up mass-spectrometry (MS) methods is undermined by peptide-specific biases. To avoid these biases, we developed and implemented a first-principles model (HIquant) for quantifying proteoform stoichiometries. We characterized when MS data allow inferring proteoform stoichiometries by HIquant, derived an algorithm for optimal inference, and demonstrated experimentally high accuracy in quantifying fractional PTM occupancy without using external standards, even in the challenging case of the histone modification code. HIquant server is implemented at: https://web.northeastern.edu/slavov/2014_HIquant/ △ Less

Submitted 5 August, 2017; originally announced August 2017.

Report number: mcp.TIR118.000947

Journal ref: Molecular & Cellular Proteomics, 2018

arXiv:1506.00219 [pdf, other]

doi 10.1371/journal.pcbi.1005535

Post-transcriptional regulation across human tissues

Authors: Alexander Franks, Edoardo Airoldi, Nikolai Slavov

Abstract: Transcriptional and post-transcriptional regulation shape tissue-type-specific proteomes, but their relative contributions remain contested. Estimates of the factors determining protein levels in human tissues do not distinguish between (i) the factors determining the variability between the abundances of different proteins, i.e., mean-level-variability and, (ii) the factors determining the physio… ▽ More Transcriptional and post-transcriptional regulation shape tissue-type-specific proteomes, but their relative contributions remain contested. Estimates of the factors determining protein levels in human tissues do not distinguish between (i) the factors determining the variability between the abundances of different proteins, i.e., mean-level-variability and, (ii) the factors determining the physiological variability of the same protein across different tissue types, i.e., across-tissues variability. We sought to estimate the contribution of transcript levels to these two orthogonal sources of variability, and found that scaled mRNA levels can account for most of the mean-level-variability but not necessarily for across-tissues variability. The reliable quantification of the latter estimate is limited by substantial measurement noise. However, protein-to-mRNA ratios exhibit substantial across-tissues variability that is functionally concerted and reproducible across different datasets, suggesting extensive post-transcriptional regulation. These results caution against estimating protein fold-changes from mRNA fold-changes between different cell-types, and highlight the contribution of post-transcriptional regulation to shaping tissue-type-specific proteomes. △ Less

Submitted 2 May, 2017; v1 submitted 31 May, 2015; originally announced June 2015.

Comments: 30 pages, 4 figures

Journal ref: PLoS Comput Biol 13(5): e1005535 (2017)

arXiv:1406.5799 [pdf, other]

Estimating cellular pathways from an ensemble of heterogeneous data sources

Authors: Alexander Franks, Florian Markowetz, Edoardo Airoldi

Abstract: Building better models of cellular pathways is one of the major challenges of systems biology and functional genomics. There is a need for methods to build on established expert knowledge and reconcile it with results of high-throughput studies. Moreover, the available data sources are heterogeneous and need to be combined in a way specific for the part of the pathway in which they are most inform… ▽ More Building better models of cellular pathways is one of the major challenges of systems biology and functional genomics. There is a need for methods to build on established expert knowledge and reconcile it with results of high-throughput studies. Moreover, the available data sources are heterogeneous and need to be combined in a way specific for the part of the pathway in which they are most informative. Here, we present a compartment specific strategy to integrate edge, node and path data for the refinement of a network hypothesis. Specifically, we use a local-move Gibbs sampler for refining pathway hypotheses from a compendium of heterogeneous data sources, including novel methodology for integrating protein attributes. We demonstrate the utility of this approach in a case study of the pheromone response MAPK pathway in the yeast S. cerevisiae. △ Less

Submitted 22 June, 2014; originally announced June 2014.

arXiv:1406.0399 [pdf, other]

doi 10.1016/j.celrep.2015.09.056

Differential stoichiometry among core ribosomal proteins

Authors: Nikolai Slavov, Sefan Semrau, Edoardo Airoldi, Bogdan Budnik, Alexander van Oudenaarden

Abstract: Understanding the regulation and structure of ribosomes is essential to understanding protein synthesis and its deregulation in disease. While ribosomes are believed to have a fixed stoichiometry among their core ribosomal proteins (RPs), some experiments suggest a more variable composition. Testing such variability requires direct and precise quantification of RPs. We used mass-spectrometry to di… ▽ More Understanding the regulation and structure of ribosomes is essential to understanding protein synthesis and its deregulation in disease. While ribosomes are believed to have a fixed stoichiometry among their core ribosomal proteins (RPs), some experiments suggest a more variable composition. Testing such variability requires direct and precise quantification of RPs. We used mass-spectrometry to directly quantify RPs across monosomes and polysomes of mouse embryonic stem cells (ESC) and budding yeast. Our data show that the stoichiometry among core RPs in wild-type yeast cells and ESC depends both on the growth conditions and on the number of ribosomes bound per mRNA. Furthermore, we find that the fitness of cells with a deleted RP-gene is inversely proportional to the enrichment of the corresponding RP in polysomes. Together, our findings support the existence of ribosomes with distinct protein composition and physiological function. △ Less

Submitted 15 April, 2015; v1 submitted 2 June, 2014; originally announced June 2014.

Comments: 31 pages, 8 figures

Journal ref: Cell Reports 13: 865 - 873, 2015

arXiv:1405.2566 [pdf, other]

Learning modular structures from network data and node variables

Authors: Elham Azizi, James E. Galagan, Edoardo M. Airoldi

Abstract: A standard technique for understanding underlying dependency structures among a set of variables posits a shared conditional probability distribution for the variables measured on individuals within a group. This approach is often referred to as module networks, where individuals are represented by nodes in a network, groups are termed modules, and the focus is on estimating the network structure… ▽ More A standard technique for understanding underlying dependency structures among a set of variables posits a shared conditional probability distribution for the variables measured on individuals within a group. This approach is often referred to as module networks, where individuals are represented by nodes in a network, groups are termed modules, and the focus is on estimating the network structure among modules. However, estimation solely from node-specific variables can lead to spurious dependencies, and unverifiable structural assumptions are often used for regularization. Here, we propose an extended model that leverages direct observations about the network in addition to node-specific variables. By integrating complementary data types, we avoid the need for structural assumptions. We illustrate theoretical and practical significance of the model and develop a reversible-jump MCMC learning procedure for learning modules and model parameters. We demonstrate the method accuracy in predicting modular structures from synthetic data and capability to learn influence structures in twitter data and regulatory modules in the Mycobacterium tuberculosis gene regulatory network. △ Less

Submitted 11 May, 2014; originally announced May 2014.

Comments: 22 pages, 6 figures, 3 tables, 3 algorithms

arXiv:1306.3466 [pdf, other]

doi 10.1093/bioinformatics/btv034

Sashimi plots: Quantitative visualization of RNA sequencing read alignments

Authors: Yarden Katz, Eric T. Wang, Jacob Silterra, Schraga Schwartz, Bang Wong, Jill P. Mesirov, Edoardo M. Airoldi, Christopher B. Burge

Abstract: We introduce Sashimi plots, a quantitative multi-sample visualization of mRNA sequencing reads aligned to gene annotations. Sashimi plots are made using alignments (stored in the SAM/BAM format) and gene model annotations (in GFF format), which can be custom-made by the user or obtained from databases such as Ensembl or UCSC. We describe two implementations of Sashimi plots: (1) a stand-alone comm… ▽ More We introduce Sashimi plots, a quantitative multi-sample visualization of mRNA sequencing reads aligned to gene annotations. Sashimi plots are made using alignments (stored in the SAM/BAM format) and gene model annotations (in GFF format), which can be custom-made by the user or obtained from databases such as Ensembl or UCSC. We describe two implementations of Sashimi plots: (1) a stand-alone command line implementation aimed at making customizable publication quality figures, and (2) an implementation built into the Integrated Genome Viewer (IGV) browser, which enables rapid and dynamic creation of Sashimi plots for any genomic region of interest, suitable for exploratory analysis of alternatively spliced regions of the transcriptome. Isoform expression estimates outputted by the MISO program can be optionally plotted along with Sashimi plots. Sashimi plots can be used to quickly screen differentially spliced exons along genomic regions of interest and can be used in publication quality figures. The Sashimi plot software and documentation is available from: http://genes.mit.edu/burgelab/miso/docs/sashimi.html △ Less

Submitted 14 June, 2013; originally announced June 2013.

Comments: 2 figures

arXiv:1010.3268 [pdf, other]

doi 10.1371/journal.pcbi.1001034

Mapping Dynamic Histone Acetylation Patterns to Gene Expression in Nanog-depleted Murine Embryonic Stem Cells

Authors: Florian Markowetz, Klaas W Mulder, Edoardo M Airoldi, Ihor R Lemischka, Olga G Troyanskaya

Abstract: Embryonic stem cells (ESC) have the potential to self-renew indefinitely and to differentiate into any of the three germ layers. The molecular mechanisms for self-renewal, maintenance of pluripotency and lineage specification are poorly understood, but recent results point to a key role for epigenetic mechanisms. In this study, we focus on quantifying the impact of histone 3 acetylation (H3K9,14ac… ▽ More Embryonic stem cells (ESC) have the potential to self-renew indefinitely and to differentiate into any of the three germ layers. The molecular mechanisms for self-renewal, maintenance of pluripotency and lineage specification are poorly understood, but recent results point to a key role for epigenetic mechanisms. In this study, we focus on quantifying the impact of histone 3 acetylation (H3K9,14ac) on gene expression in murine embryonic stem cells. We analyze genome-wide histone acetylation patterns and gene expression profiles measured over the first five days of cell differentiation triggered by silencing Nanog, a key transcription factor in ESC regulation. We explore the temporal and spatial dynamics of histone acetylation data and its correlation with gene expression using supervised and unsupervised statistical models. On a genome-wide scale, changes in acetylation are significantly correlated to changes in mRNA expression and, surprisingly, this coherence increases over time. We quantify the predictive power of histone acetylation for gene expression changes in a balanced cross-validation procedure. In an in-depth study we focus on genes central to the regulatory network of Mouse ESC, including those identified in a recent genome-wide RNAi screen and in the PluriNet, a computationally derived stem cell signature. We find that compared to the rest of the genome, ESC-specific genes show significantly more acetylation signal and a much stronger decrease in acetylation over time, which is often not reflected in an concordant expression change. These results shed light on the complexity of the relationship between histone acetylation and gene expression and are a step forward to dissect the multilayer regulatory mechanisms that determine stem cell fate. △ Less

Submitted 15 October, 2010; originally announced October 2010.

Comments: accepted at PLoS Computational Biology

Journal ref: PLoS Comp Bio, 2010 Dec 16;6(12):e1001034

arXiv:0912.5410 [pdf, other]

A survey of statistical network models

Authors: Anna Goldenberg, Alice X Zheng, Stephen E Fienberg, Edoardo M Airoldi

Abstract: Networks are ubiquitous in science and have become a focal point for discussion in everyday life. Formal statistical models for the analysis of network data have emerged as a major topic of interest in diverse areas of study, and most of these involve a form of graphical representation. Probability models on graphs date back to 1959. Along with empirical studies in social psychology and sociolog… ▽ More Networks are ubiquitous in science and have become a focal point for discussion in everyday life. Formal statistical models for the analysis of network data have emerged as a major topic of interest in diverse areas of study, and most of these involve a form of graphical representation. Probability models on graphs date back to 1959. Along with empirical studies in social psychology and sociology from the 1960s, these early works generated an active network community and a substantial literature in the 1970s. This effort moved into the statistical literature in the late 1970s and 1980s, and the past decade has seen a burgeoning network literature in statistical physics and computer science. The growth of the World Wide Web and the emergence of online networking communities such as Facebook, MySpace, and LinkedIn, and a host of more specialized professional network communities has intensified interest in the study of networks and network data. Our goal in this review is to provide the reader with an entry point to this burgeoning literature. We begin with an overview of the historical development of statistical network modeling and then we introduce a number of examples that have been studied in the network literature. Our subsequent discussion focuses on a number of prominent static and dynamic network models and their interconnections. We emphasize formal model descriptions, and pay special attention to the interpretation of parameters and their estimation. We end with a description of some open problems and challenges for machine learning and statistics. △ Less

Submitted 29 December, 2009; originally announced December 2009.

Comments: 96 pages, 14 figures, 333 references

Journal ref: Foundations and Trends in Machine Learning, 2(2):1-117, 2009

arXiv:0912.5193 [pdf, ps, other]

doi 10.1214/09-AOAS321

Ranking relations using analogies in biological and information networks

Authors: Ricardo Silva, Katherine Heller, Zoubin Ghahramani, Edoardo M. Airoldi

Abstract: Analogical reasoning depends fundamentally on the ability to learn and generalize about relations between objects. We develop an approach to relational learning which, given a set of pairs of objects $\mathbf{S}=\{A^{(1)}:B^{(1)},A^{(2)}:B^{(2)},\ldots,A^{(N)}:B ^{(N)}\}$, measures how well other pairs A:B fit in with the set $\mathbf{S}$. Our work addresses the following question: is the relation… ▽ More Analogical reasoning depends fundamentally on the ability to learn and generalize about relations between objects. We develop an approach to relational learning which, given a set of pairs of objects $\mathbf{S}=\{A^{(1)}:B^{(1)},A^{(2)}:B^{(2)},\ldots,A^{(N)}:B ^{(N)}\}$, measures how well other pairs A:B fit in with the set $\mathbf{S}$. Our work addresses the following question: is the relation between objects A and B analogous to those relations found in $\mathbf{S}$? Such questions are particularly relevant in information retrieval, where an investigator might want to search for analogous pairs of objects that match the query set of interest. There are many ways in which objects can be related, making the task of measuring analogies very challenging. Our approach combines a similarity measure on function spaces with Bayesian analysis to produce a ranking. It requires data containing features of the objects of interest and a link matrix specifying which relationships exist; no further attributes of such relationships are necessary. We illustrate the potential of our method on text analysis and information networks. An application on discovering functional interactions between pairs of proteins is discussed in detail, where we show that our approach can work in practice even if a small set of protein pairs is provided. △ Less

Submitted 29 August, 2013; v1 submitted 28 December, 2009; originally announced December 2009.

Comments: Published in at http://dx.doi.org/10.1214/09-AOAS321 the Annals of Applied Statistics (http://www.imstat.org/aoas/) by the Institute of Mathematical Statistics (http://www.imstat.org)

Report number: IMS-AOAS-AOAS321

Journal ref: Annals of Applied Statistics 2010, Vol. 4, No. 2, 615-644

arXiv:0711.2520 [pdf, other]

Mixed membership analysis of genome-wide expression data

Authors: Edoardo M Airoldi, Stephen E Fienberg, Eric P Xing

Abstract: Learning latent expression themes that best express complex patterns in a sample is a central problem in data mining and scientific research. For example, in computational biology we seek a set of salient gene expression themes that explain a biological process, extracting them from a large pool of gene expression profiles. In this paper, we introduce probabilistic models to learn such latent th… ▽ More Learning latent expression themes that best express complex patterns in a sample is a central problem in data mining and scientific research. For example, in computational biology we seek a set of salient gene expression themes that explain a biological process, extracting them from a large pool of gene expression profiles. In this paper, we introduce probabilistic models to learn such latent themes in an unsupervised fashion. Our models capture contagion, i.e., dependence among multiple occurrences of the same feature, using a hierarchical Bayesian scheme. Contagion is a convenient analytical formalism to characterize semantic themes underlying observed feature patterns, such as biological context. We present model variants tailored to different properties of biological data, and we outline a general variational inference scheme for approximate posterior inference. We validate our methods on both simulated data and realistic high-throughput gene expression profiles via SAGE. Our results show improved predictions of gene functions over existing methods based on stronger independence assumptions, and demonstrate feasibility of a promising hierarchical Bayesian formalism for soft clustering and latent aspects analysis. △ Less

Submitted 15 November, 2007; originally announced November 2007.

Comments: 22 pages, 4 figures

arXiv:0706.2040 [pdf, other]

doi 10.1371/journal.pcbi.0030252

Getting started in probabilistic graphical models

Authors: Edoardo M Airoldi

Abstract: Probabilistic graphical models (PGMs) have become a popular tool for computational analysis of biological data in a variety of domains. But, what exactly are they and how do they work? How can we use PGMs to discover patterns that are biologically relevant? And to what extent can PGMs help us formulate new hypotheses that are testable at the bench? This note sketches out some answers and illustr… ▽ More Probabilistic graphical models (PGMs) have become a popular tool for computational analysis of biological data in a variety of domains. But, what exactly are they and how do they work? How can we use PGMs to discover patterns that are biologically relevant? And to what extent can PGMs help us formulate new hypotheses that are testable at the bench? This note sketches out some answers and illustrates the main ideas behind the statistical approach to biological pattern discovery. △ Less

Submitted 10 November, 2007; v1 submitted 14 June, 2007; originally announced June 2007.

Comments: 12 pages, 1 figure

Journal ref: Airoldi EM (2007) Getting started in probabilistic graphical models. PLoS Comput Biol 3(12): e252

arXiv:0706.0294 [pdf, other]

Mixed membership analysis of high-throughput interaction studies: Relational data

Authors: Edoardo M Airoldi, David M Blei, Stephen E Fienberg, Eric P Xing

Abstract: In this paper, we consider the statistical analysis of a protein interaction network. We propose a Bayesian model that uses a hierarchy of probabilistic assumptions about the way proteins interact with one another in order to: (i) identify the number of non-observable functional modules; (ii) estimate the degree of membership of proteins to modules; and (iii) estimate typical interaction pattern… ▽ More In this paper, we consider the statistical analysis of a protein interaction network. We propose a Bayesian model that uses a hierarchy of probabilistic assumptions about the way proteins interact with one another in order to: (i) identify the number of non-observable functional modules; (ii) estimate the degree of membership of proteins to modules; and (iii) estimate typical interaction patterns among the functional modules themselves. Our model describes large amount of (relational) data using a relatively small set of parameters that we can reliably estimate with an efficient inference algorithm. We apply our methodology to data on protein-to-protein interactions in saccharomyces cerevisiae to reveal proteins' diverse functional roles. The case study provides the basis for an overview of which scientific questions can be addressed using our methods, and for a discussion of technical issues. △ Less

Submitted 15 November, 2007; v1 submitted 2 June, 2007; originally announced June 2007.

Comments: 22 pages, 6 figures, 2 tables

Showing 1–13 of 13 results for author: Airoldi, E