-
Fluctuations and the limit of predictability in protein evolution
Authors:
Saverio Rossi,
Leonardo Di Bari,
Martin Weigt,
Francesco Zamponi
Abstract:
Protein evolution involves mutations occurring across a wide range of time scales. In analogy with other disordered systems, this dynamical heterogeneity suggests strong correlations between mutations happening at distinct sites and times. To quantify these correlations, we examine the role of various fluctuation sources in protein evolution, simulated using a data-driven epistatic landscape. By a…
▽ More
Protein evolution involves mutations occurring across a wide range of time scales. In analogy with other disordered systems, this dynamical heterogeneity suggests strong correlations between mutations happening at distinct sites and times. To quantify these correlations, we examine the role of various fluctuation sources in protein evolution, simulated using a data-driven epistatic landscape. By applying spatio-temporal correlation functions inspired by statistical physics, we disentangle fluctuations originating from the ancestral protein sequence from those driven by stochastic mutations along independent evolutionary paths. Our analysis shows that, in diverse protein families, fluctuations from the ancestral sequence predominate at shorter time scales. This allows us to identify a time scale over which ancestral sequence information persists, enabling its reconstruction. We link this persistence to the strength of epistatic interactions: ancestral sequences with stronger epistatic signatures impact evolutionary trajectories over extended periods. At longer time scales, however, ancestral influence fades as epistatically constrained sites evolve collectively. To confirm this idea, we apply a standard ancestral sequence reconstruction algorithm and verify that the time-dependent recovery error is influenced by the properties of the ancestor itself.
△ Less
Submitted 12 December, 2024; v1 submitted 2 December, 2024;
originally announced December 2024.
-
Emergent time scales of epistasis in protein evolution
Authors:
Leonardo Di Bari,
Matteo Bisardi,
Sabrina Cotogno,
Martin Weigt,
Francesco Zamponi
Abstract:
We introduce a data-driven epistatic model of protein evolution, capable of generating evolutionary trajectories spanning very different time scales reaching from individual mutations to diverged homologs. Our in silico evolution encompasses random nucleotide mutations, insertions and deletions, and models selection using a fitness landscape, which is inferred via a generative probabilistic model…
▽ More
We introduce a data-driven epistatic model of protein evolution, capable of generating evolutionary trajectories spanning very different time scales reaching from individual mutations to diverged homologs. Our in silico evolution encompasses random nucleotide mutations, insertions and deletions, and models selection using a fitness landscape, which is inferred via a generative probabilistic model for protein families. We show that the proposed framework accurately reproduces the sequence statistics of both short-time (experimental) and long-time (natural) protein evolution, suggesting applicability also to relatively data-poor intermediate evolutionary time scales, which are currently inaccessible to evolution experiments. Our model uncovers a highly collective nature of epistasis, gradually changing the fitness effect of mutations in a diverging sequence context, rather than acting via strong interactions between individual mutations. This collective nature triggers the emergence of a long evolutionary time scale, separating fast mutational processes inside a given sequence context, from the slow evolution of the context itself. The model quantitatively reproduces epistatic phenomena such as contingency and entrenchment, as well as the loss of predictability in protein evolution observed in deep mutational scanning experiments of distant homologs. It thereby deepens our understanding of the interplay between mutation and selection in shaping protein diversity and novel functions, allows one to statistically forecast evolution, and challenges the prevailing independent-site models of protein evolution, which are unable to capture the fundamental importance of epistasis.
△ Less
Submitted 27 September, 2024; v1 submitted 14 March, 2024;
originally announced March 2024.
-
Towards Parsimonious Generative Modeling of RNA Families
Authors:
Francesco Calvanese,
Camille N. Lambert,
Philippe Nghe,
Francesco Zamponi,
Martin Weigt
Abstract:
Generative probabilistic models emerge as a new paradigm in data-driven, evolution-informed design of biomolecular sequences. This paper introduces a novel approach, called Edge Activation Direct Coupling Analysis (eaDCA), tailored to the characteristics of RNA sequences, with a strong emphasis on simplicity, efficiency, and interpretability. eaDCA explicitly constructs sparse coevolutionary model…
▽ More
Generative probabilistic models emerge as a new paradigm in data-driven, evolution-informed design of biomolecular sequences. This paper introduces a novel approach, called Edge Activation Direct Coupling Analysis (eaDCA), tailored to the characteristics of RNA sequences, with a strong emphasis on simplicity, efficiency, and interpretability. eaDCA explicitly constructs sparse coevolutionary models for RNA families, achieving performance levels comparable to more complex methods while utilizing a significantly lower number of parameters. Our approach demonstrates efficiency in generating artificial RNA sequences that closely resemble their natural counterparts in both statistical analyses and SHAPE-MaP experiments, and in predicting the effect of mutations. Notably, eaDCA provides a unique feature: estimating the number of potential functional sequences within a given RNA family. For example, in the case of cyclic di-AMP riboswitches (RF00379), our analysis suggests the existence of approximately $\mathbf{10^{39}}$ functional nucleotide sequences. While huge compared to the known $< \mathbf{4,000}$ natural sequences, this number represents only a tiny fraction of the vast pool of nearly $\mathbf{10^{82}}$ possible nucleotide sequences of the same length (136 nucleotides). These results underscore the promise of sparse and interpretable generative models, such as eaDCA, in enhancing our understanding of the expansive RNA sequence space.
△ Less
Submitted 19 October, 2023;
originally announced October 2023.
-
Machine-learning-assisted Monte Carlo fails at sampling computationally hard problems
Authors:
Simone Ciarella,
Jeanne Trinquier,
Martin Weigt,
Francesco Zamponi
Abstract:
Several strategies have been recently proposed in order to improve Monte Carlo sampling efficiency using machine learning tools. Here, we challenge these methods by considering a class of problems that are known to be exponentially hard to sample using conventional local Monte Carlo at low enough temperatures. In particular, we study the antiferromagnetic Potts model on a random graph, which reduc…
▽ More
Several strategies have been recently proposed in order to improve Monte Carlo sampling efficiency using machine learning tools. Here, we challenge these methods by considering a class of problems that are known to be exponentially hard to sample using conventional local Monte Carlo at low enough temperatures. In particular, we study the antiferromagnetic Potts model on a random graph, which reduces to the coloring of random graphs at zero temperature. We test several machine-learning-assisted Monte Carlo approaches, and we find that they all fail. Our work thus provides good benchmarks for future proposals for smart sampling algorithms.
△ Less
Submitted 10 March, 2023; v1 submitted 20 October, 2022;
originally announced October 2022.
-
Combining phylogeny and coevolution improves the inference of interaction partners among paralogous proteins
Authors:
Carlos A. Gandarilla-Perez,
Sergio Pinilla,
Anne-Florence Bitbol,
Martin Weigt
Abstract:
Predicting protein-protein interactions from sequences is an important goal of computational biology. Various sources of information can be used to this end. Starting from the sequences of two interacting protein families, one can use phylogeny or residue coevolution to infer which paralogs are specific interaction partners within each species. We show that these two signals can be combined to imp…
▽ More
Predicting protein-protein interactions from sequences is an important goal of computational biology. Various sources of information can be used to this end. Starting from the sequences of two interacting protein families, one can use phylogeny or residue coevolution to infer which paralogs are specific interaction partners within each species. We show that these two signals can be combined to improve the performance of the inference of interaction partners among paralogs. For this, we first align the sequence-similarity graphs of the two families through simulated annealing, yielding a robust partial pairing. We next use this partial pairing to seed a coevolution-based iterative pairing algorithm. This combined method improves performance over either separate method. The improvement obtained is striking in the difficult cases where the average number of paralogs per species is large or where the total number of sequences is modest.
△ Less
Submitted 24 August, 2022;
originally announced August 2022.
-
Statistical-physics approaches to RNA molecules, families and networks
Authors:
Simona Cocco,
Andrea De Martino,
Andrea Pagnani,
Martin Weigt
Abstract:
This contribution focuses on the fascinating RNA molecule, its sequence-dependent folding driven by base-pairing interactions, the interplay between these interactions and natural evolution, and its multiple regulatory roles. The four of us have dug into these topics using the tools and the spirit of the statistical physics of disordered systems, and in particular the concept of a disordered (ener…
▽ More
This contribution focuses on the fascinating RNA molecule, its sequence-dependent folding driven by base-pairing interactions, the interplay between these interactions and natural evolution, and its multiple regulatory roles. The four of us have dug into these topics using the tools and the spirit of the statistical physics of disordered systems, and in particular the concept of a disordered (energy/fitness) landscape. After an introduction to RNA molecules and the perspectives they open not only in evolutionary and synthetic biology but also in medicine, we will introduce the important notions of energy and fitness landscapes for these molecules. In Section III we will review some models and algorithms for RNA sequence-to-secondary-structure mapping. Section IV discusses how the secondary-structure energy landscape can be derived from unzipping data. Section V deals with the inference of RNA structure from evolutionary sequence data sampled in different organisms. This will shift the focus from the `sequence-to-structure' mapping described in Section III to a `sequence-to-function' landscape that can be inferred from laboratory evolutionary data on DNA aptamers. Finally, in Section VI, we shall discuss the rich theoretical picture linking networks of interacting RNA molecules to the organization of robust, systemic regulatory programs. Along this path, we will therefore explore phenomena across multiple scales in space, number of molecules and time, showing how the biological complexity of the RNA world can be captured by the unifying concepts of statistical physics.
△ Less
Submitted 27 July, 2022;
originally announced July 2022.
-
adabmDCA: Adaptive Boltzmann machine learning for biological sequences
Authors:
Anna Paola Muntoni,
Andrea Pagnani,
Martin Weigt,
Francesco Zamponi
Abstract:
Boltzmann machines are energy-based models that have been shown to provide an accurate statistical description of domains of evolutionary-related protein and RNA families. They are parametrized in terms of local biases accounting for residue conservation, and pairwise terms to model epistatic coevolution between residues. From the model parameters, it is possible to extract an accurate prediction…
▽ More
Boltzmann machines are energy-based models that have been shown to provide an accurate statistical description of domains of evolutionary-related protein and RNA families. They are parametrized in terms of local biases accounting for residue conservation, and pairwise terms to model epistatic coevolution between residues. From the model parameters, it is possible to extract an accurate prediction of the three-dimensional contact map of the target domain. More recently, the accuracy of these models has been also assessed in terms of their ability in predicting mutational effects and generating in silico functional sequences. Our adaptive implementation of Boltzmann machine learning, adabmDCA, can be generally applied to both protein and RNA families and accomplishes several learning set-ups, depending on the complexity of the input data and on the user requirements. The code is fully available at https://github.com/anna-pa-m/adabmDCA. As an example, we have performed the learning of three Boltzmann machines modeling the Kunitz and Beta-lactamase2 protein domains and TPP-riboswitch RNA domain. The models learned by adabmDCA are comparable to those obtained by state-of-the-art techniques for this task, in terms of the quality of the inferred contact map as well as of the synthetically generated sequences. In addition, the code implements both equilibrium and out-of-equilibrium learning, which allows for an accurate and lossless training when the equilibrium one is prohibitive in terms of computational time, and allows for pruning irrelevant parameters using an information-based criterion.
△ Less
Submitted 2 November, 2021; v1 submitted 9 September, 2021;
originally announced September 2021.
-
Modeling sequence-space exploration and emergence of epistatic signals in protein evolution
Authors:
Matteo Bisardi,
Juan Rodriguez-Rivas,
Francesco Zamponi,
Martin Weigt
Abstract:
During their evolution, proteins explore sequence space via an interplay between random mutations and phenotypic selection. Here we build upon recent progress in reconstructing data-driven fitness landscapes for families of homologous proteins, to propose stochastic models of experimental protein evolution. These models predict quantitatively important features of experimentally evolved sequence l…
▽ More
During their evolution, proteins explore sequence space via an interplay between random mutations and phenotypic selection. Here we build upon recent progress in reconstructing data-driven fitness landscapes for families of homologous proteins, to propose stochastic models of experimental protein evolution. These models predict quantitatively important features of experimentally evolved sequence libraries, like fitness distributions and position-specific mutational spectra. They also allow us to efficiently simulate sequence libraries for a vast array of combinations of experimental parameters like sequence divergence, selection strength and library size. We showcase the potential of the approach in re-analyzing two recent experiments to determine protein structure from signals of epistasis emerging in experimental sequence libraries. To be detectable, these signals require sufficiently large and sufficiently diverged libraries. Our modeling framework offers a quantitative explanation for the variable success of recently published experiments. Furthermore, we can forecast the outcome of time- and resource-intensive evolution experiments, opening thereby a way to computationally optimize experimental protocols.
△ Less
Submitted 27 January, 2022; v1 submitted 4 June, 2021;
originally announced June 2021.
-
Efficient generative modeling of protein sequences using simple autoregressive models
Authors:
Jeanne Trinquier,
Guido Uguzzoni,
Andrea Pagnani,
Francesco Zamponi,
Martin Weigt
Abstract:
Generative models emerge as promising candidates for novel sequence-data driven approaches to protein design, and for the extraction of structural and functional information about proteins deeply hidden in rapidly growing sequence databases. Here we propose simple autoregressive models as highly accurate but computationally efficient generative sequence models. We show that they perform similarly…
▽ More
Generative models emerge as promising candidates for novel sequence-data driven approaches to protein design, and for the extraction of structural and functional information about proteins deeply hidden in rapidly growing sequence databases. Here we propose simple autoregressive models as highly accurate but computationally efficient generative sequence models. We show that they perform similarly to existing approaches based on Boltzmann machines or deep generative models, but at a substantially lower computational cost (by a factor between $10^2$ and $10^3$). Furthermore, the simple structure of our models has distinctive mathematical advantages, which translate into an improved applicability in sequence generation and evaluation. Within these models, we can easily estimate both the probability of a given sequence, and, using the model's entropy, the size of the functional sequence space related to a specific protein family. In the example of response regulators, we find a huge number of ca. $10^{68}$ possible sequences, which nevertheless constitute only the astronomically small fraction $10^{-80}$ of all amino-acid sequences of the same length. These findings illustrate the potential and the difficulty in exploring sequence space via generative sequence models.
△ Less
Submitted 9 November, 2021; v1 submitted 4 March, 2021;
originally announced March 2021.
-
Global multivariate model learning from hierarchically correlated data
Authors:
Edwin Rodriguez Horta,
Alejandro Lage,
Martin Weigt,
Pierre Barrat-Charlaix
Abstract:
Inverse statistical physics aims at inferring models compatible with a set of empirical averages estimated from a high-dimensional dataset of independently distributed equilibrium configurations of a given system. However, in several applications such as biology, data result from stochastic evolutionary processes, and configurations are related through a hierarchical structure, typically represent…
▽ More
Inverse statistical physics aims at inferring models compatible with a set of empirical averages estimated from a high-dimensional dataset of independently distributed equilibrium configurations of a given system. However, in several applications such as biology, data result from stochastic evolutionary processes, and configurations are related through a hierarchical structure, typically represented by a tree, and therefore not independent. In turn, empirical averages of observables superpose intrinsic signal related to the equilibrium distribution of the studied system and spurious historical (or phylogenetic) signal resulting from the structure underlying the data-generating process. The naive application of inverse statistical physics techniques therefore leads to systematic biases and an effective reduction of the sample size. To advance on the currently open task of extracting intrinsic signals from correlated data, we study a system described by a multivariate Ornstein-Uhlenbeck process defined on a finite tree. Using a Bayesian framework, we can disentangle covariances in the data corresponding to their multivariate Gaussian equilibrium distribution from those resulting from the historical correlations. Our approach leads to a clear gain in accuracy in the inferred equilibrium distribution, which corresponds to an effective two- to fourfold increase in sample size.
△ Less
Submitted 11 February, 2021;
originally announced February 2021.
-
Sparse generative modeling via parameter-reduction of Boltzmann machines: application to protein-sequence families
Authors:
Pierre Barrat-Charlaix,
Anna Paola Muntoni,
Kai Shimagaki,
Martin Weigt,
Francesco Zamponi
Abstract:
Boltzmann machines (BM) are widely used as generative models. For example, pairwise Potts models (PM), which are instances of the BM class, provide accurate statistical models of families of evolutionarily related protein sequences. Their parameters are the local fields, which describe site-specific patterns of amino-acid conservation, and the two-site couplings, which mirror the coevolution betwe…
▽ More
Boltzmann machines (BM) are widely used as generative models. For example, pairwise Potts models (PM), which are instances of the BM class, provide accurate statistical models of families of evolutionarily related protein sequences. Their parameters are the local fields, which describe site-specific patterns of amino-acid conservation, and the two-site couplings, which mirror the coevolution between pairs of sites. This coevolution reflects structural and functional constraints acting on protein sequences during evolution. The most conservative choice to describe the coevolution signal is to include all possible two-site couplings into the PM. This choice, typical of what is known as Direct Coupling Analysis, has been successful for predicting residue contacts in the three-dimensional structure, mutational effects, and in generating new functional sequences. However, the resulting PM suffers from important over-fitting effects: many couplings are small, noisy and hardly interpretable; the PM is close to a critical point, meaning that it is highly sensitive to small parameter perturbations. In this work, we introduce a general parameter-reduction procedure for BMs, via a controlled iterative decimation of the less statistically significant couplings, identified by an information-based criterion that selects either weak or statistically unsupported couplings. For several protein families, our procedure allows one to remove more than $90\%$ of the PM couplings, while preserving the predictive and generative properties of the original dense PM, and the resulting model is far away from criticality, hence more robust to noise.
△ Less
Submitted 30 July, 2021; v1 submitted 23 November, 2020;
originally announced November 2020.
-
Aligning biological sequences by exploiting residue conservation and coevolution
Authors:
Anna Paola Muntoni,
Andrea Pagnani,
Martin Weigt,
Francesco Zamponi
Abstract:
Sequences of nucleotides (for DNA and RNA) or amino acids (for proteins) are central objects in biology. Among the most important computational problems is that of sequence alignment, i.e. arranging sequences from different organisms in such a way to identify similar regions, to detect evolutionary relationships between sequences, and to predict biomolecular structure and function. This is typical…
▽ More
Sequences of nucleotides (for DNA and RNA) or amino acids (for proteins) are central objects in biology. Among the most important computational problems is that of sequence alignment, i.e. arranging sequences from different organisms in such a way to identify similar regions, to detect evolutionary relationships between sequences, and to predict biomolecular structure and function. This is typically addressed through profile models, which capture position-specificities like conservation in sequences, but assume an independent evolution of different positions. Over the last years, it has been well established that coevolution of different amino-acid positions is essential for maintaining three-dimensional structure and function. Modeling approaches based on inverse statistical physics can catch the coevolution signal in sequence ensembles; and they are now widely used in predicting protein structure, protein-protein interactions, and mutational landscapes. Here, we present DCAlign, an efficient alignment algorithm based on an approximate message-passing strategy, which is able to overcome the limitations of profile models, to include coevolution among positions in a general way, and to be therefore universally applicable to protein- and RNA-sequence alignment without the need of using complementary structural information. The potential of DCAlign is carefully explored using well-controlled simulated data, as well as real protein and RNA sequences.
△ Less
Submitted 13 November, 2020; v1 submitted 18 May, 2020;
originally announced May 2020.
-
Statistical physics of interacting proteins: impact of dataset size and quality assessed in synthetic sequences
Authors:
Carlos A. Gandarilla-Pérez,
Pierre Mergny,
Martin Weigt,
Anne-Florence Bitbol
Abstract:
Identifying protein-protein interactions is crucial for a systems-level understanding of the cell. Recently, algorithms based on inverse statistical physics, e.g. Direct Coupling Analysis (DCA), have allowed to use evolutionarily related sequences to address two conceptually related inference tasks: finding pairs of interacting proteins, and identifying pairs of residues which form contacts betwee…
▽ More
Identifying protein-protein interactions is crucial for a systems-level understanding of the cell. Recently, algorithms based on inverse statistical physics, e.g. Direct Coupling Analysis (DCA), have allowed to use evolutionarily related sequences to address two conceptually related inference tasks: finding pairs of interacting proteins, and identifying pairs of residues which form contacts between interacting proteins. Here we address two underlying questions: How are the performances of both inference tasks related? How does performance depend on dataset size and the quality? To this end, we formalize both tasks using Ising models defined over stochastic block models, with individual blocks representing single proteins, and inter-block couplings protein-protein interactions; controlled synthetic sequence data are generated by Monte-Carlo simulations. We show that DCA is able to address both inference tasks accurately when sufficiently large training sets are available, and that an iterative pairing algorithm (IPA) allows to make predictions even without a training set. Noise in the training data deteriorates performance. In both tasks we find a quadratic scaling relating dataset quality and size that is consistent with noise adding in square-root fashion and signal adding linearly when increasing the dataset. This implies that it is generally good to incorporate more data even if its quality is imperfect, thereby shedding light on the empirically observed performance of DCA applied to natural protein sequences.
△ Less
Submitted 16 March, 2020; v1 submitted 23 December, 2019;
originally announced December 2019.
-
Phylogenetic correlations can suffice to infer protein partners from sequences
Authors:
Guillaume Marmier,
Martin Weigt,
Anne-Florence Bitbol
Abstract:
Determining which proteins interact together is crucial to a systems-level understanding of the cell. Recently, algorithms based on Direct Coupling Analysis (DCA) pairwise maximum-entropy models have allowed to identify interaction partners among paralogous proteins from sequence data. This success of DCA at predicting protein-protein interactions could be mainly based on its known ability to iden…
▽ More
Determining which proteins interact together is crucial to a systems-level understanding of the cell. Recently, algorithms based on Direct Coupling Analysis (DCA) pairwise maximum-entropy models have allowed to identify interaction partners among paralogous proteins from sequence data. This success of DCA at predicting protein-protein interactions could be mainly based on its known ability to identify pairs of residues that are in contact in the three-dimensional structure of protein complexes and that coevolve to remain physicochemically complementary. However, interacting proteins possess similar evolutionary histories. What is the role of purely phylogenetic correlations in the performance of DCA-based methods to infer interaction partners? To address this question, we employ controlled synthetic data that only involve phylogeny and no interactions or contacts. We find that DCA accurately identifies the pairs of synthetic sequences that share evolutionary history. While phylogenetic correlations confound the identification of contacting residues by DCA, they are thus useful to predict interacting partners among paralogs. We find that DCA performs as well as phylogenetic methods to this end, and slightly better than them with large and accurate training sets. Employing DCA or phylogenetic methods within an Iterative Pairing Algorithm (IPA) allows to predict pairs of evolutionary partners without a training set. We demonstrate the ability of these various methods to correctly predict pairings among real paralogous proteins with genome proximity but no known physical interaction, illustrating the importance of phylogenetic correlations in natural data. However, for physically interacting and strongly coevolving proteins, DCA and mutual information outperform phylogenetic methods. We discuss how to distinguish physically interacting proteins from those only sharing evolutionary history.
△ Less
Submitted 4 September, 2019; v1 submitted 10 June, 2019;
originally announced June 2019.
-
Selection of sequence motifs and generative Hopfield-Potts models for protein familiesilies
Authors:
Kai Shimagaki,
Martin Weigt
Abstract:
Statistical models for families of evolutionary related proteins have recently gained interest: in particular pairwise Potts models, as those inferred by the Direct-Coupling Analysis, have been able to extract information about the three-dimensional structure of folded proteins, and about the effect of amino-acid substitutions in proteins. These models are typically requested to reproduce the one-…
▽ More
Statistical models for families of evolutionary related proteins have recently gained interest: in particular pairwise Potts models, as those inferred by the Direct-Coupling Analysis, have been able to extract information about the three-dimensional structure of folded proteins, and about the effect of amino-acid substitutions in proteins. These models are typically requested to reproduce the one- and two-point statistics of the amino-acid usage in a protein family, {\em i.e.}~to capture the so-called residue conservation and covariation statistics of proteins of common evolutionary origin. Pairwise Potts models are the maximum-entropy models achieving this. While being successful, these models depend on huge numbers of {\em ad hoc} introduced parameters, which have to be estimated from finite amount of data and whose biophysical interpretation remains unclear. Here we propose an approach to parameter reduction, which is based on selecting collective sequence motifs. It naturally leads to the formulation of statistical sequence models in terms of Hopfield-Potts models. These models can be accurately inferred using a mapping to restricted Boltzmann machines and persistent contrastive divergence. We show that, when applied to protein data, even 20-40 patterns are sufficient to obtain statistically close-to-generative models. The Hopfield patterns form interpretable sequence motifs and may be used to clusterize amino-acid sequences into functional sub-families. However, the distributed collective nature of these motifs intrinsically limits the ability of Hopfield-Potts models in predicting contact maps, showing the necessity of developing models going beyond the Hopfield-Potts models discussed here.
△ Less
Submitted 5 September, 2019; v1 submitted 28 May, 2019;
originally announced May 2019.
-
How pairwise coevolutionary models capture the collective residue variability in proteins
Authors:
Matteo Figliuzzi,
Pierre Barrat-Charlaix,
Martin Weigt
Abstract:
Global coevolutionary models of homologous protein families, as constructed by direct coupling analysis (DCA), have recently gained popularity in particular due to their capacity to accurately predict residue-residue contacts from sequence information alone, and thereby to facilitate tertiary and quaternary protein structure prediction. More recently, they have also been used to predict fitness ef…
▽ More
Global coevolutionary models of homologous protein families, as constructed by direct coupling analysis (DCA), have recently gained popularity in particular due to their capacity to accurately predict residue-residue contacts from sequence information alone, and thereby to facilitate tertiary and quaternary protein structure prediction. More recently, they have also been used to predict fitness effects of amino-acid substitutions in proteins, and to predict evolutionary conserved protein-protein interactions. These models are based on two currently unjustified hypotheses: (a) correlations in the amino-acid usage of different positions are resulting collectively from networks of direct couplings; and (b) pairwise couplings are sufficient to capture the amino-acid variability. Here we propose a highly precise inference scheme based on Boltzmann-machine learning, which allows us to systematically address these hypotheses. We show how correlations are built up in a highly collective way by a large number of coupling paths, which are based on the protein's three-dimensional structure. We further find that pairwise coevolutionary models capture the collective residue variability across homologous proteins even for quantities which are not imposed by the inference procedure, like three-residue correlations, the clustered structure of protein families in sequence space or the sequence distances between homologs. These findings strongly suggest that pairwise coevolutionary models are actually sufficient to accurately capture the residue variability in homologous protein families.
△ Less
Submitted 12 January, 2018;
originally announced January 2018.
-
Inverse Statistical Physics of Protein Sequences: A Key Issues Review
Authors:
Simona Cocco,
Christoph Feinauer,
Matteo Figliuzzi,
Remi Monasson,
Martin Weigt
Abstract:
In the course of evolution, proteins undergo important changes in their amino acid sequences, while their three-dimensional folded structure and their biological function remain remarkably conserved. Thanks to modern sequencing techniques, sequence data accumulate at unprecedented pace. This provides large sets of so-called homologous, i.e.~evolutionarily related protein sequences, to which method…
▽ More
In the course of evolution, proteins undergo important changes in their amino acid sequences, while their three-dimensional folded structure and their biological function remain remarkably conserved. Thanks to modern sequencing techniques, sequence data accumulate at unprecedented pace. This provides large sets of so-called homologous, i.e.~evolutionarily related protein sequences, to which methods of inverse statistical physics can be applied. Using sequence data as the basis for the inference of Boltzmann distributions from samples of microscopic configurations or observables, it is possible to extract information about evolutionary constraints and thus protein function and structure. Here we give an overview over some biologically important questions, and how statistical-mechanics inspired modeling approaches can help to answer them. Finally, we discuss some open questions, which we expect to be addressed over the next years.
△ Less
Submitted 3 March, 2017;
originally announced March 2017.
-
Improving landscape inference by integrating heterogeneous data in the inverse Ising problem
Authors:
Pierre Barrat-Charlaix,
Matteo Figliuzzi,
Martin Weigt
Abstract:
The inverse Ising problem and its generalizations to Potts and continuous spin models have recently attracted much attention thanks to their successful applications in the statistical modeling of biological data. In the standard setting, the parameters of an Ising model (couplings and fields) are inferred using a sample of equilibrium configurations drawn from the Boltzmann distribution. However,…
▽ More
The inverse Ising problem and its generalizations to Potts and continuous spin models have recently attracted much attention thanks to their successful applications in the statistical modeling of biological data. In the standard setting, the parameters of an Ising model (couplings and fields) are inferred using a sample of equilibrium configurations drawn from the Boltzmann distribution. However, in the context of biological applications, quantitative information for a limited number of microscopic spins configurations has recently become available. In this paper, we extend the usual setting of the inverse Ising model by developing an integrative approach combining the equilibrium sample with (possibly noisy) measurements of the energy performed for a number of arbitrary configurations. Using simulated data, we show that our integrative approach outperforms standard inference based only on the equilibrium sample or the energy measurements, including error correction of noisy energy measurements. As a biological proof-of-concept application, we show that mutational fitness landscapes in proteins can be better described when combining evolutionary sequence data with complementary structural information about mutant sequences.
△ Less
Submitted 5 November, 2016; v1 submitted 19 September, 2016;
originally announced September 2016.
-
Simultaneous identification of specifically interacting paralogs and inter-protein contacts by Direct-Coupling Analysis
Authors:
Thomas Gueudré,
Carlo Baldassi,
Marco Zamparo,
Martin Weigt,
Andrea Pagnani
Abstract:
Understanding protein-protein interactions is central to our understanding of almost all complex biological processes. Computational tools exploiting rapidly growing genomic databases to characterize protein-protein interactions are urgently needed. Such methods should connect multiple scales from evolutionary conserved interactions between families of homologous proteins, over the identification…
▽ More
Understanding protein-protein interactions is central to our understanding of almost all complex biological processes. Computational tools exploiting rapidly growing genomic databases to characterize protein-protein interactions are urgently needed. Such methods should connect multiple scales from evolutionary conserved interactions between families of homologous proteins, over the identification of specifically interacting proteins in the case of multiple paralogs inside a species, down to the prediction of residues being in physical contact across interaction interfaces. Statistical inference methods detecting residue-residue coevolution have recently triggered considerable progress in using sequence data for quaternary protein structure prediction; they require, however, large joint alignments of homologous protein pairs known to interact. The generation of such alignments is a complex computational task on its own; application of coevolutionary modeling has in turn been restricted to proteins without paralogs, or to bacterial systems with the corresponding coding genes being co-localized in operons. Here we show that the Direct-Coupling Analysis of residue coevolution can be extended to connect the different scales, and simultaneously to match interacting paralogs, to identify inter-protein residue-residue contacts and to discriminate interacting from noninteracting families in a multiprotein system. Our results extend the potential applications of coevolutionary analysis far beyond cases treatable so far.
△ Less
Submitted 12 May, 2016;
originally announced May 2016.
-
Coevolutionary landscape inference and the context-dependence of mutations in beta-lactamase TEM-1
Authors:
Matteo Figliuzzi,
Hervé Jacquier,
Alexander Schug,
Olivier Tenaillon,
Martin Weigt
Abstract:
The quantitative characterization of mutational landscapes is a task of outstanding importance in evolutionary and medical biology: It is, e.g., of central importance for our understanding of the phenotypic effect of mutations related to disease and antibiotic drug resistance. Here we develop a novel inference scheme for mutational landscapes, which is based on the statistical analysis of large al…
▽ More
The quantitative characterization of mutational landscapes is a task of outstanding importance in evolutionary and medical biology: It is, e.g., of central importance for our understanding of the phenotypic effect of mutations related to disease and antibiotic drug resistance. Here we develop a novel inference scheme for mutational landscapes, which is based on the statistical analysis of large alignments of homologs of the protein of interest. Our method is able to capture epistatic couplings between residues, and therefore to assess the dependence of mutational effects on the sequence context where they appear. Compared to recent large-scale mutagenesis data of the beta-lactamase TEM-1, a protein providing resistance against beta-lactam antibiotics, our method leads to an increase of about 40% in explicative power as compared to approaches neglecting epistasis. We find that the informative sequence context extends to residues at native distances of about 20Ã… from the mutated site, reaching thus far beyond residues in direct physical contact.
△ Less
Submitted 12 October, 2015;
originally announced October 2015.
-
Fast and accurate multivariate Gaussian modeling of protein families: Predicting residue contacts and protein-interaction partners
Authors:
Carlo Baldassi,
Marco Zamparo,
Christoph Feinauer,
Andrea Procaccini,
Riccardo Zecchina,
Martin Weigt,
Andrea Pagnani
Abstract:
In the course of evolution, proteins show a remarkable conservation of their three-dimensional structure and their biological function, leading to strong evolutionary constraints on the sequence variability between homologous proteins. Our method aims at extracting such constraints from rapidly accumulating sequence data, and thereby at inferring protein structure and function from sequence inform…
▽ More
In the course of evolution, proteins show a remarkable conservation of their three-dimensional structure and their biological function, leading to strong evolutionary constraints on the sequence variability between homologous proteins. Our method aims at extracting such constraints from rapidly accumulating sequence data, and thereby at inferring protein structure and function from sequence information alone. Recently, global statistical inference methods (e.g. direct-coupling analysis, sparse inverse covariance estimation) have achieved a breakthrough towards this aim, and their predictions have been successfully implemented into tertiary and quaternary protein structure prediction methods. However, due to the discrete nature of the underlying variable (amino-acids), exact inference requires exponential time in the protein length, and efficient approximations are needed for practical applicability. Here we propose a very efficient multivariate Gaussian modeling approach as a variant of direct-coupling analysis: the discrete amino-acid variables are replaced by continuous Gaussian random variables. The resulting statistical inference problem is efficiently and exactly solvable. We show that the quality of inference is comparable or superior to the one achieved by mean-field approximations to inference with discrete variables, as done by direct-coupling analysis. This is true for (i) the prediction of residue-residue contacts in proteins, and (ii) the identification of protein-protein interaction partner in bacterial signal transduction. An implementation of our multivariate Gaussian approach is available at the website http://areeweb.polito.it/ricerca/cmp/code
△ Less
Submitted 4 April, 2014;
originally announced April 2014.
-
From principal component to direct coupling analysis of coevolution in proteins: Low-eigenvalue modes are needed for structure prediction
Authors:
Simona Cocco,
Remi Monasson,
Martin Weigt
Abstract:
Various approaches have explored the covariation of residues in multiple-sequence alignments of homologous proteins to extract functional and structural information. Among those are principal component analysis (PCA), which identifies the most correlated groups of residues, and direct coupling analysis (DCA), a global inference method based on the maximum entropy principle, which aims at predictin…
▽ More
Various approaches have explored the covariation of residues in multiple-sequence alignments of homologous proteins to extract functional and structural information. Among those are principal component analysis (PCA), which identifies the most correlated groups of residues, and direct coupling analysis (DCA), a global inference method based on the maximum entropy principle, which aims at predicting residue-residue contacts. In this paper, inspired by the statistical physics of disordered systems, we introduce the Hopfield-Potts model to naturally interpolate between these two approaches. The Hopfield-Potts model allows us to identify relevant 'patterns' of residues from the knowledge of the eigenmodes and eigenvalues of the residue-residue correlation matrix. We show how the computation of such statistical patterns makes it possible to accurately predict residue-residue contacts with a much smaller number of parameters than DCA. This dimensional reduction allows us to avoid overfitting and to extract contact information from multiple-sequence alignments of reduced size. In addition, we show that low-eigenvalue correlation modes, discarded by PCA, are important to recover structural information: the corresponding patterns are highly localized, that is, they are concentrated in few sites, which we find to be in close contact in the three-dimensional protein fold.
△ Less
Submitted 27 August, 2013; v1 submitted 13 December, 2012;
originally announced December 2012.
-
Improved contact prediction in proteins: Using pseudolikelihoods to infer Potts models
Authors:
Magnus Ekeberg,
Cecilia Lövkvist,
Yueheng Lan,
Martin Weigt,
Erik Aurell
Abstract:
Spatially proximate amino acids in a protein tend to coevolve. A protein's three-dimensional (3D) structure hence leaves an echo of correlations in the evolutionary record. Reverse engineering 3D structures from such correlations is an open problem in structural biology, pursued with increasing vigor as more and more protein sequences continue to fill the data banks. Within this task lies a statis…
▽ More
Spatially proximate amino acids in a protein tend to coevolve. A protein's three-dimensional (3D) structure hence leaves an echo of correlations in the evolutionary record. Reverse engineering 3D structures from such correlations is an open problem in structural biology, pursued with increasing vigor as more and more protein sequences continue to fill the data banks. Within this task lies a statistical inference problem, rooted in the following: correlation between two sites in a protein sequence can arise from firsthand interaction but can also be network-propagated via intermediate sites; observed correlation is not enough to guarantee proximity. To separate direct from indirect interactions is an instance of the general problem of inverse statistical mechanics, where the task is to learn model parameters (fields, couplings) from observables (magnetizations, correlations, samples) in large systems. In the context of protein sequences, the approach has been referred to as direct-coupling analysis. Here we show that the pseudolikelihood method, applied to 21-state Potts models describing the statistical properties of families of evolutionarily related proteins, significantly outperforms existing approaches to the direct-coupling analysis, the latter being based on standard mean-field techniques. This improved performance also relies on a modified score for the coupling strength. The results are verified using known crystal structures of specific sequence instances of various protein families. Code implementing the new method can be found at http://plmdca.csc.kth.se/.
△ Less
Submitted 12 January, 2013; v1 submitted 6 November, 2012;
originally announced November 2012.
-
Direct-coupling analysis of residue co-evolution captures native contacts across many protein families
Authors:
Faruck Morcos,
Andrea Pagnani,
Bryan Lunt,
Arianna Bertolino,
Debora S. Marks,
Chris Sander,
Riccardo Zecchina,
Jose' N. Onuchic,
Terence Hwa,
Martin Weigt
Abstract:
The similarity in the three-dimensional structures of homologous proteins imposes strong constraints on their sequence variability. It has long been suggested that the resulting correlations among amino acid compositions at different sequence positions can be exploited to infer spatial contacts within the tertiary protein structure. Crucial to this inference is the ability to disentangle direct an…
▽ More
The similarity in the three-dimensional structures of homologous proteins imposes strong constraints on their sequence variability. It has long been suggested that the resulting correlations among amino acid compositions at different sequence positions can be exploited to infer spatial contacts within the tertiary protein structure. Crucial to this inference is the ability to disentangle direct and indirect correlations, as accomplished by the recently introduced Direct Coupling Analysis (DCA) (Weigt et al. (2009) Proc Natl Acad Sci 106:67). Here we develop a computationally efficient implementation of DCA, which allows us to evaluate the accuracy of contact prediction by DCA for a large number of protein domains, based purely on sequence information. DCA is shown to yield a large number of correctly predicted contacts, recapitulating the global structure of the contact map for the majority of the protein domains examined. Furthermore, our analysis captures clear signals beyond intra- domain residue contacts, arising, e.g., from alternative protein conformations, ligand- mediated residue couplings, and inter-domain interactions in protein oligomers. Our findings suggest that contacts predicted by DCA can be used as a reliable guide to facilitate computational predictions of alternative protein conformations, protein complex formation, and even the de novo prediction of protein domain structures, provided the existence of a large number of homologous sequences which are being rapidly made available due to advances in genome sequencing.
△ Less
Submitted 25 October, 2011; v1 submitted 24 October, 2011;
originally announced October 2011.
-
Dissecting the Specificity of Protein-Protein Interaction in Bacterial Two-Component Signaling: Orphans and Crosstalks
Authors:
Andrea Procaccini,
Bryan Lunt,
Hendrik Szurmant,
Terence Hwa,
Martin Weigt
Abstract:
Predictive understanding of the myriads of signal transduction pathways in a cell is an outstanding challenge of systems biology. Such pathways are primarily mediated by specific but transient protein-protein interactions, which are difficult to study experimentally. In this study, we dissect the specificity of protein-protein interactions governing two-component signaling (TCS) systems ubiquitous…
▽ More
Predictive understanding of the myriads of signal transduction pathways in a cell is an outstanding challenge of systems biology. Such pathways are primarily mediated by specific but transient protein-protein interactions, which are difficult to study experimentally. In this study, we dissect the specificity of protein-protein interactions governing two-component signaling (TCS) systems ubiquitously used in bacteria. Exploiting the large number of sequenced bacterial genomes and an operon structure which packages many pairs of interacting TCS proteins together, we developed a computational approach to extract a molecular interaction code capturing the preferences of a small but critical number of directly interacting residue pairs. This code is found to reflect physical interaction mechanisms, with the strongest signal coming from charged amino acids. It is used to predict the specificity of TCS interaction: Our results compare favorably to most available experimental results, including the prediction of 7 (out of 8 known) interaction partners of orphan signaling proteins in Caulobacter crescentus. Surveying among the available bacterial genomes, our results suggest 15~25% of the TCS proteins could participate in out-of-operon "crosstalks". Additionally, we predict clusters of crosstalking candidates, expanding from the anecdotally known examples in model organisms. The tools and results presented here can be used to guide experimental studies towards a system-level understanding of two-component signaling.
△ Less
Submitted 17 May, 2011;
originally announced May 2011.
-
Classification and sparse-signature extraction from gene-expression data
Authors:
Andrea Pagnani,
Francesca Tria,
Martin Weigt
Abstract:
In this work we suggest a statistical mechanics approach to the classification of high-dimensional data according to a binary label. We propose an algorithm whose aim is twofold: First it learns a classifier from a relatively small number of data, second it extracts a sparse signature, {\it i.e.} a lower-dimensional subspace carrying the information needed for the classification. In particular t…
▽ More
In this work we suggest a statistical mechanics approach to the classification of high-dimensional data according to a binary label. We propose an algorithm whose aim is twofold: First it learns a classifier from a relatively small number of data, second it extracts a sparse signature, {\it i.e.} a lower-dimensional subspace carrying the information needed for the classification. In particular the second part of the task is NP-hard, therefore we propose a statistical-mechanics based message-passing approach. The resulting algorithm is firstly tested on artificial data to prove its validity, but also to elucidate possible limitations.
As an important application, we consider the classification of gene-expression data measured in various types of cancer tissues. We find that, despite the currently low quantity and quality of available data (the number of available samples is much smaller than the number of measured genes, limiting thus strongly the predictive capacities), the algorithm performs slightly better than many state-of-the-art approaches in bioinformatics.
△ Less
Submitted 21 July, 2009;
originally announced July 2009.
-
Statistical mechanics of sparse generalization and model selection
Authors:
Alejandro Lage-Castellanos,
Andrea Pagnani,
Martin Weigt
Abstract:
One of the crucial tasks in many inference problems is the extraction of sparse information out of a given number of high-dimensional measurements. In machine learning, this is frequently achieved using, as a penality term, the $L_p$ norm of the model parameters, with $p\leq 1$ for efficient dilution. Here we propose a statistical-mechanics analysis of the problem in the setting of perceptron me…
▽ More
One of the crucial tasks in many inference problems is the extraction of sparse information out of a given number of high-dimensional measurements. In machine learning, this is frequently achieved using, as a penality term, the $L_p$ norm of the model parameters, with $p\leq 1$ for efficient dilution. Here we propose a statistical-mechanics analysis of the problem in the setting of perceptron memorization and generalization. Using a replica approach, we are able to evaluate the relative performance of naive dilution (obtained by learning without dilution, following by applying a threshold to the model parameters), $L_1$ dilution (which is frequently used in convex optimization) and $L_0$ dilution (which is optimal but computationally hard to implement). Whereas both $L_p$ diluted approaches clearly outperform the naive approach, we find a small region where $L_0$ works almost perfectly and strongly outperforms the simpler to implement $L_1$ dilution.
△ Less
Submitted 18 July, 2009;
originally announced July 2009.
-
Aligning graphs and finding substructures by a cavity approach
Authors:
S. Bradde,
A. Braunstein,
H. Mahmoudi,
F. Tria,
M. Weigt,
R. Zecchina
Abstract:
We introduce a new distributed algorithm for aligning graphs or finding substructures within a given graph. It is based on the cavity method and is used to study the maximum-clique and the graph-alignment problems in random graphs. The algorithm allows to analyze large graphs and may find applications in fields such as computational biology. As a proof of concept we use our algorithm to align the…
▽ More
We introduce a new distributed algorithm for aligning graphs or finding substructures within a given graph. It is based on the cavity method and is used to study the maximum-clique and the graph-alignment problems in random graphs. The algorithm allows to analyze large graphs and may find applications in fields such as computational biology. As a proof of concept we use our algorithm to align the similarity graphs of two interacting protein families involved in bacterial signal transduction, and to predict actually interacting protein partners between these families.
△ Less
Submitted 1 April, 2010; v1 submitted 12 May, 2009;
originally announced May 2009.
-
Identification of direct residue contacts in protein-protein interaction by message passing
Authors:
M. Weigt,
R. A. White,
H. Szurmant,
J. A. Hoch,
T. Hwa
Abstract:
Understanding the molecular determinants of specificity in protein-protein interaction is an outstanding challenge of postgenome biology. The availability of large protein databases generated from sequences of hundreds of bacterial genomes enables various statistical approaches to this problem. In this context covariance-based methods have been used to identify correlation between amino acid pos…
▽ More
Understanding the molecular determinants of specificity in protein-protein interaction is an outstanding challenge of postgenome biology. The availability of large protein databases generated from sequences of hundreds of bacterial genomes enables various statistical approaches to this problem. In this context covariance-based methods have been used to identify correlation between amino acid positions in interacting proteins. However, these methods have an important shortcoming, in that they cannot distinguish between directly and indirectly correlated residues. We developed a method that combines covariance analysis with global inference analysis, adopted from use in statistical physics. Applied to a set of >2,500 representatives of the bacterial two-component signal transduction system, the combination of covariance with global inference successfully and robustly identified residue pairs that are proximal in space without resorting to ad hoc tuning parameters, both for heterointeractions between sensor kinase (SK) and response regulator (RR) proteins and for homointeractions between RR proteins. The spectacular success of this approach illustrates the effectiveness of the global inference approach in identifying direct interaction based on sequence information alone. We expect this method to be applicable soon to interaction surfaces between proteins present in only 1 copy per genome as the number of sequenced genomes continues to expand. Use of this method could significantly increase the potential targets for therapeutic intervention, shed light on the mechanism of protein-protein interaction, and establish the foundation for the accurate prediction of interacting protein partners.
△ Less
Submitted 9 January, 2009;
originally announced January 2009.
-
Inference algorithms for gene networks: a statistical mechanics analysis
Authors:
A. Braunstein,
A. Pagnani,
M. Weigt,
R. Zecchina
Abstract:
The inference of gene regulatory networks from high throughput gene expression data is one of the major challenges in systems biology. This paper aims at analysing and comparing two different algorithmic approaches. The first approach uses pairwise correlations between regulated and regulating genes; the second one uses message-passing techniques for inferring activating and inhibiting regulator…
▽ More
The inference of gene regulatory networks from high throughput gene expression data is one of the major challenges in systems biology. This paper aims at analysing and comparing two different algorithmic approaches. The first approach uses pairwise correlations between regulated and regulating genes; the second one uses message-passing techniques for inferring activating and inhibiting regulatory interactions. The performance of these two algorithms can be analysed theoretically on well-defined test sets, using tools from the statistical physics of disordered systems like the replica method. We find that the second algorithm outperforms the first one since it takes into account collective effects of multiple regulators.
△ Less
Submitted 4 December, 2008;
originally announced December 2008.
-
Gene-network inference by message passing
Authors:
A. Braunstein,
A. Pagnani,
M. Weigt,
R. Zecchina
Abstract:
The inference of gene-regulatory processes from gene-expression data belongs to the major challenges of computational systems biology. Here we address the problem from a statistical-physics perspective and develop a message-passing algorithm which is able to infer sparse, directed and combinatorial regulatory mechanisms. Using the replica technique, the algorithmic performance can be characteriz…
▽ More
The inference of gene-regulatory processes from gene-expression data belongs to the major challenges of computational systems biology. Here we address the problem from a statistical-physics perspective and develop a message-passing algorithm which is able to infer sparse, directed and combinatorial regulatory mechanisms. Using the replica technique, the algorithmic performance can be characterized analytically for artificially generated data. The algorithm is applied to genome-wide expression data of baker's yeast under various environmental conditions. We find clear cases of combinatorial control, and enrichment in common functional annotations of regulated genes and their regulators.
△ Less
Submitted 4 December, 2008;
originally announced December 2008.
-
A thermodynamic model for agglomeration of DNA-looping proteins
Authors:
Sumedha,
Martin Weigt
Abstract:
In this paper, we propose a thermodynamic mechanism for the formation of transcriptional foci via the joint agglomeration of DNA-looping proteins and protein-binding domains on DNA: The competition between the gain in protein-DNA binding free energy and the entropy loss due to DNA looping is argued to result in an effective attraction between loops. A mean-field approximation can be described an…
▽ More
In this paper, we propose a thermodynamic mechanism for the formation of transcriptional foci via the joint agglomeration of DNA-looping proteins and protein-binding domains on DNA: The competition between the gain in protein-DNA binding free energy and the entropy loss due to DNA looping is argued to result in an effective attraction between loops. A mean-field approximation can be described analytically via a mapping to a restricted random-graph ensemble having local degree constraints and global constraints on the number of connected components. It shows the emergence of protein clusters containing a finite fraction of all looping proteins. If the entropy loss due to a single DNA loop is high enough, this transition is found to be of first order.
△ Less
Submitted 20 October, 2008; v1 submitted 9 January, 2008;
originally announced January 2008.
-
Unsupervised and semi-supervised clustering by message passing: Soft-constraint affinity propagation
Authors:
Michele Leone,
Sumedha,
Martin Weigt
Abstract:
Soft-constraint affinity propagation (SCAP) is a new statistical-physics based clustering technique. First we give the derivation of a simplified version of the algorithm and discuss possibilities of time- and memory-efficient implementations. Later we give a detailed analysis of the performance of SCAP on artificial data, showing that the algorithm efficiently unveils clustered and hierarchical…
▽ More
Soft-constraint affinity propagation (SCAP) is a new statistical-physics based clustering technique. First we give the derivation of a simplified version of the algorithm and discuss possibilities of time- and memory-efficient implementations. Later we give a detailed analysis of the performance of SCAP on artificial data, showing that the algorithm efficiently unveils clustered and hierarchical data structures. We generalize the algorithm to the problem of semi-supervised clustering, where data are already partially labeled, and clustering assigns labels to previously unlabeled points. SCAP uses both the geometrical organization of the data and the available labels assigned to few points in a computationally efficient way, as is shown on artificial and biological benchmark data.
△ Less
Submitted 15 September, 2008; v1 submitted 7 December, 2007;
originally announced December 2007.
-
Clustering by soft-constraint affinity propagation: Applications to gene-expression data
Authors:
Michele Leone,
Sumedha,
Martin Weigt
Abstract:
Motivation: Similarity-measure based clustering is a crucial problem appearing throughout scientific data analysis. Recently, a powerful new algorithm called Affinity Propagation (AP) based on message-passing techniques was proposed by Frey and Dueck \cite{Frey07}. In AP, each cluster is identified by a common exemplar all other data points of the same cluster refer to, and exemplars have to ref…
▽ More
Motivation: Similarity-measure based clustering is a crucial problem appearing throughout scientific data analysis. Recently, a powerful new algorithm called Affinity Propagation (AP) based on message-passing techniques was proposed by Frey and Dueck \cite{Frey07}. In AP, each cluster is identified by a common exemplar all other data points of the same cluster refer to, and exemplars have to refer to themselves. Albeit its proved power, AP in its present form suffers from a number of drawbacks. The hard constraint of having exactly one exemplar per cluster restricts AP to classes of regularly shaped clusters, and leads to suboptimal performance, {\it e.g.}, in analyzing gene expression data. Results: This limitation can be overcome by relaxing the AP hard constraints. A new parameter controls the importance of the constraints compared to the aim of maximizing the overall similarity, and allows to interpolate between the simple case where each data point selects its closest neighbor as an exemplar and the original AP. The resulting soft-constraint affinity propagation (SCAP) becomes more informative, accurate and leads to more stable clustering. Even though a new {\it a priori} free-parameter is introduced, the overall dependence of the algorithm on external tuning is reduced, as robustness is increased and an optimal strategy for parameter selection emerges more naturally. SCAP is tested on biological benchmark data, including in particular microarray data related to various cancer types. We show that the algorithm efficiently unveils the hierarchical cluster structure present in the data sets. Further on, it allows to extract sparse gene expression signatures for each cluster.
△ Less
Submitted 29 November, 2007; v1 submitted 18 May, 2007;
originally announced May 2007.
-
Propagation of external regulation and asynchronous dynamics in random Boolean networks
Authors:
Hamed Mahmoudi,
Andrea Pagnani,
Martin Weigt,
Riccardo Zecchina
Abstract:
Boolean Networks and their dynamics are of great interest as abstract modeling schemes in various disciplines, ranging from biology to computer science. Whereas parallel update schemes have been studied extensively in past years, the level of understanding of asynchronous updates schemes is still very poor. In this paper we study the propagation of external information given by regulatory input…
▽ More
Boolean Networks and their dynamics are of great interest as abstract modeling schemes in various disciplines, ranging from biology to computer science. Whereas parallel update schemes have been studied extensively in past years, the level of understanding of asynchronous updates schemes is still very poor. In this paper we study the propagation of external information given by regulatory input variables into a random Boolean network. We compute both analytically and numerically the time evolution and the asymptotic behavior of this propagation of external regulation (PER). In particular, this allows us to identify variables which are completely determined by this external information. All those variables in the network which are not directly fixed by PER form a core which contains in particular all non-trivial feedback loops. We design a message-passing approach allowing to characterize the statistical properties of these cores in dependence of the Boolean network and the external condition. At the end we establish a link between PER dynamics and the full random asynchronous dynamics of a Boolean network.
△ Less
Submitted 25 April, 2007;
originally announced April 2007.
-
Finitely coordinated models for low-temperature phases of amorphous systems
Authors:
Reimer Kuehn,
Jort van Mourik,
Martin Weigt,
Annette Zippelius
Abstract:
We introduce models of heterogeneous systems with finite connectivity defined on random graphs to capture finite-coordination effects on the low-temperature behavior of finite dimensional systems. Our models use a description in terms of small deviations of particle coordinates from a set of reference positions, particularly appropriate for the description of low-temperature phenomena. A Born-vo…
▽ More
We introduce models of heterogeneous systems with finite connectivity defined on random graphs to capture finite-coordination effects on the low-temperature behavior of finite dimensional systems. Our models use a description in terms of small deviations of particle coordinates from a set of reference positions, particularly appropriate for the description of low-temperature phenomena. A Born-von-Karman type expansion with random coefficients is used to model effects of frozen heterogeneities. The key quantity appearing in the theoretical description is a full distribution of effective single-site potentials which needs to be determined self-consistently. If microscopic interactions are harmonic, the effective single-site potentials turn out to be harmonic as well, and the distribution of these single-site potentials is equivalent to a distribution of localization lengths used earlier in the description of chemical gels. For structural glasses characterized by frustration and anharmonicities in the microscopic interactions, the distribution of single-site potentials involves anharmonicities of all orders, and both single-well and double well potentials are observed, the latter with a broad spectrum of barrier heights. The appearance of glassy phases at low temperatures is marked by the appearance of asymmetries in the distribution of single-site potentials, as previously observed for fully connected systems. Double-well potentials with a broad spectrum of barrier heights and asymmetries would give rise to the well known universal glassy low temperature anomalies when quantum effects are taken into account.
△ Less
Submitted 20 March, 2007;
originally announced March 2007.
-
Statistical mechanics of combinatorial auctions
Authors:
Tobias Galla,
Michele Leone,
Matteo Marsili,
Mauro Sellitto,
Martin Weigt,
Riccardo Zecchina
Abstract:
Combinatorial auctions are formulated as frustrated lattice gases on sparse random graphs, allowing the determination of the optimal revenue by methods of statistical physics. Transitions between computationally easy and hard regimes are found and interpreted in terms of the geometric structure of the space of solutions. We introduce an iterative algorithm to solve intermediate and large instanc…
▽ More
Combinatorial auctions are formulated as frustrated lattice gases on sparse random graphs, allowing the determination of the optimal revenue by methods of statistical physics. Transitions between computationally easy and hard regimes are found and interpreted in terms of the geometric structure of the space of solutions. We introduce an iterative algorithm to solve intermediate and large instances, and discuss competing states of optimal revenue and maximal number of satisfied bidders. The algorithm can be generalized to the hard phase and to more sophisticated auction protocols.
△ Less
Submitted 24 August, 2006; v1 submitted 25 May, 2006;
originally announced May 2006.
-
Message passing for vertex covers
Authors:
Martin Weigt,
Haijun Zhou
Abstract:
Constructing a minimal vertex cover of a graph can be seen as a prototype for a combinatorial optimization problem under hard constraints. In this paper, we develop and analyze message passing techniques, namely warning and survey propagation, which serve as efficient heuristic algorithms for solving these computational hard problems. We show also, how previously obtained results on the typical-…
▽ More
Constructing a minimal vertex cover of a graph can be seen as a prototype for a combinatorial optimization problem under hard constraints. In this paper, we develop and analyze message passing techniques, namely warning and survey propagation, which serve as efficient heuristic algorithms for solving these computational hard problems. We show also, how previously obtained results on the typical-case behavior of vertex covers of random graphs can be recovered starting from the message passing equations, and how they can be extended.
△ Less
Submitted 8 September, 2006; v1 submitted 8 May, 2006;
originally announced May 2006.
-
Sudden emergence of q-regular subgraphs in random graphs
Authors:
Marco Pretti,
Martin Weigt
Abstract:
We investigate the computationally hard problem whether a random graph of finite average vertex degree has an extensively large $q$-regular subgraph, i.e., a subgraph with all vertices having degree equal to $q$. We reformulate this problem as a constraint-satisfaction problem, and solve it using the cavity method of statistical physics at zero temperature. For $q=3$, we find that the first larg…
▽ More
We investigate the computationally hard problem whether a random graph of finite average vertex degree has an extensively large $q$-regular subgraph, i.e., a subgraph with all vertices having degree equal to $q$. We reformulate this problem as a constraint-satisfaction problem, and solve it using the cavity method of statistical physics at zero temperature. For $q=3$, we find that the first large $q$-regular subgraphs appear discontinuously at an average vertex degree $c_\reg{3} \simeq 3.3546$ and contain immediately about 24% of all vertices in the graph. This transition is extremely close to (but different from) the well-known 3-core percolation point $c_\cor{3} \simeq 3.3509$. For $q>3$, the $q$-regular subgraph percolation threshold is found to coincide with that of the $q$-core.
△ Less
Submitted 30 March, 2006;
originally announced March 2006.
-
Introduction to graphs
Authors:
Alexander K. Hartmann,
Martin Weigt
Abstract:
Graph theory provides fundamental concepts for many fields of science like statistical physics, network analysis and theoretical computer science. Here we give a pedagogical introduction to graph theory, divided into three sections. In the first, we introduce some basic notations and graph theoretical problems, e.g. Eulerian circuits, vertex covers, and graph colorings. The second section descri…
▽ More
Graph theory provides fundamental concepts for many fields of science like statistical physics, network analysis and theoretical computer science. Here we give a pedagogical introduction to graph theory, divided into three sections. In the first, we introduce some basic notations and graph theoretical problems, e.g. Eulerian circuits, vertex covers, and graph colorings. The second section describes some fundamental algorithmic concepts to solve basic graph problems numerically, as, e.g., depth-first search, calculation of strongly connected components, and minimum-spanning tree algorithms. The last section introduces random graphs and probabilistic tools to analyze the emergence of a giant component and a giant q-core in these graphs.
The presented text is published as the third chapter of the book "Phase Transitions in Combinatorial Optimization Problem" (Wiley-VCH 2005). Together with introductions to algorithms, to complexity theory and to basic statistical mechanics over random structures, it provides the technical basis for the more advanced chapters. These cover the analysis of phase transitions in combinatorial optimization problems, algorithmic and analytical approaches based on statistical physics tools (replica and cavity methods), the analysis of various search algorithms and the development of efficient heuristic algorithms, based on message passing techniques (warning, belief, and survey propagation).
△ Less
Submitted 6 February, 2006;
originally announced February 2006.
-
Computational core and fixed-point organisation in Boolean networks
Authors:
L. Correale,
M. Leone,
A. Pagnani,
M. Weigt,
R. Zecchina
Abstract:
In this paper, we analyse large random Boolean networks in terms of a constraint satisfaction problem. We first develop an algorithmic scheme which allows to prune simple logical cascades and under-determined variables, returning thereby the computational core of the network. Second we apply the cavity method to analyse number and organisation of fixed points. We find in particular a phase trans…
▽ More
In this paper, we analyse large random Boolean networks in terms of a constraint satisfaction problem. We first develop an algorithmic scheme which allows to prune simple logical cascades and under-determined variables, returning thereby the computational core of the network. Second we apply the cavity method to analyse number and organisation of fixed points. We find in particular a phase transition between an easy and a complex regulatory phase, the latter one being characterised by the existence of an exponential number of macroscopically separated fixed-point clusters. The different techniques developed are reinterpreted as algorithms for the analysis of single Boolean networks, and they are applied to analysis and in silico experiments on the gene-regulatory networks of baker's yeast (saccaromices cerevisiae) and the segment-polarity genes of the fruit-fly drosophila melanogaster.
△ Less
Submitted 6 March, 2006; v1 submitted 5 December, 2005;
originally announced December 2005.
-
Cavity Approach to the Random Solid State
Authors:
Xiaoming Mao,
Paul M. Goldbart,
Marc Mezard,
Martin Weigt
Abstract:
The cavity approach is used to address the physical properties of random solids in equilibrium. Particular attention is paid to the fraction of localized particles and the distribution of localization lengths characterizing their thermal motion. This approach is of relevance to a wide class of random solids, including rubbery media (formed via the vulcanization of polymer fluids) and chemical ge…
▽ More
The cavity approach is used to address the physical properties of random solids in equilibrium. Particular attention is paid to the fraction of localized particles and the distribution of localization lengths characterizing their thermal motion. This approach is of relevance to a wide class of random solids, including rubbery media (formed via the vulcanization of polymer fluids) and chemical gels (formed by the random covalent bonding of fluids of atoms or small molecules). The cavity approach confirms results that have been obtained previously via replica mean-field theory, doing so in a way that sheds new light on their physical origin.
△ Less
Submitted 8 June, 2005;
originally announced June 2005.
-
A hard-sphere model on generalised Bethe lattices: Dynamics
Authors:
Hendrik Hansen-Goos,
Martin Weigt
Abstract:
We analyse the dynamics of a hard-sphere lattice gas on generalised Bethe lattices using a projective approximation scheme (PAS). The latter consists in mapping the system's dynamics to a finite set of global observables, closure of the resulting equations is obtained by approximating the true non-equilibrium state by a pseudo-equilibrium based only on the value of the observables under consider…
▽ More
We analyse the dynamics of a hard-sphere lattice gas on generalised Bethe lattices using a projective approximation scheme (PAS). The latter consists in mapping the system's dynamics to a finite set of global observables, closure of the resulting equations is obtained by approximating the true non-equilibrium state by a pseudo-equilibrium based only on the value of the observables under consideration. We study the liquid--crystal as well as the liquid--spin-glass transitions, special attention is given to the prediction of equilibration times and their divergence close to the phase transitions. Analytical results are corroborated by Monte-Carlo simulations.
△ Less
Submitted 9 May, 2005;
originally announced May 2005.
-
A hard-sphere model on generalized Bethe lattices: Statics
Authors:
Hendrik Hansen-Goos,
Martin Weigt
Abstract:
We analyze the phase diagram of a model of hard spheres of chemical radius one, which is defined over a generalized Bethe lattice containing short loops. We find a liquid, two different crystalline, a glassy and an unusual crystalline glassy phase. Special attention is also paid to the close-packing limit in the glassy phase. All analytical results are cross-checked by numerical Monte-Carlo simu…
▽ More
We analyze the phase diagram of a model of hard spheres of chemical radius one, which is defined over a generalized Bethe lattice containing short loops. We find a liquid, two different crystalline, a glassy and an unusual crystalline glassy phase. Special attention is also paid to the close-packing limit in the glassy phase. All analytical results are cross-checked by numerical Monte-Carlo simulations.
△ Less
Submitted 10 May, 2005; v1 submitted 24 January, 2005;
originally announced January 2005.
-
Core percolation and onset of complexity in Boolean networks
Authors:
L. Correale,
M. Leone,
A. Pagnani,
M. Weigt,
R. Zecchina
Abstract:
The determination and classification of fixed points of large Boolean networks is addressed in terms of constraint satisfaction problem. We develop a general simplification scheme that, removing all those variables and functions belonging to trivial logical cascades, returns the computational core of the network. The onset of an easy-to-complex regulatory phase is introduced as a function of the…
▽ More
The determination and classification of fixed points of large Boolean networks is addressed in terms of constraint satisfaction problem. We develop a general simplification scheme that, removing all those variables and functions belonging to trivial logical cascades, returns the computational core of the network. The onset of an easy-to-complex regulatory phase is introduced as a function of the parameters of the model, identifying both theoretically and algorithmically the relevant regulatory variables.
△ Less
Submitted 22 November, 2005; v1 submitted 16 December, 2004;
originally announced December 2004.
-
Threshold values, stability analysis and high-q asymptotics for the coloring problem on random graphs
Authors:
Florent Krzakala,
Andrea Pagnani,
Martin Weigt
Abstract:
We consider the problem of coloring Erdos-Renyi and regular random graphs of finite connectivity using q colors. It has been studied so far using the cavity approach within the so-called one-step replica symmetry breaking (1RSB) ansatz. We derive a general criterion for the validity of this ansatz and, applying it to the ground state, we provide evidence that the 1RSB solution gives exact thresh…
▽ More
We consider the problem of coloring Erdos-Renyi and regular random graphs of finite connectivity using q colors. It has been studied so far using the cavity approach within the so-called one-step replica symmetry breaking (1RSB) ansatz. We derive a general criterion for the validity of this ansatz and, applying it to the ground state, we provide evidence that the 1RSB solution gives exact threshold values c_q for the q-COL/UNCOL phase transition. We also study the asymptotic thresholds for q >> 1 finding c_q = 2qlog(q)-log(q)-1+o(1) in perfect agreement with rigorous mathematical bounds, as well as the nature of excited states, and give a global phase diagram of the problem.
△ Less
Submitted 28 July, 2004; v1 submitted 30 March, 2004;
originally announced March 2004.
-
Approximation schemes for the dynamics of diluted spin models: the Ising ferromagnet on a Bethe lattice
Authors:
Guilhem Semerjian,
Martin Weigt
Abstract:
We discuss analytical approximation schemes for the dynamics of diluted spin models. The original dynamics of the complete set of degrees of freedom is replaced by a hierarchy of equations including an increasing number of global observables, which can be closed approximately at different levels of the hierarchy. We illustrate this method on the simple example of the Ising ferromagnet on a Bethe…
▽ More
We discuss analytical approximation schemes for the dynamics of diluted spin models. The original dynamics of the complete set of degrees of freedom is replaced by a hierarchy of equations including an increasing number of global observables, which can be closed approximately at different levels of the hierarchy. We illustrate this method on the simple example of the Ising ferromagnet on a Bethe lattice, investigating the first three possible closures, which are all exact in the long time limit, and which yield more and more accurate predictions for the finite-time behavior. We also investigate the critical region around the phase transition, and the behavior of two-time correlation functions. We finally underline the close relationship between this approach and the dynamical replica theory under the assumption of replica symmetry.
△ Less
Submitted 17 February, 2004;
originally announced February 2004.
-
Statistical mechanics of the vertex-cover problem
Authors:
Alexander K. Hartmann,
Martin Weigt
Abstract:
We review recent progress in the study of the vertex-cover problem (VC). VC belongs to the class of NP-complete graph theoretical problems, which plays a central role in theoretical computer science. On ensembles of random graphs, VC exhibits an coverable-uncoverable phase transition. Very close to this transition, depending on the solution algorithm, easy-hard transitions in the typical running…
▽ More
We review recent progress in the study of the vertex-cover problem (VC). VC belongs to the class of NP-complete graph theoretical problems, which plays a central role in theoretical computer science. On ensembles of random graphs, VC exhibits an coverable-uncoverable phase transition. Very close to this transition, depending on the solution algorithm, easy-hard transitions in the typical running time of the algorithms occur.
We explain a statistical mechanics approach, which works by mapping VC to a hard-core lattice gas, and then applying techniques like the replica trick or the cavity approach. Using these methods, the phase diagram of VC could be obtained exactly for connectivities $c<e$, where VC is replica symmetric. Recently, this result could be confirmed using traditional mathematical techniques. For $c>e$, the solution of VC exhibits full replica symmetry breaking.
The statistical mechanics approach can also be used to study analytically the typical running time of simple complete and incomplete algorithms for VC. Finally, we describe recent results for VC when studied on other ensembles of finite- and infinite-dimensional graphs.
△ Less
Submitted 10 July, 2003;
originally announced July 2003.
-
Polynomial iterative algorithms for coloring and analyzing random graphs
Authors:
A. Braunstein,
R. Mulet,
A. Pagnani,
M. Weigt,
R. Zecchina
Abstract:
We study the graph coloring problem over random graphs of finite average connectivity $c$. Given a number $q$ of available colors, we find that graphs with low connectivity admit almost always a proper coloring whereas graphs with high connectivity are uncolorable. Depending on $q$, we find the precise value of the critical average connectivity $c_q$. Moreover, we show that below $c_q$ there exi…
▽ More
We study the graph coloring problem over random graphs of finite average connectivity $c$. Given a number $q$ of available colors, we find that graphs with low connectivity admit almost always a proper coloring whereas graphs with high connectivity are uncolorable. Depending on $q$, we find the precise value of the critical average connectivity $c_q$. Moreover, we show that below $c_q$ there exist a clustering phase $c\in [c_d,c_q]$ in which ground states spontaneously divide into an exponential number of clusters. Furthermore, we extended our considerations to the case of single instances showing consistent results. This lead us to propose a new algorithm able to color in polynomial time random graphs in the hard but colorable region, i.e when $c\in [c_d,c_q]$.
△ Less
Submitted 24 April, 2003;
originally announced April 2003.
-
Solving satisfiability problems by fluctuations: The dynamics of stochastic local search algorithms
Authors:
Wolfgang Barthel,
Alexander K. Hartmann,
Martin Weigt
Abstract:
Stochastic local search algorithms are frequently used to numerically solve hard combinatorial optimization or decision problems. We give numerical and approximate analytical descriptions of the dynamics of such algorithms applied to random satisfiability problems. We find two different dynamical regimes, depending on the number of constraints per variable: For low constraintness, the problems a…
▽ More
Stochastic local search algorithms are frequently used to numerically solve hard combinatorial optimization or decision problems. We give numerical and approximate analytical descriptions of the dynamics of such algorithms applied to random satisfiability problems. We find two different dynamical regimes, depending on the number of constraints per variable: For low constraintness, the problems are solved efficiently, i.e. in linear time. For higher constraintness, the solution times become exponential. We observe that the dynamical behavior is characterized by a fast equilibration and fluctuations around this equilibrium. If the algorithm runs long enough, an exponentially rare fluctuation towards a solution appears.
△ Less
Submitted 7 May, 2003; v1 submitted 15 January, 2003;
originally announced January 2003.