-
Small Coupling Expansion for Multiple Sequence Alignment
Authors:
Louise Budzynski,
Andrea Pagnani
Abstract:
The alignment of biological sequences such as DNA, RNA, and proteins, is one of the basic tools that allow to detect evolutionary patterns, as well as functional/structural characterizations between homologous sequences in different organisms. Typically, state-of-the-art bioinformatics tools are based on profile models that assume the statistical independence of the different sites of the sequence…
▽ More
The alignment of biological sequences such as DNA, RNA, and proteins, is one of the basic tools that allow to detect evolutionary patterns, as well as functional/structural characterizations between homologous sequences in different organisms. Typically, state-of-the-art bioinformatics tools are based on profile models that assume the statistical independence of the different sites of the sequences. Over the last years, it has become increasingly clear that homologous sequences show complex patterns of long-range correlations over the primary sequence as a consequence of the natural evolution process that selects genetic variants under the constraint of preserving the functional/structural determinants of the sequence. Here, we present a new alignment algorithm based on message passing techniques that overcomes the limitations of profile models. Our method is based on a new perturbative small-coupling expansion of the free energy of the model that assumes a linear chain approximation as the $0^\mathrm{th}$-order of the expansion. We test the potentiality of the algorithm against standard competing strategies on several biological sequences.
△ Less
Submitted 27 April, 2023; v1 submitted 7 October, 2022;
originally announced October 2022.
-
Statistical-physics approaches to RNA molecules, families and networks
Authors:
Simona Cocco,
Andrea De Martino,
Andrea Pagnani,
Martin Weigt
Abstract:
This contribution focuses on the fascinating RNA molecule, its sequence-dependent folding driven by base-pairing interactions, the interplay between these interactions and natural evolution, and its multiple regulatory roles. The four of us have dug into these topics using the tools and the spirit of the statistical physics of disordered systems, and in particular the concept of a disordered (ener…
▽ More
This contribution focuses on the fascinating RNA molecule, its sequence-dependent folding driven by base-pairing interactions, the interplay between these interactions and natural evolution, and its multiple regulatory roles. The four of us have dug into these topics using the tools and the spirit of the statistical physics of disordered systems, and in particular the concept of a disordered (energy/fitness) landscape. After an introduction to RNA molecules and the perspectives they open not only in evolutionary and synthetic biology but also in medicine, we will introduce the important notions of energy and fitness landscapes for these molecules. In Section III we will review some models and algorithms for RNA sequence-to-secondary-structure mapping. Section IV discusses how the secondary-structure energy landscape can be derived from unzipping data. Section V deals with the inference of RNA structure from evolutionary sequence data sampled in different organisms. This will shift the focus from the `sequence-to-structure' mapping described in Section III to a `sequence-to-function' landscape that can be inferred from laboratory evolutionary data on DNA aptamers. Finally, in Section VI, we shall discuss the rich theoretical picture linking networks of interacting RNA molecules to the organization of robust, systemic regulatory programs. Along this path, we will therefore explore phenomena across multiple scales in space, number of molecules and time, showing how the biological complexity of the RNA world can be captured by the unifying concepts of statistical physics.
△ Less
Submitted 27 July, 2022;
originally announced July 2022.
-
Relationship between fitness and heterogeneity in exponentially growing microbial populations
Authors:
Anna Paola Muntoni,
Alfredo Braunstein,
Andrea Pagnani,
Daniele De Martino,
Andrea De Martino
Abstract:
Despite major environmental and genetic differences, microbial metabolic networks are known to generate consistent physiological outcomes across vastly different organisms. This remarkable robustness suggests that, at least in bacteria, metabolic activity may be guided by universal principles. The constrained optimization of evolutionarily-motivated objective functions like the growth rate has eme…
▽ More
Despite major environmental and genetic differences, microbial metabolic networks are known to generate consistent physiological outcomes across vastly different organisms. This remarkable robustness suggests that, at least in bacteria, metabolic activity may be guided by universal principles. The constrained optimization of evolutionarily-motivated objective functions like the growth rate has emerged as the key theoretical assumption for the study of bacterial metabolism. While conceptually and practically useful in many situations, the idea that certain functions are optimized is hard to validate in data. Moreover, it is not always clear how optimality can be reconciled with the high degree of single-cell variability observed in experiments within microbial populations. To shed light on these issues, we develop an inverse modeling framework that connects the fitness of a population of cells (represented by the mean single-cell growth rate) to the underlying metabolic variability through the Maximum-Entropy inference of the distribution of metabolic phenotypes from data. While no clear objective function emerges, we find that, as the medium gets richer, the fitness and inferred variability for Escherichia coli populations follow and slowly approach the theoretically optimal bound defined by minimal reduction of variability at given fitness. These results suggest that bacterial metabolism may be crucially shaped by a population-level trade-off between growth and heterogeneity.
△ Less
Submitted 7 April, 2022; v1 submitted 6 April, 2021;
originally announced April 2021.
-
Aligning biological sequences by exploiting residue conservation and coevolution
Authors:
Anna Paola Muntoni,
Andrea Pagnani,
Martin Weigt,
Francesco Zamponi
Abstract:
Sequences of nucleotides (for DNA and RNA) or amino acids (for proteins) are central objects in biology. Among the most important computational problems is that of sequence alignment, i.e. arranging sequences from different organisms in such a way to identify similar regions, to detect evolutionary relationships between sequences, and to predict biomolecular structure and function. This is typical…
▽ More
Sequences of nucleotides (for DNA and RNA) or amino acids (for proteins) are central objects in biology. Among the most important computational problems is that of sequence alignment, i.e. arranging sequences from different organisms in such a way to identify similar regions, to detect evolutionary relationships between sequences, and to predict biomolecular structure and function. This is typically addressed through profile models, which capture position-specificities like conservation in sequences, but assume an independent evolution of different positions. Over the last years, it has been well established that coevolution of different amino-acid positions is essential for maintaining three-dimensional structure and function. Modeling approaches based on inverse statistical physics can catch the coevolution signal in sequence ensembles; and they are now widely used in predicting protein structure, protein-protein interactions, and mutational landscapes. Here, we present DCAlign, an efficient alignment algorithm based on an approximate message-passing strategy, which is able to overcome the limitations of profile models, to include coevolution among positions in a general way, and to be therefore universally applicable to protein- and RNA-sequence alignment without the need of using complementary structural information. The potential of DCAlign is carefully explored using well-controlled simulated data, as well as real protein and RNA sequences.
△ Less
Submitted 13 November, 2020; v1 submitted 18 May, 2020;
originally announced May 2020.
-
Compressed sensing reconstruction using Expectation Propagation
Authors:
Alfredo Braunstein,
Anna Paola Muntoni,
Andrea Pagnani,
Mirko Pieropan
Abstract:
Many interesting problems in fields ranging from telecommunications to computational biology can be formalized in terms of large underdetermined systems of linear equations with additional constraints or regularizers. One of the most studied ones, the Compressed Sensing problem (CS), consists in finding the solution with the smallest number of non-zero components of a given system of linear equati…
▽ More
Many interesting problems in fields ranging from telecommunications to computational biology can be formalized in terms of large underdetermined systems of linear equations with additional constraints or regularizers. One of the most studied ones, the Compressed Sensing problem (CS), consists in finding the solution with the smallest number of non-zero components of a given system of linear equations $\boldsymbol y = \mathbf{F} \boldsymbol{w}$ for known measurement vector $\boldsymbol{y}$ and sensing matrix $\mathbf{F}$. Here, we will address the compressed sensing problem within a Bayesian inference framework where the sparsity constraint is remapped into a singular prior distribution (called Spike-and-Slab or Bernoulli-Gauss). Solution to the problem is attempted through the computation of marginal distributions via Expectation Propagation (EP), an iterative computational scheme originally developed in Statistical Physics. We will show that this strategy is comparatively more accurate than the alternatives in solving instances of CS generated from statistically correlated measurement matrices. For computational strategies based on the Bayesian framework such as variants of Belief Propagation, this is to be expected, as they implicitly rely on the hypothesis of statistical independence among the entries of the sensing matrix. Perhaps surprisingly, the method outperforms uniformly also all the other state-of-the-art methods in our tests.
△ Less
Submitted 3 August, 2019; v1 submitted 10 April, 2019;
originally announced April 2019.
-
Kinetic modelling of competition and depletion of shared miRNAs by competing endogenous RNAs
Authors:
Araks Martirosyan,
Marco Del Giudice,
Chiara Enrico Bena,
Andrea Pagnani,
Carla Bosia,
Andrea De Martino
Abstract:
Non-conding RNAs play a key role in the post-transcriptional regulation of mRNA translation and turnover in eukaryotes. miRNAs, in particular, interact with their target RNAs through protein-mediated, sequence-specific binding, giving rise to extended and highly heterogeneous miRNA-RNA interaction networks. Within such networks, competition to bind miRNAs can generate an effective positive couplin…
▽ More
Non-conding RNAs play a key role in the post-transcriptional regulation of mRNA translation and turnover in eukaryotes. miRNAs, in particular, interact with their target RNAs through protein-mediated, sequence-specific binding, giving rise to extended and highly heterogeneous miRNA-RNA interaction networks. Within such networks, competition to bind miRNAs can generate an effective positive coupling between their targets. Competing endogenous RNAs (ceRNAs) can in turn regulate each other through miRNA-mediated crosstalk. Albeit potentially weak, ceRNA interactions can occur both dynamically, affecting e.g. the regulatory clock, and at stationarity, in which case ceRNA networks as a whole can be implicated in the composition of the cell's proteome. Many features of ceRNA interactions, including the conditions under which they become significant, can be unraveled by mathematical and in silico models. We review the understanding of the ceRNA effect obtained within such frameworks, focusing on the methods employed to quantify it, its role in the processing of gene expression noise, and how network topology can determine its reach.
△ Less
Submitted 22 December, 2018;
originally announced December 2018.
-
Improved Pseudolikelihood Regularization and Decimation methods on Non-linearly Interacting Systems with Continuous Variables
Authors:
Alessia Marruzzo,
Payal Tyagi,
Fabrizio Antenucci,
Andrea Pagnani,
Luca Leuzzi
Abstract:
We propose and test improvements to state-of-the-art techniques of Bayeasian statistical inference based on pseudolikelihood maximization with $\ell_1$ regularization and with decimation. In particular, we present a method to determine the best value of the regularizer parameter starting from a hypothesis testing technique. Concerning the decimation, we also analyze the worst case scenario in whic…
▽ More
We propose and test improvements to state-of-the-art techniques of Bayeasian statistical inference based on pseudolikelihood maximization with $\ell_1$ regularization and with decimation. In particular, we present a method to determine the best value of the regularizer parameter starting from a hypothesis testing technique. Concerning the decimation, we also analyze the worst case scenario in which there is no sharp peak in the tilded-pseudolikelihood function, firstly defined as a criterion to stop the decimation. Techniques are applied to noisy systems with non-linear dynamics, mapped onto multi-variable interacting Hamiltonian effective models for waves and phasors. Results are analyzed varying the number of available samples and the externally tunable temperature-like parameter mimicing real data noise. Eventually the behavior of inference procedures described are tested against a wrong hypothesis: non-linearly generated data are analyzed with a pairwise interacting hypothesis. Our analysis shows that, looking at the behavior of the inverse graphical problem as data size increases, the methods exposed allow to rule out a wrong hypothesis.
△ Less
Submitted 23 May, 2018; v1 submitted 2 August, 2017;
originally announced August 2017.
-
An analytic approximation of the feasible space of metabolic networks
Authors:
Alfredo Braunstein,
Anna Paola Muntoni,
Andrea Pagnani
Abstract:
Assuming a steady-state condition within a cell, metabolic fluxes satisfy an under-determined linear system of stoichiometric equations. Characterizing the space of fluxes that satisfy such equations along with given bounds (and possibly additional relevant constraints) is considered of utmost importance for the understanding of cellular metabolism. Extreme values for each individual flux can be c…
▽ More
Assuming a steady-state condition within a cell, metabolic fluxes satisfy an under-determined linear system of stoichiometric equations. Characterizing the space of fluxes that satisfy such equations along with given bounds (and possibly additional relevant constraints) is considered of utmost importance for the understanding of cellular metabolism. Extreme values for each individual flux can be computed with Linear Programming (as Flux Balance Analysis), and their marginal distributions can be approximately computed with Monte-Carlo sampling. Here we present an approximate analytic method for the latter task based on Expectation Propagation equations that does not involve sampling and can achieve much better predictions than other existing analytic methods. The method is iterative, and its computation time is dominated by one matrix inversion per iteration. With respect to sampling, we show through extensive simulation that it has some advantages including computation time, and the ability to efficiently fix empirically estimated distributions of fluxes.
△ Less
Submitted 6 April, 2017; v1 submitted 17 February, 2017;
originally announced February 2017.
-
Inverse problem for multi-body interaction of nonlinear waves
Authors:
Alessia Marruzzo,
Payal Tyagi,
Fabrizio Antenucci,
Andrea Pagnani,
Luca Leuzzi
Abstract:
The inverse problem is studied in multi-body systems with nonlinear dynamics representing, e.g., phase-locked wave systems, standard multimode and random lasers. Using a general model for four-body interacting complex-valued variables we test two methods based on pseudolikelihood, respectively with regularization and with decimation, to determine the coupling constants from sets of measured config…
▽ More
The inverse problem is studied in multi-body systems with nonlinear dynamics representing, e.g., phase-locked wave systems, standard multimode and random lasers. Using a general model for four-body interacting complex-valued variables we test two methods based on pseudolikelihood, respectively with regularization and with decimation, to determine the coupling constants from sets of measured configurations. We test statistical inference predictions for increasing number of sampled configurations and for an externally tunable {\em temperature}-like parameter mimicing real data noise and helping minimization procedures. Analyzed models with phasors and rotors are generalizations of problems of real-valued spherical problems (e.g., density fluctuations), discrete spins (Ising and vectorial Potts) or finite number of states (standard Potts): inference methods presented here can, then, be straightforward applied to a large class of inverse problems.
△ Less
Submitted 1 January, 2017; v1 submitted 28 July, 2016;
originally announced July 2016.
-
Regularization and decimation pseudolikelihood approaches to statistical inference in $XY$-spin models
Authors:
Payal Tyagi,
Alessia Marruzzo,
Andrea Pagnani,
Fabrizio Antenucci,
Luca Leuzzi
Abstract:
We implement a pseudolikelyhood approach with l2-regularization as well as the recently introduced pseudolikelihood with decimation procedure to the inverse problem in continuous spin models on arbitrary networks, with arbitrarily disordered couplings. Performances of the approaches are tested against data produced by Monte Carlo numerical simulations and compared also from previously studied full…
▽ More
We implement a pseudolikelyhood approach with l2-regularization as well as the recently introduced pseudolikelihood with decimation procedure to the inverse problem in continuous spin models on arbitrary networks, with arbitrarily disordered couplings. Performances of the approaches are tested against data produced by Monte Carlo numerical simulations and compared also from previously studied fully-connected mean-field-based inference techniques. The results clearly show that the best network reconstruction is obtained through the decimation scheme, that also allows to dwell the inference down to lower temperature regimes. Possible applications to phasor models for light propagation in random media are proposed and discussed.
△ Less
Submitted 23 March, 2016; v1 submitted 16 March, 2016;
originally announced March 2016.
-
Inference for interacting linear waves in ordered and random media
Authors:
P. Tyagi,
A. Pagnani,
F. Antenucci,
M. Ibáñez Berganza,
L. Leuzzi
Abstract:
A statistical inference method is developed and tested for pairwise interacting systems whose degrees of freedom are continuous angular variables, such as planar spins in magnetic systems or wave phases in optics and acoustics. We investigate systems with both deterministic and quenched disordered couplings on two extreme topologies: complete and sparse graphs. To match further applications in opt…
▽ More
A statistical inference method is developed and tested for pairwise interacting systems whose degrees of freedom are continuous angular variables, such as planar spins in magnetic systems or wave phases in optics and acoustics. We investigate systems with both deterministic and quenched disordered couplings on two extreme topologies: complete and sparse graphs. To match further applications in optics also complex couplings and external fields are considered and general inference formulas are derived for real and imaginary parts of Hermitian coupling matrices from real and imaginary parts of complex correlation functions. The whole procedure is, eventually, tested on numerically generated correlation functions and local magnetizations by means of Monte Carlo simulations.
△ Less
Submitted 31 January, 2015;
originally announced February 2015.
-
Identifying all irreducible conserved metabolite pools in genome-scale metabolic networks: a general method and the case of Escherichia coli
Authors:
A. De Martino,
D. De Martino,
R. Mulet,
A. Pagnani
Abstract:
The stoichiometry of metabolic networks usually gives rise to a family of conservation laws for the aggregate concentration of specific pools of metabolites, which not only constrain the dynamics of the network, but also provide key insight into a cell's production capabilities. When the conserved quantity identifies with a chemical moiety, extracting all such conservation laws from the stoichiome…
▽ More
The stoichiometry of metabolic networks usually gives rise to a family of conservation laws for the aggregate concentration of specific pools of metabolites, which not only constrain the dynamics of the network, but also provide key insight into a cell's production capabilities. When the conserved quantity identifies with a chemical moiety, extracting all such conservation laws from the stoichiometry amounts to finding all integer solutions to an NP-hard programming problem. Here we propose a novel and efficient computational strategy that combines Monte Carlo, message passing, and relaxation algorithms to compute the complete set of irreducible integer conservation laws of a given stoichiometric matrix, also providing a certificate for correctness and maximality of the solution. The method is deployed for the analysis of the complete set of irreducible integer pools of two large-scale reconstructions of the metabolism of the bacterium Escherichia coli in different growth media. In addition, we uncover a scaling relation that links the size of the irreducible pool basis to the number of metabolites, for which we present an analytical explanation.
△ Less
Submitted 10 September, 2013;
originally announced September 2013.
-
Modeling competing endogenous RNAs networks
Authors:
Carla Bosia,
Andrea Pagnani,
Riccardo Zecchina
Abstract:
MicroRNAs (miRNAs) are small RNA molecules, about 22 nucleotide long, which post-transcriptionally regulate their target messenger RNAs (mRNAs). They accomplish key roles in gene regulatory networks, ranging from signaling pathways to tissue morphogenesis, and their aberrant behavior is often associated with the development of various diseases. Recently it has been shown that, in analogy with the…
▽ More
MicroRNAs (miRNAs) are small RNA molecules, about 22 nucleotide long, which post-transcriptionally regulate their target messenger RNAs (mRNAs). They accomplish key roles in gene regulatory networks, ranging from signaling pathways to tissue morphogenesis, and their aberrant behavior is often associated with the development of various diseases. Recently it has been shown that, in analogy with the better understood case of small RNAs in bacteria, the way miRNAs interact with their targets can be described in terms of a titration mechanism characterized by threshold effects, hypersensitivity of the system near the threshold, and prioritized cross-talk among targets. The latter characteristic has been lately identified as competing endogenous RNA (ceRNA) effect to mark those indirect interactions among targets of a common pool of miRNAs they are in competition for. Here we analyze the equilibrium and out-of-equilibrium properties of a general stochastic model of $M$ miRNAs interacting with $N$ mRNA targets. In particular we are able to describe in details the peculiar equilibrium and non-equilibrium phenomena that the system displays around the threshold: (i) maximal cross-talk and correlation between targets, (ii) robustness of ceRNA effect with respect to the model's parameters and in particular to the catalyticity of the miRNA-mRNA interaction, and (iii) anomalous response-time to external perturbations.
△ Less
Submitted 8 October, 2012;
originally announced October 2012.
-
3D Protein Structure Predicted from Sequence
Authors:
Debora S. Marks,
Lucy J. Colwell,
Robert Sheridan,
Thomas A. Hopf,
Andrea Pagnani,
Riccardo Zecchina,
Chris Sander
Abstract:
The evolutionary trajectory of a protein through sequence space is constrained by function and three-dimensional (3D) structure. Residues in spatial proximity tend to co-evolve, yet attempts to invert the evolutionary record to identify these constraints and use them to computationally fold proteins have so far been unsuccessful. Here, we show that co-variation of residue pairs, observed in a larg…
▽ More
The evolutionary trajectory of a protein through sequence space is constrained by function and three-dimensional (3D) structure. Residues in spatial proximity tend to co-evolve, yet attempts to invert the evolutionary record to identify these constraints and use them to computationally fold proteins have so far been unsuccessful. Here, we show that co-variation of residue pairs, observed in a large protein family, provides sufficient information to determine 3D protein structure. Using a data-constrained maximum entropy model of the multiple sequence alignment, we identify pairs of statistically coupled residue positions which are expected to be close in the protein fold, termed contacts inferred from evolutionary information (EICs). To assess the amount of information about the protein fold contained in these coupled pairs, we evaluate the accuracy of predicted 3D structures for proteins of 50-260 residues, from 15 diverse protein families, including a G-protein coupled receptor. These structure predictions are de novo, i.e., they do not use homology modeling or sequence-similar fragments from known structures. The resulting low Cα-RMSD error range of 2.7-5.1Å, over at least 75% of the protein, indicates the potential for predicting essentially correct 3D structures for the thousands of protein families that have no known structure, provided they include a sufficiently large number of divergent sample sequences. With the current enormous growth in sequence information based on new sequencing technology, this opens the door to a comprehensive survey of protein 3D structures, including many not currently accessible to the experimental methods of structural genomics. This advance has potential applications in many biological contexts, such as synthetic biology, identification of functional sites in proteins and interpretation of the functional impact of genetic variants.
△ Less
Submitted 25 October, 2011; v1 submitted 23 October, 2011;
originally announced October 2011.
-
Statistical mechanics of sparse generalization and model selection
Authors:
Alejandro Lage-Castellanos,
Andrea Pagnani,
Martin Weigt
Abstract:
One of the crucial tasks in many inference problems is the extraction of sparse information out of a given number of high-dimensional measurements. In machine learning, this is frequently achieved using, as a penality term, the $L_p$ norm of the model parameters, with $p\leq 1$ for efficient dilution. Here we propose a statistical-mechanics analysis of the problem in the setting of perceptron me…
▽ More
One of the crucial tasks in many inference problems is the extraction of sparse information out of a given number of high-dimensional measurements. In machine learning, this is frequently achieved using, as a penality term, the $L_p$ norm of the model parameters, with $p\leq 1$ for efficient dilution. Here we propose a statistical-mechanics analysis of the problem in the setting of perceptron memorization and generalization. Using a replica approach, we are able to evaluate the relative performance of naive dilution (obtained by learning without dilution, following by applying a threshold to the model parameters), $L_1$ dilution (which is frequently used in convex optimization) and $L_0$ dilution (which is optimal but computationally hard to implement). Whereas both $L_p$ diluted approaches clearly outperform the naive approach, we find a small region where $L_0$ works almost perfectly and strongly outperforms the simpler to implement $L_1$ dilution.
△ Less
Submitted 18 July, 2009;
originally announced July 2009.