-
Predicting protein folding dynamics using sequence information
Authors:
Ezequiel A. Galpern,
Federico Caamaño,
Diego U. Ferreiro
Abstract:
Natural protein sequences somehow encode the structural forms that these molecules adopt. Recent developments in structure-prediction are agnostic to the mechanisms by which proteins fold and represent them as static objects. However, the amino acid sequences also encode information about how the folding process can happen, and how variations in the sequences impact on the populations of the disti…
▽ More
Natural protein sequences somehow encode the structural forms that these molecules adopt. Recent developments in structure-prediction are agnostic to the mechanisms by which proteins fold and represent them as static objects. However, the amino acid sequences also encode information about how the folding process can happen, and how variations in the sequences impact on the populations of the distinct structural forms that proteins acquire. Here we present a method to infer protein folding dynamics based only on sequence information. For this, we will rely first on the obtention of a precise 'evolutionary field' from the observed variations in the sequences of homologous proteins. We then show how to map the energetics to a coarse-grained folding model where the protein is treated as a string of foldons that interact. We then describe how, for any given protein sequence of a family, the equilibrium folding curve can be computed and how the emergence of protein folding sub-domains can be identified. We finally present protocols to analyze how mutations perturb both the folding stability and the cooperativity, that represent predictions for a deep-mutational scan of a protein of interest.
△ Less
Submitted 22 May, 2025;
originally announced May 2025.
-
Frustration, dynamics and catalysis
Authors:
R. Gonzalo Parra,
Diego U. Ferreiro
Abstract:
The controlled dissipation of chemical potentials is the fundamental way cells make a living. Enzyme-mediated catalysis allows the various transformations to proceed at biologically relevant rates with remarkable precision and efficiency. Theory, experiments and computational studies coincide to show that local frustration is a useful concept to relate protein dynamics with catalytic power. Local…
▽ More
The controlled dissipation of chemical potentials is the fundamental way cells make a living. Enzyme-mediated catalysis allows the various transformations to proceed at biologically relevant rates with remarkable precision and efficiency. Theory, experiments and computational studies coincide to show that local frustration is a useful concept to relate protein dynamics with catalytic power. Local frustration gives rise to the asperities of the energy landscapes that can harness the thermal fluctuations to guide the functional protein motions. We review here recent advances into these relationships from various fields of protein science. The biologically relevant dynamics is tuned by the evolution of protein sequences that modulate the local frustration patterns to near optimal values.
△ Less
Submitted 1 May, 2025;
originally announced May 2025.
-
Frustration In Physiology And Molecular Medicine
Authors:
R. Gonzalo Parra,
Elizabeth A. Komives,
Peter G. Wolynes,
Diego U. Ferreiro
Abstract:
Molecules provide the ultimate language in terms of which physiology and pathology must be understood. Myriads of proteins participate in elaborate networks of interactions and perform chemical activities coordinating the life of cells. To perform these often amazing tasks, proteins must move and we must think of them as dynamic ensembles of three dimensional structures formed first by folding the…
▽ More
Molecules provide the ultimate language in terms of which physiology and pathology must be understood. Myriads of proteins participate in elaborate networks of interactions and perform chemical activities coordinating the life of cells. To perform these often amazing tasks, proteins must move and we must think of them as dynamic ensembles of three dimensional structures formed first by folding the polypeptide chains so as to minimize the conflicts between the interactions of their constituent amino acids. It is apparent however that, even when completely folded, not all conflicting interactions have been resolved so the structure remains "locally frustrated". Over the last decades it has become clearer that this local frustration is not just a random accident but plays an essential part of the inner workings of protein molecules. We will review here the physical origins of the frustration concept and review evidence that local frustration is important for protein physiology, protein-protein recognition, catalysis and allostery. Also, we highlight examples showing how alterations in the local frustration patterns can be linked to distinct pathologies. Finally we explore the extensions of the impact of frustration in higher order levels of organization of systems including gene regulatory networks and the neural networks of the brain.
△ Less
Submitted 6 February, 2025;
originally announced February 2025.
-
Inferring protein folding mechanisms from natural sequence diversity
Authors:
Ezequiel A. Galpern,
Ernesto A. Roman,
Diego U. Ferreiro
Abstract:
Protein sequences serve as a natural record of the evolutionary constraints that shape their functional structures. We show that it is possible to use only sequence information to go beyond predicting native structures and global stability to infer the folding mechanisms of globular proteins. The one- and two-body evolutionary energy fields at the amino-acid level are mapped to a coarse-grained de…
▽ More
Protein sequences serve as a natural record of the evolutionary constraints that shape their functional structures. We show that it is possible to use only sequence information to go beyond predicting native structures and global stability to infer the folding mechanisms of globular proteins. The one- and two-body evolutionary energy fields at the amino-acid level are mapped to a coarse-grained description of folding, where proteins are divided into contiguous folding elements, commonly referred to as foldons. For 15 diverse protein families, we calculated the folding mechanisms of hundreds of proteins by simulating an Ising chain of foldons, with their energetics determined by the amino acid sequences. We show that protein topology imposes limits on the variability of folding cooperativity within a family. While most beta and alpha/beta structures exhibit only a few possible mechanisms despite high sequence diversity, alpha topologies allow for diverse folding scenarios among family members. We show that both the stability and cooperativity changes induced by mutations can be computed directly using sequence-based evolutionary models.
△ Less
Submitted 24 June, 2025; v1 submitted 18 December, 2024;
originally announced December 2024.
-
Reassessing the Exon-Foldon correspondence using Frustration Analysis
Authors:
Ezequiel A. Galpern,
Hana Jaafari,
Carlos Bueno,
Peter G. Wolynes,
Diego U. Ferreiro
Abstract:
Protein folding and evolution are intimately linked phenomena. Here, we revisit the concept of exons as potential protein folding modules across 38 abundant and conserved protein families. Taking advantage of genomic exon-intron organization and extensive protein sequence data, we explore exon boundary conservation and assess their foldon-like behavior using energy landscape theoretic measurements…
▽ More
Protein folding and evolution are intimately linked phenomena. Here, we revisit the concept of exons as potential protein folding modules across 38 abundant and conserved protein families. Taking advantage of genomic exon-intron organization and extensive protein sequence data, we explore exon boundary conservation and assess their foldon-like behavior using energy landscape theoretic measurements. We found deviations in exon size distribution from exponential decay indicating selection in evolution. We describe that there is a pronounced independent foldability of segments corresponding to conserved exons, supporting the exon-foldon correspondence. We further develop a systematic partitioning of protein domains using exon boundary hot spots, unveiling minimal common exons consisting of uninterrupted alpha and/or beta elements for the majority but not all of the studied families.
△ Less
Submitted 4 January, 2024;
originally announced January 2024.
-
Solvent constraints for biopolymer folding and evolution in extraterrestrial environments
Authors:
Ignacio E. Sánchez,
Ezequiel A. Galpern,
Diego U. Ferreiro
Abstract:
We propose that spontaneous folding and molecular evolution of biopolymers are two universal aspects that must concur for life to happen. These aspects are fundamentally related to the chemical composition of biopolymers and crucially depend on the solvent in which they are embedded. We show that molecular information theory and energy landscape theory allow us to explore the limits that solvents…
▽ More
We propose that spontaneous folding and molecular evolution of biopolymers are two universal aspects that must concur for life to happen. These aspects are fundamentally related to the chemical composition of biopolymers and crucially depend on the solvent in which they are embedded. We show that molecular information theory and energy landscape theory allow us to explore the limits that solvents impose on biopolymer existence. We consider 54 solvents, including water, alcohols, hydrocarbons, halogenated solvents, aromatic solvents, and low molecular weight substances made up of elements abundant in the universe, which may potentially take part in alternative biochemistries. We find that along with water, there are many solvents for which the liquid regime is compatible with biopolymer folding and evolution. We present a ranking of the solvents in terms of biopolymer compatibility. Many of these solvents have been found in molecular clouds or may be expected to occur in extrasolar planets.
△ Less
Submitted 29 September, 2023;
originally announced October 2023.
-
Molecular information theory meets protein folding
Authors:
Ignacio E. Sánchez,
Ezequiel A. Galpern,
Martín M. Garibaldi,
Diego U. Ferreiro
Abstract:
We propose an application of molecular information theory to analyze the folding of single domain proteins. We analyze results from various areas of protein science, such as sequence-based potentials, reduced amino acid alphabets, backbone configurational entropy, secondary structure content, residue burial layers, and mutational studies of protein stability changes. We found that the average info…
▽ More
We propose an application of molecular information theory to analyze the folding of single domain proteins. We analyze results from various areas of protein science, such as sequence-based potentials, reduced amino acid alphabets, backbone configurational entropy, secondary structure content, residue burial layers, and mutational studies of protein stability changes. We found that the average information contained in the sequences of evolved proteins is very close to the average information needed to specify a fold ~2.2 $\pm$ 0.3 bits/(site operation). The effective alphabet size in evolved proteins equals the effective number of conformations of a residue in the compact unfolded state at around 5. We calculated an energy-to-information conversion efficiency upon folding of around 50%, lower than the theoretical limit of 70%, but much higher than human built macroscopic machines. We propose a simple mapping between molecular information theory and energy landscape theory and explore the connections between sequence evolution, configurational entropy and the energetics of protein folding.
△ Less
Submitted 29 June, 2022;
originally announced June 2022.
-
From evolution to folding of repeat proteins
Authors:
Ezequiel A. Galpern,
Jacopo Marchi,
Thierry Mora,
Aleksandra M. Walczak,
Diego U. Ferreiro
Abstract:
Repeat proteins are made with tandem copies of similar amino acid stretches that fold into elongated architectures. Due to their symmetry, these proteins constitute excellent model systems to investigate how evolution relates to structure, folding and function. Here, we propose a scheme to map evolutionary information at the sequence level to a coarse-grained model for repeat-protein folding and u…
▽ More
Repeat proteins are made with tandem copies of similar amino acid stretches that fold into elongated architectures. Due to their symmetry, these proteins constitute excellent model systems to investigate how evolution relates to structure, folding and function. Here, we propose a scheme to map evolutionary information at the sequence level to a coarse-grained model for repeat-protein folding and use it to investigate the folding of thousands of repeat-proteins. We model the energetics by a combination of an inverse Potts model scheme with an explicit mechanistic model of duplications and deletions of repeats to calculate the evolutionary parameters of the system at single residue level. This is used to inform an Ising-like model that allows for the generation of folding curves, apparent domain emergence and occupation of intermediate states that are highly compatible with experimental data in specific case studies. We analyzed the folding of thousands of natural Ankyrin-repeat proteins and found that a multiplicity of folding mechanisms are possible. Fully cooperative all-or-none transition are obtained for arrays with enough sequence-similar elements and strong interactions between them, while non-cooperative element-by-element intermittent folding arose if the elements are dissimilar and the interactions between them are energetically weak. In between, we characterised nucleation-propagation and multi-domain folding mechanisms. Finally, we showed that stability and cooperativity of a repeat-array can be quantitatively predicted from a simple energy score, paving the way for guiding protein folding design with a co-evolutionary model.
△ Less
Submitted 24 February, 2022;
originally announced February 2022.
-
A structural model for the Coronavirus Nucleocapsid
Authors:
Federico Coscio,
Alejandro D. Nadra,
Diego U. Ferreiro
Abstract:
We propose a mesoscale model structure for the coronavirus nucleocapsid, assembled from the high resolution structures of the basic building blocks of the N-protein, CryoEM imaging and mathematical constraints for an overall quasi-spherical particle. The structure is a truncated octahedron that accommodates two layers: an outer shell composed of triangular and quadrangular lattices of the N-termin…
▽ More
We propose a mesoscale model structure for the coronavirus nucleocapsid, assembled from the high resolution structures of the basic building blocks of the N-protein, CryoEM imaging and mathematical constraints for an overall quasi-spherical particle. The structure is a truncated octahedron that accommodates two layers: an outer shell composed of triangular and quadrangular lattices of the N-terminal domain and an inner shell of equivalent lattices of coiled parallel helices of the C-terminal domain. The model is consistent with the dimensions expected for packaging large viral genomes and provides a rationale to interpret the apparent pleomorphic nature of coronaviruses.
△ Less
Submitted 25 May, 2020;
originally announced May 2020.
-
Size and structure of the sequence space of repeat proteins
Authors:
Jacopo Marchi,
Ezequiel A. Galpern,
Rocio Espada,
Diego U. Ferreiro,
Aleksandra M. Walczak,
Thierry Mora
Abstract:
The coding space of protein sequences is shaped by evolutionary constraints set by requirements of function and stability. We show that the coding space of a given protein family--the total number of sequences in that family--can be estimated using models of maximum entropy trained on multiple sequence alignments of naturally occuring amino acid sequences. We analyzed and calculated the size of th…
▽ More
The coding space of protein sequences is shaped by evolutionary constraints set by requirements of function and stability. We show that the coding space of a given protein family--the total number of sequences in that family--can be estimated using models of maximum entropy trained on multiple sequence alignments of naturally occuring amino acid sequences. We analyzed and calculated the size of three abundant repeat proteins families, whose members are large proteins made of many repetitions of conserved portions of ~ 30 amino acids. While amino acid conservation at each position of the alignment explains most of the reduction of diversity relative to completely random sequences, we found that correlations between amino acid usage at different positions significantly impact that diversity. We quantified the impact of different types of correlations, functional and evolutionary, on sequence diversity. Analysis of the detailed structure of the coding space of the families revealed a rugged landscape, with many local energy minima of varying sizes with a hierarchical structure, reminiscent of fustrated energy landscapes of spin glass in physics. This clustered structure indicates a multiplicity of subtypes within each family, and suggests new strategies for protein design.
△ Less
Submitted 3 July, 2019; v1 submitted 11 May, 2019;
originally announced May 2019.
-
Localization of Energetic Frustration in Proteins
Authors:
A. Brenda Guzovsky,
Nicholas P. Schafer,
Peter G. Wolynes,
Diego U. Ferreiro
Abstract:
We present a detailed heuristic method to quantify the degree of local energetic frustration manifested by protein molecules. Current applications are realized in computational experiments where a protein structure is visualized highlighting the energetic conflicts or the concordance of the local interactions in that structure. Minimally frustrated linkages highlight the stable folding core of the…
▽ More
We present a detailed heuristic method to quantify the degree of local energetic frustration manifested by protein molecules. Current applications are realized in computational experiments where a protein structure is visualized highlighting the energetic conflicts or the concordance of the local interactions in that structure. Minimally frustrated linkages highlight the stable folding core of the molecule. Sites of high local frustration, in contrast, often indicate functionally relevant regions such as binding, active or allosteric sites.
△ Less
Submitted 14 December, 2018;
originally announced December 2018.
-
On the Natural Structure of Amino Acid Patterns in Families of Protein Sequences
Authors:
Pablo Turjanski,
Diego U. Ferreiro
Abstract:
All known terrestrial proteins are coded as continuous strings of ~20 amino acids. The patterns formed by the repetitions of elements in groups of finite sequences describes the natural architectures of protein families. We present a method to search for patterns and groupings of patterns in protein sequences using a mathematically precise definition for 'repetition', an efficient algorithmic impl…
▽ More
All known terrestrial proteins are coded as continuous strings of ~20 amino acids. The patterns formed by the repetitions of elements in groups of finite sequences describes the natural architectures of protein families. We present a method to search for patterns and groupings of patterns in protein sequences using a mathematically precise definition for 'repetition', an efficient algorithmic implementation and a robust scoring system with no adjustable parameters. We show that the sequence patterns can be well-separated into disjoint classes according to their recurrence in nested structures. The statistics of pattern occurrences indicate that short repetitions are enough to account for the differences between natural families and randomized groups by more than 10 standard deviations, while patterns shorter than 5 residues are effectively random. A small subset of patterns is sufficient to account for a robust ''familiarity'' definition of arbitrary sets of sequences.
△ Less
Submitted 26 July, 2018;
originally announced July 2018.
-
Frustration, function and folding
Authors:
Diego U. Ferreiro,
Elizabeth A. Komives,
Peter G. Wolynes
Abstract:
Natural protein molecules are exceptional polymers. Encoded in apparently random strings of amino-acids, these objects perform clear physical tasks that are rare to find by simple chance. Accurate folding, specific binding, powerful catalysis, are examples of basic chemical activities that the great majority of polypeptides do not display, and are thought to be the outcome of the natural history o…
▽ More
Natural protein molecules are exceptional polymers. Encoded in apparently random strings of amino-acids, these objects perform clear physical tasks that are rare to find by simple chance. Accurate folding, specific binding, powerful catalysis, are examples of basic chemical activities that the great majority of polypeptides do not display, and are thought to be the outcome of the natural history of proteins. Function, a concept genuine to Biology, is at the core of evolution and often conflicts with the physical constraints. Locating the frustration between discrepant goals in a recurrent system leads to fundamental insights about the chances and necessities that shape the encoding of biological information.
△ Less
Submitted 6 October, 2017;
originally announced October 2017.
-
Inferring repeat protein energetics from evolutionary information
Authors:
Rocío Espada,
R. Gonzalo Parra,
Thierry Mora,
Aleksandra M. Walczak,
Diego U. Ferreiro
Abstract:
Natural protein sequences contain a record of their history. A common constraint in a given protein family is the ability to fold to specific structures, and it has been shown possible to infer the main native ensemble by analyzing covariations in extant sequences. Still, many natural proteins that fold into the same structural topology show different stabilization energies, and these are often re…
▽ More
Natural protein sequences contain a record of their history. A common constraint in a given protein family is the ability to fold to specific structures, and it has been shown possible to infer the main native ensemble by analyzing covariations in extant sequences. Still, many natural proteins that fold into the same structural topology show different stabilization energies, and these are often related to their physiological behavior. We propose a description for the energetic variation given by sequence modifications in repeat proteins, systems for which the overall problem is simplified by their inherent symmetry. We explicitly account for single amino acid and pair-wise interactions and treat higher order correlations with a single term. We show that the resulting force field can be interpreted with structural detail. We trace the variations in the energetic scores of natural proteins and relate them to their experimental characterization. The resulting energetic force field allows the prediction of the folding free energy change for several mutants, and can be used to generate synthetic sequences that are statistically indistinguishable from the natural counterparts.
△ Less
Submitted 15 March, 2017; v1 submitted 9 March, 2017;
originally announced March 2017.
-
Protein Repeats from First Principles
Authors:
Pablo Turjanski,
R. Gonzalo Parra,
Rocío Espada,
Verónica Becher,
Diego U. Ferreiro
Abstract:
Some natural proteins display recurrent structural patterns. Despite being highly similar at the tertiary structure level, repetitions within a single repeat protein can be extremely variable at the sequence level. We propose a mathematical definition of a repeat and investigate the occurrences of these in different protein families. We found that long stretches of perfect repetitions are infreque…
▽ More
Some natural proteins display recurrent structural patterns. Despite being highly similar at the tertiary structure level, repetitions within a single repeat protein can be extremely variable at the sequence level. We propose a mathematical definition of a repeat and investigate the occurrences of these in different protein families. We found that long stretches of perfect repetitions are infrequent in individual natural proteins, even for those which are known to fold into structures of recurrent structural motifs. We found that natural repeat proteins are indeed repetitive in their families, exhibiting abundant stretches of 6 amino acids or longer that are perfect repetitions in the reference family. We provide a systematic quantification for this repetitiveness, and show that this form of repetitiveness is not exclusive of repeat proteins, but also occurs in globular domains. A by-product of this work is a fast classifier of proteins into families, which yields likelihood value about a given protein belonging to a given family.
△ Less
Submitted 8 October, 2015;
originally announced October 2015.
-
Amino acid metabolism conflicts with protein diversity
Authors:
Teresa Krick,
David A. Shub,
Nina Verstraete,
Diego U. Ferreiro,
Leonardo G. Alonso,
Michael Shub,
Ignacio E. Sanchez
Abstract:
The twenty protein coding amino acids are found in proteomes with different relative abundances. The most abundant amino acid, leucine, is nearly an order of magnitude more prevalent than the least abundant amino acid, cysteine. Amino acid metabolic costs differ similarly, constraining their incorporation into proteins. On the other hand, sequence diversity is necessary for protein folding, functi…
▽ More
The twenty protein coding amino acids are found in proteomes with different relative abundances. The most abundant amino acid, leucine, is nearly an order of magnitude more prevalent than the least abundant amino acid, cysteine. Amino acid metabolic costs differ similarly, constraining their incorporation into proteins. On the other hand, sequence diversity is necessary for protein folding, function and evolution. Here we present a simple model for a cost-diversity trade-off postulating that natural proteomes minimize amino acid metabolic flux while maximizing sequence entropy. The model explains the relative abundances of amino acids across a diverse set of proteomes. We found that the data is remarkably well explained when the cost function accounts for amino acid chemical decay. More than one hundred proteomes reach comparable solutions to the trade-off by different combinations of cost and diversity. Quantifying the interplay between proteome size and entropy shows that proteomes can get optimally large and diverse.
△ Less
Submitted 19 March, 2014; v1 submitted 13 March, 2014;
originally announced March 2014.
-
Frustration in Biomolecules
Authors:
Diego U. Ferreiro,
Elizabeth A. Komives,
Peter G. Wolynes
Abstract:
Biomolecules are the prime information processing elements of living matter. Most of these inanimate systems are polymers that compute their structures and dynamics using as input seemingly random character strings of their sequence, following which they coalesce and perform integrated cellular functions. In large computational systems with a finite interaction-codes, the appearance of conflicting…
▽ More
Biomolecules are the prime information processing elements of living matter. Most of these inanimate systems are polymers that compute their structures and dynamics using as input seemingly random character strings of their sequence, following which they coalesce and perform integrated cellular functions. In large computational systems with a finite interaction-codes, the appearance of conflicting goals is inevitable. Simple conflicting forces can lead to quite complex structures and behaviors, leading to the concept of "frustration" in condensed matter. We present here some basic ideas about frustration in biomolecules and how the frustration concept leads to a better appreciation of many aspects of the architecture of biomolecules, and how structure connects to function. These ideas are simultaneously both seductively simple and perilously subtle to grasp completely. The energy landscape theory of protein folding provides a framework for quantifying frustration in large systems and has been implemented at many levels of description. We first review the notion of frustration from the areas of abstract logic and its uses in simple condensed matter systems. We discuss then how the frustration concept applies specifically to heteropolymers, testing folding landscape theory in computer simulations of protein models and in experimentally accessible systems. Studying the aspects of frustration averaged over many proteins provides ways to infer energy functions useful for reliable structure prediction. We discuss how frustration affects folding, how a large part of the biological functions of proteins are related to subtle local frustration effects and how frustration influences the appearance of metastable states, the nature of binding processes, catalysis and allosteric transitions. We hope to illustrate how Frustration is a fundamental concept in relating function to structural biology.
△ Less
Submitted 3 December, 2013;
originally announced December 2013.
-
Detecting Repetitions and Periodicities in Proteins by Tiling the Structural Space
Authors:
R. Gonzalo Parra,
Rocío Espada,
Ignacio E. Sánchez,
Manfred J. Sippl,
Diego U. Ferreiro
Abstract:
The notion of energy landscapes provides conceptual tools for understanding the complexities of protein folding and function. Energy Landscape Theory indicates that it is much easier to find sequences that satisfy the "Principle of Minimal Frustration" when the folded structure is symmetric (Wolynes, P. G. Symmetry and the Energy Landscapes of Biomolecules. Proc. Natl. Acad. Sci. U.S.A. 1996, 93,…
▽ More
The notion of energy landscapes provides conceptual tools for understanding the complexities of protein folding and function. Energy Landscape Theory indicates that it is much easier to find sequences that satisfy the "Principle of Minimal Frustration" when the folded structure is symmetric (Wolynes, P. G. Symmetry and the Energy Landscapes of Biomolecules. Proc. Natl. Acad. Sci. U.S.A. 1996, 93, 14249-14255). Similarly, repeats and structural mosaics may be fundamentally related to landscapes with multiple embedded funnels. Here we present analytical tools to detect and compare structural repetitions in protein molecules. By an exhaustive analysis of the distribution of structural repeats using a robust metric we define those portions of a protein molecule that best describe the overall structure as a tessellation of basic units. The patterns produced by such tessellations provide intuitive representations of the repeating regions and their association towards higher order arrangements. We find that some protein architectures can be described as nearly periodic, while in others clear separations between repetitions exist. Since the method is independent of amino acid sequence information we can identify structural units that can be encoded by a variety of distinct amino acid sequences.
△ Less
Submitted 12 June, 2013;
originally announced June 2013.