-
Two fitness inference schemes compared using allele frequencies from 1,068,391 sequences sampled in the UK during the COVID-19 pandemic
Authors:
Hong-Li Zeng,
Cheng-Long Yang,
Bo Jing,
John Barton,
Erik Aurell
Abstract:
Throughout the course of the SARS-CoV-2 pandemic, genetic variation has contributed to the spread and persistence of the virus. For example, various mutations have allowed SARS-CoV-2 to escape antibody neutralization or to bind more strongly to the receptors that it uses to enter human cells. Here, we compared two methods that estimate the fitness effects of viral mutations using the abundant sequ…
▽ More
Throughout the course of the SARS-CoV-2 pandemic, genetic variation has contributed to the spread and persistence of the virus. For example, various mutations have allowed SARS-CoV-2 to escape antibody neutralization or to bind more strongly to the receptors that it uses to enter human cells. Here, we compared two methods that estimate the fitness effects of viral mutations using the abundant sequence data gathered over the course of the pandemic. Both approaches are grounded in population genetics theory but with different assumptions. One approach, tQLE, features an epistatic fitness landscape and assumes that alleles are nearly in linkage equilibrium. Another approach, MPL, assumes a simple, additive fitness landscape, but allows for any level of correlation between alleles. We characterized differences in the distributions of fitness values inferred by each approach and in the ranks of fitness values that they assign to sequences across time. We find that in a large fraction of weeks the two methods are in good agreement as to their top-ranked sequences, i.e., as to which sequences observed that week are most fit. We also find that agreement between ranking of sequences varies with genetic unimodality in the population in a given week.
△ Less
Submitted 21 March, 2024;
originally announced March 2024.
-
Temporal epistasis inference from more than 3,500,000 SARS-CoV-2 Genomic Sequences
Authors:
Hong-Li Zeng,
Yue Liu,
Vito Dichio,
Erik Aurell
Abstract:
We use Direct Coupling Analysis (DCA) to determine epistatic interactions between loci of variability of the SARS-CoV-2 virus, segmenting genomes by month of sampling. We use full-length, high-quality genomes from the GISAID repository up to October 2021, in total over 3,500,000 genomes. We find that DCA terms are more stable over time than correlations, but nevertheless change over time as mutati…
▽ More
We use Direct Coupling Analysis (DCA) to determine epistatic interactions between loci of variability of the SARS-CoV-2 virus, segmenting genomes by month of sampling. We use full-length, high-quality genomes from the GISAID repository up to October 2021, in total over 3,500,000 genomes. We find that DCA terms are more stable over time than correlations, but nevertheless change over time as mutations disappear from the global population or reach fixation. Correlations are enriched for phylogenetic effects, and in particularly statistical dependencies at short genomic distances, while DCA brings out links at longer genomic distance. We discuss the validity of a DCA analysis under these conditions in terms of a transient Quasi-Linkage Equilibrium state. We identify putative epistatic interaction mutations involving loci in Spike.
△ Less
Submitted 5 June, 2022; v1 submitted 24 December, 2021;
originally announced December 2021.
-
Mutation frequency time series reveal complex mixtures of clones in the world-wide SARS-CoV-2 viral population
Authors:
Hong-Li Zeng,
Yue Liu,
Vito Dichio,
Kaisa Thorell,
Rickard Nordén,
Erik Aurell
Abstract:
We compute the allele frequencies of the alpha (B.1.1.7), beta (B.1.351) and delta (B.167.2) variants of SARS-CoV-2 from almost two million genome sequences on the GISAID repository. We find that the frequencies of a majority of the defining mutations in alpha rose towards the end of 2020 but drifted apart during spring 2021, a similar pattern being followed by delta during summer of 2021. For bet…
▽ More
We compute the allele frequencies of the alpha (B.1.1.7), beta (B.1.351) and delta (B.167.2) variants of SARS-CoV-2 from almost two million genome sequences on the GISAID repository. We find that the frequencies of a majority of the defining mutations in alpha rose towards the end of 2020 but drifted apart during spring 2021, a similar pattern being followed by delta during summer of 2021. For beta we find a more complex scenario with frequencies of some mutations rising and some remaining close to zero. Our results point to that what is generally reported as single variants is in fact a collection of variants with different genetic characteristics. For all three variants we further find some alleles with a clearly deviating time series.
△ Less
Submitted 7 September, 2021;
originally announced September 2021.
-
Statistical Genetics in and out of Quasi-Linkage Equilibrium (Extended)
Authors:
Vito Dichio,
Hong-Li Zeng,
Erik Aurell
Abstract:
This review is about statistical genetics, an interdisciplinary topic between statistical physics and population biology. The focus is on the phase of quasi-linkage equilibrium (QLE). Our goals here are to clarify under which conditions the QLE phase can be expected to hold in population biology and how the stability of the QLE phase is lost. The QLE state, which has many similarities to a thermal…
▽ More
This review is about statistical genetics, an interdisciplinary topic between statistical physics and population biology. The focus is on the phase of quasi-linkage equilibrium (QLE). Our goals here are to clarify under which conditions the QLE phase can be expected to hold in population biology and how the stability of the QLE phase is lost. The QLE state, which has many similarities to a thermal equilibrium state in statistical mechanics, was discovered by M Kimura for a two-locus two-allele model, and was extended and generalized to the global genome scale by (Neher and Shraiman, 2011). What we will refer to as the Kimura-Neher-Shraiman (KNS) theory describes a population evolving due to the mutations, recombination, natural selection and possibly genetic drift. A QLE phase exists at sufficiently high recombination rate and/or mutation rates with respect to selection strength. We show how in QLE it is possible to infer the epistatic parameters of the fitness function from the knowledge of the (dynamical) distribution of genotypes in a population. We further consider the breakdown of the QLE regime for high enough selection strength. We review recent results for the selection-mutation and selection-recombination dynamics. Finally, we identify and characterize a new phase which we call the non-random coexistence (NRC) where variability persists in the population without either fixating or disappearing.
△ Less
Submitted 3 February, 2023; v1 submitted 4 May, 2021;
originally announced May 2021.
-
Global analysis of more than 50,000 SARS-Cov-2 genomes reveals epistasis between 8 viral genes
Authors:
Hong-Li Zeng,
Vito Dichio,
Edwin Rodríguez Horta,
Kaisa Thorell,
Erik Aurell
Abstract:
Genome-wide epistasis analysis is a powerful tool to infer gene interactions, which can guide drug and vaccine development and lead to a deeper understanding of microbial pathogenesis. We have considered all complete SARS-CoV-2 genomes deposited in the GISAID repository until \textbf{four} different cut-off dates, and used Direct Coupling Analysis together with an assumption of Quasi-Linkage Equil…
▽ More
Genome-wide epistasis analysis is a powerful tool to infer gene interactions, which can guide drug and vaccine development and lead to a deeper understanding of microbial pathogenesis. We have considered all complete SARS-CoV-2 genomes deposited in the GISAID repository until \textbf{four} different cut-off dates, and used Direct Coupling Analysis together with an assumption of Quasi-Linkage Equilibrium to infer epistatic contributions to fitness from polymorphic loci. We find \textbf{eight} interactions, of which three between pairs where one locus lies in gene ORF3a, both loci holding non-synonymous mutations. We also find interactions between two loci in gene nsp13, both holding non-synonymous mutations, and four interactions involving one locus holding a synonymous mutation. Altogether we infer interactions between loci in viral genes ORF3a and nsp2, nsp12 and nsp6, between ORF8 and nsp4, and between loci in genes nsp2, nsp13 and nsp14. The paper opens the prospect to use prominent epistatically linked pairs as a starting point to search for combinatorial weaknesses of recombinant viral pathogens.
△ Less
Submitted 2 October, 2020;
originally announced October 2020.
-
Inferring epistasis from genomic data with comparable mutation and outcrossing rate
Authors:
Hong-Li Zeng,
Eugenio Mauri,
Vito Dichio,
Simona Cocco,
Remi Monasson,
Erik Aurell
Abstract:
We consider a population evolving due to mutation, selection and recombination, where selection includes single-locus terms (additive fitness) and two-loci terms (pairwise epistatic fitness). We further consider the problem of inferring fitness in the evolutionary dynamics from one or several snap-shots of the distribution of genotypes in the population. In the recent literature this has been done…
▽ More
We consider a population evolving due to mutation, selection and recombination, where selection includes single-locus terms (additive fitness) and two-loci terms (pairwise epistatic fitness). We further consider the problem of inferring fitness in the evolutionary dynamics from one or several snap-shots of the distribution of genotypes in the population. In the recent literature this has been done by applying the Quasi-Linkage Equilibrium (QLE) regime first obtained by Kimura in the limit of high recombination. Here we show that the approach also works in the interesting regime where the effects of mutations are comparable to or larger than recombination. This leads to a modified main epistatic fitness inference formula where the rates of mutation and recombination occur together. We also derive this formula using by a previously developed Gaussian closure that formally remains valid when recombination is absent. The findings are validated through numerical simulations.
△ Less
Submitted 4 May, 2021; v1 submitted 30 June, 2020;
originally announced June 2020.
-
Inferring genetic fitness from genomic data
Authors:
Hong-Li Zeng,
Erik Aurell
Abstract:
The genetic composition of a naturally developing population is considered as due to mutation, selection, genetic drift and recombination. Selection is modeled as single-locus terms (additive fitness) and two-loci terms (pairwise epistatic fitness). The problem is posed to infer epistatic fitness from population-wide whole-genome data from a time series of a developing population. We generate such…
▽ More
The genetic composition of a naturally developing population is considered as due to mutation, selection, genetic drift and recombination. Selection is modeled as single-locus terms (additive fitness) and two-loci terms (pairwise epistatic fitness). The problem is posed to infer epistatic fitness from population-wide whole-genome data from a time series of a developing population. We generate such data in silico, and show that in the Quasi-Linkage Equilibrium (QLE) phase of Kimura, Neher and Shraiman, that pertains at high enough recombination rates and low enough mutation rates, epistatic fitness can be quantitatively correctly inferred using inverse Ising/Potts methods.
△ Less
Submitted 7 January, 2020;
originally announced January 2020.
-
DCA for genome-wide epistasis analysis: the statistical genetics perspective
Authors:
Chen-Yi Gao,
Fabio Cecconi,
Angelo Vulpiani,
Hai-Jun Zhou,
Erik Aurell
Abstract:
Direct Coupling Analysis (DCA) is a now widely used method to leverage statistical information from many similar biological systems to draw meaningful conclusions on each system separately. DCA has been applied with great success to sequences of homologous proteins, and also more recently to whole-genome population-wide sequencing data. We here argue that the use of DCA on the genome scale is cont…
▽ More
Direct Coupling Analysis (DCA) is a now widely used method to leverage statistical information from many similar biological systems to draw meaningful conclusions on each system separately. DCA has been applied with great success to sequences of homologous proteins, and also more recently to whole-genome population-wide sequencing data. We here argue that the use of DCA on the genome scale is contingent on fundamental issues of population genetics. DCA can be expected to yield meaningful results when a population is in the Quasi-Linkage Equilibrium (QLE) phase studied by Kimura and others, but not, for instance, in a phase of Clonal Competition. We discuss how the exponential (Potts model) distributions emerge in QLE, and compare couplings to correlations obtained in a study of about 3,000 genomes of the human pathogen Streptococcus pneumoniae.
△ Less
Submitted 10 August, 2018;
originally announced August 2018.
-
Correlation-Compressed Direct Coupling Analysis
Authors:
Chen-Yi Gao,
Hai-Jun Zhou,
Erik Aurell
Abstract:
Learning Ising or Potts models from data has become an important topic in statistical physics and computational biology, with applications to predictions of structural contacts in proteins and other areas of biological data analysis. The corresponding inference problems are challenging since the normalization constant (partition function) of the Ising/Potts distributions cannot be computed efficie…
▽ More
Learning Ising or Potts models from data has become an important topic in statistical physics and computational biology, with applications to predictions of structural contacts in proteins and other areas of biological data analysis. The corresponding inference problems are challenging since the normalization constant (partition function) of the Ising/Potts distributions cannot be computed efficiently on large instances. Different ways to address this issue have hence given size to a substantial methodological literature. In this paper we investigate how these methods could be used on much larger datasets than studied previously. We focus on a central aspect, that in practice these inference problems are almost always severely under-sampled, and the operational result is almost always a small set of leading (largest) predictions. We therefore explore an approach where the data is pre-filtered based on empirical correlations, which can be computed directly even for very large problems. Inference is only used on the much smaller instance in a subsequent step of the analysis. We show that in several relevant model classes such a combined approach gives results of almost the same quality as the computationally much more demanding inference on the whole dataset. We also show that results on whole-genome epistatic couplings that were obtained in a recent computation-intensive study can be retrieved by the new approach. The method of this paper hence opens up the possibility to learn parameters describing pair-wise dependencies in whole genomes in a computationally feasible and expedient manner.
△ Less
Submitted 13 October, 2017;
originally announced October 2017.
-
Steady diffusion in a drift field: a comparison of large deviation techniques and multiple-scale analysis
Authors:
Erik Aurell,
Stefano Bo
Abstract:
A particle with internal unobserved states diffusing in a force field will generally display effective advection-diffusion. The drift velocity is proportional to the mobility averaged over the internal states, or effective mobility, while the effective diffusion has two terms. One is of the equilibrium type and satisfies an Einstein relation with the effective mobility while the other is quadratic…
▽ More
A particle with internal unobserved states diffusing in a force field will generally display effective advection-diffusion. The drift velocity is proportional to the mobility averaged over the internal states, or effective mobility, while the effective diffusion has two terms. One is of the equilibrium type and satisfies an Einstein relation with the effective mobility while the other is quadratic in the applied force. In this contribution we present two new methods to obtain these results, on the one hand using large deviation techniques, and on the other by a multiple-scale analysis, and compare the two. We consider both systems with discrete internal states and continuous internal states. We show that the auxiliary equations in the multiple-scale analysis can also be derived in second-order perturbation theory in a large deviation theory of a generating function (discrete internal states) or generating functional (continuous internal states). We discuss that measuring the two components of the effective diffusion give a way to determine kinetic rates from only first and second moments of the displacement in steady state.
△ Less
Submitted 12 October, 2017; v1 submitted 2 June, 2017;
originally announced June 2017.
-
An observation of circular RNAs in bacterial RNA-seq data
Authors:
Nicolas Innocenti,
Hoang-Son Nguyen,
Aymeric Fouquier d'hérouël,
Erik Aurell
Abstract:
Circular RNAs (circRNAs) are a class of RNA with an important role in micro RNA (miRNA) regulation recently discovered in Human and various other eukaryotes as well as in archaea. Here, we have analyzed RNA-seq data obtained from {\it Enterococcus faecalis} and {\it Escherichia coli} in a way similar to previous studies performed on eukaryotes. We report observations of circRNAs in RNA-seq data th…
▽ More
Circular RNAs (circRNAs) are a class of RNA with an important role in micro RNA (miRNA) regulation recently discovered in Human and various other eukaryotes as well as in archaea. Here, we have analyzed RNA-seq data obtained from {\it Enterococcus faecalis} and {\it Escherichia coli} in a way similar to previous studies performed on eukaryotes. We report observations of circRNAs in RNA-seq data that are reproducible across multiple experiments performed with different protocols or growth conditions.
△ Less
Submitted 14 June, 2016;
originally announced June 2016.
-
The Bulk and The Tail of Minimal Absent Words in Genome Sequences
Authors:
Erik Aurell,
Nicolas Innocenti,
Hai-Jun-Zhou
Abstract:
Minimal absent words (MAW) of a genomic sequence are subsequences that are absent themselves but the subwords of which are all present in the sequence. The characteristic distribution of genomic MAWs as a function of their length has been observed to be qualitatively similar for all living organisms, the bulk being rather short, and only relatively few being long. It has been an open issue whether…
▽ More
Minimal absent words (MAW) of a genomic sequence are subsequences that are absent themselves but the subwords of which are all present in the sequence. The characteristic distribution of genomic MAWs as a function of their length has been observed to be qualitatively similar for all living organisms, the bulk being rather short, and only relatively few being long. It has been an open issue whether the reason behind this phenomenon is statistical or reflects a biological mechanism, and what biological information is contained in absent words. In this work we demonstrate that the bulk can be described by a probabilistic model of sampling words from random sequences, while the tail of long MAWs is of biological origin. We introduce the novel concept of a core of a minimal absent word, which are sequences present in the genome and closest to a given MAW. We show that in bacteria and yeast the cores of the longest MAWs, which exist in two or more copies, are located in highly conserved regions the most prominent example being ribosomal RNAs (rRNAs). We also show that while the distribution of the cores of long MAWs is roughly uniform over these genomes on a coarse-grained level, on a more detailed level it is strongly enhanced in 3' untranslated regions (UTRs) and, to a lesser extent, also in 5' UTRs. This indicates that MAWs and associated MAW cores correspond to fine-tuned evolutionary relationships, and suggest that they can be more widely used as markers for genomic complexity.
△ Less
Submitted 17 September, 2015;
originally announced September 2015.
-
Whole genome mapping of 5' RNA ends in bacteria by tagged sequencing : A comprehensive view in Enterococcus faecalis
Authors:
Nicolas Innocenti,
Monica Golumbeanu,
Aymeric Fouquier d'Hérouël,
Caroline Lacoux,
Rémy A. Bonnin,
Sean P. Kennedy,
Françoise Wessner,
Pascale Serror,
Philippe Bouloc,
Francis Repoila,
Erik Aurell
Abstract:
Enterococcus faecalis is the third cause of nosocomial infections. To obtain the first comprehensive view of transcriptional organizations in this bacterium, we used a modified RNA-seq approach enabling to discriminate primary from processed 5'RNA ends. We also validated our approach by confirming known features in Escherichia coli.
We mapped 559 transcription start sites and 352 processing site…
▽ More
Enterococcus faecalis is the third cause of nosocomial infections. To obtain the first comprehensive view of transcriptional organizations in this bacterium, we used a modified RNA-seq approach enabling to discriminate primary from processed 5'RNA ends. We also validated our approach by confirming known features in Escherichia coli.
We mapped 559 transcription start sites and 352 processing sites in E. faecalis. A blind motif search retrieved canonical features of SigA- and SigN-dependent promoters preceding TSSs mapped. We discovered 95 novel putative regulatory RNAs, small- and antisense RNAs, and 72 transcriptional antisense organisations.
Presented data constitute a significant insight into bacterial RNA landscapes and a step towards the inference of regulatory processes at transcriptional and post-transcriptional levels in a comprehensive manner.
△ Less
Submitted 7 October, 2014;
originally announced October 2014.
-
SEK: Sparsity exploiting $k$-mer-based estimation of bacterial community composition
Authors:
Saikat Chatterjee,
David Koslicki,
Siyuan Dong,
Nicolas Innocenti,
Lu Cheng,
Yueheng Lan,
Mikko Vehkaperä,
Mikael Skoglund,
Lars K. Rasmussen,
Erik Aurell,
Jukka Corander
Abstract:
Motivation: Estimation of bacterial community composition from a high-throughput sequenced sample is an important task in metagenomics applications. Since the sample sequence data typically harbors reads of variable lengths and different levels of biological and technical noise, accurate statistical analysis of such data is challenging. Currently popular estimation methods are typically very time…
▽ More
Motivation: Estimation of bacterial community composition from a high-throughput sequenced sample is an important task in metagenomics applications. Since the sample sequence data typically harbors reads of variable lengths and different levels of biological and technical noise, accurate statistical analysis of such data is challenging. Currently popular estimation methods are typically very time consuming in a desktop computing environment.
Results: Using sparsity enforcing methods from the general sparse signal processing field (such as compressed sensing), we derive a solution to the community composition estimation problem by a simultaneous assignment of all sample reads to a pre-processed reference database. A general statistical model based on kernel density estimation techniques is introduced for the assignment task and the model solution is obtained using convex optimization tools. Further, we design a greedy algorithm solution for a fast solution. Our approach offers a reasonably fast community composition estimation method which is shown to be more robust to input data variation than a recently introduced related method.
Availability: A platform-independent Matlab implementation of the method is freely available at http://www.ee.kth.se/ctsoftware; source code that does not require access to Matlab is currently being tested and will be made available later through the above website.
△ Less
Submitted 1 July, 2014;
originally announced July 2014.
-
Improving contact prediction along three dimensions
Authors:
Christoph Feinauer,
Marcin J. Skwark,
Andrea Pagnani,
Erik Aurell
Abstract:
Correlation patterns in multiple sequence alignments of homologous proteins can be exploited to infer information on the three-dimensional structure of their members. The typical pipeline to address this task, which we in this paper refer to as the three dimensions of contact prediction, is to: (i) filter and align the raw sequence data representing the evolutionarily related proteins; (ii) choose…
▽ More
Correlation patterns in multiple sequence alignments of homologous proteins can be exploited to infer information on the three-dimensional structure of their members. The typical pipeline to address this task, which we in this paper refer to as the three dimensions of contact prediction, is to: (i) filter and align the raw sequence data representing the evolutionarily related proteins; (ii) choose a predictive model to describe a sequence alignment; (iii) infer the model parameters and interpret them in terms of structural properties, such as an accurate contact map. We show here that all three dimensions are important for overall prediction success. In particular, we show that it is possible to improve significantly along the second dimension by going beyond the pair-wise Potts models from statistical physics, which have hitherto been the focus of the field. These (simple) extensions are motivated by multiple sequence alignments often containing long stretches of gaps which, as a data feature, would be rather untypical for independent samples drawn from a Potts model. Using a large test set of proteins we show that the combined improvements along the three dimensions are as large as any reported to date.
△ Less
Submitted 5 March, 2014; v1 submitted 3 March, 2014;
originally announced March 2014.
-
Fast pseudolikelihood maximization for direct-coupling analysis of protein structure from many homologous amino-acid sequences
Authors:
Magnus Ekeberg,
Tuomo Hartonen,
Erik Aurell
Abstract:
Direct-Coupling Analysis is a group of methods to harvest information about coevolving residues in a protein family by learning a generative model in an exponential family from data. In protein families of realistic size, this learning can only be done approximately, and there is a trade-off between inference precision and computational speed. We here show that an earlier introduced $l_2$-regulari…
▽ More
Direct-Coupling Analysis is a group of methods to harvest information about coevolving residues in a protein family by learning a generative model in an exponential family from data. In protein families of realistic size, this learning can only be done approximately, and there is a trade-off between inference precision and computational speed. We here show that an earlier introduced $l_2$-regularized pseudolikelihood maximization method called plmDCA can be modified as to be easily parallelizable, as well as inherently faster on a single processor, at negligible difference in accuracy. We test the new incarnation of the method on 148 protein families from the Protein Families database (PFAM), one of the largest tests of this class of algorithms to date.
△ Less
Submitted 20 January, 2014;
originally announced January 2014.
-
Lognormality and oscillations in the coverage of high-throughput transcriptomic data towards gene ends
Authors:
Nicolas Innocenti,
Erik Aurell
Abstract:
High-throughput transcriptomics experiments have reached the stage where the count of the number of reads alignable to a given position can be treated as an almost-continuous signal. This allows to ask questions of biophysical/biotechnical nature, but which may still have biological implications. Here we show that when sequencing RNA fragments from one end, as it is the case on most platforms, an…
▽ More
High-throughput transcriptomics experiments have reached the stage where the count of the number of reads alignable to a given position can be treated as an almost-continuous signal. This allows to ask questions of biophysical/biotechnical nature, but which may still have biological implications. Here we show that when sequencing RNA fragments from one end, as it is the case on most platforms, an oscillation in the read count is observed at the other end. We further show that these oscillations can be well described by Kolmogorov's 1941 broken stick model. We investigate how the model can be used to improve predictions of gene ends (3' transcript ends) but conclude that with present data the improvement is only marginal. The results highlight subtle effects in high-throughput transcriptomics experiments which do not have a biological origin, but which may still be used to obtain biological information.
△ Less
Submitted 28 August, 2013; v1 submitted 18 March, 2013;
originally announced March 2013.
-
Improved contact prediction in proteins: Using pseudolikelihoods to infer Potts models
Authors:
Magnus Ekeberg,
Cecilia Lövkvist,
Yueheng Lan,
Martin Weigt,
Erik Aurell
Abstract:
Spatially proximate amino acids in a protein tend to coevolve. A protein's three-dimensional (3D) structure hence leaves an echo of correlations in the evolutionary record. Reverse engineering 3D structures from such correlations is an open problem in structural biology, pursued with increasing vigor as more and more protein sequences continue to fill the data banks. Within this task lies a statis…
▽ More
Spatially proximate amino acids in a protein tend to coevolve. A protein's three-dimensional (3D) structure hence leaves an echo of correlations in the evolutionary record. Reverse engineering 3D structures from such correlations is an open problem in structural biology, pursued with increasing vigor as more and more protein sequences continue to fill the data banks. Within this task lies a statistical inference problem, rooted in the following: correlation between two sites in a protein sequence can arise from firsthand interaction but can also be network-propagated via intermediate sites; observed correlation is not enough to guarantee proximity. To separate direct from indirect interactions is an instance of the general problem of inverse statistical mechanics, where the task is to learn model parameters (fields, couplings) from observables (magnetizations, correlations, samples) in large systems. In the context of protein sequences, the approach has been referred to as direct-coupling analysis. Here we show that the pseudolikelihood method, applied to 21-state Potts models describing the statistical properties of families of evolutionarily related proteins, significantly outperforms existing approaches to the direct-coupling analysis, the latter being based on standard mean-field techniques. This improved performance also relies on a modified score for the coupling strength. The results are verified using known crystal structures of specific sequence instances of various protein families. Code implementing the new method can be found at http://plmdca.csc.kth.se/.
△ Less
Submitted 12 January, 2013; v1 submitted 6 November, 2012;
originally announced November 2012.
-
Quasi-potential landscape in complex multi-stable systems
Authors:
Joseph Xu Zhou,
M. D. S. Aliyu,
Erik Aurell,
Sui Huang
Abstract:
Developmental dynamics of multicellular organism is a process that takes place in a multi-stable system in which each attractor state represents a cell type and attractor transitions correspond to cell differentiation paths. This new understanding has revived the idea of a quasi-potential landscape, first proposed by Waddington as a metaphor. To describe development one is interested in the "relat…
▽ More
Developmental dynamics of multicellular organism is a process that takes place in a multi-stable system in which each attractor state represents a cell type and attractor transitions correspond to cell differentiation paths. This new understanding has revived the idea of a quasi-potential landscape, first proposed by Waddington as a metaphor. To describe development one is interested in the "relative stabilities" of N attractors (N>2). Existing theories of state transition between local minima on some potential landscape deal with the exit in the transition between a pair attractor but do not offer the notion of a global potential function that relate more than two attractors to each other. Several ad hoc methods have been used in systems biology to compute a landscape in non-gradient systems, such as gene regulatory networks. Here we present an overview of the currently available methods, discuss their limitations and propose a new decomposition of vector fields that permit the computation of a quasi-potential function that is equivalent to the Freidlin-Wentzell potential but is not limited to two attractors. Several examples of decomposition are given and the significance of such a quasi-potential function is discussed.
△ Less
Submitted 11 June, 2012;
originally announced June 2012.
-
Statistical physics of pairwise probability models
Authors:
Yasser Roudi,
Erik Aurell,
John Hertz
Abstract:
Statistical models for describing the probability distribution over the states of biological systems are commonly used for dimensional reduction. Among these models, pairwise models are very attractive in part because they can be fit using a reasonable amount of data: knowledge of the means and correlations between pairs of elements in the system is sufficient. Not surprisingly, then, using pair…
▽ More
Statistical models for describing the probability distribution over the states of biological systems are commonly used for dimensional reduction. Among these models, pairwise models are very attractive in part because they can be fit using a reasonable amount of data: knowledge of the means and correlations between pairs of elements in the system is sufficient. Not surprisingly, then, using pairwise models for studying neural data has been the focus of many studies in recent years. In this paper, we describe how tools from statistical physics can be employed for studying and using pairwise models. We build on our previous work on the subject and study the relation between different methods for fitting these models and evaluating their quality. In particular, using data from simulated cortical networks we study how the quality of various approximate methods for inferring the parameters in a pairwise model depends on the time bin chosen for binning the data. We also study the effect of the size of the time bin on the model quality itself, again using simulated data. We show that using finer time bins increases the quality of the pairwise model. We offer new ways of deriving the expressions reported in our previous work for assessing the quality of pairwise models.
△ Less
Submitted 9 May, 2009;
originally announced May 2009.
-
A computational systems biology study of the lambda-lac mutants
Authors:
Maria Werner,
Erik Aurell
Abstract:
We present a comprehensive computational study of some 900 possible "lambda-lac" mutants of the lysogeny maintenance switch in phage lambda, of which up to date 19 have been studied experimentally (Atsumi & Little, PNAS 103: 4558-4563, (2006)). We clarify that these mutants realise regulatory schemes quite different from wild-type lambda, and can therefore be expected to behave differently, with…
▽ More
We present a comprehensive computational study of some 900 possible "lambda-lac" mutants of the lysogeny maintenance switch in phage lambda, of which up to date 19 have been studied experimentally (Atsumi & Little, PNAS 103: 4558-4563, (2006)). We clarify that these mutants realise regulatory schemes quite different from wild-type lambda, and can therefore be expected to behave differently, within the conventional mechanistic setting in which this problem has often been framed. We verify that indeed, within this framework, across this wide selection of mutants the lambda-lac mutants for the most part either have no stable lytic states, or should only be inducible with difficulty. In particular, the computational results contradicts the experimental finding that four lambda-lac mutants both show stable lysogeny and are inducible. This work hence suggests either that the four out of 900 mutants are special, or that lambda lysogeny and inducibility are holistic effects involving other molecular players or other mechanisms, or both. The approach illustrates the power and versatility of computational systems biology to systematically and quickly test a wide variety of examples and alternative hypotheses for future closer experimental studies.
△ Less
Submitted 21 December, 2007;
originally announced December 2007.
-
Cooperative action in eukaryotic gene regulation: physical properties of a viral example
Authors:
Maria Werner,
LiZhe Zhu,
Erik Aurell
Abstract:
The Epstein-Barr virus (EBV) infects more than 90% of the human population, and is the cause of several both serious and mild diseases. It is a tumorivirus, and has been widely studied as a model system for gene (de)regulation in human. A central feature of the EBV life cycle is its ability to persist in human B cells in states denoted latency I, II and III. In latency III the host cell is drive…
▽ More
The Epstein-Barr virus (EBV) infects more than 90% of the human population, and is the cause of several both serious and mild diseases. It is a tumorivirus, and has been widely studied as a model system for gene (de)regulation in human. A central feature of the EBV life cycle is its ability to persist in human B cells in states denoted latency I, II and III. In latency III the host cell is driven to cell proliferation and hence expansion of the viral population, but does not enter the lytic pathway, and no new virions are produced, while the latency I state is almost completely dormant. In this paper we study a physico-chemical model of the switch between latency I and latency III in EBV. We show that the unusually large number of binding sites of two competing transcription factors, one viral and one from the host, serves to make the switch sharper (higher Hill coefficient), either by cooperative binding between molecules of the same species when they bind, or by competition between the two species if there is sufficient steric hindrance.
△ Less
Submitted 13 June, 2007;
originally announced June 2007.
-
Noise-filtering features of transcription regulation in the yeast S. cerevisiae
Authors:
Erik Aurell,
Aymeric Fouquier d'Herouel,
Claes Malmnas,
Massimo Vergassola
Abstract:
Transcription regulation is largely governed by the profile and the dynamics of transcription factors' binding to DNA. Stochastic effects are intrinsic to this dynamics and the binding to functional sites must be controled with a certain specificity for living organisms to be able to elicit specific cellular responses. Specificity stems here from the interplay between binding affinity and cellul…
▽ More
Transcription regulation is largely governed by the profile and the dynamics of transcription factors' binding to DNA. Stochastic effects are intrinsic to this dynamics and the binding to functional sites must be controled with a certain specificity for living organisms to be able to elicit specific cellular responses. Specificity stems here from the interplay between binding affinity and cellular abundancy of transcription factor proteins and the binding of such proteins to DNA is thus controlled by their chemical potential.
We combine large-scale protein abundance data in the budding yeast with binding affinities for all transcription factors with known DNA binding site sequences to assess the behavior of their chemical potentials. A sizable fraction of transcription factors is apparently bound non-specifically to DNA and the observed abundances are marginally sufficient to ensure high occupations of the functional sites. We argue that a biological cause of this feature is related to its noise-filtering consequences: abundances below physiological levels do not yield significant binding of functional targets and mis-expressions of regulated genes are thus tamed.
△ Less
Submitted 29 March, 2007;
originally announced March 2007.
-
Epigenetics as a first exit problem
Authors:
E. Aurell,
K. Sneppen
Abstract:
We develop a framework to discuss stability of epigenetic states as first exit problems in dynamical systems with noise. We consider in particular the stability of the lysogenic state of the lambda prophage, which is known to exhibit exceptionally large stability. The formalism defines a quantative measure of robustness of inherited states.
In contrast to Kramers' well-known problem of escape…
▽ More
We develop a framework to discuss stability of epigenetic states as first exit problems in dynamical systems with noise. We consider in particular the stability of the lysogenic state of the lambda prophage, which is known to exhibit exceptionally large stability. The formalism defines a quantative measure of robustness of inherited states.
In contrast to Kramers' well-known problem of escape from a potential well, the stability of inherited states in our formulation is not a numerically trivial problem. The most likely exit path does not go along a steepest decent of a potential -- there is no potential. Instead, such a path can be described as a zero-energy trajectory between two equilibria in an auxiliary classical mechanical system. Finding it is similar to e.g. computing heteroclinic orbits in celestial mechanics. The overall lesson of this study is that an examination of equilibria and their bifurcations with changing parameter values allow us to quantify both the stability and the robustness of particular states of a genetic control system.
△ Less
Submitted 5 March, 2001;
originally announced March 2001.
-
Stability Puzzles in Phage Lambda
Authors:
Erik Aurell,
Stanley Brown,
Johan Johanson,
Kim Sneppen
Abstract:
The lysogeny maintenance switch in phage lambda is one of the simplest examples on the molecular level of computation, command and control in a living system. If, following infection of the bacterium E. coli, the virus enters the lysogenic pathway, it represses its developmental functions, and integrates its DNA into the host chromosome. In this state the prophage may be passively replicated for…
▽ More
The lysogeny maintenance switch in phage lambda is one of the simplest examples on the molecular level of computation, command and control in a living system. If, following infection of the bacterium E. coli, the virus enters the lysogenic pathway, it represses its developmental functions, and integrates its DNA into the host chromosome. In this state the prophage may be passively replicated for many generations of E. coli. In fact, this repressed state is intrinsically more stable than the gene encoding the repressor. We develop a mathematical formalism to predict the stability of such epigenetic states from affinities of the molecular components. We apply the model to the behavior of recently published mutants at the right operator complex of lambda, and find that the reported stability indicates that the current view of the switch is incomplete. The approach described here should be generally applicable to the stability of expressed states
△ Less
Submitted 19 October, 2000;
originally announced October 2000.