-
Folding@home: achievements from over twenty years of citizen science herald the exascale era
Authors:
Vincent A. Voelz,
Vijay S. Pande,
Gregory R. Bowman
Abstract:
Simulations of biomolecules have enormous potential to inform our understanding of biology but require extremely demanding calculations. For over twenty years, the Folding@home distributed computing project has pioneered a massively parallel approach to biomolecular simulation, harnessing the resources of citizen scientists across the globe. Here, we summarize the scientific and technical advances…
▽ More
Simulations of biomolecules have enormous potential to inform our understanding of biology but require extremely demanding calculations. For over twenty years, the Folding@home distributed computing project has pioneered a massively parallel approach to biomolecular simulation, harnessing the resources of citizen scientists across the globe. Here, we summarize the scientific and technical advances this perspective has enabled. As the project's name implies, the early years of Folding@home focused on driving advances in our understanding of protein folding by developing statistical methods for capturing long-timescale processes and facilitating insight into complex dynamical processes. Success laid a foundation for broadening the scope of Folding@home to address other functionally relevant conformational changes, such as receptor signaling, enzyme dynamics, and ligand binding. Continued algorithmic advances, hardware developments such as GPU-based computing, and the growing scale of Folding@home have enabled the project to focus on new areas where massively parallel sampling can be impactful. While previous work sought to expand toward larger proteins with slower conformational changes, new work focuses on large-scale comparative studies of different protein sequences and chemical compounds to better understand biology and inform the development of small molecule drugs. Progress on these fronts enabled the community to pivot quickly in response to the COVID-19 pandemic, expanding to become the world's first exascale computer and deploying this massive resource to provide insight into the inner workings of the SARS-CoV-2 virus and aid the development of new antivirals. This success provides a glimpse of what's to come as exascale supercomputers come online, and Folding@home continues its work.
△ Less
Submitted 15 March, 2023;
originally announced March 2023.
-
Predicting Gene Expression Between Species with Neural Networks
Authors:
Peter Eastman,
Vijay S. Pande
Abstract:
We train a neural network to predict human gene expression levels based on experimental data for rat cells. The network is trained with paired human/rat samples from the Open TG-GATES database, where paired samples were treated with the same compound at the same dose. When evaluated on a test set of held out compounds, the network successfully predicts human expression levels. On the majority of t…
▽ More
We train a neural network to predict human gene expression levels based on experimental data for rat cells. The network is trained with paired human/rat samples from the Open TG-GATES database, where paired samples were treated with the same compound at the same dose. When evaluated on a test set of held out compounds, the network successfully predicts human expression levels. On the majority of the test compounds, the list of differentially expressed genes determined from predicted expression levels agrees well with the list of differentially expressed genes determined from actual human experimental data.
△ Less
Submitted 5 July, 2019;
originally announced July 2019.
-
Predicting Toxicity from Gene Expression with Neural Networks
Authors:
Peter Eastman,
Vijay S. Pande
Abstract:
We train a neural network to predict chemical toxicity based on gene expression data. The input to the network is a full expression profile collected either in vitro from cultured cells or in vivo from live animals. The output is a set of fine grained predictions for the presence of a variety of pathological effects in treated animals. When trained on the Open TG-GATEs database it produces good re…
▽ More
We train a neural network to predict chemical toxicity based on gene expression data. The input to the network is a full expression profile collected either in vitro from cultured cells or in vivo from live animals. The output is a set of fine grained predictions for the presence of a variety of pathological effects in treated animals. When trained on the Open TG-GATEs database it produces good results, outperforming classical models trained on the same data. This is a promising approach for efficiently screening chemicals for toxic effects, and for more accurately evaluating drug candidates based on preclinical data.
△ Less
Submitted 31 January, 2019;
originally announced February 2019.
-
Binding Pathway of Opiates to $μ$ Opioid Receptors Revealed by Unsupervised Machine Learning
Authors:
Amir Barati Farimani,
Evan N. Feinberg,
Vijay S. Pande
Abstract:
Many important analgesics relieve pain by binding to the $μ$-Opioid Receptor ($μ$OR), which makes the $μ$OR among the most clinically relevant proteins of the G Protein Coupled Receptor (GPCR) family. Despite previous studies on the activation pathways of the GPCRs, the mechanism of opiate binding and the selectivity of $μ$OR are largely unknown. We performed extensive molecular dynamics (MD) simu…
▽ More
Many important analgesics relieve pain by binding to the $μ$-Opioid Receptor ($μ$OR), which makes the $μ$OR among the most clinically relevant proteins of the G Protein Coupled Receptor (GPCR) family. Despite previous studies on the activation pathways of the GPCRs, the mechanism of opiate binding and the selectivity of $μ$OR are largely unknown. We performed extensive molecular dynamics (MD) simulation and analysis to find the selective allosteric binding sites of the $μ$OR and the path opiates take to bind to the orthosteric site. In this study, we predicted that the allosteric site is responsible for the attraction and selection of opiates. Using Markov state models and machine learning, we traced the pathway of opiates in binding to the orthosteric site, the main binding pocket. Our results have important implications in designing novel analgesics.
△ Less
Submitted 22 April, 2018;
originally announced April 2018.
-
Machine Learning Harnesses Molecular Dynamics to Discover New $μ$ Opioid Chemotypes
Authors:
Evan N. Feinberg,
Amir Barati Farimani,
Rajendra Uprety,
Amanda Hunkele,
Gavril W. Pasternak,
Susruta Majumdar,
Vijay S. Pande
Abstract:
Computational chemists typically assay drug candidates by virtually screening compounds against crystal structures of a protein despite the fact that some targets, like the $μ$ Opioid Receptor and other members of the GPCR family, traverse many non-crystallographic states. We discover new conformational states of $μOR$ with molecular dynamics simulation and then machine learn ligand-structure rela…
▽ More
Computational chemists typically assay drug candidates by virtually screening compounds against crystal structures of a protein despite the fact that some targets, like the $μ$ Opioid Receptor and other members of the GPCR family, traverse many non-crystallographic states. We discover new conformational states of $μOR$ with molecular dynamics simulation and then machine learn ligand-structure relationships to predict opioid ligand function. These artificial intelligence models identified a novel $μ$ opioid chemotype.
△ Less
Submitted 12 March, 2018;
originally announced March 2018.
-
SentRNA: Improving computational RNA design by incorporating a prior of human design strategies
Authors:
Jade Shi,
Rhiju Das,
Vijay S. Pande
Abstract:
Solving the RNA inverse folding problem is a critical prerequisite to RNA design, an emerging field in bioengineering with a broad range of applications from reaction catalysis to cancer therapy. Although significant progress has been made in developing machine-based inverse RNA folding algorithms, current approaches still have difficulty designing sequences for large or complex targets. On the ot…
▽ More
Solving the RNA inverse folding problem is a critical prerequisite to RNA design, an emerging field in bioengineering with a broad range of applications from reaction catalysis to cancer therapy. Although significant progress has been made in developing machine-based inverse RNA folding algorithms, current approaches still have difficulty designing sequences for large or complex targets. On the other hand, human players of the online RNA design game EteRNA have consistently shown superior performance in this regard, being able to readily design sequences for targets that are challenging for machine algorithms. Here we present a novel approach to the RNA design problem, SentRNA, a design agent consisting of a fully-connected neural network trained end-to-end using human-designed RNA sequences. We show that through this approach, SentRNA can solve complex targets previously unsolvable by any machine-based approach and achieve state-of-the-art performance on two separate challenging test sets. Our results demonstrate that incorporating human design strategies into a design algorithm can significantly boost machine performance and suggests a new paradigm for machine-based RNA design.
△ Less
Submitted 5 March, 2019; v1 submitted 8 March, 2018;
originally announced March 2018.
-
Using Deep Learning for Segmentation and Counting within Microscopy Data
Authors:
Carlos X. Hernández,
Mohammad M. Sultan,
Vijay S. Pande
Abstract:
Cell counting is a ubiquitous, yet tedious task that would greatly benefit from automation. From basic biological questions to clinical trials, cell counts provide key quantitative feedback that drive research. Unfortunately, cell counting is most commonly a manual task and can be time-intensive. The task is made even more difficult due to overlapping cells, existence of multiple focal planes, and…
▽ More
Cell counting is a ubiquitous, yet tedious task that would greatly benefit from automation. From basic biological questions to clinical trials, cell counts provide key quantitative feedback that drive research. Unfortunately, cell counting is most commonly a manual task and can be time-intensive. The task is made even more difficult due to overlapping cells, existence of multiple focal planes, and poor imaging quality, among other factors. Here, we describe a convolutional neural network approach, using a recently described feature pyramid network combined with a VGG-style neural network, for segmenting and subsequent counting of cells in a given microscopy image.
△ Less
Submitted 28 February, 2018;
originally announced February 2018.
-
Automated design of collective variables using supervised machine learning
Authors:
Mohammad M. Sultan,
Vijay S. Pande
Abstract:
Selection of appropriate collective variables for enhancing sampling of molecular simulations remains an unsolved problem in computational biophysics. In particular, picking initial collective variables (CVs) is particularly challenging in higher dimensions. Which atomic coordinates or transforms there of from a list of thousands should one pick for enhanced sampling runs? How does a modeler even…
▽ More
Selection of appropriate collective variables for enhancing sampling of molecular simulations remains an unsolved problem in computational biophysics. In particular, picking initial collective variables (CVs) is particularly challenging in higher dimensions. Which atomic coordinates or transforms there of from a list of thousands should one pick for enhanced sampling runs? How does a modeler even begin to pick starting coordinates for investigation? This remains true even in the case of simple two state systems and only increases in difficulty for multi-state systems. In this work, we solve the initial CV problem using a data-driven approach inspired by the filed of supervised machine learning. In particular, we show how the decision functions in supervised machine learning (SML) algorithms can be used as initial CVs (SML_cv) for accelerated sampling. Using solvated alanine dipeptide and Chignolin mini-protein as our test cases, we illustrate how the distance to the Support Vector Machines' decision hyperplane, the output probability estimates from Logistic Regression, the outputs from deep neural network classifiers, and other classifiers may be used to reversibly sample slow structural transitions. We discuss the utility of other SML algorithms that might be useful for identifying CVs for accelerating molecular simulations.
△ Less
Submitted 13 May, 2018; v1 submitted 28 February, 2018;
originally announced February 2018.
-
Transferable neural networks for enhanced sampling of protein dynamics
Authors:
Mohammad M. Sultan,
Hannah K. Wayment-Steele,
Vijay S. Pande
Abstract:
Variational auto-encoder frameworks have demonstrated success in reducing complex nonlinear dynamics in molecular simulation to a single non-linear embedding. In this work, we illustrate how this non-linear latent embedding can be used as a collective variable for enhanced sampling, and present a simple modification that allows us to rapidly perform sampling in multiple related systems. We first d…
▽ More
Variational auto-encoder frameworks have demonstrated success in reducing complex nonlinear dynamics in molecular simulation to a single non-linear embedding. In this work, we illustrate how this non-linear latent embedding can be used as a collective variable for enhanced sampling, and present a simple modification that allows us to rapidly perform sampling in multiple related systems. We first demonstrate our method is able to describe the effects of force field changes in capped alanine dipeptide after learning a model using AMBER99. We further provide a simple extension to variational dynamics encoders that allows the model to be trained in a more efficient manner on larger systems by encoding the outputs of a linear transformation using time-structure based independent component analysis (tICA). Using this technique, we show how such a model trained for one protein, the WW domain, can efficiently be transferred to perform enhanced sampling on a related mutant protein, the GTT mutation. This method shows promise for its ability to rapidly sample related systems using a single transferable collective variable and is generally applicable to sets of related simulations, enabling us to probe the effects of variation in increasingly large systems of biophysical interest.
△ Less
Submitted 2 January, 2018;
originally announced January 2018.
-
Unsupervised learning of dynamical and molecular similarity using variance minimization
Authors:
Brooke E. Husic,
Vijay S. Pande
Abstract:
In this report, we present an unsupervised machine learning method for determining groups of molecular systems according to similarity in their dynamics or structures using Ward's minimum variance objective function. We first apply the minimum variance clustering to a set of simulated tripeptides using the information theoretic Jensen-Shannon divergence between Markovian transition matrices in ord…
▽ More
In this report, we present an unsupervised machine learning method for determining groups of molecular systems according to similarity in their dynamics or structures using Ward's minimum variance objective function. We first apply the minimum variance clustering to a set of simulated tripeptides using the information theoretic Jensen-Shannon divergence between Markovian transition matrices in order to gain insight into how point mutations affect protein dynamics. Then, we extend the method to partition two chemoinformatic datasets according to structural similarity to motivate a train/validation/test split for supervised learning that avoids overfitting.
△ Less
Submitted 20 December, 2017;
originally announced December 2017.
-
Variational Encoding of Complex Dynamics
Authors:
Carlos X. Hernández,
Hannah K. Wayment-Steele,
Mohammad M. Sultan,
Brooke E. Husic,
Vijay S. Pande
Abstract:
Often the analysis of time-dependent chemical and biophysical systems produces high-dimensional time-series data for which it can be difficult to interpret which individual features are most salient. While recent work from our group and others has demonstrated the utility of time-lagged co-variate models to study such systems, linearity assumptions can limit the compression of inherently nonlinear…
▽ More
Often the analysis of time-dependent chemical and biophysical systems produces high-dimensional time-series data for which it can be difficult to interpret which individual features are most salient. While recent work from our group and others has demonstrated the utility of time-lagged co-variate models to study such systems, linearity assumptions can limit the compression of inherently nonlinear dynamics into just a few characteristic components. Recent work in the field of deep learning has led to the development of variational autoencoders (VAE), which are able to compress complex datasets into simpler manifolds. We present the use of a time-lagged VAE, or variational dynamics encoder (VDE), to reduce complex, nonlinear processes to a single embedding with high fidelity to the underlying dynamics. We demonstrate how the VDE is able to capture nontrivial dynamics in a variety of examples, including Brownian dynamics and atomistic protein folding. Additionally, we demonstrate a method for analyzing the VDE model, inspired by saliency mapping, to determine what features are selected by the VDE model to describe dynamics. The VDE presents an important step in applying techniques from deep learning to more accurately model and interpret complex biophysics.
△ Less
Submitted 1 December, 2017; v1 submitted 23 November, 2017;
originally announced November 2017.
-
MSM lag time cannot be used for variational model selection
Authors:
Brooke E. Husic,
Vijay S. Pande
Abstract:
The variational principle for conformational dynamics has enabled the systematic construction of Markov state models through the optimization of hyperparameters by approximating the transfer operator. In this note we discuss why lag time of the operator being approximated must be held constant in the variational approach.
The variational principle for conformational dynamics has enabled the systematic construction of Markov state models through the optimization of hyperparameters by approximating the transfer operator. In this note we discuss why lag time of the operator being approximated must be held constant in the variational approach.
△ Less
Submitted 27 August, 2017;
originally announced August 2017.
-
Theoretical restrictions on longest implicit timescales in Markov state models of biomolecular dynamics
Authors:
Anton V. Sinitskiy,
Vijay S. Pande
Abstract:
Markov state models (MSMs) have been widely used to analyze computer simulations of various biomolecular systems. They can capture conformational transitions much slower than an average or maximal length of a single molecular dynamics (MD) trajectory from the set of trajectories used to build the MSM. A rule of thumb claiming that the slowest implicit timescale captured by an MSM should be compara…
▽ More
Markov state models (MSMs) have been widely used to analyze computer simulations of various biomolecular systems. They can capture conformational transitions much slower than an average or maximal length of a single molecular dynamics (MD) trajectory from the set of trajectories used to build the MSM. A rule of thumb claiming that the slowest implicit timescale captured by an MSM should be comparable by the order of magnitude to the aggregate duration of all MD trajectories used to build this MSM has been known in the field. However, this rule have never been formally proved. In this work, we present analytical results for the slowest timescale in several types of MSMs, supporting the above rule. We conclude that the slowest implicit timescale equals the product of the aggregate sampling and four factors that quantify: (1) how much statistics on the conformational transitions corresponding to the longest implicit timescale is available, (2) how good the sampling of the destination Markov state is, (3) the gain in statistics from using a sliding window for counting transitions between Markov states, and (4) a bias in the estimate of the implicit timescale arising from finite sampling of the conformational transitions. We demonstrate that in many practically important cases all these four factors are on the order of unity, and we analyze possible scenarios that could lead to their significant deviation from unity. Overall, we provide for the first time analytical results on the slowest timescales captured by MSMs. These results can guide further practical applications of MSMs to biomolecular dynamics and allow for higher computational efficiency of simulations.
△ Less
Submitted 9 August, 2017;
originally announced August 2017.
-
Computationally Discovered Potentiating Role of Glycans on NMDA Receptors
Authors:
Anton V. Sinitskiy,
Nathaniel H. Stanley,
David H. Hackos,
Jesse E. Hanson,
Benjamin D. Sellers,
Vijay S. Pande
Abstract:
N-methyl-D-aspartate receptors (NMDARs) are glycoproteins in the brain central to learning and memory. The effects of glycosylation on the structure and dynamics of NMDARs are largely unknown. In this work, we use extensive molecular dynamics simulations of GluN1 and GluN2B ligand binding domains (LBDs) of NMDARs to investigate these effects. Our simulations predict that intra-domain interactions…
▽ More
N-methyl-D-aspartate receptors (NMDARs) are glycoproteins in the brain central to learning and memory. The effects of glycosylation on the structure and dynamics of NMDARs are largely unknown. In this work, we use extensive molecular dynamics simulations of GluN1 and GluN2B ligand binding domains (LBDs) of NMDARs to investigate these effects. Our simulations predict that intra-domain interactions involving the glycan attached to residue GluN1-N440 stabilize closed-clamshell conformations of the GluN1 LBD. The glycan on GluN2B-N688 shows a similar, though weaker, effect. Based on these results, and assuming the transferability of the results of LBD simulations to the full receptor, we predict that glycans at GluN1-N440 might play a potentiator role in NMDARs. To validate this prediction, we perform electrophysiological analysis of full-length NMDARs with a glycosylation-preventing GluN1-N440Q mutation, and demonstrate an increase in the glycine EC50 value. Overall, our results suggest an intramolecular potentiating role of glycans on NMDA receptors.
△ Less
Submitted 19 December, 2016;
originally announced December 2016.
-
Identification of simple reaction coordinates from complex dynamics
Authors:
Robert T. McGibbon,
Brooke E. Husic,
Vijay S. Pande
Abstract:
Reaction coordinates are widely used throughout chemical physics to model and understand complex chemical transformations. We introduce a definition of the natural reaction coordinate, suitable for condensed phase and biomolecular systems, as a maximally predictive one-dimensional projection. We then show this criterion is uniquely satisfied by a dominant eigenfunction of an integral operator asso…
▽ More
Reaction coordinates are widely used throughout chemical physics to model and understand complex chemical transformations. We introduce a definition of the natural reaction coordinate, suitable for condensed phase and biomolecular systems, as a maximally predictive one-dimensional projection. We then show this criterion is uniquely satisfied by a dominant eigenfunction of an integral operator associated with the ensemble dynamics. We present a new sparse estimator for these eigenfunctions which can search through a large candidate pool of structural order parameters and build simple, interpretable approximations that employ only a small number of these order parameters. Example applications with a small molecule's rotational dynamics and simulations of protein conformational change and folding show that this approach can filter through statistical noise to identify simple reaction coordinates from complex dynamics.
△ Less
Submitted 6 January, 2017; v1 submitted 28 February, 2016;
originally announced February 2016.
-
Efficient maximum likelihood parameterization of continuous-time Markov processes
Authors:
Robert T. McGibbon,
Vijay S. Pande
Abstract:
Continuous-time Markov processes over finite state-spaces are widely used to model dynamical processes in many fields of natural and social science. Here, we introduce an maximum likelihood estimator for constructing such models from data observed at a finite time interval. This estimator is dramatically more efficient than prior approaches, enables the calculation of deterministic confidence inte…
▽ More
Continuous-time Markov processes over finite state-spaces are widely used to model dynamical processes in many fields of natural and social science. Here, we introduce an maximum likelihood estimator for constructing such models from data observed at a finite time interval. This estimator is dramatically more efficient than prior approaches, enables the calculation of deterministic confidence intervals in all model parameters, and can easily enforce important physical constraints on the models such as detailed balance. We demonstrate and discuss the advantages of these models over existing discrete-time Markov models for the analysis of molecular dynamics simulations.
△ Less
Submitted 30 June, 2015; v1 submitted 7 April, 2015;
originally announced April 2015.
-
Perspective: Markov Models for Long-Timescale Biomolecular Dynamics
Authors:
Christian R. Schwantes,
Robert T. McGibbon,
Vijay S. Pande
Abstract:
Molecular dynamics simulations have the potential to provide atomic-level detail and insight to important questions in chemical physics that cannot be observed in typical experiments. However, simply generating a long trajectory is insufficient, as researchers must be able to transform the data in a simulation trajectory into specific scientific insights. Although this analysis step has often been…
▽ More
Molecular dynamics simulations have the potential to provide atomic-level detail and insight to important questions in chemical physics that cannot be observed in typical experiments. However, simply generating a long trajectory is insufficient, as researchers must be able to transform the data in a simulation trajectory into specific scientific insights. Although this analysis step has often been taken for granted, it deserves further attention as large-scale simulations become increasingly routine. In this perspective, we discuss the application of Markov models to the analysis of large-scale biomolecular simulations. We draw attention to recent improvements in the construction of these models as well as several important open issues. In addition, we highlight recent theoretical advances that pave the way for a new generation of models of molecular kinetics.
△ Less
Submitted 22 August, 2014;
originally announced August 2014.
-
Efficient inference of protein structural ensembles
Authors:
Thomas J. Lane,
Christian R. Schwantes,
Kyle A. Beauchamp,
Vijay S. Pande
Abstract:
It is becoming clear that traditional, single-structure models of proteins are insufficient for understanding their biological function. Here, we outline one method for inferring, from experiments, not only the most common structure a protein adopts (native state), but the entire ensemble of conformations the system can adopt. Such ensemble mod- els are necessary to understand intrinsically disord…
▽ More
It is becoming clear that traditional, single-structure models of proteins are insufficient for understanding their biological function. Here, we outline one method for inferring, from experiments, not only the most common structure a protein adopts (native state), but the entire ensemble of conformations the system can adopt. Such ensemble mod- els are necessary to understand intrinsically disordered proteins, enzyme catalysis, and signaling. We suggest that the most difficult aspect of generating such a model will be finding a small set of configurations to accurately model structural heterogeneity and present one way to overcome this challenge.
△ Less
Submitted 1 August, 2014;
originally announced August 2014.
-
Variational cross-validation of slow dynamical modes in molecular kinetics
Authors:
Robert T. McGibbon,
Vijay S. Pande
Abstract:
Markov state models (MSMs) are a widely used method for approximating the eigenspectrum of the molecular dynamics propagator, yielding insight into the long-timescale statistical kinetics and slow dynamical modes of biomolecular systems. However, the lack of a unified theoretical framework for choosing between alternative models has hampered progress, especially for non-experts applying these meth…
▽ More
Markov state models (MSMs) are a widely used method for approximating the eigenspectrum of the molecular dynamics propagator, yielding insight into the long-timescale statistical kinetics and slow dynamical modes of biomolecular systems. However, the lack of a unified theoretical framework for choosing between alternative models has hampered progress, especially for non-experts applying these methods to novel biological systems. Here, we consider cross-validation with a new objective function for estimators of these slow dynamical modes, a generalized matrix Rayleigh quotient (GMRQ), which measures the ability of a rank-$m$ projection operator to capture the slow subspace of the system. It is shown that a variational theorem bounds the GMRQ from above by the sum of the first $m$ eigenvalues of the system's propagator, but that this bound can be violated when the requisite matrix elements are estimated subject to statistical uncertainty. This overfitting can be detected and avoided through cross-validation. These result make it possible to construct Markov state models for protein dynamics in a way that appropriately captures the tradeoff between systematic and statistical errors.
△ Less
Submitted 27 March, 2015; v1 submitted 30 July, 2014;
originally announced July 2014.
-
Understanding Protein Dynamics with L1-Regularized Reversible Hidden Markov Models
Authors:
Robert T. McGibbon,
Bharath Ramsundar,
Mohammad M. Sultan,
Gert Kiss,
Vijay S. Pande
Abstract:
We present a machine learning framework for modeling protein dynamics. Our approach uses L1-regularized, reversible hidden Markov models to understand large protein datasets generated via molecular dynamics simulations. Our model is motivated by three design principles: (1) the requirement of massive scalability; (2) the need to adhere to relevant physical law; and (3) the necessity of providing a…
▽ More
We present a machine learning framework for modeling protein dynamics. Our approach uses L1-regularized, reversible hidden Markov models to understand large protein datasets generated via molecular dynamics simulations. Our model is motivated by three design principles: (1) the requirement of massive scalability; (2) the need to adhere to relevant physical law; and (3) the necessity of providing accessible interpretations, critical for both cellular biology and rational drug design. We present an EM algorithm for learning and introduce a model selection criteria based on the physical notion of convergence in relaxation timescales. We contrast our model with standard methods in biophysics and demonstrate improved robustness. We implement our algorithm on GPUs and apply the method to two large protein simulation datasets generated respectively on the NCSA Bluewaters supercomputer and the Folding@Home distributed computing network. Our analysis identifies the conformational dynamics of the ubiquitin protein critical to cellular signaling, and elucidates the stepwise activation mechanism of the c-Src kinase protein.
△ Less
Submitted 6 May, 2014;
originally announced May 2014.
-
Probing the Origins of Two-State Folding
Authors:
Thomas J. Lane,
Christian R. Schwantes,
Kyle A. Beauchamp,
Vijay S. Pande
Abstract:
Many protein systems fold in a two-state manner. Random models, however, rarely display two-state kinetics and thus such behavior should not be accepted as a default. To date, many theories for the prevalence of two-state kinetics have been presented, but none sufficiently explain the breadth of experimental observations. A model, making a minimum of assumptions, is introduced that suggests two-st…
▽ More
Many protein systems fold in a two-state manner. Random models, however, rarely display two-state kinetics and thus such behavior should not be accepted as a default. To date, many theories for the prevalence of two-state kinetics have been presented, but none sufficiently explain the breadth of experimental observations. A model, making a minimum of assumptions, is introduced that suggests two-state behavior is likely for any system with an overwhelmingly populated native state. We show two-state folding is emergent and strengthened by increasing the occupancy population of the native state. Further, the model exhibits a hub-like behavior, with slow interconversions between unfolded states. Despite this, the unfolded state equilibrates quickly relative to the folding time. This apparent paradox is readily understood through this model. Finally, our results compare favorable with experimental measurements of protein folding rates as a function of chain length and Keq, and provide new insight into these results.
△ Less
Submitted 4 May, 2013;
originally announced May 2013.
-
A robust approach to estimating rates from time-correlation functions
Authors:
John D. Chodera,
Phillip J. Elms,
William C. Swope,
Jan-Hendrik Prinz,
Susan Marqusee,
Carlos Bustamante,
Frank Noé,
Vijay S. Pande
Abstract:
While seemingly straightforward in principle, the reliable estimation of rate constants is seldom easy in practice. Numerous issues, such as the complication of poor reaction coordinates, cause obvious approaches to yield unreliable estimates. When a reliable order parameter is available, the reactive flux theory of Chandler allows the rate constant to be extracted from the plateau region of an ap…
▽ More
While seemingly straightforward in principle, the reliable estimation of rate constants is seldom easy in practice. Numerous issues, such as the complication of poor reaction coordinates, cause obvious approaches to yield unreliable estimates. When a reliable order parameter is available, the reactive flux theory of Chandler allows the rate constant to be extracted from the plateau region of an appropriate reactive flux function. However, when applied to real data from single-molecule experiments or molecular dynamics simulations, the rate can sometimes be difficult to extract due to the numerical differentiation of a noisy empirical correlation function or difficulty in locating the plateau region at low sampling frequencies. We present a modified version of this theory which does not require numerical derivatives, allowing rate constants to be robustly estimated from the time-correlation function directly. We compare these approaches using single-molecule force spectroscopy measurements of an RNA hairpin.
△ Less
Submitted 10 August, 2011;
originally announced August 2011.
-
Splitting probabilities as a test of reaction coordinate choice in single-molecule experiments
Authors:
John D. Chodera,
Vijay S. Pande
Abstract:
To explain the observed dynamics in equilibrium single-molecule measurements of biomolecules, the experimental observable is often chosen as a putative reaction coordinate along which kinetic behavior is presumed to be governed by diffusive dynamics. Here, we invoke the splitting probability as a test of the suitability of such a proposed reaction coordinate. Comparison of the observed splitting p…
▽ More
To explain the observed dynamics in equilibrium single-molecule measurements of biomolecules, the experimental observable is often chosen as a putative reaction coordinate along which kinetic behavior is presumed to be governed by diffusive dynamics. Here, we invoke the splitting probability as a test of the suitability of such a proposed reaction coordinate. Comparison of the observed splitting probability with that computed from the kinetic model provides a simple test to reject poor reaction coordinates. We demonstrate this test for a force spectroscopy measurement of a DNA hairpin.
△ Less
Submitted 13 July, 2011; v1 submitted 3 May, 2011;
originally announced May 2011.
-
A simple theory of protein folding kinetics
Authors:
Vijay S. Pande
Abstract:
We present a simple model of protein folding dynamics that captures key qualitative elements recently seen in all-atom simulations. The goals of this theory are to serve as a simple formalism for gaining deeper insight into the physical properties seen in detailed simulations as well as to serve as a model to easily compare why these simulations suggest a different kinetic mechanism than previous…
▽ More
We present a simple model of protein folding dynamics that captures key qualitative elements recently seen in all-atom simulations. The goals of this theory are to serve as a simple formalism for gaining deeper insight into the physical properties seen in detailed simulations as well as to serve as a model to easily compare why these simulations suggest a different kinetic mechanism than previous simple models. Specifically, we find that non-native contacts play a key role in determining the mechanism, which can shift dramatically as the energetic strength of non-native interactions is changed. For protein-like non-native interactions, our model finds that the native state is a kinetic hub, connecting the strength of relevant interactions directly to the nature of folding kinetics.
△ Less
Submitted 2 July, 2010;
originally announced July 2010.
-
Folding@Home and Genome@Home: Using distributed computing to tackle previously intractable problems in computational biology
Authors:
Stefan M. Larson,
Christopher D. Snow,
Michael Shirts,
Vijay S. Pande
Abstract:
For decades, researchers have been applying computer simulation to address problems in biology. However, many of these "grand challenges" in computational biology, such as simulating how proteins fold, remained unsolved due to their great complexity. Indeed, even to simulate the fastest folding protein would require decades on the fastest modern CPUs. Here, we review novel methods to fundamental…
▽ More
For decades, researchers have been applying computer simulation to address problems in biology. However, many of these "grand challenges" in computational biology, such as simulating how proteins fold, remained unsolved due to their great complexity. Indeed, even to simulate the fastest folding protein would require decades on the fastest modern CPUs. Here, we review novel methods to fundamentally speed such previously intractable problems using a new computational paradigm: distributed computing. By efficiently harnessing tens of thousands of computers throughout the world, we have been able to break previous computational barriers. However, distributed computing brings new challenges, such as how to efficiently divide a complex calculation of many PCs that are connected by relatively slow networking. Moreover, even if the challenge of accurately reproducing reality can be conquered, a new challenge emerges: how can we take the results of these simulations (typically tens to hundreds of gigabytes of raw data) and gain some insight into the questions at hand. This challenge of the analysis of the sea of data resulting from large-scale simulation will likely remain for decades to come.
△ Less
Submitted 7 January, 2009;
originally announced January 2009.
-
Potential for modulation of the hydrophobic effect inside chaperonins
Authors:
Jeremy L. England,
Vijay S. Pande
Abstract:
Despite the spontaneity of some in vitro protein folding reactions, native folding in vivo often requires the participation of barrel-shaped multimeric complexes known as chaperonins. Although it has long been known that chaperonin substrates fold upon sequestration inside the chaperonin barrel, the precise mechanism by which confinement within this space facilitates folding remains unknown. In…
▽ More
Despite the spontaneity of some in vitro protein folding reactions, native folding in vivo often requires the participation of barrel-shaped multimeric complexes known as chaperonins. Although it has long been known that chaperonin substrates fold upon sequestration inside the chaperonin barrel, the precise mechanism by which confinement within this space facilitates folding remains unknown. In this study, we examine the possibility that the chaperonin mediates a favorable reorganization of the solvent for the folding reaction. We begin by discussing the effect of electrostatic charge on solvent-mediated hydrophobic forces in an aqueous environment. Based on these initial physical arguments, we construct a simple, phenomenological theory for the thermodynamics of density and hydrogen bond order fluctuations in liquid water. Within the framework of this model, we investigate the effect of confinement within a chaperonin-like cavity on the configurational free energy of water by calculating solvent free energies for cavities corresponding to the different conformational states in the ATP- driven catalytic cycle of the prokaryotic chaperonin GroEL. Our findings suggest that one function of chaperonins may be to trap unfolded proteins and subsequently expose them to a micro-environment in which the hydrophobic effect, a crucial thermodynamic driving force for folding, is enhanced.
△ Less
Submitted 4 February, 2008;
originally announced February 2008.
-
Freezing Transition of Compact Polyampholytes
Authors:
Vijay S. Pande,
Alexander Yu. Grosberg,
Chris Joerg,
Mehran Kardar,
Toyoichi Tanaka
Abstract:
Polyampholytes (PAs) are heteropolymers with long range Coulomb interactions. Unlike polymers with short range forces, PA energy levels have non-vanishing correlations and are thus very different from the Random Energy Model (REM). Nevertheless, if charges in the PA globule are screened as in a regular plasma, PAs freeze in REM fashion. Our results shed light on the potential role of Coulomb int…
▽ More
Polyampholytes (PAs) are heteropolymers with long range Coulomb interactions. Unlike polymers with short range forces, PA energy levels have non-vanishing correlations and are thus very different from the Random Energy Model (REM). Nevertheless, if charges in the PA globule are screened as in a regular plasma, PAs freeze in REM fashion. Our results shed light on the potential role of Coulomb interactions in folding and evolution of {\it proteins}, which are weakly charged PAs, in particular making connection with the finding that sequences of charged amino acids in proteins are not random.
△ Less
Submitted 5 September, 1996;
originally announced September 1996.
-
Is Heteropolymer Freezing Well Described by the Random Energy Model?
Authors:
Vijay S. Pande,
Alexander Yu. Grosberg,
Chris Joerg,
Toyoichi Tanaka
Abstract:
It is widely held that the Random Energy Model (REM) describes the freezing transition of a variety of types of heteropolymers. We demonstrate that the hallmark property of REM, statistical independence of the energies of states over disorder, is violated in different ways for models commonly employed in heteropolymer freezing studies. The implications for proteins are also discussed.
It is widely held that the Random Energy Model (REM) describes the freezing transition of a variety of types of heteropolymers. We demonstrate that the hallmark property of REM, statistical independence of the energies of states over disorder, is violated in different ways for models commonly employed in heteropolymer freezing studies. The implications for proteins are also discussed.
△ Less
Submitted 23 April, 1996;
originally announced April 1996.
-
How Accurate Must Potentials Be for Successful Modeling of Protein Folding?
Authors:
Vijay S. Pande,
Alexander Yu. Grosberg,
Toyoichi Tanaka
Abstract:
Protein sequences are believed to have been selected to provide the stability of, and reliable renaturation to, an encoded unique spatial fold. In recently proposed theoretical schemes, this selection is modeled as ``minimal frustration,'' or ``optimal energy'' of the desirable target conformation over all possible sequences, such that the ``design'' of the sequence is governed by the interactio…
▽ More
Protein sequences are believed to have been selected to provide the stability of, and reliable renaturation to, an encoded unique spatial fold. In recently proposed theoretical schemes, this selection is modeled as ``minimal frustration,'' or ``optimal energy'' of the desirable target conformation over all possible sequences, such that the ``design'' of the sequence is governed by the interactions between monomers. With replica mean field theory, we examine the possibility to reconstruct the renaturation, or freezing transition, of the ``designed'' heteropolymer given the inevitable errors in the determination of interaction energies, that is, the difference between sets (matrices) of interactions governing chain design and conformations, respectively. We find that the possibility of folding to the designed conformation is controlled by the correlations of the elements of the design and renaturation interaction matrices; unlike random heteropolymers, the ground state of designed heteropolymers is sufficiently stable, such that even a substantial error in the interaction energy should still yield correct renaturation.
△ Less
Submitted 20 October, 1995;
originally announced October 1995.