-
OpenMM 8: Molecular Dynamics Simulation with Machine Learning Potentials
Authors:
Peter Eastman,
Raimondas Galvelis,
Raúl P. Peláez,
Charlles R. A. Abreu,
Stephen E. Farr,
Emilio Gallicchio,
Anton Gorenko,
Michael M. Henry,
Frank Hu,
Jing Huang,
Andreas Krämer,
Julien Michel,
Joshua A. Mitchell,
Vijay S. Pande,
João PGLM Rodrigues,
Jaime Rodriguez-Guerra,
Andrew C. Simmonett,
Sukrit Singh,
Jason Swails,
Philip Turner,
Yuanqing Wang,
Ivy Zhang,
John D. Chodera,
Gianni De Fabritiis,
Thomas E. Markland
Abstract:
Machine learning plays an important and growing role in molecular simulation. The newest version of the OpenMM molecular dynamics toolkit introduces new features to support the use of machine learning potentials. Arbitrary PyTorch models can be added to a simulation and used to compute forces and energy. A higher-level interface allows users to easily model their molecules of interest with general…
▽ More
Machine learning plays an important and growing role in molecular simulation. The newest version of the OpenMM molecular dynamics toolkit introduces new features to support the use of machine learning potentials. Arbitrary PyTorch models can be added to a simulation and used to compute forces and energy. A higher-level interface allows users to easily model their molecules of interest with general purpose, pretrained potential functions. A collection of optimized CUDA kernels and custom PyTorch operations greatly improves the speed of simulations. We demonstrate these features on simulations of cyclin-dependent kinase 8 (CDK8) and the green fluorescent protein (GFP) chromophore in water. Taken together, these features make it practical to use machine learning to improve the accuracy of simulations at only a modest increase in cost.
△ Less
Submitted 29 November, 2023; v1 submitted 4 October, 2023;
originally announced October 2023.
-
Folding@home: achievements from over twenty years of citizen science herald the exascale era
Authors:
Vincent A. Voelz,
Vijay S. Pande,
Gregory R. Bowman
Abstract:
Simulations of biomolecules have enormous potential to inform our understanding of biology but require extremely demanding calculations. For over twenty years, the Folding@home distributed computing project has pioneered a massively parallel approach to biomolecular simulation, harnessing the resources of citizen scientists across the globe. Here, we summarize the scientific and technical advances…
▽ More
Simulations of biomolecules have enormous potential to inform our understanding of biology but require extremely demanding calculations. For over twenty years, the Folding@home distributed computing project has pioneered a massively parallel approach to biomolecular simulation, harnessing the resources of citizen scientists across the globe. Here, we summarize the scientific and technical advances this perspective has enabled. As the project's name implies, the early years of Folding@home focused on driving advances in our understanding of protein folding by developing statistical methods for capturing long-timescale processes and facilitating insight into complex dynamical processes. Success laid a foundation for broadening the scope of Folding@home to address other functionally relevant conformational changes, such as receptor signaling, enzyme dynamics, and ligand binding. Continued algorithmic advances, hardware developments such as GPU-based computing, and the growing scale of Folding@home have enabled the project to focus on new areas where massively parallel sampling can be impactful. While previous work sought to expand toward larger proteins with slower conformational changes, new work focuses on large-scale comparative studies of different protein sequences and chemical compounds to better understand biology and inform the development of small molecule drugs. Progress on these fronts enabled the community to pivot quickly in response to the COVID-19 pandemic, expanding to become the world's first exascale computer and deploying this massive resource to provide insight into the inner workings of the SARS-CoV-2 virus and aid the development of new antivirals. This success provides a glimpse of what's to come as exascale supercomputers come online, and Folding@home continues its work.
△ Less
Submitted 15 March, 2023;
originally announced March 2023.
-
Classical Quantum Optimization with Neural Network Quantum States
Authors:
Joseph Gomes,
Keri A. McKiernan,
Peter Eastman,
Vijay S. Pande
Abstract:
The classical simulation of quantum systems typically requires exponential resources. Recently, the introduction of a machine learning-based wavefunction ansatz has led to the ability to solve the quantum many-body problem in regimes that had previously been intractable for existing exact numerical methods. Here, we demonstrate the utility of the variational representation of quantum states based…
▽ More
The classical simulation of quantum systems typically requires exponential resources. Recently, the introduction of a machine learning-based wavefunction ansatz has led to the ability to solve the quantum many-body problem in regimes that had previously been intractable for existing exact numerical methods. Here, we demonstrate the utility of the variational representation of quantum states based on artificial neural networks for performing quantum optimization. We show empirically that this methodology achieves high approximation ratio solutions with polynomial classical computing resources for a range of instances of the Maximum Cut (MaxCut) problem whose solutions have been encoded into the ground state of quantum many-body systems up to and including 256 qubits.
△ Less
Submitted 23 October, 2019;
originally announced October 2019.
-
Physical machine learning outperforms "human learning" in Quantum Chemistry
Authors:
Anton V. Sinitskiy,
Vijay S. Pande
Abstract:
Two types of approaches to modeling molecular systems have demonstrated high practical efficiency. Density functional theory (DFT), the most widely used quantum chemical method, is a physical approach predicting energies and electron densities of molecules. Recently, numerous papers on machine learning (ML) of molecular properties have also been published. ML models greatly outperform DFT in terms…
▽ More
Two types of approaches to modeling molecular systems have demonstrated high practical efficiency. Density functional theory (DFT), the most widely used quantum chemical method, is a physical approach predicting energies and electron densities of molecules. Recently, numerous papers on machine learning (ML) of molecular properties have also been published. ML models greatly outperform DFT in terms of computational costs, and may even reach comparable accuracy, but they are missing physicality - a direct link to Quantum Physics - which limits their applicability. Here, we propose an approach that combines the strong sides of DFT and ML, namely, physicality and low computational cost. By generalizing the famous Hohenberg-Kohn theorems, we derive general equations for exact electron densities and energies that can naturally guide applications of ML in Quantum Chemistry. Based on these equations, we build a deep neural network that can compute electron densities and energies of a wide range of organic molecules not only much faster, but also closer to exact physical values than current versions of DFT. In particular, we reached a mean absolute error in energies of molecules with up to eight non-hydrogen atoms as low as 0.9 kcal/mol relative to CCSD(T) values, noticeably lower than those of DFT (down to ~3 kcal/mol on the same set of molecules) and ML (down to ~1.5 kcal/mol) methods. A simultaneous improvement in the accuracy of predictions of electron densities and energies suggests that the proposed approach describes the physics of molecules better than DFT functionals developed by "human learning" earlier. Thus, physics-based ML offers exciting opportunities for modeling, with high-theory-level quantum chemical accuracy, of much larger molecular systems than currently possible.
△ Less
Submitted 27 February, 2020; v1 submitted 1 August, 2019;
originally announced August 2019.
-
Predicting Gene Expression Between Species with Neural Networks
Authors:
Peter Eastman,
Vijay S. Pande
Abstract:
We train a neural network to predict human gene expression levels based on experimental data for rat cells. The network is trained with paired human/rat samples from the Open TG-GATES database, where paired samples were treated with the same compound at the same dose. When evaluated on a test set of held out compounds, the network successfully predicts human expression levels. On the majority of t…
▽ More
We train a neural network to predict human gene expression levels based on experimental data for rat cells. The network is trained with paired human/rat samples from the Open TG-GATES database, where paired samples were treated with the same compound at the same dose. When evaluated on a test set of held out compounds, the network successfully predicts human expression levels. On the majority of the test compounds, the list of differentially expressed genes determined from predicted expression levels agrees well with the list of differentially expressed genes determined from actual human experimental data.
△ Less
Submitted 5 July, 2019;
originally announced July 2019.
-
Step Change Improvement in ADMET Prediction with PotentialNet Deep Featurization
Authors:
Evan N. Feinberg,
Robert Sheridan,
Elizabeth Joshi,
Vijay S. Pande,
Alan C. Cheng
Abstract:
The Absorption, Distribution, Metabolism, Elimination, and Toxicity (ADMET) properties of drug candidates are estimated to account for up to 50% of all clinical trial failures. Predicting ADMET properties has therefore been of great interest to the cheminformatics and medicinal chemistry communities in recent decades. Traditional cheminformatics approaches, whether the learner is a random forest o…
▽ More
The Absorption, Distribution, Metabolism, Elimination, and Toxicity (ADMET) properties of drug candidates are estimated to account for up to 50% of all clinical trial failures. Predicting ADMET properties has therefore been of great interest to the cheminformatics and medicinal chemistry communities in recent decades. Traditional cheminformatics approaches, whether the learner is a random forest or a deep neural network, leverage fixed fingerprint feature representations of molecules. In contrast, in this paper, we learn the features most relevant to each chemical task at hand by representing each molecule explicitly as a graph, where each node is an atom and each edge is a bond. By applying graph convolutions to this explicit molecular representation, we achieve, to our knowledge, unprecedented accuracy in prediction of ADMET properties. By challenging our methodology with rigorous cross-validation procedures and prospective analyses, we show that deep featurization better enables molecular predictors to not only interpolate but also extrapolate to new regions of chemical space.
△ Less
Submitted 28 March, 2019;
originally announced March 2019.
-
Predicting Toxicity from Gene Expression with Neural Networks
Authors:
Peter Eastman,
Vijay S. Pande
Abstract:
We train a neural network to predict chemical toxicity based on gene expression data. The input to the network is a full expression profile collected either in vitro from cultured cells or in vivo from live animals. The output is a set of fine grained predictions for the presence of a variety of pathological effects in treated animals. When trained on the Open TG-GATEs database it produces good re…
▽ More
We train a neural network to predict chemical toxicity based on gene expression data. The input to the network is a full expression profile collected either in vitro from cultured cells or in vivo from live animals. The output is a set of fine grained predictions for the presence of a variety of pathological effects in treated animals. When trained on the Open TG-GATEs database it produces good results, outperforming classical models trained on the same data. This is a promising approach for efficiently screening chemicals for toxic effects, and for more accurately evaluating drug candidates based on preclinical data.
△ Less
Submitted 31 January, 2019;
originally announced February 2019.
-
Deep Neural Network Computes Electron Densities and Energies of a Large Set of Organic Molecules Faster than Density Functional Theory (DFT)
Authors:
Anton V. Sinitskiy,
Vijay S. Pande
Abstract:
Density functional theory (DFT) is one of the main methods in Quantum Chemistry that offers an attractive trade off between the cost and accuracy of quantum chemical computations. The electron density plays a key role in DFT. In this work, we explore whether machine learning - more specifically, deep neural networks (DNNs) - can be trained to predict electron densities faster than DFT. First, we c…
▽ More
Density functional theory (DFT) is one of the main methods in Quantum Chemistry that offers an attractive trade off between the cost and accuracy of quantum chemical computations. The electron density plays a key role in DFT. In this work, we explore whether machine learning - more specifically, deep neural networks (DNNs) - can be trained to predict electron densities faster than DFT. First, we choose a practically efficient combination of a DFT functional and a basis set (PBE0/pcS-3) and use it to generate a database of DFT solutions for more than 133,000 organic molecules from a previously published database QM9. Next, we train a DNN to predict electron densities and energies of such molecules. The only input to the DNN is an approximate electron density computed with a cheap quantum chemical method in a small basis set (HF/cc-VDZ). We demonstrate that the DNN successfully learns differences in the electron densities arising both from electron correlation and small basis set artifacts in the HF computations. All qualitative features in density differences, including local minima on lone pairs, local maxima on nuclei, toroidal shapes around C-H and C-C bonds, complex shapes around aromatic and cyclopropane rings and CN group, etc. are captured by the DNN. Accuracy of energy predictions by the DNN is ~ 1 kcal/mol, on par with other models reported in the literature, while those models do not predict the electron density. Computations with the DNN, including HF computations, take much less time that DFT computations (by a factor of ~20-30 for most QM9 molecules in the current version, and it is clear how it could be further improved).
△ Less
Submitted 7 September, 2018;
originally announced September 2018.
-
Binding Pathway of Opiates to $μ$ Opioid Receptors Revealed by Unsupervised Machine Learning
Authors:
Amir Barati Farimani,
Evan N. Feinberg,
Vijay S. Pande
Abstract:
Many important analgesics relieve pain by binding to the $μ$-Opioid Receptor ($μ$OR), which makes the $μ$OR among the most clinically relevant proteins of the G Protein Coupled Receptor (GPCR) family. Despite previous studies on the activation pathways of the GPCRs, the mechanism of opiate binding and the selectivity of $μ$OR are largely unknown. We performed extensive molecular dynamics (MD) simu…
▽ More
Many important analgesics relieve pain by binding to the $μ$-Opioid Receptor ($μ$OR), which makes the $μ$OR among the most clinically relevant proteins of the G Protein Coupled Receptor (GPCR) family. Despite previous studies on the activation pathways of the GPCRs, the mechanism of opiate binding and the selectivity of $μ$OR are largely unknown. We performed extensive molecular dynamics (MD) simulation and analysis to find the selective allosteric binding sites of the $μ$OR and the path opiates take to bind to the orthosteric site. In this study, we predicted that the allosteric site is responsible for the attraction and selection of opiates. Using Markov state models and machine learning, we traced the pathway of opiates in binding to the orthosteric site, the main binding pocket. Our results have important implications in designing novel analgesics.
△ Less
Submitted 22 April, 2018;
originally announced April 2018.
-
Deep Learning Phase Segregation
Authors:
Amir Barati Farimani,
Joseph Gomes,
Rishi Sharma,
Franklin L. Lee,
Vijay S. Pande
Abstract:
Phase segregation, the process by which the components of a binary mixture spontaneously separate, is a key process in the evolution and design of many chemical, mechanical, and biological systems. In this work, we present a data-driven approach for the learning, modeling, and prediction of phase segregation. A direct mapping between an initially dispersed, immiscible binary fluid and the equilibr…
▽ More
Phase segregation, the process by which the components of a binary mixture spontaneously separate, is a key process in the evolution and design of many chemical, mechanical, and biological systems. In this work, we present a data-driven approach for the learning, modeling, and prediction of phase segregation. A direct mapping between an initially dispersed, immiscible binary fluid and the equilibrium concentration field is learned by conditional generative convolutional neural networks. Concentration field predictions by the deep learning model conserve phase fraction, correctly predict phase transition, and reproduce area, perimeter, and total free energy distributions up to 98% accuracy.
△ Less
Submitted 23 March, 2018;
originally announced March 2018.
-
Note: Variational Encoding of Protein Dynamics Benefits from Maximizing Latent Autocorrelation
Authors:
Hannah K. Wayment-Steele,
Vijay S. Pande
Abstract:
As deep Variational Auto-Encoder (VAE) frameworks become more widely used for modeling biomolecular simulation data, we emphasize the capability of the VAE architecture to concurrently maximize the timescale of the latent space while inferring a reduced coordinate, which assists in finding slow processes as according to the variational approach to conformational dynamics. We additionally provide e…
▽ More
As deep Variational Auto-Encoder (VAE) frameworks become more widely used for modeling biomolecular simulation data, we emphasize the capability of the VAE architecture to concurrently maximize the timescale of the latent space while inferring a reduced coordinate, which assists in finding slow processes as according to the variational approach to conformational dynamics. We additionally provide evidence that the VDE framework (Hernández et al., 2017), which uses this autocorrelation loss along with a time-lagged reconstruction loss, obtains a variationally optimized latent coordinate in comparison with related loss functions. We thus recommend leveraging the autocorrelation of the latent space while training neural network models of biomolecular simulation data to better represent slow processes.
△ Less
Submitted 16 March, 2018;
originally announced March 2018.
-
Machine Learning Harnesses Molecular Dynamics to Discover New $μ$ Opioid Chemotypes
Authors:
Evan N. Feinberg,
Amir Barati Farimani,
Rajendra Uprety,
Amanda Hunkele,
Gavril W. Pasternak,
Susruta Majumdar,
Vijay S. Pande
Abstract:
Computational chemists typically assay drug candidates by virtually screening compounds against crystal structures of a protein despite the fact that some targets, like the $μ$ Opioid Receptor and other members of the GPCR family, traverse many non-crystallographic states. We discover new conformational states of $μOR$ with molecular dynamics simulation and then machine learn ligand-structure rela…
▽ More
Computational chemists typically assay drug candidates by virtually screening compounds against crystal structures of a protein despite the fact that some targets, like the $μ$ Opioid Receptor and other members of the GPCR family, traverse many non-crystallographic states. We discover new conformational states of $μOR$ with molecular dynamics simulation and then machine learn ligand-structure relationships to predict opioid ligand function. These artificial intelligence models identified a novel $μ$ opioid chemotype.
△ Less
Submitted 12 March, 2018;
originally announced March 2018.
-
PotentialNet for Molecular Property Prediction
Authors:
Evan N. Feinberg,
Debnil Sur,
Zhenqin Wu,
Brooke E. Husic,
Huanghao Mai,
Yang Li,
Saisai Sun,
Jianyi Yang,
Bharath Ramsundar,
Vijay S. Pande
Abstract:
The arc of drug discovery entails a multiparameter optimization problem spanning vast length scales. They key parameters range from solubility (angstroms) to protein-ligand binding (nanometers) to in vivo toxicity (meters). Through feature learning---instead of feature engineering---deep neural networks promise to outperform both traditional physics-based and knowledge-based machine learning model…
▽ More
The arc of drug discovery entails a multiparameter optimization problem spanning vast length scales. They key parameters range from solubility (angstroms) to protein-ligand binding (nanometers) to in vivo toxicity (meters). Through feature learning---instead of feature engineering---deep neural networks promise to outperform both traditional physics-based and knowledge-based machine learning models for predicting molecular properties pertinent to drug discovery. To this end, we present the PotentialNet family of graph convolutions. These models are specifically designed for and achieve state-of-the-art performance for protein-ligand binding affinity. We further validate these deep neural networks by setting new standards of performance in several ligand-based tasks. In parallel, we introduce a new metric, the Regression Enrichment Factor $EF_χ^{(R)}$, to measure the early enrichment of computational models for chemical data. Finally, we introduce a cross-validation strategy based on structural homology clustering that can more accurately measure model generalizability, which crucially distinguishes the aims of machine learning for drug discovery from standard machine learning tasks.
△ Less
Submitted 22 October, 2018; v1 submitted 12 March, 2018;
originally announced March 2018.
-
SentRNA: Improving computational RNA design by incorporating a prior of human design strategies
Authors:
Jade Shi,
Rhiju Das,
Vijay S. Pande
Abstract:
Solving the RNA inverse folding problem is a critical prerequisite to RNA design, an emerging field in bioengineering with a broad range of applications from reaction catalysis to cancer therapy. Although significant progress has been made in developing machine-based inverse RNA folding algorithms, current approaches still have difficulty designing sequences for large or complex targets. On the ot…
▽ More
Solving the RNA inverse folding problem is a critical prerequisite to RNA design, an emerging field in bioengineering with a broad range of applications from reaction catalysis to cancer therapy. Although significant progress has been made in developing machine-based inverse RNA folding algorithms, current approaches still have difficulty designing sequences for large or complex targets. On the other hand, human players of the online RNA design game EteRNA have consistently shown superior performance in this regard, being able to readily design sequences for targets that are challenging for machine algorithms. Here we present a novel approach to the RNA design problem, SentRNA, a design agent consisting of a fully-connected neural network trained end-to-end using human-designed RNA sequences. We show that through this approach, SentRNA can solve complex targets previously unsolvable by any machine-based approach and achieve state-of-the-art performance on two separate challenging test sets. Our results demonstrate that incorporating human design strategies into a design algorithm can significantly boost machine performance and suggests a new paradigm for machine-based RNA design.
△ Less
Submitted 5 March, 2019; v1 submitted 8 March, 2018;
originally announced March 2018.
-
Using Deep Learning for Segmentation and Counting within Microscopy Data
Authors:
Carlos X. Hernández,
Mohammad M. Sultan,
Vijay S. Pande
Abstract:
Cell counting is a ubiquitous, yet tedious task that would greatly benefit from automation. From basic biological questions to clinical trials, cell counts provide key quantitative feedback that drive research. Unfortunately, cell counting is most commonly a manual task and can be time-intensive. The task is made even more difficult due to overlapping cells, existence of multiple focal planes, and…
▽ More
Cell counting is a ubiquitous, yet tedious task that would greatly benefit from automation. From basic biological questions to clinical trials, cell counts provide key quantitative feedback that drive research. Unfortunately, cell counting is most commonly a manual task and can be time-intensive. The task is made even more difficult due to overlapping cells, existence of multiple focal planes, and poor imaging quality, among other factors. Here, we describe a convolutional neural network approach, using a recently described feature pyramid network combined with a VGG-style neural network, for segmenting and subsequent counting of cells in a given microscopy image.
△ Less
Submitted 28 February, 2018;
originally announced February 2018.
-
Automated design of collective variables using supervised machine learning
Authors:
Mohammad M. Sultan,
Vijay S. Pande
Abstract:
Selection of appropriate collective variables for enhancing sampling of molecular simulations remains an unsolved problem in computational biophysics. In particular, picking initial collective variables (CVs) is particularly challenging in higher dimensions. Which atomic coordinates or transforms there of from a list of thousands should one pick for enhanced sampling runs? How does a modeler even…
▽ More
Selection of appropriate collective variables for enhancing sampling of molecular simulations remains an unsolved problem in computational biophysics. In particular, picking initial collective variables (CVs) is particularly challenging in higher dimensions. Which atomic coordinates or transforms there of from a list of thousands should one pick for enhanced sampling runs? How does a modeler even begin to pick starting coordinates for investigation? This remains true even in the case of simple two state systems and only increases in difficulty for multi-state systems. In this work, we solve the initial CV problem using a data-driven approach inspired by the filed of supervised machine learning. In particular, we show how the decision functions in supervised machine learning (SML) algorithms can be used as initial CVs (SML_cv) for accelerated sampling. Using solvated alanine dipeptide and Chignolin mini-protein as our test cases, we illustrate how the distance to the Support Vector Machines' decision hyperplane, the output probability estimates from Logistic Regression, the outputs from deep neural network classifiers, and other classifiers may be used to reversibly sample slow structural transitions. We discuss the utility of other SML algorithms that might be useful for identifying CVs for accelerating molecular simulations.
△ Less
Submitted 13 May, 2018; v1 submitted 28 February, 2018;
originally announced February 2018.
-
Adaptive Boundaries in Multiscale Simulations
Authors:
Jason A. Wagoner,
Vijay S. Pande
Abstract:
Combined-resolution simulations are an effective way to study molecular properties across a range of length- and time-scales. These simulations can benefit from adaptive boundaries that allow the high-resolution region to adapt (change size and/or shape) as the simulation progresses. The number of degrees of freedom required to accurately represent even a simple molecular process can vary by sever…
▽ More
Combined-resolution simulations are an effective way to study molecular properties across a range of length- and time-scales. These simulations can benefit from adaptive boundaries that allow the high-resolution region to adapt (change size and/or shape) as the simulation progresses. The number of degrees of freedom required to accurately represent even a simple molecular process can vary by several orders of magnitude throughout the course of a simulation, and adaptive boundaries react to these changes to include an appropriate but not excessive amount of detail. Here, we derive the Hamiltonian and distribution function for such a molecular simulation. We also design an algorithm that can efficiently sample the boundary as a new coordinate of the system. We apply this framework to a mixed explicit/continuum representation of a peptide in solvent. We use this example to discuss the conditions necessary for a successful implementation of adaptive boundaries that is both efficient and accurate in reproducing molecular properties.
△ Less
Submitted 3 April, 2018; v1 submitted 14 February, 2018;
originally announced February 2018.
-
Transferable neural networks for enhanced sampling of protein dynamics
Authors:
Mohammad M. Sultan,
Hannah K. Wayment-Steele,
Vijay S. Pande
Abstract:
Variational auto-encoder frameworks have demonstrated success in reducing complex nonlinear dynamics in molecular simulation to a single non-linear embedding. In this work, we illustrate how this non-linear latent embedding can be used as a collective variable for enhanced sampling, and present a simple modification that allows us to rapidly perform sampling in multiple related systems. We first d…
▽ More
Variational auto-encoder frameworks have demonstrated success in reducing complex nonlinear dynamics in molecular simulation to a single non-linear embedding. In this work, we illustrate how this non-linear latent embedding can be used as a collective variable for enhanced sampling, and present a simple modification that allows us to rapidly perform sampling in multiple related systems. We first demonstrate our method is able to describe the effects of force field changes in capped alanine dipeptide after learning a model using AMBER99. We further provide a simple extension to variational dynamics encoders that allows the model to be trained in a more efficient manner on larger systems by encoding the outputs of a linear transformation using time-structure based independent component analysis (tICA). Using this technique, we show how such a model trained for one protein, the WW domain, can efficiently be transferred to perform enhanced sampling on a related mutant protein, the GTT mutation. This method shows promise for its ability to rapidly sample related systems using a single transferable collective variable and is generally applicable to sets of related simulations, enabling us to probe the effects of variation in increasingly large systems of biophysical interest.
△ Less
Submitted 2 January, 2018;
originally announced January 2018.
-
Unsupervised learning of dynamical and molecular similarity using variance minimization
Authors:
Brooke E. Husic,
Vijay S. Pande
Abstract:
In this report, we present an unsupervised machine learning method for determining groups of molecular systems according to similarity in their dynamics or structures using Ward's minimum variance objective function. We first apply the minimum variance clustering to a set of simulated tripeptides using the information theoretic Jensen-Shannon divergence between Markovian transition matrices in ord…
▽ More
In this report, we present an unsupervised machine learning method for determining groups of molecular systems according to similarity in their dynamics or structures using Ward's minimum variance objective function. We first apply the minimum variance clustering to a set of simulated tripeptides using the information theoretic Jensen-Shannon divergence between Markovian transition matrices in order to gain insight into how point mutations affect protein dynamics. Then, we extend the method to partition two chemoinformatic datasets according to structural similarity to motivate a train/validation/test split for supervised learning that avoids overfitting.
△ Less
Submitted 20 December, 2017;
originally announced December 2017.
-
Variational Encoding of Complex Dynamics
Authors:
Carlos X. Hernández,
Hannah K. Wayment-Steele,
Mohammad M. Sultan,
Brooke E. Husic,
Vijay S. Pande
Abstract:
Often the analysis of time-dependent chemical and biophysical systems produces high-dimensional time-series data for which it can be difficult to interpret which individual features are most salient. While recent work from our group and others has demonstrated the utility of time-lagged co-variate models to study such systems, linearity assumptions can limit the compression of inherently nonlinear…
▽ More
Often the analysis of time-dependent chemical and biophysical systems produces high-dimensional time-series data for which it can be difficult to interpret which individual features are most salient. While recent work from our group and others has demonstrated the utility of time-lagged co-variate models to study such systems, linearity assumptions can limit the compression of inherently nonlinear dynamics into just a few characteristic components. Recent work in the field of deep learning has led to the development of variational autoencoders (VAE), which are able to compress complex datasets into simpler manifolds. We present the use of a time-lagged VAE, or variational dynamics encoder (VDE), to reduce complex, nonlinear processes to a single embedding with high fidelity to the underlying dynamics. We demonstrate how the VDE is able to capture nontrivial dynamics in a variety of examples, including Brownian dynamics and atomistic protein folding. Additionally, we demonstrate a method for analyzing the VDE model, inspired by saliency mapping, to determine what features are selected by the VDE model to describe dynamics. The VDE presents an important step in applying techniques from deep learning to more accurately model and interpret complex biophysics.
△ Less
Submitted 1 December, 2017; v1 submitted 23 November, 2017;
originally announced November 2017.
-
Deep Learning the Physics of Transport Phenomena
Authors:
Amir Barati Farimani,
Joseph Gomes,
Vijay S. Pande
Abstract:
We have developed a new data-driven paradigm for the rapid inference, modeling and simulation of the physics of transport phenomena by deep learning. Using conditional generative adversarial networks (cGAN), we train models for the direct generation of solutions to steady state heat conduction and incompressible fluid flow purely on observation without knowledge of the underlying governing equatio…
▽ More
We have developed a new data-driven paradigm for the rapid inference, modeling and simulation of the physics of transport phenomena by deep learning. Using conditional generative adversarial networks (cGAN), we train models for the direct generation of solutions to steady state heat conduction and incompressible fluid flow purely on observation without knowledge of the underlying governing equations. Rather than using iterative numerical methods to approximate the solution of the constitutive equations, cGANs learn to directly generate the solutions to these phenomena, given arbitrary boundary conditions and domain, with high test accuracy (MAE$<$1\%) and state-of-the-art computational performance. The cGAN framework can be used to learn causal models directly from experimental observations where the underlying physical model is complex or unknown.
△ Less
Submitted 7 September, 2017;
originally announced September 2017.
-
MSM lag time cannot be used for variational model selection
Authors:
Brooke E. Husic,
Vijay S. Pande
Abstract:
The variational principle for conformational dynamics has enabled the systematic construction of Markov state models through the optimization of hyperparameters by approximating the transfer operator. In this note we discuss why lag time of the operator being approximated must be held constant in the variational approach.
The variational principle for conformational dynamics has enabled the systematic construction of Markov state models through the optimization of hyperparameters by approximating the transfer operator. In this note we discuss why lag time of the operator being approximated must be held constant in the variational approach.
△ Less
Submitted 27 August, 2017;
originally announced August 2017.
-
Theoretical restrictions on longest implicit timescales in Markov state models of biomolecular dynamics
Authors:
Anton V. Sinitskiy,
Vijay S. Pande
Abstract:
Markov state models (MSMs) have been widely used to analyze computer simulations of various biomolecular systems. They can capture conformational transitions much slower than an average or maximal length of a single molecular dynamics (MD) trajectory from the set of trajectories used to build the MSM. A rule of thumb claiming that the slowest implicit timescale captured by an MSM should be compara…
▽ More
Markov state models (MSMs) have been widely used to analyze computer simulations of various biomolecular systems. They can capture conformational transitions much slower than an average or maximal length of a single molecular dynamics (MD) trajectory from the set of trajectories used to build the MSM. A rule of thumb claiming that the slowest implicit timescale captured by an MSM should be comparable by the order of magnitude to the aggregate duration of all MD trajectories used to build this MSM has been known in the field. However, this rule have never been formally proved. In this work, we present analytical results for the slowest timescale in several types of MSMs, supporting the above rule. We conclude that the slowest implicit timescale equals the product of the aggregate sampling and four factors that quantify: (1) how much statistics on the conformational transitions corresponding to the longest implicit timescale is available, (2) how good the sampling of the destination Markov state is, (3) the gain in statistics from using a sliding window for counting transitions between Markov states, and (4) a bias in the estimate of the implicit timescale arising from finite sampling of the conformational transitions. We demonstrate that in many practically important cases all these four factors are on the order of unity, and we analyze possible scenarios that could lead to their significant deviation from unity. Overall, we provide for the first time analytical results on the slowest timescales captured by MSMs. These results can guide further practical applications of MSMs to biomolecular dynamics and allow for higher computational efficiency of simulations.
△ Less
Submitted 9 August, 2017;
originally announced August 2017.
-
Atomic Convolutional Networks for Predicting Protein-Ligand Binding Affinity
Authors:
Joseph Gomes,
Bharath Ramsundar,
Evan N. Feinberg,
Vijay S. Pande
Abstract:
Empirical scoring functions based on either molecular force fields or cheminformatics descriptors are widely used, in conjunction with molecular docking, during the early stages of drug discovery to predict potency and binding affinity of a drug-like molecule to a given target. These models require expert-level knowledge of physical chemistry and biology to be encoded as hand-tuned parameters or f…
▽ More
Empirical scoring functions based on either molecular force fields or cheminformatics descriptors are widely used, in conjunction with molecular docking, during the early stages of drug discovery to predict potency and binding affinity of a drug-like molecule to a given target. These models require expert-level knowledge of physical chemistry and biology to be encoded as hand-tuned parameters or features rather than allowing the underlying model to select features in a data-driven procedure. Here, we develop a general 3-dimensional spatial convolution operation for learning atomic-level chemical interactions directly from atomic coordinates and demonstrate its application to structure-based bioactivity prediction. The atomic convolutional neural network is trained to predict the experimentally determined binding affinity of a protein-ligand complex by direct calculation of the energy associated with the complex, protein, and ligand given the crystal structure of the binding pose. Non-covalent interactions present in the complex that are absent in the protein-ligand sub-structures are identified and the model learns the interaction strength associated with these features. We test our model by predicting the binding free energy of a subset of protein-ligand complexes found in the PDBBind dataset and compare with state-of-the-art cheminformatics and machine learning-based approaches. We find that all methods achieve experimental accuracy and that atomic convolutional networks either outperform or perform competitively with the cheminformatics based methods. Unlike all previous protein-ligand prediction systems, atomic convolutional networks are end-to-end and fully-differentiable. They represent a new data-driven, physics-based deep learning model paradigm that offers a strong foundation for future improvements in structure-based bioactivity prediction.
△ Less
Submitted 30 March, 2017;
originally announced March 2017.
-
Computationally Discovered Potentiating Role of Glycans on NMDA Receptors
Authors:
Anton V. Sinitskiy,
Nathaniel H. Stanley,
David H. Hackos,
Jesse E. Hanson,
Benjamin D. Sellers,
Vijay S. Pande
Abstract:
N-methyl-D-aspartate receptors (NMDARs) are glycoproteins in the brain central to learning and memory. The effects of glycosylation on the structure and dynamics of NMDARs are largely unknown. In this work, we use extensive molecular dynamics simulations of GluN1 and GluN2B ligand binding domains (LBDs) of NMDARs to investigate these effects. Our simulations predict that intra-domain interactions…
▽ More
N-methyl-D-aspartate receptors (NMDARs) are glycoproteins in the brain central to learning and memory. The effects of glycosylation on the structure and dynamics of NMDARs are largely unknown. In this work, we use extensive molecular dynamics simulations of GluN1 and GluN2B ligand binding domains (LBDs) of NMDARs to investigate these effects. Our simulations predict that intra-domain interactions involving the glycan attached to residue GluN1-N440 stabilize closed-clamshell conformations of the GluN1 LBD. The glycan on GluN2B-N688 shows a similar, though weaker, effect. Based on these results, and assuming the transferability of the results of LBD simulations to the full receptor, we predict that glycans at GluN1-N440 might play a potentiator role in NMDARs. To validate this prediction, we perform electrophysiological analysis of full-length NMDARs with a glycosylation-preventing GluN1-N440Q mutation, and demonstrate an increase in the glycine EC50 value. Overall, our results suggest an intramolecular potentiating role of glycans on NMDA receptors.
△ Less
Submitted 19 December, 2016;
originally announced December 2016.
-
Learning Protein Dynamics with Metastable Switching Systems
Authors:
Bharath Ramsundar,
Vijay S. Pande
Abstract:
We introduce a machine learning approach for extracting fine-grained representations of protein evolution from molecular dynamics datasets. Metastable switching linear dynamical systems extend standard switching models with a physically-inspired stability constraint. This constraint enables the learning of nuanced representations of protein dynamics that closely match physical reality. We derive a…
▽ More
We introduce a machine learning approach for extracting fine-grained representations of protein evolution from molecular dynamics datasets. Metastable switching linear dynamical systems extend standard switching models with a physically-inspired stability constraint. This constraint enables the learning of nuanced representations of protein dynamics that closely match physical reality. We derive an EM algorithm for learning, where the E-step extends the forward-backward algorithm for HMMs and the M-step requires the solution of large biconvex optimization problems. We construct an approximate semidefinite program solver based on the Frank-Wolfe algorithm and use it to solve the M-step. We apply our EM algorithm to learn accurate dynamics from large simulation datasets for the opioid peptide met-enkephalin and the proto-oncogene Src-kinase. Our learned models demonstrate significant improvements in temporal coherence over HMMs and standard switching models for met-enkephalin, and sample transition paths (possibly useful in rational drug design) for Src-kinase.
△ Less
Submitted 5 October, 2016;
originally announced October 2016.
-
Identification of simple reaction coordinates from complex dynamics
Authors:
Robert T. McGibbon,
Brooke E. Husic,
Vijay S. Pande
Abstract:
Reaction coordinates are widely used throughout chemical physics to model and understand complex chemical transformations. We introduce a definition of the natural reaction coordinate, suitable for condensed phase and biomolecular systems, as a maximally predictive one-dimensional projection. We then show this criterion is uniquely satisfied by a dominant eigenfunction of an integral operator asso…
▽ More
Reaction coordinates are widely used throughout chemical physics to model and understand complex chemical transformations. We introduce a definition of the natural reaction coordinate, suitable for condensed phase and biomolecular systems, as a maximally predictive one-dimensional projection. We then show this criterion is uniquely satisfied by a dominant eigenfunction of an integral operator associated with the ensemble dynamics. We present a new sparse estimator for these eigenfunctions which can search through a large candidate pool of structural order parameters and build simple, interpretable approximations that employ only a small number of these order parameters. Example applications with a small molecule's rotational dynamics and simulations of protein conformational change and folding show that this approach can filter through statistical noise to identify simple reaction coordinates from complex dynamics.
△ Less
Submitted 6 January, 2017; v1 submitted 28 February, 2016;
originally announced February 2016.
-
Efficient maximum likelihood parameterization of continuous-time Markov processes
Authors:
Robert T. McGibbon,
Vijay S. Pande
Abstract:
Continuous-time Markov processes over finite state-spaces are widely used to model dynamical processes in many fields of natural and social science. Here, we introduce an maximum likelihood estimator for constructing such models from data observed at a finite time interval. This estimator is dramatically more efficient than prior approaches, enables the calculation of deterministic confidence inte…
▽ More
Continuous-time Markov processes over finite state-spaces are widely used to model dynamical processes in many fields of natural and social science. Here, we introduce an maximum likelihood estimator for constructing such models from data observed at a finite time interval. This estimator is dramatically more efficient than prior approaches, enables the calculation of deterministic confidence intervals in all model parameters, and can easily enforce important physical constraints on the models such as detailed balance. We demonstrate and discuss the advantages of these models over existing discrete-time Markov models for the analysis of molecular dynamics simulations.
△ Less
Submitted 30 June, 2015; v1 submitted 7 April, 2015;
originally announced April 2015.
-
Perspective: Markov Models for Long-Timescale Biomolecular Dynamics
Authors:
Christian R. Schwantes,
Robert T. McGibbon,
Vijay S. Pande
Abstract:
Molecular dynamics simulations have the potential to provide atomic-level detail and insight to important questions in chemical physics that cannot be observed in typical experiments. However, simply generating a long trajectory is insufficient, as researchers must be able to transform the data in a simulation trajectory into specific scientific insights. Although this analysis step has often been…
▽ More
Molecular dynamics simulations have the potential to provide atomic-level detail and insight to important questions in chemical physics that cannot be observed in typical experiments. However, simply generating a long trajectory is insufficient, as researchers must be able to transform the data in a simulation trajectory into specific scientific insights. Although this analysis step has often been taken for granted, it deserves further attention as large-scale simulations become increasingly routine. In this perspective, we discuss the application of Markov models to the analysis of large-scale biomolecular simulations. We draw attention to recent improvements in the construction of these models as well as several important open issues. In addition, we highlight recent theoretical advances that pave the way for a new generation of models of molecular kinetics.
△ Less
Submitted 22 August, 2014;
originally announced August 2014.
-
Efficient inference of protein structural ensembles
Authors:
Thomas J. Lane,
Christian R. Schwantes,
Kyle A. Beauchamp,
Vijay S. Pande
Abstract:
It is becoming clear that traditional, single-structure models of proteins are insufficient for understanding their biological function. Here, we outline one method for inferring, from experiments, not only the most common structure a protein adopts (native state), but the entire ensemble of conformations the system can adopt. Such ensemble mod- els are necessary to understand intrinsically disord…
▽ More
It is becoming clear that traditional, single-structure models of proteins are insufficient for understanding their biological function. Here, we outline one method for inferring, from experiments, not only the most common structure a protein adopts (native state), but the entire ensemble of conformations the system can adopt. Such ensemble mod- els are necessary to understand intrinsically disordered proteins, enzyme catalysis, and signaling. We suggest that the most difficult aspect of generating such a model will be finding a small set of configurations to accurately model structural heterogeneity and present one way to overcome this challenge.
△ Less
Submitted 1 August, 2014;
originally announced August 2014.
-
Variational cross-validation of slow dynamical modes in molecular kinetics
Authors:
Robert T. McGibbon,
Vijay S. Pande
Abstract:
Markov state models (MSMs) are a widely used method for approximating the eigenspectrum of the molecular dynamics propagator, yielding insight into the long-timescale statistical kinetics and slow dynamical modes of biomolecular systems. However, the lack of a unified theoretical framework for choosing between alternative models has hampered progress, especially for non-experts applying these meth…
▽ More
Markov state models (MSMs) are a widely used method for approximating the eigenspectrum of the molecular dynamics propagator, yielding insight into the long-timescale statistical kinetics and slow dynamical modes of biomolecular systems. However, the lack of a unified theoretical framework for choosing between alternative models has hampered progress, especially for non-experts applying these methods to novel biological systems. Here, we consider cross-validation with a new objective function for estimators of these slow dynamical modes, a generalized matrix Rayleigh quotient (GMRQ), which measures the ability of a rank-$m$ projection operator to capture the slow subspace of the system. It is shown that a variational theorem bounds the GMRQ from above by the sum of the first $m$ eigenvalues of the system's propagator, but that this bound can be violated when the requisite matrix elements are estimated subject to statistical uncertainty. This overfitting can be detected and avoided through cross-validation. These result make it possible to construct Markov state models for protein dynamics in a way that appropriately captures the tradeoff between systematic and statistical errors.
△ Less
Submitted 27 March, 2015; v1 submitted 30 July, 2014;
originally announced July 2014.
-
Understanding Protein Dynamics with L1-Regularized Reversible Hidden Markov Models
Authors:
Robert T. McGibbon,
Bharath Ramsundar,
Mohammad M. Sultan,
Gert Kiss,
Vijay S. Pande
Abstract:
We present a machine learning framework for modeling protein dynamics. Our approach uses L1-regularized, reversible hidden Markov models to understand large protein datasets generated via molecular dynamics simulations. Our model is motivated by three design principles: (1) the requirement of massive scalability; (2) the need to adhere to relevant physical law; and (3) the necessity of providing a…
▽ More
We present a machine learning framework for modeling protein dynamics. Our approach uses L1-regularized, reversible hidden Markov models to understand large protein datasets generated via molecular dynamics simulations. Our model is motivated by three design principles: (1) the requirement of massive scalability; (2) the need to adhere to relevant physical law; and (3) the necessity of providing accessible interpretations, critical for both cellular biology and rational drug design. We present an EM algorithm for learning and introduce a model selection criteria based on the physical notion of convergence in relaxation timescales. We contrast our model with standard methods in biophysics and demonstrate improved robustness. We implement our algorithm on GPUs and apply the method to two large protein simulation datasets generated respectively on the NCSA Bluewaters supercomputer and the Folding@Home distributed computing network. Our analysis identifies the conformational dynamics of the ubiquitin protein critical to cellular signaling, and elucidates the stepwise activation mechanism of the c-Src kinase protein.
△ Less
Submitted 6 May, 2014;
originally announced May 2014.
-
Probing the Origins of Two-State Folding
Authors:
Thomas J. Lane,
Christian R. Schwantes,
Kyle A. Beauchamp,
Vijay S. Pande
Abstract:
Many protein systems fold in a two-state manner. Random models, however, rarely display two-state kinetics and thus such behavior should not be accepted as a default. To date, many theories for the prevalence of two-state kinetics have been presented, but none sufficiently explain the breadth of experimental observations. A model, making a minimum of assumptions, is introduced that suggests two-st…
▽ More
Many protein systems fold in a two-state manner. Random models, however, rarely display two-state kinetics and thus such behavior should not be accepted as a default. To date, many theories for the prevalence of two-state kinetics have been presented, but none sufficiently explain the breadth of experimental observations. A model, making a minimum of assumptions, is introduced that suggests two-state behavior is likely for any system with an overwhelmingly populated native state. We show two-state folding is emergent and strengthened by increasing the occupancy population of the native state. Further, the model exhibits a hub-like behavior, with slow interconversions between unfolded states. Despite this, the unfolded state equilibrates quickly relative to the folding time. This apparent paradox is readily understood through this model. Finally, our results compare favorable with experimental measurements of protein folding rates as a function of chain length and Keq, and provide new insight into these results.
△ Less
Submitted 4 May, 2013;
originally announced May 2013.
-
Inferring the Rate-Length Law of Protein Folding
Authors:
Thomas J. Lane,
Vijay S. Pande
Abstract:
We investigate the rate-length scaling law of protein folding, a key undetermined scaling law in the analytical theory of protein folding. We demonstrate that chain length is a dominant factor determining folding times, and that the unambiguous determination of the way chain length corre- lates with folding times could provide key mechanistic insight into the folding process. Four specific propose…
▽ More
We investigate the rate-length scaling law of protein folding, a key undetermined scaling law in the analytical theory of protein folding. We demonstrate that chain length is a dominant factor determining folding times, and that the unambiguous determination of the way chain length corre- lates with folding times could provide key mechanistic insight into the folding process. Four specific proposed laws (power law, exponential, and two stretched exponentials) are tested against one an- other, and it is found that the power law best explains the data. At the same time, the fit power law results in rates that are very fast, nearly unreasonably so in a biological context. We show that any of the proposed forms are viable, conclude that more data is necessary to unequivocally infer the rate-length law, and that such data could be obtained through a small number of protein folding experiments on large protein domains.
△ Less
Submitted 18 January, 2013;
originally announced January 2013.
-
Reducing the effect of Metropolization on mixing times in molecular dynamics simulations
Authors:
Jason A. Wagoner,
Vijay S. Pande
Abstract:
Molecular dynamics algorithms are subject to some amount of error dependent on the size of the time step that is used. This error can be corrected by periodically updating the system with a Metropolis criteria, where the integration step is treated as a selection probability for candidate state generation. Such a method, closely related to generalized hybrid Monte Carlo (GHMC), satisfies the balan…
▽ More
Molecular dynamics algorithms are subject to some amount of error dependent on the size of the time step that is used. This error can be corrected by periodically updating the system with a Metropolis criteria, where the integration step is treated as a selection probability for candidate state generation. Such a method, closely related to generalized hybrid Monte Carlo (GHMC), satisfies the balance condition by imposing a reversal of momenta upon candidate rejection. In the present study, we demonstrate that such momentum reversals can have a significant impact on molecular kinetics and extend the time required for system decorrelation, resulting in an order of magnitude increase in the integrated autocorrelation times of molecular variables for the worst cases. We present a simple method, referred to as reduced-flipping GHMC, that uses the information of the previous, current, and candidate states to reduce the probability of momentum flipping following candidate rejection while rigorously satisfying the balance condition. This method is a simple modification to traditional, automatic-flipping, GHMC methods and significantly mitigates the impact of such algorithms on molecular kinetics and simulation mixing times.
△ Less
Submitted 24 September, 2012;
originally announced September 2012.
-
A robust approach to estimating rates from time-correlation functions
Authors:
John D. Chodera,
Phillip J. Elms,
William C. Swope,
Jan-Hendrik Prinz,
Susan Marqusee,
Carlos Bustamante,
Frank Noé,
Vijay S. Pande
Abstract:
While seemingly straightforward in principle, the reliable estimation of rate constants is seldom easy in practice. Numerous issues, such as the complication of poor reaction coordinates, cause obvious approaches to yield unreliable estimates. When a reliable order parameter is available, the reactive flux theory of Chandler allows the rate constant to be extracted from the plateau region of an ap…
▽ More
While seemingly straightforward in principle, the reliable estimation of rate constants is seldom easy in practice. Numerous issues, such as the complication of poor reaction coordinates, cause obvious approaches to yield unreliable estimates. When a reliable order parameter is available, the reactive flux theory of Chandler allows the rate constant to be extracted from the plateau region of an appropriate reactive flux function. However, when applied to real data from single-molecule experiments or molecular dynamics simulations, the rate can sometimes be difficult to extract due to the numerical differentiation of a noisy empirical correlation function or difficulty in locating the plateau region at low sampling frequencies. We present a modified version of this theory which does not require numerical derivatives, allowing rate constants to be robustly estimated from the time-correlation function directly. We compare these approaches using single-molecule force spectroscopy measurements of an RNA hairpin.
△ Less
Submitted 10 August, 2011;
originally announced August 2011.
-
Splitting probabilities as a test of reaction coordinate choice in single-molecule experiments
Authors:
John D. Chodera,
Vijay S. Pande
Abstract:
To explain the observed dynamics in equilibrium single-molecule measurements of biomolecules, the experimental observable is often chosen as a putative reaction coordinate along which kinetic behavior is presumed to be governed by diffusive dynamics. Here, we invoke the splitting probability as a test of the suitability of such a proposed reaction coordinate. Comparison of the observed splitting p…
▽ More
To explain the observed dynamics in equilibrium single-molecule measurements of biomolecules, the experimental observable is often chosen as a putative reaction coordinate along which kinetic behavior is presumed to be governed by diffusive dynamics. Here, we invoke the splitting probability as a test of the suitability of such a proposed reaction coordinate. Comparison of the observed splitting probability with that computed from the kinetic model provides a simple test to reject poor reaction coordinates. We demonstrate this test for a force spectroscopy measurement of a DNA hairpin.
△ Less
Submitted 13 July, 2011; v1 submitted 3 May, 2011;
originally announced May 2011.
-
A simple theory of protein folding kinetics
Authors:
Vijay S. Pande
Abstract:
We present a simple model of protein folding dynamics that captures key qualitative elements recently seen in all-atom simulations. The goals of this theory are to serve as a simple formalism for gaining deeper insight into the physical properties seen in detailed simulations as well as to serve as a model to easily compare why these simulations suggest a different kinetic mechanism than previous…
▽ More
We present a simple model of protein folding dynamics that captures key qualitative elements recently seen in all-atom simulations. The goals of this theory are to serve as a simple formalism for gaining deeper insight into the physical properties seen in detailed simulations as well as to serve as a model to easily compare why these simulations suggest a different kinetic mechanism than previous simple models. Specifically, we find that non-native contacts play a key role in determining the mechanism, which can shift dramatically as the energetic strength of non-native interactions is changed. For protein-like non-native interactions, our model finds that the native state is a kinetic hub, connecting the strength of relevant interactions directly to the nature of folding kinetics.
△ Less
Submitted 2 July, 2010;
originally announced July 2010.
-
Hard Data on Soft Errors: A Large-Scale Assessment of Real-World Error Rates in GPGPU
Authors:
Imran S. Haque,
Vijay S. Pande
Abstract:
Graphics processing units (GPUs) are gaining widespread use in computational chemistry and other scientific simulation contexts because of their huge performance advantages relative to conventional CPUs. However, the reliability of GPUs in error-intolerant applications is largely unproven. In particular, a lack of error checking and correcting (ECC) capability in the memory subsystems of graphic…
▽ More
Graphics processing units (GPUs) are gaining widespread use in computational chemistry and other scientific simulation contexts because of their huge performance advantages relative to conventional CPUs. However, the reliability of GPUs in error-intolerant applications is largely unproven. In particular, a lack of error checking and correcting (ECC) capability in the memory subsystems of graphics cards has been cited as a hindrance to the acceptance of GPUs as high-performance coprocessors, but the impact of this design has not been previously quantified.
In this article we present MemtestG80, our software for assessing memory error rates on NVIDIA G80 and GT200-architecture-based graphics cards. Furthermore, we present the results of a large-scale assessment of GPU error rate, conducted by running MemtestG80 on over 20,000 hosts on the Folding@home distributed computing network. Our control experiments on consumer-grade and dedicated-GPGPU hardware in a controlled environment found no errors. However, our survey over cards on Folding@home finds that, in their installed environments, two-thirds of tested GPUs exhibit a detectable, pattern-sensitive rate of memory soft errors. We demonstrate that these errors persist after controlling for overclocking and environmental proxies for temperature, but depend strongly on board architecture.
△ Less
Submitted 13 November, 2009; v1 submitted 2 October, 2009;
originally announced October 2009.
-
Folding@Home and Genome@Home: Using distributed computing to tackle previously intractable problems in computational biology
Authors:
Stefan M. Larson,
Christopher D. Snow,
Michael Shirts,
Vijay S. Pande
Abstract:
For decades, researchers have been applying computer simulation to address problems in biology. However, many of these "grand challenges" in computational biology, such as simulating how proteins fold, remained unsolved due to their great complexity. Indeed, even to simulate the fastest folding protein would require decades on the fastest modern CPUs. Here, we review novel methods to fundamental…
▽ More
For decades, researchers have been applying computer simulation to address problems in biology. However, many of these "grand challenges" in computational biology, such as simulating how proteins fold, remained unsolved due to their great complexity. Indeed, even to simulate the fastest folding protein would require decades on the fastest modern CPUs. Here, we review novel methods to fundamentally speed such previously intractable problems using a new computational paradigm: distributed computing. By efficiently harnessing tens of thousands of computers throughout the world, we have been able to break previous computational barriers. However, distributed computing brings new challenges, such as how to efficiently divide a complex calculation of many PCs that are connected by relatively slow networking. Moreover, even if the challenge of accurately reproducing reality can be conquered, a new challenge emerges: how can we take the results of these simulations (typically tens to hundreds of gigabytes of raw data) and gain some insight into the questions at hand. This challenge of the analysis of the sea of data resulting from large-scale simulation will likely remain for decades to come.
△ Less
Submitted 7 January, 2009;
originally announced January 2009.
-
Potential for modulation of the hydrophobic effect inside chaperonins
Authors:
Jeremy L. England,
Vijay S. Pande
Abstract:
Despite the spontaneity of some in vitro protein folding reactions, native folding in vivo often requires the participation of barrel-shaped multimeric complexes known as chaperonins. Although it has long been known that chaperonin substrates fold upon sequestration inside the chaperonin barrel, the precise mechanism by which confinement within this space facilitates folding remains unknown. In…
▽ More
Despite the spontaneity of some in vitro protein folding reactions, native folding in vivo often requires the participation of barrel-shaped multimeric complexes known as chaperonins. Although it has long been known that chaperonin substrates fold upon sequestration inside the chaperonin barrel, the precise mechanism by which confinement within this space facilitates folding remains unknown. In this study, we examine the possibility that the chaperonin mediates a favorable reorganization of the solvent for the folding reaction. We begin by discussing the effect of electrostatic charge on solvent-mediated hydrophobic forces in an aqueous environment. Based on these initial physical arguments, we construct a simple, phenomenological theory for the thermodynamics of density and hydrogen bond order fluctuations in liquid water. Within the framework of this model, we investigate the effect of confinement within a chaperonin-like cavity on the configurational free energy of water by calculating solvent free energies for cavities corresponding to the different conformational states in the ATP- driven catalytic cycle of the prokaryotic chaperonin GroEL. Our findings suggest that one function of chaperonins may be to trap unfolded proteins and subsequently expose them to a micro-environment in which the hydrophobic effect, a crucial thermodynamic driving force for folding, is enhanced.
△ Less
Submitted 4 February, 2008;
originally announced February 2008.
-
Freezing Transition of Compact Polyampholytes
Authors:
Vijay S. Pande,
Alexander Yu. Grosberg,
Chris Joerg,
Mehran Kardar,
Toyoichi Tanaka
Abstract:
Polyampholytes (PAs) are heteropolymers with long range Coulomb interactions. Unlike polymers with short range forces, PA energy levels have non-vanishing correlations and are thus very different from the Random Energy Model (REM). Nevertheless, if charges in the PA globule are screened as in a regular plasma, PAs freeze in REM fashion. Our results shed light on the potential role of Coulomb int…
▽ More
Polyampholytes (PAs) are heteropolymers with long range Coulomb interactions. Unlike polymers with short range forces, PA energy levels have non-vanishing correlations and are thus very different from the Random Energy Model (REM). Nevertheless, if charges in the PA globule are screened as in a regular plasma, PAs freeze in REM fashion. Our results shed light on the potential role of Coulomb interactions in folding and evolution of {\it proteins}, which are weakly charged PAs, in particular making connection with the finding that sequences of charged amino acids in proteins are not random.
△ Less
Submitted 5 September, 1996;
originally announced September 1996.
-
Is Heteropolymer Freezing Well Described by the Random Energy Model?
Authors:
Vijay S. Pande,
Alexander Yu. Grosberg,
Chris Joerg,
Toyoichi Tanaka
Abstract:
It is widely held that the Random Energy Model (REM) describes the freezing transition of a variety of types of heteropolymers. We demonstrate that the hallmark property of REM, statistical independence of the energies of states over disorder, is violated in different ways for models commonly employed in heteropolymer freezing studies. The implications for proteins are also discussed.
It is widely held that the Random Energy Model (REM) describes the freezing transition of a variety of types of heteropolymers. We demonstrate that the hallmark property of REM, statistical independence of the energies of states over disorder, is violated in different ways for models commonly employed in heteropolymer freezing studies. The implications for proteins are also discussed.
△ Less
Submitted 23 April, 1996;
originally announced April 1996.
-
How Accurate Must Potentials Be for Successful Modeling of Protein Folding?
Authors:
Vijay S. Pande,
Alexander Yu. Grosberg,
Toyoichi Tanaka
Abstract:
Protein sequences are believed to have been selected to provide the stability of, and reliable renaturation to, an encoded unique spatial fold. In recently proposed theoretical schemes, this selection is modeled as ``minimal frustration,'' or ``optimal energy'' of the desirable target conformation over all possible sequences, such that the ``design'' of the sequence is governed by the interactio…
▽ More
Protein sequences are believed to have been selected to provide the stability of, and reliable renaturation to, an encoded unique spatial fold. In recently proposed theoretical schemes, this selection is modeled as ``minimal frustration,'' or ``optimal energy'' of the desirable target conformation over all possible sequences, such that the ``design'' of the sequence is governed by the interactions between monomers. With replica mean field theory, we examine the possibility to reconstruct the renaturation, or freezing transition, of the ``designed'' heteropolymer given the inevitable errors in the determination of interaction energies, that is, the difference between sets (matrices) of interactions governing chain design and conformations, respectively. We find that the possibility of folding to the designed conformation is controlled by the correlations of the elements of the design and renaturation interaction matrices; unlike random heteropolymers, the ground state of designed heteropolymers is sufficiently stable, such that even a substantial error in the interaction energy should still yield correct renaturation.
△ Less
Submitted 20 October, 1995;
originally announced October 1995.
-
Freezing Transition of Random Heteropolymers Consisting of an Arbitrary Set of Monomers
Authors:
Vijay S. Pande,
Alexander Yu. Grosberg,
Toyoichi Tanaka
Abstract:
Mean field replica theory is employed to analyze the freezing transition of random heteropolymers comprised of an arbitrary number ($q$) of types of monomers. Our formalism assumes that interactions are short range and heterogeneity comes only from pairwise interactions, which are defined by an arbitrary $q \times q$ matrix. We show that, in general, there exists a freezing transition from a ran…
▽ More
Mean field replica theory is employed to analyze the freezing transition of random heteropolymers comprised of an arbitrary number ($q$) of types of monomers. Our formalism assumes that interactions are short range and heterogeneity comes only from pairwise interactions, which are defined by an arbitrary $q \times q$ matrix. We show that, in general, there exists a freezing transition from a random globule, in which the thermodynamic equilibrium is comprised of an essentially infinite number polymer conformations, to a frozen globule, in which equilibrium ensemble is dominated by one or very few conformations. We also examine some special cases of interaction matrices to analyze the relationship between the freezing transition and the nature of interactions involved.
△ Less
Submitted 1 December, 1994;
originally announced December 1994.