-
HD-Bind: Encoding of Molecular Structure with Low Precision, Hyperdimensional Binary Representations
Authors:
Derek Jones,
Jonathan E. Allen,
Xiaohua Zhang,
Behnam Khaleghi,
Jaeyoung Kang,
Weihong Xu,
Niema Moshiri,
Tajana S. Rosing
Abstract:
Publicly available collections of drug-like molecules have grown to comprise 10s of billions of possibilities in recent history due to advances in chemical synthesis. Traditional methods for identifying ``hit'' molecules from a large collection of potential drug-like candidates have relied on biophysical theory to compute approximations to the Gibbs free energy of the binding interaction between t…
▽ More
Publicly available collections of drug-like molecules have grown to comprise 10s of billions of possibilities in recent history due to advances in chemical synthesis. Traditional methods for identifying ``hit'' molecules from a large collection of potential drug-like candidates have relied on biophysical theory to compute approximations to the Gibbs free energy of the binding interaction between the drug to its protein target. A major drawback of the approaches is that they require exceptional computing capabilities to consider for even relatively small collections of molecules.
Hyperdimensional Computing (HDC) is a recently proposed learning paradigm that is able to leverage low-precision binary vector arithmetic to build efficient representations of the data that can be obtained without the need for gradient-based optimization approaches that are required in many conventional machine learning and deep learning approaches. This algorithmic simplicity allows for acceleration in hardware that has been previously demonstrated for a range of application areas. We consider existing HDC approaches for molecular property classification and introduce two novel encoding algorithms that leverage the extended connectivity fingerprint (ECFP) algorithm.
We show that HDC-based inference methods are as much as 90 times more efficient than more complex representative machine learning methods and achieve an acceleration of nearly 9 orders of magnitude as compared to inference with molecular docking. We demonstrate multiple approaches for the encoding of molecular data for HDC and examine their relative performance on a range of challenging molecular property prediction and drug-protein binding classification tasks. Our work thus motivates further investigation into molecular representation learning to develop ultra-efficient pre-screening tools.
△ Less
Submitted 27 March, 2023;
originally announced March 2023.
-
High-Throughput Virtual Screening of Small Molecule Inhibitors for SARS-CoV-2 Protein Targets with Deep Fusion Models
Authors:
Garrett A. Stevenson,
Derek Jones,
Hyojin Kim,
W. F. Drew Bennett,
Brian J. Bennion,
Monica Borucki,
Feliza Bourguet,
Aidan Epstein,
Magdalena Franco,
Brooke Harmon,
Stewart He,
Max P. Katz,
Daniel Kirshner,
Victoria Lao,
Edmond Y. Lau,
Jacky Lo,
Kevin McLoughlin,
Richard Mosesso,
Deepa K. Murugesh,
Oscar A. Negrete,
Edwin A. Saada,
Brent Segelke,
Maxwell Stefan,
Marisa W. Torres,
Dina Weilhammer
, et al. (7 additional authors not shown)
Abstract:
Structure-based Deep Fusion models were recently shown to outperform several physics- and machine learning-based protein-ligand binding affinity prediction methods. As part of a multi-institutional COVID-19 pandemic response, over 500 million small molecules were computationally screened against four protein structures from the novel coronavirus (SARS-CoV-2), which causes COVID-19. Three enhanceme…
▽ More
Structure-based Deep Fusion models were recently shown to outperform several physics- and machine learning-based protein-ligand binding affinity prediction methods. As part of a multi-institutional COVID-19 pandemic response, over 500 million small molecules were computationally screened against four protein structures from the novel coronavirus (SARS-CoV-2), which causes COVID-19. Three enhancements to Deep Fusion were made in order to evaluate more than 5 billion docked poses on SARS-CoV-2 protein targets. First, the Deep Fusion concept was refined by formulating the architecture as one, coherently backpropagated model (Coherent Fusion) to improve binding-affinity prediction accuracy. Secondly, the model was trained using a distributed, genetic hyper-parameter optimization. Finally, a scalable, high-throughput screening capability was developed to maximize the number of ligands evaluated and expedite the path to experimental evaluation. In this work, we present both the methods developed for machine learning-based high-throughput screening and results from using our computational pipeline to find SARS-CoV-2 inhibitors.
△ Less
Submitted 31 May, 2021; v1 submitted 9 April, 2021;
originally announced April 2021.
-
Improved Protein-ligand Binding Affinity Prediction with Structure-Based Deep Fusion Inference
Authors:
Derek Jones,
Hyojin Kim,
Xiaohua Zhang,
Adam Zemla,
Garrett Stevenson,
William D. Bennett,
Dan Kirshner,
Sergio Wong,
Felice Lightstone,
Jonathan E. Allen
Abstract:
Predicting accurate protein-ligand binding affinity is important in drug discovery but remains a challenge even with computationally expensive biophysics-based energy scoring methods and state-of-the-art deep learning approaches. Despite the recent advances in the deep convolutional and graph neural network based approaches, the model performance depends on the input data representation and suffer…
▽ More
Predicting accurate protein-ligand binding affinity is important in drug discovery but remains a challenge even with computationally expensive biophysics-based energy scoring methods and state-of-the-art deep learning approaches. Despite the recent advances in the deep convolutional and graph neural network based approaches, the model performance depends on the input data representation and suffers from distinct limitations. It is natural to combine complementary features and their inference from the individual models for better predictions. We present fusion models to benefit from different feature representations of two neural network models to improve the binding affinity prediction. We demonstrate effectiveness of the proposed approach by performing experiments with the PDBBind 2016 dataset and its docking pose complexes. The results show that the proposed approach improves the overall prediction compared to the individual neural network models with greater computational efficiency than related biophysics based energy scoring functions. We also discuss the benefit of the proposed fusion inference with several example complexes. The software is made available as open source at https://github.com/llnl/fast.
△ Less
Submitted 17 May, 2020;
originally announced May 2020.
-
Machine Learning Models to Predict Inhibition of the Bile Salt Export Pump
Authors:
Kevin S. McLoughlin,
Claire G. Jeong,
Thomas D. Sweitzer,
Amanda J. Minnich,
Margaret J. Tse,
Brian J. Bennion,
Jonathan E. Allen,
Stacie Calad-Thomson,
Thomas S. Rush,
James M. Brase
Abstract:
Drug-induced liver injury (DILI) is the most common cause of acute liver failure and a frequent reason for withdrawal of candidate drugs during preclinical and clinical testing. An important type of DILI is cholestatic liver injury, caused by buildup of bile salts within hepatocytes; it is frequently associated with inhibition of bile salt transporters, such as the bile salt export pump (BSEP). Re…
▽ More
Drug-induced liver injury (DILI) is the most common cause of acute liver failure and a frequent reason for withdrawal of candidate drugs during preclinical and clinical testing. An important type of DILI is cholestatic liver injury, caused by buildup of bile salts within hepatocytes; it is frequently associated with inhibition of bile salt transporters, such as the bile salt export pump (BSEP). Reliable in silico models to predict BSEP inhibition directly from chemical structures would significantly reduce costs during drug discovery and could help avoid injury to patients. Unfortunately, models published to date have been insufficiently accurate to encourage wide adoption. We report our development of classification and regression models for BSEP inhibition with substantially improved performance over previously published models. Our model development leveraged the ATOM Modeling PipeLine (AMPL) developed by the ATOM Consortium, which enabled us to train and evaluate thousands of candidate models. In the course of model development, we assessed a variety of schemes for chemical featurization, dataset partitioning and class labeling, and identified those producing models that generalized best to novel chemical entities. Our best performing classification model was a neural network with ROC AUC = 0.88 on our internal test dataset and 0.89 on an independent external compound set. Our best regression model, the first ever reported for predicting BSEP IC50s, yielded a test set $R^2 = 0.56$ and mean absolute error 0.37, corresponding to a mean 2.3-fold error in predicted IC50s, comparable to experimental variation. These models will thus be useful as inputs to mechanistic predictions of DILI and as part of computational pipelines for drug discovery.
△ Less
Submitted 27 February, 2020;
originally announced February 2020.
-
AMPL: A Data-Driven Modeling Pipeline for Drug Discovery
Authors:
Amanda J. Minnich,
Kevin McLoughlin,
Margaret Tse,
Jason Deng,
Andrew Weber,
Neha Murad,
Benjamin D. Madej,
Bharath Ramsundar,
Tom Rush,
Stacie Calad-Thomson,
Jim Brase,
Jonathan E. Allen
Abstract:
One of the key requirements for incorporating machine learning into the drug discovery process is complete reproducibility and traceability of the model building and evaluation process. With this in mind, we have developed an end-to-end modular and extensible software pipeline for building and sharing machine learning models that predict key pharma-relevant parameters. The ATOM Modeling PipeLine,…
▽ More
One of the key requirements for incorporating machine learning into the drug discovery process is complete reproducibility and traceability of the model building and evaluation process. With this in mind, we have developed an end-to-end modular and extensible software pipeline for building and sharing machine learning models that predict key pharma-relevant parameters. The ATOM Modeling PipeLine, or AMPL, extends the functionality of the open source library DeepChem and supports an array of machine learning and molecular featurization tools. We have benchmarked AMPL on a large collection of pharmaceutical datasets covering a wide range of parameters. As a result of these comprehensive experiments, we have found that physicochemical descriptors and deep learning-based graph representations significantly outperform traditional fingerprints in the characterization of molecular features. We have also found that dataset size is directly correlated to prediction performance, and that single-task deep learning models only outperform shallow learners if there is sufficient data. Likewise, dataset size has a direct impact on model predictivity, independent of comprehensive hyperparameter model tuning. Our findings point to the need for public dataset integration or multi-task/transfer learning approaches. Lastly, we found that uncertainty quantification (UQ) analysis may help identify model error; however, efficacy of UQ to filter predictions varies considerably between datasets and featurization/model types. AMPL is open source and available for download at http://github.com/ATOMconsortium/AMPL.
△ Less
Submitted 13 November, 2019; v1 submitted 12 November, 2019;
originally announced November 2019.