-
An evaluation of unconditional 3D molecular generation methods
Authors:
Martin Buttenschoen,
Yael Ziv,
Garrett M. Morris,
Charlotte M. Deane
Abstract:
Unconditional molecular generation is a stepping stone for conditional molecular generation, which is important in \emph{de novo} drug design. Recent unconditional 3D molecular generation methods report saturated benchmarks, suggesting it is time to re-evaluate our benchmarks and compare the latest models. We assess five recent high-performing 3D molecular generation methods (EQGAT-diff, FlowMol,…
▽ More
Unconditional molecular generation is a stepping stone for conditional molecular generation, which is important in \emph{de novo} drug design. Recent unconditional 3D molecular generation methods report saturated benchmarks, suggesting it is time to re-evaluate our benchmarks and compare the latest models. We assess five recent high-performing 3D molecular generation methods (EQGAT-diff, FlowMol, GCDM, GeoLDM, and SemlaFlow), in terms of both standard benchmarks and chemical and physical validity. Overall, the best method, SemlaFlow, has a success rate of 87% in generating valid, unique, and novel molecules without post-processing and 92.4% with post-processing.
△ Less
Submitted 1 May, 2025;
originally announced May 2025.
-
Context-Guided Diffusion for Out-of-Distribution Molecular and Protein Design
Authors:
Leo Klarner,
Tim G. J. Rudner,
Garrett M. Morris,
Charlotte M. Deane,
Yee Whye Teh
Abstract:
Generative models have the potential to accelerate key steps in the discovery of novel molecular therapeutics and materials. Diffusion models have recently emerged as a powerful approach, excelling at unconditional sample generation and, with data-driven guidance, conditional generation within their training domain. Reliably sampling from high-value regions beyond the training data, however, remai…
▽ More
Generative models have the potential to accelerate key steps in the discovery of novel molecular therapeutics and materials. Diffusion models have recently emerged as a powerful approach, excelling at unconditional sample generation and, with data-driven guidance, conditional generation within their training domain. Reliably sampling from high-value regions beyond the training data, however, remains an open challenge -- with current methods predominantly focusing on modifying the diffusion process itself. In this paper, we develop context-guided diffusion (CGD), a simple plug-and-play method that leverages unlabeled data and smoothness constraints to improve the out-of-distribution generalization of guided diffusion models. We demonstrate that this approach leads to substantial performance gains across various settings, including continuous, discrete, and graph-structured diffusion processes with applications across drug discovery, materials science, and protein design.
△ Less
Submitted 16 July, 2024;
originally announced July 2024.
-
Sort & Slice: A Simple and Superior Alternative to Hash-Based Folding for Extended-Connectivity Fingerprints
Authors:
Markus Dablander,
Thierry Hanser,
Renaud Lambiotte,
Garrett M. Morris
Abstract:
Extended-connectivity fingerprints (ECFPs) are a ubiquitous tool in current cheminformatics and molecular machine learning, and one of the most prevalent molecular feature extraction techniques used for chemical prediction. Atom features learned by graph neural networks can be aggregated to compound-level representations using a large spectrum of graph pooling methods; in contrast, sets of detecte…
▽ More
Extended-connectivity fingerprints (ECFPs) are a ubiquitous tool in current cheminformatics and molecular machine learning, and one of the most prevalent molecular feature extraction techniques used for chemical prediction. Atom features learned by graph neural networks can be aggregated to compound-level representations using a large spectrum of graph pooling methods; in contrast, sets of detected ECFP substructures are by default transformed into bit vectors using only a simple hash-based folding procedure. We introduce a general mathematical framework for the vectorisation of structural fingerprints via a formal operation called substructure pooling that encompasses hash-based folding, algorithmic substructure-selection, and a wide variety of other potential techniques. We go on to describe Sort & Slice, an easy-to-implement and bit-collision-free alternative to hash-based folding for the pooling of ECFP substructures. Sort & Slice first sorts ECFP substructures according to their relative prevalence in a given set of training compounds and then slices away all but the $L$ most frequent substructures which are subsequently used to generate a binary fingerprint of desired length, $L$. We computationally compare the performance of hash-based folding, Sort & Slice, and two advanced supervised substructure-selection schemes (filtering and mutual-information maximisation) for ECFP-based molecular property prediction. Our results indicate that, despite its technical simplicity, Sort & Slice robustly (and at times substantially) outperforms traditional hash-based folding as well as the other investigated methods across prediction tasks, data splitting techniques, machine-learning models and ECFP hyperparameters. We thus recommend that Sort & Slice canonically replace hash-based folding as the default substructure-pooling technique to vectorise ECFPs for supervised molecular machine learning.
△ Less
Submitted 10 March, 2024;
originally announced March 2024.
-
PoseBusters: AI-based docking methods fail to generate physically valid poses or generalise to novel sequences
Authors:
Martin Buttenschoen,
Garrett M. Morris,
Charlotte M. Deane
Abstract:
The last few years have seen the development of numerous deep learning-based protein-ligand docking methods. They offer huge promise in terms of speed and accuracy. However, despite claims of state-of-the-art performance in terms of crystallographic root-mean-square deviation (RMSD), upon closer inspection, it has become apparent that they often produce physically implausible molecular structures.…
▽ More
The last few years have seen the development of numerous deep learning-based protein-ligand docking methods. They offer huge promise in terms of speed and accuracy. However, despite claims of state-of-the-art performance in terms of crystallographic root-mean-square deviation (RMSD), upon closer inspection, it has become apparent that they often produce physically implausible molecular structures. It is therefore not sufficient to evaluate these methods solely by RMSD to a native binding mode. It is vital, particularly for deep learning-based methods, that they are also evaluated on steric and energetic criteria. We present PoseBusters, a Python package that performs a series of standard quality checks using the well-established cheminformatics toolkit RDKit. Only methods that both pass these checks and predict native-like binding modes should be classed as having "state-of-the-art" performance. We use PoseBusters to compare five deep learning-based docking methods (DeepDock, DiffDock, EquiBind, TankBind, and Uni-Mol) and two well-established standard docking methods (AutoDock Vina and CCDC Gold) with and without an additional post-prediction energy minimisation step using a molecular mechanics force field. We show that both in terms of physical plausibility and the ability to generalise to examples that are distinct from the training data, no deep learning-based method yet outperforms classical docking tools. In addition, we find that molecular mechanics force fields contain docking-relevant physics missing from deep-learning methods. PoseBusters allows practitioners to assess docking and molecular generation methods and may inspire new inductive biases still required to improve deep learning-based methods, which will help drive the development of more accurate and more realistic predictions.
△ Less
Submitted 28 November, 2023; v1 submitted 10 August, 2023;
originally announced August 2023.
-
Drug Discovery under Covariate Shift with Domain-Informed Prior Distributions over Functions
Authors:
Leo Klarner,
Tim G. J. Rudner,
Michael Reutlinger,
Torsten Schindler,
Garrett M. Morris,
Charlotte Deane,
Yee Whye Teh
Abstract:
Accelerating the discovery of novel and more effective therapeutics is an important pharmaceutical problem in which deep learning is playing an increasingly significant role. However, real-world drug discovery tasks are often characterized by a scarcity of labeled data and significant covariate shift$\unicode{x2013}\unicode{x2013}$a setting that poses a challenge to standard deep learning methods.…
▽ More
Accelerating the discovery of novel and more effective therapeutics is an important pharmaceutical problem in which deep learning is playing an increasingly significant role. However, real-world drug discovery tasks are often characterized by a scarcity of labeled data and significant covariate shift$\unicode{x2013}\unicode{x2013}$a setting that poses a challenge to standard deep learning methods. In this paper, we present Q-SAVI, a probabilistic model able to address these challenges by encoding explicit prior knowledge of the data-generating process into a prior distribution over functions, presenting researchers with a transparent and probabilistically principled way to encode data-driven modeling preferences. Building on a novel, gold-standard bioactivity dataset that facilitates a meaningful comparison of models in an extrapolative regime, we explore different approaches to induce data shift and construct a challenging evaluation setup. We then demonstrate that using Q-SAVI to integrate contextualized prior knowledge of drug-like chemical space into the modeling process affords substantial gains in predictive accuracy and calibration, outperforming a broad range of state-of-the-art self-supervised pre-training and domain adaptation techniques.
△ Less
Submitted 14 July, 2023;
originally announced July 2023.
-
Exploring QSAR Models for Activity-Cliff Prediction
Authors:
Markus Dablander,
Thierry Hanser,
Renaud Lambiotte,
Garrett M. Morris
Abstract:
Pairs of similar compounds that only differ by a small structural modification but exhibit a large difference in their binding affinity for a given target are known as activity cliffs (ACs). It has been hypothesised that quantitative structure-activity relationship (QSAR) models struggle to predict ACs and that ACs thus form a major source of prediction error. However, a study to explore the AC-pr…
▽ More
Pairs of similar compounds that only differ by a small structural modification but exhibit a large difference in their binding affinity for a given target are known as activity cliffs (ACs). It has been hypothesised that quantitative structure-activity relationship (QSAR) models struggle to predict ACs and that ACs thus form a major source of prediction error. However, a study to explore the AC-prediction power of modern QSAR methods and its relationship to general QSAR-prediction performance is lacking. We systematically construct nine distinct QSAR models by combining three molecular representation methods (extended-connectivity fingerprints, physicochemical-descriptor vectors and graph isomorphism networks) with three regression techniques (random forests, k-nearest neighbours and multilayer perceptrons); we then use each resulting model to classify pairs of similar compounds as ACs or non-ACs and to predict the activities of individual molecules in three case studies: dopamine receptor D2, factor Xa, and SARS-CoV-2 main protease. We observe low AC-sensitivity amongst the tested models when the activities of both compounds are unknown, but a substantial increase in AC-sensitivity when the actual activity of one of the compounds is given. Graph isomorphism features are found to be competitive with or superior to classical molecular representations for AC-classification and can thus be employed as baseline AC-prediction models or simple compound-optimisation tools. For general QSAR-prediction, however, extended-connectivity fingerprints still consistently deliver the best performance. Our results provide strong support for the hypothesis that indeed QSAR methods frequently fail to predict ACs. We propose twin-network training for deep learning models as a potential future pathway to increase AC-sensitivity and thus overall QSAR performance.
△ Less
Submitted 31 January, 2023;
originally announced January 2023.
-
The prospects of quantum computing in computational molecular biology
Authors:
Carlos Outeiral,
Martin Strahm,
Jiye Shi,
Garrett M. Morris,
Simon C. Benjamin,
Charlotte M. Deane
Abstract:
Quantum computers can in principle solve certain problems exponentially more quickly than their classical counterparts. We have not yet reached the advent of useful quantum computation, but when we do, it will affect nearly all scientific disciplines. In this review, we examine how current quantum algorithms could revolutionize computational biology and bioinformatics. There are potential benefits…
▽ More
Quantum computers can in principle solve certain problems exponentially more quickly than their classical counterparts. We have not yet reached the advent of useful quantum computation, but when we do, it will affect nearly all scientific disciplines. In this review, we examine how current quantum algorithms could revolutionize computational biology and bioinformatics. There are potential benefits across the entire field, from the ability to process vast amounts of information and run machine learning algorithms far more efficiently, to algorithms for quantum simulation that are poised to improve computational calculations in drug discovery, to quantum algorithms for optimization that may advance fields from protein structure prediction to network analysis. However, these exciting prospects are susceptible to "hype", and it is also important to recognize the caveats and challenges in this new technology. Our aim is to introduce the promise and limitations of emerging quantum computing technologies in the areas of computational molecular biology and bioinformatics.
△ Less
Submitted 26 May, 2020;
originally announced May 2020.
-
Investigating the potential for a limited quantum speedup on protein lattice problems
Authors:
Carlos Outeiral,
Garrett M. Morris,
Jiye Shi,
Martin Strahm,
Simon C. Benjamin,
Charlotte M. Deane
Abstract:
Protein folding is a central challenge in computational biology, with important applications in molecular biology, drug discovery and catalyst design. As a hard combinatorial optimisation problem, it has been studied as a potential target problem for quantum annealing. Although several experimental implementations have been discussed in the literature, the computational scaling of these approaches…
▽ More
Protein folding is a central challenge in computational biology, with important applications in molecular biology, drug discovery and catalyst design. As a hard combinatorial optimisation problem, it has been studied as a potential target problem for quantum annealing. Although several experimental implementations have been discussed in the literature, the computational scaling of these approaches has not been elucidated. In this article, we present a numerical study of quantum annealing applied to a large number of small peptide folding problems, aiming to infer useful insights for near-term applications. We present two conclusions: that even naive quantum annealing, when applied to protein lattice folding, has the potential to outperform classical approaches, and that careful engineering of the Hamiltonians and schedules involved can deliver notable relative improvements for this problem. Overall, our results suggest that quantum algorithms may well offer improvements for problems in the protein folding and structure prediction realm.
△ Less
Submitted 18 May, 2021; v1 submitted 2 April, 2020;
originally announced April 2020.
-
Nature of weak magnetism in SrTiO3/LaAlO3 multilayers
Authors:
Z. Salman,
O. Ofer,
M. Radovic,
H. Hao,
K. H. Chow,
M. D. Hossain,
C. D. P. Levy,
W. A. MacFarlane,
G. M. Morris,
L. Patthey,
M. R. Pearson,
H. Saadaoui,
T. Schmitt,
D. Wang,
R. F. Kiefl
Abstract:
We report the observation of weak magnetism in superlattices of LaAlO3/SrTiO3 using beta-detected nuclear magnetic resonance. The spin lattice relaxation rate of 8 Li in superlattices with a spacer layers of 8 and 6 unit cells of LaAlO3 exhibits a strong peak near ~35 K, whereas no such peak is observed in a superlattice with spacer layer thickness of 3 unit cells. We attribute the observed temper…
▽ More
We report the observation of weak magnetism in superlattices of LaAlO3/SrTiO3 using beta-detected nuclear magnetic resonance. The spin lattice relaxation rate of 8 Li in superlattices with a spacer layers of 8 and 6 unit cells of LaAlO3 exhibits a strong peak near ~35 K, whereas no such peak is observed in a superlattice with spacer layer thickness of 3 unit cells. We attribute the observed temperature dependence to slowing down of weakly coupled electronic moments at the LaAlO3/SrTiO3 interface. These results show that the magnetism at the interface depends strongly on the thickness of the spacer layer, and that a minimal thickness of ~4-6 unit cells is required for the appearance of magnetism. A simple model is used to determine that the observed relaxation is due to small fluctuating moments (~0.002 muB) in the two samples with a larger LaAlO3 spacer thickness.
△ Less
Submitted 21 November, 2012;
originally announced November 2012.