-
Generalists vs. Specialists: Evaluating LLMs on Highly-Constrained Biophysical Sequence Optimization Tasks
Authors:
Angelica Chen,
Samuel D. Stanton,
Frances Ding,
Robert G. Alberstein,
Andrew M. Watkins,
Richard Bonneau,
Vladimir Gligorijević,
Kyunghyun Cho,
Nathan C. Frey
Abstract:
Although large language models (LLMs) have shown promise in biomolecule optimization problems, they incur heavy computational costs and struggle to satisfy precise constraints. On the other hand, specialized solvers like LaMBO-2 offer efficiency and fine-grained control but require more domain expertise. Comparing these approaches is challenging due to expensive laboratory validation and inadequat…
▽ More
Although large language models (LLMs) have shown promise in biomolecule optimization problems, they incur heavy computational costs and struggle to satisfy precise constraints. On the other hand, specialized solvers like LaMBO-2 offer efficiency and fine-grained control but require more domain expertise. Comparing these approaches is challenging due to expensive laboratory validation and inadequate synthetic benchmarks. We address this by introducing Ehrlich functions, a synthetic test suite that captures the geometric structure of biophysical sequence optimization problems. With prompting alone, off-the-shelf LLMs struggle to optimize Ehrlich functions. In response, we propose LLOME (Language Model Optimization with Margin Expectation), a bilevel optimization routine for online black-box optimization. When combined with a novel preference learning loss, we find LLOME can not only learn to solve some Ehrlich functions, but can even outperform LaMBO-2 on moderately difficult Ehrlich variants. However, LLOME is comparable to LaMBO-2 on very easy or difficult variants, exhibits some likelihood-reward miscalibration, and struggles without explicit rewards. Our results indicate LLMs can provide significant benefits in some cases, but specialized solvers are still competitive and incur less overhead.
△ Less
Submitted 2 April, 2025; v1 submitted 29 October, 2024;
originally announced October 2024.
-
Antibody DomainBed: Out-of-Distribution Generalization in Therapeutic Protein Design
Authors:
Nataša Tagasovska,
Ji Won Park,
Matthieu Kirchmeyer,
Nathan C. Frey,
Andrew Martin Watkins,
Aya Abdelsalam Ismail,
Arian Rokkum Jamasb,
Edith Lee,
Tyler Bryson,
Stephen Ra,
Kyunghyun Cho
Abstract:
Machine learning (ML) has demonstrated significant promise in accelerating drug design. Active ML-guided optimization of therapeutic molecules typically relies on a surrogate model predicting the target property of interest. The model predictions are used to determine which designs to evaluate in the lab, and the model is updated on the new measurements to inform the next cycle of decisions. A key…
▽ More
Machine learning (ML) has demonstrated significant promise in accelerating drug design. Active ML-guided optimization of therapeutic molecules typically relies on a surrogate model predicting the target property of interest. The model predictions are used to determine which designs to evaluate in the lab, and the model is updated on the new measurements to inform the next cycle of decisions. A key challenge is that the experimental feedback from each cycle inspires changes in the candidate proposal or experimental protocol for the next cycle, which lead to distribution shifts. To promote robustness to these shifts, we must account for them explicitly in the model training. We apply domain generalization (DG) methods to classify the stability of interactions between an antibody and antigen across five domains defined by design cycles. Our results suggest that foundational models and ensembling improve predictive performance on out-of-distribution domains. We publicly release our codebase extending the DG benchmark ``DomainBed,'' and the associated dataset of antibody sequences and structures emulating distribution shifts across design cycles.
△ Less
Submitted 15 July, 2024;
originally announced July 2024.
-
NEBULA: Neural Empirical Bayes Under Latent Representations for Efficient and Controllable Design of Molecular Libraries
Authors:
Ewa M. Nowara,
Pedro O. Pinheiro,
Sai Pooja Mahajan,
Omar Mahmood,
Andrew Martin Watkins,
Saeed Saremi,
Michael Maser
Abstract:
We present NEBULA, the first latent 3D generative model for scalable generation of large molecular libraries around a seed compound of interest. Such libraries are crucial for scientific discovery, but it remains challenging to generate large numbers of high quality samples efficiently. 3D-voxel-based methods have recently shown great promise for generating high quality samples de novo from random…
▽ More
We present NEBULA, the first latent 3D generative model for scalable generation of large molecular libraries around a seed compound of interest. Such libraries are crucial for scientific discovery, but it remains challenging to generate large numbers of high quality samples efficiently. 3D-voxel-based methods have recently shown great promise for generating high quality samples de novo from random noise (Pinheiro et al., 2023). However, sampling in 3D-voxel space is computationally expensive and use in library generation is prohibitively slow. Here, we instead perform neural empirical Bayes sampling (Saremi & Hyvarinen, 2019) in the learned latent space of a vector-quantized variational autoencoder. NEBULA generates large molecular libraries nearly an order of magnitude faster than existing methods without sacrificing sample quality. Moreover, NEBULA generalizes better to unseen drug-like molecules, as demonstrated on two public datasets and multiple recently released drugs. We expect the approach herein to be highly enabling for machine learning-based drug discovery. The code is available at https://github.com/prescient-design/nebula
△ Less
Submitted 3 July, 2024;
originally announced July 2024.
-
OpenProteinSet: Training data for structural biology at scale
Authors:
Gustaf Ahdritz,
Nazim Bouatta,
Sachin Kadyan,
Lukas Jarosch,
Daniel Berenberg,
Ian Fisk,
Andrew M. Watkins,
Stephen Ra,
Richard Bonneau,
Mohammed AlQuraishi
Abstract:
Multiple sequence alignments (MSAs) of proteins encode rich biological information and have been workhorses in bioinformatic methods for tasks like protein design and protein structure prediction for decades. Recent breakthroughs like AlphaFold2 that use transformers to attend directly over large quantities of raw MSAs have reaffirmed their importance. Generation of MSAs is highly computationally…
▽ More
Multiple sequence alignments (MSAs) of proteins encode rich biological information and have been workhorses in bioinformatic methods for tasks like protein design and protein structure prediction for decades. Recent breakthroughs like AlphaFold2 that use transformers to attend directly over large quantities of raw MSAs have reaffirmed their importance. Generation of MSAs is highly computationally intensive, however, and no datasets comparable to those used to train AlphaFold2 have been made available to the research community, hindering progress in machine learning for proteins. To remedy this problem, we introduce OpenProteinSet, an open-source corpus of more than 16 million MSAs, associated structural homologs from the Protein Data Bank, and AlphaFold2 protein structure predictions. We have previously demonstrated the utility of OpenProteinSet by successfully retraining AlphaFold2 on it. We expect OpenProteinSet to be broadly useful as training and validation data for 1) diverse tasks focused on protein structure, function, and design and 2) large-scale multimodal machine learning research.
△ Less
Submitted 10 August, 2023;
originally announced August 2023.
-
MoleCLUEs: Molecular Conformers Maximally In-Distribution for Predictive Models
Authors:
Michael Maser,
Natasa Tagasovska,
Jae Hyeon Lee,
Andrew Watkins
Abstract:
Structure-based molecular ML (SBML) models can be highly sensitive to input geometries and give predictions with large variance. We present an approach to mitigate the challenge of selecting conformations for such models by generating conformers that explicitly minimize predictive uncertainty. To achieve this, we compute estimates of aleatoric and epistemic uncertainties that are differentiable w.…
▽ More
Structure-based molecular ML (SBML) models can be highly sensitive to input geometries and give predictions with large variance. We present an approach to mitigate the challenge of selecting conformations for such models by generating conformers that explicitly minimize predictive uncertainty. To achieve this, we compute estimates of aleatoric and epistemic uncertainties that are differentiable w.r.t. latent posteriors. We then iteratively sample new latents in the direction of lower uncertainty by gradient descent. As we train our predictive models jointly with a conformer decoder, the new latent embeddings can be mapped to their corresponding inputs, which we call \textit{MoleCLUEs}, or (molecular) counterfactual latent uncertainty explanations \citep{antoran2020getting}. We assess our algorithm for the task of predicting drug properties from 3D structure with maximum confidence. We additionally analyze the structure trajectories obtained from conformer optimizations, which provide insight into the sources of uncertainty in SBML.
△ Less
Submitted 6 November, 2023; v1 submitted 20 June, 2023;
originally announced June 2023.
-
3D molecule generation by denoising voxel grids
Authors:
Pedro O. Pinheiro,
Joshua Rackers,
Joseph Kleinhenz,
Michael Maser,
Omar Mahmood,
Andrew Martin Watkins,
Stephen Ra,
Vishnu Sresht,
Saeed Saremi
Abstract:
We propose a new score-based approach to generate 3D molecules represented as atomic densities on regular grids. First, we train a denoising neural network that learns to map from a smooth distribution of noisy molecules to the distribution of real molecules. Then, we follow the neural empirical Bayes framework (Saremi and Hyvarinen, 19) and generate molecules in two steps: (i) sample noisy densit…
▽ More
We propose a new score-based approach to generate 3D molecules represented as atomic densities on regular grids. First, we train a denoising neural network that learns to map from a smooth distribution of noisy molecules to the distribution of real molecules. Then, we follow the neural empirical Bayes framework (Saremi and Hyvarinen, 19) and generate molecules in two steps: (i) sample noisy density grids from a smooth distribution via underdamped Langevin Markov chain Monte Carlo, and (ii) recover the "clean" molecule by denoising the noisy grid with a single step. Our method, VoxMol, generates molecules in a fundamentally different way than the current state of the art (ie, diffusion models applied to atom point clouds). It differs in terms of the data representation, the noise model, the network architecture and the generative modeling algorithm. Our experiments show that VoxMol captures the distribution of drug-like molecules better than state of the art, while being faster to generate samples.
△ Less
Submitted 8 March, 2024; v1 submitted 12 June, 2023;
originally announced June 2023.
-
PropertyDAG: Multi-objective Bayesian optimization of partially ordered, mixed-variable properties for biological sequence design
Authors:
Ji Won Park,
Samuel Stanton,
Saeed Saremi,
Andrew Watkins,
Henri Dwyer,
Vladimir Gligorijevic,
Richard Bonneau,
Stephen Ra,
Kyunghyun Cho
Abstract:
Bayesian optimization offers a sample-efficient framework for navigating the exploration-exploitation trade-off in the vast design space of biological sequences. Whereas it is possible to optimize the various properties of interest jointly using a multi-objective acquisition function, such as the expected hypervolume improvement (EHVI), this approach does not account for objectives with a hierarch…
▽ More
Bayesian optimization offers a sample-efficient framework for navigating the exploration-exploitation trade-off in the vast design space of biological sequences. Whereas it is possible to optimize the various properties of interest jointly using a multi-objective acquisition function, such as the expected hypervolume improvement (EHVI), this approach does not account for objectives with a hierarchical dependency structure. We consider a common use case where some regions of the Pareto frontier are prioritized over others according to a specified $\textit{partial ordering}$ in the objectives. For instance, when designing antibodies, we would like to maximize the binding affinity to a target antigen only if it can be expressed in live cell culture -- modeling the experimental dependency in which affinity can only be measured for antibodies that can be expressed and thus produced in viable quantities. In general, we may want to confer a partial ordering to the properties such that each property is optimized conditioned on its parent properties satisfying some feasibility condition. To this end, we present PropertyDAG, a framework that operates on top of the traditional multi-objective BO to impose this desired ordering on the objectives, e.g. expression $\rightarrow$ affinity. We demonstrate its performance over multiple simulated active learning iterations on a penicillin production task, toy numerical problem, and a real-world antibody design task.
△ Less
Submitted 8 October, 2022;
originally announced October 2022.
-
Multi-segment preserving sampling for deep manifold sampler
Authors:
Daniel Berenberg,
Jae Hyeon Lee,
Simon Kelow,
Ji Won Park,
Andrew Watkins,
Vladimir Gligorijević,
Richard Bonneau,
Stephen Ra,
Kyunghyun Cho
Abstract:
Deep generative modeling for biological sequences presents a unique challenge in reconciling the bias-variance trade-off between explicit biological insight and model flexibility. The deep manifold sampler was recently proposed as a means to iteratively sample variable-length protein sequences by exploiting the gradients from a function predictor. We introduce an alternative approach to this guide…
▽ More
Deep generative modeling for biological sequences presents a unique challenge in reconciling the bias-variance trade-off between explicit biological insight and model flexibility. The deep manifold sampler was recently proposed as a means to iteratively sample variable-length protein sequences by exploiting the gradients from a function predictor. We introduce an alternative approach to this guided sampling procedure, multi-segment preserving sampling, that enables the direct inclusion of domain-specific knowledge by designating preserved and non-preserved segments along the input sequence, thereby restricting variation to only select regions. We present its effectiveness in the context of antibody design by training two models: a deep manifold sampler and a GPT-2 language model on nearly six million heavy chain sequences annotated with the IGHV1-18 gene. During sampling, we restrict variation to only the complementarity-determining region 3 (CDR3) of the input. We obtain log probability scores from a GPT-2 model for each sampled CDR3 and demonstrate that multi-segment preserving sampling generates reasonable designs while maintaining the desired, preserved regions.
△ Less
Submitted 9 May, 2022;
originally announced May 2022.
-
Deep learning models for predicting RNA degradation via dual crowdsourcing
Authors:
Hannah K. Wayment-Steele,
Wipapat Kladwang,
Andrew M. Watkins,
Do Soon Kim,
Bojan Tunguz,
Walter Reade,
Maggie Demkin,
Jonathan Romano,
Roger Wellington-Oguri,
John J. Nicol,
Jiayang Gao,
Kazuki Onodera,
Kazuki Fujikawa,
Hanfei Mao,
Gilles Vandewiele,
Michele Tinti,
Bram Steenwinckel,
Takuya Ito,
Taiga Noumi,
Shujun He,
Keiichiro Ishi,
Youhan Lee,
Fatih Öztürk,
Anthony Chiu,
Emin Öztürk
, et al. (4 additional authors not shown)
Abstract:
Messenger RNA-based medicines hold immense potential, as evidenced by their rapid deployment as COVID-19 vaccines. However, worldwide distribution of mRNA molecules has been limited by their thermostability, which is fundamentally limited by the intrinsic instability of RNA molecules to a chemical degradation reaction called in-line hydrolysis. Predicting the degradation of an RNA molecule is a ke…
▽ More
Messenger RNA-based medicines hold immense potential, as evidenced by their rapid deployment as COVID-19 vaccines. However, worldwide distribution of mRNA molecules has been limited by their thermostability, which is fundamentally limited by the intrinsic instability of RNA molecules to a chemical degradation reaction called in-line hydrolysis. Predicting the degradation of an RNA molecule is a key task in designing more stable RNA-based therapeutics. Here, we describe a crowdsourced machine learning competition ("Stanford OpenVaccine") on Kaggle, involving single-nucleotide resolution measurements on 6043 102-130-nucleotide diverse RNA constructs that were themselves solicited through crowdsourcing on the RNA design platform Eterna. The entire experiment was completed in less than 6 months, and 41% of nucleotide-level predictions from the winning model were within experimental error of the ground truth measurement. Furthermore, these models generalized to blindly predicting orthogonal degradation data on much longer mRNA molecules (504-1588 nucleotides) with improved accuracy compared to previously published models. Top teams integrated natural language processing architectures and data augmentation techniques with predictions from previous dynamic programming models for RNA secondary structure. These results indicate that such models are capable of representing in-line hydrolysis with excellent accuracy, supporting their use for designing stabilized messenger RNAs. The integration of two crowdsourcing platforms, one for data set creation and another for machine learning, may be fruitful for other urgent problems that demand scientific discovery on rapid timescales.
△ Less
Submitted 22 April, 2022; v1 submitted 14 October, 2021;
originally announced October 2021.