Showing 1–2 of 2 results for author: Clark, J D
-
Two for the Price of One: Integrating Large Language Models to Learn Biophysical Interactions
Authors:
Joseph D. Clark,
Tanner J. Dean,
Diwakar Shukla
Abstract:
Deep learning models have become fundamental tools in drug design. In particular, large language models trained on biochemical sequences learn feature vectors that guide drug discovery through virtual screening. However, such models do not capture the molecular interactions important for binding affinity and specificity. Therefore, there is a need to 'compose' representations from distinct biologi…
▽ More
Deep learning models have become fundamental tools in drug design. In particular, large language models trained on biochemical sequences learn feature vectors that guide drug discovery through virtual screening. However, such models do not capture the molecular interactions important for binding affinity and specificity. Therefore, there is a need to 'compose' representations from distinct biological modalities to effectively represent molecular complexes. We present an overview of the methods to combine molecular representations and propose that future work should balance computational efficiency and expressiveness. Specifically, we argue that improvements in both speed and accuracy are possible by learning to merge the representations from internal layers of domain specific biological language models. We demonstrate that 'composing' biochemical language models performs similar or better than standard methods representing molecular interactions despite having significantly fewer features. Finally, we discuss recent methods for interpreting and democratizing large language models that could aid the development of interaction aware foundation models for biology, as well as their shortcomings.
△ Less
Submitted 26 March, 2025;
originally announced March 2025.
-
Substrate Prediction for RiPP Biosynthetic Enzymes via Masked Language Modeling and Transfer Learning
Authors:
Joseph D. Clark,
Xuenan Mi,
Douglas A. Mitchell,
Diwakar Shukla
Abstract:
Ribosomally synthesized and post-translationally modified peptide (RiPP) biosynthetic enzymes often exhibit promiscuous substrate preferences that cannot be reduced to simple rules. Large language models are promising tools for predicting such peptide fitness landscapes. However, state-of-the-art protein language models are trained on relatively few peptide sequences. A previous study comprehensiv…
▽ More
Ribosomally synthesized and post-translationally modified peptide (RiPP) biosynthetic enzymes often exhibit promiscuous substrate preferences that cannot be reduced to simple rules. Large language models are promising tools for predicting such peptide fitness landscapes. However, state-of-the-art protein language models are trained on relatively few peptide sequences. A previous study comprehensively profiled the peptide substrate preferences of LazBF (a two-component serine dehydratase) and LazDEF (a three-component azole synthetase) from the lactazole biosynthetic pathway. We demonstrated that masked language modeling of LazBF substrate preferences produced language model embeddings that improved downstream classification models of both LazBF and LazDEF substrates. Similarly, masked language modeling of LazDEF substrate preferences produced embeddings that improved the performance of classification models of both LazBF and LazDEF substrates. Our results suggest that the models learned functional forms that are transferable between distinct enzymatic transformations that act within the same biosynthetic pathway. Our transfer learning method improved performance and data efficiency in data-scarce scenarios. We then fine-tuned models on each data set and showed that the fine-tuned models provided interpretable insight that we anticipate will facilitate the design of substrate libraries that are compatible with desired RiPP biosynthetic pathways.
△ Less
Submitted 23 February, 2024;
originally announced February 2024.