-
Foundation Models for AI-Enabled Biological Design
Authors:
Asher Moldwin,
Amarda Shehu
Abstract:
This paper surveys foundation models for AI-enabled biological design, focusing on recent developments in applying large-scale, self-supervised models to tasks such as protein engineering, small molecule design, and genomic sequence design. Though this domain is evolving rapidly, this survey presents and discusses a taxonomy of current models and methods. The focus is on challenges and solutions i…
▽ More
This paper surveys foundation models for AI-enabled biological design, focusing on recent developments in applying large-scale, self-supervised models to tasks such as protein engineering, small molecule design, and genomic sequence design. Though this domain is evolving rapidly, this survey presents and discusses a taxonomy of current models and methods. The focus is on challenges and solutions in adapting these models for biological applications, including biological sequence modeling architectures, controllability in generation, and multi-modal integration. The survey concludes with a discussion of open problems and future directions, offering concrete next-steps to improve the quality of biological sequence generation.
△ Less
Submitted 16 May, 2025;
originally announced May 2025.
-
Transformer Neural Networks Attending to Both Sequence and Structure for Protein Prediction Tasks
Authors:
Anowarul Kabir,
Amarda Shehu
Abstract:
The increasing number of protein sequences decoded from genomes is opening up new avenues of research on linking protein sequence to function with transformer neural networks. Recent research has shown that the number of known protein sequences supports learning useful, task-agnostic sequence representations via transformers. In this paper, we posit that learning joint sequence-structure represent…
▽ More
The increasing number of protein sequences decoded from genomes is opening up new avenues of research on linking protein sequence to function with transformer neural networks. Recent research has shown that the number of known protein sequences supports learning useful, task-agnostic sequence representations via transformers. In this paper, we posit that learning joint sequence-structure representations yields better representations for function-related prediction tasks. We propose a transformer neural network that attends to both sequence and tertiary structure. We show that such joint representations are more powerful than sequence-based representations only, and they yield better performance on superfamily membership across various metrics.
△ Less
Submitted 17 June, 2022;
originally announced June 2022.
-
Decoy Selection for Protein Structure Prediction Via Extreme Gradient Boosting and Ranking
Authors:
Nasrin Akhter,
Gopinath Chennupati,
Hristo Djidjev,
Amarda Shehu
Abstract:
Identifying one or more biologically-active/native decoys from millions of non-native decoys is one of the major challenges in computational structural biology. The extreme lack of balance in positive and negative samples (native and non-native decoys) in a decoy set makes the problem even more complicated. Consensus methods show varied success in handling the challenge of decoy selection despite…
▽ More
Identifying one or more biologically-active/native decoys from millions of non-native decoys is one of the major challenges in computational structural biology. The extreme lack of balance in positive and negative samples (native and non-native decoys) in a decoy set makes the problem even more complicated. Consensus methods show varied success in handling the challenge of decoy selection despite some issues associated with clustering large decoy sets and decoy sets that do not show much structural similarity. Recent investigations into energy landscape-based decoy selection approaches show promises. However, lack of generalization over varied test cases remains a bottleneck for these methods. We propose a novel decoy selection method, ML-Select, a machine learning framework that exploits the energy landscape associated with the structure space probed through a template-free decoy generation. The proposed method outperforms both clustering and energy ranking-based methods, all the while consistently offering better performance on varied test-cases. Moreover, ML-Select shows promising results even for the decoy sets consisting of mostly low-quality decoys. ML-Select is a useful method for decoy selection. This work suggests further research in finding more effective ways to adopt machine learning frameworks in achieving robust performance for decoy selection in template-free protein structure prediction.
△ Less
Submitted 3 October, 2020;
originally announced October 2020.
-
Generating Tertiary Protein Structures via an Interpretative Variational Autoencoder
Authors:
Xiaojie Guo,
Yuanqi Du,
Sivani Tadepalli,
Liang Zhao,
Amarda Shehu
Abstract:
Much scientific enquiry across disciplines is founded upon a mechanistic treatment of dynamic systems that ties form to function. A highly visible instance of this is in molecular biology, where an important goal is to determine functionally-relevant forms/structures that a protein molecule employs to interact with molecular partners in the living cell. This goal is typically pursued under the umb…
▽ More
Much scientific enquiry across disciplines is founded upon a mechanistic treatment of dynamic systems that ties form to function. A highly visible instance of this is in molecular biology, where an important goal is to determine functionally-relevant forms/structures that a protein molecule employs to interact with molecular partners in the living cell. This goal is typically pursued under the umbrella of stochastic optimization with algorithms that optimize a scoring function. Research repeatedly shows that current scoring function, though steadily improving, correlate weakly with molecular activity. Inspired by recent momentum in generative deep learning, this paper proposes and evaluates an alternative approach to generating functionally-relevant three-dimensional structures of a protein. Though typically deep generative models struggle with highly-structured data, the work presented here circumvents this challenge via graph-generative models. A comprehensive evaluation of several deep architectures shows the promise of generative models in directly revealing the latent space for sampling novel tertiary structures, as well as in highlighting axes/factors that carry structural meaning and open the black box often associated with deep models. The work presented here is a first step towards interpretative, deep generative models becoming viable and informative complementary approaches to protein structure prediction.
△ Less
Submitted 16 June, 2021; v1 submitted 8 April, 2020;
originally announced April 2020.
-
ROMEO: A Plug-and-play Software Platform of Robotics-inspired Algorithms for Modeling Biomolecular Structures and Motions
Authors:
Kevin Molloy,
Erion Plaku,
Amarda Shehu
Abstract:
Motivation: Due to the central role of protein structure in molecular recognition, great computational efforts are devoted to modeling protein structures and motions that mediate structural rearrangements. The size, dimensionality, and non-linearity of the protein structure space present outstanding challenges. Such challenges also arise in robot motion planning, and robotics-inspired treatments o…
▽ More
Motivation: Due to the central role of protein structure in molecular recognition, great computational efforts are devoted to modeling protein structures and motions that mediate structural rearrangements. The size, dimensionality, and non-linearity of the protein structure space present outstanding challenges. Such challenges also arise in robot motion planning, and robotics-inspired treatments of protein structure and motion are increasingly showing high exploration capability. Encouraged by such findings, we debut here ROMEO, which stands for Robotics prOtein Motion ExplOration framework. ROMEO is an open-source, object-oriented platform that allows researchers access to and reproducibility of published robotics-inspired algorithms for modeling protein structures and motions, as well as facilitates novel algorithmic design via its plug-and-play architecture.
Availability and implementation: ROMEO is written in C++ and is available in GitLab (https://github.com/). This software is freely available under the Creative Commons license (Attribution and Non-Commercial).
Contact: [email protected]
△ Less
Submitted 20 May, 2019;
originally announced May 2019.