Search | arXiv e-print repository

Foundation Models for AI-Enabled Biological Design

Abstract: This paper surveys foundation models for AI-enabled biological design, focusing on recent developments in applying large-scale, self-supervised models to tasks such as protein engineering, small molecule design, and genomic sequence design. Though this domain is evolving rapidly, this survey presents and discusses a taxonomy of current models and methods. The focus is on challenges and solutions i… ▽ More This paper surveys foundation models for AI-enabled biological design, focusing on recent developments in applying large-scale, self-supervised models to tasks such as protein engineering, small molecule design, and genomic sequence design. Though this domain is evolving rapidly, this survey presents and discusses a taxonomy of current models and methods. The focus is on challenges and solutions in adapting these models for biological applications, including biological sequence modeling architectures, controllability in generation, and multi-modal integration. The survey concludes with a discussion of open problems and future directions, offering concrete next-steps to improve the quality of biological sequence generation. △ Less

Submitted 16 May, 2025; originally announced May 2025.

Comments: Published as part of the workshop proceedings at AAAI 2025 in the workshop "Foundation Models for Biological Discoveries"

arXiv:2206.11057 [pdf, other]

Transformer Neural Networks Attending to Both Sequence and Structure for Protein Prediction Tasks

Authors: Anowarul Kabir, Amarda Shehu

Abstract: The increasing number of protein sequences decoded from genomes is opening up new avenues of research on linking protein sequence to function with transformer neural networks. Recent research has shown that the number of known protein sequences supports learning useful, task-agnostic sequence representations via transformers. In this paper, we posit that learning joint sequence-structure represent… ▽ More The increasing number of protein sequences decoded from genomes is opening up new avenues of research on linking protein sequence to function with transformer neural networks. Recent research has shown that the number of known protein sequences supports learning useful, task-agnostic sequence representations via transformers. In this paper, we posit that learning joint sequence-structure representations yields better representations for function-related prediction tasks. We propose a transformer neural network that attends to both sequence and tertiary structure. We show that such joint representations are more powerful than sequence-based representations only, and they yield better performance on superfamily membership across various metrics. △ Less

Submitted 17 June, 2022; originally announced June 2022.

Comments: 8 pages, 4 figures, 3 tables

arXiv:2010.01441 [pdf, other]

Decoy Selection for Protein Structure Prediction Via Extreme Gradient Boosting and Ranking

Authors: Nasrin Akhter, Gopinath Chennupati, Hristo Djidjev, Amarda Shehu

Abstract: Identifying one or more biologically-active/native decoys from millions of non-native decoys is one of the major challenges in computational structural biology. The extreme lack of balance in positive and negative samples (native and non-native decoys) in a decoy set makes the problem even more complicated. Consensus methods show varied success in handling the challenge of decoy selection despite… ▽ More Identifying one or more biologically-active/native decoys from millions of non-native decoys is one of the major challenges in computational structural biology. The extreme lack of balance in positive and negative samples (native and non-native decoys) in a decoy set makes the problem even more complicated. Consensus methods show varied success in handling the challenge of decoy selection despite some issues associated with clustering large decoy sets and decoy sets that do not show much structural similarity. Recent investigations into energy landscape-based decoy selection approaches show promises. However, lack of generalization over varied test cases remains a bottleneck for these methods. We propose a novel decoy selection method, ML-Select, a machine learning framework that exploits the energy landscape associated with the structure space probed through a template-free decoy generation. The proposed method outperforms both clustering and energy ranking-based methods, all the while consistently offering better performance on varied test-cases. Moreover, ML-Select shows promising results even for the decoy sets consisting of mostly low-quality decoys. ML-Select is a useful method for decoy selection. This work suggests further research in finding more effective ways to adopt machine learning frameworks in achieving robust performance for decoy selection in template-free protein structure prediction. △ Less

Submitted 3 October, 2020; originally announced October 2020.

Comments: Accepted for BMC Bioinformatics

arXiv:2004.07119 [pdf, other]

Generating Tertiary Protein Structures via an Interpretative Variational Autoencoder

Authors: Xiaojie Guo, Yuanqi Du, Sivani Tadepalli, Liang Zhao, Amarda Shehu

Abstract: Much scientific enquiry across disciplines is founded upon a mechanistic treatment of dynamic systems that ties form to function. A highly visible instance of this is in molecular biology, where an important goal is to determine functionally-relevant forms/structures that a protein molecule employs to interact with molecular partners in the living cell. This goal is typically pursued under the umb… ▽ More Much scientific enquiry across disciplines is founded upon a mechanistic treatment of dynamic systems that ties form to function. A highly visible instance of this is in molecular biology, where an important goal is to determine functionally-relevant forms/structures that a protein molecule employs to interact with molecular partners in the living cell. This goal is typically pursued under the umbrella of stochastic optimization with algorithms that optimize a scoring function. Research repeatedly shows that current scoring function, though steadily improving, correlate weakly with molecular activity. Inspired by recent momentum in generative deep learning, this paper proposes and evaluates an alternative approach to generating functionally-relevant three-dimensional structures of a protein. Though typically deep generative models struggle with highly-structured data, the work presented here circumvents this challenge via graph-generative models. A comprehensive evaluation of several deep architectures shows the promise of generative models in directly revealing the latent space for sampling novel tertiary structures, as well as in highlighting axes/factors that carry structural meaning and open the black box often associated with deep models. The work presented here is a first step towards interpretative, deep generative models becoming viable and informative complementary approaches to protein structure prediction. △ Less

Submitted 16 June, 2021; v1 submitted 8 April, 2020; originally announced April 2020.

arXiv:1905.08331 [pdf, other]

ROMEO: A Plug-and-play Software Platform of Robotics-inspired Algorithms for Modeling Biomolecular Structures and Motions

Authors: Kevin Molloy, Erion Plaku, Amarda Shehu

Abstract: Motivation: Due to the central role of protein structure in molecular recognition, great computational efforts are devoted to modeling protein structures and motions that mediate structural rearrangements. The size, dimensionality, and non-linearity of the protein structure space present outstanding challenges. Such challenges also arise in robot motion planning, and robotics-inspired treatments o… ▽ More Motivation: Due to the central role of protein structure in molecular recognition, great computational efforts are devoted to modeling protein structures and motions that mediate structural rearrangements. The size, dimensionality, and non-linearity of the protein structure space present outstanding challenges. Such challenges also arise in robot motion planning, and robotics-inspired treatments of protein structure and motion are increasingly showing high exploration capability. Encouraged by such findings, we debut here ROMEO, which stands for Robotics prOtein Motion ExplOration framework. ROMEO is an open-source, object-oriented platform that allows researchers access to and reproducibility of published robotics-inspired algorithms for modeling protein structures and motions, as well as facilitates novel algorithmic design via its plug-and-play architecture. Availability and implementation: ROMEO is written in C++ and is available in GitLab (https://github.com/). This software is freely available under the Creative Commons license (Attribution and Non-Commercial). Contact: [email protected] △ Less

Submitted 20 May, 2019; originally announced May 2019.

Comments: 6 pages, 5 figures

Showing 1–5 of 5 results for author: Shehu, A