-
Graph Neural Networks in Modern AI-aided Drug Discovery
Authors:
Odin Zhang,
Haitao Lin,
Xujun Zhang,
Xiaorui Wang,
Zhenxing Wu,
Qing Ye,
Weibo Zhao,
Jike Wang,
Kejun Ying,
Yu Kang,
Chang-yu Hsieh,
Tingjun Hou
Abstract:
Graph neural networks (GNNs), as topology/structure-aware models within deep learning, have emerged as powerful tools for AI-aided drug discovery (AIDD). By directly operating on molecular graphs, GNNs offer an intuitive and expressive framework for learning the complex topological and geometric features of drug-like molecules, cementing their role in modern molecular modeling. This review provide…
▽ More
Graph neural networks (GNNs), as topology/structure-aware models within deep learning, have emerged as powerful tools for AI-aided drug discovery (AIDD). By directly operating on molecular graphs, GNNs offer an intuitive and expressive framework for learning the complex topological and geometric features of drug-like molecules, cementing their role in modern molecular modeling. This review provides a comprehensive overview of the methodological foundations and representative applications of GNNs in drug discovery, spanning tasks such as molecular property prediction, virtual screening, molecular generation, biomedical knowledge graph construction, and synthesis planning. Particular attention is given to recent methodological advances, including geometric GNNs, interpretable models, uncertainty quantification, scalable graph architectures, and graph generative frameworks. We also discuss how these models integrate with modern deep learning approaches, such as self-supervised learning, multi-task learning, meta-learning and pre-training. Throughout this review, we highlight the practical challenges and methodological bottlenecks encountered when applying GNNs to real-world drug discovery pipelines, and conclude with a discussion on future directions.
△ Less
Submitted 7 June, 2025;
originally announced June 2025.
-
AutoLoop: a novel autoregressive deep learning method for protein loop prediction with high accuracy
Authors:
Tianyue Wang,
Xujun Zhang,
Langcheng Wang,
Odin Zhang,
Jike Wang,
Ercheng Wang,
Jialu Wu,
Renling Hu,
Jingxuan Ge,
Shimeng Li,
Qun Su,
Jiajun Yu,
Chang-Yu Hsieh,
Tingjun Hou,
Yu Kang
Abstract:
Protein structure prediction is a critical and longstanding challenge in biology, garnering widespread interest due to its significance in understanding biological processes. A particular area of focus is the prediction of missing loops in proteins, which are vital in determining protein function and activity. To address this challenge, we propose AutoLoop, a novel computational model designed to…
▽ More
Protein structure prediction is a critical and longstanding challenge in biology, garnering widespread interest due to its significance in understanding biological processes. A particular area of focus is the prediction of missing loops in proteins, which are vital in determining protein function and activity. To address this challenge, we propose AutoLoop, a novel computational model designed to automatically generate accurate loop backbone conformations that closely resemble their natural structures. AutoLoop employs a bidirectional training approach while merging atom- and residue-level embedding, thus improving robustness and precision. We compared AutoLoop with twelve established methods, including FREAD, NGK, AlphaFold2, and AlphaFold3. AutoLoop consistently outperforms other methods, achieving a median RMSD of 1.12 Angstrom and a 2-Angstrom success rate of 73.23% on the CASP15 dataset, while maintaining strong performance on the HOMSTARD dataset. It demonstrates the best performance across nearly all loop lengths and secondary structural types. Beyond accuracy, AutoLoop is computationally efficient, requiring only 0.10 s per generation. A post-processing module for side-chain packing and energy minimization further improves results slightly, confirming the reliability of the predicted backbone. A case study also highlights AutoLoop's potential for precise predictions based on dominant loop conformations. These advances hold promise for protein engineering and drug discovery.
△ Less
Submitted 5 May, 2025;
originally announced May 2025.
-
Elucidating the Design Space of Multimodal Protein Language Models
Authors:
Cheng-Yen Hsieh,
Xinyou Wang,
Daiheng Zhang,
Dongyu Xue,
Fei Ye,
Shujian Huang,
Zaixiang Zheng,
Quanquan Gu
Abstract:
Multimodal protein language models (PLMs) integrate sequence and token-based structural information, serving as a powerful foundation for protein modeling, generation, and design. However, the reliance on tokenizing 3D structures into discrete tokens causes substantial loss of fidelity about fine-grained structural details and correlations. In this paper, we systematically elucidate the design spa…
▽ More
Multimodal protein language models (PLMs) integrate sequence and token-based structural information, serving as a powerful foundation for protein modeling, generation, and design. However, the reliance on tokenizing 3D structures into discrete tokens causes substantial loss of fidelity about fine-grained structural details and correlations. In this paper, we systematically elucidate the design space of multimodal PLMs to overcome their limitations. We identify tokenization loss and inaccurate structure token predictions by the PLMs as major bottlenecks. To address these, our proposed design space covers improved generative modeling, structure-aware architectures and representation learning, and data exploration. Our advancements approach finer-grained supervision, demonstrating that token-based multimodal PLMs can achieve robust structural modeling. The effective design methods dramatically improve the structure generation diversity, and notably, folding abilities of our 650M model by reducing the RMSD from 5.52 to 2.36 on PDB testset, even outperforming 3B baselines and on par with the specialized folding models. Project page and code: https://bytedance.github.io/dplm/dplm-2.1/.
△ Less
Submitted 11 June, 2025; v1 submitted 15 April, 2025;
originally announced April 2025.
-
ProtFlow: Fast Protein Sequence Design via Flow Matching on Compressed Protein Language Model Embeddings
Authors:
Zitai Kong,
Yiheng Zhu,
Yinlong Xu,
Hanjing Zhou,
Mingzhe Yin,
Jialu Wu,
Hongxia Xu,
Chang-Yu Hsieh,
Tingjun Hou,
Jian Wu
Abstract:
The design of protein sequences with desired functionalities is a fundamental task in protein engineering. Deep generative methods, such as autoregressive models and diffusion models, have greatly accelerated the discovery of novel protein sequences. However, these methods mainly focus on local or shallow residual semantics and suffer from low inference efficiency, large modeling space and high tr…
▽ More
The design of protein sequences with desired functionalities is a fundamental task in protein engineering. Deep generative methods, such as autoregressive models and diffusion models, have greatly accelerated the discovery of novel protein sequences. However, these methods mainly focus on local or shallow residual semantics and suffer from low inference efficiency, large modeling space and high training cost. To address these challenges, we introduce ProtFlow, a fast flow matching-based protein sequence design framework that operates on embeddings derived from semantically meaningful latent space of protein language models. By compressing and smoothing the latent space, ProtFlow enhances performance while training on limited computational resources. Leveraging reflow techniques, ProtFlow enables high-quality single-step sequence generation. Additionally, we develop a joint design pipeline for the design scene of multichain proteins. We evaluate ProtFlow across diverse protein design tasks, including general peptides and long-chain proteins, antimicrobial peptides, and antibodies. Experimental results demonstrate that ProtFlow outperforms task-specific methods in these applications, underscoring its potential and broad applicability in computational protein sequence design and analysis.
△ Less
Submitted 15 April, 2025;
originally announced April 2025.
-
S$^2$ALM: Sequence-Structure Pre-trained Large Language Model for Comprehensive Antibody Representation Learning
Authors:
Mingze Yin,
Hanjing Zhou,
Jialu Wu,
Yiheng Zhu,
Yuxuan Zhan,
Zitai Kong,
Hongxia Xu,
Chang-Yu Hsieh,
Jintai Chen,
Tingjun Hou,
Jian Wu
Abstract:
Antibodies safeguard our health through their precise and potent binding to specific antigens, demonstrating promising therapeutic efficacy in the treatment of numerous diseases, including COVID-19. Recent advancements in biomedical language models have shown the great potential to interpret complex biological structures and functions. However, existing antibody specific models have a notable limi…
▽ More
Antibodies safeguard our health through their precise and potent binding to specific antigens, demonstrating promising therapeutic efficacy in the treatment of numerous diseases, including COVID-19. Recent advancements in biomedical language models have shown the great potential to interpret complex biological structures and functions. However, existing antibody specific models have a notable limitation that they lack explicit consideration for antibody structural information, despite the fact that both 1D sequence and 3D structure carry unique and complementary insights into antibody behavior and functionality. This paper proposes Sequence-Structure multi-level pre-trained Antibody Language Model (S$^2$ALM), combining holistic sequential and structural information in one unified, generic antibody foundation model. We construct a hierarchical pre-training paradigm incorporated with two customized multi-level training objectives to facilitate the modeling of comprehensive antibody representations. S$^2$ALM's representation space uncovers inherent functional binding mechanisms, biological evolution properties and structural interaction patterns. Pre-trained over 75 million sequences and 11.7 million structures, S$^2$ALM can be adopted for diverse downstream tasks: accurately predicting antigen-antibody binding affinities, precisely distinguishing B cell maturation stages, identifying antibody crucial binding positions, and specifically designing novel coronavirus-binding antibodies. Remarkably, S$^2$ALM outperforms well-established and renowned baselines and sets new state-of-the-art performance across extensive antibody specific understanding and generation tasks. S$^2$ALM's ability to model comprehensive and generalized representations further positions its potential to advance real-world therapeutic antibody development, potentially addressing unmet academic, industrial, and clinical needs.
△ Less
Submitted 20 November, 2024;
originally announced November 2024.
-
How to quantify interaction strengths? A critical rethinking of the interaction Jacobian and evaluation methods for non-parametric inference in time series analysis
Authors:
Takeshi Miki,
Chun-Wei Chang,
Po-Ju Ke,
Arndt Telschow,
Cheng-Han Tsai,
Masayuki Ushio,
Chih-hao Hsieh
Abstract:
Quantifying interaction strengths between state variables in dynamical systems is essential for understanding ecological networks. Within the empirical dynamic modeling approach, multivariate S-map infers the interaction Jacobian from time series data without assuming specific dynamical models. This approach enables the non-parametric statistical inference of interspecific interactions through sta…
▽ More
Quantifying interaction strengths between state variables in dynamical systems is essential for understanding ecological networks. Within the empirical dynamic modeling approach, multivariate S-map infers the interaction Jacobian from time series data without assuming specific dynamical models. This approach enables the non-parametric statistical inference of interspecific interactions through state space reconstruction. However, deviations in the biological interpretation and numerical implementation of the interaction Jacobian from its mathematical definition pose challenges. We mathematically reintroduce the interaction Jacobian using differential quotients, uncovering two problems: (1) the mismatch between the interaction Jacobian and its biological meaning complicates comparisons between interspecific and intraspecific interactions; (2) the interaction Jacobian is not fully implemented in the parametric Jacobian numerically derived from given parametric models, especially using ordinary differential equations. As a result, model-based evaluations of S-map methods become inappropriate. To address these problems, (1) we propose adjusting the diagonal elements of the interaction Jacobian by subtracting 1 to resolve the comparability problem between inter- and intraspecific interaction strengths. Simulations of population dynamics showed that this adjustment prevents overestimation of intraspecific interaction strengths. (2) We introduce an alternative parametric Jacobian and then cumulative interaction strength (CIS), providing a more rigorous benchmark for evaluating S-map methods. Furthermore, we demonstrated that the numerical gap between CIS and the existing parametric Jacobian is substantial in realistic scenarios, suggesting CIS as preferred benchmark. These solutions offer a clearer framework for developing non-parametric approaches in ecological time series analysis.
△ Less
Submitted 13 November, 2024;
originally announced November 2024.
-
DEL-Ranking: Ranking-Correction Denoising Framework for Elucidating Molecular Affinities in DNA-Encoded Libraries
Authors:
Hanqun Cao,
Mutian He,
Ning Ma,
Chang-yu Hsieh,
Chunbin Gu,
Pheng-Ann Heng
Abstract:
DNA-encoded library (DEL) screening has revolutionized the detection of protein-ligand interactions through read counts, enabling rapid exploration of vast chemical spaces. However, noise in read counts, stemming from nonspecific interactions, can mislead this exploration process. We present DEL-Ranking, a novel distribution-correction denoising framework that addresses these challenges. Our appro…
▽ More
DNA-encoded library (DEL) screening has revolutionized the detection of protein-ligand interactions through read counts, enabling rapid exploration of vast chemical spaces. However, noise in read counts, stemming from nonspecific interactions, can mislead this exploration process. We present DEL-Ranking, a novel distribution-correction denoising framework that addresses these challenges. Our approach introduces two key innovations: (1) a novel ranking loss that rectifies relative magnitude relationships between read counts, enabling the learning of causal features determining activity levels, and (2) an iterative algorithm employing self-training and consistency loss to establish model coherence between activity label and read count predictions. Furthermore, we contribute three new DEL screening datasets, the first to comprehensively include multi-dimensional molecular representations, protein-ligand enrichment values, and their activity labels. These datasets mitigate data scarcity issues in AI-driven DEL screening research. Rigorous evaluation on diverse DEL datasets demonstrates DEL-Ranking's superior performance across multiple correlation metrics, with significant improvements in binding affinity prediction accuracy. Our model exhibits zero-shot generalization ability across different protein targets and successfully identifies potential motifs determining compound binding affinity. This work advances DEL screening analysis and provides valuable resources for future research in this area.
△ Less
Submitted 4 December, 2024; v1 submitted 18 October, 2024;
originally announced October 2024.
-
Unlocking Potential Binders: Multimodal Pretraining DEL-Fusion for Denoising DNA-Encoded Libraries
Authors:
Chunbin Gu,
Mutian He,
Hanqun Cao,
Guangyong Chen,
Chang-yu Hsieh,
Pheng Ann Heng
Abstract:
In the realm of drug discovery, DNA-encoded library (DEL) screening technology has emerged as an efficient method for identifying high-affinity compounds. However, DEL screening faces a significant challenge: noise arising from nonspecific interactions within complex biological systems. Neural networks trained on DEL libraries have been employed to extract compound features, aiming to denoise the…
▽ More
In the realm of drug discovery, DNA-encoded library (DEL) screening technology has emerged as an efficient method for identifying high-affinity compounds. However, DEL screening faces a significant challenge: noise arising from nonspecific interactions within complex biological systems. Neural networks trained on DEL libraries have been employed to extract compound features, aiming to denoise the data and uncover potential binders to the desired therapeutic target. Nevertheless, the inherent structure of DEL, constrained by the limited diversity of building blocks, impacts the performance of compound encoders. Moreover, existing methods only capture compound features at a single level, further limiting the effectiveness of the denoising strategy. To mitigate these issues, we propose a Multimodal Pretraining DEL-Fusion model (MPDF) that enhances encoder capabilities through pretraining and integrates compound features across various scales. We develop pretraining tasks applying contrastive objectives between different compound representations and their text descriptions, enhancing the compound encoders' ability to acquire generic features. Furthermore, we propose a novel DEL-fusion framework that amalgamates compound information at the atomic, submolecular, and molecular levels, as captured by various compound encoders. The synergy of these innovations equips MPDF with enriched, multi-scale features, enabling comprehensive downstream denoising. Evaluated on three DEL datasets, MPDF demonstrates superior performance in data processing and analysis for validation tasks. Notably, MPDF offers novel insights into identifying high-affinity molecules, paving the way for improved DEL utility in drug discovery.
△ Less
Submitted 7 September, 2024;
originally announced September 2024.
-
Discovery of novel antimicrobial peptides with notable antibacterial potency by a LLM-based foundation model
Authors:
Jike Wang,
Jianwen Feng,
Yu Kang,
Peichen Pan,
Jingxuan Ge,
Yan Wang,
Mingyang Wang,
Zhenxing Wu,
Xingcai Zhang,
Jiameng Yu,
Xujun Zhang,
Tianyue Wang,
Lirong Wen,
Guangning Yan,
Yafeng Deng,
Hui Shi,
Chang-Yu Hsieh,
Zhihui Jiang,
Tingjun Hou
Abstract:
Large language models (LLMs) have shown remarkable advancements in chemistry and biomedical research, acting as versatile foundation models for various tasks. We introduce AMP-Designer, an LLM-based approach for swiftly designing novel antimicrobial peptides (AMPs) with desired properties. Within 11 days, AMP-Designer achieved the de novo design of 18 AMPs with broad-spectrum activity against Gram…
▽ More
Large language models (LLMs) have shown remarkable advancements in chemistry and biomedical research, acting as versatile foundation models for various tasks. We introduce AMP-Designer, an LLM-based approach for swiftly designing novel antimicrobial peptides (AMPs) with desired properties. Within 11 days, AMP-Designer achieved the de novo design of 18 AMPs with broad-spectrum activity against Gram-negative bacteria. In vitro validation revealed a 94.4% success rate, with two candidates demonstrating exceptional antibacterial efficacy, minimal hemotoxicity, stability in human plasma, and low potential to induce resistance, as evidenced by significant bacterial load reduction in murine lung infection experiments. The entire process, from design to validation, concluded in 48 days. AMP-Designer excels in creating AMPs targeting specific strains despite limited data availability, with a top candidate displaying a minimum inhibitory concentration of 2.0 μg/ml against Propionibacterium acnes. Integrating advanced machine learning techniques, AMP-Designer demonstrates remarkable efficiency, paving the way for innovative solutions to antibiotic resistance.
△ Less
Submitted 2 March, 2025; v1 submitted 16 July, 2024;
originally announced July 2024.
-
Token-Mol 1.0: Tokenized drug design with large language model
Authors:
Jike Wang,
Rui Qin,
Mingyang Wang,
Meijing Fang,
Yangyang Zhang,
Yuchen Zhu,
Qun Su,
Qiaolin Gou,
Chao Shen,
Odin Zhang,
Zhenxing Wu,
Dejun Jiang,
Xujun Zhang,
Huifeng Zhao,
Xiaozhe Wan,
Zhourui Wu,
Liwei Liu,
Yu Kang,
Chang-Yu Hsieh,
Tingjun Hou
Abstract:
Significant interests have recently risen in leveraging sequence-based large language models (LLMs) for drug design. However, most current applications of LLMs in drug discovery lack the ability to comprehend three-dimensional (3D) structures, thereby limiting their effectiveness in tasks that explicitly involve molecular conformations. In this study, we introduced Token-Mol, a token-only 3D drug…
▽ More
Significant interests have recently risen in leveraging sequence-based large language models (LLMs) for drug design. However, most current applications of LLMs in drug discovery lack the ability to comprehend three-dimensional (3D) structures, thereby limiting their effectiveness in tasks that explicitly involve molecular conformations. In this study, we introduced Token-Mol, a token-only 3D drug design model. This model encodes all molecular information, including 2D and 3D structures, as well as molecular property data, into tokens, which transforms classification and regression tasks in drug discovery into probabilistic prediction problems, thereby enabling learning through a unified paradigm. Token-Mol is built on the transformer decoder architecture and trained using random causal masking techniques. Additionally, we proposed the Gaussian cross-entropy (GCE) loss function to overcome the challenges in regression tasks, significantly enhancing the capacity of LLMs to learn continuous numerical values. Through a combination of fine-tuning and reinforcement learning (RL), Token-Mol achieves performance comparable to or surpassing existing task-specific methods across various downstream tasks, including pocket-based molecular generation, conformation generation, and molecular property prediction. Compared to existing molecular pre-trained models, Token-Mol exhibits superior proficiency in handling a wider range of downstream tasks essential for drug design. Notably, our approach improves regression task accuracy by approximately 30% compared to similar token-only methods. Token-Mol overcomes the precision limitations of token-only models and has the potential to integrate seamlessly with general models such as ChatGPT, paving the way for the development of a universal artificial intelligence drug design model that facilitates rapid and high-quality drug design by experts.
△ Less
Submitted 19 August, 2024; v1 submitted 10 July, 2024;
originally announced July 2024.
-
Human-level molecular optimization driven by mol-gene evolution
Authors:
Jiebin Fang,
Churu Mao,
Yuchen Zhu,
Xiaoming Chen,
Chang-Yu Hsieh,
Zhongjun Ma
Abstract:
De novo molecule generation allows the search for more drug-like hits across a vast chemical space. However, lead optimization is still required, and the process of optimizing molecular structures faces the challenge of balancing structural novelty with pharmacological properties. This study introduces the Deep Genetic Molecular Modification Algorithm (DGMM), which brings structure modification to…
▽ More
De novo molecule generation allows the search for more drug-like hits across a vast chemical space. However, lead optimization is still required, and the process of optimizing molecular structures faces the challenge of balancing structural novelty with pharmacological properties. This study introduces the Deep Genetic Molecular Modification Algorithm (DGMM), which brings structure modification to the level of medicinal chemists. A discrete variational autoencoder (D-VAE) is used in DGMM to encode molecules as quantization code, mol-gene, which incorporates deep learning into genetic algorithms for flexible structural optimization. The mol-gene allows for the discovery of pharmacologically similar but structurally distinct compounds, and reveals the trade-offs of structural optimization in drug discovery. We demonstrate the effectiveness of the DGMM in several applications.
△ Less
Submitted 12 June, 2024;
originally announced June 2024.
-
Deep Lead Optimization: Leveraging Generative AI for Structural Modification
Authors:
Odin Zhang,
Haitao Lin,
Hui Zhang,
Huifeng Zhao,
Yufei Huang,
Yuansheng Huang,
Dejun Jiang,
Chang-yu Hsieh,
Peichen Pan,
Tingjun Hou
Abstract:
The idea of using deep-learning-based molecular generation to accelerate discovery of drug candidates has attracted extraordinary attention, and many deep generative models have been developed for automated drug design, termed molecular generation. In general, molecular generation encompasses two main strategies: de novo design, which generates novel molecular structures from scratch, and lead opt…
▽ More
The idea of using deep-learning-based molecular generation to accelerate discovery of drug candidates has attracted extraordinary attention, and many deep generative models have been developed for automated drug design, termed molecular generation. In general, molecular generation encompasses two main strategies: de novo design, which generates novel molecular structures from scratch, and lead optimization, which refines existing molecules into drug candidates. Among them, lead optimization plays an important role in real-world drug design. For example, it can enable the development of me-better drugs that are chemically distinct yet more effective than the original drugs. It can also facilitate fragment-based drug design, transforming virtual-screened small ligands with low affinity into first-in-class medicines. Despite its importance, automated lead optimization remains underexplored compared to the well-established de novo generative models, due to its reliance on complex biological and chemical knowledge. To bridge this gap, we conduct a systematic review of traditional computational methods for lead optimization, organizing these strategies into four principal sub-tasks with defined inputs and outputs. This review delves into the basic concepts, goals, conventional CADD techniques, and recent advancements in AIDD. Additionally, we introduce a unified perspective based on constrained subgraph generation to harmonize the methodologies of de novo design and lead optimization. Through this lens, de novo design can incorporate strategies from lead optimization to address the challenge of generating hard-to-synthesize molecules; inversely, lead optimization can benefit from the innovations in de novo design by approaching it as a task of generating molecules conditioned on certain substructures.
△ Less
Submitted 29 April, 2024;
originally announced April 2024.
-
AAVDiff: Experimental Validation of Enhanced Viability and Diversity in Recombinant Adeno-Associated Virus (AAV) Capsids through Diffusion Generation
Authors:
Lijun Liu,
Jiali Yang,
Jianfei Song,
Xinglin Yang,
Lele Niu,
Zeqi Cai,
Hui Shi,
Tingjun Hou,
Chang-yu Hsieh,
Weiran Shen,
Yafeng Deng
Abstract:
Recombinant adeno-associated virus (rAAV) vectors have revolutionized gene therapy, but their broad tropism and suboptimal transduction efficiency limit their clinical applications. To overcome these limitations, researchers have focused on designing and screening capsid libraries to identify improved vectors. However, the large sequence space and limited resources present challenges in identifyin…
▽ More
Recombinant adeno-associated virus (rAAV) vectors have revolutionized gene therapy, but their broad tropism and suboptimal transduction efficiency limit their clinical applications. To overcome these limitations, researchers have focused on designing and screening capsid libraries to identify improved vectors. However, the large sequence space and limited resources present challenges in identifying viable capsid variants. In this study, we propose an end-to-end diffusion model to generate capsid sequences with enhanced viability. Using publicly available AAV2 data, we generated 38,000 diverse AAV2 viral protein (VP) sequences, and evaluated 8,000 for viral selection. The results attested the superiority of our model compared to traditional methods. Additionally, in the absence of AAV9 capsid data, apart from one wild-type sequence, we used the same model to directly generate a number of viable sequences with up to 9 mutations. we transferred the remaining 30,000 samples to the AAV9 domain. Furthermore, we conducted mutagenesis on AAV9 VP hypervariable regions VI and V, contributing to the continuous improvement of the AAV9 VP sequence. This research represents a significant advancement in the design and functional validation of rAAV vectors, offering innovative solutions to enhance specificity and transduction efficiency in gene therapy applications.
△ Less
Submitted 17 April, 2024; v1 submitted 16 April, 2024;
originally announced April 2024.
-
Deep Geometry Handling and Fragment-wise Molecular 3D Graph Generation
Authors:
Odin Zhang,
Yufei Huang,
Shichen Cheng,
Mengyao Yu,
Xujun Zhang,
Haitao Lin,
Yundian Zeng,
Mingyang Wang,
Zhenxing Wu,
Huifeng Zhao,
Zaixi Zhang,
Chenqing Hua,
Yu Kang,
Sunliang Cui,
Peichen Pan,
Chang-Yu Hsieh,
Tingjun Hou
Abstract:
Most earlier 3D structure-based molecular generation approaches follow an atom-wise paradigm, incrementally adding atoms to a partially built molecular fragment within protein pockets. These methods, while effective in designing tightly bound ligands, often overlook other essential properties such as synthesizability. The fragment-wise generation paradigm offers a promising solution. However, a co…
▽ More
Most earlier 3D structure-based molecular generation approaches follow an atom-wise paradigm, incrementally adding atoms to a partially built molecular fragment within protein pockets. These methods, while effective in designing tightly bound ligands, often overlook other essential properties such as synthesizability. The fragment-wise generation paradigm offers a promising solution. However, a common challenge across both atom-wise and fragment-wise methods lies in their limited ability to co-design plausible chemical and geometrical structures, resulting in distorted conformations. In response to this challenge, we introduce the Deep Geometry Handling protocol, a more abstract design that extends the design focus beyond the model architecture. Through a comprehensive review of existing geometry-related models and their protocols, we propose a novel hybrid strategy, culminating in the development of FragGen - a geometry-reliable, fragment-wise molecular generation method. FragGen marks a significant leap forward in the quality of generated geometry and the synthesis accessibility of molecules. The efficacy of FragGen is further validated by its successful application in designing type II kinase inhibitors at the nanomolar level.
△ Less
Submitted 15 March, 2024;
originally announced April 2024.
-
Generative AI for Controllable Protein Sequence Design: A Survey
Authors:
Yiheng Zhu,
Zitai Kong,
Jialu Wu,
Weize Liu,
Yuqiang Han,
Mingze Yin,
Hongxia Xu,
Chang-Yu Hsieh,
Tingjun Hou
Abstract:
The design of novel protein sequences with targeted functionalities underpins a central theme in protein engineering, impacting diverse fields such as drug discovery and enzymatic engineering. However, navigating this vast combinatorial search space remains a severe challenge due to time and financial constraints. This scenario is rapidly evolving as the transformative advancements in AI, particul…
▽ More
The design of novel protein sequences with targeted functionalities underpins a central theme in protein engineering, impacting diverse fields such as drug discovery and enzymatic engineering. However, navigating this vast combinatorial search space remains a severe challenge due to time and financial constraints. This scenario is rapidly evolving as the transformative advancements in AI, particularly in the realm of generative models and optimization algorithms, have been propelling the protein design field towards an unprecedented revolution. In this survey, we systematically review recent advances in generative AI for controllable protein sequence design. To set the stage, we first outline the foundational tasks in protein sequence design in terms of the constraints involved and present key generative models and optimization algorithms. We then offer in-depth reviews of each design task and discuss the pertinent applications. Finally, we identify the unresolved challenges and highlight research opportunities that merit deeper exploration.
△ Less
Submitted 16 February, 2024;
originally announced February 2024.
-
Multi-channel learning for integrating structural hierarchies into context-dependent molecular representation
Authors:
Yue Wan,
Jialu Wu,
Tingjun Hou,
Chang-Yu Hsieh,
Xiaowei Jia
Abstract:
Reliable molecular property prediction is essential for various scientific endeavors and industrial applications, such as drug discovery. However, the data scarcity, combined with the highly non-linear causal relationships between physicochemical and biological properties and conventional molecular featurization schemes, complicates the development of robust molecular machine learning models. Self…
▽ More
Reliable molecular property prediction is essential for various scientific endeavors and industrial applications, such as drug discovery. However, the data scarcity, combined with the highly non-linear causal relationships between physicochemical and biological properties and conventional molecular featurization schemes, complicates the development of robust molecular machine learning models. Self-supervised learning (SSL) has emerged as a popular solution, utilizing large-scale, unannotated molecular data to learn a foundational representation of chemical space that might be advantageous for downstream tasks. Yet, existing molecular SSL methods largely overlook chemical knowledge, including molecular structure similarity, scaffold composition, and the context-dependent aspects of molecular properties when operating over the chemical space. They also struggle to learn the subtle variations in structure-activity relationship. This paper introduces a novel pre-training framework that learns robust and generalizable chemical knowledge. It leverages the structural hierarchy within the molecule, embeds them through distinct pre-training tasks across channels, and aggregates channel information in a task-specific manner during fine-tuning. Our approach demonstrates competitive performance across various molecular property benchmarks and offers strong advantages in particularly challenging yet ubiquitous scenarios like activity cliffs.
△ Less
Submitted 12 January, 2025; v1 submitted 5 November, 2023;
originally announced November 2023.
-
Delete: Deep Lead Optimization Enveloped in Protein Pocket through Unified Deleting Strategies and a Structure-aware Network
Authors:
Haotian Zhang,
Huifeng Zhao,
Xujun Zhang,
Qun Su,
Hongyan Du,
Chao Shen,
Zhe Wang,
Dan Li,
Peichen Pan,
Guangyong Chen,
Yu Kang,
Chang-yu Hsieh,
Tingjun Hou
Abstract:
Drug discovery is a highly complicated process, and it is unfeasible to fully commit it to the recently developed molecular generation methods. Deep learning-based lead optimization takes expert knowledge as a starting point, learning from numerous historical cases about how to modify the structure for better drug-forming properties. However, compared with the more established de novo generation s…
▽ More
Drug discovery is a highly complicated process, and it is unfeasible to fully commit it to the recently developed molecular generation methods. Deep learning-based lead optimization takes expert knowledge as a starting point, learning from numerous historical cases about how to modify the structure for better drug-forming properties. However, compared with the more established de novo generation schemes, lead optimization is still an area that requires further exploration. Previously developed models are often limited to resolving one (or few) certain subtask(s) of lead optimization, and most of them can only generate the two-dimensional structures of molecules while disregarding the vital protein-ligand interactions based on the three-dimensional binding poses. To address these challenges, we present a novel tool for lead optimization, named Delete (Deep lead optimization enveloped in protein pocket). Our model can handle all subtasks of lead optimization involving fragment growing, linking, and replacement through a unified deleting (masking) strategy, and is aware of the intricate pocket-ligand interactions through the geometric design of networks. Statistical evaluations and case studies conducted on individual subtasks demonstrate that Delete has a significant ability to produce molecules with superior binding affinities to protein targets and reasonable drug-likeness from given fragments or atoms. This feature may assist medicinal chemists in developing not only me-too/me-better products from existing drugs but also hit-to-lead for first-in-class drugs in a highly efficient manner.
△ Less
Submitted 4 August, 2023;
originally announced August 2023.
-
Highly accurate and efficient deep learning paradigm for full-atom protein loop modeling with KarmaLoop
Authors:
Tianyue Wang,
Xujun Zhang,
Odin Zhang,
Peichen Pan,
Guangyong Chen,
Yu Kang,
Chang-Yu Hsieh,
Tingjun Hou
Abstract:
Protein loop modeling is the most challenging yet highly non-trivial task in protein structure prediction. Despite recent progress, existing methods including knowledge-based, ab initio, hybrid and deep learning (DL) methods fall significantly short of either atomic accuracy or computational efficiency. Moreover, an overarching focus on backbone atoms has resulted in a dearth of attention given to…
▽ More
Protein loop modeling is the most challenging yet highly non-trivial task in protein structure prediction. Despite recent progress, existing methods including knowledge-based, ab initio, hybrid and deep learning (DL) methods fall significantly short of either atomic accuracy or computational efficiency. Moreover, an overarching focus on backbone atoms has resulted in a dearth of attention given to side-chain conformation, a critical aspect in a host of downstream applications including ligand docking, molecular dynamics simulation and drug design. To overcome these limitations, we present KarmaLoop, a novel paradigm that distinguishes itself as the first DL method centered on full-atom (encompassing both backbone and side-chain heavy atoms) protein loop modeling. Our results demonstrate that KarmaLoop considerably outperforms conventional and DL-based methods of loop modeling in terms of both accuracy and efficiency, with the average RMSD improved by over two-fold compared to the second-best baseline method across different tasks, and manifests at least two orders of magnitude speedup in general. Consequently, our comprehensive evaluations indicate that KarmaLoop provides a state-of-the-art DL solution for protein loop modeling, with the potential to hasten the advancement of protein engineering, antibody-antigen recognition, and drug design.
△ Less
Submitted 22 June, 2023;
originally announced June 2023.
-
An Equivariant Generative Framework for Molecular Graph-Structure Co-Design
Authors:
Zaixi Zhang,
Qi Liu,
Chee-Kong Lee,
Chang-Yu Hsieh,
Enhong Chen
Abstract:
Designing molecules with desirable physiochemical properties and functionalities is a long-standing challenge in chemistry, material science, and drug discovery. Recently, machine learning-based generative models have emerged as promising approaches for \emph{de novo} molecule design. However, further refinement of methodology is highly desired as most existing methods lack unified modeling of 2D…
▽ More
Designing molecules with desirable physiochemical properties and functionalities is a long-standing challenge in chemistry, material science, and drug discovery. Recently, machine learning-based generative models have emerged as promising approaches for \emph{de novo} molecule design. However, further refinement of methodology is highly desired as most existing methods lack unified modeling of 2D topology and 3D geometry information and fail to effectively learn the structure-property relationship for molecule design. Here we present MolCode, a roto-translation equivariant generative framework for \underline{Mol}ecular graph-structure \underline{Co-de}sign. In MolCode, 3D geometric information empowers the molecular 2D graph generation, which in turn helps guide the prediction of molecular 3D structure. Extensive experimental results show that MolCode outperforms previous methods on a series of challenging tasks including \emph{de novo} molecule design, targeted molecule discovery, and structure-based drug design. Particularly, MolCode not only consistently generates valid (99.95$\%$ Validity) and diverse (98.75$\%$ Uniqueness) molecular graphs/structures with desirable properties, but also generate drug-like molecules with high affinity to target proteins (61.8$\%$ high-affinity ratio), which demonstrates MolCode's potential applications in material design and drug discovery. Our extensive investigation reveals that the 2D topology and 3D geometry contain intrinsically complementary information in molecule design, and provide new insights into machine learning-based molecule representation and generation.
△ Less
Submitted 12 April, 2023;
originally announced April 2023.
-
ODBO: Bayesian Optimization with Search Space Prescreening for Directed Protein Evolution
Authors:
Lixue Cheng,
Ziyi Yang,
Changyu Hsieh,
Benben Liao,
Shengyu Zhang
Abstract:
Directed evolution is a versatile technique in protein engineering that mimics the process of natural selection by iteratively alternating between mutagenesis and screening in order to search for sequences that optimize a given property of interest, such as catalytic activity and binding affinity to a specified target. However, the space of possible proteins is too large to search exhaustively in…
▽ More
Directed evolution is a versatile technique in protein engineering that mimics the process of natural selection by iteratively alternating between mutagenesis and screening in order to search for sequences that optimize a given property of interest, such as catalytic activity and binding affinity to a specified target. However, the space of possible proteins is too large to search exhaustively in the laboratory, and functional proteins are scarce in the vast sequence space. Machine learning (ML) approaches can accelerate directed evolution by learning to map protein sequences to functions without building a detailed model of the underlying physics, chemistry and biological pathways. Despite the great potentials held by these ML methods, they encounter severe challenges in identifying the most suitable sequences for a targeted function. These failures can be attributed to the common practice of adopting a high-dimensional feature representation for protein sequences and inefficient search methods. To address these issues, we propose an efficient, experimental design-oriented closed-loop optimization framework for protein directed evolution, termed ODBO, which employs a combination of novel low-dimensional protein encoding strategy and Bayesian optimization enhanced with search space prescreening via outlier detection. We further design an initial sample selection strategy to minimize the number of experimental samples for training ML models. We conduct and report four protein directed evolution experiments that substantiate the capability of the proposed framework for finding of the variants with properties of interest. We expect the ODBO framework to greatly reduce the experimental cost and time cost of directed evolution, and can be further generalized as a powerful tool for adaptive experimental design in a broader context.
△ Less
Submitted 1 May, 2024; v1 submitted 19 May, 2022;
originally announced May 2022.
-
ExBrainable: An Open-Source GUI for CNN-based EEG Decoding and Model Interpretation
Authors:
Ya-Lin Huang,
Chia-Ying Hsieh,
Jian-Xue Huang,
Chun-Shu Wei
Abstract:
We have developed a graphic user interface (GUI), ExBrainable, dedicated to convolutional neural networks (CNN) model training and visualization in electroencephalography (EEG) decoding. Available functions include model training, evaluation, and parameter visualization in terms of temporal and spatial representations. We demonstrate these functions using a well-studied public dataset of motor-ima…
▽ More
We have developed a graphic user interface (GUI), ExBrainable, dedicated to convolutional neural networks (CNN) model training and visualization in electroencephalography (EEG) decoding. Available functions include model training, evaluation, and parameter visualization in terms of temporal and spatial representations. We demonstrate these functions using a well-studied public dataset of motor-imagery EEG and compare the results with existing knowledge of neuroscience. The primary objective of ExBrainable is to provide a fast, simplified, and user-friendly solution of EEG decoding for investigators across disciplines to leverage cutting-edge methods in brain/neuroscience research.
△ Less
Submitted 10 January, 2022;
originally announced January 2022.
-
SPLDExtraTrees: Robust machine learning approach for predicting kinase inhibitor resistance
Authors:
Ziyi Yang,
Zhaofeng Ye,
Yijia Xiao,
Changyu Hsieh,
Shengyu Zhang
Abstract:
Drug resistance is a major threat to the global health and a significant concern throughout the clinical treatment of diseases and drug development. The mutation in proteins that is related to drug binding is a common cause for adaptive drug resistance. Therefore, quantitative estimations of how mutations would affect the interaction between a drug and the target protein would be of vital signific…
▽ More
Drug resistance is a major threat to the global health and a significant concern throughout the clinical treatment of diseases and drug development. The mutation in proteins that is related to drug binding is a common cause for adaptive drug resistance. Therefore, quantitative estimations of how mutations would affect the interaction between a drug and the target protein would be of vital significance for the drug development and the clinical practice. Computational methods that rely on molecular dynamics simulations, Rosetta protocols, as well as machine learning methods have been proven to be capable of predicting ligand affinity changes upon protein mutation. However, the severely limited sample size and heavy noise induced overfitting and generalization issues have impeded wide adoption of machine learning for studying drug resistance. In this paper, we propose a robust machine learning method, termed SPLDExtraTrees, which can accurately predict ligand binding affinity changes upon protein mutation and identify resistance-causing mutations. Especially, the proposed method ranks training data following a specific scheme that starts with easy-to-learn samples and gradually incorporates harder and diverse samples into the training, and then iterates between sample weight recalculations and model updates. In addition, we calculate additional physics-based structural features to provide the machine learning model with the valuable domain knowledge on proteins for this data-limited predictive tasks. The experiments substantiate the capability of the proposed method for predicting kinase inhibitor resistance under three scenarios, and achieves predictive accuracy comparable to that of molecular dynamics and Rosetta methods with much less computational costs.
△ Less
Submitted 14 January, 2022; v1 submitted 15 November, 2021;
originally announced November 2021.
-
Modeling Protein Using Large-scale Pretrain Language Model
Authors:
Yijia Xiao,
Jiezhong Qiu,
Ziang Li,
Chang-Yu Hsieh,
Jie Tang
Abstract:
Protein is linked to almost every life process. Therefore, analyzing the biological structure and property of protein sequences is critical to the exploration of life, as well as disease detection and drug discovery. Traditional protein analysis methods tend to be labor-intensive and time-consuming. The emergence of deep learning models makes modeling data patterns in large quantities of data poss…
▽ More
Protein is linked to almost every life process. Therefore, analyzing the biological structure and property of protein sequences is critical to the exploration of life, as well as disease detection and drug discovery. Traditional protein analysis methods tend to be labor-intensive and time-consuming. The emergence of deep learning models makes modeling data patterns in large quantities of data possible. Interdisciplinary researchers have begun to leverage deep learning methods to model large biological datasets, e.g. using long short-term memory and convolutional neural network for protein sequence classification. After millions of years of evolution, evolutionary information is encoded in protein sequences. Inspired by the similarity between natural language and protein sequences, we use large-scale language models to model evolutionary-scale protein sequences, encoding protein biology information in representation. Significant improvements are observed in both token-level and sequence-level tasks, demonstrating that our large-scale model can accurately capture evolution information from pretraining on evolutionary-scale individual sequences. Our code and model are available at https://github.com/THUDM/ProteinLM.
△ Less
Submitted 7 December, 2021; v1 submitted 17 August, 2021;
originally announced August 2021.
-
Reconstructing large networks with time-varying interactions
Authors:
Chun-Wei Chang,
Takeshi Miki,
Masayuki Ushio,
Hsiao-Pei Lu,
Fuh-Kwo Shiah,
Chih-hao Hsieh
Abstract:
Reconstructing interactions from observational data is a critical need for investigating natural biological networks, wherein network dimensionality (i.e. number of interacting components) is usually high and interactions are time-varying. These pose a challenge to existing methods that can quantify only small interaction networks or assume static interactions under steady state. Here, we proposed…
▽ More
Reconstructing interactions from observational data is a critical need for investigating natural biological networks, wherein network dimensionality (i.e. number of interacting components) is usually high and interactions are time-varying. These pose a challenge to existing methods that can quantify only small interaction networks or assume static interactions under steady state. Here, we proposed a novel approach to reconstruct high-dimensional, time-varying interaction networks using empirical time series. This method, named "multiview distance regularized S-map", generalized the state space reconstruction to accommodate high dimensionality and overcome difficulties in quantifying massive interactions with limited data. When we evaluated this method using the time series generated from a large theoretical model involving hundreds of interacting species, estimated interaction strengths were in good agreement with theoretical expectations. As a result, reconstructed networks preserved important topological properties, such as centrality, strength distribution and derived stability measures. Moreover, our method effectively forecasted the dynamic behavior of network nodes. Applying this method to a natural bacterial community helped identify keystone species from the interaction network and revealed the mechanisms governing the dynamical stability of bacterial community. Our method overcame the challenge of high dimensionality and disentangled complex time-varying interactions in large natural dynamical systems.
△ Less
Submitted 8 February, 2021;
originally announced February 2021.
-
Radius evolution for bubbles with elastic shells
Authors:
S. C. Mancas,
H. C. Rosu,
C. -C. Hsieh
Abstract:
We present an analysis of an extended Rayleigh-Plesset (RP) equation for a three dimensional cell of microorganisms such as bacteria or viruses in some liquid, where the cell membrane in bacteria or the envelope (capsid) in viruses possess elastic properties. To account for rapid changes in the shape configuration of such microorganisms, the bubble membrane/envelope must be rigid to resist large p…
▽ More
We present an analysis of an extended Rayleigh-Plesset (RP) equation for a three dimensional cell of microorganisms such as bacteria or viruses in some liquid, where the cell membrane in bacteria or the envelope (capsid) in viruses possess elastic properties. To account for rapid changes in the shape configuration of such microorganisms, the bubble membrane/envelope must be rigid to resist large pressures while being flexible to adapt to growth or decay. Such properties are embedded in the RP equation by including a pressure bending term that is proportional to the square of the curvature of the elastic wall. Analytical solutions to this extended equation are obtained in terms of elliptic functions.
△ Less
Submitted 27 August, 2021; v1 submitted 24 December, 2020;
originally announced December 2020.
-
On Control of Epidemics with Application to COVID-19
Authors:
Chung-Han Hsieh
Abstract:
At the time of writing, the ongoing COVID-19 pandemic, caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), had already resulted in more than thirty-two million cases infected and more than one million deaths worldwide.
Given the fact that the pandemic is still threatening health and safety, it is in the urgency to understand the COVID-19 contagion process and know how it migh…
▽ More
At the time of writing, the ongoing COVID-19 pandemic, caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), had already resulted in more than thirty-two million cases infected and more than one million deaths worldwide.
Given the fact that the pandemic is still threatening health and safety, it is in the urgency to understand the COVID-19 contagion process and know how it might be controlled. With this motivation in mind, in this paper, we consider a version of a stochastic discrete-time Susceptible-Infected-Recovered-Death~(SIRD)-based epidemiological model with two uncertainties: The uncertain rate of infected cases which are undetected or asymptomatic, and the uncertain effectiveness rate of control. Our aim is to study the effect of an epidemic control policy on the uncertain model in a control-theoretic framework. We begin by providing the closed-form solutions of states in the modified SIRD-based model such as infected cases, susceptible cases, recovered cases, and deceased cases. Then, the corresponding expected states and the technical lower and upper bounds for those states are provided as well. Subsequently, we consider two epidemic control problems to be addressed: One is almost sure epidemic control problem and the other average epidemic control problem. Having defined the two problems, our main results are a set of sufficient conditions on a class of linear control policy which assures that the epidemic is "well-controlled"; i.e., both of the infected cases and deceased cases are upper bounded uniformly and the number of infected cases converges to zero asymptotically. Our numerical studies, using the historical COVID-19 contagion data in the United States, suggest that our appealingly simple model and control framework can provide a reasonable epidemic control performance compared to the ongoing pandemic situation.
△ Less
Submitted 2 November, 2020;
originally announced November 2020.
-
Ecosystem-level stabilizing effects of biodiversity via nutrient-diversity feedbacks in multitrophic systems
Authors:
Chun-Wei Chang,
Chih-hao Hsieh,
Takeshi Miki
Abstract:
Statistical averaging and asynchronous population dynamics as portfolio mechanisms are considered as the most important processes with which biodiversity contributes to ecosystem stability. However, portfolio theories usually regard biodiversity as a fixed property, but overlook the dynamics of biodiversity altered by other ecosystem components. Here, we proposed a new mechanistic food chain model…
▽ More
Statistical averaging and asynchronous population dynamics as portfolio mechanisms are considered as the most important processes with which biodiversity contributes to ecosystem stability. However, portfolio theories usually regard biodiversity as a fixed property, but overlook the dynamics of biodiversity altered by other ecosystem components. Here, we proposed a new mechanistic food chain model with nutrient-diversity feedback to investigate how dynamics of phytoplankton species diversity determines ecosystem stability. Our model focuses on nutrient, community biomass of phytoplankton and zooplankton, and phytoplankton species richness. The model assumes diversity effects of phytoplankton on trophic interaction strength along plankton food chain: phytoplankton diversity influences nutrient uptake by phytoplankton and zooplankton grazing on phytoplankton, which subsequently affects nutrient level and community biomass of phytoplankton and zooplankton. The nutrient level in turn affects phytoplankton diversity. These processes collectively form feedbacks between phytoplankton diversity and dynamics of plankton and nutrient. More importantly, nutrient-diversity feedback introduced additional temporal variabilities in community biomass, which apparently implies a destabilizing effect of phytoplankton diversity on ecosystem. However, the variabilities made ecosystems more robust against extinction of plankton because increasing phytoplankton diversity facilitates resource consumptions when consumers prone to extinct; while, reducing diversity weakens destabilizing dynamics caused by over-growth. Our results suggest the presence of a novel stabilizing effect of biodiversity acting through nutrient-diversity feedback, being independent of portfolio mechanisms.
△ Less
Submitted 12 September, 2019;
originally announced September 2019.
-
Minimal model for genome evolution and growth
Authors:
L. C. Hsieh,
L. F. Luo,
F. M. Ji,
H. C. Lee
Abstract:
Textual analysis of typical microbial genomes reveals that they have the statistical characteristics of a DNA sequence of a much shorter length. This peculiar property supports an evolutionary model in which a genome evolves by random mutation but primarily grows by random segmental self-copying. That genomes grew mostly by self-copying is consistent with the observation that repeat sequences in…
▽ More
Textual analysis of typical microbial genomes reveals that they have the statistical characteristics of a DNA sequence of a much shorter length. This peculiar property supports an evolutionary model in which a genome evolves by random mutation but primarily grows by random segmental self-copying. That genomes grew mostly by self-copying is consistent with the observation that repeat sequences in all genomes are widespread and intragenomic and intergenomic homologous genes are preponderance across all life forms. The model predicates the coexistence of the two competing modes of evolution: the gradual changes of classical Darwinism and the stochastic spurts envisioned in ``punctuated equilibrium''.
△ Less
Submitted 11 June, 2002;
originally announced June 2002.
-
Geometric and Statistical Properties of the Mean-Field HP Model, the LS Model and Real Protein Sequences
Authors:
C. T. Shih,
Z. Y. Su,
J. F. Gwan,
B. L. Hao,
C. H. Hsieh,
J. L. Lo.,
H. C. Lee
Abstract:
Lattice models, for their coarse-grained nature, are best suited for the study of the ``designability problem'', the phenomenon in which most of the about 16,000 proteins of known structure have their native conformations concentrated in a relatively small number of about 500 topological classes of conformations. Here it is shown that on a lattice the most highly designable simulated protein str…
▽ More
Lattice models, for their coarse-grained nature, are best suited for the study of the ``designability problem'', the phenomenon in which most of the about 16,000 proteins of known structure have their native conformations concentrated in a relatively small number of about 500 topological classes of conformations. Here it is shown that on a lattice the most highly designable simulated protein structures are those that have the largest number of surface-core switchbacks. A combination of physical, mathematical and biological reasons that causes the phenomenon is given. By comparing the most foldable model peptides with protein sequences in the Protein Data Bank, it is shown that whereas different models may yield similar designabilities, predicted foldable peptides will simulate natural proteins only when the model incorporates the correct physics and biology, in this case if the main folding force arises from the differing hydrophobicity of the residues, but does not originate, say, from the steric hindrance effect caused by the differing sizes of the residues.
△ Less
Submitted 27 December, 2001; v1 submitted 3 April, 2001;
originally announced April 2001.
-
Mean-Field HP Model, Designability and Alpha-Helices in Protein Structures
Authors:
C. T. Shih,
Z. Y. Su,
J. F. Gwan,
H. C. Lee,
B. L. Hao,
C. H. Hsieh
Abstract:
Analysis of the geometric properties of a mean-field HP model on a square lattice for protein structure shows that structures with large number of switch backs between surface and core sites are chosen favorably by peptides as unique ground states. Global comparison of model (binary) peptide sequences with concatenated (binary) protein sequences listed in the Protein Data Bank and the Dali Domai…
▽ More
Analysis of the geometric properties of a mean-field HP model on a square lattice for protein structure shows that structures with large number of switch backs between surface and core sites are chosen favorably by peptides as unique ground states. Global comparison of model (binary) peptide sequences with concatenated (binary) protein sequences listed in the Protein Data Bank and the Dali Domain Dictionary indicates that the highest correlation occurs between model peptides choosing the favored structures and those portions of protein sequences containing alpha-helices.
△ Less
Submitted 16 November, 1999; v1 submitted 14 December, 1998;
originally announced December 1998.