-
Higher-order evolutionary dynamics with game transitions
Authors:
Yi-Duo Chen,
Zhi-Xi Wu,
Jian-Yue Guan
Abstract:
Higher-order interactions are prevalent in real-world complex systems and exert unique influences on system evolution that cannot be captured by pairwise interactions. We incorporate game transitions into the higher-order prisoner's dilemma game model, where these transitions consistently promote cooperation. Moreover, in systems with game transitions, the fraction of higher-order interactions has…
▽ More
Higher-order interactions are prevalent in real-world complex systems and exert unique influences on system evolution that cannot be captured by pairwise interactions. We incorporate game transitions into the higher-order prisoner's dilemma game model, where these transitions consistently promote cooperation. Moreover, in systems with game transitions, the fraction of higher-order interactions has a dual impact, either enhancing the emergence and persistence of cooperation or facilitating invasions that promote defection within an otherwise cooperative system.
△ Less
Submitted 16 April, 2025;
originally announced April 2025.
-
ViralQC: A Tool for Assessing Completeness and Contamination of Predicted Viral Contigs
Authors:
Cheng Peng,
Jiayu Shang,
Jiaojiao Guan,
Yanni Sun
Abstract:
Motivation: Viruses represent the most abundant biological entities on the planet and play vital roles in diverse ecosystems. Cataloging viruses across various environments is essential for understanding their properties and functions. Metagenomic sequencing has emerged as the most comprehensive method for virus discovery, enabling the sequencing of all genetic materials, including viruses, from h…
▽ More
Motivation: Viruses represent the most abundant biological entities on the planet and play vital roles in diverse ecosystems. Cataloging viruses across various environments is essential for understanding their properties and functions. Metagenomic sequencing has emerged as the most comprehensive method for virus discovery, enabling the sequencing of all genetic materials, including viruses, from host or environmental samples. However, distinguishing viral sequences from the vast background of cellular organism-derived reads in metagenomic data remains a significant challenge. While several learning-based tools, such as VirSorter2 and geNomad, have shown promise in identifying viral contigs, they often experience varying degrees of false positive rates due to noise in sequencing and assembly, shared genes between viruses and their hosts, and the formation of proviruses within host genomes. This highlights the urgent need for an accurate and efficient method to evaluate the quality of viral contigs. Results: To address these challenges, we introduce ViralQC, a tool designed to assess the quality of reported viral contigs or bins. ViralQC identifies contamination regions within putative viral sequences using foundation models trained on viral and cellular genomes and estimates viral completeness through protein organization alignment. We evaluate ViralQC on multiple datasets and compare its performance against CheckV, the state-of-the-art in virus quality assessment. Notably, ViralQC correctly identifies 38% more contamination than CheckV, while maintaining a median absolute error of only 3%. In addition, ViralQC delivers more accurate results for medium- to high-quality (>50% completeness) contigs, demonstrating its superior performance in completeness estimation.
△ Less
Submitted 8 April, 2025;
originally announced April 2025.
-
Integrating Protein Dynamics into Structure-Based Drug Design via Full-Atom Stochastic Flows
Authors:
Xiangxin Zhou,
Yi Xiao,
Haowei Lin,
Xinheng He,
Jiaqi Guan,
Yang Wang,
Qiang Liu,
Feng Zhou,
Liang Wang,
Jianzhu Ma
Abstract:
The dynamic nature of proteins, influenced by ligand interactions, is essential for comprehending protein function and progressing drug discovery. Traditional structure-based drug design (SBDD) approaches typically target binding sites with rigid structures, limiting their practical application in drug development. While molecular dynamics simulation can theoretically capture all the biologically…
▽ More
The dynamic nature of proteins, influenced by ligand interactions, is essential for comprehending protein function and progressing drug discovery. Traditional structure-based drug design (SBDD) approaches typically target binding sites with rigid structures, limiting their practical application in drug development. While molecular dynamics simulation can theoretically capture all the biologically relevant conformations, the transition rate is dictated by the intrinsic energy barrier between them, making the sampling process computationally expensive. To overcome the aforementioned challenges, we propose to use generative modeling for SBDD considering conformational changes of protein pockets. We curate a dataset of apo and multiple holo states of protein-ligand complexes, simulated by molecular dynamics, and propose a full-atom flow model (and a stochastic version), named DynamicFlow, that learns to transform apo pockets and noisy ligands into holo pockets and corresponding 3D ligand molecules. Our method uncovers promising ligand molecules and corresponding holo conformations of pockets. Additionally, the resultant holo-like states provide superior inputs for traditional SBDD approaches, playing a significant role in practical drug discovery.
△ Less
Submitted 5 March, 2025;
originally announced March 2025.
-
GiantHunter: Accurate detection of giant virus in metagenomic data using reinforcement-learning and Monte Carlo tree search
Authors:
Fuchuan Qu,
Cheng Peng,
Jiaojiao Guan,
Donglin Wang,
Yanni Sun,
Jiayu Shang
Abstract:
Motivation: Nucleocytoplasmic large DNA viruses (NCLDVs) are notable for their large genomes and extensive gene repertoires, which contribute to their widespread environmental presence and critical roles in processes such as host metabolic reprogramming and nutrient cycling. Metagenomic sequencing has emerged as a powerful tool for uncovering novel NCLDVs in environmental samples. However, identif…
▽ More
Motivation: Nucleocytoplasmic large DNA viruses (NCLDVs) are notable for their large genomes and extensive gene repertoires, which contribute to their widespread environmental presence and critical roles in processes such as host metabolic reprogramming and nutrient cycling. Metagenomic sequencing has emerged as a powerful tool for uncovering novel NCLDVs in environmental samples. However, identifying NCLDV sequences in metagenomic data remains challenging due to their high genomic diversity, limited reference genomes, and shared regions with other microbes. Existing alignment-based and machine learning methods struggle with achieving optimal trade-offs between sensitivity and precision. Results: In this work, we present GiantHunter, a reinforcement learning-based tool for identifying NCLDVs from metagenomic data. By employing a Monte Carlo tree search strategy, GiantHunter dynamically selects representative non-NCLDV sequences as the negative training data, enabling the model to establish a robust decision boundary. Benchmarking on rigorously designed experiments shows that GiantHunter achieves high precision while maintaining competitive sensitivity, improving the F1-score by 10% and reducing computational cost by 90% compared to the second-best method. To demonstrate its real-world utility, we applied GiantHunter to 60 metagenomic datasets collected from six cities along the Yangtze River, located both upstream and downstream of the Three Gorges Dam. The results reveal significant differences in NCLDV diversity correlated with proximity to the dam, likely influenced by reduced flow velocity caused by the dam. These findings highlight the potential of GiantSeeker to advance our understanding of NCLDVs and their ecological roles in diverse environments.
△ Less
Submitted 26 January, 2025;
originally announced January 2025.
-
Group Ligands Docking to Protein Pockets
Authors:
Jiaqi Guan,
Jiahan Li,
Xiangxin Zhou,
Xingang Peng,
Sheng Wang,
Yunan Luo,
Jian Peng,
Jianzhu Ma
Abstract:
Molecular docking is a key task in computational biology that has attracted increasing interest from the machine learning community. While existing methods have achieved success, they generally treat each protein-ligand pair in isolation. Inspired by the biochemical observation that ligands binding to the same target protein tend to adopt similar poses, we propose \textsc{GroupBind}, a novel molec…
▽ More
Molecular docking is a key task in computational biology that has attracted increasing interest from the machine learning community. While existing methods have achieved success, they generally treat each protein-ligand pair in isolation. Inspired by the biochemical observation that ligands binding to the same target protein tend to adopt similar poses, we propose \textsc{GroupBind}, a novel molecular docking framework that simultaneously considers multiple ligands docking to a protein. This is achieved by introducing an interaction layer for the group of ligands and a triangle attention module for embedding protein-ligand and group-ligand pairs. By integrating our approach with diffusion-based docking model, we set a new S performance on the PDBBind blind docking benchmark, demonstrating the effectiveness of our proposed molecular docking paradigm.
△ Less
Submitted 24 January, 2025;
originally announced January 2025.
-
M$^{3}$-20M: A Large-Scale Multi-Modal Molecule Dataset for AI-driven Drug Design and Discovery
Authors:
Siyuan Guo,
Lexuan Wang,
Chang Jin,
Jinxian Wang,
Han Peng,
Huayang Shi,
Wengen Li,
Jihong Guan,
Shuigeng Zhou
Abstract:
This paper introduces M$^{3}$-20M, a large-scale Multi-Modal Molecule dataset that contains over 20 million molecules, with the data mainly being integrated from existing databases and partially generated by large language models. Designed to support AI-driven drug design and discovery, M$^{3}$-20M is 71 times more in the number of molecules than the largest existing dataset, providing an unpreced…
▽ More
This paper introduces M$^{3}$-20M, a large-scale Multi-Modal Molecule dataset that contains over 20 million molecules, with the data mainly being integrated from existing databases and partially generated by large language models. Designed to support AI-driven drug design and discovery, M$^{3}$-20M is 71 times more in the number of molecules than the largest existing dataset, providing an unprecedented scale that can highly benefit the training or fine-tuning of models, including large language models for drug design and discovery tasks. This dataset integrates one-dimensional SMILES, two-dimensional molecular graphs, three-dimensional molecular structures, physicochemical properties, and textual descriptions collected through web crawling and generated using GPT-3.5, offering a comprehensive view of each molecule. To demonstrate the power of M$^{3}$-20M in drug design and discovery, we conduct extensive experiments on two key tasks: molecule generation and molecular property prediction, using large language models including GLM4, GPT-3.5, GPT-4, and Llama3-8b. Our experimental results show that M$^{3}$-20M can significantly boost model performance in both tasks. Specifically, it enables the models to generate more diverse and valid molecular structures and achieve higher property prediction accuracy than existing single-modal datasets, which validates the value and potential of M$^{3}$-20M in supporting AI-driven drug design and discovery. The dataset is available at https://github.com/bz99bz/M-3.
△ Less
Submitted 16 March, 2025; v1 submitted 7 December, 2024;
originally announced December 2024.
-
Hotspot-Driven Peptide Design via Multi-Fragment Autoregressive Extension
Authors:
Jiahan Li,
Tong Chen,
Shitong Luo,
Chaoran Cheng,
Jiaqi Guan,
Ruihan Guo,
Sheng Wang,
Ge Liu,
Jian Peng,
Jianzhu Ma
Abstract:
Peptides, short chains of amino acids, interact with target proteins, making them a unique class of protein-based therapeutics for treating human diseases. Recently, deep generative models have shown great promise in peptide generation. However, several challenges remain in designing effective peptide binders. First, not all residues contribute equally to peptide-target interactions. Second, the g…
▽ More
Peptides, short chains of amino acids, interact with target proteins, making them a unique class of protein-based therapeutics for treating human diseases. Recently, deep generative models have shown great promise in peptide generation. However, several challenges remain in designing effective peptide binders. First, not all residues contribute equally to peptide-target interactions. Second, the generated peptides must adopt valid geometries due to the constraints of peptide bonds. Third, realistic tasks for peptide drug development are still lacking. To address these challenges, we introduce PepHAR, a hot-spot-driven autoregressive generative model for designing peptides targeting specific proteins. Building on the observation that certain hot spot residues have higher interaction potentials, we first use an energy-based density model to fit and sample these key residues. Next, to ensure proper peptide geometry, we autoregressively extend peptide fragments by estimating dihedral angles between residue frames. Finally, we apply an optimization process to iteratively refine fragment assembly, ensuring correct peptide structures. By combining hot spot sampling with fragment-based extension, our approach enables de novo peptide design tailored to a target protein and allows the incorporation of key hot spot residues into peptide scaffolds. Extensive experiments, including peptide design and peptide scaffold generation, demonstrate the strong potential of PepHAR in computational peptide binder design. Source code will be available at https://github.com/Ced3-han/PepHAR.
△ Less
Submitted 20 May, 2025; v1 submitted 26 November, 2024;
originally announced November 2024.
-
Reprogramming Pretrained Target-Specific Diffusion Models for Dual-Target Drug Design
Authors:
Xiangxin Zhou,
Jiaqi Guan,
Yijia Zhang,
Xingang Peng,
Liang Wang,
Jianzhu Ma
Abstract:
Dual-target therapeutic strategies have become a compelling approach and attracted significant attention due to various benefits, such as their potential in overcoming drug resistance in cancer therapy. Considering the tremendous success that deep generative models have achieved in structure-based drug design in recent years, we formulate dual-target drug design as a generative task and curate a n…
▽ More
Dual-target therapeutic strategies have become a compelling approach and attracted significant attention due to various benefits, such as their potential in overcoming drug resistance in cancer therapy. Considering the tremendous success that deep generative models have achieved in structure-based drug design in recent years, we formulate dual-target drug design as a generative task and curate a novel dataset of potential target pairs based on synergistic drug combinations. We propose to design dual-target drugs with diffusion models that are trained on single-target protein-ligand complex pairs. Specifically, we align two pockets in 3D space with protein-ligand binding priors and build two complex graphs with shared ligand nodes for SE(3)-equivariant composed message passing, based on which we derive a composed drift in both 3D and categorical probability space in the generative process. Our algorithm can well transfer the knowledge gained in single-target pretraining to dual-target scenarios in a zero-shot manner. We also repurpose linker design methods as strong baselines for this task. Extensive experiments demonstrate the effectiveness of our method compared with various baselines.
△ Less
Submitted 26 November, 2024; v1 submitted 27 October, 2024;
originally announced October 2024.
-
PhaGO: Protein function annotation for bacteriophages by integrating the genomic context
Authors:
Jiaojiao Guan,
Yongxin Ji,
Cheng Peng,
Wei Zou,
Xubo Tang,
Jiayu Shang,
Yanni Sun
Abstract:
Bacteriophages are viruses that target bacteria, playing a crucial role in microbial ecology. Phage proteins are important in understanding phage biology, such as virus infection, replication, and evolution. Although a large number of new phages have been identified via metagenomic sequencing, many of them have limited protein function annotation. Accurate function annotation of phage proteins pre…
▽ More
Bacteriophages are viruses that target bacteria, playing a crucial role in microbial ecology. Phage proteins are important in understanding phage biology, such as virus infection, replication, and evolution. Although a large number of new phages have been identified via metagenomic sequencing, many of them have limited protein function annotation. Accurate function annotation of phage proteins presents several challenges, including their inherent diversity and the scarcity of annotated ones. Existing tools have yet to fully leverage the unique properties of phages in annotating protein functions. In this work, we propose a new protein function annotation tool for phages by leveraging the modular genomic structure of phage genomes. By employing embeddings from the latest protein foundation models and Transformer to capture contextual information between proteins in phage genomes, PhaGO surpasses state-of-the-art methods in annotating diverged proteins and proteins with uncommon functions by 6.78% and 13.05% improvement, respectively. PhaGO can annotate proteins lacking homology search results, which is critical for characterizing the rapidly accumulating phage genomes. We demonstrate the utility of PhaGO by identifying 688 potential holins in phages, which exhibit high structural conservation with known holins. The results show the potential of PhaGO to extend our understanding of newly discovered phages.
△ Less
Submitted 17 August, 2024; v1 submitted 12 August, 2024;
originally announced August 2024.
-
Coevolutionary game dynamics with localized environmental resource feedback
Authors:
Yi-Duo Chen,
Jian-Yue Guan,
Zhi-Xi Wu
Abstract:
Dynamic environments shape diverse dynamics in evolutionary game systems. We introduce spatial heterogeneity of resources into the prisoner's dilemma game model to explore coevolutionary game dynamics with environmental feedback. The availability of resources significantly affects the survival competitiveness of surrounding individuals. Feedback between individuals' strategies and the resources th…
▽ More
Dynamic environments shape diverse dynamics in evolutionary game systems. We introduce spatial heterogeneity of resources into the prisoner's dilemma game model to explore coevolutionary game dynamics with environmental feedback. The availability of resources significantly affects the survival competitiveness of surrounding individuals. Feedback between individuals' strategies and the resources they can use leads to the oscillating dynamic known as the "oscillatory tragedy of the commons". Our findings indicate that when the influence of individuals' strategies on the update rate of resources is significantly high in systems characterized by environmental heterogeneity, they can attain an equilibrium state that avoids the oscillatory tragedy. In contrast to the numerical results obtained in well-mixed structures, self-organized clustered patterns emerge in simulations utilizing square lattices, further enhancing the stability of the system. We discuss critical phenomena in detail, demonstrating that the aforementioned transition is robust across various system parameters, including the strength of cooperators in restoring the environment, initial distributions of cooperators, system size and structures, and noise.
△ Less
Submitted 14 February, 2025; v1 submitted 25 July, 2024;
originally announced July 2024.
-
Accurate and efficient protein embedding using multi-teacher distillation learning
Authors:
Jiayu Shang,
Cheng Peng,
Yongxin Ji,
Jiaojiao Guan,
Dehan Cai,
Xubo Tang,
Yanni Sun
Abstract:
Motivation: Protein embedding, which represents proteins as numerical vectors, is a crucial step in various learning-based protein annotation/classification problems, including gene ontology prediction, protein-protein interaction prediction, and protein structure prediction. However, existing protein embedding methods are often computationally expensive due to their large number of parameters, wh…
▽ More
Motivation: Protein embedding, which represents proteins as numerical vectors, is a crucial step in various learning-based protein annotation/classification problems, including gene ontology prediction, protein-protein interaction prediction, and protein structure prediction. However, existing protein embedding methods are often computationally expensive due to their large number of parameters, which can reach millions or even billions. The growing availability of large-scale protein datasets and the need for efficient analysis tools have created a pressing demand for efficient protein embedding methods.
Results: We propose a novel protein embedding approach based on multi-teacher distillation learning, which leverages the knowledge of multiple pre-trained protein embedding models to learn a compact and informative representation of proteins. Our method achieves comparable performance to state-of-the-art methods while significantly reducing computational costs and resource requirements. Specifically, our approach reduces computational time by ~70\% and maintains almost the same accuracy as the original large models. This makes our method well-suited for large-scale protein analysis and enables the bioinformatics community to perform protein embedding tasks more efficiently.
△ Less
Submitted 19 May, 2024;
originally announced May 2024.
-
DecompDiff: Diffusion Models with Decomposed Priors for Structure-Based Drug Design
Authors:
Jiaqi Guan,
Xiangxin Zhou,
Yuwei Yang,
Yu Bao,
Jian Peng,
Jianzhu Ma,
Qiang Liu,
Liang Wang,
Quanquan Gu
Abstract:
Designing 3D ligands within a target binding site is a fundamental task in drug discovery. Existing structured-based drug design methods treat all ligand atoms equally, which ignores different roles of atoms in the ligand for drug design and can be less efficient for exploring the large drug-like molecule space. In this paper, inspired by the convention in pharmaceutical practice, we decompose the…
▽ More
Designing 3D ligands within a target binding site is a fundamental task in drug discovery. Existing structured-based drug design methods treat all ligand atoms equally, which ignores different roles of atoms in the ligand for drug design and can be less efficient for exploring the large drug-like molecule space. In this paper, inspired by the convention in pharmaceutical practice, we decompose the ligand molecule into two parts, namely arms and scaffold, and propose a new diffusion model, DecompDiff, with decomposed priors over arms and scaffold. In order to facilitate the decomposed generation and improve the properties of the generated molecules, we incorporate both bond diffusion in the model and additional validity guidance in the sampling phase. Extensive experiments on CrossDocked2020 show that our approach achieves state-of-the-art performance in generating high-affinity molecules while maintaining proper molecular properties and conformational stability, with up to -8.39 Avg. Vina Dock score and 24.5 Success Rate. The code is provided at https://github.com/bytedance/DecompDiff
△ Less
Submitted 26 February, 2024;
originally announced March 2024.
-
Molecular Property Prediction Based on Graph Structure Learning
Authors:
Bangyi Zhao,
Weixia Xu,
Jihong Guan,
Shuigeng Zhou
Abstract:
Molecular property prediction (MPP) is a fundamental but challenging task in the computer-aided drug discovery process. More and more recent works employ different graph-based models for MPP, which have made considerable progress in improving prediction performance. However, current models often ignore relationships between molecules, which could be also helpful for MPP. For this sake, in this pap…
▽ More
Molecular property prediction (MPP) is a fundamental but challenging task in the computer-aided drug discovery process. More and more recent works employ different graph-based models for MPP, which have made considerable progress in improving prediction performance. However, current models often ignore relationships between molecules, which could be also helpful for MPP. For this sake, in this paper we propose a graph structure learning (GSL) based MPP approach, called GSL-MPP. Specifically, we first apply graph neural network (GNN) over molecular graphs to extract molecular representations. Then, with molecular fingerprints, we construct a molecular similarity graph (MSG). Following that, we conduct graph structure learning on the MSG (i.e., molecule-level graph structure learning) to get the final molecular embeddings, which are the results of fusing both GNN encoded molecular representations and the relationships among molecules, i.e., combining both intra-molecule and inter-molecule information. Finally, we use these molecular embeddings to perform MPP. Extensive experiments on seven various benchmark datasets show that our method could achieve state-of-the-art performance in most cases, especially on classification tasks. Further visualization studies also demonstrate the good molecular representations of our method.
△ Less
Submitted 28 December, 2023;
originally announced December 2023.
-
scRNA-seq Data Clustering by Cluster-aware Iterative Contrastive Learning
Authors:
Weikang Jiang,
Jinxian Wang,
Jihong Guan,
Shuigeng Zhou
Abstract:
Single-cell RNA sequencing (scRNA-seq) enables researchers to analyze gene expression at single-cell level. One important task in scRNA-seq data analysis is unsupervised clustering, which helps identify distinct cell types, laying down the foundation for other downstream analysis tasks. In this paper, we propose a novel method called Cluster-aware Iterative Contrastive Learning (CICL in short) for…
▽ More
Single-cell RNA sequencing (scRNA-seq) enables researchers to analyze gene expression at single-cell level. One important task in scRNA-seq data analysis is unsupervised clustering, which helps identify distinct cell types, laying down the foundation for other downstream analysis tasks. In this paper, we propose a novel method called Cluster-aware Iterative Contrastive Learning (CICL in short) for scRNA-seq data clustering, which utilizes an iterative representation learning and clustering framework to progressively learn the clustering structure of scRNA-seq data with a cluster-aware contrastive loss. CICL consists of a Transformer encoder, a clustering head, a projection head and a contrastive loss module. First, CICL extracts the feature vectors of the original and augmented data by the Transformer encoder. Then, it computes the clustering centroids by K-means and employs the student t-distribution to assign pseudo-labels to all cells in the clustering head. The projection-head uses a Multi-Layer Perceptron (MLP) to obtain projections of the augmented data. At last, both pseudo-labels and projections are used in the contrastive loss to guide the model training. Such a process goes iteratively so that the clustering result becomes better and better. Extensive experiments on 25 real world scRNA-seq datasets show that CICL outperforms the SOTA methods. Concretely, CICL surpasses the existing methods by from 14% to 280%, and from 5% to 133% on average in terms of performance metrics ARI and NMI respectively.
△ Less
Submitted 27 December, 2023;
originally announced December 2023.
-
Diffusing on Two Levels and Optimizing for Multiple Properties: A Novel Approach to Generating Molecules with Desirable Properties
Authors:
Siyuan Guo,
Jihong Guan,
Shuigeng Zhou
Abstract:
In the past decade, Artificial Intelligence driven drug design and discovery has been a hot research topic, where an important branch is molecule generation by generative models, from GAN-based models and VAE-based models to the latest diffusion-based models. However, most existing models pursue only the basic properties like validity and uniqueness of the generated molecules, a few go further to…
▽ More
In the past decade, Artificial Intelligence driven drug design and discovery has been a hot research topic, where an important branch is molecule generation by generative models, from GAN-based models and VAE-based models to the latest diffusion-based models. However, most existing models pursue only the basic properties like validity and uniqueness of the generated molecules, a few go further to explicitly optimize one single important molecular property (e.g. QED or PlogP), which makes most generated molecules little usefulness in practice. In this paper, we present a novel approach to generating molecules with desirable properties, which expands the diffusion model framework with multiple innovative designs. The novelty is two-fold. On the one hand, considering that the structures of molecules are complex and diverse, and molecular properties are usually determined by some substructures (e.g. pharmacophores), we propose to perform diffusion on two structural levels: molecules and molecular fragments respectively, with which a mixed Gaussian distribution is obtained for the reverse diffusion process. To get desirable molecular fragments, we develop a novel electronic effect based fragmentation method. On the other hand, we introduce two ways to explicitly optimize multiple molecular properties under the diffusion model framework. First, as potential drug molecules must be chemically valid, we optimize molecular validity by an energy-guidance function. Second, since potential drug molecules should be desirable in various properties, we employ a multi-objective mechanism to optimize multiple molecular properties simultaneously. Extensive experiments with two benchmark datasets QM9 and ZINC250k show that the molecules generated by our proposed method have better validity, uniqueness, novelty, Fréchet ChemNet Distance (FCD), QED, and PlogP than those generated by current SOTA models.
△ Less
Submitted 5 October, 2023;
originally announced October 2023.
-
MolDiff: Addressing the Atom-Bond Inconsistency Problem in 3D Molecule Diffusion Generation
Authors:
Xingang Peng,
Jiaqi Guan,
Qiang Liu,
Jianzhu Ma
Abstract:
Deep generative models have recently achieved superior performance in 3D molecule generation. Most of them first generate atoms and then add chemical bonds based on the generated atoms in a post-processing manner. However, there might be no corresponding bond solution for the temporally generated atoms as their locations are generated without considering potential bonds. We define this problem as…
▽ More
Deep generative models have recently achieved superior performance in 3D molecule generation. Most of them first generate atoms and then add chemical bonds based on the generated atoms in a post-processing manner. However, there might be no corresponding bond solution for the temporally generated atoms as their locations are generated without considering potential bonds. We define this problem as the atom-bond inconsistency problem and claim it is the main reason for current approaches to generating unrealistic 3D molecules. To overcome this problem, we propose a new diffusion model called MolDiff which can generate atoms and bonds simultaneously while still maintaining their consistency by explicitly modeling the dependence between their relationships. We evaluated the generation ability of our proposed model and the quality of the generated molecules using criteria related to both geometry and chemical properties. The empirical studies showed that our model outperforms previous approaches, achieving a three-fold improvement in success rate and generating molecules with significantly better quality.
△ Less
Submitted 11 May, 2023;
originally announced May 2023.
-
Molecular Property Prediction by Semantic-invariant Contrastive Learning
Authors:
Ziqiao Zhang,
Ailin Xie,
Jihong Guan,
Shuigeng Zhou
Abstract:
Contrastive learning have been widely used as pretext tasks for self-supervised pre-trained molecular representation learning models in AI-aided drug design and discovery. However, exiting methods that generate molecular views by noise-adding operations for contrastive learning may face the semantic inconsistency problem, which leads to false positive pairs and consequently poor prediction perform…
▽ More
Contrastive learning have been widely used as pretext tasks for self-supervised pre-trained molecular representation learning models in AI-aided drug design and discovery. However, exiting methods that generate molecular views by noise-adding operations for contrastive learning may face the semantic inconsistency problem, which leads to false positive pairs and consequently poor prediction performance. To address this problem, in this paper we first propose a semantic-invariant view generation method by properly breaking molecular graphs into fragment pairs. Then, we develop a Fragment-based Semantic-Invariant Contrastive Learning (FraSICL) model based on this view generation method for molecular property prediction. The FraSICL model consists of two branches to generate representations of views for contrastive learning, meanwhile a multi-view fusion and an auxiliary similarity loss are introduced to make better use of the information contained in different fragment-pair views. Extensive experiments on various benchmark datasets show that with the least number of pre-training samples, FraSICL can achieve state-of-the-art performance, compared with major existing counterpart models.
△ Less
Submitted 13 March, 2023;
originally announced March 2023.
-
Intercellular competitive growth dynamics with microenvironmental feedback
Authors:
De-Ming Liu,
Zhi-Xi Wu,
Jian-Yue Guan
Abstract:
Normal life activities between cells rely crucially on the homeostasis of the cellular microenvironment, but aging and cancer will upset this balance. In this paper, we introduce the microenvironmental feedback mechanism to the growth dynamics of multicellular organisms, which changes the cellular competitive ability, and thereby regulates the growth of multicellular organisms. We show that the pr…
▽ More
Normal life activities between cells rely crucially on the homeostasis of the cellular microenvironment, but aging and cancer will upset this balance. In this paper, we introduce the microenvironmental feedback mechanism to the growth dynamics of multicellular organisms, which changes the cellular competitive ability, and thereby regulates the growth of multicellular organisms. We show that the presence of microenvironmental feedback can effectively delay aging, but cancer cells may grow uncontrollably due to the emergence of the tumor microenvironment (TME). We study the effect of the fraction of cancer cells relative to that of senescent cells on the feedback rate of the microenvironment on the lifespan of multicellular organisms, and find that the average lifespan shortened is close to the data for non-Hodgkin lymphoma in Canada from 1980 to 2015. We also investigate how the competitive ability of cancer cells affects the lifespan of multicellular organisms, and reveal that there is an optimal value of the competitive ability of cancer cells allowing the organism to survive longest. Interestingly, the proposed microenvironmental feedback mechanism can give rise to the phenomenon of Parrondo's paradox: when the competitive ability of cancer cells switches between a too high and a too low value, multicellular organisms are able to live longer than in each case individually. Our results may provide helpful clues targeted therapies aimed at TME.
△ Less
Submitted 8 May, 2023; v1 submitted 10 March, 2023;
originally announced March 2023.
-
3D Equivariant Diffusion for Target-Aware Molecule Generation and Affinity Prediction
Authors:
Jiaqi Guan,
Wesley Wei Qian,
Xingang Peng,
Yufeng Su,
Jian Peng,
Jianzhu Ma
Abstract:
Rich data and powerful machine learning models allow us to design drugs for a specific protein target \textit{in silico}. Recently, the inclusion of 3D structures during targeted drug design shows superior performance to other target-free models as the atomic interaction in the 3D space is explicitly modeled. However, current 3D target-aware models either rely on the voxelized atom densities or th…
▽ More
Rich data and powerful machine learning models allow us to design drugs for a specific protein target \textit{in silico}. Recently, the inclusion of 3D structures during targeted drug design shows superior performance to other target-free models as the atomic interaction in the 3D space is explicitly modeled. However, current 3D target-aware models either rely on the voxelized atom densities or the autoregressive sampling process, which are not equivariant to rotation or easily violate geometric constraints resulting in unrealistic structures. In this work, we develop a 3D equivariant diffusion model to solve the above challenges. To achieve target-aware molecule design, our method learns a joint generative process of both continuous atom coordinates and categorical atom types with a SE(3)-equivariant network. Moreover, we show that our model can serve as an unsupervised feature extractor to estimate the binding affinity under proper parameterization, which provides an effective way for drug screening. To evaluate our model, we propose a comprehensive framework to evaluate the quality of sampled molecules from different dimensions. Empirical studies show our model could generate molecules with more realistic 3D structures and better affinities towards the protein targets, and improve binding affinity ranking and prediction without retraining.
△ Less
Submitted 6 March, 2023;
originally announced March 2023.
-
Pocket2Mol: Efficient Molecular Sampling Based on 3D Protein Pockets
Authors:
Xingang Peng,
Shitong Luo,
Jiaqi Guan,
Qi Xie,
Jian Peng,
Jianzhu Ma
Abstract:
Deep generative models have achieved tremendous success in designing novel drug molecules in recent years. A new thread of works have shown the great potential in advancing the specificity and success rate of in silico drug design by considering the structure of protein pockets. This setting posts fundamental computational challenges in sampling new chemical compounds that could satisfy multiple g…
▽ More
Deep generative models have achieved tremendous success in designing novel drug molecules in recent years. A new thread of works have shown the great potential in advancing the specificity and success rate of in silico drug design by considering the structure of protein pockets. This setting posts fundamental computational challenges in sampling new chemical compounds that could satisfy multiple geometrical constraints imposed by pockets. Previous sampling algorithms either sample in the graph space or only consider the 3D coordinates of atoms while ignoring other detailed chemical structures such as bond types and functional groups. To address the challenge, we develop Pocket2Mol, an E(3)-equivariant generative network composed of two modules: 1) a new graph neural network capturing both spatial and bonding relationships between atoms of the binding pockets and 2) a new efficient algorithm which samples new drug candidates conditioned on the pocket representations from a tractable distribution without relying on MCMC. Experimental results demonstrate that molecules sampled from Pocket2Mol achieve significantly better binding affinity and other drug properties such as druglikeness and synthetic accessibility.
△ Less
Submitted 15 May, 2022;
originally announced May 2022.
-
A 3D Generative Model for Structure-Based Drug Design
Authors:
Shitong Luo,
Jiaqi Guan,
Jianzhu Ma,
Jian Peng
Abstract:
We study a fundamental problem in structure-based drug design -- generating molecules that bind to specific protein binding sites. While we have witnessed the great success of deep generative models in drug design, the existing methods are mostly string-based or graph-based. They are limited by the lack of spatial information and thus unable to be applied to structure-based design tasks. Particula…
▽ More
We study a fundamental problem in structure-based drug design -- generating molecules that bind to specific protein binding sites. While we have witnessed the great success of deep generative models in drug design, the existing methods are mostly string-based or graph-based. They are limited by the lack of spatial information and thus unable to be applied to structure-based design tasks. Particularly, such models have no or little knowledge of how molecules interact with their target proteins exactly in 3D space. In this paper, we propose a 3D generative model that generates molecules given a designated 3D protein binding site. Specifically, given a binding site as the 3D context, our model estimates the probability density of atom's occurrences in 3D space -- positions that are more likely to have atoms will be assigned higher probability. To generate 3D molecules, we propose an auto-regressive sampling scheme -- atoms are sampled sequentially from the learned distribution until there is no room for new atoms. Combined with this sampling scheme, our model can generate valid and diverse molecules, which could be applicable to various structure-based molecular design tasks such as molecule sampling and linker design. Experimental results demonstrate that molecules sampled from our model exhibit high binding affinity to specific targets and good drug properties such as drug-likeness even if the model is not explicitly optimized for them.
△ Less
Submitted 12 November, 2022; v1 submitted 19 March, 2022;
originally announced March 2022.
-
Orientation-Aware Graph Neural Networks for Protein Structure Representation Learning
Authors:
Jiahan Li,
Shitong Luo,
Congyue Deng,
Chaoran Cheng,
Jiaqi Guan,
Leonidas Guibas,
Jian Peng,
Jianzhu Ma
Abstract:
By folding into particular 3D structures, proteins play a key role in living beings. To learn meaningful representation from a protein structure for downstream tasks, not only the global backbone topology but the local fine-grained orientational relations between amino acids should also be considered. In this work, we propose the Orientation-Aware Graph Neural Networks (OAGNNs) to better sense the…
▽ More
By folding into particular 3D structures, proteins play a key role in living beings. To learn meaningful representation from a protein structure for downstream tasks, not only the global backbone topology but the local fine-grained orientational relations between amino acids should also be considered. In this work, we propose the Orientation-Aware Graph Neural Networks (OAGNNs) to better sense the geometric characteristics in protein structure (e.g. inner-residue torsion angles, inter-residue orientations). Extending a single weight from a scalar to a 3D vector, we construct a rich set of geometric-meaningful operations to process both the classical and SO(3) representations of a given structure. To plug our designed perceptron unit into existing Graph Neural Networks, we further introduce an equivariant message passing paradigm, showing superior versatility in maintaining SO(3)-equivariance at the global scale. Experiments have shown that our OAGNNs have a remarkable ability to sense geometric orientational features compared to classical networks. OAGNNs have also achieved state-of-the-art performance on various computational biology applications related to protein 3D structures. The code is available at https://github.com/Ced3-han/OAGNN/tree/main.
△ Less
Submitted 4 February, 2025; v1 submitted 28 January, 2022;
originally announced January 2022.
-
Game-environment feedback dynamics for voluntary prisoner's dilemma games
Authors:
Bin-Quan Li,
Cong Liu,
Zhi-Xi Wu,
Jian-Yue Guan
Abstract:
Recently, the eco-evolutionary game theory which describes the coupled dynamics of strategies and environment have attracted great attention. At the same time, most of the current work is focused on the classic two-player two-strategy game. In this work, we study multi-strategy eco-evolutionary game theory which is an extension of the framework. For simplicity, we'll focus on the voluntary partici…
▽ More
Recently, the eco-evolutionary game theory which describes the coupled dynamics of strategies and environment have attracted great attention. At the same time, most of the current work is focused on the classic two-player two-strategy game. In this work, we study multi-strategy eco-evolutionary game theory which is an extension of the framework. For simplicity, we'll focus on the voluntary participation Prisoner's dilemma game. For the general class of payoff-dependent feedback dynamics, we show the conditions for the existence and stability of internal equilibrium by using the replicator dynamics, respectively. Where internal equilibrium points, such as, two-strategy coexistence states, three-strategy coexistence states, persistent oscillation states and interior saddle points. These states are determined by the relative feedback strength and payoff matrix, and are independent of the relative feedback speed and initial state. In particular, the three-strategy coexistence provides a new mechanism for maintaining biodiversity in biology, ecology, and sociology. Besides, we find that this three-strategy model return to the persistent oscillation state of the two-strategy model when there is no defective strategy at the initial moment.
△ Less
Submitted 18 November, 2021;
originally announced November 2021.
-
Behavior of susceptible-vaccinated--infected--recovered epidemics with diversity in the infection rate of the individuals
Authors:
Chao-Ran Cai,
Zhi-Xi wu,
Jian-Yue Guan
Abstract:
We study a susceptible-vaccinated--infected--recovered (SVIR) epidemic-spreading model with diversity of infection rate of the individuals. By means of analytical arguments as well as extensive computer simulations, we demonstrate that the heterogeneity in infection rate can either impede or accelerate the epidemic spreading, which depends on the amount of vaccinated individuals introduced in the…
▽ More
We study a susceptible-vaccinated--infected--recovered (SVIR) epidemic-spreading model with diversity of infection rate of the individuals. By means of analytical arguments as well as extensive computer simulations, we demonstrate that the heterogeneity in infection rate can either impede or accelerate the epidemic spreading, which depends on the amount of vaccinated individuals introduced in the population as well as the contact pattern among the individuals. Remarkably, as long as the individuals with different capability of acquiring the disease interact with unequal frequency, there always exist a cross point for the fraction of vaccinated, below which the diversity of infection rate hinders the epidemic spreading and above which expedites it. The overall results are robust to the SVIR dynamics defined on different population models; the possible applications of the results are discussed.
△ Less
Submitted 3 December, 2013; v1 submitted 2 December, 2013;
originally announced December 2013.
-
Diagnosing Heterogeneous Dynamics in Single Molecule/Particle Trajectories with Multiscale Wavelets
Authors:
Kejia Chen,
Bo Wang,
Juan Guan,
Steve Granick
Abstract:
We describe a simple automated method to extract and quantify transient heterogeneous dynamical changes from large datasets generated in single molecule/particle tracking experiments. Based on wavelet transform, the method transforms raw data to locally match dynamics of interest. This is accomplished using statistically adaptive universal thresholding, whose advantage is to avoid a single arbitra…
▽ More
We describe a simple automated method to extract and quantify transient heterogeneous dynamical changes from large datasets generated in single molecule/particle tracking experiments. Based on wavelet transform, the method transforms raw data to locally match dynamics of interest. This is accomplished using statistically adaptive universal thresholding, whose advantage is to avoid a single arbitrary threshold that might conceal individual variability across populations. How to implement this multiscale method is described, focusing on local confined diffusion separated by transient transport periods or hopping events, with 3 specific examples: in cell biology, biotechnology, and glassy colloid dynamics. This computationally-efficient method can run routinely on hundreds of millions of data points analyzed within an hour on a desktop personal computer.
△ Less
Submitted 3 June, 2013;
originally announced June 2013.
-
Epidemic spreading with nonlinear infectivity in weighted scale-free networks
Authors:
Xiangwei Chu,
Zhongzhi Zhang,
Jihong Guan,
Shuigeng Zhou
Abstract:
In this paper, we investigate the epidemic spreading for SIR model in weighted scale-free networks with nonlinear infectivity, where the transmission rate in our analytical model is weighted. Concretely, we introduce the infectivity exponent $α$ and the weight exponent $β$ into the analytical SIR model, then examine the combination effects of $α$ and $β$ on the epidemic threshold and phase trans…
▽ More
In this paper, we investigate the epidemic spreading for SIR model in weighted scale-free networks with nonlinear infectivity, where the transmission rate in our analytical model is weighted. Concretely, we introduce the infectivity exponent $α$ and the weight exponent $β$ into the analytical SIR model, then examine the combination effects of $α$ and $β$ on the epidemic threshold and phase transition. We show that one can adjust the values of $α$ and $β$ to rebuild the epidemic threshold to a finite value, and it is observed that the steady epidemic prevalence $R$ grows in an exponential form in the early stage, then follows hierarchical dynamics. Furthermore, we find $α$ is more sensitive than $β$ in the transformation of the epidemic threshold and epidemic prevalence, which might deliver some useful information or new insights in the epidemic spreading and the correlative immunization schemes.
△ Less
Submitted 5 March, 2009;
originally announced March 2009.