-
A Symbolic and Statistical Learning Framework to Discover Bioprocessing Regulatory Mechanism: Cell Culture Example
Authors:
Keilung Choy,
Wei Xie,
Keqi Wang
Abstract:
Bioprocess mechanistic modeling is essential for advancing intelligent digital twin representation of biomanufacturing, yet challenges persist due to complex intracellular regulation, stochastic system behavior, and limited experimental data. This paper introduces a symbolic and statistical learning framework to identify key regulatory mechanisms and quantify model uncertainty. Bioprocess dynamics…
▽ More
Bioprocess mechanistic modeling is essential for advancing intelligent digital twin representation of biomanufacturing, yet challenges persist due to complex intracellular regulation, stochastic system behavior, and limited experimental data. This paper introduces a symbolic and statistical learning framework to identify key regulatory mechanisms and quantify model uncertainty. Bioprocess dynamics is formulated with stochastic differential equations characterizing intrinsic process variability, with a predefined set of candidate regulatory mechanisms constructed from biological knowledge. A Bayesian learning approach is developed, which is based on a joint learning of kinetic parameters and regulatory structure through a formulation of the mixture model. To enhance computational efficiency, a Metropolis-adjusted Langevin algorithm with adjoint sensitivity analysis is developed for posterior exploration. Compared to state-of-the-art Bayesian inference approaches, the proposed framework achieves improved sample efficiency and robust model selection. An empirical study demonstrates its ability to recover missing regulatory mechanisms and improve model fidelity under data-limited conditions.
△ Less
Submitted 6 May, 2025;
originally announced May 2025.
-
PharmAgents: Building a Virtual Pharma with Large Language Model Agents
Authors:
Bowen Gao,
Yanwen Huang,
Yiqiao Liu,
Wenxuan Xie,
Wei-Ying Ma,
Ya-Qin Zhang,
Yanyan Lan
Abstract:
The discovery of novel small molecule drugs remains a critical scientific challenge with far-reaching implications for treating diseases and advancing human health. Traditional drug development--especially for small molecule therapeutics--is a highly complex, resource-intensive, and time-consuming process that requires multidisciplinary collaboration. Recent breakthroughs in artificial intelligenc…
▽ More
The discovery of novel small molecule drugs remains a critical scientific challenge with far-reaching implications for treating diseases and advancing human health. Traditional drug development--especially for small molecule therapeutics--is a highly complex, resource-intensive, and time-consuming process that requires multidisciplinary collaboration. Recent breakthroughs in artificial intelligence (AI), particularly the rise of large language models (LLMs), present a transformative opportunity to streamline and accelerate this process. In this paper, we introduce PharmAgents, a virtual pharmaceutical ecosystem driven by LLM-based multi-agent collaboration. PharmAgents simulates the full drug discovery workflow--from target discovery to preclinical evaluation--by integrating explainable, LLM-driven agents equipped with specialized machine learning models and computational tools. Through structured knowledge exchange and automated optimization, PharmAgents identifies potential therapeutic targets, discovers promising lead compounds, enhances binding affinity and key molecular properties, and performs in silico analyses of toxicity and synthetic feasibility. Additionally, the system supports interpretability, agent interaction, and self-evolvement, enabling it to refine future drug designs based on prior experience. By showcasing the potential of LLM-powered multi-agent systems in drug discovery, this work establishes a new paradigm for autonomous, explainable, and scalable pharmaceutical research, with future extensions toward comprehensive drug lifecycle management.
△ Less
Submitted 31 March, 2025; v1 submitted 28 March, 2025;
originally announced March 2025.
-
Pushing the boundaries of Structure-Based Drug Design through Collaboration with Large Language Models
Authors:
Bowen Gao,
Yanwen Huang,
Yiqiao Liu,
Wenxuan Xie,
Wei-Ying Ma,
Ya-Qin Zhang,
Yanyan Lan
Abstract:
Structure-Based Drug Design (SBDD) has revolutionized drug discovery by enabling the rational design of molecules for specific protein targets. Despite significant advancements in improving docking scores, advanced 3D-SBDD generative models still face challenges in producing drug-like candidates that meet medicinal chemistry standards and pharmacokinetic requirements. These limitations arise from…
▽ More
Structure-Based Drug Design (SBDD) has revolutionized drug discovery by enabling the rational design of molecules for specific protein targets. Despite significant advancements in improving docking scores, advanced 3D-SBDD generative models still face challenges in producing drug-like candidates that meet medicinal chemistry standards and pharmacokinetic requirements. These limitations arise from their inherent focus on molecular interactions, often neglecting critical aspects of drug-likeness. To address these shortcomings, we introduce the Collaborative Intelligence Drug Design (CIDD) framework, which combines the structural precision of 3D-SBDD models with the chemical reasoning capabilities of large language models (LLMs). CIDD begins by generating supporting molecules with 3D-SBDD models and then refines these molecules through LLM-supported modules to enhance drug-likeness and structural reasonability. When evaluated on the CrossDocked2020 dataset, CIDD achieved a remarkable success ratio of 37.94%, significantly outperforming the previous state-of-the-art benchmark of 15.72%. Although improving molecular interactions and drug-likeness is often seen as a trade-off, CIDD uniquely achieves a balanced improvement in both by leveraging the complementary strengths of different models, offering a robust and innovative pathway for designing therapeutically promising drug candidates.
△ Less
Submitted 3 March, 2025;
originally announced March 2025.
-
Multi-Scale Kinetics Modeling for Cell Culture Process with Metabolic State Transition
Authors:
Keqi Wang,
Sarah W. Harcum,
Wei Xie
Abstract:
To advance the understanding of cellular metabolisms and control batch-to-batch variations in cell culture processes, a multi-scale mechanistic model with a bottom-up and top-down structure was developed to simulate the dynamics of cell culture process undergoing metabolic state transitions. This model integrates interactions at the molecular, cellular, and macro-kinetic levels, accounting for inh…
▽ More
To advance the understanding of cellular metabolisms and control batch-to-batch variations in cell culture processes, a multi-scale mechanistic model with a bottom-up and top-down structure was developed to simulate the dynamics of cell culture process undergoing metabolic state transitions. This model integrates interactions at the molecular, cellular, and macro-kinetic levels, accounting for inherent variations in metabolic state transitions of individual cells. By incorporating both online (e.g., oxygen uptake, pH) and offline measurements (e.g., viable cell density, metabolite concentrations), the proposed mechanistic model enables accurate long-term prediction of cell culture trajectories and provides reliable prediction intervals quantifying batch-to-batch variations. This work can guide optimal design of experiments and robust process control to improve yield and production stability. Additionally, the proposed multi-scale model has a modular design enables flexible in silico simulations and extrapolation across diverse conditions, providing a robust prediction framework for scalable and flexible biomanufacturing applications.
△ Less
Submitted 5 December, 2024;
originally announced December 2024.
-
Supervised Learning without Backpropagation using Spike-Timing-Dependent Plasticity for Image Recognition
Authors:
Wei Xie
Abstract:
This study introduces a novel supervised learning approach for spiking neural networks that does not rely on traditional backpropagation. Instead, it employs spike-timing-dependent plasticity (STDP) within a supervised framework for image recognition tasks. The effectiveness of this method is demonstrated using the MNIST dataset. The model achieves approximately 40\% learning accuracy with just 10…
▽ More
This study introduces a novel supervised learning approach for spiking neural networks that does not rely on traditional backpropagation. Instead, it employs spike-timing-dependent plasticity (STDP) within a supervised framework for image recognition tasks. The effectiveness of this method is demonstrated using the MNIST dataset. The model achieves approximately 40\% learning accuracy with just 10 training stimuli, where each category is exposed to the model only once during training (one-shot learning). With larger training samples, the accuracy increases up to 87\%, maintaining negligible ambiguity. Notably, with only 10 hidden neurons, the model reaches 89\% accuracy with around 10\% ambiguity. This proposed method offers a robust and efficient alternative to traditional backpropagation-based supervised learning techniques.
△ Less
Submitted 11 February, 2025; v1 submitted 21 October, 2024;
originally announced October 2024.
-
AutoRG-Brain: Grounded Report Generation for Brain MRI
Authors:
Jiayu Lei,
Xiaoman Zhang,
Chaoyi Wu,
Lisong Dai,
Ya Zhang,
Yanyong Zhang,
Yanfeng Wang,
Weidi Xie,
Yuehua Li
Abstract:
Radiologists are tasked with interpreting a large number of images in a daily base, with the responsibility of generating corresponding reports. This demanding workload elevates the risk of human error, potentially leading to treatment delays, increased healthcare costs, revenue loss, and operational inefficiencies. To address these challenges, we initiate a series of work on grounded Automatic Re…
▽ More
Radiologists are tasked with interpreting a large number of images in a daily base, with the responsibility of generating corresponding reports. This demanding workload elevates the risk of human error, potentially leading to treatment delays, increased healthcare costs, revenue loss, and operational inefficiencies. To address these challenges, we initiate a series of work on grounded Automatic Report Generation (AutoRG), starting from the brain MRI interpretation system, which supports the delineation of brain structures, the localization of anomalies, and the generation of well-organized findings. We make contributions from the following aspects, first, on dataset construction, we release a comprehensive dataset encompassing segmentation masks of anomaly regions and manually authored reports, termed as RadGenome-Brain MRI. This data resource is intended to catalyze ongoing research and development in the field of AI-assisted report generation systems. Second, on system design, we propose AutoRG-Brain, the first brain MRI report generation system with pixel-level grounded visual clues. Third, for evaluation, we conduct quantitative assessments and human evaluations of brain structure segmentation, anomaly localization, and report generation tasks to provide evidence of its reliability and accuracy. This system has been integrated into real clinical scenarios, where radiologists were instructed to write reports based on our generated findings and anomaly segmentation masks. The results demonstrate that our system enhances the report-writing skills of junior doctors, aligning their performance more closely with senior doctors, thereby boosting overall productivity.
△ Less
Submitted 29 July, 2024; v1 submitted 23 July, 2024;
originally announced July 2024.
-
Adjoint Sensitivity Analysis on Multi-Scale Bioprocess Stochastic Reaction Network
Authors:
Keilung Choy,
Wei Xie
Abstract:
Motivated by the pressing challenges in the digital twin development for biomanufacturing systems, we introduce an adjoint sensitivity analysis (SA) approach to expedite the learning of mechanistic model parameters. In this paper, we consider enzymatic stochastic reaction networks representing a multi-scale bioprocess mechanistic model that allows us to integrate disparate data from diverse produc…
▽ More
Motivated by the pressing challenges in the digital twin development for biomanufacturing systems, we introduce an adjoint sensitivity analysis (SA) approach to expedite the learning of mechanistic model parameters. In this paper, we consider enzymatic stochastic reaction networks representing a multi-scale bioprocess mechanistic model that allows us to integrate disparate data from diverse production processes and leverage the information from existing macro-kinetic and genome-scale models. To support forward prediction and backward reasoning, we develop a convergent adjoint SA algorithm studying how the perturbations of model parameters and inputs (e.g., initial state) propagate through enzymatic reaction networks and impact on output trajectory predictions. This SA can provide a sample efficient and interpretable way to assess the sensitivities between inputs and outputs accounting for their causal dependencies. Our empirical study underscores the resilience of these sensitivities and illuminates a deeper comprehension of the regulatory mechanisms behind bioprocess through sensitivities.
△ Less
Submitted 28 June, 2024; v1 submitted 7 May, 2024;
originally announced May 2024.
-
Digital Twin Calibration for Biological System-of-Systems: Cell Culture Manufacturing Process
Authors:
Fuqiang Cheng,
Wei Xie,
Hua Zheng
Abstract:
Biomanufacturing innovation relies on an efficient Design of Experiments (DoEs) to optimize processes and product quality. Traditional DoE methods, ignoring the underlying bioprocessing mechanisms, often suffer from a lack of interpretability and sample efficiency. This limitation motivates us to create a new optimal learning approach for digital twin model calibration. In this study, we consider…
▽ More
Biomanufacturing innovation relies on an efficient Design of Experiments (DoEs) to optimize processes and product quality. Traditional DoE methods, ignoring the underlying bioprocessing mechanisms, often suffer from a lack of interpretability and sample efficiency. This limitation motivates us to create a new optimal learning approach for digital twin model calibration. In this study, we consider the cell culture process multi-scale mechanistic model, also known as Biological System-of-Systems (Bio-SoS). This model with a modular design, composed of sub-models, allows us to integrate data across various production processes. To calibrate the Bio-SoS digital twin, we evaluate the mean squared error of model prediction and develop a computational approach to quantify the impact of parameter estimation error of individual sub-models on the prediction accuracy of digital twin, which can guide sample-efficient and interpretable DoEs.
△ Less
Submitted 28 June, 2024; v1 submitted 6 May, 2024;
originally announced May 2024.
-
Accelerating Discovery of Novel and Bioactive Ligands With Pharmacophore-Informed Generative Models
Authors:
Weixin Xie,
Jianhang Zhang,
Qin Xie,
Chaojun Gong,
Youjun Xu,
Luhua Lai,
Jianfeng Pei
Abstract:
Deep generative models have gained significant advancements to accelerate drug discovery by generating bioactive chemicals against desired targets. Nevertheless, most generated compounds that have been validated for potent bioactivity often exhibit structural novelty levels that fall short of satisfaction, thereby providing limited inspiration to human medicinal chemists. The challenge faced by ge…
▽ More
Deep generative models have gained significant advancements to accelerate drug discovery by generating bioactive chemicals against desired targets. Nevertheless, most generated compounds that have been validated for potent bioactivity often exhibit structural novelty levels that fall short of satisfaction, thereby providing limited inspiration to human medicinal chemists. The challenge faced by generative models lies in their ability to produce compounds that are both bioactive and novel, rather than merely making minor modifications to known actives present in the training set. Recognizing the utility of pharmacophores in facilitating scaffold hopping, we developed TransPharmer, an innovative generative model that integrates ligand-based interpretable pharmacophore fingerprints with generative pre-training transformer (GPT) for de novo molecule generation. TransPharmer demonstrates superior performance across tasks involving unconditioned distribution learning, de novo generation and scaffold elaboration under pharmacophoric constraints. Its distinct exploration mode within the local chemical space renders it particularly useful for scaffold hopping, producing compounds that are structurally novel while pharmaceutically related. The efficacy of TransPharmer is validated through two case studies involving the dopamine receptor D2 (DRD2) and polo-like kinase 1 (PLK1). Notably in the case of PLK1, three out of four synthesized designed compounds exhibit submicromolar activities, with the most potent one, IIP0943, demonstrating a potency of 5.1 nM. Featuring a new scaffold of 4-(benzo[b]thiophen-7-yloxy)pyrimidine, IIP0943 also exhibits high selectivity for PLK1. It was demonstrated that TransPharmer is a powerful tool for discovery of novel and bioactive ligands.
△ Less
Submitted 2 January, 2024;
originally announced January 2024.
-
DiffDTM: A conditional structure-free framework for bioactive molecules generation targeted for dual proteins
Authors:
Lei Huang,
Zheng Yuan,
Huihui Yan,
Rong Sheng,
Linjing Liu,
Fuzhou Wang,
Weidun Xie,
Nanjun Chen,
Fei Huang,
Songfang Huang,
Ka-Chun Wong,
Yaoyun Zhang
Abstract:
Advances in deep generative models shed light on de novo molecule generation with desired properties. However, molecule generation targeted for dual protein targets still faces formidable challenges including protein 3D structure data requisition for model training, auto-regressive sampling, and model generalization for unseen targets. Here, we proposed DiffDTM, a novel conditional structure-free…
▽ More
Advances in deep generative models shed light on de novo molecule generation with desired properties. However, molecule generation targeted for dual protein targets still faces formidable challenges including protein 3D structure data requisition for model training, auto-regressive sampling, and model generalization for unseen targets. Here, we proposed DiffDTM, a novel conditional structure-free deep generative model based on a diffusion model for dual targets based molecule generation to address the above issues. Specifically, DiffDTM receives protein sequences and molecular graphs as inputs instead of protein and molecular conformations and incorporates an information fusion module to achieve conditional generation in a one-shot manner. We have conducted comprehensive multi-view experiments to demonstrate that DiffDTM can generate drug-like, synthesis-accessible, novel, and high-binding affinity molecules targeting specific dual proteins, outperforming the state-of-the-art (SOTA) models in terms of multiple evaluation metrics. Furthermore, we utilized DiffDTM to generate molecules towards dopamine receptor D2 and 5-hydroxytryptamine receptor 1A as new antipsychotics. The experimental results indicate that DiffDTM can be easily plugged into unseen dual targets to generate bioactive molecules, addressing the issues of requiring insufficient active molecule data for training as well as the need to retrain when encountering new targets.
△ Less
Submitted 24 June, 2023;
originally announced June 2023.
-
Stochastic Biological System-of-Systems Modelling for iPSC Culture
Authors:
Hua Zheng,
Sarah W. Harcum,
Jinxiang Pei,
Wei Xie
Abstract:
Large-scale manufacturing of induced pluripotent stem cells (iPSCs) is essential for cell therapies and regenerative medicines. Yet, iPSCs form large cell aggregates in suspension bioreactors, resulting in insufficient nutrient supply and extra metabolic waste build-up for the cells located at the core. Since subtle changes in micro-environment can lead to a heterogeneous cell population, a novel…
▽ More
Large-scale manufacturing of induced pluripotent stem cells (iPSCs) is essential for cell therapies and regenerative medicines. Yet, iPSCs form large cell aggregates in suspension bioreactors, resulting in insufficient nutrient supply and extra metabolic waste build-up for the cells located at the core. Since subtle changes in micro-environment can lead to a heterogeneous cell population, a novel Biological System-of-Systems (Bio-SoS) framework is proposed to model cell-to-cell interactions, spatial and metabolic heterogeneity, and cell response to micro-environmental variation. Building on stochastic metabolic reaction network, aggregation kinetics, and reaction-diffusion mechanisms, the Bio-SoS model characterizes causal interdependencies at individual cell, aggregate, and cell population levels. It has a modular design that enables data integration and improves predictions for different monolayer and aggregate culture processes. In addition, a variance decomposition analysis is derived to quantify the impact of factors (i.e., aggregate size) on cell product health and quality heterogeneity.
△ Less
Submitted 11 October, 2023; v1 submitted 28 May, 2023;
originally announced May 2023.
-
Stochastic Molecular Reaction Queueing Network Modeling for In Vitro Transcription Process
Authors:
Keqi Wang,
Wei Xie,
Hua Zheng
Abstract:
To facilitate a rapid response to pandemic threats, this paper focuses on developing a mechanistic simulation model for in vitro transcription (IVT) process, a crucial step in mRNA vaccine manufacturing. To enhance production and support industry 4.0, this model is proposed to improve the prediction and analysis of IVT enzymatic reaction network. It incorporates a novel stochastic molecular reacti…
▽ More
To facilitate a rapid response to pandemic threats, this paper focuses on developing a mechanistic simulation model for in vitro transcription (IVT) process, a crucial step in mRNA vaccine manufacturing. To enhance production and support industry 4.0, this model is proposed to improve the prediction and analysis of IVT enzymatic reaction network. It incorporates a novel stochastic molecular reaction queueing network with a regulatory kinetic model characterizing the effect of bioprocess state variables on reaction rates. The empirical study demonstrates that the proposed model has a promising performance under different production conditions and it could offer potential improvements in mRNA product quality and yield.
△ Less
Submitted 21 June, 2023; v1 submitted 16 May, 2023;
originally announced May 2023.
-
Structure-Function Dynamics Hybrid Modeling: RNA Degradation
Authors:
Hua Zheng,
Wei Xie,
Paul Whitford,
Ailun Wang,
Chunsheng Fang,
Wandi Xu
Abstract:
RNA structure and functional dynamics play fundamental roles in controlling biological systems. Molecular dynamics simulation, which can characterize interactions at an atomistic level, can advance the understanding on new drug discovery, manufacturing, and delivery mechanisms. However, it is computationally unattainable to support the development of a digital twin for enzymatic reaction network m…
▽ More
RNA structure and functional dynamics play fundamental roles in controlling biological systems. Molecular dynamics simulation, which can characterize interactions at an atomistic level, can advance the understanding on new drug discovery, manufacturing, and delivery mechanisms. However, it is computationally unattainable to support the development of a digital twin for enzymatic reaction network mechanism learning, and end-to-end bioprocess design and control. Thus, we create a hybrid ("mechanistic + machine learning") model characterizing the interdependence of RNA structure and functional dynamics from atomistic to macroscopic levels. To assess the proposed modeling strategy, in this paper, we consider RNA degradation which is a critical process in cellular biology that affects gene expression. The empirical study on RNA lifetime prediction demonstrates the promising performance of the proposed multi-scale bioprocess hybrid modeling strategy.
△ Less
Submitted 17 June, 2023; v1 submitted 6 May, 2023;
originally announced May 2023.
-
Metabolic Regulatory Network Kinetic Modeling with Multiple Isotopic Tracers for iPSCs
Authors:
Keqi Wang,
Wei Xie,
Sarah W. Harcum
Abstract:
The rapidly expanding market for regenerative medicines and cell therapies highlights the need to advance the understanding of cellular metabolisms and improve the prediction of cultivation production process for human induced pluripotent stem cells (iPSCs). In this paper, a metabolic kinetic model was developed to characterize underlying mechanisms of iPSC culture process, which can predict cell…
▽ More
The rapidly expanding market for regenerative medicines and cell therapies highlights the need to advance the understanding of cellular metabolisms and improve the prediction of cultivation production process for human induced pluripotent stem cells (iPSCs). In this paper, a metabolic kinetic model was developed to characterize underlying mechanisms of iPSC culture process, which can predict cell response to environmental perturbation and support process control. This model focuses on the central carbon metabolic network, including glycolysis, pentose phosphate pathway (PPP), tricarboxylic acid (TCA) cycle, and amino acid metabolism, which plays a crucial role to support iPSC proliferation. Heterogeneous measures of extracellular metabolites and multiple isotopic tracers collected under multiple conditions were used to learn metabolic regulatory mechanisms. Systematic cross-validation confirmed the model's performance in terms of providing reliable predictions on cellular metabolism and culture process dynamics under various culture conditions. Thus, the developed mechanistic kinetic model can support process control strategies to strategically select optimal cell culture conditions at different times, ensure cell product functionality, and facilitate large-scale manufacturing of regenerative medicines and cell therapies.
△ Less
Submitted 25 October, 2023; v1 submitted 29 April, 2023;
originally announced May 2023.
-
From Discovery to Production: Challenges and Novel Methodologies for Next Generation Biomanufacturing
Authors:
Wei Xie,
Giulia Pedrielli
Abstract:
The increasingly pressing demand of novel drugs (e.g., gene therapies for personalized cancer care, ever evolving vaccines) with unprecedented levels of personalization, has put a remarkable pressure on the traditionally long time required by the pharma R&D and manufacturing to go from design to production of new products. The revolution has already brought important changes in the technologies us…
▽ More
The increasingly pressing demand of novel drugs (e.g., gene therapies for personalized cancer care, ever evolving vaccines) with unprecedented levels of personalization, has put a remarkable pressure on the traditionally long time required by the pharma R&D and manufacturing to go from design to production of new products. The revolution has already brought important changes in the technologies used within the industry. In fact, practitioners are increasingly moving away from the classical paradigm of large-scale batch production to continuous biomanufacturing with flexible and modular design, which is further supported by the recent technology advance in single-use equipment. In contrast to long design processes, low product variability (one-fits-all), and highly rigid systems, modern pharma players are answering the question: can we bring design and process control up to the speed that novel production technologies give us to quickly set up a flexible production run?
In this tutorial, we present key challenges and potential solutions from the world of operations research that can support answering such question. We first present technical challenges and novel methods for the design of next generation drugs, followed by the process modeling and control approaches to successfully and efficiently manufacture them.
△ Less
Submitted 28 June, 2022; v1 submitted 8 May, 2022;
originally announced May 2022.
-
Self-Supervised Graph Transformer on Large-Scale Molecular Data
Authors:
Yu Rong,
Yatao Bian,
Tingyang Xu,
Weiyang Xie,
Ying Wei,
Wenbing Huang,
Junzhou Huang
Abstract:
How to obtain informative representations of molecules is a crucial prerequisite in AI-driven drug design and discovery. Recent researches abstract molecules as graphs and employ Graph Neural Networks (GNNs) for molecular representation learning. Nevertheless, two issues impede the usage of GNNs in real scenarios: (1) insufficient labeled molecules for supervised training; (2) poor generalization…
▽ More
How to obtain informative representations of molecules is a crucial prerequisite in AI-driven drug design and discovery. Recent researches abstract molecules as graphs and employ Graph Neural Networks (GNNs) for molecular representation learning. Nevertheless, two issues impede the usage of GNNs in real scenarios: (1) insufficient labeled molecules for supervised training; (2) poor generalization capability to new-synthesized molecules. To address them both, we propose a novel framework, GROVER, which stands for Graph Representation frOm self-superVised mEssage passing tRansformer. With carefully designed self-supervised tasks in node-, edge- and graph-level, GROVER can learn rich structural and semantic information of molecules from enormous unlabelled molecular data. Rather, to encode such complex information, GROVER integrates Message Passing Networks into the Transformer-style architecture to deliver a class of more expressive encoders of molecules. The flexibility of GROVER allows it to be trained efficiently on large-scale molecular dataset without requiring any supervision, thus being immunized to the two issues mentioned above. We pre-train GROVER with 100 million parameters on 10 million unlabelled molecules -- the biggest GNN and the largest training dataset in molecular representation learning. We then leverage the pre-trained GROVER for molecular property prediction followed by task-specific fine-tuning, where we observe a huge improvement (more than 6% on average) from current state-of-the-art methods on 11 challenging benchmarks. The insights we gained are that well-designed self-supervision losses and largely-expressive pre-trained models enjoy the significant potential on performance boosting.
△ Less
Submitted 28 October, 2020; v1 submitted 18 June, 2020;
originally announced July 2020.
-
Multi-View Graph Neural Networks for Molecular Property Prediction
Authors:
Hehuan Ma,
Yatao Bian,
Yu Rong,
Wenbing Huang,
Tingyang Xu,
Weiyang Xie,
Geyan Ye,
Junzhou Huang
Abstract:
The crux of molecular property prediction is to generate meaningful representations of the molecules. One promising route is to exploit the molecular graph structure through Graph Neural Networks (GNNs). It is well known that both atoms and bonds significantly affect the chemical properties of a molecule, so an expressive model shall be able to exploit both node (atom) and edge (bond) information…
▽ More
The crux of molecular property prediction is to generate meaningful representations of the molecules. One promising route is to exploit the molecular graph structure through Graph Neural Networks (GNNs). It is well known that both atoms and bonds significantly affect the chemical properties of a molecule, so an expressive model shall be able to exploit both node (atom) and edge (bond) information simultaneously. Guided by this observation, we present Multi-View Graph Neural Network (MV-GNN), a multi-view message passing architecture to enable more accurate predictions of molecular properties. In MV-GNN, we introduce a shared self-attentive readout component and disagreement loss to stabilize the training process. This readout component also renders the whole architecture interpretable. We further boost the expressive power of MV-GNN by proposing a cross-dependent message passing scheme that enhances information communication of the two views, which results in the MV-GNN^cross variant. Lastly, we theoretically justify the expressiveness of the two proposed models in terms of distinguishing non-isomorphism graphs. Extensive experiments demonstrate that MV-GNN models achieve remarkably superior performance over the state-of-the-art models on a variety of challenging benchmarks. Meanwhile, visualization results of the node importance are consistent with prior knowledge, which confirms the interpretability power of MV-GNN models.
△ Less
Submitted 12 June, 2020; v1 submitted 17 May, 2020;
originally announced May 2020.
-
Supporting Regularized Logistic Regression Privately and Efficiently
Authors:
Wenfa Li,
Hongzhe Liu,
Peng Yang,
Wei Xie
Abstract:
As one of the most popular statistical and machine learning models, logistic regression with regularization has found wide adoption in biomedicine, social sciences, information technology, and so on. These domains often involve data of human subjects that are contingent upon strict privacy regulations. Increasing concerns over data privacy make it more and more difficult to coordinate and conduct…
▽ More
As one of the most popular statistical and machine learning models, logistic regression with regularization has found wide adoption in biomedicine, social sciences, information technology, and so on. These domains often involve data of human subjects that are contingent upon strict privacy regulations. Increasing concerns over data privacy make it more and more difficult to coordinate and conduct large-scale collaborative studies, which typically rely on cross-institution data sharing and joint analysis. Our work here focuses on safeguarding regularized logistic regression, a widely-used machine learning model in various disciplines while at the same time has not been investigated from a data security and privacy perspective. We consider a common use scenario of multi-institution collaborative studies, such as in the form of research consortia or networks as widely seen in genetics, epidemiology, social sciences, etc. To make our privacy-enhancing solution practical, we demonstrate a non-conventional and computationally efficient method leveraging distributing computing and strong cryptography to provide comprehensive protection over individual-level and summary data. Extensive empirical evaluation on several studies validated the privacy guarantees, efficiency and scalability of our proposal. We also discuss the practical implications of our solution for large-scale studies and applications from various disciplines, including genetic and biomedical studies, smart grid, network analysis, etc.
△ Less
Submitted 30 September, 2015;
originally announced October 2015.