-
Automating Exploratory Proteomics Research via Language Models
Authors:
Ning Ding,
Shang Qu,
Linhai Xie,
Yifei Li,
Zaoqu Liu,
Kaiyan Zhang,
Yibai Xiong,
Yuxin Zuo,
Zhangren Chen,
Ermo Hua,
Xingtai Lv,
Youbang Sun,
Yang Li,
Dong Li,
Fuchu He,
Bowen Zhou
Abstract:
With the development of artificial intelligence, its contribution to science is evolving from simulating a complex problem to automating entire research processes and producing novel discoveries. Achieving this advancement requires both specialized general models grounded in real-world scientific data and iterative, exploratory frameworks that mirror human scientific methodologies. In this paper,…
▽ More
With the development of artificial intelligence, its contribution to science is evolving from simulating a complex problem to automating entire research processes and producing novel discoveries. Achieving this advancement requires both specialized general models grounded in real-world scientific data and iterative, exploratory frameworks that mirror human scientific methodologies. In this paper, we present PROTEUS, a fully automated system for scientific discovery from raw proteomics data. PROTEUS uses large language models (LLMs) to perform hierarchical planning, execute specialized bioinformatics tools, and iteratively refine analysis workflows to generate high-quality scientific hypotheses. The system takes proteomics datasets as input and produces a comprehensive set of research objectives, analysis results, and novel biological hypotheses without human intervention. We evaluated PROTEUS on 12 proteomics datasets collected from various biological samples (e.g. immune cells, tumors) and different sample types (single-cell and bulk), generating 191 scientific hypotheses. These were assessed using both automatic LLM-based scoring on 5 metrics and detailed reviews from human experts. Results demonstrate that PROTEUS consistently produces reliable, logically coherent results that align well with existing literature while also proposing novel, evaluable hypotheses. The system's flexible architecture facilitates seamless integration of diverse analysis tools and adaptation to different proteomics data types. By automating complex proteomics analysis workflows and hypothesis generation, PROTEUS has the potential to considerably accelerate the pace of scientific discovery in proteomics research, enabling researchers to efficiently explore large-scale datasets and uncover biological insights.
△ Less
Submitted 6 November, 2024;
originally announced November 2024.
-
eDOC: Explainable Decoding Out-of-domain Cell Types with Evidential Learning
Authors:
Chaochen Wu,
Meiyun Zuo,
Lei Xie
Abstract:
Single-cell RNA-seq (scRNA-seq) technology is a powerful tool for unraveling the complexity of biological systems. One of essential and fundamental tasks in scRNA-seq data analysis is Cell Type Annotation (CTA). In spite of tremendous efforts in developing machine learning methods for this problem, several challenges remains. They include identifying Out-of-Domain (OOD) cell types, quantifying the…
▽ More
Single-cell RNA-seq (scRNA-seq) technology is a powerful tool for unraveling the complexity of biological systems. One of essential and fundamental tasks in scRNA-seq data analysis is Cell Type Annotation (CTA). In spite of tremendous efforts in developing machine learning methods for this problem, several challenges remains. They include identifying Out-of-Domain (OOD) cell types, quantifying the uncertainty of unseen cell type annotations, and determining interpretable cell type-specific gene drivers for an OOD case. OOD cell types are often associated with therapeutic responses and disease origins, making them critical for precision medicine and early disease diagnosis. Additionally, scRNA-seq data contains tens thousands of gene expressions. Pinpointing gene drivers underlying CTA can provide deep insight into gene regulatory mechanisms and serve as disease biomarkers. In this study, we develop a new method, eDOC, to address aforementioned challenges. eDOC leverages a transformer architecture with evidential learning to annotate In-Domain (IND) and OOD cell types as well as to highlight genes that contribute both IND cells and OOD cells in a single cell resolution. Rigorous experiments demonstrate that eDOC significantly improves the efficiency and effectiveness of OOD cell type and gene driver identification compared to other state-of-the-art methods. Our findings suggest that eDOC may provide new insights into single-cell biology.
△ Less
Submitted 30 October, 2024;
originally announced November 2024.
-
Dumpling GNN: Hybrid GNN Enables Better ADC Payload Activity Prediction Based on Chemical Structure
Authors:
Shengjie Xu,
Lingxi Xie
Abstract:
Antibody-drug conjugates (ADCs) have emerged as a promising class of targeted cancer therapeutics, but the design and optimization of their cytotoxic payloads remain challenging. This study introduces DumplingGNN, a novel hybrid Graph Neural Network architecture specifically designed for predicting ADC payload activity based on chemical structure. By integrating Message Passing Neural Networks (MP…
▽ More
Antibody-drug conjugates (ADCs) have emerged as a promising class of targeted cancer therapeutics, but the design and optimization of their cytotoxic payloads remain challenging. This study introduces DumplingGNN, a novel hybrid Graph Neural Network architecture specifically designed for predicting ADC payload activity based on chemical structure. By integrating Message Passing Neural Networks (MPNN), Graph Attention Networks (GAT), and GraphSAGE layers, DumplingGNN effectively captures multi-scale molecular features and leverages both 2D topological and 3D structural information. We evaluate DumplingGNN on a comprehensive ADC payload dataset focusing on DNA Topoisomerase I inhibitors, as well as on multiple public benchmarks from MoleculeNet. DumplingGNN achieves state-of-the-art performance across several datasets, including BBBP (96.4\% ROC-AUC), ToxCast (78.2\% ROC-AUC), and PCBA (88.87\% ROC-AUC). On our specialized ADC payload dataset, it demonstrates exceptional accuracy (91.48\%), sensitivity (95.08\%), and specificity (97.54\%). Ablation studies confirm the synergistic effects of the hybrid architecture and the critical role of 3D structural information in enhancing predictive accuracy. The model's strong interpretability, enabled by attention mechanisms, provides valuable insights into structure-activity relationships. DumplingGNN represents a significant advancement in molecular property prediction, with particular promise for accelerating the design and optimization of ADC payloads in targeted cancer therapy development.
△ Less
Submitted 23 September, 2024;
originally announced October 2024.
-
Integrating Chemical Language and Molecular Graph in Multimodal Fused Deep Learning for Drug Property Prediction
Authors:
Xiaohua Lu,
Liangxu Xie,
Lei Xu,
Rongzhi Mao,
Shan Chang,
Xiaojun Xu
Abstract:
Accurately predicting molecular properties is a challenging but essential task in drug discovery. Recently, many mono-modal deep learning methods have been successfully applied to molecular property prediction. However, the inherent limitation of mono-modal learning arises from relying solely on one modality of molecular representation, which restricts a comprehensive understanding of drug molecul…
▽ More
Accurately predicting molecular properties is a challenging but essential task in drug discovery. Recently, many mono-modal deep learning methods have been successfully applied to molecular property prediction. However, the inherent limitation of mono-modal learning arises from relying solely on one modality of molecular representation, which restricts a comprehensive understanding of drug molecules and hampers their resilience against data noise. To overcome the limitations, we construct multimodal deep learning models to cover different molecular representations. We convert drug molecules into three molecular representations, SMILES-encoded vectors, ECFP fingerprints, and molecular graphs. To process the modal information, Transformer-Encoder, bi-directional gated recurrent units (BiGRU), and graph convolutional network (GCN) are utilized for feature learning respectively, which can enhance the model capability to acquire complementary and naturally occurring bioinformatics information. We evaluated our triple-modal model on six molecule datasets. Different from bi-modal learning models, we adopt five fusion methods to capture the specific features and leverage the contribution of each modal information better. Compared with mono-modal models, our multimodal fused deep learning (MMFDL) models outperform single models in accuracy, reliability, and resistance capability against noise. Moreover, we demonstrate its generalization ability in the prediction of binding constants for protein-ligand complex molecules in the refined set of PDBbind. The advantage of the multimodal model lies in its ability to process diverse sources of data using proper models and suitable fusion methods, which would enhance the noise resistance of the model while obtaining data diversity.
△ Less
Submitted 12 September, 2024; v1 submitted 29 December, 2023;
originally announced December 2023.
-
Protein Language Model-Powered 3D Ligand Binding Site Prediction from Protein Sequence
Authors:
Shuo Zhang,
Lei Xie
Abstract:
Prediction of ligand binding sites of proteins is a fundamental and important task for understanding the function of proteins and screening potential drugs. Most existing methods require experimentally determined protein holo-structures as input. However, such structures can be unavailable on novel or less-studied proteins. To tackle this limitation, we propose LaMPSite, which only takes protein s…
▽ More
Prediction of ligand binding sites of proteins is a fundamental and important task for understanding the function of proteins and screening potential drugs. Most existing methods require experimentally determined protein holo-structures as input. However, such structures can be unavailable on novel or less-studied proteins. To tackle this limitation, we propose LaMPSite, which only takes protein sequences and ligand molecular graphs as input for ligand binding site predictions. The protein sequences are used to retrieve residue-level embeddings and contact maps from the pre-trained ESM-2 protein language model. The ligand molecular graphs are fed into a graph neural network to compute atom-level embeddings. Then we compute and update the protein-ligand interaction embedding based on the protein residue-level embeddings and ligand atom-level embeddings, and the geometric constraints in the inferred protein contact map and ligand distance map. A final pooling on protein-ligand interaction embedding would indicate which residues belong to the binding sites. Without any 3D coordinate information of proteins, our proposed model achieves competitive performance compared to baseline methods that require 3D protein structures when predicting binding sites. Given that less than 50% of proteins have reliable structure information in the current stage, LaMPSite will provide new opportunities for drug discovery.
△ Less
Submitted 4 December, 2023;
originally announced December 2023.
-
A Universal Framework for Accurate and Efficient Geometric Deep Learning of Molecular Systems
Authors:
Shuo Zhang,
Yang Liu,
Lei Xie
Abstract:
Molecular sciences address a wide range of problems involving molecules of different types and sizes and their complexes. Recently, geometric deep learning, especially Graph Neural Networks, has shown promising performance in molecular science applications. However, most existing works often impose targeted inductive biases to a specific molecular system, and are inefficient when applied to macrom…
▽ More
Molecular sciences address a wide range of problems involving molecules of different types and sizes and their complexes. Recently, geometric deep learning, especially Graph Neural Networks, has shown promising performance in molecular science applications. However, most existing works often impose targeted inductive biases to a specific molecular system, and are inefficient when applied to macromolecules or large-scale tasks, thereby limiting their applications to many real-world problems. To address these challenges, we present PAMNet, a universal framework for accurately and efficiently learning the representations of three-dimensional (3D) molecules of varying sizes and types in any molecular system. Inspired by molecular mechanics, PAMNet induces a physics-informed bias to explicitly model local and non-local interactions and their combined effects. As a result, PAMNet can reduce expensive operations, making it time and memory efficient. In extensive benchmark studies, PAMNet outperforms state-of-the-art baselines regarding both accuracy and efficiency in three diverse learning tasks: small molecule properties, RNA 3D structures, and protein-ligand binding affinities. Our results highlight the potential for PAMNet in a broad range of molecular science applications.
△ Less
Submitted 18 November, 2023;
originally announced November 2023.
-
Learning Universal and Robust 3D Molecular Representations with Graph Convolutional Networks
Authors:
Shuo Zhang,
Yang Liu,
Li Xie,
Lei Xie
Abstract:
To learn accurate representations of molecules, it is essential to consider both chemical and geometric features. To encode geometric information, many descriptors have been proposed in constrained circumstances for specific types of molecules and do not have the properties to be ``robust": 1. Invariant to rotations and translations; 2. Injective when embedding molecular structures. In this work,…
▽ More
To learn accurate representations of molecules, it is essential to consider both chemical and geometric features. To encode geometric information, many descriptors have been proposed in constrained circumstances for specific types of molecules and do not have the properties to be ``robust": 1. Invariant to rotations and translations; 2. Injective when embedding molecular structures. In this work, we propose a universal and robust Directional Node Pair (DNP) descriptor based on the graph representations of 3D molecules. Our DNP descriptor is robust compared to previous ones and can be applied to multiple molecular types. To combine the DNP descriptor and chemical features in molecules, we construct the Robust Molecular Graph Convolutional Network (RoM-GCN) which is capable to take both node and edge features into consideration when generating molecule representations. We evaluate our model on protein and small molecule datasets. Our results validate the superiority of the DNP descriptor in incorporating 3D geometric information of molecules. RoM-GCN outperforms all compared baselines.
△ Less
Submitted 23 July, 2023;
originally announced July 2023.
-
Regional Deep Atrophy: a Self-Supervised Learning Method to Automatically Identify Regions Associated With Alzheimer's Disease Progression From Longitudinal MRI
Authors:
Mengjin Dong,
Long Xie,
Sandhitsu R. Das,
Jiancong Wang,
Laura E. M. Wisse,
Robin deFlores,
David A. Wolk,
Paul A. Yushkevich
Abstract:
Longitudinal assessment of brain atrophy, particularly in the hippocampus, is a well-studied biomarker for neurodegenerative diseases, such as Alzheimer's disease (AD). In clinical trials, estimation of brain progressive rates can be applied to track therapeutic efficacy of disease modifying treatments. However, most state-of-the-art measurements calculate changes directly by segmentation and/or d…
▽ More
Longitudinal assessment of brain atrophy, particularly in the hippocampus, is a well-studied biomarker for neurodegenerative diseases, such as Alzheimer's disease (AD). In clinical trials, estimation of brain progressive rates can be applied to track therapeutic efficacy of disease modifying treatments. However, most state-of-the-art measurements calculate changes directly by segmentation and/or deformable registration of MRI images, and may misreport head motion or MRI artifacts as neurodegeneration, impacting their accuracy. In our previous study, we developed a deep learning method DeepAtrophy that uses a convolutional neural network to quantify differences between longitudinal MRI scan pairs that are associated with time. DeepAtrophy has high accuracy in inferring temporal information from longitudinal MRI scans, such as temporal order or relative inter-scan interval. DeepAtrophy also provides an overall atrophy score that was shown to perform well as a potential biomarker of disease progression and treatment efficacy. However, DeepAtrophy is not interpretable, and it is unclear what changes in the MRI contribute to progression measurements. In this paper, we propose Regional Deep Atrophy (RDA), which combines the temporal inference approach from DeepAtrophy with a deformable registration neural network and attention mechanism that highlights regions in the MRI image where longitudinal changes are contributing to temporal inference. RDA has similar prediction accuracy as DeepAtrophy, but its additional interpretability makes it more acceptable for use in clinical settings, and may lead to more sensitive biomarkers for disease monitoring in clinical trials of early AD.
△ Less
Submitted 10 April, 2023;
originally announced April 2023.
-
A Tribute to Phil Bourne -- Scientist and Human
Authors:
Cameron Mura,
Emma Candelier,
Lei Xie
Abstract:
This Special Issue of Biomolecules, commissioned in honor of Dr. Philip E. Bourne, focuses on a new field of biomolecular data science. In this brief retrospective, we consider the arc of Bourne's 40-year scientific and professional career, particularly as it relates to the origins of this new field.
This Special Issue of Biomolecules, commissioned in honor of Dr. Philip E. Bourne, focuses on a new field of biomolecular data science. In this brief retrospective, we consider the arc of Bourne's 40-year scientific and professional career, particularly as it relates to the origins of this new field.
△ Less
Submitted 8 December, 2022;
originally announced December 2022.
-
Physics-aware Graph Neural Network for Accurate RNA 3D Structure Prediction
Authors:
Shuo Zhang,
Yang Liu,
Lei Xie
Abstract:
Biological functions of RNAs are determined by their three-dimensional (3D) structures. Thus, given the limited number of experimentally determined RNA structures, the prediction of RNA structures will facilitate elucidating RNA functions and RNA-targeted drug discovery, but remains a challenging task. In this work, we propose a Graph Neural Network (GNN)-based scoring function trained only with t…
▽ More
Biological functions of RNAs are determined by their three-dimensional (3D) structures. Thus, given the limited number of experimentally determined RNA structures, the prediction of RNA structures will facilitate elucidating RNA functions and RNA-targeted drug discovery, but remains a challenging task. In this work, we propose a Graph Neural Network (GNN)-based scoring function trained only with the atomic types and coordinates on limited solved RNA 3D structures for distinguishing accurate structural models. The proposed Physics-aware Multiplex Graph Neural Network (PaxNet) separately models the local and non-local interactions inspired by molecular mechanics. Furthermore, PaxNet contains an attention-based fusion module that learns the individual contribution of each interaction type for the final prediction. We rigorously evaluate the performance of PaxNet on two benchmarks and compare it with several state-of-the-art baselines. The results show that PaxNet significantly outperforms all the baselines overall, and demonstrate the potential of PaxNet for improving the 3D structure modeling of RNA and other macromolecules. Our code is available at https://github.com/zetayue/Physics-aware-Multiplex-GNN.
△ Less
Submitted 23 July, 2023; v1 submitted 28 October, 2022;
originally announced October 2022.
-
Efficient and Accurate Physics-aware Multiplex Graph Neural Networks for 3D Small Molecules and Macromolecule Complexes
Authors:
Shuo Zhang,
Yang Liu,
Lei Xie
Abstract:
Recent advances in applying Graph Neural Networks (GNNs) to molecular science have showcased the power of learning three-dimensional (3D) structure representations with GNNs. However, most existing GNNs suffer from the limitations of insufficient modeling of diverse interactions, computational expensive operations, and ignorance of vectorial values. Here, we tackle these limitations by proposing a…
▽ More
Recent advances in applying Graph Neural Networks (GNNs) to molecular science have showcased the power of learning three-dimensional (3D) structure representations with GNNs. However, most existing GNNs suffer from the limitations of insufficient modeling of diverse interactions, computational expensive operations, and ignorance of vectorial values. Here, we tackle these limitations by proposing a novel GNN model, Physics-aware Multiplex Graph Neural Network (PaxNet), to efficiently and accurately learn the representations of 3D molecules for both small organic compounds and macromolecule complexes. PaxNet separates the modeling of local and non-local interactions inspired by molecular mechanics, and reduces the expensive angle-related computations. Besides scalar properties, PaxNet can also predict vectorial properties by learning an associated vector for each atom. To evaluate the performance of PaxNet, we compare it with state-of-the-art baselines in two tasks. On small molecule dataset for predicting quantum chemical properties, PaxNet reduces the prediction error by 15% and uses 73% less memory than the best baseline. On macromolecule dataset for predicting protein-ligand binding affinities, PaxNet outperforms the best baseline while reducing the memory consumption by 33% and the inference time by 85%. Thus, PaxNet provides a universal, robust and accurate method for large-scale machine learning of molecules. Our code is available at https://github.com/zetayue/Physics-aware-Multiplex-GNN.
△ Less
Submitted 18 November, 2023; v1 submitted 5 June, 2022;
originally announced June 2022.
-
DePS: An improved deep learning model for de novo peptide sequencing
Authors:
Cheng Ge,
Yi Lu,
Jia Qu,
Liangxu Xie,
Feng Wang,
Hong Zhang,
Ren Kong,
Shan Chang
Abstract:
De novo peptide sequencing from mass spectrometry data is an important method for protein identification. Recently, various deep learning approaches were applied for de novo peptide sequencing and DeepNovoV2 is one of the represetative models. In this study, we proposed an enhanced model, DePS, which can improve the accuracy of de novo peptide sequencing even with missing signal peaks or large num…
▽ More
De novo peptide sequencing from mass spectrometry data is an important method for protein identification. Recently, various deep learning approaches were applied for de novo peptide sequencing and DeepNovoV2 is one of the represetative models. In this study, we proposed an enhanced model, DePS, which can improve the accuracy of de novo peptide sequencing even with missing signal peaks or large number of noisy peaks in tandem mass spectrometry data. It is showed that, for the same test set of DeepNovoV2, the DePS model achieved excellent results of 74.22%, 74.21% and 41.68% for amino acid recall, amino acid precision and peptide recall respectively. Furthermore, the results suggested that DePS outperforms DeepNovoV2 on the cross species dataset.
△ Less
Submitted 16 March, 2022;
originally announced March 2022.
-
Reinforcement Learning for Personalized Drug Discovery and Design for Complex Diseases: A Systems Pharmacology Perspective
Authors:
Ryan K. Tan,
Yang Liu,
Lei Xie
Abstract:
Many multi-genic systemic diseases such as neurological disorders, inflammatory diseases, and the majority of cancers do not have effective treatments yet. Reinforcement learning powered systems pharmacology is a potentially effective approach to design personalized therapies for untreatable complex diseases. In this survey, state-of-the-art reinforcement learning methods and their latest applicat…
▽ More
Many multi-genic systemic diseases such as neurological disorders, inflammatory diseases, and the majority of cancers do not have effective treatments yet. Reinforcement learning powered systems pharmacology is a potentially effective approach to design personalized therapies for untreatable complex diseases. In this survey, state-of-the-art reinforcement learning methods and their latest applications to drug design are reviewed. The challenges on harnessing reinforcement learning for systems pharmacology and personalized medicine are discussed. Potential solutions to overcome the challenges are proposed. In spite of successful application of advanced reinforcement learning techniques to target-based drug discovery, new reinforcement learning strategies are needed to address systems pharmacology-oriented personalized de novo drug design.
△ Less
Submitted 23 February, 2022; v1 submitted 21 January, 2022;
originally announced January 2022.
-
Mathematical Properties of Incremental Effect Additivity and Other Synergy Theories
Authors:
Leonid Hanin,
Liyang Xie,
Rainer Sachs
Abstract:
Synergy theories for multi-component agent mixtures use 1-agent dose-effect relations, assumed known from analyzing previous 1-agent experiments, to calculate baseline Neither-Synergy-Nor-Antagonism mixture dose-effect relations. The most commonly used synergy theory, Simple Effect Additivity, is not self-consistent mathematically. Many nonlinear alternatives have been suggested, almost all of whi…
▽ More
Synergy theories for multi-component agent mixtures use 1-agent dose-effect relations, assumed known from analyzing previous 1-agent experiments, to calculate baseline Neither-Synergy-Nor-Antagonism mixture dose-effect relations. The most commonly used synergy theory, Simple Effect Additivity, is not self-consistent mathematically. Many nonlinear alternatives have been suggested, almost all of which require an assumption that effects increase monotonically as dose increases. We here emphasize the recently introduced Incremental Effect Additivity synergy theory and briefly discuss Loewe Additivity. By utilizing the fact that, when dose increments approach zero, dose-effect relations approach linearity, Incremental Effect Additivity theory to some extent circumvents the non-linearity of dose-effect relations that plague Simple Effect Additivity calculations. We study mathematical properties of Incremental Effect Additivity that are relevant to practical implementation of this synergy theory and hold whatever particular area of biology, medicine, toxicology or pharmacology is involved. However, as yet Incremental Effect Additivity synergy theory has only been applied to mixture experiments simulating the toxic galactic cosmic ray mixture encountered during voyages in interplanetary space. Our main results are theorems, propositions, examples and counterexamples revealing various properties of Incremental Effect Additivity synergy theory including whether or not Neither-Synergy-Nor-Antagonism dose-effect relations lie between 1-agent dose-effect relations. These results are amply illustrated with figures.
△ Less
Submitted 24 December, 2021;
originally announced December 2021.
-
Exploration of Dark Chemical Genomics Space via Portal Learning: Applied to Targeting the Undruggable Genome and COVID-19 Anti-Infective Polypharmacology
Authors:
Tian Cai,
Li Xie,
Muge Chen,
Yang Liu,
Di He,
Shuo Zhang,
Cameron Mura,
Philip E. Bourne,
Lei Xie
Abstract:
Advances in biomedicine are largely fueled by exploring uncharted territories of human biology. Machine learning can both enable and accelerate discovery, but faces a fundamental hurdle when applied to unseen data with distributions that differ from previously observed ones -- a common dilemma in scientific inquiry. We have developed a new deep learning framework, called {\textit{Portal Learning}}…
▽ More
Advances in biomedicine are largely fueled by exploring uncharted territories of human biology. Machine learning can both enable and accelerate discovery, but faces a fundamental hurdle when applied to unseen data with distributions that differ from previously observed ones -- a common dilemma in scientific inquiry. We have developed a new deep learning framework, called {\textit{Portal Learning}}, to explore dark chemical and biological space. Three key, novel components of our approach include: (i) end-to-end, step-wise transfer learning, in recognition of biology's sequence-structure-function paradigm, (ii) out-of-cluster meta-learning, and (iii) stress model selection. Portal Learning provides a practical solution to the out-of-distribution (OOD) problem in statistical machine learning. Here, we have implemented Portal Learning to predict chemical-protein interactions on a genome-wide scale. Systematic studies demonstrate that Portal Learning can effectively assign ligands to unexplored gene families (unknown functions), versus existing state-of-the-art methods, thereby allowing us to target previously "undruggable" proteins and design novel polypharmacological agents for disrupting interactions between SARS-CoV-2 and human proteins. Portal Learning is general-purpose and can be further applied to other areas of scientific inquiry.
△ Less
Submitted 23 November, 2021;
originally announced November 2021.
-
CODE-AE: A Coherent De-confounding Autoencoder for Predicting Patient-Specific Drug Response From Cell Line Transcriptomics
Authors:
Di He,
Lei Xie
Abstract:
Accurate and robust prediction of patient's response to drug treatments is critical for developing precision medicine. However, it is often difficult to obtain a sufficient amount of coherent drug response data from patients directly for training a generalized machine learning model. Although the utilization of rich cell line data provides an alternative solution, it is challenging to transfer the…
▽ More
Accurate and robust prediction of patient's response to drug treatments is critical for developing precision medicine. However, it is often difficult to obtain a sufficient amount of coherent drug response data from patients directly for training a generalized machine learning model. Although the utilization of rich cell line data provides an alternative solution, it is challenging to transfer the knowledge obtained from cell lines to patients due to various confounding factors. Few existing transfer learning methods can reliably disentangle common intrinsic biological signals from confounding factors in the cell line and patient data. In this paper, we develop a Coherent Deconfounding Autoencoder (CODE-AE) that can extract both common biological signals shared by incoherent samples and private representations unique to each data set, transfer knowledge learned from cell line data to tissue data, and separate confounding factors from them. Extensive studies on multiple data sets demonstrate that CODE-AE significantly improves the accuracy and robustness over state-of-the-art methods in both predicting patient drug response and de-confounding biological signals. Thus, CODE-AE provides a useful framework to take advantage of in vitro omics data for developing generalized patient predictive models. The source code is available at https://github.com/XieResearchGroup/CODE-AE.
△ Less
Submitted 31 January, 2021;
originally announced February 2021.
-
Molecular Mechanics-Driven Graph Neural Network with Multiplex Graph for Molecular Structures
Authors:
Shuo Zhang,
Yang Liu,
Lei Xie
Abstract:
The prediction of physicochemical properties from molecular structures is a crucial task for artificial intelligence aided molecular design. A growing number of Graph Neural Networks (GNNs) have been proposed to address this challenge. These models improve their expressive power by incorporating auxiliary information in molecules while inevitably increase their computational complexity. In this wo…
▽ More
The prediction of physicochemical properties from molecular structures is a crucial task for artificial intelligence aided molecular design. A growing number of Graph Neural Networks (GNNs) have been proposed to address this challenge. These models improve their expressive power by incorporating auxiliary information in molecules while inevitably increase their computational complexity. In this work, we aim to design a GNN which is both powerful and efficient for molecule structures. To achieve such goal, we propose a molecular mechanics-driven approach by first representing each molecule as a two-layer multiplex graph, where one layer contains only local connections that mainly capture the covalent interactions and another layer contains global connections that can simulate non-covalent interactions. Then for each layer, a corresponding message passing module is proposed to balance the trade-off of expression power and computational complexity. Based on these two modules, we build Multiplex Molecular Graph Neural Network (MXMNet). When validated by the QM9 dataset for small molecules and PDBBind dataset for large protein-ligand complexes, MXMNet achieves superior results to the existing state-of-the-art models under restricted resources.
△ Less
Submitted 15 November, 2020;
originally announced November 2020.
-
DeepAtrophy: Teaching a Neural Network to Differentiate Progressive Changes from Noise on Longitudinal MRI in Alzheimer's Disease
Authors:
Mengjin Dong,
Long Xie,
Sandhitsu R. Das,
Jiancong Wang,
Laura E. M. Wisse,
Robin deFlores,
David A. Wolk,
Paul Yushkevich
Abstract:
Volume change measures derived from longitudinal MRI (e.g. hippocampal atrophy) are a well-studied biomarker of disease progression in Alzheimer's Disease (AD) and are used in clinical trials to track the therapeutic efficacy of disease-modifying treatments. However, longitudinal MRI change measures can be confounded by non-biological factors, such as different degrees of head motion and susceptib…
▽ More
Volume change measures derived from longitudinal MRI (e.g. hippocampal atrophy) are a well-studied biomarker of disease progression in Alzheimer's Disease (AD) and are used in clinical trials to track the therapeutic efficacy of disease-modifying treatments. However, longitudinal MRI change measures can be confounded by non-biological factors, such as different degrees of head motion and susceptibility artifact between pairs of MRI scans. We hypothesize that deep learning methods applied directly to pairs of longitudinal MRI scans can be trained to differentiate between biological changes and non-biological factors better than conventional approaches based on deformable image registration. To achieve this, we make a simplifying assumption that biological factors are associated with time (i.e. the hippocampus shrinks overtime in the aging population) whereas non-biological factors are independent of time. We then formulate deep learning networks to infer the temporal order of same-subject MRI scans input to the network in arbitrary order; as well as to infer ratios between interscan intervals for two pairs of same-subject MRI scans. In the test dataset, these networks perform better in tasks of temporal ordering (89.3%) and interscan interval inference (86.1%) than a state-of-the-art deformation-based morphometry method ALOHA (76.6% and 76.1% respectively) (Das et al., 2012). Furthermore, we derive a disease progression score from the network that is able to detect a group difference between 58 preclinical AD and 75 beta-amyloid-negative cognitively normal individuals within one year, compared to two years for ALOHA. This suggests that deep learning can be trained to differentiate MRI changes due to biological factors (tissue loss) from changes due to non-biological factors, leading to novel biomarkers that are more sensitive to longitudinal changes at the earliest stages of AD.
△ Less
Submitted 24 October, 2020;
originally announced October 2020.
-
A Cross-Level Information Transmission Network for Predicting Phenotype from New Genotype: Application to Cancer Precision Medicine
Authors:
Di He,
Lei Xie
Abstract:
An unsolved fundamental problem in biology and ecology is to predict observable traits (phenotypes) from a new genetic constitution (genotype) of an organism under environmental perturbations (e.g., drug treatment). The emergence of multiple omics data provides new opportunities but imposes great challenges in the predictive modeling of genotype-phenotype associations. Firstly, the high-dimensiona…
▽ More
An unsolved fundamental problem in biology and ecology is to predict observable traits (phenotypes) from a new genetic constitution (genotype) of an organism under environmental perturbations (e.g., drug treatment). The emergence of multiple omics data provides new opportunities but imposes great challenges in the predictive modeling of genotype-phenotype associations. Firstly, the high-dimensionality of genomics data and the lack of labeled data often make the existing supervised learning techniques less successful. Secondly, it is a challenging task to integrate heterogeneous omics data from different resources. Finally, the information transmission from DNA to phenotype involves multiple intermediate levels of RNA, protein, metabolite, etc. The higher-level features (e.g., gene expression) usually have stronger discriminative power than the lower level features (e.g., somatic mutation). To address above issues, we proposed a novel Cross-LEvel Information Transmission network (CLEIT) framework. CLEIT aims to explicitly model the asymmetrical multi-level organization of the biological system. Inspired by domain adaptation, CLEIT first learns the latent representation of high-level domain then uses it as ground-truth embedding to improve the representation learning of the low-level domain in the form of contrastive loss. In addition, we adopt a pre-training-fine-tuning approach to leveraging the unlabeled heterogeneous omics data to improve the generalizability of CLEIT. We demonstrate the effectiveness and performance boost of CLEIT in predicting anti-cancer drug sensitivity from somatic mutations via the assistance of gene expressions when compared with state-of-the-art methods.
△ Less
Submitted 9 October, 2020;
originally announced October 2020.
-
Structural insights into characterizing binding sites in EGFR kinase mutants
Authors:
Zheng Zhao,
Lei Xie,
Philip E. Bourne
Abstract:
Over the last two decades epidermal growth factor receptor (EGFR) kinase has become an important target to treat non-small cell lung cancer (NSCLC). Currently, three generations of EGFR kinase-targeted small molecule drugs have been FDA approved. They nominally produce a response at the start of treatment and lead to a substantial survival benefit for patients. However, long-term treatment results…
▽ More
Over the last two decades epidermal growth factor receptor (EGFR) kinase has become an important target to treat non-small cell lung cancer (NSCLC). Currently, three generations of EGFR kinase-targeted small molecule drugs have been FDA approved. They nominally produce a response at the start of treatment and lead to a substantial survival benefit for patients. However, long-term treatment results in acquired drug resistance and further vulnerability to NSCLC. Therefore, novel EGFR kinase inhibitors that specially overcome acquired mutations are urgently needed. To this end, we carried out a comprehensive study of different EGFR kinase mutants using a structural systems pharmacology strategy. Our analysis shows that both wild-type and mutated structures exhibit multiple conformational states that have not been observed in solved crystal structures. We show that this conformational flexibility accommodates diverse types of ligands with multiple types of binding modes. These results provide insights for designing a new-generation of EGFR kinase inhibitor that combats acquired drug-resistant mutations through a multi-conformation-based drug design strategy.
△ Less
Submitted 26 December, 2018;
originally announced December 2018.
-
The Activation Entropy Change in Enzymatic Reaction Catalyzed by Isochorismate-Pyruvate Lyase of Pseudomonas Aeruginosa PchB
Authors:
Liangxu Xie,
Zhe-Ning Chen,
Mingjun Yang
Abstract:
The elucidation of entropic contribution to enzyme catalysis has been debated over decades. The recent experimentally measured activation enthalpy and entropy, for chorismate rearrangement reaction in PchB brings up a hotly debated issue whether the chorismate mutase catalyzed reaction is entropy-driven reaction. Extensive configurational sampling combined with quantum mechanics/molecular mechanic…
▽ More
The elucidation of entropic contribution to enzyme catalysis has been debated over decades. The recent experimentally measured activation enthalpy and entropy, for chorismate rearrangement reaction in PchB brings up a hotly debated issue whether the chorismate mutase catalyzed reaction is entropy-driven reaction. Extensive configurational sampling combined with quantum mechanics/molecular mechanics molecular dynamics (QM/MM MD) provides an approach to calculate entropic contribution in condensed phase reactions. Complete reaction pathway is exploited by QM/MM MD simulations at DFT and SCC-DFTB levels. The overall entropy change calculated at SCC-DFTB level QM/MM MD simulations, is close agreement with the experimental value. Conformation analysis indicates that the self-ordering of chorismate in the active site of PchB also contributes to total entropy change. This entropy penalty including conformational transformation entropy and activation entropy cannot be intuitively speculated from the crystal structure that only acts as a stationary state along the reaction pathway of PchB catalyzed reaction. This is the first time to use QM/MM MD simulations to calculate the activation entropy from the temperature dependence of reliable free energy profiles with extensive simulation time. The reasonable insight in enthalpy/entropy scheme clarifies the detailed entropy change and provides a quantitative tool to the contradicted experimental results.
△ Less
Submitted 5 November, 2017;
originally announced November 2017.
-
Automated deconvolution of structured mixtures from bulk tumor genomic data
Authors:
Theodore Roman,
Lu Xie,
Russell Schwartz
Abstract:
Motivation: As cancer researchers have come to appreciate the importance of intratumor heterogeneity, much attention has focused on the challenges of accurately profiling heterogeneity in individual patients. Experimental technologies for directly profiling genomes of single cells are rapidly improving, but they are still impractical for large-scale sampling. Bulk genomic assays remain the standar…
▽ More
Motivation: As cancer researchers have come to appreciate the importance of intratumor heterogeneity, much attention has focused on the challenges of accurately profiling heterogeneity in individual patients. Experimental technologies for directly profiling genomes of single cells are rapidly improving, but they are still impractical for large-scale sampling. Bulk genomic assays remain the standard for population-scale studies, but conflate the influences of mixtures of genetically distinct tumor, stromal, and infiltrating immune cells. Many computational approaches have been developed to deconvolute these mixed samples and reconstruct the genomics of genetically homogeneous clonal subpopulations. All such methods, however, are limited to reconstructing only coarse approximations to a few major subpopulations. In prior work, we showed that one can improve deconvolution of genomic data by leveraging substructure in cellular mixtures through a strategy called simplicial complex inference. This strategy, however, is also limited by the difficulty of inferring mixture structure from sparse, noisy assays. Results: We improve on past work by introducing enhancements to automate learning of substructured genomic mixtures, with specific emphasis on genome-wide copy number variation (CNV) data. We introduce methods for dimensionality estimation to better decompose mixture model substructure; fuzzy clustering to better identify substructure in sparse, noisy data; and automated model inference methods for other key model parameters. We show that these improvements lead to more accurate inference of cell populations and mixture proportions in simulated scenarios. We further demonstrate their effectiveness in identifying mixture substructure in real tumor CNV data. Availability: Source code is available at http://www.cs.cmu.edu/~russells/software/WSCUnmix.zip
△ Less
Submitted 8 April, 2016;
originally announced April 2016.
-
Derivative-free optimization of rate parameters of capsid assembly models from bulk in vitro data
Authors:
Lu Xie,
Gregory R. Smith,
Russell Schwartz
Abstract:
The assembly of virus capsids from free coat proteins proceeds by a complicated cascade of association and dissociation steps, the great majority of which cannot be directly experimentally observed. This has made capsid assembly a rich field for computational models to attempt to fill the gaps in what is experimentally observable. Nonetheless, accurate simulation predictions depend on accurate mod…
▽ More
The assembly of virus capsids from free coat proteins proceeds by a complicated cascade of association and dissociation steps, the great majority of which cannot be directly experimentally observed. This has made capsid assembly a rich field for computational models to attempt to fill the gaps in what is experimentally observable. Nonetheless, accurate simulation predictions depend on accurate models and there are substantial obstacles to model inference for such systems. Here, we describe progress in learning parameters for capsid assembly systems, particularly kinetic rate constants of coat-coat interactions, by computationally fitting simulations to experimental data. We previously developed an approach to learn rate parameters of coat-coat interactions by minimizing the deviation between real and simulated light scattering data monitoring bulk capsid assembly in vitro. This is a difficult data-fitting problem, however, because of the high computational cost of simulating assembly trajectories, the stochastic noise inherent to the models, and the limited and noisy data available for fitting. Here we show that a newer classes of methods, based on derivative-free optimization (DFO), can more quickly and precisely learn physical parameters from static light scattering data. We further explore how the advantages of the approaches might be affected by alternative data sources through simulation of a model of time-resolved mass spectrometry data, an alternative technology for monitoring bulk capsid assembly that can be expected to provide much richer data. The results show that advances in both the data and the algorithms can improve model inference, with rich data leading to high-quality fits for all methods, but DFO methods showing substantial advantages over less informative data sources better representative of the current experimental practice.
△ Less
Submitted 7 July, 2015;
originally announced July 2015.
-
Avoid Internal Loops in Steady State Flux Space Sampling
Authors:
Lu Xie
Abstract:
As a widely used method in metabolic network studies, Monte-Carlo sampling in the steady state flux space is known for its flexibility and convenience of carrying out different purposes, simply by alternating constraints or objective functions, or appending post processes. Recently the concept of a non-linear constraint based on the second thermodynamic law, known as "Loop Law", is challenging cur…
▽ More
As a widely used method in metabolic network studies, Monte-Carlo sampling in the steady state flux space is known for its flexibility and convenience of carrying out different purposes, simply by alternating constraints or objective functions, or appending post processes. Recently the concept of a non-linear constraint based on the second thermodynamic law, known as "Loop Law", is challenging current sampling algorithms which will inevitably give rise to the internal loops. A generalized method is proposed here to eradicate the probability of the appearance of internal loops during sampling process. Based on Artificial Centered Hit and Run (ACHR) method, each step of the new sampling process will avoid entering "loop-forming" subspaces. This method has been applied on the metabolic network of Helicobacter pylori with three different objective functions: uniform sampling, optimizing biomass synthesis, optimizing biomass synthesis efficiency over resources ingested. Comparison between results from the new method and conventional ACHR method shows effective elimination of loop fluxes without affecting non-loop fluxes.
△ Less
Submitted 18 October, 2012;
originally announced October 2012.
-
Implications of 3-step swimming patterns in bacterial chemotaxis
Authors:
Tuba Altindal,
Li Xie,
Xiao-Lun Wu
Abstract:
We recently found that marine bacteria Vibrio alginolyticus execute a cyclic 3-step (run- reverse-flick) motility pattern that is distinctively different from the 2-step (run-tumble) pattern of Escherichia coli. How this novel swimming pattern is regulated by cells of V. alginolyticus is not currently known, but its significance for bacterial chemotaxis is self- evident and will be delineated here…
▽ More
We recently found that marine bacteria Vibrio alginolyticus execute a cyclic 3-step (run- reverse-flick) motility pattern that is distinctively different from the 2-step (run-tumble) pattern of Escherichia coli. How this novel swimming pattern is regulated by cells of V. alginolyticus is not currently known, but its significance for bacterial chemotaxis is self- evident and will be delineated herein. Using an approach introduced by de Gennes, we calculated the migration speed of a cell executing the 3-step pattern in a linear chemical gradient, and found that a biphasic chemotactic response arises naturally. The implication of such a response for the cells to adapt to ocean environments and its possible connection to E. coli 's response are also discussed.
△ Less
Submitted 14 November, 2010;
originally announced November 2010.
-
Imposition of Different Optimizing Object with Non-Linear Constraints on Flux Sampling and Elimination of Free Futile Pathways
Authors:
Lu Xie,
Yi Zhang
Abstract:
Constraint-based modeling has been widely used on metabolic networks analysis, such as biosynthetic prediction and flux optimization. The linear constraints, like mass conservation constraint, reversibility constraint, biological capacity constraint, can be imposed on linear algorithms. However, recently a non-linear constraint based on the second thermodynamic law, known as "loop law", has emer…
▽ More
Constraint-based modeling has been widely used on metabolic networks analysis, such as biosynthetic prediction and flux optimization. The linear constraints, like mass conservation constraint, reversibility constraint, biological capacity constraint, can be imposed on linear algorithms. However, recently a non-linear constraint based on the second thermodynamic law, known as "loop law", has emerged and challenged the existing algorithms. Proven to be unfeasible with linear solutions, this non-linear constraint has been successfully imposed on the sampling process. In this place, Monte - Carlo sampling with Metropolis criterion and Simulated Annealing has been introduced to optimize the Biomass synthesis of genome scale metabolic network of Helicobacter pylori (iIT341 GSM / GPR) under mass conservation constraint, biological capacity constraint, and thermodynamic constraints including reversibility and "loop law". The sampling method has also been employed to optimize a non-linear objective function, the Biomass synthetic rate, which is unified by the total income number of reducible electrons. To verify whether a sample contains internal loops, an automatic solution has been developed based on solving a set of inequalities. In addition, a new type of pathway has been proposed here, the Futile Pathway, which has three properties: 1) its mass flow could be self-balanced; 2) it has exchange reactions; 3) it is independent to the biomass synthesis. To eliminate the fluxes of the Futile Pathways in the sampling results, a linear programming based method has been suggested and the results have showed improved correlations among the reaction fluxes in the pathways related to Biomass synthesis.
△ Less
Submitted 28 November, 2009; v1 submitted 7 November, 2007;
originally announced November 2007.