-
An LLM-Driven Multi-Agent Debate System for Mendelian Diseases
Authors:
Xinyang Zhou,
Yongyong Ren,
Qianqian Zhao,
Daoyi Huang,
Xinbo Wang,
Tingting Zhao,
Zhixing Zhu,
Wenyuan He,
Shuyuan Li,
Yan Xu,
Yu Sun,
Yongguo Yu,
Shengnan Wu,
Jian Wang,
Guangjun Yu,
Dake He,
Bo Ban,
Hui Lu
Abstract:
Accurate diagnosis of Mendelian diseases is crucial for precision therapy and assistance in preimplantation genetic diagnosis. However, existing methods often fall short of clinical standards or depend on extensive datasets to build pretrained machine learning models. To address this, we introduce an innovative LLM-Driven multi-agent debate system (MD2GPS) with natural language explanations of the…
▽ More
Accurate diagnosis of Mendelian diseases is crucial for precision therapy and assistance in preimplantation genetic diagnosis. However, existing methods often fall short of clinical standards or depend on extensive datasets to build pretrained machine learning models. To address this, we introduce an innovative LLM-Driven multi-agent debate system (MD2GPS) with natural language explanations of the diagnostic results. It utilizes a language model to transform results from data-driven and knowledge-driven agents into natural language, then fostering a debate between these two specialized agents. This system has been tested on 1,185 samples across four independent datasets, enhancing the TOP1 accuracy from 42.9% to 66% on average. Additionally, in a challenging cohort of 72 cases, MD2GPS identified potential pathogenic genes in 12 patients, reducing the diagnostic time by 90%. The methods within each module of this multi-agent debate system are also replaceable, facilitating its adaptation for diagnosing and researching other complex diseases.
△ Less
Submitted 11 April, 2025; v1 submitted 10 April, 2025;
originally announced April 2025.
-
Bridging Geometric States via Geometric Diffusion Bridge
Authors:
Shengjie Luo,
Yixian Xu,
Di He,
Shuxin Zheng,
Tie-Yan Liu,
Liwei Wang
Abstract:
The accurate prediction of geometric state evolution in complex systems is critical for advancing scientific domains such as quantum chemistry and material modeling. Traditional experimental and computational methods face challenges in terms of environmental constraints and computational demands, while current deep learning approaches still fall short in terms of precision and generality. In this…
▽ More
The accurate prediction of geometric state evolution in complex systems is critical for advancing scientific domains such as quantum chemistry and material modeling. Traditional experimental and computational methods face challenges in terms of environmental constraints and computational demands, while current deep learning approaches still fall short in terms of precision and generality. In this work, we introduce the Geometric Diffusion Bridge (GDB), a novel generative modeling framework that accurately bridges initial and target geometric states. GDB leverages a probabilistic approach to evolve geometric state distributions, employing an equivariant diffusion bridge derived by a modified version of Doob's $h$-transform for connecting geometric states. This tailored diffusion process is anchored by initial and target geometric states as fixed endpoints and governed by equivariant transition kernels. Moreover, trajectory data can be seamlessly leveraged in our GDB framework by using a chain of equivariant diffusion bridges, providing a more detailed and accurate characterization of evolution dynamics. Theoretically, we conduct a thorough examination to confirm our framework's ability to preserve joint distributions of geometric states and capability to completely model the underlying dynamics inducing trajectory distributions with negligible error. Experimental evaluations across various real-world scenarios show that GDB surpasses existing state-of-the-art approaches, opening up a new pathway for accurately bridging geometric states and tackling crucial scientific challenges with improved accuracy and applicability.
△ Less
Submitted 31 October, 2024;
originally announced October 2024.
-
UnPaSt: unsupervised patient stratification by differentially expressed biclusters in omics data
Authors:
Michael Hartung,
Andreas Maier,
Fernando Delgado-Chaves,
Yuliya Burankova,
Olga I. Isaeva,
Fábio Malta de Sá Patroni,
Daniel He,
Casey Shannon,
Katharina Kaufmann,
Jens Lohmann,
Alexey Savchik,
Anne Hartebrodt,
Zoe Chervontseva,
Farzaneh Firoozbakht,
Niklas Probul,
Evgenia Zotova,
Olga Tsoy,
David B. Blumenthal,
Martin Ester,
Tanja Laske,
Jan Baumbach,
Olga Zolotareva
Abstract:
Most complex diseases, including cancer and non-malignant diseases like asthma, have distinct molecular subtypes that require distinct clinical approaches. However, existing computational patient stratification methods have been benchmarked almost exclusively on cancer omics data and only perform well when mutually exclusive subtypes can be characterized by many biomarkers. Here, we contribute wit…
▽ More
Most complex diseases, including cancer and non-malignant diseases like asthma, have distinct molecular subtypes that require distinct clinical approaches. However, existing computational patient stratification methods have been benchmarked almost exclusively on cancer omics data and only perform well when mutually exclusive subtypes can be characterized by many biomarkers. Here, we contribute with a massive evaluation attempt, quantitatively exploring the power of 22 unsupervised patient stratification methods using both, simulated and real transcriptome data. From this experience, we developed UnPaSt (https://apps.cosy.bio/unpast/) optimizing unsupervised patient stratification, working even with only a limited number of subtype-predictive biomarkers. We evaluated all 23 methods on real-world breast cancer and asthma transcriptomics data. Although many methods reliably detected major breast cancer subtypes, only few identified Th2-high asthma, and UnPaSt significantly outperformed its closest competitors in both test datasets. Essentially, we showed that UnPaSt can detect many biologically insightful and reproducible patterns in omic datasets.
△ Less
Submitted 31 July, 2024;
originally announced August 2024.
-
GeoMFormer: A General Architecture for Geometric Molecular Representation Learning
Authors:
Tianlang Chen,
Shengjie Luo,
Di He,
Shuxin Zheng,
Tie-Yan Liu,
Liwei Wang
Abstract:
Molecular modeling, a central topic in quantum mechanics, aims to accurately calculate the properties and simulate the behaviors of molecular systems. The molecular model is governed by physical laws, which impose geometric constraints such as invariance and equivariance to coordinate rotation and translation. While numerous deep learning approaches have been developed to learn molecular represent…
▽ More
Molecular modeling, a central topic in quantum mechanics, aims to accurately calculate the properties and simulate the behaviors of molecular systems. The molecular model is governed by physical laws, which impose geometric constraints such as invariance and equivariance to coordinate rotation and translation. While numerous deep learning approaches have been developed to learn molecular representations under these constraints, most of them are built upon heuristic and costly modules. We argue that there is a strong need for a general and flexible framework for learning both invariant and equivariant features. In this work, we introduce a novel Transformer-based molecular model called GeoMFormer to achieve this goal. Using the standard Transformer modules, two separate streams are developed to maintain and learn invariant and equivariant representations. Carefully designed cross-attention modules bridge the two streams, allowing information fusion and enhancing geometric modeling in each stream. As a general and flexible architecture, we show that many previous architectures can be viewed as special instantiations of GeoMFormer. Extensive experiments are conducted to demonstrate the power of GeoMFormer. All empirical results show that GeoMFormer achieves strong performance on both invariant and equivariant tasks of different types and scales. Code and models will be made publicly available at https://github.com/c-tl/GeoMFormer.
△ Less
Submitted 24 June, 2024;
originally announced June 2024.
-
RiboDiffusion: Tertiary Structure-based RNA Inverse Folding with Generative Diffusion Models
Authors:
Han Huang,
Ziqian Lin,
Dongchen He,
Liang Hong,
Yu Li
Abstract:
RNA design shows growing applications in synthetic biology and therapeutics, driven by the crucial role of RNA in various biological processes. A fundamental challenge is to find functional RNA sequences that satisfy given structural constraints, known as the inverse folding problem. Computational approaches have emerged to address this problem based on secondary structures. However, designing RNA…
▽ More
RNA design shows growing applications in synthetic biology and therapeutics, driven by the crucial role of RNA in various biological processes. A fundamental challenge is to find functional RNA sequences that satisfy given structural constraints, known as the inverse folding problem. Computational approaches have emerged to address this problem based on secondary structures. However, designing RNA sequences directly from 3D structures is still challenging, due to the scarcity of data, the non-unique structure-sequence mapping, and the flexibility of RNA conformation. In this study, we propose RiboDiffusion, a generative diffusion model for RNA inverse folding that can learn the conditional distribution of RNA sequences given 3D backbone structures. Our model consists of a graph neural network-based structure module and a Transformer-based sequence module, which iteratively transforms random sequences into desired sequences. By tuning the sampling weight, our model allows for a trade-off between sequence recovery and diversity to explore more candidates. We split test sets based on RNA clustering with different cut-offs for sequence or structure similarity. Our model outperforms baselines in sequence recovery, with an average relative improvement of $11\%$ for sequence similarity splits and $16\%$ for structure similarity splits. Moreover, RiboDiffusion performs consistently well across various RNA length categories and RNA types. We also apply in-silico folding to validate whether the generated sequences can fold into the given 3D RNA backbones. Our method could be a powerful tool for RNA design that explores the vast sequence space and finds novel solutions to 3D structural constraints.
△ Less
Submitted 17 April, 2024;
originally announced April 2024.
-
Pre-Training on Large-Scale Generated Docking Conformations with HelixDock to Unlock the Potential of Protein-ligand Structure Prediction Models
Authors:
Lihang Liu,
Shanzhuo Zhang,
Donglong He,
Xianbin Ye,
Jingbo Zhou,
Xiaonan Zhang,
Yaoyao Jiang,
Weiming Diao,
Hang Yin,
Hua Chai,
Fan Wang,
Jingzhou He,
Liang Zheng,
Yonghui Li,
Xiaomin Fang
Abstract:
Protein-ligand structure prediction is an essential task in drug discovery, predicting the binding interactions between small molecules (ligands) and target proteins (receptors). Recent advances have incorporated deep learning techniques to improve the accuracy of protein-ligand structure prediction. Nevertheless, the experimental validation of docking conformations remains costly, it raises conce…
▽ More
Protein-ligand structure prediction is an essential task in drug discovery, predicting the binding interactions between small molecules (ligands) and target proteins (receptors). Recent advances have incorporated deep learning techniques to improve the accuracy of protein-ligand structure prediction. Nevertheless, the experimental validation of docking conformations remains costly, it raises concerns regarding the generalizability of these deep learning-based methods due to the limited training data. In this work, we show that by pre-training on a large-scale docking conformation generated by traditional physics-based docking tools and then fine-tuning with a limited set of experimentally validated receptor-ligand complexes, we can obtain a protein-ligand structure prediction model with outstanding performance. Specifically, this process involved the generation of 100 million docking conformations for protein-ligand pairings, an endeavor consuming roughly 1 million CPU core days. The proposed model, HelixDock, aims to acquire the physical knowledge encapsulated by the physics-based docking tools during the pre-training phase. HelixDock has been rigorously benchmarked against both physics-based and deep learning-based baselines, demonstrating its exceptional precision and robust transferability in predicting binding confirmation. In addition, our investigation reveals the scaling laws governing pre-trained protein-ligand structure prediction models, indicating a consistent enhancement in performance with increases in model parameters and the volume of pre-training data. Moreover, we applied HelixDock to several drug discovery-related tasks to validate its practical utility. HelixDock demonstrates outstanding capabilities on both cross-docking and structure-based virtual screening benchmarks.
△ Less
Submitted 22 May, 2024; v1 submitted 21 October, 2023;
originally announced October 2023.
-
Energy-positive soaring using transient turbulent fluctuations
Authors:
Danyun He,
Gautam Reddy,
Chris H. Rycroft
Abstract:
Soaring birds gain energy from stable ascending currents or shear. However, it remains unclear whether energy loss due to drag can be overcome by extracting work from transient turbulent fluctuations. We designed numerical simulations of gliders navigating in a kinematic model that captures the spatio-temporal correlations of atmospheric turbulence. Energy extraction is enabled by an adaptive algo…
▽ More
Soaring birds gain energy from stable ascending currents or shear. However, it remains unclear whether energy loss due to drag can be overcome by extracting work from transient turbulent fluctuations. We designed numerical simulations of gliders navigating in a kinematic model that captures the spatio-temporal correlations of atmospheric turbulence. Energy extraction is enabled by an adaptive algorithm based on Monte Carlo tree search that dynamically filters acquired information about the flow to plan future paths. We show that net energy gain is feasible under realistic constraints. Glider paths reflect patterns of foraging, where exploration of the flow is interspersed with bouts of energy extraction through localized spirals.
△ Less
Submitted 9 January, 2024; v1 submitted 12 April, 2023;
originally announced April 2023.
-
3D Molecular Generation via Virtual Dynamics
Authors:
Shuqi Lu,
Lin Yao,
Xi Chen,
Hang Zheng,
Di He,
Guolin Ke
Abstract:
Structure-based drug design, i.e., finding molecules with high affinities to the target protein pocket, is one of the most critical tasks in drug discovery. Traditional solutions, like virtual screening, require exhaustively searching on a large molecular database, which are inefficient and cannot return novel molecules beyond the database. The pocket-based 3D molecular generation model, i.e., dir…
▽ More
Structure-based drug design, i.e., finding molecules with high affinities to the target protein pocket, is one of the most critical tasks in drug discovery. Traditional solutions, like virtual screening, require exhaustively searching on a large molecular database, which are inefficient and cannot return novel molecules beyond the database. The pocket-based 3D molecular generation model, i.e., directly generating a molecule with a 3D structure and binding position in the pocket, is a new promising way to address this issue. Herein, we propose VD-Gen, a novel pocket-based 3D molecular generation pipeline. VD-Gen consists of several carefully designed stages to generate fine-grained 3D molecules with binding positions in the pocket cavity end-to-end. Rather than directly generating or sampling atoms with 3D positions in the pocket like in early attempts, in VD-Gen, we first randomly initialize many virtual particles in the pocket; then iteratively move these virtual particles, making the distribution of virtual particles approximate the distribution of molecular atoms. After virtual particles are stabilized in 3D space, we extract a 3D molecule from them. Finally, we further refine atoms in the extracted molecule by iterative movement again, to get a high-quality 3D molecule, and predict a confidence score for it. Extensive experiment results on pocket-based molecular generation demonstrate that VD-Gen can generate novel 3D molecules to fill the target pocket cavity with high binding affinities, significantly outperforming previous baselines.
△ Less
Submitted 11 February, 2023;
originally announced February 2023.
-
One Transformer Can Understand Both 2D & 3D Molecular Data
Authors:
Shengjie Luo,
Tianlang Chen,
Yixian Xu,
Shuxin Zheng,
Tie-Yan Liu,
Liwei Wang,
Di He
Abstract:
Unlike vision and language data which usually has a unique format, molecules can naturally be characterized using different chemical formulations. One can view a molecule as a 2D graph or define it as a collection of atoms located in a 3D space. For molecular representation learning, most previous works designed neural networks only for a particular data format, making the learned models likely to…
▽ More
Unlike vision and language data which usually has a unique format, molecules can naturally be characterized using different chemical formulations. One can view a molecule as a 2D graph or define it as a collection of atoms located in a 3D space. For molecular representation learning, most previous works designed neural networks only for a particular data format, making the learned models likely to fail for other data formats. We believe a general-purpose neural network model for chemistry should be able to handle molecular tasks across data modalities. To achieve this goal, in this work, we develop a novel Transformer-based Molecular model called Transformer-M, which can take molecular data of 2D or 3D formats as input and generate meaningful semantic representations. Using the standard Transformer as the backbone architecture, Transformer-M develops two separated channels to encode 2D and 3D structural information and incorporate them with the atom features in the network modules. When the input data is in a particular format, the corresponding channel will be activated, and the other will be disabled. By training on 2D and 3D molecular data with properly designed supervised signals, Transformer-M automatically learns to leverage knowledge from different data modalities and correctly capture the representations. We conducted extensive experiments for Transformer-M. All empirical results show that Transformer-M can simultaneously achieve strong performance on 2D and 3D tasks, suggesting its broad applicability. The code and models will be made publicly available at https://github.com/lsj2408/Transformer-M.
△ Less
Submitted 27 March, 2023; v1 submitted 4 October, 2022;
originally announced October 2022.
-
GEM-2: Next Generation Molecular Property Prediction Network by Modeling Full-range Many-body Interactions
Authors:
Lihang Liu,
Donglong He,
Xiaomin Fang,
Shanzhuo Zhang,
Fan Wang,
Jingzhou He,
Hua Wu
Abstract:
Molecular property prediction is a fundamental task in the drug and material industries. Physically, the properties of a molecule are determined by its own electronic structure, which is a quantum many-body system and can be exactly described by the Schr"odinger equation. Full-range many-body interactions between electrons have been proven effective in obtaining an accurate solution of the Schr"od…
▽ More
Molecular property prediction is a fundamental task in the drug and material industries. Physically, the properties of a molecule are determined by its own electronic structure, which is a quantum many-body system and can be exactly described by the Schr"odinger equation. Full-range many-body interactions between electrons have been proven effective in obtaining an accurate solution of the Schr"odinger equation by classical computational chemistry methods, although modeling such interactions consumes an expensive computational cost. Meanwhile, deep learning methods have also demonstrated their competence in molecular property prediction tasks. Inspired by the classical computational chemistry methods, we design a novel method, namely GEM-2, which comprehensively considers full-range many-body interactions in molecules. Multiple tracks are utilized to model the full-range interactions between the many-bodies with different orders, and a novel axial attention mechanism is designed to approximate the full-range interaction modeling with much lower computational cost. Extensive experiments demonstrate the overwhelming superiority of GEM-2 over multiple baseline methods in quantum chemistry and drug discovery tasks. The ablation studies also verify the effectiveness of the full-range many-body interactions.
△ Less
Submitted 20 October, 2022; v1 submitted 11 August, 2022;
originally announced August 2022.
-
HelixADMET: a robust and endpoint extensible ADMET system incorporating self-supervised knowledge transfer
Authors:
Shanzhuo Zhang,
Zhiyuan Yan,
Yueyang Huang,
Lihang Liu,
Donglong He,
Wei Wang,
Xiaomin Fang,
Xiaonan Zhang,
Fan Wang,
Hua Wu,
Haifeng Wang
Abstract:
Accurate ADMET (an abbreviation for "absorption, distribution, metabolism, excretion, and toxicity") predictions can efficiently screen out undesirable drug candidates in the early stage of drug discovery. In recent years, multiple comprehensive ADMET systems that adopt advanced machine learning models have been developed, providing services to estimate multiple endpoints. However, those ADMET sys…
▽ More
Accurate ADMET (an abbreviation for "absorption, distribution, metabolism, excretion, and toxicity") predictions can efficiently screen out undesirable drug candidates in the early stage of drug discovery. In recent years, multiple comprehensive ADMET systems that adopt advanced machine learning models have been developed, providing services to estimate multiple endpoints. However, those ADMET systems usually suffer from weak extrapolation ability. First, due to the lack of labelled data for each endpoint, typical machine learning models perform frail for the molecules with unobserved scaffolds. Second, most systems only provide fixed built-in endpoints and cannot be customised to satisfy various research requirements. To this end, we develop a robust and endpoint extensible ADMET system, HelixADMET (H-ADMET). H-ADMET incorporates the concept of self-supervised learning to produce a robust pre-trained model. The model is then fine-tuned with a multi-task and multi-stage framework to transfer knowledge between ADMET endpoints, auxiliary tasks, and self-supervised tasks. Our results demonstrate that H-ADMET achieves an overall improvement of 4%, compared with existing ADMET systems on comparable endpoints. Additionally, the pre-trained model provided by H-ADMET can be fine-tuned to generate new and customised ADMET endpoints, meeting various demands of drug research and development requirements.
△ Less
Submitted 16 May, 2022;
originally announced May 2022.
-
Exploration of Dark Chemical Genomics Space via Portal Learning: Applied to Targeting the Undruggable Genome and COVID-19 Anti-Infective Polypharmacology
Authors:
Tian Cai,
Li Xie,
Muge Chen,
Yang Liu,
Di He,
Shuo Zhang,
Cameron Mura,
Philip E. Bourne,
Lei Xie
Abstract:
Advances in biomedicine are largely fueled by exploring uncharted territories of human biology. Machine learning can both enable and accelerate discovery, but faces a fundamental hurdle when applied to unseen data with distributions that differ from previously observed ones -- a common dilemma in scientific inquiry. We have developed a new deep learning framework, called {\textit{Portal Learning}}…
▽ More
Advances in biomedicine are largely fueled by exploring uncharted territories of human biology. Machine learning can both enable and accelerate discovery, but faces a fundamental hurdle when applied to unseen data with distributions that differ from previously observed ones -- a common dilemma in scientific inquiry. We have developed a new deep learning framework, called {\textit{Portal Learning}}, to explore dark chemical and biological space. Three key, novel components of our approach include: (i) end-to-end, step-wise transfer learning, in recognition of biology's sequence-structure-function paradigm, (ii) out-of-cluster meta-learning, and (iii) stress model selection. Portal Learning provides a practical solution to the out-of-distribution (OOD) problem in statistical machine learning. Here, we have implemented Portal Learning to predict chemical-protein interactions on a genome-wide scale. Systematic studies demonstrate that Portal Learning can effectively assign ligands to unexplored gene families (unknown functions), versus existing state-of-the-art methods, thereby allowing us to target previously "undruggable" proteins and design novel polypharmacological agents for disrupting interactions between SARS-CoV-2 and human proteins. Portal Learning is general-purpose and can be further applied to other areas of scientific inquiry.
△ Less
Submitted 23 November, 2021;
originally announced November 2021.
-
ChemRL-GEM: Geometry Enhanced Molecular Representation Learning for Property Prediction
Authors:
Xiaomin Fang,
Lihang Liu,
Jieqiong Lei,
Donglong He,
Shanzhuo Zhang,
Jingbo Zhou,
Fan Wang,
Hua Wu,
Haifeng Wang
Abstract:
Effective molecular representation learning is of great importance to facilitate molecular property prediction, which is a fundamental task for the drug and material industry. Recent advances in graph neural networks (GNNs) have shown great promise in applying GNNs for molecular representation learning. Moreover, a few recent studies have also demonstrated successful applications of self-supervise…
▽ More
Effective molecular representation learning is of great importance to facilitate molecular property prediction, which is a fundamental task for the drug and material industry. Recent advances in graph neural networks (GNNs) have shown great promise in applying GNNs for molecular representation learning. Moreover, a few recent studies have also demonstrated successful applications of self-supervised learning methods to pre-train the GNNs to overcome the problem of insufficient labeled molecules. However, existing GNNs and pre-training strategies usually treat molecules as topological graph data without fully utilizing the molecular geometry information. Whereas, the three-dimensional (3D) spatial structure of a molecule, a.k.a molecular geometry, is one of the most critical factors for determining molecular physical, chemical, and biological properties. To this end, we propose a novel Geometry Enhanced Molecular representation learning method (GEM) for Chemical Representation Learning (ChemRL). At first, we design a geometry-based GNN architecture that simultaneously models atoms, bonds, and bond angles in a molecule. To be specific, we devised double graphs for a molecule: The first one encodes the atom-bond relations; The second one encodes bond-angle relations. Moreover, on top of the devised GNN architecture, we propose several novel geometry-level self-supervised learning strategies to learn spatial knowledge by utilizing the local and global molecular 3D structures. We compare ChemRL-GEM with various state-of-the-art (SOTA) baselines on different molecular benchmarks and exhibit that ChemRL-GEM can significantly outperform all baselines in both regression and classification tasks. For example, the experimental results show an overall improvement of 8.8% on average compared to SOTA baselines on the regression tasks, demonstrating the superiority of the proposed method.
△ Less
Submitted 22 February, 2022; v1 submitted 10 June, 2021;
originally announced June 2021.
-
CODE-AE: A Coherent De-confounding Autoencoder for Predicting Patient-Specific Drug Response From Cell Line Transcriptomics
Authors:
Di He,
Lei Xie
Abstract:
Accurate and robust prediction of patient's response to drug treatments is critical for developing precision medicine. However, it is often difficult to obtain a sufficient amount of coherent drug response data from patients directly for training a generalized machine learning model. Although the utilization of rich cell line data provides an alternative solution, it is challenging to transfer the…
▽ More
Accurate and robust prediction of patient's response to drug treatments is critical for developing precision medicine. However, it is often difficult to obtain a sufficient amount of coherent drug response data from patients directly for training a generalized machine learning model. Although the utilization of rich cell line data provides an alternative solution, it is challenging to transfer the knowledge obtained from cell lines to patients due to various confounding factors. Few existing transfer learning methods can reliably disentangle common intrinsic biological signals from confounding factors in the cell line and patient data. In this paper, we develop a Coherent Deconfounding Autoencoder (CODE-AE) that can extract both common biological signals shared by incoherent samples and private representations unique to each data set, transfer knowledge learned from cell line data to tissue data, and separate confounding factors from them. Extensive studies on multiple data sets demonstrate that CODE-AE significantly improves the accuracy and robustness over state-of-the-art methods in both predicting patient drug response and de-confounding biological signals. Thus, CODE-AE provides a useful framework to take advantage of in vitro omics data for developing generalized patient predictive models. The source code is available at https://github.com/XieResearchGroup/CODE-AE.
△ Less
Submitted 31 January, 2021;
originally announced February 2021.
-
A Cross-Level Information Transmission Network for Predicting Phenotype from New Genotype: Application to Cancer Precision Medicine
Authors:
Di He,
Lei Xie
Abstract:
An unsolved fundamental problem in biology and ecology is to predict observable traits (phenotypes) from a new genetic constitution (genotype) of an organism under environmental perturbations (e.g., drug treatment). The emergence of multiple omics data provides new opportunities but imposes great challenges in the predictive modeling of genotype-phenotype associations. Firstly, the high-dimensiona…
▽ More
An unsolved fundamental problem in biology and ecology is to predict observable traits (phenotypes) from a new genetic constitution (genotype) of an organism under environmental perturbations (e.g., drug treatment). The emergence of multiple omics data provides new opportunities but imposes great challenges in the predictive modeling of genotype-phenotype associations. Firstly, the high-dimensionality of genomics data and the lack of labeled data often make the existing supervised learning techniques less successful. Secondly, it is a challenging task to integrate heterogeneous omics data from different resources. Finally, the information transmission from DNA to phenotype involves multiple intermediate levels of RNA, protein, metabolite, etc. The higher-level features (e.g., gene expression) usually have stronger discriminative power than the lower level features (e.g., somatic mutation). To address above issues, we proposed a novel Cross-LEvel Information Transmission network (CLEIT) framework. CLEIT aims to explicitly model the asymmetrical multi-level organization of the biological system. Inspired by domain adaptation, CLEIT first learns the latent representation of high-level domain then uses it as ground-truth embedding to improve the representation learning of the low-level domain in the form of contrastive loss. In addition, we adopt a pre-training-fine-tuning approach to leveraging the unlabeled heterogeneous omics data to improve the generalizability of CLEIT. We demonstrate the effectiveness and performance boost of CLEIT in predicting anti-cancer drug sensitivity from somatic mutations via the assistance of gene expressions when compared with state-of-the-art methods.
△ Less
Submitted 9 October, 2020;
originally announced October 2020.
-
Four-tier response system and spatial propagation of COVID-19 in China by a network model
Authors:
Jing Ge,
Daihai He,
Zhigui Lin,
Huaiping Zhu,
Zian Zhuang
Abstract:
In order to investigate the effectiveness of lockdown and social distancing restrictions, which have been widely carried out as policy choice to curb the ongoing COVID-19 pandemic around the world, we formulate and discuss a staged and weighed networked system based on a classical SEAIR epidemiological model. Five stages have been taken into consideration according to four-tier response to Public…
▽ More
In order to investigate the effectiveness of lockdown and social distancing restrictions, which have been widely carried out as policy choice to curb the ongoing COVID-19 pandemic around the world, we formulate and discuss a staged and weighed networked system based on a classical SEAIR epidemiological model. Five stages have been taken into consideration according to four-tier response to Public Health Crisis, which comes from the National Contingency Plan in China. Staggered basic reproduction number has been derived and we evaluate the effectiveness of lockdown and social distancing policies under different scenarios among 19 cities/regions in mainland China. Further, we estimate the infection risk associated with the sequential release based on population mobility between cities and the intensity of some non-pharmaceutical interventions. Our results reveal that Level I public health emergency response is necessary for high-risk cities, which can flatten the COVID-19 curve effectively and quickly. Moreover, properly designed staggered-release policies are extremely significant for the prevention and control of COVID-19, furthermore, beneficial to economic activities and social stability and development.
△ Less
Submitted 16 August, 2020;
originally announced August 2020.
-
Religious Festivals and Influenza
Authors:
Alice P. Y. Chiu,
Qianying Lin,
Daihai He
Abstract:
Objectives Influenza outbreaks have been widely studied. However, the patterns between influenza and religious festivals remained unexplored. This study examined the patterns of influenza and Hanukkah in Israel, and that of influenza and Hajj in Bahrain, Egypt, Iraq, Jordan, Oman and Qatar. Method Influenza surveillance data of these seven countries from 2009 to 2017 were downloaded from the FluNe…
▽ More
Objectives Influenza outbreaks have been widely studied. However, the patterns between influenza and religious festivals remained unexplored. This study examined the patterns of influenza and Hanukkah in Israel, and that of influenza and Hajj in Bahrain, Egypt, Iraq, Jordan, Oman and Qatar. Method Influenza surveillance data of these seven countries from 2009 to 2017 were downloaded from the FluNet of the World Health Organization. Secondary data were collected for the countries' population, and the dates of Hajj and Hanukkah. We aggregated the weekly influenza A and B laboratory confirmations for each country over the study period. Weekly influenza A patterns and religious festival dates were further explored across the study period. Results We found that influenza A peaks closely followed Hanukkah in Israel in six out of seven years from 2010 to 2017. Aggregated influenza A peaks of the other six Middle East countries also occurred right after Hajj every year during the study period. Conclusions We predict that unless there is an emergence of new influenza strain, such influenza patterns are likely to persist in future years. Our results suggested that the optimal timing of mass influenza vaccination should take into considerations of the dates of these religious festivals.
△ Less
Submitted 24 October, 2017;
originally announced October 2017.
-
Patterns of Influenza Vaccination Coverage in the United States from 2009 to 2015
Authors:
Alice P. Y. Chiu,
Duo Yu,
Jonathan Dushoff,
Daihai He
Abstract:
Background: Globally, influenza is a major cause of morbidity, hospitalization and mortality. Influenza vaccination has shown substantial protective effectiveness in the United States. We investigated state-level patterns of coverage rates of seasonal and pandemic influenza vaccination, among the overall population in the U.S. and specifically among children and the elderly, from 2009/10 to 2014/1…
▽ More
Background: Globally, influenza is a major cause of morbidity, hospitalization and mortality. Influenza vaccination has shown substantial protective effectiveness in the United States. We investigated state-level patterns of coverage rates of seasonal and pandemic influenza vaccination, among the overall population in the U.S. and specifically among children and the elderly, from 2009/10 to 2014/15, and associations with ecological factors.
Methods and Findings: We obtained state-level influenza vaccination coverage rates from national surveys, and state-level socio-demographic and health data from a variety of sources. We employed a retrospective ecological study design, and used mixed-model regression to determine the levels of ecological association of the state-level vaccinations rates with these factors, both with and without region as a factor for the three populations. We found that health-care access is positively and significantly associated with mean influenza vaccination coverage rates across all populations and models. We also found that prevalence of asthma in adults are negatively and significantly associated with mean influenza vaccination coverage rates in the elderly populations.
Conclusions: Health-care access has a robust, positive association with state-level vaccination rates across different populations. This highlights a potential population-level advantage of expanding health-care access.
△ Less
Submitted 13 March, 2017;
originally announced March 2017.
-
Increasing Trends of Guillain-Barré Syndrome (GBS) and Dengue in Hong Kong
Authors:
Xiujuan Tang,
Shi Zhao,
Alice P. Y. Chiu,
Xin Wang,
Lin Yang,
Daihai He
Abstract:
Background: Guillain-Barré Syndrome (GBS) is a common type of severe acute paralytic neuropathy and associated with other virus infections such as dengue fever and Zika. This study investigate the relationship between GBS, dengue, local meteorological factors in Hong Kong and global climatic factors from January 2000 to June 2016.
Methods: The correlations between GBS, dengue, Multivariate El Ni…
▽ More
Background: Guillain-Barré Syndrome (GBS) is a common type of severe acute paralytic neuropathy and associated with other virus infections such as dengue fever and Zika. This study investigate the relationship between GBS, dengue, local meteorological factors in Hong Kong and global climatic factors from January 2000 to June 2016.
Methods: The correlations between GBS, dengue, Multivariate El Nino Southern Oscillation Index (MEI) and local meteorological data were explored by the Spearman Rank correlations and cross-correlations between these time series. Poisson regression models were fitted to identify nonlinear associations between MEI and dengue. Cross wavelet analysis was applied to infer potential non-stationary oscillating associations among MEI, dengue and GBS.
Findings : An increasing trend was found for both GBS cases and imported dengue cases in Hong Kong. We found a weak but statistically significant negative correlation between GBS and local meteorological factors. MEI explained over 12\% of dengue's variations from Poisson regression models. Wavelet analyses showed that there is possible non-stationary oscillating association between dengue and GBS from 2005 to 2015 in Hong Kong. Our study has led to an improved understanding of the timing and relationship between GBS, dengue and MEI.
△ Less
Submitted 13 March, 2017;
originally announced March 2017.
-
Effects of Reactive Social Distancing on the 1918 Influenza Pandemic
Authors:
Duo Yu,
Qianying Lin,
Alice PY Chiu,
Daihai He
Abstract:
The 1918 influenza pandemic was characterized by multiple epidemic waves. We investigated into reactive social distancing, a form of behavioral responses, and its effect on the multiple influenza waves in the United Kingdom. Two forms of reactive social distancing have been used in previous studies: Power function, which is a function of the proportion of recent influenza mortality in a population…
▽ More
The 1918 influenza pandemic was characterized by multiple epidemic waves. We investigated into reactive social distancing, a form of behavioral responses, and its effect on the multiple influenza waves in the United Kingdom. Two forms of reactive social distancing have been used in previous studies: Power function, which is a function of the proportion of recent influenza mortality in a population, and Hill function, which is a function of the actual number of recent influenza mortality. Using a simple epidemic model with a Power function and one common set of parameters, we provided a good model fit for the observed multiple epidemic waves in London boroughs, Birmingham and Liverpool. Our approach is different from previous studies where separate models are fitted to each city. We then applied these model parameters obtained from fitting three cities to all 334 administrative units in England and Wales and including the population sizes of individual administrative units. We computed the Pearson's correlation between the observed and simulated data for each administrative unit. We achieved a median correlation of 0.636, indicating our model predictions perform reasonably well. Our modelling approach which requires reduced number of parameters resulted in computational efficiency gain without over-fitting the model. Our works have both scientific and public health significance.
△ Less
Submitted 12 March, 2017;
originally announced March 2017.
-
Prevention and control of Zika fever as a mosquito-borne and sexually transmitted disease
Authors:
Daozhou Gao,
Yijun Lou,
Daihai He,
Travis C. Porco,
Yang Kuang,
Gerardo Chowell,
Shigui Ruan
Abstract:
The ongoing Zika virus (ZIKV) epidemic poses a major global public health emergency. It is known that ZIKV is spread by \textit{Aedes} mosquitoes, recent studies show that ZIKV can also be transmitted via sexual contact and cases of sexually transmitted ZIKV have been confirmed in the U.S., France, and Italy. How sexual transmission affects the spread and control of ZIKV infection is not well-unde…
▽ More
The ongoing Zika virus (ZIKV) epidemic poses a major global public health emergency. It is known that ZIKV is spread by \textit{Aedes} mosquitoes, recent studies show that ZIKV can also be transmitted via sexual contact and cases of sexually transmitted ZIKV have been confirmed in the U.S., France, and Italy. How sexual transmission affects the spread and control of ZIKV infection is not well-understood. We presented a mathematical model to investigate the impact of mosquito-borne and sexual transmission on spread and control of ZIKV and used the model to fit the ZIKV data in Brazil, Colombia, and El Salvador. Based on the estimated parameter values, we calculated the median and confidence interval of the basic reproduction number R0=2.055 (95% CI: 0.523-6.300), in which the distribution of the percentage of contribution by sexual transmission is 3.044 (95% CI: 0.123-45.73). Our study indicates that R0 is most sensitive to the biting rate and mortality rate of mosquitoes while sexual transmission increases the risk of infection and epidemic size and prolongs the outbreak. In order to prevent and control the transmission of ZIKV, it must be treated as not only a mosquito-borne disease but also a sexually transmitted disease.
△ Less
Submitted 13 April, 2016;
originally announced April 2016.
-
Spatio-temporal patterns of influenza B proportions
Authors:
Daihai He,
Alice PY Chiu,
Qianying Lin,
Duo Yu
Abstract:
We study the spatio-temporal patterns of the proportion of influenza B out of laboratory confirmations of both influenza A and B, with data from 139 countries and regions downloaded from the FluNet compiled by the World Health Organization, from January 2006 to October 2015, excluding 2009. We restricted our analysis to 34 countries that reported more than 2000 confirmations for each of types A an…
▽ More
We study the spatio-temporal patterns of the proportion of influenza B out of laboratory confirmations of both influenza A and B, with data from 139 countries and regions downloaded from the FluNet compiled by the World Health Organization, from January 2006 to October 2015, excluding 2009. We restricted our analysis to 34 countries that reported more than 2000 confirmations for each of types A and B over the study period. We find that Pearson's correlation is 0.669 between effective distance from Mexico and influenza B proportion among the countries from January 2006 to October 2015. In the United States, influenza B proportion in the pre-pandemic period (2003-2008) negatively correlated with that in the post-pandemic era (2010-2015) at the regional level. Our study limitations are the country-level variations in both surveillance methods and testing policies. Influenza B proportion displayed wide variations over the study period. Our findings suggest that even after excluding 2009's data, the influenza pandemic still has an evident impact on the relative burden of the two influenza types. Future studies could examine whether there are other additional factors. This study has potential implications in prioritizing public health control measures.
△ Less
Submitted 26 January, 2016;
originally announced January 2016.
-
IPED2: Inheritance Path based Pedigree Reconstruction Algorithm for Complicated Pedigrees
Authors:
Dan He,
Zhanyong Wang,
Laxmi Parida,
Eleazar Eskin
Abstract:
Reconstruction of family trees, or pedigree reconstruction, for a group of individuals is a fundamental problem in genetics. The problem is known to be NP-hard even for datasets known to only contain siblings. Some recent methods have been developed to accurately and efficiently reconstruct pedigrees. These methods, however, still consider relatively simple pedigrees, for example, they are not abl…
▽ More
Reconstruction of family trees, or pedigree reconstruction, for a group of individuals is a fundamental problem in genetics. The problem is known to be NP-hard even for datasets known to only contain siblings. Some recent methods have been developed to accurately and efficiently reconstruct pedigrees. These methods, however, still consider relatively simple pedigrees, for example, they are not able to handle half-sibling situations where a pair of individuals only share one parent. In this work, we propose an efficient method, IPED2, based on our previous work, which specifically targets reconstruction of complicated pedigrees that include half-siblings. We note that the presence of half-siblings makes the reconstruction problem significantly more challenging which is why previous methods exclude the possibility of half-siblings. We proposed a novel model as well as an efficient graph algorithm and experiments show that our algorithm achieves relatively accurate reconstruction. To our knowledge, this is the first method that is able to handle pedigree reconstruction based on genotype data only when half-sibling exists in any generation of the pedigree.
△ Less
Submitted 23 August, 2014;
originally announced August 2014.
-
Global Spatio-temporal Patterns of Influenza in the Post-pandemic Era
Authors:
Daihai He,
Roger Lui,
Lin Wang,
Chi Kong Tse,
Lin Yang,
Lewi Stone
Abstract:
We study the global spatio-temporal patterns of influenza dynamics. This is achieved by analysing and modelling weekly laboratory confirmed cases of influenza A and B from 138 countries between January 2006 and May 2014. The data were obtained from FluNet, the surveillance network compiled by the the World Health Organization. We report a pattern of {\it skip-and-resurgence} behavior between the y…
▽ More
We study the global spatio-temporal patterns of influenza dynamics. This is achieved by analysing and modelling weekly laboratory confirmed cases of influenza A and B from 138 countries between January 2006 and May 2014. The data were obtained from FluNet, the surveillance network compiled by the the World Health Organization. We report a pattern of {\it skip-and-resurgence} behavior between the years 2011 and 2013 for influenza H1N1/09, the strain responsible for the 2009 pandemic, in Europe and Eastern Asia. In particular, the expected H1N1/09 epidemic outbreak in 2011 failed to occur (or"skipped") in many countries across the globe, although an outbreak occurred in the following year. We also report a pattern of {\it well-synchronized} 2010 winter wave of H1N1/09 in the Northern Hemisphere countries, and a pattern of replacement of strain H1N1/77 by H1N1/09 between the 2009 and 2012 influenza seasons. Using both a statistical and a mechanistic mathematical model, and through fitting the data of 108 countries (108 countries in a statistical model and 10 large populations with a mechanistic model), we discuss the mechanisms that are likely to generate these events taking into account the role of multi-strain dynamics. A basic understanding of these patterns has important public health implications and scientific significance.
△ Less
Submitted 8 December, 2014; v1 submitted 21 July, 2014;
originally announced July 2014.
-
A mathematical model of the metabolic and perfusion effects on cortical spreading depression
Authors:
Joshua C. Chang,
K. C. Brennan,
Dongdong He,
Huaxiong Huang,
Robert M. Miura,
Phillip L. Wilson,
Jonathan J. Wylie
Abstract:
Cortical spreading depression (CSD) is a slow-moving ionic and metabolic disturbance that propagates in cortical brain tissue. In addition to massive cellular depolarization, CSD also involves significant changes in perfusion and metabolism -- aspects of CSD that had not been modeled and are important to traumatic brain injury, subarachnoid hemorrhage, stroke, and migraine.
In this study, we dev…
▽ More
Cortical spreading depression (CSD) is a slow-moving ionic and metabolic disturbance that propagates in cortical brain tissue. In addition to massive cellular depolarization, CSD also involves significant changes in perfusion and metabolism -- aspects of CSD that had not been modeled and are important to traumatic brain injury, subarachnoid hemorrhage, stroke, and migraine.
In this study, we develop a mathematical model for CSD where we focus on modeling the features essential to understanding the implications of neurovascular coupling during CSD. In our model, the sodium-potassium--ATPase, mainly responsible for ionic homeostasis and active during CSD, operates at a rate that is dependent on the supply of oxygen. The supply of oxygen is determined by modeling blood flow through a lumped vascular tree with an effective local vessel radius that is controlled by the extracellular potassium concentration. We show that during CSD, the metabolic demands of the cortex exceed the physiological limits placed on oxygen delivery, regardless of vascular constriction or dilation. However, vasoconstriction and vasodilation play important roles in the propagation of CSD and its recovery. Our model replicates the qualitative and quantitative behavior of CSD -- vasoconstriction, oxygen depletion, extracellular potassium elevation, prolonged depolarization -- found in experimental studies.
We predict faster, longer duration CSD in vivo than in vitro due to the contribution of the vasculature. Our results also help explain some of the variability of CSD between species and even within the same animal. These results have clinical and translational implications, as they allow for more precise in vitro, in vivo, and in silico exploration of a phenomenon broadly relevant to neurological disease.
△ Less
Submitted 15 June, 2013; v1 submitted 15 July, 2012;
originally announced July 2012.
-
A Collaboration Network Model Of Cytokine-Protein Network
Authors:
Sheng-Rong Zou,
Ta Zhou,
Yu-Jing Peng,
Zhong-Wei Guo,
Chang-gui Gu,
Da-Ren He
Abstract:
Complex networks provide us a new view for investigation of immune systems. In this paper we collect data through STRING database and present a model with cooperation network theory. The cytokine-protein network model we consider is constituted by two kinds of nodes, one is immune cytokine types which can act as acts, other one is protein type which can act as actors. From act degree distributio…
▽ More
Complex networks provide us a new view for investigation of immune systems. In this paper we collect data through STRING database and present a model with cooperation network theory. The cytokine-protein network model we consider is constituted by two kinds of nodes, one is immune cytokine types which can act as acts, other one is protein type which can act as actors. From act degree distribution that can be well described by typical SPL -shifted power law functions, we find that HRAS.TNFRSF13C.S100A8.S100A1.MAPK8.S100A7.LIF.CCL4.CXCL13 are highly collaborated with other proteins. It reveals that these mediators are important in cytokine-protein network to regulate immune activity. Dyad act degree distribution is another important property to generalized collaboration network. Dyad is two proteins and they appear in one cytokine collaboration relationship. The dyad act degree distribution can be well described by typical SPL functions. The length of the average shortest path is 1.29. These results show that this model could describe the cytokine-protein collaboration preferably
△ Less
Submitted 5 December, 2007;
originally announced December 2007.
-
An Empirical Study of Immune System Based On Bipartite Network
Authors:
Sheng-Rong Zou,
Yu-Jing Peng,
Zhong-Wei Guo,
Ta Zhou,
Chang-gui Gu,
Da-Ren He
Abstract:
Immune system is the most important defense system to resist human pathogens. In this paper we present an immune model with bipartite graphs theory. We collect data through COPE database and construct an immune cell- mediators network. The act degree distribution of this network is proved to be power-law, with index of 1.8. From our analysis, we found that some mediators with high degree are ver…
▽ More
Immune system is the most important defense system to resist human pathogens. In this paper we present an immune model with bipartite graphs theory. We collect data through COPE database and construct an immune cell- mediators network. The act degree distribution of this network is proved to be power-law, with index of 1.8. From our analysis, we found that some mediators with high degree are very important mediators in the process of regulating immune activity, such as TNF-alpha, IL-8, TNF-alpha receptors, CCL5, IL-6, IL-2 receptors, TNF-beta receptors, TNF-beta, IL-4 receptors, IL-1 beta, CD54 and so on. These mediators are important in immune system to regulate their activity. We also found that the assortative of the immune system is -0.27. It reveals that our immune system is non-social network. Finally we found similarity of the network is 0.13. Each two cells are similar to small extent. It reveals that many cells have its unique features. The results show that this model could describe the immune system comprehensive.
△ Less
Submitted 5 December, 2007;
originally announced December 2007.
-
A Brand-new Research Method of Neuroendocrine System
Authors:
Sheng-Rong Zou,
Zhong-Wei Guo,
Yu-Jing Peng,
Ta Zhou,
Chang-Gui Gu,
Da-Ren He
Abstract:
In this paper, we present the empirical investigation results on the neuroendocrine system by bipartite graphs. This neuroendocrine network model can describe the structural characteristic of neuroendocrine system. The act degree distribution and cumulate act degree distribution show so-called shifted power law-SPL function forms. In neuroendocrine network, the act degree stands for the number o…
▽ More
In this paper, we present the empirical investigation results on the neuroendocrine system by bipartite graphs. This neuroendocrine network model can describe the structural characteristic of neuroendocrine system. The act degree distribution and cumulate act degree distribution show so-called shifted power law-SPL function forms. In neuroendocrine network, the act degree stands for the number of the cells that secretes a single mediator, in which bFGF(basic fibroblast growth factor) is the largest node act degree. It is an important mitogenic cytokine, followed by TGF-beta, IL-6, IL1-beta, VEGF, IGF-1and so on. They are critical in neuroendocrine system to maintain bodily healthiness, emotional stabilization and endocrine harmony. The average act degree of neuroendocrine network is h = 3.01, It means each mediator is secreted by three cells on an average . The similarity that stand for the average probability of secreting the same mediators by all the neuroendocrine cells is s = 0.14. Our results may be used in the research of the medical treatment of neuroendocrine diseases.
△ Less
Submitted 2 December, 2007;
originally announced December 2007.