-
An 11,000-Study Open-Access Dataset of Longitudinal Magnetic Resonance Images of Brain Metastases
Authors:
Saahil Chadha,
David Weiss,
Anastasia Janas,
Divya Ramakrishnan,
Thomas Hager,
Klara Osenberg,
Klara Willms,
Joshua Zhu,
Veronica Chiang,
Spyridon Bakas,
Nazanin Maleki,
Durga V. Sritharan,
Sven Schoenherr,
Malte Westerhoff,
Matthew Zawalich,
Melissa Davis,
Ajay Malhotra,
Khaled Bousabarah,
Cornelius Deuschl,
MingDe Lin,
Sanjay Aneja,
Mariam S. Aboian
Abstract:
Brain metastases are a common complication of systemic cancer, affecting over 20% of patients with primary malignancies. Longitudinal magnetic resonance imaging (MRI) is essential for diagnosing patients, tracking disease progression, assessing therapeutic response, and guiding treatment selection. However, the manual review of longitudinal imaging is time-intensive, especially for patients with m…
▽ More
Brain metastases are a common complication of systemic cancer, affecting over 20% of patients with primary malignancies. Longitudinal magnetic resonance imaging (MRI) is essential for diagnosing patients, tracking disease progression, assessing therapeutic response, and guiding treatment selection. However, the manual review of longitudinal imaging is time-intensive, especially for patients with multifocal disease. Artificial intelligence (AI) offers opportunities to streamline image evaluation, but developing robust AI models requires comprehensive training data representative of real-world imaging studies. Thus, there is an urgent necessity for a large dataset with heterogeneity in imaging protocols and disease presentation. To address this, we present an open-access dataset of 11,884 longitudinal brain MRI studies from 1,430 patients with clinically confirmed brain metastases, paired with clinical and image metadata. The provided dataset will facilitate the development of AI models to assist in the long-term management of patients with brain metastasis.
△ Less
Submitted 16 June, 2025;
originally announced June 2025.
-
SPACE: Your Genomic Profile Predictor is a Powerful DNA Foundation Model
Authors:
Zhao Yang,
Jiwei Zhu,
Bing Su
Abstract:
Inspired by the success of unsupervised pre-training paradigms, researchers have applied these approaches to DNA pre-training. However, we argue that these approaches alone yield suboptimal results because pure DNA sequences lack sufficient information, since their functions are regulated by genomic profiles like chromatin accessibility. Here, we demonstrate that supervised training for genomic pr…
▽ More
Inspired by the success of unsupervised pre-training paradigms, researchers have applied these approaches to DNA pre-training. However, we argue that these approaches alone yield suboptimal results because pure DNA sequences lack sufficient information, since their functions are regulated by genomic profiles like chromatin accessibility. Here, we demonstrate that supervised training for genomic profile prediction serves as a more effective alternative to pure sequence pre-training. Furthermore, considering the multi-species and multi-profile nature of genomic profile prediction, we introduce our $\textbf{S}$pecies-$\textbf{P}$rofile $\textbf{A}$daptive $\textbf{C}$ollaborative $\textbf{E}$xperts (SPACE) that leverages Mixture of Experts (MoE) to better capture the relationships between DNA sequences across different species and genomic profiles, thereby learning more effective DNA representations. Through extensive experiments across various tasks, our model achieves state-of-the-art performance, establishing that DNA models trained with supervised genomic profiles serve as powerful DNA representation learners. The code is available at https://github.com/ZhuJiwei111/SPACE.
△ Less
Submitted 2 June, 2025;
originally announced June 2025.
-
GenDMR: A dynamic multimodal role-swapping network for identifying risk gene phenotypes
Authors:
Lina Qin,
Cheng Zhu,
Chuqi Zhou,
Yukun Huang,
Jiayi Zhu,
Ping Liang,
Jinju Wang,
Yixing Huang,
Cheng Luo,
Dezhong Yao,
Ying Tan
Abstract:
Recent studies have shown that integrating multimodal data fusion techniques for imaging and genetic features is beneficial for the etiological analysis and predictive diagnosis of Alzheimer's disease (AD). However, there are several critical flaws in current deep learning methods. Firstly, there has been insufficient discussion and exploration regarding the selection and encoding of genetic infor…
▽ More
Recent studies have shown that integrating multimodal data fusion techniques for imaging and genetic features is beneficial for the etiological analysis and predictive diagnosis of Alzheimer's disease (AD). However, there are several critical flaws in current deep learning methods. Firstly, there has been insufficient discussion and exploration regarding the selection and encoding of genetic information. Secondly, due to the significantly superior classification value of AD imaging features compared to genetic features, many studies in multimodal fusion emphasize the strengths of imaging features, actively mitigating the influence of weaker features, thereby diminishing the learning of the unique value of genetic features. To address this issue, this study proposes the dynamic multimodal role-swapping network (GenDMR). In GenDMR, we develop a novel approach to encode the spatial organization of single nucleotide polymorphisms (SNPs), enhancing the representation of their genomic context. Additionally, to adaptively quantify the disease risk of SNPs and brain region, we propose a multi-instance attention module to enhance model interpretability. Furthermore, we introduce a dominant modality selection module and a contrastive self-distillation module, combining them to achieve a dynamic teacher-student role exchange mechanism based on dominant and auxiliary modalities for bidirectional co-updating of different modal data. Finally, GenDMR achieves state-of-the-art performance on the ADNI public dataset and visualizes attention to different SNPs, focusing on confirming 12 potential high-risk genes related to AD, including the most classic APOE and recently highlighted significant risk genes. This demonstrates GenDMR's interpretable analytical capability in exploring AD genetic features, providing new insights and perspectives for the development of multimodal data fusion techniques.
△ Less
Submitted 2 June, 2025;
originally announced June 2025.
-
Inductive-Associative Meta-learning Pipeline with Human Cognitive Patterns for Unseen Drug-Target Interaction Prediction
Authors:
Xiaoqing Lian,
Jie Zhu,
Tianxu Lv,
Shiyun Nie,
Hang Fan,
Guosheng Wu,
Yunjun Ge,
Lihua Li,
Xiangxiang Zeng,
Xiang Pan
Abstract:
Significant differences in protein structures hinder the generalization of existing drug-target interaction (DTI) models, which often rely heavily on pre-learned binding principles or detailed annotations. In contrast, BioBridge designs an Inductive-Associative pipeline inspired by the workflow of scientists who base their accumulated expertise on drawing insights into novel drug-target pairs from…
▽ More
Significant differences in protein structures hinder the generalization of existing drug-target interaction (DTI) models, which often rely heavily on pre-learned binding principles or detailed annotations. In contrast, BioBridge designs an Inductive-Associative pipeline inspired by the workflow of scientists who base their accumulated expertise on drawing insights into novel drug-target pairs from weakly related references. BioBridge predicts novel drug-target interactions using limited sequence data, incorporating multi-level encoders with adversarial training to accumulate transferable binding principles. On these principles basis, BioBridge employs a dynamic prototype meta-learning framework to associate insights from weakly related annotations, enabling robust predictions for previously unseen drug-target pairs. Extensive experiments demonstrate that BioBridge surpasses existing models, especially for unseen proteins. Notably, when only homologous protein binding data is available, BioBridge proves effective for virtual screening of the epidermal growth factor receptor and adenosine receptor, underscoring its potential in drug discovery.
△ Less
Submitted 27 March, 2025; v1 submitted 26 January, 2025;
originally announced January 2025.
-
Easing Seasickness through Attention Redirection with a Mindfulness-Based Brain--Computer Interface
Authors:
Xiaoyu Bao,
Kailin Xu,
Jiawei Zhu,
Haiyun Huang,
Kangning Li,
Qiyun Huang,
Yuanqing Li
Abstract:
Seasickness is a prevalent issue that adversely impacts both passenger experiences and the operational efficiency of maritime crews. While techniques that redirect attention have proven effective in alleviating motion sickness symptoms in terrestrial environments, applying similar strategies to manage seasickness poses unique challenges due to the prolonged and intense motion environment associate…
▽ More
Seasickness is a prevalent issue that adversely impacts both passenger experiences and the operational efficiency of maritime crews. While techniques that redirect attention have proven effective in alleviating motion sickness symptoms in terrestrial environments, applying similar strategies to manage seasickness poses unique challenges due to the prolonged and intense motion environment associated with maritime travel. In this study, we propose a mindfulness brain-computer interface (BCI), specifically designed to redirect attention with the aim of mitigating seasickness symptoms in real-world settings. Our system utilizes a single-channel headband to capture prefrontal EEG signals, which are then wirelessly transmitted to computing devices for the assessment of mindfulness states. The results are transferred into real-time feedback as mindfulness scores and audiovisual stimuli, facilitating a shift in attentional focus from physiological discomfort to mindfulness practices. A total of 43 individuals participated in a real-world maritime experiment consisted of three sessions: a real-feedback mindfulness session, a resting session, and a pseudofeedback mindfulness session. Notably, 81.39% of participants reported that the mindfulness BCI intervention was effective, and there was a significant reduction in the severity of seasickness, as measured by the Misery Scale (MISC). Furthermore, EEG analysis revealed a decrease in the theta/beta ratio, corresponding with the alleviation of seasickness symptoms. A decrease in overall EEG band power during the real-feedback mindfulness session suggests that the mindfulness BCI fosters a more tranquil and downregulated state of brain activity. Together, this study presents a novel nonpharmacological, portable, and effective approach for seasickness intervention, with the potential to enhance the cruising experience for both passengers and crews.
△ Less
Submitted 14 January, 2025;
originally announced January 2025.
-
DeltaDock: A Unified Framework for Accurate, Efficient, and Physically Reliable Molecular Docking
Authors:
Jiaxian Yan,
Zaixi Zhang,
Jintao Zhu,
Kai Zhang,
Jianfeng Pei,
Qi Liu
Abstract:
Molecular docking, a technique for predicting ligand binding poses, is crucial in structure-based drug design for understanding protein-ligand interactions. Recent advancements in docking methods, particularly those leveraging geometric deep learning (GDL), have demonstrated significant efficiency and accuracy advantages over traditional sampling methods. Despite these advancements, current method…
▽ More
Molecular docking, a technique for predicting ligand binding poses, is crucial in structure-based drug design for understanding protein-ligand interactions. Recent advancements in docking methods, particularly those leveraging geometric deep learning (GDL), have demonstrated significant efficiency and accuracy advantages over traditional sampling methods. Despite these advancements, current methods are often tailored for specific docking settings, and limitations such as the neglect of protein side-chain structures, difficulties in handling large binding pockets, and challenges in predicting physically valid structures exist. To accommodate various docking settings and achieve accurate, efficient, and physically reliable docking, we propose a novel two-stage docking framework, DeltaDock, consisting of pocket prediction and site-specific docking. We innovatively reframe the pocket prediction task as a pocket-ligand alignment problem rather than direct prediction in the first stage. Then we follow a bi-level coarse-to-fine iterative refinement process to perform site-specific docking. Comprehensive experiments demonstrate the superior performance of DeltaDock. Notably, in the blind docking setting, DeltaDock achieves a 31\% relative improvement over the docking success rate compared with the previous state-of-the-art GDL model. With the consideration of physical validity, this improvement increases to about 300\%.
△ Less
Submitted 16 October, 2024; v1 submitted 14 October, 2024;
originally announced October 2024.
-
Exploiting Pre-trained Models for Drug Target Affinity Prediction with Nearest Neighbors
Authors:
Qizhi Pei,
Lijun Wu,
Zhenyu He,
Jinhua Zhu,
Yingce Xia,
Shufang Xie,
Rui Yan
Abstract:
Drug-Target binding Affinity (DTA) prediction is essential for drug discovery. Despite the application of deep learning methods to DTA prediction, the achieved accuracy remain suboptimal. In this work, inspired by the recent success of retrieval methods, we propose $k$NN-DTA, a non-parametric embedding-based retrieval method adopted on a pre-trained DTA prediction model, which can extend the power…
▽ More
Drug-Target binding Affinity (DTA) prediction is essential for drug discovery. Despite the application of deep learning methods to DTA prediction, the achieved accuracy remain suboptimal. In this work, inspired by the recent success of retrieval methods, we propose $k$NN-DTA, a non-parametric embedding-based retrieval method adopted on a pre-trained DTA prediction model, which can extend the power of the DTA model with no or negligible cost. Different from existing methods, we introduce two neighbor aggregation ways from both embedding space and label space that are integrated into a unified framework. Specifically, we propose a \emph{label aggregation} with \emph{pair-wise retrieval} and a \emph{representation aggregation} with \emph{point-wise retrieval} of the nearest neighbors. This method executes in the inference phase and can efficiently boost the DTA prediction performance with no training cost. In addition, we propose an extension, Ada-$k$NN-DTA, an instance-wise and adaptive aggregation with lightweight learning. Results on four benchmark datasets show that $k$NN-DTA brings significant improvements, outperforming previous state-of-the-art (SOTA) results, e.g, on BindingDB IC$_{50}$ and $K_i$ testbeds, $k$NN-DTA obtains new records of RMSE $\bf{0.684}$ and $\bf{0.750}$. The extended Ada-$k$NN-DTA further improves the performance to be $\bf{0.675}$ and $\bf{0.735}$ RMSE. These results strongly prove the effectiveness of our method. Results in other settings and comprehensive studies/analyses also show the great potential of our $k$NN-DTA approach.
△ Less
Submitted 21 July, 2024;
originally announced July 2024.
-
An Adaptively Weighted Averaging Method for Regional Time Series Extraction of fMRI-based Brain Decoding
Authors:
Jianfei Zhu,
Baichun Wei,
Jiaru Tian,
Feng Jiang,
Chunzhi Yi
Abstract:
Brain decoding that classifies cognitive states using the functional fluctuations of the brain can provide insightful information for understanding the brain mechanisms of cognitive functions. Among the common procedures of decoding the brain cognitive states with functional magnetic resonance imaging (fMRI), extracting the time series of each brain region after brain parcellation traditionally av…
▽ More
Brain decoding that classifies cognitive states using the functional fluctuations of the brain can provide insightful information for understanding the brain mechanisms of cognitive functions. Among the common procedures of decoding the brain cognitive states with functional magnetic resonance imaging (fMRI), extracting the time series of each brain region after brain parcellation traditionally averages across the voxels within a brain region. This neglects the spatial information among the voxels and the requirement of extracting information for the downstream tasks. In this study, we propose to use a fully connected neural network that is jointly trained with the brain decoder to perform an adaptively weighted average across the voxels within each brain region. We perform extensive evaluations by cognitive state decoding, manifold learning, and interpretability analysis on the Human Connectome Project (HCP) dataset. The performance comparison of the cognitive state decoding presents an accuracy increase of up to 5\% and stable accuracy improvement under different time window sizes, resampling sizes, and training data sizes. The results of manifold learning show that our method presents a considerable separability among cognitive states and basically excludes subject-specific information. The interpretability analysis shows that our method can identify reasonable brain regions corresponding to each cognitive state. Our study would aid the improvement of the basic pipeline of fMRI processing.
△ Less
Submitted 11 July, 2024;
originally announced July 2024.
-
Systematic evaluation of the isolated effect of tissue environment on the transcriptome using a single-cell RNA-seq atlas dataset
Authors:
Daigo Okada,
Jianshen Zhu,
Kan Shota,
Yuuki Nishimura,
Kazuya Haraguchi
Abstract:
Background: Understanding cellular diversity throughout the body is essential for elucidating the complex functions of biological systems. Recently, large-scale single-cell omics datasets, known as omics atlases, have become available. These atlases encompass data from diverse tissues and cell-types, providing insights into the landscape of cell-type-specific gene expression. However, the isolated…
▽ More
Background: Understanding cellular diversity throughout the body is essential for elucidating the complex functions of biological systems. Recently, large-scale single-cell omics datasets, known as omics atlases, have become available. These atlases encompass data from diverse tissues and cell-types, providing insights into the landscape of cell-type-specific gene expression. However, the isolated effect of the tissue environment has not been thoroughly investigated. Evaluating this isolated effect is challenging due to statistical confounding with cell-type effects, arising from significant biases in the combinations of tissues and cell-types within the body. Results: This study introduces a novel data analysis framework, named the Combinatorial Sub-dataset Extraction for Confounding Reduction (COSER), which addresses statistical confounding by using graph theory to enumerate appropriate sub-datasets. COSER enables the assessment of isolated effects of discrete variables in single cells. Applying COSER to the Tabula Muris Senis single-cell transcriptome atlas, we characterized the isolated impact of tissue environments. Our findings demonstrate that some of genes are markedly affected by the tissue environment, particularly in modulating intercellular diversity in immune responses and their age-related changes. Conclusion: COSER provides a robust, general-purpose framework for evaluating the isolated effects of discrete variables from large-scale data mining. This approach reveals critical insights into the interplay between tissue environments and gene expression.
△ Less
Submitted 19 December, 2024; v1 submitted 11 June, 2024;
originally announced June 2024.
-
3D-MolT5: Leveraging Discrete Structural Information for Molecule-Text Modeling
Authors:
Qizhi Pei,
Rui Yan,
Kaiyuan Gao,
Jinhua Zhu,
Lijun Wu
Abstract:
The integration of molecular and natural language representations has emerged as a focal point in molecular science, with recent advancements in Language Models (LMs) demonstrating significant potential for comprehensive modeling of both domains. However, existing approaches face notable limitations, particularly in their neglect of three-dimensional (3D) information, which is crucial for understa…
▽ More
The integration of molecular and natural language representations has emerged as a focal point in molecular science, with recent advancements in Language Models (LMs) demonstrating significant potential for comprehensive modeling of both domains. However, existing approaches face notable limitations, particularly in their neglect of three-dimensional (3D) information, which is crucial for understanding molecular structures and functions. While some efforts have been made to incorporate 3D molecular information into LMs using external structure encoding modules, significant difficulties remain, such as insufficient interaction across modalities in pre-training and challenges in modality alignment. To address the limitations, we propose \textbf{3D-MolT5}, a unified framework designed to model molecule in both sequence and 3D structure spaces. The key innovation of our approach lies in mapping fine-grained 3D substructure representations into a specialized 3D token vocabulary. This methodology facilitates the seamless integration of sequence and structure representations in a tokenized format, enabling 3D-MolT5 to encode molecular sequences, molecular structures, and text sequences within a unified architecture. Leveraging this tokenized input strategy, we build a foundation model that unifies the sequence and structure data formats. We then conduct joint pre-training with multi-task objectives to enhance the model's comprehension of these diverse modalities within a shared representation space. Thus, our approach significantly improves cross-modal interaction and alignment, addressing key challenges in previous work. Further instruction tuning demonstrated that our 3D-MolT5 has strong generalization ability and surpasses existing methods with superior performance in multiple downstream tasks. Our code is available at https://github.com/QizhiPei/3D-MolT5.
△ Less
Submitted 18 March, 2025; v1 submitted 9 June, 2024;
originally announced June 2024.
-
3D MR Fingerprinting for Dynamic Contrast-Enhanced Imaging of Whole Mouse Brain
Authors:
Yuran Zhu,
Guanhua Wang,
Yuning Gu,
Walter Zhao,
Jiahao Lu,
Junqing Zhu,
Christina J. MacAskill,
Andrew Dupuis,
Mark A. Griswold,
Dan Ma,
Chris A. Flask,
Xin Yu
Abstract:
Quantitative MRI enables direct quantification of contrast agent concentrations in contrast-enhanced scans. However, the lengthy scan times required by conventional methods are inadequate for tracking contrast agent transport dynamically in mouse brain. We developed a 3D MR fingerprinting (MRF) method for simultaneous T1 and T2 mapping across the whole mouse brain with 4.3-min temporal resolution.…
▽ More
Quantitative MRI enables direct quantification of contrast agent concentrations in contrast-enhanced scans. However, the lengthy scan times required by conventional methods are inadequate for tracking contrast agent transport dynamically in mouse brain. We developed a 3D MR fingerprinting (MRF) method for simultaneous T1 and T2 mapping across the whole mouse brain with 4.3-min temporal resolution. We designed a 3D MRF sequence with variable acquisition segment lengths and magnetization preparations on a 9.4T preclinical MRI scanner. Model-based reconstruction approaches were employed to improve the accuracy and speed of MRF acquisition. The method's accuracy for T1 and T2 measurements was validated in vitro, while its repeatability of T1 and T2 measurements was evaluated in vivo (n=3). The utility of the 3D MRF sequence for dynamic tracking of intracisternally infused Gd-DTPA in the whole mouse brain was demonstrated (n=5). Phantom studies confirmed accurate T1 and T2 measurements by 3D MRF with an undersampling factor up to 48. Dynamic contrast-enhanced (DCE) MRF scans achieved a spatial resolution of 192 x 192 x 500 um3 and a temporal resolution of 4.3 min, allowing for the analysis and comparison of dynamic changes in concentration and transport kinetics of intracisternally infused Gd-DTPA across brain regions. The sequence also enabled highly repeatable, high-resolution T1 and T2 mapping of the whole mouse brain (192 x 192 x 250 um3) in 30 min. We present the first dynamic and multi-parametric approach for quantitatively tracking contrast agent transport in the mouse brain using 3D MRF.
△ Less
Submitted 5 August, 2024; v1 submitted 1 May, 2024;
originally announced May 2024.
-
FABind+: Enhancing Molecular Docking through Improved Pocket Prediction and Pose Generation
Authors:
Kaiyuan Gao,
Qizhi Pei,
Gongbo Zhang,
Jinhua Zhu,
Kun He,
Lijun Wu
Abstract:
Molecular docking is a pivotal process in drug discovery. While traditional techniques rely on extensive sampling and simulation governed by physical principles, these methods are often slow and costly. The advent of deep learning-based approaches has shown significant promise, offering increases in both accuracy and efficiency. Building upon the foundational work of FABind, a model designed with…
▽ More
Molecular docking is a pivotal process in drug discovery. While traditional techniques rely on extensive sampling and simulation governed by physical principles, these methods are often slow and costly. The advent of deep learning-based approaches has shown significant promise, offering increases in both accuracy and efficiency. Building upon the foundational work of FABind, a model designed with a focus on speed and accuracy, we present FABind+, an enhanced iteration that largely boosts the performance of its predecessor. We identify pocket prediction as a critical bottleneck in molecular docking and propose a novel methodology that significantly refines pocket prediction, thereby streamlining the docking process. Furthermore, we introduce modifications to the docking module to enhance its pose generation capabilities. In an effort to bridge the gap with conventional sampling/generative methods, we incorporate a simple yet effective sampling technique coupled with a confidence model, requiring only minor adjustments to the regression framework of FABind. Experimental results and analysis reveal that FABind+ remarkably outperforms the original FABind, achieves competitive state-of-the-art performance, and delivers insightful modeling strategies. This demonstrates FABind+ represents a substantial step forward in molecular docking and drug discovery. Our code is in https://github.com/QizhiPei/FABind.
△ Less
Submitted 24 February, 2025; v1 submitted 29 March, 2024;
originally announced March 2024.
-
A unified framework for coarse grained molecular dynamics of proteins with high-fidelity reconstruction
Authors:
Jinzhen Zhu
Abstract:
Simulating large proteins using traditional molecular dynamics (MD) is computationally demanding. To address this challenge, we propose a novel tree-structured coarse-grained model that efficiently captures protein dynamics. By leveraging a hierarchical protein representation, our model accurately reconstructs high-resolution protein structures, with sub-angstrom precision achieved for a 168-amino…
▽ More
Simulating large proteins using traditional molecular dynamics (MD) is computationally demanding. To address this challenge, we propose a novel tree-structured coarse-grained model that efficiently captures protein dynamics. By leveraging a hierarchical protein representation, our model accurately reconstructs high-resolution protein structures, with sub-angstrom precision achieved for a 168-amino acid protein. We combine this coarse-grained model with a deep learning framework based on stochastic differential equations (SDEs). A neural network is trained to model the drift force, while a RealNVP-based noise generator approximates the stochastic component. This approach enables a significant speedup of over 20,000 times compared to traditional MD, allowing for the generation of microsecond-long trajectories within a few minutes and providing valuable insights into protein behavior. Our method demonstrates high accuracy, achieving sub-angstrom reconstruction for short (25 ns) trajectories and maintaining statistical consistency across multiple independent simulations.
△ Less
Submitted 9 December, 2024; v1 submitted 26 March, 2024;
originally announced March 2024.
-
Leveraging Biomolecule and Natural Language through Multi-Modal Learning: A Survey
Authors:
Qizhi Pei,
Lijun Wu,
Kaiyuan Gao,
Jinhua Zhu,
Yue Wang,
Zun Wang,
Tao Qin,
Rui Yan
Abstract:
The integration of biomolecular modeling with natural language (BL) has emerged as a promising interdisciplinary area at the intersection of artificial intelligence, chemistry and biology. This approach leverages the rich, multifaceted descriptions of biomolecules contained within textual data sources to enhance our fundamental understanding and enable downstream computational tasks such as biomol…
▽ More
The integration of biomolecular modeling with natural language (BL) has emerged as a promising interdisciplinary area at the intersection of artificial intelligence, chemistry and biology. This approach leverages the rich, multifaceted descriptions of biomolecules contained within textual data sources to enhance our fundamental understanding and enable downstream computational tasks such as biomolecule property prediction. The fusion of the nuanced narratives expressed through natural language with the structural and functional specifics of biomolecules described via various molecular modeling techniques opens new avenues for comprehensively representing and analyzing biomolecules. By incorporating the contextual language data that surrounds biomolecules into their modeling, BL aims to capture a holistic view encompassing both the symbolic qualities conveyed through language as well as quantitative structural characteristics. In this review, we provide an extensive analysis of recent advancements achieved through cross modeling of biomolecules and natural language. (1) We begin by outlining the technical representations of biomolecules employed, including sequences, 2D graphs, and 3D structures. (2) We then examine in depth the rationale and key objectives underlying effective multi-modal integration of language and molecular data sources. (3) We subsequently survey the practical applications enabled to date in this developing research area. (4) We also compile and summarize the available resources and datasets to facilitate future work. (5) Looking ahead, we identify several promising research directions worthy of further exploration and investment to continue advancing the field. The related resources and contents are updating in \url{https://github.com/QizhiPei/Awesome-Biomolecule-Language-Cross-Modeling}.
△ Less
Submitted 5 March, 2024; v1 submitted 3 March, 2024;
originally announced March 2024.
-
BioT5+: Towards Generalized Biological Understanding with IUPAC Integration and Multi-task Tuning
Authors:
Qizhi Pei,
Lijun Wu,
Kaiyuan Gao,
Xiaozhuan Liang,
Yin Fang,
Jinhua Zhu,
Shufang Xie,
Tao Qin,
Rui Yan
Abstract:
Recent research trends in computational biology have increasingly focused on integrating text and bio-entity modeling, especially in the context of molecules and proteins. However, previous efforts like BioT5 faced challenges in generalizing across diverse tasks and lacked a nuanced understanding of molecular structures, particularly in their textual representations (e.g., IUPAC). This paper intro…
▽ More
Recent research trends in computational biology have increasingly focused on integrating text and bio-entity modeling, especially in the context of molecules and proteins. However, previous efforts like BioT5 faced challenges in generalizing across diverse tasks and lacked a nuanced understanding of molecular structures, particularly in their textual representations (e.g., IUPAC). This paper introduces BioT5+, an extension of the BioT5 framework, tailored to enhance biological research and drug discovery. BioT5+ incorporates several novel features: integration of IUPAC names for molecular understanding, inclusion of extensive bio-text and molecule data from sources like bioRxiv and PubChem, the multi-task instruction tuning for generality across tasks, and a numerical tokenization technique for improved processing of numerical data. These enhancements allow BioT5+ to bridge the gap between molecular representations and their textual descriptions, providing a more holistic understanding of biological entities, and largely improving the grounded reasoning of bio-text and bio-sequences. The model is pre-trained and fine-tuned with a large number of experiments, including \emph{3 types of problems (classification, regression, generation), 15 kinds of tasks, and 21 total benchmark datasets}, demonstrating the remarkable performance and state-of-the-art results in most cases. BioT5+ stands out for its ability to capture intricate relationships in biological data, thereby contributing significantly to bioinformatics and computational biology. Our code is available at \url{https://github.com/QizhiPei/BioT5}.
△ Less
Submitted 31 May, 2024; v1 submitted 27 February, 2024;
originally announced February 2024.
-
Toward a Team of AI-made Scientists for Scientific Discovery from Gene Expression Data
Authors:
Haoyang Liu,
Yijiang Li,
Jinglin Jian,
Yuxuan Cheng,
Jianrong Lu,
Shuyi Guo,
Jinglei Zhu,
Mianchen Zhang,
Miantong Zhang,
Haohan Wang
Abstract:
Machine learning has emerged as a powerful tool for scientific discovery, enabling researchers to extract meaningful insights from complex datasets. For instance, it has facilitated the identification of disease-predictive genes from gene expression data, significantly advancing healthcare. However, the traditional process for analyzing such datasets demands substantial human effort and expertise…
▽ More
Machine learning has emerged as a powerful tool for scientific discovery, enabling researchers to extract meaningful insights from complex datasets. For instance, it has facilitated the identification of disease-predictive genes from gene expression data, significantly advancing healthcare. However, the traditional process for analyzing such datasets demands substantial human effort and expertise for the data selection, processing, and analysis. To address this challenge, we introduce a novel framework, a Team of AI-made Scientists (TAIS), designed to streamline the scientific discovery pipeline. TAIS comprises simulated roles, including a project manager, data engineer, and domain expert, each represented by a Large Language Model (LLM). These roles collaborate to replicate the tasks typically performed by data scientists, with a specific focus on identifying disease-predictive genes. Furthermore, we have curated a benchmark dataset to assess TAIS's effectiveness in gene identification, demonstrating our system's potential to significantly enhance the efficiency and scope of scientific exploration. Our findings represent a solid step towards automating scientific discovery through large language models.
△ Less
Submitted 20 February, 2024; v1 submitted 15 February, 2024;
originally announced February 2024.
-
Retrosynthesis Prediction via Search in (Hyper) Graph
Authors:
Zixun Lan,
Binjie Hong,
Jiajun Zhu,
Zuo Zeng,
Zhenfu Liu,
Limin Yu,
Fei Ma
Abstract:
Predicting reactants from a specified core product stands as a fundamental challenge within organic synthesis, termed retrosynthesis prediction. Recently, semi-template-based methods and graph-edits-based methods have achieved good performance in terms of both interpretability and accuracy. However, due to their mechanisms these methods cannot predict complex reactions, e.g., reactions with multip…
▽ More
Predicting reactants from a specified core product stands as a fundamental challenge within organic synthesis, termed retrosynthesis prediction. Recently, semi-template-based methods and graph-edits-based methods have achieved good performance in terms of both interpretability and accuracy. However, due to their mechanisms these methods cannot predict complex reactions, e.g., reactions with multiple reaction center or attaching the same leaving group to more than one atom. In this study we propose a semi-template-based method, the \textbf{Retro}synthesis via \textbf{S}earch \textbf{i}n (Hyper) \textbf{G}raph (RetroSiG) framework to alleviate these limitations. In the proposed method, we turn the reaction center identification and the leaving group completion tasks as tasks of searching in the product molecular graph and leaving group hypergraph respectively. As a semi-template-based method RetroSiG has several advantages. First, RetroSiG is able to handle the complex reactions mentioned above by its novel search mechanism. Second, RetroSiG naturally exploits the hypergraph to model the implicit dependencies between leaving groups. Third, RetroSiG makes full use of the prior, i.e., one-hop constraint. It reduces the search space and enhances overall performance. Comprehensive experiments demonstrated that RetroSiG achieved competitive results. Furthermore, we conducted experiments to show the capability of RetroSiG in predicting complex reactions. Ablation experiments verified the efficacy of specific elements, such as the one-hop constraint and the leaving group hypergraph.
△ Less
Submitted 9 February, 2024;
originally announced February 2024.
-
DeepRLI: A Multi-objective Framework for Universal Protein--Ligand Interaction Prediction
Authors:
Haoyu Lin,
Shiwei Wang,
Jintao Zhu,
Yibo Li,
Jianfeng Pei,
Luhua Lai
Abstract:
Protein (receptor)--ligand interaction prediction is a critical component in computer-aided drug design, significantly influencing molecular docking and virtual screening processes. Despite the development of numerous scoring functions in recent years, particularly those employing machine learning, accurately and efficiently predicting binding affinities for protein--ligand complexes remains a for…
▽ More
Protein (receptor)--ligand interaction prediction is a critical component in computer-aided drug design, significantly influencing molecular docking and virtual screening processes. Despite the development of numerous scoring functions in recent years, particularly those employing machine learning, accurately and efficiently predicting binding affinities for protein--ligand complexes remains a formidable challenge. Most contemporary methods are tailored for specific tasks, such as binding affinity prediction, binding pose prediction, or virtual screening, often failing to encompass all aspects. In this study, we put forward DeepRLI, a novel protein--ligand interaction prediction architecture. It encodes each protein--ligand complex into a fully connected graph, retaining the integrity of the topological and spatial structure, and leverages the improved graph transformer layers with cosine envelope as the central module of the neural network, thus exhibiting superior scoring power. In order to equip the model to generalize to conformations beyond the confines of crystal structures and to adapt to molecular docking and virtual screening tasks, we propose a multi-objective strategy, that is, the model outputs three scores for scoring and ranking, docking, and screening, and the training process optimizes these three objectives simultaneously. For the latter two objectives, we augment the dataset through a docking procedure, incorporate suitable physics-informed blocks and employ an effective contrastive learning approach. Eventually, our model manifests a balanced performance across scoring, ranking, docking, and screening, thereby demonstrating its ability to handle a range of tasks. Overall, this research contributes a multi-objective framework for universal protein--ligand interaction prediction, augmenting the landscape of structure-based drug design.
△ Less
Submitted 19 January, 2024;
originally announced January 2024.
-
DiffBindFR: An SE(3) Equivariant Network for Flexible Protein-Ligand Docking
Authors:
Jintao Zhu,
Zhonghui Gu,
Jianfeng Pei,
Luhua Lai
Abstract:
Molecular docking, a key technique in structure-based drug design, plays pivotal roles in protein-ligand interaction modeling, hit identification and optimization, in which accurate prediction of protein-ligand binding mode is essential. Conventional docking approaches perform well in redocking tasks with known protein binding pocket conformation in the complex state. However, in real-world dockin…
▽ More
Molecular docking, a key technique in structure-based drug design, plays pivotal roles in protein-ligand interaction modeling, hit identification and optimization, in which accurate prediction of protein-ligand binding mode is essential. Conventional docking approaches perform well in redocking tasks with known protein binding pocket conformation in the complex state. However, in real-world docking scenario without knowing the protein binding conformation for a new ligand, accurately modeling the binding complex structure remains challenging as flexible docking is computationally expensive and inaccurate. Typical deep learning-based docking methods do not explicitly consider protein side chain conformations and fail to ensure the physical plausibility and detailed atomic interactions. In this study, we present DiffBindFR, a full-atom diffusion-based flexible docking model that operates over the product space of ligand overall movements and flexibility and pocket side chain torsion changes. We show that DiffBindFR has higher accuracy in producing native-like binding structures with physically plausible and detailed interactions than available docking methods. Furthermore, in the Apo and AlphaFold2 modeled structures, DiffBindFR demonstrates superior advantages in accurate ligand binding pose and protein binding conformation prediction, making it suitable for Apo and AlphaFold2 structure-based drug design. DiffBindFR provides a powerful flexible docking tool for modeling accurate protein-ligand binding structures.
△ Less
Submitted 19 December, 2023; v1 submitted 26 November, 2023;
originally announced November 2023.
-
EpiGeoPop: A Tool for Developing Spatially Accurate Country-level Epidemiological Models
Authors:
Lara Herriott,
Henriette L. Capel,
Isaac Ellmen,
Nathan Schofield,
Jiayuan Zhu,
Ben Lambert,
David Gavaghan,
Ioana Bouros,
Richard Creswell,
Kit Gallagher
Abstract:
Mathematical models play a crucial role in understanding the spread of infectious disease outbreaks and influencing policy decisions. These models aid pandemic preparedness by predicting outcomes under hypothetical scenarios and identifying weaknesses in existing frameworks. However, their accuracy, utility, and comparability are being scrutinized. Agent-based models (ABMs) have emerged as a valua…
▽ More
Mathematical models play a crucial role in understanding the spread of infectious disease outbreaks and influencing policy decisions. These models aid pandemic preparedness by predicting outcomes under hypothetical scenarios and identifying weaknesses in existing frameworks. However, their accuracy, utility, and comparability are being scrutinized. Agent-based models (ABMs) have emerged as a valuable tool, capturing population heterogeneity and spatial effects, particularly when assessing intervention strategies. Here we present EpiGeoPop, a user-friendly tool for rapidly preparing spatially accurate population configurations of entire countries. EpiGeoPop helps to address the problem of complex and time-consuming model set up in ABMs, specifically improving the integration of spatial detail. We subsequently demonstrate the importance of accurate spatial detail in ABM simulations of disease outbreaks using Epiabm, an ABM based on Imperial College London's CovidSim with improved modularity, documentation and testing. Our investigation involves the interplay between population density, the implementation of spatial transmission, and realistic interventions implemented in Epiabm.
△ Less
Submitted 20 October, 2023;
originally announced October 2023.
-
BioT5: Enriching Cross-modal Integration in Biology with Chemical Knowledge and Natural Language Associations
Authors:
Qizhi Pei,
Wei Zhang,
Jinhua Zhu,
Kehan Wu,
Kaiyuan Gao,
Lijun Wu,
Yingce Xia,
Rui Yan
Abstract:
Recent advancements in biological research leverage the integration of molecules, proteins, and natural language to enhance drug discovery. However, current models exhibit several limitations, such as the generation of invalid molecular SMILES, underutilization of contextual information, and equal treatment of structured and unstructured knowledge. To address these issues, we propose…
▽ More
Recent advancements in biological research leverage the integration of molecules, proteins, and natural language to enhance drug discovery. However, current models exhibit several limitations, such as the generation of invalid molecular SMILES, underutilization of contextual information, and equal treatment of structured and unstructured knowledge. To address these issues, we propose $\mathbf{BioT5}$, a comprehensive pre-training framework that enriches cross-modal integration in biology with chemical knowledge and natural language associations. $\mathbf{BioT5}$ utilizes SELFIES for $100%$ robust molecular representations and extracts knowledge from the surrounding context of bio-entities in unstructured biological literature. Furthermore, $\mathbf{BioT5}$ distinguishes between structured and unstructured knowledge, leading to more effective utilization of information. After fine-tuning, BioT5 shows superior performance across a wide range of tasks, demonstrating its strong capability of capturing underlying relations and properties of bio-entities. Our code is available at $\href{https://github.com/QizhiPei/BioT5}{Github}$.
△ Less
Submitted 28 January, 2024; v1 submitted 11 October, 2023;
originally announced October 2023.
-
FABind: Fast and Accurate Protein-Ligand Binding
Authors:
Qizhi Pei,
Kaiyuan Gao,
Lijun Wu,
Jinhua Zhu,
Yingce Xia,
Shufang Xie,
Tao Qin,
Kun He,
Tie-Yan Liu,
Rui Yan
Abstract:
Modeling the interaction between proteins and ligands and accurately predicting their binding structures is a critical yet challenging task in drug discovery. Recent advancements in deep learning have shown promise in addressing this challenge, with sampling-based and regression-based methods emerging as two prominent approaches. However, these methods have notable limitations. Sampling-based meth…
▽ More
Modeling the interaction between proteins and ligands and accurately predicting their binding structures is a critical yet challenging task in drug discovery. Recent advancements in deep learning have shown promise in addressing this challenge, with sampling-based and regression-based methods emerging as two prominent approaches. However, these methods have notable limitations. Sampling-based methods often suffer from low efficiency due to the need for generating multiple candidate structures for selection. On the other hand, regression-based methods offer fast predictions but may experience decreased accuracy. Additionally, the variation in protein sizes often requires external modules for selecting suitable binding pockets, further impacting efficiency. In this work, we propose $\mathbf{FABind}$, an end-to-end model that combines pocket prediction and docking to achieve accurate and fast protein-ligand binding. $\mathbf{FABind}$ incorporates a unique ligand-informed pocket prediction module, which is also leveraged for docking pose estimation. The model further enhances the docking process by incrementally integrating the predicted pocket to optimize protein-ligand binding, reducing discrepancies between training and inference. Through extensive experiments on benchmark datasets, our proposed $\mathbf{FABind}$ demonstrates strong advantages in terms of effectiveness and efficiency compared to existing methods. Our code is available at https://github.com/QizhiPei/FABind
△ Less
Submitted 8 January, 2024; v1 submitted 10 October, 2023;
originally announced October 2023.
-
Revive, Restore, Revitalize: An Eco-economic Methodology for Maasai Mara
Authors:
Yipeng Xu,
He Sun,
Junfeng Zhu
Abstract:
The Maasai Mara in Kenya, renowned for its biodiversity, is witnessing ecosystem degradation and species endangerment due to intensified human activities. Addressing this, we introduce a dynamic system harmonizing ecological and human priorities. Our agent-based model replicates the Maasai Mara savanna ecosystem, incorporating 71 animal species, 10 human classifications, and 2 natural resource typ…
▽ More
The Maasai Mara in Kenya, renowned for its biodiversity, is witnessing ecosystem degradation and species endangerment due to intensified human activities. Addressing this, we introduce a dynamic system harmonizing ecological and human priorities. Our agent-based model replicates the Maasai Mara savanna ecosystem, incorporating 71 animal species, 10 human classifications, and 2 natural resource types. The model employs the metabolic rate-mass relationship for animal energy dynamics, logistic curves for animal growth, individual interactions for food web simulation, and human intervention impacts. Algorithms like fitness proportional selection and particle swarm mimic organism preferences for resources. To guide preservation activities, we formulated 21 management strategies encompassing tourism, transportation, taxation, environmental conservation, research, diplomacy, and poaching, employing a game-theoretic framework. Using the TOPSIS method, we prioritized four key developmental indicators: environmental health, research advancement, economic growth, and security. The interplay of 16 factors determines these indicators, each influenced by our policies to varying degrees. By evaluating the policies' repercussions, we aim to mitigate adverse animal-human interactions and equitably address human concerns. We classified the policy impacts into three categories: Environmental Preservation, Economic Prosperity, and Holistic Development. By applying these policy groupings to our ecosystem model, we tracked the effects on the intricate animal-human-resource dynamics. Utilizing the entropy weight method, we assessed the efficacy of these policy clusters over a decade, identifying the optimal blend emphasizing both environmental conservation and economic progression.
△ Less
Submitted 11 September, 2023;
originally announced September 2023.
-
A Study on the Performance of Generative Pre-trained Transformer (GPT) in Simulating Depressed Individuals on the Standardized Depressive Symptom Scale
Authors:
Sijin Cai,
Nanfeng Zhang,
Jiaying Zhu,
Yanjie Liu,
Yongjin Zhou
Abstract:
Background: Depression is a common mental disorder with societal and economic burden. Current diagnosis relies on self-reports and assessment scales, which have reliability issues. Objective approaches are needed for diagnosing depression. Objective: Evaluate the potential of GPT technology in diagnosing depression. Assess its ability to simulate individuals with depression and investigate the inf…
▽ More
Background: Depression is a common mental disorder with societal and economic burden. Current diagnosis relies on self-reports and assessment scales, which have reliability issues. Objective approaches are needed for diagnosing depression. Objective: Evaluate the potential of GPT technology in diagnosing depression. Assess its ability to simulate individuals with depression and investigate the influence of depression scales. Methods: Three depression-related assessment tools (HAMD-17, SDS, GDS-15) were used. Two experiments simulated GPT responses to normal individuals and individuals with depression. Compare GPT's responses with expected results, assess its understanding of depressive symptoms, and performance differences under different conditions. Results: GPT's performance in depression assessment was evaluated. It aligned with scoring criteria for both individuals with depression and normal individuals. Some performance differences were observed based on depression severity. GPT performed better on scales with higher sensitivity. Conclusion: GPT accurately simulates individuals with depression and normal individuals during depression-related assessments. Deviations occur when simulating different degrees of depression, limiting understanding of mild and moderate cases. GPT performs better on scales with higher sensitivity, indicating potential for developing more effective depression scales. GPT has important potential in depression assessment, supporting clinicians and patients.
△ Less
Submitted 17 July, 2023;
originally announced July 2023.
-
Towards Predicting Equilibrium Distributions for Molecular Systems with Deep Learning
Authors:
Shuxin Zheng,
Jiyan He,
Chang Liu,
Yu Shi,
Ziheng Lu,
Weitao Feng,
Fusong Ju,
Jiaxi Wang,
Jianwei Zhu,
Yaosen Min,
He Zhang,
Shidi Tang,
Hongxia Hao,
Peiran Jin,
Chi Chen,
Frank Noé,
Haiguang Liu,
Tie-Yan Liu
Abstract:
Advances in deep learning have greatly improved structure prediction of molecules. However, many macroscopic observations that are important for real-world applications are not functions of a single molecular structure, but rather determined from the equilibrium distribution of structures. Traditional methods for obtaining these distributions, such as molecular dynamics simulation, are computation…
▽ More
Advances in deep learning have greatly improved structure prediction of molecules. However, many macroscopic observations that are important for real-world applications are not functions of a single molecular structure, but rather determined from the equilibrium distribution of structures. Traditional methods for obtaining these distributions, such as molecular dynamics simulation, are computationally expensive and often intractable. In this paper, we introduce a novel deep learning framework, called Distributional Graphormer (DiG), in an attempt to predict the equilibrium distribution of molecular systems. Inspired by the annealing process in thermodynamics, DiG employs deep neural networks to transform a simple distribution towards the equilibrium distribution, conditioned on a descriptor of a molecular system, such as a chemical graph or a protein sequence. This framework enables efficient generation of diverse conformations and provides estimations of state densities. We demonstrate the performance of DiG on several molecular tasks, including protein conformation sampling, ligand structure sampling, catalyst-adsorbate sampling, and property-guided structure generation. DiG presents a significant advancement in methodology for statistically understanding molecular systems, opening up new research opportunities in molecular science.
△ Less
Submitted 8 June, 2023;
originally announced June 2023.
-
Temporal Dynamic Synchronous Functional Brain Network for Schizophrenia Diagnosis and Lateralization Analysis
Authors:
Cheng Zhu,
Ying Tan,
Shuqi Yang,
Jiaqing Miao,
Jiayi Zhu,
Huan Huang,
Dezhong Yao,
Cheng Luo
Abstract:
The available evidence suggests that dynamic functional connectivity (dFC) can capture time-varying abnormalities in brain activity in resting-state cerebral functional magnetic resonance imaging (rs-fMRI) data and has a natural advantage in uncovering mechanisms of abnormal brain activity in schizophrenia(SZ) patients. Hence, an advanced dynamic brain network analysis model called the temporal br…
▽ More
The available evidence suggests that dynamic functional connectivity (dFC) can capture time-varying abnormalities in brain activity in resting-state cerebral functional magnetic resonance imaging (rs-fMRI) data and has a natural advantage in uncovering mechanisms of abnormal brain activity in schizophrenia(SZ) patients. Hence, an advanced dynamic brain network analysis model called the temporal brain category graph convolutional network (Temporal-BCGCN) was employed. Firstly, a unique dynamic brain network analysis module, DSF-BrainNet, was designed to construct dynamic synchronization features. Subsequently, a revolutionary graph convolution method, TemporalConv, was proposed, based on the synchronous temporal properties of feature. Finally, the first modular abnormal hemispherical lateralization test tool in deep learning based on rs-fMRI data, named CategoryPool, was proposed. This study was validated on COBRE and UCLA datasets and achieved 83.62% and 89.71% average accuracies, respectively, outperforming the baseline model and other state-of-the-art methods. The ablation results also demonstrate the advantages of TemporalConv over the traditional edge feature graph convolution approach and the improvement of CategoryPool over the classical graph pooling approach. Interestingly, this study showed that the lower order perceptual system and higher order network regions in the left hemisphere are more severely dysfunctional than in the right hemisphere in SZ and reaffirms the importance of the left medial superior frontal gyrus in SZ. Our core code is available at: https://github.com/swfen/Temporal-BCGCN.
△ Less
Submitted 11 September, 2023; v1 submitted 30 March, 2023;
originally announced April 2023.
-
Incorporating Pre-training Paradigm for Antibody Sequence-Structure Co-design
Authors:
Kaiyuan Gao,
Lijun Wu,
Jinhua Zhu,
Tianbo Peng,
Yingce Xia,
Liang He,
Shufang Xie,
Tao Qin,
Haiguang Liu,
Kun He,
Tie-Yan Liu
Abstract:
Antibodies are versatile proteins that can bind to pathogens and provide effective protection for human body. Recently, deep learning-based computational antibody design has attracted popular attention since it automatically mines the antibody patterns from data that could be complementary to human experiences. However, the computational methods heavily rely on high-quality antibody structure data…
▽ More
Antibodies are versatile proteins that can bind to pathogens and provide effective protection for human body. Recently, deep learning-based computational antibody design has attracted popular attention since it automatically mines the antibody patterns from data that could be complementary to human experiences. However, the computational methods heavily rely on high-quality antibody structure data, which is quite limited. Besides, the complementarity-determining region (CDR), which is the key component of an antibody that determines the specificity and binding affinity, is highly variable and hard to predict. Therefore, the data limitation issue further raises the difficulty of CDR generation for antibodies. Fortunately, there exists a large amount of sequence data of antibodies that can help model the CDR and alleviate the reliance on structure data. By witnessing the success of pre-training models for protein modeling, in this paper, we develop the antibody pre-training language model and incorporate it into the (antigen-specific) antibody design model in a systemic way. Specifically, we first pre-train an antibody language model based on the sequence data, then propose a one-shot way for sequence and structure generation of CDR to avoid the heavy cost and error propagation from an autoregressive manner, and finally leverage the pre-trained antibody model for the antigen-specific antibody generation model with some carefully designed modules. Through various experiments, we show that our method achieves superior performances over previous baselines on different tasks, such as sequence and structure generation and antigen-binding CDR-H3 design.
△ Less
Submitted 17 November, 2022; v1 submitted 26 October, 2022;
originally announced November 2022.
-
Equivariant Energy-Guided SDE for Inverse Molecular Design
Authors:
Fan Bao,
Min Zhao,
Zhongkai Hao,
Peiyao Li,
Chongxuan Li,
Jun Zhu
Abstract:
Inverse molecular design is critical in material science and drug discovery, where the generated molecules should satisfy certain desirable properties. In this paper, we propose equivariant energy-guided stochastic differential equations (EEGSDE), a flexible framework for controllable 3D molecule generation under the guidance of an energy function in diffusion models. Formally, we show that EEGSDE…
▽ More
Inverse molecular design is critical in material science and drug discovery, where the generated molecules should satisfy certain desirable properties. In this paper, we propose equivariant energy-guided stochastic differential equations (EEGSDE), a flexible framework for controllable 3D molecule generation under the guidance of an energy function in diffusion models. Formally, we show that EEGSDE naturally exploits the geometric symmetry in 3D molecular conformation, as long as the energy function is invariant to orthogonal transformations. Empirically, under the guidance of designed energy functions, EEGSDE significantly improves the baseline on QM9, in inverse molecular design targeted to quantum properties and molecular structures. Furthermore, EEGSDE is able to generate molecules with multiple target properties by combining the corresponding energy functions linearly.
△ Less
Submitted 28 February, 2023; v1 submitted 30 September, 2022;
originally announced September 2022.
-
Molecular Design Based on Integer Programming and Quadratic Descriptors in a Two-layered Model
Authors:
Jianshen Zhu,
Naveed Ahmed Azam,
Shengjuan Cao,
Ryota Ido,
Kazuya Haraguchi,
Liang Zhao,
Hiroshi Nagamochi,
Tatsuya Akutsu
Abstract:
A novel framework has recently been proposed for designing the molecular structure of chemical compounds with a desired chemical property, where design of novel drugs is an important topic in bioinformatics and chemo-informatics. The framework infers a desired chemical graph by solving a mixed integer linear program (MILP) that simulates the computation process of a feature function defined by a t…
▽ More
A novel framework has recently been proposed for designing the molecular structure of chemical compounds with a desired chemical property, where design of novel drugs is an important topic in bioinformatics and chemo-informatics. The framework infers a desired chemical graph by solving a mixed integer linear program (MILP) that simulates the computation process of a feature function defined by a two-layered model on chemical graphs and a prediction function constructed by a machine learning method. A set of graph theoretical descriptors in the feature function plays a key role to derive a compact formulation of such an MILP. To improve the learning performance of prediction functions in the framework maintaining the compactness of the MILP, this paper utilizes the product of two of those descriptors as a new descriptor and then designs a method of reducing the number of descriptors. The results of our computational experiments suggest that the proposed method improved the learning performance for many chemical properties and can infer a chemical structure with up to 50 non-hydrogen atoms.
△ Less
Submitted 13 September, 2022;
originally announced September 2022.
-
Can Brain Signals Reveal Inner Alignment with Human Languages?
Authors:
William Han,
Jielin Qiu,
Jiacheng Zhu,
Mengdi Xu,
Douglas Weber,
Bo Li,
Ding Zhao
Abstract:
Brain Signals, such as Electroencephalography (EEG), and human languages have been widely explored independently for many downstream tasks, however, the connection between them has not been well explored. In this study, we explore the relationship and dependency between EEG and language. To study at the representation level, we introduced \textbf{MTAM}, a \textbf{M}ultimodal \textbf{T}ransformer \…
▽ More
Brain Signals, such as Electroencephalography (EEG), and human languages have been widely explored independently for many downstream tasks, however, the connection between them has not been well explored. In this study, we explore the relationship and dependency between EEG and language. To study at the representation level, we introduced \textbf{MTAM}, a \textbf{M}ultimodal \textbf{T}ransformer \textbf{A}lignment \textbf{M}odel, to observe coordinated representations between the two modalities. We used various relationship alignment-seeking techniques, such as Canonical Correlation Analysis and Wasserstein Distance, as loss functions to transfigure features. On downstream applications, sentiment analysis and relation detection, we achieved new state-of-the-art results on two datasets, ZuCo and K-EmoCon. Our method achieved an F1-score improvement of 1.7% on K-EmoCon and 9.3% on Zuco datasets for sentiment analysis, and 7.4% on ZuCo for relation detection. In addition, we provide interpretations of the performance improvement: (1) feature distribution shows the effectiveness of the alignment module for discovering and encoding the relationship between EEG and language; (2) alignment weights show the influence of different language semantics as well as EEG frequency features; (3) brain topographical maps provide an intuitive demonstration of the connectivity in the brain regions. Our code is available at \url{https://github.com/Jason-Qiu/EEG_Language_Alignment}.
△ Less
Submitted 4 May, 2024; v1 submitted 10 August, 2022;
originally announced August 2022.
-
SSM-DTA: Breaking the Barriers of Data Scarcity in Drug-Target Affinity Prediction
Authors:
Qizhi Pei,
Lijun Wu,
Jinhua Zhu,
Yingce Xia,
Shufang Xie,
Tao Qin,
Haiguang Liu,
Tie-Yan Liu,
Rui Yan
Abstract:
Accurate prediction of Drug-Target Affinity (DTA) is of vital importance in early-stage drug discovery, facilitating the identification of drugs that can effectively interact with specific targets and regulate their activities. While wet experiments remain the most reliable method, they are time-consuming and resource-intensive, resulting in limited data availability that poses challenges for deep…
▽ More
Accurate prediction of Drug-Target Affinity (DTA) is of vital importance in early-stage drug discovery, facilitating the identification of drugs that can effectively interact with specific targets and regulate their activities. While wet experiments remain the most reliable method, they are time-consuming and resource-intensive, resulting in limited data availability that poses challenges for deep learning approaches. Existing methods have primarily focused on developing techniques based on the available DTA data, without adequately addressing the data scarcity issue. To overcome this challenge, we present the SSM-DTA framework, which incorporates three simple yet highly effective strategies: (1) A multi-task training approach that combines DTA prediction with masked language modeling (MLM) using paired drug-target data. (2) A semi-supervised training method that leverages large-scale unpaired molecules and proteins to enhance drug and target representations. This approach differs from previous methods that only employed molecules or proteins in pre-training. (3) The integration of a lightweight cross-attention module to improve the interaction between drugs and targets, further enhancing prediction accuracy. Through extensive experiments on benchmark datasets such as BindingDB, DAVIS, and KIBA, we demonstrate the superior performance of our framework. Additionally, we conduct case studies on specific drug-target binding activities, virtual screening experiments, drug feature visualizations, and real-world applications, all of which showcase the significant potential of our work. In conclusion, our proposed SSM-DTA framework addresses the data limitation challenge in DTA prediction and yields promising results, paving the way for more efficient and accurate drug discovery processes. Our code is available at $\href{https://github.com/QizhiPei/SSM-DTA}{Github}$.
△ Less
Submitted 17 October, 2023; v1 submitted 20 June, 2022;
originally announced June 2022.
-
MolMiner: You only look once for chemical structure recognition
Authors:
Youjun Xu,
Jinchuan Xiao,
Chia-Han Chou,
Jianhang Zhang,
Jintao Zhu,
Qiwan Hu,
Hemin Li,
Ningsheng Han,
Bingyu Liu,
Shuaipeng Zhang,
Jinyu Han,
Zhen Zhang,
Shuhao Zhang,
Weilin Zhang,
Luhua Lai,
Jianfeng Pei
Abstract:
Molecular structures are always depicted as 2D printed form in scientific documents like journal papers and patents. However, these 2D depictions are not machine-readable. Due to a backlog of decades and an increasing amount of these printed literature, there is a high demand for the translation of printed depictions into machine-readable formats, which is known as Optical Chemical Structure Recog…
▽ More
Molecular structures are always depicted as 2D printed form in scientific documents like journal papers and patents. However, these 2D depictions are not machine-readable. Due to a backlog of decades and an increasing amount of these printed literature, there is a high demand for the translation of printed depictions into machine-readable formats, which is known as Optical Chemical Structure Recognition (OCSR). Most OCSR systems developed over the last three decades follow a rule-based approach where the key step of vectorization of the depiction is based on the interpretation of vectors and nodes as bonds and atoms. Here, we present a practical software MolMiner, which is primarily built up using deep neural networks originally developed for semantic segmentation and object detection to recognize atom and bond elements from documents. These recognized elements can be easily connected as a molecular graph with distance-based construction algorithm. We carefully evaluate our software on four benchmark datasets with the state-of-the-art performance. Various real application scenarios are also tested, yielding satisfactory outcomes. The free download links of Mac and Windows versions are available: Mac: https://molminer-cdn.iipharma.cn/pharma-mind/artifact/latest/mac/PharmaMind-mac-latest-setup.dmg and Windows: https://molminer-cdn.iipharma.cn/pharma-mind/artifact/latest/win/PharmaMind-win-latest-setup.exe
△ Less
Submitted 22 May, 2022;
originally announced May 2022.
-
Dynamic Ensemble Bayesian Filter for Robust Control of a Human Brain-machine Interface
Authors:
Yu Qi,
Xinyun Zhu,
Kedi Xu,
Feixiao Ren,
Hongjie Jiang,
Junming Zhu,
Jianmin Zhang,
Gang Pan,
Yueming Wang
Abstract:
Objective: Brain-machine interfaces (BMIs) aim to provide direct brain control of devices such as prostheses and computer cursors, which have demonstrated great potential for mobility restoration. One major limitation of current BMIs lies in the unstable performance in online control due to the variability of neural signals, which seriously hinders the clinical availability of BMIs. Method: To dea…
▽ More
Objective: Brain-machine interfaces (BMIs) aim to provide direct brain control of devices such as prostheses and computer cursors, which have demonstrated great potential for mobility restoration. One major limitation of current BMIs lies in the unstable performance in online control due to the variability of neural signals, which seriously hinders the clinical availability of BMIs. Method: To deal with the neural variability in online BMI control, we propose a dynamic ensemble Bayesian filter (DyEnsemble). DyEnsemble extends Bayesian filters with a dynamic measurement model, which adjusts its parameters in time adaptively with neural changes. This is achieved by learning a pool of candidate functions and dynamically weighting and assembling them according to neural signals. In this way, DyEnsemble copes with variability in signals and improves the robustness of online control. Results: Online BMI experiments with a human participant demonstrate that, compared with the velocity Kalman filter, DyEnsemble significantly improves the control accuracy (increases the success rate by 13.9% and reduces the reach time by 13.5% in the random target pursuit task) and robustness (performs more stably over different experiment days). Conclusion: Our results demonstrate the superiority of DyEnsemble in online BMI control. Significance: DyEnsemble frames a novel and flexible framework for robust neural decoding, which is beneficial to different neural decoding applications.
△ Less
Submitted 22 April, 2022;
originally announced April 2022.
-
An Inverse QSAR Method Based on Linear Regression and Integer Programming
Authors:
Jianshen Zhu,
Naveed Ahmed Azam,
Kazuya Haraguchi,
Liang Zhao,
Hiroshi Nagamochi,
Tatsuya Akutsu
Abstract:
Recently a novel framework has been proposed for designing the molecular structure of chemical compounds using both artificial neural networks (ANNs) and mixed integer linear programming (MILP). In the framework, we first define a feature vector $f(C)$ of a chemical graph $C$ and construct an ANN that maps $x=f(C)$ to a predicted value $η(x)$ of a chemical property $Ï€$ to $C$. After this, we formu…
▽ More
Recently a novel framework has been proposed for designing the molecular structure of chemical compounds using both artificial neural networks (ANNs) and mixed integer linear programming (MILP). In the framework, we first define a feature vector $f(C)$ of a chemical graph $C$ and construct an ANN that maps $x=f(C)$ to a predicted value $η(x)$ of a chemical property $π$ to $C$. After this, we formulate an MILP that simulates the computation process of $f(C)$ from $C$ and that of $η(x)$ from $x$. Given a target value $y^*$ of the chemical property $π$, we infer a chemical graph $C^\dagger$ such that $η(f(C^\dagger))=y^*$ by solving the MILP. In this paper, we use linear regression to construct a prediction function $η$ instead of ANNs. For this, we derive an MILP formulation that simulates the computation process of a prediction function by linear regression. The results of computational experiments suggest our method can infer chemical graphs with around up to 50 non-hydrogen atoms.
△ Less
Submitted 23 August, 2021; v1 submitted 6 July, 2021;
originally announced July 2021.
-
Dual-view Molecule Pre-training
Authors:
Jinhua Zhu,
Yingce Xia,
Tao Qin,
Wengang Zhou,
Houqiang Li,
Tie-Yan Liu
Abstract:
Inspired by its success in natural language processing and computer vision, pre-training has attracted substantial attention in cheminformatics and bioinformatics, especially for molecule based tasks. A molecule can be represented by either a graph (where atoms are connected by bonds) or a SMILES sequence (where depth-first-search is applied to the molecular graph with specific rules). Existing wo…
▽ More
Inspired by its success in natural language processing and computer vision, pre-training has attracted substantial attention in cheminformatics and bioinformatics, especially for molecule based tasks. A molecule can be represented by either a graph (where atoms are connected by bonds) or a SMILES sequence (where depth-first-search is applied to the molecular graph with specific rules). Existing works on molecule pre-training use either graph representations only or SMILES representations only. In this work, we propose to leverage both the representations and design a new pre-training algorithm, dual-view molecule pre-training (briefly, DMP), that can effectively combine the strengths of both types of molecule representations. The model of DMP consists of two branches: a Transformer branch that takes the SMILES sequence of a molecule as input, and a GNN branch that takes a molecular graph as input. The training of DMP contains three tasks: (1) predicting masked tokens in a SMILES sequence by the Transformer branch, (2) predicting masked atoms in a molecular graph by the GNN branch, and (3) maximizing the consistency between the two high-level representations output by the Transformer and GNN branches separately. After pre-training, we can use either the Transformer branch (this one is recommended according to empirical results), the GNN branch, or both for downstream tasks. DMP is tested on nine molecular property prediction tasks and achieves state-of-the-art performances on seven of them. Furthermore, we test DMP on three retrosynthesis tasks and achieve state-of-the-art results on them.
△ Less
Submitted 12 October, 2021; v1 submitted 16 June, 2021;
originally announced June 2021.
-
Causal inference for observational longitudinal studies using deep survival models
Authors:
Jie Zhu,
Blanca Gallego
Abstract:
Causal inference for observational longitudinal studies often requires the accurate estimation of treatment effects on time-to-event outcomes in the presence of time-dependent patient history and time-dependent covariates. To tackle this longitudinal treatment effect estimation problem, we have developed a time-variant causal survival (TCS) model that uses the potential outcomes framework with an…
▽ More
Causal inference for observational longitudinal studies often requires the accurate estimation of treatment effects on time-to-event outcomes in the presence of time-dependent patient history and time-dependent covariates. To tackle this longitudinal treatment effect estimation problem, we have developed a time-variant causal survival (TCS) model that uses the potential outcomes framework with an ensemble of recurrent subnetworks to estimate the difference in survival probabilities and its confidence interval over time as a function of time-dependent covariates and treatments. Using simulated survival datasets, the TCS model showed good causal effect estimation performance across scenarios of varying sample dimensions, event rates, confounding and overlapping. However, increasing the sample size was not effective in alleviating the adverse impact of a high level of confounding. In a large clinical cohort study, TCS identified the expected conditional average treatment effect and detected individual treatment effect heterogeneity over time. TCS provides an efficient way to estimate and update individualized treatment effects over time, in order to improve clinical decisions. The use of a propensity score layer and potential outcome subnetworks helps correcting for selection bias. However, the proposed model is limited in its ability to correct the bias from unmeasured confounding, and more extensive testing of TCS under extreme scenarios such as low overlapping and the presence of unmeasured confounders is desired and left for future work.
△ Less
Submitted 8 June, 2022; v1 submitted 26 January, 2021;
originally announced January 2021.
-
RRScell method for automated single-cell profiling of multiplexed immunofluorescence cancer tissue
Authors:
Alvason Zhenhua Li,
Karsten Eichholz,
Anton Sholukh,
Daniel Stone,
Michelle A. Loprieno,
Keith R. Jerome,
Khamsone Phasouk,
Kurt Diem,
Jia Zhu,
Lawrence Corey
Abstract:
Multiplexed immuno-fluorescence tissue imaging, allowing simultaneous detection of molecular properties of cells, is an essential tool for characterizing the complex cellular mechanisms in translational research and clinical practice. New image analysis approaches are needed because tissue section stained with a mixture of protein, DNA and RNA biomarkers are introducing various complexities, inclu…
▽ More
Multiplexed immuno-fluorescence tissue imaging, allowing simultaneous detection of molecular properties of cells, is an essential tool for characterizing the complex cellular mechanisms in translational research and clinical practice. New image analysis approaches are needed because tissue section stained with a mixture of protein, DNA and RNA biomarkers are introducing various complexities, including spurious edges due to fluorescent staining artifacts between touching or overlapping cells. We have developed the RRScell method harnessing the stochastic random-reaction-seed (RRS) algorithm and deep neural learning U-net to extract single-cell resolution profiling-map of gene expression over a million cells tissue section accurately and automatically. Furthermore, with the use of manifold learning technique UMAP for cell phenotype cluster analysis, the AI-driven RRScell has equipped with a marker-based image cytometry analysis tool (markerUMAP) in quantifying spatial distribution of cell phenotypes from tissue images with a mixture of biomarkers. The results achieved in this study suggest that RRScell provides a robust enough way for extracting cytometric single cell morphology as well as biomarker content in various tissue types, while the build-in markerUMAP tool secures the efficiency of dimension reduction, making it viable as a general tool in the spatial analysis of high dimensional tissue image.
△ Less
Submitted 18 March, 2021; v1 submitted 30 October, 2020;
originally announced November 2020.
-
Brain-inspired global-local learning incorporated with neuromorphic computing
Authors:
Yujie Wu,
Rong Zhao,
Jun Zhu,
Feng Chen,
Mingkun Xu,
Guoqi Li,
Sen Song,
Lei Deng,
Guanrui Wang,
Hao Zheng,
Jing Pei,
Youhui Zhang,
Mingguo Zhao,
Luping Shi
Abstract:
Two main routes of learning methods exist at present including error-driven global learning and neuroscience-oriented local learning. Integrating them into one network may provide complementary learning capabilities for versatile learning scenarios. At the same time, neuromorphic computing holds great promise, but still needs plenty of useful algorithms and algorithm-hardware co-designs for exploi…
▽ More
Two main routes of learning methods exist at present including error-driven global learning and neuroscience-oriented local learning. Integrating them into one network may provide complementary learning capabilities for versatile learning scenarios. At the same time, neuromorphic computing holds great promise, but still needs plenty of useful algorithms and algorithm-hardware co-designs for exploiting the advantages. Here, we report a neuromorphic hybrid learning model by introducing a brain-inspired meta-learning paradigm and a differentiable spiking model incorporating neuronal dynamics and synaptic plasticity. It can meta-learn local plasticity and receive top-down supervision information for multiscale synergic learning. We demonstrate the advantages of this model in multiple different tasks, including few-shot learning, continual learning, and fault-tolerance learning in neuromorphic vision sensors. It achieves significantly higher performance than single-learning methods, and shows promise in empowering neuromorphic applications revolution. We further implemented the hybrid model in the Tianjic neuromorphic platform by exploiting algorithm-hardware co-designs and proved that the model can fully utilize neuromorphic many-core architecture to develop hybrid computation paradigm.
△ Less
Submitted 21 June, 2021; v1 submitted 5 June, 2020;
originally announced June 2020.
-
Noisy Pooled PCR for Virus Testing
Authors:
Junan Zhu,
Kristina Rivera,
Dror Baron
Abstract:
Fast testing can help mitigate the coronavirus disease 2019 (COVID-19) pandemic. Despite their accuracy for single sample analysis, infectious diseases diagnostic tools, like RT-PCR, require substantial resources to test large populations. We develop a scalable approach for determining the viral status of pooled patient samples. Our approach converts group testing to a linear inverse problem, wher…
▽ More
Fast testing can help mitigate the coronavirus disease 2019 (COVID-19) pandemic. Despite their accuracy for single sample analysis, infectious diseases diagnostic tools, like RT-PCR, require substantial resources to test large populations. We develop a scalable approach for determining the viral status of pooled patient samples. Our approach converts group testing to a linear inverse problem, where false positives and negatives are interpreted as generated by a noisy communication channel, and a message passing algorithm estimates the illness status of patients. Numerical results reveal that our approach estimates patient illness using fewer pooled measurements than existing noisy group testing algorithms. Our approach can easily be extended to various applications, including where false negatives must be minimized. Finally, in a Utopian world we would have collaborated with RT-PCR experts; it is difficult to form such connections during a pandemic. We welcome new collaborators to reach out and help improve this work!
△ Less
Submitted 6 April, 2020;
originally announced April 2020.
-
MODMA dataset: a Multi-modal Open Dataset for Mental-disorder Analysis
Authors:
Hanshu Cai,
Yiwen Gao,
Shuting Sun,
Na Li,
Fuze Tian,
Han Xiao,
Jianxiu Li,
Zhengwu Yang,
Xiaowei Li,
Qinglin Zhao,
Zhenyu Liu,
Zhijun Yao,
Minqiang Yang,
Hong Peng,
Jing Zhu,
Xiaowei Zhang,
Guoping Gao,
Fang Zheng,
Rui Li,
Zhihua Guo,
Rong Ma,
Jing Yang,
Lan Zhang,
Xiping Hu,
Yumin Li
, et al. (1 additional authors not shown)
Abstract:
According to the World Health Organization, the number of mental disorder patients, especially depression patients, has grown rapidly and become a leading contributor to the global burden of disease. However, the present common practice of depression diagnosis is based on interviews and clinical scales carried out by doctors, which is not only labor-consuming but also time-consuming. One important…
▽ More
According to the World Health Organization, the number of mental disorder patients, especially depression patients, has grown rapidly and become a leading contributor to the global burden of disease. However, the present common practice of depression diagnosis is based on interviews and clinical scales carried out by doctors, which is not only labor-consuming but also time-consuming. One important reason is due to the lack of physiological indicators for mental disorders. With the rising of tools such as data mining and artificial intelligence, using physiological data to explore new possible physiological indicators of mental disorder and creating new applications for mental disorder diagnosis has become a new research hot topic. However, good quality physiological data for mental disorder patients are hard to acquire. We present a multi-modal open dataset for mental-disorder analysis. The dataset includes EEG and audio data from clinically depressed patients and matching normal controls. All our patients were carefully diagnosed and selected by professional psychiatrists in hospitals. The EEG dataset includes not only data collected using traditional 128-electrodes mounted elastic cap, but also a novel wearable 3-electrode EEG collector for pervasive applications. The 128-electrodes EEG signals of 53 subjects were recorded as both in resting state and under stimulation; the 3-electrode EEG signals of 55 subjects were recorded in resting state; the audio data of 52 subjects were recorded during interviewing, reading, and picture description. We encourage other researchers in the field to use it for testing their methods of mental-disorder analysis.
△ Less
Submitted 4 March, 2020; v1 submitted 20 February, 2020;
originally announced February 2020.
-
Targeted Estimation of Heterogeneous Treatment Effect in Observational Survival Analysis
Authors:
Jie Zhu,
Blanca Gallego
Abstract:
The aim of clinical effectiveness research using repositories of electronic health records is to identify what health interventions 'work best' in real-world settings. Since there are several reasons why the net benefit of intervention may differ across patients, current comparative effectiveness literature focuses on investigating heterogeneous treatment effect and predicting whether an individua…
▽ More
The aim of clinical effectiveness research using repositories of electronic health records is to identify what health interventions 'work best' in real-world settings. Since there are several reasons why the net benefit of intervention may differ across patients, current comparative effectiveness literature focuses on investigating heterogeneous treatment effect and predicting whether an individual might benefit from an intervention. The majority of this literature has concentrated on the estimation of the effect of treatment on binary outcomes. However, many medical interventions are evaluated in terms of their effect on future events, which are subject to loss to follow-up. In this study, we describe a framework for the estimation of heterogeneous treatment effect in terms of differences in time-to-event (survival) probabilities. We divide the problem into three phases: (1) estimation of treatment effect conditioned on unique sets of the covariate vector; (2) identification of features important for heterogeneity using an ensemble of non-parametric variable importance methods; and (3) estimation of treatment effect on the reference classes defined by the previously selected features, using one-step Targeted Maximum Likelihood Estimation. We conducted a series of simulation studies and found that this method performs well when either sample size or event rate is high enough and the number of covariates contributing to the effect heterogeneity is moderate. An application of this method to a clinical case study was conducted by estimating the effect of oral anticoagulants on newly diagnosed non-valvular atrial fibrillation patients using data from the UK Clinical Practice Research Datalink.
△ Less
Submitted 22 October, 2019; v1 submitted 19 October, 2019;
originally announced October 2019.
-
Seq-SetNet: Exploring Sequence Sets for Inferring Structures
Authors:
Fusong Ju,
Jianwei Zhu,
Guozheng Wei,
Qi Zhang,
Shiwei Sun,
Dongbo Bu
Abstract:
Sequence set is a widely-used type of data source in a large variety of fields. A typical example is protein structure prediction, which takes an multiple sequence alignment (MSA) as input and aims to infer structural information from it. Almost all of the existing approaches exploit MSAs in an indirect fashion, i.e., they transform MSAs into position-specific scoring matrices (PSSM) that represen…
▽ More
Sequence set is a widely-used type of data source in a large variety of fields. A typical example is protein structure prediction, which takes an multiple sequence alignment (MSA) as input and aims to infer structural information from it. Almost all of the existing approaches exploit MSAs in an indirect fashion, i.e., they transform MSAs into position-specific scoring matrices (PSSM) that represent the distribution of amino acid types at each column. PSSM could capture column-wise characteristics of MSA, however, the column-wise characteristics embedded in each individual component sequence were nearly totally neglected.
The drawback of PSSM is rooted in the fact that an MSA is essentially an unordered sequence set rather than a matrix. Specifically, the interchange of any two sequences will not affect the whole MSA. In contrast, the pixels in an image essentially form a matrix since any two rows of pixels cannot be interchanged. Therefore, the traditional deep neural networks designed for image processing cannot be directly applied on sequence sets. Here, we proposed a novel deep neural network framework (called Seq-SetNet) for sequence set processing. By employing a {\it symmetric function} module to integrate features calculated from preceding layers, Seq-SetNet are immune to the order of sequences in the input MSA. This advantage enables us to directly and fully exploit MSAs by considering each component protein individually. We evaluated Seq-SetNet by using it to extract structural information from MSA for protein secondary structure prediction. Experimental results on popular benchmark sets suggests that Seq-SetNet outperforms the state-of-the-art approaches by 3.6% in precision. These results clearly suggest the advantages of Seq-SetNet in sequence set processing and it can be readily used in a wide range of fields, say natural language processing.
△ Less
Submitted 6 June, 2019;
originally announced June 2019.
-
A statistical normalization method and differential expression analysis for RNA-seq data between different species
Authors:
Yan Zhou,
Jiadi Zhu,
Tiejun Tong,
Junhui Wang,
Bingqing Lin,
Jun Zhang
Abstract:
Background: High-throughput techniques bring novel tools but also statistical challenges to genomic research. Identifying genes with differential expression between different species is an effective way to discover evolutionarily conserved transcriptional responses. To remove systematic variation between different species for a fair comparison, the normalization procedure serves as a crucial pre-p…
▽ More
Background: High-throughput techniques bring novel tools but also statistical challenges to genomic research. Identifying genes with differential expression between different species is an effective way to discover evolutionarily conserved transcriptional responses. To remove systematic variation between different species for a fair comparison, the normalization procedure serves as a crucial pre-processing step that adjusts for the varying sample sequencing depths and other confounding technical effects.
Results: In this paper, we propose a scale based normalization (SCBN) method by taking into account the available knowledge of conserved orthologous genes and hypothesis testing framework. Considering the different gene lengths and unmapped genes between different species, we formulate the problem from the perspective of hypothesis testing and search for the optimal scaling factor that minimizes the deviation between the empirical and nominal type I errors. Conclusions: Simulation studies show that the proposed method performs significantly better than the existing competitor in a wide range of settings. An RNA-seq dataset of different species is also analyzed and it coincides with the conclusion that the proposed method outperforms the existing method. For practical applications, we have also developed an R package named "SCBN" and the software is available at http://www.bioconductor.org/packages/devel/bioc/html/SCBN.html.
△ Less
Submitted 3 October, 2018;
originally announced October 2018.
-
Prediction of Coronary Heart Disease Using Routine Blood Tests
Authors:
Ning Meng,
Peng Zhang,
Junfeng Li,
Jun He,
Jin Zhu
Abstract:
Background --The objective of this study was to examine the association of routine blood test results with coronary heart disease (CHD) risk, to incorporate them into coronary prediction models and to compare the discrimination properties of this approach with other prediction functions. Methods and Results --This work was designed as a retrospective, single-center study of a hospital-based cohort…
▽ More
Background --The objective of this study was to examine the association of routine blood test results with coronary heart disease (CHD) risk, to incorporate them into coronary prediction models and to compare the discrimination properties of this approach with other prediction functions. Methods and Results --This work was designed as a retrospective, single-center study of a hospital-based cohort. The 5060 CHD patients (2365 men and 2695 women) were 1 to 97 years old at baseline with 8 years (2009-2017) of medical records, 5051 health check-ups and 5075 cases of other diseases. We developed a two-layer Gradient Boosting Decision Tree(GBDT) model based on routine blood data to predict the risk of coronary heart disease, which could identify 86% of people with coronary heart disease. We built a dataset with 15,000 routine blood tests results. Using this dataset, we trained the two-layer GBDT model to classify healthy status, coronary heart disease and other diseases. As a result of the classification after machine learning, we found that the sensitivity of detecting the health data was approximately 93% for all data, and the sensitivity of detecting CHD was 93% for disease data that included coronary heart disease. On this basis, we further visualized the correlation between routine blood results and related data items, and there was an obvious pattern in health and coronary heart disease in all data presentations, which can be used for clinical reference. Finally, we briefly analyzed the results above from the perspective of pathophysiology. Conclusions --Routine blood data provides more information about CHD than what we already know through the correlation between test results and related data items. A simple coronary disease prediction model was developed using a GBDT algorithm, which will allow physicians to predict CHD risk in patients without overt CHD.
△ Less
Submitted 11 September, 2018;
originally announced September 2018.
-
Predicting protein inter-residue contacts using composite likelihood maximization and deep learning
Authors:
Haicang Zhang,
Qi Zhang,
Fusong Ju,
Jianwei Zhu,
Shiwei Sun,
Yujuan Gao,
Ziwei Xie,
Minghua Deng,
Shiwei Sun,
Wei-Mou Zheng,
Dongbo Bu
Abstract:
Accurate prediction of inter-residue contacts of a protein is important to calcu- lating its tertiary structure. Analysis of co-evolutionary events among residues has been proved effective to inferring inter-residue contacts. The Markov ran- dom field (MRF) technique, although being widely used for contact prediction, suffers from the following dilemma: the actual likelihood function of MRF is acc…
▽ More
Accurate prediction of inter-residue contacts of a protein is important to calcu- lating its tertiary structure. Analysis of co-evolutionary events among residues has been proved effective to inferring inter-residue contacts. The Markov ran- dom field (MRF) technique, although being widely used for contact prediction, suffers from the following dilemma: the actual likelihood function of MRF is accurate but time-consuming to calculate, in contrast, approximations to the actual likelihood, say pseudo-likelihood, are efficient to calculate but inaccu- rate. Thus, how to achieve both accuracy and efficiency simultaneously remains a challenge. In this study, we present such an approach (called clmDCA) for contact prediction. Unlike plmDCA using pseudo-likelihood, i.e., the product of conditional probability of individual residues, our approach uses composite- likelihood, i.e., the product of conditional probability of all residue pairs. Com- posite likelihood has been theoretically proved as a better approximation to the actual likelihood function than pseudo-likelihood. Meanwhile, composite likelihood is still efficient to maximize, thus ensuring the efficiency of clmDCA. We present comprehensive experiments on popular benchmark datasets, includ- ing PSICOV dataset and CASP-11 dataset, to show that: i) clmDCA alone outperforms the existing MRF-based approaches in prediction accuracy. ii) When equipped with deep learning technique for refinement, the prediction ac- curacy of clmDCA was further significantly improved, suggesting the suitability of clmDCA for subsequent refinement procedure. We further present successful application of the predicted contacts to accurately build tertiary structures for proteins in the PSICOV dataset.
Accessibility: The software clmDCA and a server are publicly accessible through http://protein.ict.ac.cn/clmDCA/.
△ Less
Submitted 31 August, 2018;
originally announced September 2018.
-
Advances in Computational Methods for Phylogenetic Networks in the Presence of Hybridization
Authors:
R. A. L. Elworth,
H. A. Ogilvie,
J. Zhu,
L. Nakhleh
Abstract:
Phylogenetic networks extend phylogenetic trees to allow for modeling reticulate evolutionary processes such as hybridization. They take the shape of a rooted, directed, acyclic graph, and when parameterized with evolutionary parameters, such as divergence times and population sizes, they form a generative process of molecular sequence evolution. Early work on computational methods for phylogeneti…
▽ More
Phylogenetic networks extend phylogenetic trees to allow for modeling reticulate evolutionary processes such as hybridization. They take the shape of a rooted, directed, acyclic graph, and when parameterized with evolutionary parameters, such as divergence times and population sizes, they form a generative process of molecular sequence evolution. Early work on computational methods for phylogenetic network inference focused exclusively on reticulations and sought networks with the fewest number of reticulations to fit the data. As processes such as incomplete lineage sorting (ILS) could be at play concurrently with hybridization, work in the last decade has shifted to computational approaches for phylogenetic network inference in the presence of ILS. In such a short period, significant advances have been made on developing and implementing such computational approaches. In particular, parsimony, likelihood, and Bayesian methods have been devised for estimating phylogenetic networks and associated parameters using estimated gene trees as data. Use of those inference methods has been augmented with statistical tests for specific hypotheses of hybridization, like the D-statistic. Most recently, Bayesian approaches for inferring phylogenetic networks directly from sequence data were developed and implemented. In this chapter, we survey such advances and discuss model assumptions as well as methods' strengths and limitations. We also discuss parallel efforts in the population genetics community aimed at inferring similar structures. Finally, we highlight major directions for future research in this area.
△ Less
Submitted 26 August, 2018;
originally announced August 2018.
-
Network Enhancement: a general method to denoise weighted biological networks
Authors:
Bo Wang,
Armin Pourshafeie,
Marinka Zitnik,
Junjie Zhu,
Carlos D. Bustamante,
Serafim Batzoglou,
Jure Leskovec
Abstract:
Networks are ubiquitous in biology where they encode connectivity patterns at all scales of organization, from molecular to the biome. However, biological networks are noisy due to the limitations of measurement technology and inherent natural variation, which can hamper discovery of network patterns and dynamics. We propose Network Enhancement (NE), a method for improving the signal-to-noise rati…
▽ More
Networks are ubiquitous in biology where they encode connectivity patterns at all scales of organization, from molecular to the biome. However, biological networks are noisy due to the limitations of measurement technology and inherent natural variation, which can hamper discovery of network patterns and dynamics. We propose Network Enhancement (NE), a method for improving the signal-to-noise ratio of undirected, weighted networks. NE uses a doubly stochastic matrix operator that induces sparsity and provides a closed-form solution that increases spectral eigengap of the input network. As a result, NE removes weak edges, enhances real connections, and leads to better downstream performance. Experiments show that NE improves gene function prediction by denoising tissue-specific interaction networks, alleviates interpretation of noisy Hi-C contact maps from the human genome, and boosts fine-grained identification accuracy of species. Our results indicate that NE is widely applicable for denoising biological networks.
△ Less
Submitted 1 June, 2018; v1 submitted 8 May, 2018;
originally announced May 2018.
-
Spatio-Temporal Backpropagation for Training High-performance Spiking Neural Networks
Authors:
Yujie Wu,
Lei Deng,
Guoqi Li,
Jun Zhu,
Luping Shi
Abstract:
Compared with artificial neural networks (ANNs), spiking neural networks (SNNs) are promising to explore the brain-like behaviors since the spikes could encode more spatio-temporal information. Although pre-training from ANN or direct training based on backpropagation (BP) makes the supervised training of SNNs possible, these methods only exploit the networks' spatial domain information which lead…
▽ More
Compared with artificial neural networks (ANNs), spiking neural networks (SNNs) are promising to explore the brain-like behaviors since the spikes could encode more spatio-temporal information. Although pre-training from ANN or direct training based on backpropagation (BP) makes the supervised training of SNNs possible, these methods only exploit the networks' spatial domain information which leads to the performance bottleneck and requires many complicated training skills. Another fundamental issue is that the spike activity is naturally non-differentiable which causes great difficulties in training SNNs. To this end, we build an iterative LIF model that is more friendly for gradient descent training. By simultaneously considering the layer-by-layer spatial domain (SD) and the timing-dependent temporal domain (TD) in the training phase, as well as an approximated derivative for the spike activity, we propose a spatio-temporal backpropagation (STBP) training framework without using any complicated technology. We achieve the best performance of multi-layered perceptron (MLP) compared with existing state-of-the-art algorithms over the static MNIST and the dynamic N-MNIST dataset as well as a custom object detection dataset. This work provides a new perspective to explore the high-performance SNNs for future brain-like computing paradigm with rich spatio-temporal dynamics.
△ Less
Submitted 12 September, 2017; v1 submitted 8 June, 2017;
originally announced June 2017.
-
SIMLR: A Tool for Large-Scale Genomic Analyses by Multi-Kernel Learning
Authors:
Bo Wang,
Daniele Ramazzotti,
Luca De Sano,
Junjie Zhu,
Emma Pierson,
Serafim Batzoglou
Abstract:
We here present SIMLR (Single-cell Interpretation via Multi-kernel LeaRning), an open-source tool that implements a novel framework to learn a sample-to-sample similarity measure from expression data observed for heterogenous samples. SIMLR can be effectively used to perform tasks such as dimension reduction, clustering, and visualization of heterogeneous populations of samples. SIMLR was benchmar…
▽ More
We here present SIMLR (Single-cell Interpretation via Multi-kernel LeaRning), an open-source tool that implements a novel framework to learn a sample-to-sample similarity measure from expression data observed for heterogenous samples. SIMLR can be effectively used to perform tasks such as dimension reduction, clustering, and visualization of heterogeneous populations of samples. SIMLR was benchmarked against state-of-the-art methods for these three tasks on several public datasets, showing it to be scalable and capable of greatly improving clustering performance, as well as providing valuable insights by making the data more interpretable via better a visualization. Availability and Implementation
SIMLR is available on GitHub in both R and MATLAB implementations. Furthermore, it is also available as an R package on http://bioconductor.org.
△ Less
Submitted 18 January, 2018; v1 submitted 21 March, 2017;
originally announced March 2017.
-
SeDMiD for Confusion Detection: Uncovering Mind State from Time Series Brain Wave Data
Authors:
Jingkang Yang,
Haohan Wang,
Jun Zhu,
Eric P. Xing
Abstract:
Understanding how brain functions has been an intriguing topic for years. With the recent progress on collecting massive data and developing advanced technology, people have become interested in addressing the challenge of decoding brain wave data into meaningful mind states, with many machine learning models and algorithms being revisited and developed, especially the ones that handle time series…
▽ More
Understanding how brain functions has been an intriguing topic for years. With the recent progress on collecting massive data and developing advanced technology, people have become interested in addressing the challenge of decoding brain wave data into meaningful mind states, with many machine learning models and algorithms being revisited and developed, especially the ones that handle time series data because of the nature of brain waves. However, many of these time series models, like HMM with hidden state in discrete space or State Space Model with hidden state in continuous space, only work with one source of data and cannot handle different sources of information simultaneously. In this paper, we propose an extension of State Space Model to work with different sources of information together with its learning and inference algorithms. We apply this model to decode the mind state of students during lectures based on their brain waves and reach a significant better results compared to traditional methods.
△ Less
Submitted 29 November, 2016;
originally announced November 2016.