-
MODA: A Unified 3D Diffusion Framework for Multi-Task Target-Aware Molecular Generation
Authors:
Dong Xu,
Zhangfan Yang,
Sisi Yuan,
Jenna Xinyi Yao,
Jiangqiang Li,
Junkai Ji
Abstract:
Three-dimensional molecular generators based on diffusion models can now reach near-crystallographic accuracy, yet they remain fragmented across tasks. SMILES-only inputs, two-stage pretrain-finetune pipelines, and one-task-one-model practices hinder stereochemical fidelity, task alignment, and zero-shot transfer. We introduce MODA, a diffusion framework that unifies fragment growing, linker desig…
▽ More
Three-dimensional molecular generators based on diffusion models can now reach near-crystallographic accuracy, yet they remain fragmented across tasks. SMILES-only inputs, two-stage pretrain-finetune pipelines, and one-task-one-model practices hinder stereochemical fidelity, task alignment, and zero-shot transfer. We introduce MODA, a diffusion framework that unifies fragment growing, linker design, scaffold hopping, and side-chain decoration with a Bayesian mask scheduler. During training, a contiguous spatial fragment is masked and then denoised in one pass, enabling the model to learn shared geometric and chemical priors across tasks. Multi-task training yields a universal backbone that surpasses six diffusion baselines and three training paradigms on substructure, chemical property, interaction, and geometry. Model-C reduces ligand-protein clashes and substructure divergences while maintaining Lipinski compliance, whereas Model-B preserves similarity but trails in novelty and binding affinity. Zero-shot de novo design and lead-optimisation tests confirm stable negative Vina scores and high improvement rates without force-field refinement. These results demonstrate that a single-stage multi-task diffusion routine can replace two-stage workflows for structure-based molecular design.
△ Less
Submitted 9 July, 2025;
originally announced July 2025.
-
ADPv2: A Hierarchical Histological Tissue Type-Annotated Dataset for Potential Biomarker Discovery of Colorectal Disease
Authors:
Zhiyuan Yang,
Kai Li,
Sophia Ghamoshi Ramandi,
Patricia Brassard,
Hakim Khellaf,
Vincent Quoc-Huy Trinh,
Jennifer Zhang,
Lina Chen,
Corwyn Rowsell,
Sonal Varma,
Kostas Plataniotis,
Mahdi S. Hosseini
Abstract:
Computational pathology (CoPath) leverages histopathology images to enhance diagnostic precision and reproducibility in clinical pathology. However, publicly available datasets for CoPath that are annotated with extensive histological tissue type (HTT) taxonomies at a granular level remain scarce due to the significant expertise and high annotation costs required. Existing datasets, such as the At…
▽ More
Computational pathology (CoPath) leverages histopathology images to enhance diagnostic precision and reproducibility in clinical pathology. However, publicly available datasets for CoPath that are annotated with extensive histological tissue type (HTT) taxonomies at a granular level remain scarce due to the significant expertise and high annotation costs required. Existing datasets, such as the Atlas of Digital Pathology (ADP), address this by offering diverse HTT annotations generalized to multiple organs, but limit the capability for in-depth studies on specific organ diseases. Building upon this foundation, we introduce ADPv2, a novel dataset focused on gastrointestinal histopathology. Our dataset comprises 20,004 image patches derived from healthy colon biopsy slides, annotated according to a hierarchical taxonomy of 32 distinct HTTs of 3 levels. Furthermore, we train a multilabel representation learning model following a two-stage training procedure on our ADPv2 dataset. We leverage the VMamba architecture and achieving a mean average precision (mAP) of 0.88 in multilabel classification of colon HTTs. Finally, we show that our dataset is capable of an organ-specific in-depth study for potential biomarker discovery by analyzing the model's prediction behavior on tissues affected by different colon diseases, which reveals statistical patterns that confirm the two pathological pathways of colon cancer development. Our dataset is publicly available at https://zenodo.org/records/15307021
△ Less
Submitted 9 July, 2025; v1 submitted 8 July, 2025;
originally announced July 2025.
-
Reimagining Target-Aware Molecular Generation through Retrieval-Enhanced Aligned Diffusion
Authors:
Dong Xu,
Zhangfan Yang,
Ka-chun Wong,
Zexuan Zhu,
Jiangqiang Li,
Junkai Ji
Abstract:
Breakthroughs in high-accuracy protein structure prediction, such as AlphaFold, have established receptor-based molecule design as a critical driver for rapid early-phase drug discovery. However, most approaches still struggle to balance pocket-specific geometric fit with strict valence and synthetic constraints. To resolve this trade-off, a Retrieval-Enhanced Aligned Diffusion termed READ is intr…
▽ More
Breakthroughs in high-accuracy protein structure prediction, such as AlphaFold, have established receptor-based molecule design as a critical driver for rapid early-phase drug discovery. However, most approaches still struggle to balance pocket-specific geometric fit with strict valence and synthetic constraints. To resolve this trade-off, a Retrieval-Enhanced Aligned Diffusion termed READ is introduced, which is the first to merge molecular Retrieval-Augmented Generation with an SE(3)-equivariant diffusion model. Specifically, a contrastively pre-trained encoder aligns atom-level representations during training, then retrieves graph embeddings of pocket-matched scaffolds to guide each reverse-diffusion step at inference. This single mechanism can inject real-world chemical priors exactly where needed, producing valid, diverse, and shape-complementary ligands. Experimental results demonstrate that READ can achieve very competitive performance in CBGBench, surpassing state-of-the-art generative models and even native ligands. That suggests retrieval and diffusion can be co-optimized for faster, more reliable structure-based drug design.
△ Less
Submitted 17 June, 2025;
originally announced June 2025.
-
SPACE: Your Genomic Profile Predictor is a Powerful DNA Foundation Model
Authors:
Zhao Yang,
Jiwei Zhu,
Bing Su
Abstract:
Inspired by the success of unsupervised pre-training paradigms, researchers have applied these approaches to DNA pre-training. However, we argue that these approaches alone yield suboptimal results because pure DNA sequences lack sufficient information, since their functions are regulated by genomic profiles like chromatin accessibility. Here, we demonstrate that supervised training for genomic pr…
▽ More
Inspired by the success of unsupervised pre-training paradigms, researchers have applied these approaches to DNA pre-training. However, we argue that these approaches alone yield suboptimal results because pure DNA sequences lack sufficient information, since their functions are regulated by genomic profiles like chromatin accessibility. Here, we demonstrate that supervised training for genomic profile prediction serves as a more effective alternative to pure sequence pre-training. Furthermore, considering the multi-species and multi-profile nature of genomic profile prediction, we introduce our $\textbf{S}$pecies-$\textbf{P}$rofile $\textbf{A}$daptive $\textbf{C}$ollaborative $\textbf{E}$xperts (SPACE) that leverages Mixture of Experts (MoE) to better capture the relationships between DNA sequences across different species and genomic profiles, thereby learning more effective DNA representations. Through extensive experiments across various tasks, our model achieves state-of-the-art performance, establishing that DNA models trained with supervised genomic profiles serve as powerful DNA representation learners. The code is available at https://github.com/ZhuJiwei111/SPACE.
△ Less
Submitted 2 June, 2025;
originally announced June 2025.
-
High-throughput Screening of the Mechanical Properties of Peptide Assemblies
Authors:
Sarah K. Yorke,
Zhenze Yang,
Aviad Levin,
Alice Ray,
Jeremy Owusu Boamah,
Tuomas P. J. Knowles,
Markus J. Buehler
Abstract:
Peptides are recognized for their varied self-assembly behaviors, forming a wide array of structures and geometries, such as spheres, fibers, and hydrogels, each presenting a unique set of material properties. The functionalities of these materials hold exceptional interest for applications in biology, medicine, photonics, nanotechnology and the food industry. In specific, the ability to exploit p…
▽ More
Peptides are recognized for their varied self-assembly behaviors, forming a wide array of structures and geometries, such as spheres, fibers, and hydrogels, each presenting a unique set of material properties. The functionalities of these materials hold exceptional interest for applications in biology, medicine, photonics, nanotechnology and the food industry. In specific, the ability to exploit peptides as viable and sustainable mechanical materials requires sequence design that enables superior performance, notably a high Young's modulus. As the peptide sequence space is vast, however, even a slight increase in sequence length leads to an exponential increase in the number of potential peptide sequences to be characterized. Here, we combine coarse-grained molecular dynamics simulations, atomic force microscopy experiments and machine learning models to correlate the sequence length and composition with the mechanical properties of self-assembled peptides. We calculate the Young's modulus for all possible amino acid sequences of di- and tripeptides using high-throughput coarse-grained methods, and validate these calculations through in-situ mechanical characterization. For pentapeptides, we select and calculate properties for a subset of sequences to train a machine learning model, which allows us to predict the modulus for other sequences. The combined workflow not only identifies promising peptide candidates with exceptional mechanical performances, but also extends current understanding of the sequence-to-function relationships for peptide materials, for specific applications.
△ Less
Submitted 13 May, 2025;
originally announced May 2025.
-
The Dynamics of Inducible Genetic Circuits
Authors:
Zitao Yang,
Rebecca J. Rousseau,
Sara D. Mahdavi,
Hernan G. Garcia,
Rob Phillips
Abstract:
Genes are connected in complex networks of interactions where often the product of one gene is a transcription factor that alters the expression of another. Many of these networks are based on a few fundamental motifs leading to switches and oscillators of various kinds. And yet, there is more to the story than which transcription factors control these various circuits. These transcription factors…
▽ More
Genes are connected in complex networks of interactions where often the product of one gene is a transcription factor that alters the expression of another. Many of these networks are based on a few fundamental motifs leading to switches and oscillators of various kinds. And yet, there is more to the story than which transcription factors control these various circuits. These transcription factors are often themselves under the control of effector molecules that bind them and alter their level of activity. Traditionally, much beautiful work has shown how to think about the stability of the different states achieved by these fundamental regulatory architectures by examining how parameters such as transcription rates, degradation rates and dissociation constants tune the circuit, giving rise to behavior such as bistability. However, such studies explore dynamics without asking how these quantities are altered in real time in living cells as opposed to at the fingertips of the synthetic biologist's pipette or on the computational biologist's computer screen. In this paper, we make a departure from the conventional dynamical systems view of these regulatory motifs by using statistical mechanical models to focus on endogenous signaling knobs such as effector concentrations rather than on the convenient but more experimentally remote knobs such as dissociation constants, transcription rates and degradation rates that are often considered. We also contrast the traditional use of Hill functions to describe transcription factor binding with more detailed thermodynamic models. This approach provides insights into how biological parameters are tuned to control the stability of regulatory motifs in living cells, sometimes revealing quite a different picture than is found by using Hill functions and tuning circuit parameters by hand.
△ Less
Submitted 11 May, 2025;
originally announced May 2025.
-
DeepGDel: Deep Learning-based Gene Deletion Prediction Framework for Growth-Coupled Production in Genome-Scale Metabolic Models
Authors:
Ziwei Yang,
Takeyuki Tamura
Abstract:
In genome-scale constraint-based metabolic models, gene deletion strategies are crucial for achieving growth-coupled production, where cell growth and target metabolite production are simultaneously achieved. While computational methods for calculating gene deletions have been widely explored and contribute to developing gene deletion strategy databases, current approaches are limited in leveragin…
▽ More
In genome-scale constraint-based metabolic models, gene deletion strategies are crucial for achieving growth-coupled production, where cell growth and target metabolite production are simultaneously achieved. While computational methods for calculating gene deletions have been widely explored and contribute to developing gene deletion strategy databases, current approaches are limited in leveraging new data-driven paradigms, such as machine learning, for more efficient strain design. Therefore, it is necessary to propose a fundamental framework for this objective. In this study, we first formulate the problem of gene deletion strategy prediction and then propose a framework for predicting gene deletion strategies for growth-coupled production in genome-scale metabolic models. The proposed framework leverages deep learning algorithms to learn and integrate sequential gene and metabolite data representation, enabling the automatic gene deletion strategy prediction. Computational experiment results demonstrate the feasibility of the proposed framework, showing substantial improvements over baseline methods. Specifically, the proposed framework achieves a 14.69%, 22.52%, and 13.03% increase in overall accuracy across three metabolic models of different scales under study, while maintaining balanced precision and recall in predicting gene deletion statuses. The source code and examples for the framework are publicly available at https://github.com/MetNetComp/DeepGDel.
△ Less
Submitted 19 June, 2025; v1 submitted 8 April, 2025;
originally announced April 2025.
-
Regulatory DNA sequence Design with Reinforcement Learning
Authors:
Zhao Yang,
Bing Su,
Chuan Cao,
Ji-Rong Wen
Abstract:
Cis-regulatory elements (CREs), such as promoters and enhancers, are relatively short DNA sequences that directly regulate gene expression. The fitness of CREs, measured by their ability to modulate gene expression, highly depends on the nucleotide sequences, especially specific motifs known as transcription factor binding sites (TFBSs). Designing high-fitness CREs is crucial for therapeutic and b…
▽ More
Cis-regulatory elements (CREs), such as promoters and enhancers, are relatively short DNA sequences that directly regulate gene expression. The fitness of CREs, measured by their ability to modulate gene expression, highly depends on the nucleotide sequences, especially specific motifs known as transcription factor binding sites (TFBSs). Designing high-fitness CREs is crucial for therapeutic and bioengineering applications. Current CRE design methods are limited by two major drawbacks: (1) they typically rely on iterative optimization strategies that modify existing sequences and are prone to local optima, and (2) they lack the guidance of biological prior knowledge in sequence optimization. In this paper, we address these limitations by proposing a generative approach that leverages reinforcement learning (RL) to fine-tune a pre-trained autoregressive (AR) model. Our method incorporates data-driven biological priors by deriving computational inference-based rewards that simulate the addition of activator TFBSs and removal of repressor TFBSs, which are then integrated into the RL process. We evaluate our method on promoter design tasks in two yeast media conditions and enhancer design tasks for three human cell types, demonstrating its ability to generate high-fitness CREs while maintaining sequence diversity. The code is available at https://github.com/yangzhao1230/TACO.
△ Less
Submitted 10 March, 2025;
originally announced March 2025.
-
Enhancing Alzheimer's Diagnosis: Leveraging Anatomical Landmarks in Graph Convolutional Neural Networks on Tetrahedral Meshes
Authors:
Yanxi Chen,
Mohammad Farazi,
Zhangsihao Yang,
Yonghui Fan,
Nicholas Ashton,
Eric M Reiman,
Yi Su,
Yalin Wang
Abstract:
Alzheimer's disease (AD) is a major neurodegenerative condition that affects millions around the world. As one of the main biomarkers in the AD diagnosis procedure, brain amyloid positivity is typically identified by positron emission tomography (PET), which is costly and invasive. Brain structural magnetic resonance imaging (sMRI) may provide a safer and more convenient solution for the AD diagno…
▽ More
Alzheimer's disease (AD) is a major neurodegenerative condition that affects millions around the world. As one of the main biomarkers in the AD diagnosis procedure, brain amyloid positivity is typically identified by positron emission tomography (PET), which is costly and invasive. Brain structural magnetic resonance imaging (sMRI) may provide a safer and more convenient solution for the AD diagnosis. Recent advances in geometric deep learning have facilitated sMRI analysis and early diagnosis of AD. However, determining AD pathology, such as brain amyloid deposition, in preclinical stage remains challenging, as less significant morphological changes can be observed. As a result, few AD classification models are generalizable to the brain amyloid positivity classification task. Blood-based biomarkers (BBBMs), on the other hand, have recently achieved remarkable success in predicting brain amyloid positivity and identifying individuals with high risk of being brain amyloid positive. However, individuals in medium risk group still require gold standard tests such as Amyloid PET for further evaluation. Inspired by the recent success of transformer architectures, we propose a geometric deep learning model based on transformer that is both scalable and robust to variations in input volumetric mesh size. Our work introduced a novel tokenization scheme for tetrahedral meshes, incorporating anatomical landmarks generated by a pre-trained Gaussian process model. Our model achieved superior classification performance in AD classification task. In addition, we showed that the model was also generalizable to the brain amyloid positivity prediction with individuals in the medium risk class, where BM alone cannot achieve a clear classification. Our work may enrich geometric deep learning research and improve AD diagnosis accuracy without using expensive and invasive PET scans.
△ Less
Submitted 6 March, 2025;
originally announced March 2025.
-
HybriDNA: A Hybrid Transformer-Mamba2 Long-Range DNA Language Model
Authors:
Mingqian Ma,
Guoqing Liu,
Chuan Cao,
Pan Deng,
Tri Dao,
Albert Gu,
Peiran Jin,
Zhao Yang,
Yingce Xia,
Renqian Luo,
Pipi Hu,
Zun Wang,
Yuan-Jyue Chen,
Haiguang Liu,
Tao Qin
Abstract:
Advances in natural language processing and large language models have sparked growing interest in modeling DNA, often referred to as the "language of life". However, DNA modeling poses unique challenges. First, it requires the ability to process ultra-long DNA sequences while preserving single-nucleotide resolution, as individual nucleotides play a critical role in DNA function. Second, success i…
▽ More
Advances in natural language processing and large language models have sparked growing interest in modeling DNA, often referred to as the "language of life". However, DNA modeling poses unique challenges. First, it requires the ability to process ultra-long DNA sequences while preserving single-nucleotide resolution, as individual nucleotides play a critical role in DNA function. Second, success in this domain requires excelling at both generative and understanding tasks: generative tasks hold potential for therapeutic and industrial applications, while understanding tasks provide crucial insights into biological mechanisms and diseases. To address these challenges, we propose HybriDNA, a decoder-only DNA language model that incorporates a hybrid Transformer-Mamba2 architecture, seamlessly integrating the strengths of attention mechanisms with selective state-space models. This hybrid design enables HybriDNA to efficiently process DNA sequences up to 131kb in length with single-nucleotide resolution. HybriDNA achieves state-of-the-art performance across 33 DNA understanding datasets curated from the BEND, GUE, and LRB benchmarks, and demonstrates exceptional capability in generating synthetic cis-regulatory elements (CREs) with desired properties. Furthermore, we show that HybriDNA adheres to expected scaling laws, with performance improving consistently as the model scales from 300M to 3B and 7B parameters. These findings underscore HybriDNA's versatility and its potential to advance DNA research and applications, paving the way for innovations in understanding and engineering the "language of life".
△ Less
Submitted 17 February, 2025; v1 submitted 15 February, 2025;
originally announced February 2025.
-
DrugImproverGPT: A Large Language Model for Drug Optimization with Fine-Tuning via Structured Policy Optimization
Authors:
Xuefeng Liu,
Songhao Jiang,
Siyu Chen,
Zhuoran Yang,
Yuxin Chen,
Ian Foster,
Rick Stevens
Abstract:
Finetuning a Large Language Model (LLM) is crucial for generating results towards specific objectives. This research delves into the realm of drug optimization and introduce a novel reinforcement learning algorithm to finetune a drug optimization LLM-based generative model, enhancing the original drug across target objectives, while retains the beneficial chemical properties of the original drug.…
▽ More
Finetuning a Large Language Model (LLM) is crucial for generating results towards specific objectives. This research delves into the realm of drug optimization and introduce a novel reinforcement learning algorithm to finetune a drug optimization LLM-based generative model, enhancing the original drug across target objectives, while retains the beneficial chemical properties of the original drug. This work is comprised of two primary components: (1) DrugImprover: A framework tailored for improving robustness and efficiency in drug optimization. It includes a LLM designed for drug optimization and a novel Structured Policy Optimization (SPO) algorithm, which is theoretically grounded. This algorithm offers a unique perspective for fine-tuning the LLM-based generative model by aligning the improvement of the generated molecule with the input molecule under desired objectives. (2) A dataset of 1 million compounds, each with OEDOCK docking scores on 5 human proteins associated with cancer cells and 24 binding sites from SARS-CoV-2 virus. We conduct a comprehensive evaluation of SPO and demonstrate its effectiveness in improving the original drug across target properties. Our code and dataset will be publicly available at: https://github.com/xuefeng-cs/DrugImproverGPT.
△ Less
Submitted 10 February, 2025;
originally announced February 2025.
-
A large population of cell-specific action potential models replicating fluorescence recordings of voltage in rabbit ventricular myocytes
Authors:
Radostin D. Simitev,
Rebecca J. Gilchrist,
Zhechao Yang,
Rachel Myles,
Francis Burton,
Godfrey L. Smith
Abstract:
Recent high-throughput experiments unveil substantial electrophysiological diversity among uncoupled healthy myocytes under identical conditions. To quantify inter-cell variability, the values of a subset of the parameters in a well-regarded mathematical model of the action potential of rabbit ventricular myocytes are estimated from fluorescence voltage measurements of a large number of cells. Sta…
▽ More
Recent high-throughput experiments unveil substantial electrophysiological diversity among uncoupled healthy myocytes under identical conditions. To quantify inter-cell variability, the values of a subset of the parameters in a well-regarded mathematical model of the action potential of rabbit ventricular myocytes are estimated from fluorescence voltage measurements of a large number of cells. Statistical inference yields a population of nearly 1200 cell-specific model variants that, on a population-level replicate experimentally measured biomarker ranges and distributions, and in contrast to earlier studies, also match experimental biomarker values on a cell-by-cell basis. This model population may be regarded as a random sample from the phenotype of healthy rabbit ventricular myocytes. Uni-variate and bi-variate joint marginal distributions of the estimated parameters are presented, and the parameter dependencies of several commonly utilised electrophysiological biomarkers are revealed. Parameter values are weakly correlated, while summary metrics such as the action potential duration are not strongly dependent on any single electrophysiological characteristic of the myocyte. Our results demonstrate the feasibility of accurately and efficiently fitting entire action potential waveforms at scale.
Keywords: cellular excitability, rabbit ventricular myocytes, fluorescence voltage measurements, action potential waveform, parameter estimation in differential equations, noisy time series
△ Less
Submitted 13 January, 2025;
originally announced January 2025.
-
Large Language Models for Bioinformatics
Authors:
Wei Ruan,
Yanjun Lyu,
Jing Zhang,
Jiazhang Cai,
Peng Shu,
Yang Ge,
Yao Lu,
Shang Gao,
Yue Wang,
Peilong Wang,
Lin Zhao,
Tao Wang,
Yufang Liu,
Luyang Fang,
Ziyu Liu,
Zhengliang Liu,
Yiwei Li,
Zihao Wu,
Junhao Chen,
Hanqi Jiang,
Yi Pan,
Zhenyuan Yang,
Jingyuan Chen,
Shizhe Liang,
Wei Zhang
, et al. (30 additional authors not shown)
Abstract:
With the rapid advancements in large language model (LLM) technology and the emergence of bioinformatics-specific language models (BioLMs), there is a growing need for a comprehensive analysis of the current landscape, computational characteristics, and diverse applications. This survey aims to address this need by providing a thorough review of BioLMs, focusing on their evolution, classification,…
▽ More
With the rapid advancements in large language model (LLM) technology and the emergence of bioinformatics-specific language models (BioLMs), there is a growing need for a comprehensive analysis of the current landscape, computational characteristics, and diverse applications. This survey aims to address this need by providing a thorough review of BioLMs, focusing on their evolution, classification, and distinguishing features, alongside a detailed examination of training methodologies, datasets, and evaluation frameworks. We explore the wide-ranging applications of BioLMs in critical areas such as disease diagnosis, drug discovery, and vaccine development, highlighting their impact and transformative potential in bioinformatics. We identify key challenges and limitations inherent in BioLMs, including data privacy and security concerns, interpretability issues, biases in training data and model outputs, and domain adaptation complexities. Finally, we highlight emerging trends and future directions, offering valuable insights to guide researchers and clinicians toward advancing BioLMs for increasingly sophisticated biological and clinical applications.
△ Less
Submitted 9 January, 2025;
originally announced January 2025.
-
Interpretable Enzyme Function Prediction via Residue-Level Detection
Authors:
Zhao Yang,
Bing Su,
Jiahao Chen,
Ji-Rong Wen
Abstract:
Predicting multiple functions labeled with Enzyme Commission (EC) numbers from the enzyme sequence is of great significance but remains a challenge due to its sparse multi-label classification nature, i.e., each enzyme is typically associated with only a few labels out of more than 6000 possible EC numbers. However, existing machine learning algorithms generally learn a fixed global representation…
▽ More
Predicting multiple functions labeled with Enzyme Commission (EC) numbers from the enzyme sequence is of great significance but remains a challenge due to its sparse multi-label classification nature, i.e., each enzyme is typically associated with only a few labels out of more than 6000 possible EC numbers. However, existing machine learning algorithms generally learn a fixed global representation for each enzyme to classify all functions, thereby they lack interpretability and the fine-grained information of some function-specific local residue fragments may be overwhelmed. Here we present an attention-based framework, namely ProtDETR (Protein Detection Transformer), by casting enzyme function prediction as a detection problem. It uses a set of learnable functional queries to adaptatively extract different local representations from the sequence of residue-level features for predicting different EC numbers. ProtDETR not only significantly outperforms existing deep learning-based enzyme function prediction methods, but also provides a new interpretable perspective on automatically detecting different local regions for identifying different functions through cross-attentions between queries and residue-level features. Code is available at https://github.com/yangzhao1230/ProtDETR.
△ Less
Submitted 5 June, 2025; v1 submitted 9 January, 2025;
originally announced January 2025.
-
DBgDel: Database-Enhanced Gene Deletion Framework for Growth-Coupled Production in Genome-Scale Metabolic Models
Authors:
Ziwei Yang,
Takeyuki Tamura
Abstract:
When simulating metabolite productions with genome-scale constraint-based metabolic models, gene deletion strategies are necessary to achieve growth-coupled production, which means cell growth and target metabolite production occur simultaneously. Since obtaining gene deletion strategies for large genome-scale models suffers from significant computational time, it is necessary to develop methods t…
▽ More
When simulating metabolite productions with genome-scale constraint-based metabolic models, gene deletion strategies are necessary to achieve growth-coupled production, which means cell growth and target metabolite production occur simultaneously. Since obtaining gene deletion strategies for large genome-scale models suffers from significant computational time, it is necessary to develop methods to mitigate this computational burden. In this study, we introduce a novel framework for computing gene deletion strategies. The proposed framework first mines related databases to extract prior information about gene deletions for growth-coupled production. It then integrates the extracted information with downstream algorithms to narrow down the algorithmic search space, resulting in highly efficient calculations on genome-scale models. Computational experiment results demonstrated that our framework can compute stoichiometrically feasible gene deletion strategies for numerous target metabolites, showcasing a noteworthy improvement in computational efficiency. Specifically, our framework achieves an average 6.1-fold acceleration in computational speed compared to existing methods while maintaining a respectable success rate. The source code of DBgDel with examples are available on https://github.com/MetNetComp/DBgDel.
△ Less
Submitted 26 March, 2025; v1 submitted 11 November, 2024;
originally announced November 2024.
-
Can Large Language Models Replace Data Scientists in Biomedical Research?
Authors:
Zifeng Wang,
Benjamin Danek,
Ziwei Yang,
Zheng Chen,
Jimeng Sun
Abstract:
Data science plays a critical role in biomedical research, but it requires professionals with expertise in coding and medical data analysis. Large language models (LLMs) have shown great potential in supporting medical tasks and performing well in general coding tests. However, existing evaluations fail to assess their capability in biomedical data science, particularly in handling diverse data ty…
▽ More
Data science plays a critical role in biomedical research, but it requires professionals with expertise in coding and medical data analysis. Large language models (LLMs) have shown great potential in supporting medical tasks and performing well in general coding tests. However, existing evaluations fail to assess their capability in biomedical data science, particularly in handling diverse data types such as genomics and clinical datasets. To address this gap, we developed a benchmark of data science coding tasks derived from the analyses of 39 published studies. This benchmark comprises 293 coding tasks (128 in Python and 165 in R) performed on real-world TCGA-type genomics and clinical data. Our findings reveal that the vanilla prompting of LLMs yields suboptimal performances due to drawbacks in following input instructions, understanding target data, and adhering to standard analysis practices. Next, we benchmarked six cutting-edge LLMs and advanced adaptation methods, finding two methods to be particularly effective: chain-of-thought prompting, which provides a step-by-step plan for data analysis, which led to a 21% code accuracy improvement (56.6% versus 35.3%); and self-reflection, enabling LLMs to refine the buggy code iteratively, yielding an 11% code accuracy improvement (45.5% versus 34.3%). Building on these insights, we developed a platform that integrates LLMs into the data science workflow for medical professionals. In a user study with five medical professionals, we found that while LLMs cannot fully automate programming tasks, they significantly streamline the programming process. We found that 80% of their submitted code solutions were incorporated from LLM-generated code, with up to 96% reuse in some cases. Our analysis highlights the potential of LLMs to enhance data science efficiency in biomedical research when integrated into expert workflows.
△ Less
Submitted 8 April, 2025; v1 submitted 28 October, 2024;
originally announced October 2024.
-
Atrial Fibrillation Detection System via Acoustic Sensing for Mobile Phones
Authors:
Xuanyu Liu,
Jiao Li,
Haoxian Liu,
Zongqi Yang,
Yi Huang,
Jin Zhang
Abstract:
Atrial fibrillation (AF) is characterized by irregular electrical impulses originating in the atria, which can lead to severe complications and even death. Due to the intermittent nature of the AF, early and timely monitoring of AF is critical for patients to prevent further exacerbation of the condition. Although ambulatory ECG Holter monitors provide accurate monitoring, the high cost of these d…
▽ More
Atrial fibrillation (AF) is characterized by irregular electrical impulses originating in the atria, which can lead to severe complications and even death. Due to the intermittent nature of the AF, early and timely monitoring of AF is critical for patients to prevent further exacerbation of the condition. Although ambulatory ECG Holter monitors provide accurate monitoring, the high cost of these devices hinders their wider adoption. Current mobile-based AF detection systems offer a portable solution, however, these systems have various applicability issues such as being easily affected by environmental factors and requiring significant user effort. To overcome the above limitations, we present MobileAF, a novel smartphone-based AF detection system using speakers and microphones. In order to capture minute cardiac activities, we propose a multi-channel pulse wave probing method. In addition, we enhance the signal quality by introducing a three-stage pulse wave purification pipeline. What's more, a ResNet-based network model is built to implement accurate and reliable AF detection. We collect data from 23 participants utilizing our data collection application on the smartphone. Extensive experimental results demonstrate the superior performance of our system, with 97.9% accuracy, 96.8% precision, 97.2% recall, 98.3% specificity, and 97.0% F1 score.
△ Less
Submitted 28 October, 2024;
originally announced October 2024.
-
Design of Ligand-Binding Proteins with Atomic Flow Matching
Authors:
Junqi Liu,
Shaoning Li,
Chence Shi,
Zhi Yang,
Jian Tang
Abstract:
Designing novel proteins that bind to small molecules is a long-standing challenge in computational biology, with applications in developing catalysts, biosensors, and more. Current computational methods rely on the assumption that the binding pose of the target molecule is known, which is not always feasible, as conformations of novel targets are often unknown and tend to change upon binding. In…
▽ More
Designing novel proteins that bind to small molecules is a long-standing challenge in computational biology, with applications in developing catalysts, biosensors, and more. Current computational methods rely on the assumption that the binding pose of the target molecule is known, which is not always feasible, as conformations of novel targets are often unknown and tend to change upon binding. In this work, we formulate proteins and molecules as unified biotokens, and present AtomFlow, a novel deep generative model under the flow-matching framework for the design of ligand-binding proteins from the 2D target molecular graph alone. Operating on representative atoms of biotokens, AtomFlow captures the flexibility of ligands and generates ligand conformations and protein backbone structures iteratively. We consider the multi-scale nature of biotokens and demonstrate that AtomFlow can be effectively trained on a subset of structures from the Protein Data Bank, by matching flow vector field using an SE(3) equivariant structure prediction network. Experimental results show that our method can generate high fidelity ligand-binding proteins and achieve performance comparable to the state-of-the-art model RFDiffusionAA, while not requiring bound ligand structures. As a general framework, AtomFlow holds the potential to be applied to various biomolecule generation tasks in the future.
△ Less
Submitted 18 September, 2024;
originally announced September 2024.
-
MLOmics: Cancer Multi-Omics Database for Machine Learning
Authors:
Ziwei Yang,
Rikuto Kotoge,
Xihao Piao,
Zheng Chen,
Lingwei Zhu,
Peng Gao,
Yasuko Matsubara,
Yasushi Sakurai,
Jimeng Sun
Abstract:
Framing the investigation of diverse cancers as a machine learning problem has recently shown significant potential in multi-omics analysis and cancer research. Empowering these successful machine learning models are the high-quality training datasets with sufficient data volume and adequate preprocessing. However, while there exist several public data portals, including The Cancer Genome Atlas (T…
▽ More
Framing the investigation of diverse cancers as a machine learning problem has recently shown significant potential in multi-omics analysis and cancer research. Empowering these successful machine learning models are the high-quality training datasets with sufficient data volume and adequate preprocessing. However, while there exist several public data portals, including The Cancer Genome Atlas (TCGA) multi-omics initiative or open-bases such as the LinkedOmics, these databases are not off-the-shelf for existing machine learning models. In this paper, we introduce MLOmics, an open cancer multi-omics database aiming at serving better the development and evaluation of bioinformatics and machine learning models. MLOmics contains 8,314 patient samples covering all 32 cancer types with four omics types, stratified features, and extensive baselines. Complementary support for downstream analysis and bio-knowledge linking are also included to support interdisciplinary analysis.
△ Less
Submitted 16 June, 2025; v1 submitted 2 September, 2024;
originally announced September 2024.
-
Generative models of MRI-derived neuroimaging features and associated dataset of 18,000 samples
Authors:
Sai Spandana Chintapalli,
Rongguang Wang,
Zhijian Yang,
Vasiliki Tassopoulou,
Fanyang Yu,
Vishnu Bashyam,
Guray Erus,
Pratik Chaudhari,
Haochang Shou,
Christos Davatzikos
Abstract:
Availability of large and diverse medical datasets is often challenged by privacy and data sharing restrictions. For successful application of machine learning techniques for disease diagnosis, prognosis, and precision medicine, large amounts of data are necessary for model building and optimization. To help overcome such limitations in the context of brain MRI, we present GenMIND: a collection of…
▽ More
Availability of large and diverse medical datasets is often challenged by privacy and data sharing restrictions. For successful application of machine learning techniques for disease diagnosis, prognosis, and precision medicine, large amounts of data are necessary for model building and optimization. To help overcome such limitations in the context of brain MRI, we present GenMIND: a collection of generative models of normative regional volumetric features derived from structural brain imaging. GenMIND models are trained on real brain imaging regional volumetric measures from the iSTAGING consortium, which encompasses over 40,000 MRI scans across 13 studies, incorporating covariates such as age, sex, and race. Leveraging GenMIND, we produce and offer 18,000 synthetic samples spanning the adult lifespan (ages 22-90 years), alongside the model's capability to generate unlimited data. Experimental results indicate that samples generated from GenMIND agree with the distributions obtained from real data. Most importantly, the generated normative data significantly enhance the accuracy of downstream machine learning models on tasks such as disease classification. Data and models are available at: https://huggingface.co/spaces/rongguangw/GenMIND.
△ Less
Submitted 1 October, 2024; v1 submitted 17 July, 2024;
originally announced July 2024.
-
MolTC: Towards Molecular Relational Modeling In Language Models
Authors:
Junfeng Fang,
Shuai Zhang,
Chang Wu,
Zhengyi Yang,
Zhiyuan Liu,
Sihang Li,
Kun Wang,
Wenjie Du,
Xiang Wang
Abstract:
Molecular Relational Learning (MRL), aiming to understand interactions between molecular pairs, plays a pivotal role in advancing biochemical research. Recently, the adoption of large language models (LLMs), known for their vast knowledge repositories and advanced logical inference capabilities, has emerged as a promising way for efficient and effective MRL. Despite their potential, these methods…
▽ More
Molecular Relational Learning (MRL), aiming to understand interactions between molecular pairs, plays a pivotal role in advancing biochemical research. Recently, the adoption of large language models (LLMs), known for their vast knowledge repositories and advanced logical inference capabilities, has emerged as a promising way for efficient and effective MRL. Despite their potential, these methods predominantly rely on the textual data, thus not fully harnessing the wealth of structural information inherent in molecular graphs. Moreover, the absence of a unified framework exacerbates the issue of information underutilization, as it hinders the sharing of interaction mechanism learned across diverse datasets. To address these challenges, this work proposes a novel LLM-based multi-modal framework for Molecular inTeraction prediction following Chain-of-Thought (CoT) theory, termed MolTC, which effectively integrate graphical information of two molecules in pair. To train MolTC efficiently, we introduce a Multi-hierarchical CoT concept to refine its training paradigm, and conduct a comprehensive Molecular Interactive Instructions dataset for the development of biochemical LLMs involving MRL. Our experiments, conducted across various datasets involving over 4,000,000 molecular pairs, exhibit the superiority of our method over current GNN and LLM-based baselines. Code is available at https://github.com/MangoKiller/MolTC.
△ Less
Submitted 10 June, 2024; v1 submitted 6 February, 2024;
originally announced February 2024.
-
Dimensional Neuroimaging Endophenotypes: Neurobiological Representations of Disease Heterogeneity Through Machine Learning
Authors:
Junhao Wen,
Mathilde Antoniades,
Zhijian Yang,
Gyujoon Hwang,
Ioanna Skampardoni,
Rongguang Wang,
Christos Davatzikos
Abstract:
Machine learning has been increasingly used to obtain individualized neuroimaging signatures for disease diagnosis, prognosis, and response to treatment in neuropsychiatric and neurodegenerative disorders. Therefore, it has contributed to a better understanding of disease heterogeneity by identifying disease subtypes that present significant differences in various brain phenotypic measures. In thi…
▽ More
Machine learning has been increasingly used to obtain individualized neuroimaging signatures for disease diagnosis, prognosis, and response to treatment in neuropsychiatric and neurodegenerative disorders. Therefore, it has contributed to a better understanding of disease heterogeneity by identifying disease subtypes that present significant differences in various brain phenotypic measures. In this review, we first present a systematic literature overview of studies using machine learning and multimodal MRI to unravel disease heterogeneity in various neuropsychiatric and neurodegenerative disorders, including Alzheimer disease, schizophrenia, major depressive disorder, autism spectrum disorder, multiple sclerosis, as well as their potential in transdiagnostic settings. Subsequently, we summarize relevant machine learning methodologies and discuss an emerging paradigm which we call dimensional neuroimaging endophenotype (DNE). DNE dissects the neurobiological heterogeneity of neuropsychiatric and neurodegenerative disorders into a low dimensional yet informative, quantitative brain phenotypic representation, serving as a robust intermediate phenotype (i.e., endophenotype) largely reflecting underlying genetics and etiology. Finally, we discuss the potential clinical implications of the current findings and envision future research avenues.
△ Less
Submitted 17 January, 2024;
originally announced January 2024.
-
Identifying Semantic Component for Robust Molecular Property Prediction
Authors:
Zijian Li,
Zunhong Xu,
Ruichu Cai,
Zhenhui Yang,
Yuguang Yan,
Zhifeng Hao,
Guangyi Chen,
Kun Zhang
Abstract:
Although graph neural networks have achieved great success in the task of molecular property prediction in recent years, their generalization ability under out-of-distribution (OOD) settings is still under-explored. Different from existing methods that learn discriminative representations for prediction, we propose a generative model with semantic-components identifiability, named SCI. We demonstr…
▽ More
Although graph neural networks have achieved great success in the task of molecular property prediction in recent years, their generalization ability under out-of-distribution (OOD) settings is still under-explored. Different from existing methods that learn discriminative representations for prediction, we propose a generative model with semantic-components identifiability, named SCI. We demonstrate that the latent variables in this generative model can be explicitly identified into semantic-relevant (SR) and semantic-irrelevant (SI) components, which contributes to better OOD generalization by involving minimal change properties of causal mechanisms. Specifically, we first formulate the data generation process from the atom level to the molecular level, where the latent space is split into SI substructures, SR substructures, and SR atom variables. Sequentially, to reduce misidentification, we restrict the minimal changes of the SR atom variables and add a semantic latent substructure regularization to mitigate the variance of the SR substructure under augmented domain changes. Under mild assumptions, we prove the block-wise identifiability of the SR substructure and the comment-wise identifiability of SR atom variables. Experimental studies achieve state-of-the-art performance and show general improvement on 21 datasets in 3 mainstream benchmarks. Moreover, the visualization results of the proposed SCI method provide insightful case studies and explanations for the prediction results. The code is available at: https://github.com/DMIRLAB-Group/SCI.
△ Less
Submitted 8 November, 2023;
originally announced November 2023.
-
MoCLIM: Towards Accurate Cancer Subtyping via Multi-Omics Contrastive Learning with Omics-Inference Modeling
Authors:
Ziwei Yang,
Zheng Chen,
Yasuko Matsubara,
Yasushi Sakurai
Abstract:
Precision medicine fundamentally aims to establish causality between dysregulated biochemical mechanisms and cancer subtypes. Omics-based cancer subtyping has emerged as a revolutionary approach, as different level of omics records the biochemical products of multistep processes in cancers. This paper focuses on fully exploiting the potential of multi-omics data to improve cancer subtyping outcome…
▽ More
Precision medicine fundamentally aims to establish causality between dysregulated biochemical mechanisms and cancer subtypes. Omics-based cancer subtyping has emerged as a revolutionary approach, as different level of omics records the biochemical products of multistep processes in cancers. This paper focuses on fully exploiting the potential of multi-omics data to improve cancer subtyping outcomes, and hence developed MoCLIM, a representation learning framework. MoCLIM independently extracts the informative features from distinct omics modalities. Using a unified representation informed by contrastive learning of different omics modalities, we can well-cluster the subtypes, given cancer, into a lower latent space. This contrast can be interpreted as a projection of inter-omics inference observed in biological networks. Experimental results on six cancer datasets demonstrate that our approach significantly improves data fit and subtyping performance in fewer high-dimensional cancer instances. Moreover, our framework incorporates various medical evaluations as the final component, providing high interpretability in medical analysis.
△ Less
Submitted 24 August, 2023; v1 submitted 17 August, 2023;
originally announced August 2023.
-
Digital twin brain: a bridge between biological intelligence and artificial intelligence
Authors:
Hui Xiong,
Congying Chu,
Lingzhong Fan,
Ming Song,
Jiaqi Zhang,
Yawei Ma,
Ruonan Zheng,
Junyang Zhang,
Zhengyi Yang,
Tianzi Jiang
Abstract:
In recent years, advances in neuroscience and artificial intelligence have paved the way for unprecedented opportunities for understanding the complexity of the brain and its emulation by computational systems. Cutting-edge advancements in neuroscience research have revealed the intricate relationship between brain structure and function, while the success of artificial neural networks highlights…
▽ More
In recent years, advances in neuroscience and artificial intelligence have paved the way for unprecedented opportunities for understanding the complexity of the brain and its emulation by computational systems. Cutting-edge advancements in neuroscience research have revealed the intricate relationship between brain structure and function, while the success of artificial neural networks highlights the importance of network architecture. Now is the time to bring them together to better unravel how intelligence emerges from the brain's multiscale repositories. In this review, we propose the Digital Twin Brain (DTB) as a transformative platform that bridges the gap between biological and artificial intelligence. It consists of three core elements: the brain structure that is fundamental to the twinning process, bottom-layer models to generate brain functions, and its wide spectrum of applications. Crucially, brain atlases provide a vital constraint, preserving the brain's network organization within the DTB. Furthermore, we highlight open questions that invite joint efforts from interdisciplinary fields and emphasize the far-reaching implications of the DTB. The DTB can offer unprecedented insights into the emergence of intelligence and neurological disorders, which holds tremendous promise for advancing our understanding of both biological and artificial intelligence, and ultimately propelling the development of artificial general intelligence and facilitating precision mental healthcare.
△ Less
Submitted 2 August, 2023;
originally announced August 2023.
-
Microbiome-derived bile acids contribute to elevated antigenic response and bone erosion in rheumatoid arthritis
Authors:
Xiuli Su,
Xiaona Li,
Yanqin Bian,
Qing Ren,
Leiguang Li,
Xiaohao Wu,
Hemi Luan,
Bing He,
Xiaojuan He,
Hui Feng,
Xingye Cheng,
Pan-Jun Kim,
Leihan Tang,
Aiping Lu,
Lianbo Xiao,
Liang Tian,
Zhu Yang,
Zongwei Cai
Abstract:
Rheumatoid arthritis (RA) is a chronic, disabling and incurable autoimmune disease. It has been widely recognized that gut microbial dysbiosis is an important contributor to the pathogenesis of RA, although distinct alterations in microbiota have been associated with this disease. Yet, the metabolites that mediate the impacts of the gut microbiome on RA are less well understood. Here, with microbi…
▽ More
Rheumatoid arthritis (RA) is a chronic, disabling and incurable autoimmune disease. It has been widely recognized that gut microbial dysbiosis is an important contributor to the pathogenesis of RA, although distinct alterations in microbiota have been associated with this disease. Yet, the metabolites that mediate the impacts of the gut microbiome on RA are less well understood. Here, with microbial profiling and non-targeted metabolomics, we revealed profound yet diverse perturbation of the gut microbiome and metabolome in RA patients in a discovery set. In the Bacteroides-dominated RA patients, differentiation of gut microbiome resulted in distinct bile acid profiles compared to healthy subjects. Predominated Bacteroides species expressing BSH and 7a-HSDH increased, leading to elevated secondary bile acid production in this subgroup of RA patients. Reduced serum fibroblast growth factor-19 and dysregulated bile acids were evidence of impaired farnesoid X receptor-mediated signaling in the patients. This gut microbiota-bile acid axis was correlated to ACPA. The patients from the validation sets demonstrated that ACPA-positive patients have more abundant bacteria expressing BSH and 7a-HSDH but less Clostridium scindens expressing 7a-dehydroxylation enzymes, together with dysregulated microbial bile acid metabolism and more severe bone erosion than ACPA-negative ones. Mediation analyses revealed putative causal relationships between the gut microbiome, bile acids, and ACPA-positive RA, supporting a potential causal effect of Bacteroides species in increasing levels of ACPA and bone erosion mediated via disturbing bile acid metabolism. These results provide insights into the role of gut dysbiosis in RA in a manifestation-specific manner, as well as the functions of bile acids in this gut-joint axis, which may be a potential intervention target for precisely controlling RA conditions.
△ Less
Submitted 14 July, 2023;
originally announced July 2023.
-
Can Large Language Models Empower Molecular Property Prediction?
Authors:
Chen Qian,
Huayi Tang,
Zhirui Yang,
Hong Liang,
Yong Liu
Abstract:
Molecular property prediction has gained significant attention due to its transformative potential in multiple scientific disciplines. Conventionally, a molecule graph can be represented either as a graph-structured data or a SMILES text. Recently, the rapid development of Large Language Models (LLMs) has revolutionized the field of NLP. Although it is natural to utilize LLMs to assist in understa…
▽ More
Molecular property prediction has gained significant attention due to its transformative potential in multiple scientific disciplines. Conventionally, a molecule graph can be represented either as a graph-structured data or a SMILES text. Recently, the rapid development of Large Language Models (LLMs) has revolutionized the field of NLP. Although it is natural to utilize LLMs to assist in understanding molecules represented by SMILES, the exploration of how LLMs will impact molecular property prediction is still in its early stage. In this work, we advance towards this objective through two perspectives: zero/few-shot molecular classification, and using the new explanations generated by LLMs as representations of molecules. To be specific, we first prompt LLMs to do in-context molecular classification and evaluate their performance. After that, we employ LLMs to generate semantically enriched explanations for the original SMILES and then leverage that to fine-tune a small-scale LM model for multiple downstream tasks. The experimental results highlight the superiority of text explanations as molecular representations across multiple benchmark datasets, and confirm the immense potential of LLMs in molecular property prediction tasks. Codes are available at \url{https://github.com/ChnQ/LLM4Mol}.
△ Less
Submitted 14 July, 2023;
originally announced July 2023.
-
Fusing Structural and Functional Connectivities using Disentangled VAE for Detecting MCI
Authors:
Qiankun Zuo,
Yanfei Zhu,
Libin Lu,
Zhi Yang,
Yuhui Li,
Ning Zhang
Abstract:
Brain network analysis is a useful approach to studying human brain disorders because it can distinguish patients from healthy people by detecting abnormal connections. Due to the complementary information from multiple modal neuroimages, multimodal fusion technology has a lot of potential for improving prediction performance. However, effective fusion of multimodal medical images to achieve compl…
▽ More
Brain network analysis is a useful approach to studying human brain disorders because it can distinguish patients from healthy people by detecting abnormal connections. Due to the complementary information from multiple modal neuroimages, multimodal fusion technology has a lot of potential for improving prediction performance. However, effective fusion of multimodal medical images to achieve complementarity is still a challenging problem. In this paper, a novel hierarchical structural-functional connectivity fusing (HSCF) model is proposed to construct brain structural-functional connectivity matrices and predict abnormal brain connections based on functional magnetic resonance imaging (fMRI) and diffusion tensor imaging (DTI). Specifically, the prior knowledge is incorporated into the separators for disentangling each modality of information by the graph convolutional networks (GCN). And a disentangled cosine distance loss is devised to ensure the disentanglement's effectiveness. Moreover, the hierarchical representation fusion module is designed to effectively maximize the combination of relevant and effective features between modalities, which makes the generated structural-functional connectivity more robust and discriminative in the cognitive disease analysis. Results from a wide range of tests performed on the public Alzheimer's Disease Neuroimaging Initiative (ADNI) database show that the proposed model performs better than competing approaches in terms of classification evaluation. In general, the proposed HSCF model is a promising model for generating brain structural-functional connectivities and identifying abnormal brain connections as cognitive disease progresses.
△ Less
Submitted 21 August, 2023; v1 submitted 16 June, 2023;
originally announced June 2023.
-
Multi-task Bioassay Pre-training for Protein-ligand Binding Affinity Prediction
Authors:
Jiaxian Yan,
Zhaofeng Ye,
Ziyi Yang,
Chengqiang Lu,
Shengyu Zhang,
Qi Liu,
Jiezhong Qiu
Abstract:
Protein-ligand binding affinity (PLBA) prediction is the fundamental task in drug discovery. Recently, various deep learning-based models predict binding affinity by incorporating the three-dimensional structure of protein-ligand complexes as input and achieving astounding progress. However, due to the scarcity of high-quality training data, the generalization ability of current models is still li…
▽ More
Protein-ligand binding affinity (PLBA) prediction is the fundamental task in drug discovery. Recently, various deep learning-based models predict binding affinity by incorporating the three-dimensional structure of protein-ligand complexes as input and achieving astounding progress. However, due to the scarcity of high-quality training data, the generalization ability of current models is still limited. In addition, different bioassays use varying affinity measurement labels (i.e., IC50, Ki, Kd), and different experimental conditions inevitably introduce systematic noise, which poses a significant challenge to constructing high-precision affinity prediction models. To address these issues, we (1) propose Multi-task Bioassay Pre-training (MBP), a pre-training framework for structure-based PLBA prediction; (2) construct a pre-training dataset called ChEMBL-Dock with more than 300k experimentally measured affinity labels and about 2.8M docked three-dimensional structures. By introducing multi-task pre-training to treat the prediction of different affinity labels as different tasks and classifying relative rankings between samples from the same bioassay, MBP learns robust and transferrable structural knowledge from our new ChEMBL-Dock dataset with varied and noisy labels. Experiments substantiate the capability of MBP as a general framework that can improve and be tailored to mainstream structure-based PLBA prediction tasks. To the best of our knowledge, MBP is the first affinity pre-training model and shows great potential for future development.
△ Less
Submitted 20 December, 2023; v1 submitted 7 June, 2023;
originally announced June 2023.
-
A Radiomics-Incorporated Deep Ensemble Learning Model for Multi-Parametric MRI-based Glioma Segmentation
Authors:
Yang Chen,
Zhenyu Yang,
Jingtong Zhao,
Justus Adamson,
Yang Sheng,
Fang-Fang Yin,
Chunhao Wang
Abstract:
We developed a deep ensemble learning model with a radiomics spatial encoding execution for improved glioma segmentation accuracy using multi-parametric MRI (mp-MRI). This model was developed using 369 glioma patients with a 4-modality mp-MRI protocol: T1, contrast-enhanced T1 (T1-Ce), T2, and FLAIR. In each modality volume, a 3D sliding kernel was implemented across the brain to capture image het…
▽ More
We developed a deep ensemble learning model with a radiomics spatial encoding execution for improved glioma segmentation accuracy using multi-parametric MRI (mp-MRI). This model was developed using 369 glioma patients with a 4-modality mp-MRI protocol: T1, contrast-enhanced T1 (T1-Ce), T2, and FLAIR. In each modality volume, a 3D sliding kernel was implemented across the brain to capture image heterogeneity: fifty-six radiomic features were extracted within the kernel, resulting in a 4th order tensor. Each radiomic feature can then be encoded as a 3D image volume, namely a radiomic feature map (RFM). PCA was employed for data dimension reduction and the first 4 PCs were selected. Four deep neural networks as sub-models following the U-Net architecture were trained for the segmenting of a region-of-interest (ROI): each sub-model utilizes the mp-MRI and 1 of the 4 PCs as a 5-channel input for a 2D execution. The 4 softmax probability results given by the U-net ensemble were superimposed and binarized by Otsu method as the segmentation result. Three ensemble models were trained to segment enhancing tumor (ET), tumor core (TC), and whole tumor (WT). The adopted radiomics spatial encoding execution enriches the image heterogeneity information that leads to the successful demonstration of the proposed deep ensemble model, which offers a new tool for mp-MRI based medical image segmentation.
△ Less
Submitted 18 March, 2023;
originally announced March 2023.
-
Gene-SGAN: a method for discovering disease subtypes with imaging and genetic signatures via multi-view weakly-supervised deep clustering
Authors:
Zhijian Yang,
Junhao Wen,
Ahmed Abdulkadir,
Yuhan Cui,
Guray Erus,
Elizabeth Mamourian,
Randa Melhem,
Dhivya Srinivasan,
Sindhuja T. Govindarajan,
Jiong Chen,
Mohamad Habes,
Colin L. Masters,
Paul Maruff,
Jurgen Fripp,
Luigi Ferrucci,
Marilyn S. Albert,
Sterling C. Johnson,
John C. Morris,
Pamela LaMontagne,
Daniel S. Marcus,
Tammie L. S. Benzinger,
David A. Wolk,
Li Shen,
Jingxuan Bao,
Susan M. Resnick
, et al. (3 additional authors not shown)
Abstract:
Disease heterogeneity has been a critical challenge for precision diagnosis and treatment, especially in neurologic and neuropsychiatric diseases. Many diseases can display multiple distinct brain phenotypes across individuals, potentially reflecting disease subtypes that can be captured using MRI and machine learning methods. However, biological interpretability and treatment relevance are limite…
▽ More
Disease heterogeneity has been a critical challenge for precision diagnosis and treatment, especially in neurologic and neuropsychiatric diseases. Many diseases can display multiple distinct brain phenotypes across individuals, potentially reflecting disease subtypes that can be captured using MRI and machine learning methods. However, biological interpretability and treatment relevance are limited if the derived subtypes are not associated with genetic drivers or susceptibility factors. Herein, we describe Gene-SGAN - a multi-view, weakly-supervised deep clustering method - which dissects disease heterogeneity by jointly considering phenotypic and genetic data, thereby conferring genetic correlations to the disease subtypes and associated endophenotypic signatures. We first validate the generalizability, interpretability, and robustness of Gene-SGAN in semi-synthetic experiments. We then demonstrate its application to real multi-site datasets from 28,858 individuals, deriving subtypes of Alzheimer's disease and brain endophenotypes associated with hypertension, from MRI and SNP data. Derived brain phenotypes displayed significant differences in neuroanatomical patterns, genetic determinants, biological and clinical biomarkers, indicating potentially distinct underlying neuropathologic processes, genetic drivers, and susceptibility factors. Overall, Gene-SGAN is broadly applicable to disease subtyping and endophenotype discovery, and is herein tested on disease-related, genetically-driven neuroimaging phenotypes.
△ Less
Submitted 25 January, 2023;
originally announced January 2023.
-
An open unified deep graph learning framework for discovering drug leads
Authors:
Yueming Yin,
Haifeng Hu,
Zhen Yang,
Jitao Yang,
Chun Ye,
Jiansheng Wu,
Wilson Wen Bin Goh
Abstract:
Computational discovery of ideal lead compounds is a critical process for modern drug discovery. It comprises multiple stages: hit screening, molecular property prediction, and molecule optimization. Current efforts are disparate, involving the establishment of models for each stage, followed by multi-stage multi-model integration. However, this is non-ideal, as clumsy integration of incompatible…
▽ More
Computational discovery of ideal lead compounds is a critical process for modern drug discovery. It comprises multiple stages: hit screening, molecular property prediction, and molecule optimization. Current efforts are disparate, involving the establishment of models for each stage, followed by multi-stage multi-model integration. However, this is non-ideal, as clumsy integration of incompatible models increases research overheads, and may even reduce success rates in drug discovery. Facilitating compatibilities requires establishing inherent model consistencies across lead discovery stages. Towards that effect, we propose an open deep graph learning (DGL) based pipeline: generative adversarial feature subspace enhancement (GAFSE), which first unifies the modeling of these stages into one learning framework. GAFSE also offers standardized modular design and streamlined interfaces for future expansions and community support. GAFSE combines adversarial/generative learning, graph attention network, graph reconstruction network, and optimizes the classification/regression loss, adversarial/generative loss, and reconstruction loss simultaneously. Convergence analysis theoretically guarantees model generalization performance. Exhaustive benchmarking demonstrates that the GAFSE pipeline achieves excellent performance across almost all lead discovery stages, while also providing valuable model interpretability. Hence, we believe this tool will enhance the efficiency and productivity of drug discovery researchers.
△ Less
Submitted 20 January, 2023; v1 submitted 5 December, 2022;
originally announced January 2023.
-
A Neural Active Inference Model of Perceptual-Motor Learning
Authors:
Zhizhuo Yang,
Gabriel J. Diaz,
Brett R. Fajen,
Reynold Bailey,
Alexander Ororbia
Abstract:
The active inference framework (AIF) is a promising new computational framework grounded in contemporary neuroscience that can produce human-like behavior through reward-based learning. In this study, we test the ability for the AIF to capture the role of anticipation in the visual guidance of action in humans through the systematic investigation of a visual-motor task that has been well-explored…
▽ More
The active inference framework (AIF) is a promising new computational framework grounded in contemporary neuroscience that can produce human-like behavior through reward-based learning. In this study, we test the ability for the AIF to capture the role of anticipation in the visual guidance of action in humans through the systematic investigation of a visual-motor task that has been well-explored -- that of intercepting a target moving over a ground plane. Previous research demonstrated that humans performing this task resorted to anticipatory changes in speed intended to compensate for semi-predictable changes in target speed later in the approach. To capture this behavior, our proposed "neural" AIF agent uses artificial neural networks to select actions on the basis of a very short term prediction of the information about the task environment that these actions would reveal along with a long-term estimate of the resulting cumulative expected free energy. Systematic variation revealed that anticipatory behavior emerged only when required by limitations on the agent's movement capabilities, and only when the agent was able to estimate accumulated free energy over sufficiently long durations into the future. In addition, we present a novel formulation of the prior function that maps a multi-dimensional world-state to a uni-dimensional distribution of free-energy. Together, these results demonstrate the use of AIF as a plausible model of anticipatory visually guided behavior in humans.
△ Less
Submitted 16 November, 2022;
originally announced November 2022.
-
Biologically Plausible Variational Policy Gradient with Spiking Recurrent Winner-Take-All Networks
Authors:
Zhile Yang,
Shangqi Guo,
Ying Fang,
Jian K. Liu
Abstract:
One stream of reinforcement learning research is exploring biologically plausible models and algorithms to simulate biological intelligence and fit neuromorphic hardware. Among them, reward-modulated spike-timing-dependent plasticity (R-STDP) is a recent branch with good potential in energy efficiency. However, current R-STDP methods rely on heuristic designs of local learning rules, thus requirin…
▽ More
One stream of reinforcement learning research is exploring biologically plausible models and algorithms to simulate biological intelligence and fit neuromorphic hardware. Among them, reward-modulated spike-timing-dependent plasticity (R-STDP) is a recent branch with good potential in energy efficiency. However, current R-STDP methods rely on heuristic designs of local learning rules, thus requiring task-specific expert knowledge. In this paper, we consider a spiking recurrent winner-take-all network, and propose a new R-STDP method, spiking variational policy gradient (SVPG), whose local learning rules are derived from the global policy gradient and thus eliminate the need for heuristic designs. In experiments of MNIST classification and Gym InvertedPendulum, our SVPG achieves good training performance, and also presents better robustness to various kinds of noises than conventional methods.
△ Less
Submitted 21 October, 2022;
originally announced October 2022.
-
Quantifying U-Net Uncertainty in Multi-Parametric MRI-based Glioma Segmentation by Spherical Image Projection
Authors:
Zhenyu Yang,
Kyle Lafata,
Eugene Vaios,
Zongsheng Hu,
Trey Mullikin,
Fang-Fang Yin,
Chunhao Wang
Abstract:
The projection of planar MRI data onto a spherical surface is equivalent to a nonlinear image transformation that retains global anatomical information. By incorporating this image transformation process in our proposed spherical projection-based U-Net (SPU-Net) segmentation model design, multiple independent segmentation predictions can be obtained from a single MRI. The final segmentation is the…
▽ More
The projection of planar MRI data onto a spherical surface is equivalent to a nonlinear image transformation that retains global anatomical information. By incorporating this image transformation process in our proposed spherical projection-based U-Net (SPU-Net) segmentation model design, multiple independent segmentation predictions can be obtained from a single MRI. The final segmentation is the average of all available results, and the variation can be visualized as a pixel-wise uncertainty map. An uncertainty score was introduced to evaluate and compare the performance of uncertainty measurements. The proposed SPU-Net model was implemented on the basis of 369 glioma patients with MP-MRI scans (T1, T1-Ce, T2, and FLAIR). Three SPU-Net models were trained to segment enhancing tumor (ET), tumor core (TC), and whole tumor (WT), respectively. The SPU-Net model was compared with (1) the classic U-Net model with test-time augmentation (TTA) and (2) linear scaling-based U-Net (LSU-Net) segmentation models in terms of both segmentation accuracy (Dice coefficient, sensitivity, specificity, and accuracy) and segmentation uncertainty (uncertainty map and uncertainty score). The developed SPU-Net model successfully achieved low uncertainty for correct segmentation predictions (e.g., tumor interior or healthy tissue interior) and high uncertainty for incorrect results (e.g., tumor boundaries). This model could allow the identification of missed tumor targets or segmentation errors in U-Net. Quantitatively, the SPU-Net model achieved the highest uncertainty scores for three segmentation targets (ET/TC/WT): 0.826/0.848/0.936, compared to 0.784/0.643/0.872 using the U-Net with TTA and 0.743/0.702/0.876 with the LSU-Net (scaling factor = 2). The SPU-Net also achieved statistically significantly higher Dice coefficients, underscoring the improved segmentation accuracy.
△ Less
Submitted 12 August, 2023; v1 submitted 12 October, 2022;
originally announced October 2022.
-
Unraveling Key Elements Underlying Molecular Property Prediction: A Systematic Study
Authors:
Jianyuan Deng,
Zhibo Yang,
Hehe Wang,
Iwao Ojima,
Dimitris Samaras,
Fusheng Wang
Abstract:
Artificial intelligence (AI) has been widely applied in drug discovery with a major task as molecular property prediction. Despite booming techniques in molecular representation learning, key elements underlying molecular property prediction remain largely unexplored, which impedes further advancements in this field. Herein, we conduct an extensive evaluation of representative models using various…
▽ More
Artificial intelligence (AI) has been widely applied in drug discovery with a major task as molecular property prediction. Despite booming techniques in molecular representation learning, key elements underlying molecular property prediction remain largely unexplored, which impedes further advancements in this field. Herein, we conduct an extensive evaluation of representative models using various representations on the MoleculeNet datasets, a suite of opioids-related datasets and two additional activity datasets from the literature. To investigate the predictive power in low-data and high-data space, a series of descriptors datasets of varying sizes are also assembled to evaluate the models. In total, we have trained 62,820 models, including 50,220 models on fixed representations, 4,200 models on SMILES sequences and 8,400 models on molecular graphs. Based on extensive experimentation and rigorous comparison, we show that representation learning models exhibit limited performance in molecular property prediction in most datasets. Besides, multiple key elements underlying molecular property prediction can affect the evaluation results. Furthermore, we show that activity cliffs can significantly impact model prediction. Finally, we explore into potential causes why representation learning models can fail and show that dataset size is essential for representation learning models to excel.
△ Less
Submitted 2 September, 2023; v1 submitted 26 September, 2022;
originally announced September 2022.
-
Automated Cancer Subtyping via Vector Quantization Mutual Information Maximization
Authors:
Zheng Chen,
Lingwei Zhu,
Ziwei Yang,
Takashi Matsubara
Abstract:
Cancer subtyping is crucial for understanding the nature of tumors and providing suitable therapy. However, existing labelling methods are medically controversial, and have driven the process of subtyping away from teaching signals. Moreover, cancer genetic expression profiles are high-dimensional, scarce, and have complicated dependence, thereby posing a serious challenge to existing subtyping mo…
▽ More
Cancer subtyping is crucial for understanding the nature of tumors and providing suitable therapy. However, existing labelling methods are medically controversial, and have driven the process of subtyping away from teaching signals. Moreover, cancer genetic expression profiles are high-dimensional, scarce, and have complicated dependence, thereby posing a serious challenge to existing subtyping models for outputting sensible clustering. In this study, we propose a novel clustering method for exploiting genetic expression profiles and distinguishing subtypes in an unsupervised manner. The proposed method adaptively learns categorical correspondence from latent representations of expression profiles to the subtypes output by the model. By maximizing the problem -- agnostic mutual information between input expression profiles and output subtypes, our method can automatically decide a suitable number of subtypes. Through experiments, we demonstrate that our proposed method can refine existing controversial labels, and, by further medical analysis, this refinement is proven to have a high correlation with cancer survival rates.
△ Less
Submitted 14 November, 2022; v1 submitted 21 June, 2022;
originally announced June 2022.
-
ODBO: Bayesian Optimization with Search Space Prescreening for Directed Protein Evolution
Authors:
Lixue Cheng,
Ziyi Yang,
Changyu Hsieh,
Benben Liao,
Shengyu Zhang
Abstract:
Directed evolution is a versatile technique in protein engineering that mimics the process of natural selection by iteratively alternating between mutagenesis and screening in order to search for sequences that optimize a given property of interest, such as catalytic activity and binding affinity to a specified target. However, the space of possible proteins is too large to search exhaustively in…
▽ More
Directed evolution is a versatile technique in protein engineering that mimics the process of natural selection by iteratively alternating between mutagenesis and screening in order to search for sequences that optimize a given property of interest, such as catalytic activity and binding affinity to a specified target. However, the space of possible proteins is too large to search exhaustively in the laboratory, and functional proteins are scarce in the vast sequence space. Machine learning (ML) approaches can accelerate directed evolution by learning to map protein sequences to functions without building a detailed model of the underlying physics, chemistry and biological pathways. Despite the great potentials held by these ML methods, they encounter severe challenges in identifying the most suitable sequences for a targeted function. These failures can be attributed to the common practice of adopting a high-dimensional feature representation for protein sequences and inefficient search methods. To address these issues, we propose an efficient, experimental design-oriented closed-loop optimization framework for protein directed evolution, termed ODBO, which employs a combination of novel low-dimensional protein encoding strategy and Bayesian optimization enhanced with search space prescreening via outlier detection. We further design an initial sample selection strategy to minimize the number of experimental samples for training ML models. We conduct and report four protein directed evolution experiments that substantiate the capability of the proposed framework for finding of the variants with properties of interest. We expect the ODBO framework to greatly reduce the experimental cost and time cost of directed evolution, and can be further generalized as a powerful tool for adaptive experimental design in a broader context.
△ Less
Submitted 1 May, 2024; v1 submitted 19 May, 2022;
originally announced May 2022.
-
Multi-Tier Platform for Cognizing Massive Electroencephalogram
Authors:
Zheng Chen,
Lingwei Zhu,
Ziwei Yang,
Renyuan Zhang
Abstract:
An end-to-end platform assembling multiple tiers is built for precisely cognizing brain activities. Being fed massive electroencephalogram (EEG) data, the time-frequency spectrograms are conventionally projected into the episode-wise feature matrices (seen as tier-1). A spiking neural network (SNN) based tier is designed to distill the principle information in terms of spike-streams from the rare…
▽ More
An end-to-end platform assembling multiple tiers is built for precisely cognizing brain activities. Being fed massive electroencephalogram (EEG) data, the time-frequency spectrograms are conventionally projected into the episode-wise feature matrices (seen as tier-1). A spiking neural network (SNN) based tier is designed to distill the principle information in terms of spike-streams from the rare features, which maintains the temporal implication in the nature of EEGs. The proposed tier-3 transposes time- and space-domain of spike patterns from the SNN; and feeds the transposed pattern-matrices into an artificial neural network (ANN, Transformer specifically) known as tier-4, where a special spanning topology is proposed to match the two-dimensional input form. In this manner, cognition such as classification is conducted with high accuracy. For proof-of-concept, the sleep stage scoring problem is demonstrated by introducing multiple EEG datasets with the largest comprising 42,560 hours recorded from 5,793 subjects. From experiment results, our platform achieves the general cognition overall accuracy of 87% by leveraging sole EEG, which is 2% superior to the state-of-the-art. Moreover, our developed multi-tier methodology offers visible and graphical interpretations of the temporal characteristics of EEG by identifying the critical episodes, which is demanded in neurodynamics but hardly appears in conventional cognition scenarios.
△ Less
Submitted 20 April, 2022;
originally announced April 2022.
-
Cancer Subtyping via Embedded Unsupervised Learning on Transcriptomics Data
Authors:
Ziwei Yang,
Lingwei Zhu,
Zheng Chen,
Ming Huang,
Naoaki Ono,
MD Altaf-Ul-Amin,
Shigehiko Kanaya
Abstract:
Cancer is one of the deadliest diseases worldwide. Accurate diagnosis and classification of cancer subtypes are indispensable for effective clinical treatment. Promising results on automatic cancer subtyping systems have been published recently with the emergence of various deep learning methods. However, such automatic systems often overfit the data due to the high dimensionality and scarcity. In…
▽ More
Cancer is one of the deadliest diseases worldwide. Accurate diagnosis and classification of cancer subtypes are indispensable for effective clinical treatment. Promising results on automatic cancer subtyping systems have been published recently with the emergence of various deep learning methods. However, such automatic systems often overfit the data due to the high dimensionality and scarcity. In this paper, we propose to investigate automatic subtyping from an unsupervised learning perspective by directly constructing the underlying data distribution itself, hence sufficient data can be generated to alleviate the issue of overfitting. Specifically, we bypass the strong Gaussianity assumption that typically exists but fails in the unsupervised learning subtyping literature due to small-sized samples by vector quantization. Our proposed method better captures the latent space features and models the cancer subtype manifestation on a molecular basis, as demonstrated by the extensive experimental results.
△ Less
Submitted 2 April, 2022;
originally announced April 2022.
-
Artificial Intelligence Enables Real-Time and Intuitive Control of Prostheses via Nerve Interface
Authors:
Diu Khue Luu,
Anh Tuan Nguyen,
Ming Jiang,
Markus W. Drealan,
Jian Xu,
Tong Wu,
Wing-kin Tam,
Wenfeng Zhao,
Brian Z. H. Lim,
Cynthia K. Overstreet,
Qi Zhao,
Jonathan Cheng,
Edward W. Keefer,
Zhi Yang
Abstract:
Objective: The next generation prosthetic hand that moves and feels like a real hand requires a robust neural interconnection between the human minds and machines. Methods: Here we present a neuroprosthetic system to demonstrate that principle by employing an artificial intelligence (AI) agent to translate the amputee's movement intent through a peripheral nerve interface. The AI agent is designed…
▽ More
Objective: The next generation prosthetic hand that moves and feels like a real hand requires a robust neural interconnection between the human minds and machines. Methods: Here we present a neuroprosthetic system to demonstrate that principle by employing an artificial intelligence (AI) agent to translate the amputee's movement intent through a peripheral nerve interface. The AI agent is designed based on the recurrent neural network (RNN) and could simultaneously decode six degree-of-freedom (DOF) from multichannel nerve data in real-time. The decoder's performance is characterized in motor decoding experiments with three human amputees. Results: First, we show the AI agent enables amputees to intuitively control a prosthetic hand with individual finger and wrist movements up to 97-98% accuracy. Second, we demonstrate the AI agent's real-time performance by measuring the reaction time and information throughput in a hand gesture matching task. Third, we investigate the AI agent's long-term uses and show the decoder's robust predictive performance over a 16-month implant duration. Conclusion & significance: Our study demonstrates the potential of AI-enabled nerve technology, underling the next generation of dexterous and intuitive prosthetic hands.
△ Less
Submitted 16 March, 2022;
originally announced March 2022.
-
A Neural Ordinary Differential Equation Model for Visualizing Deep Neural Network Behaviors in Multi-Parametric MRI based Glioma Segmentation
Authors:
Zhenyu Yang,
Zongsheng Hu,
Hangjie Ji,
Kyle Lafata,
Scott Floyd,
Fang-Fang Yin,
Chunhao Wang
Abstract:
Purpose: To develop a neural ordinary differential equation (ODE) model for visualizing deep neural network (DNN) behavior during multi-parametric MRI (mp-MRI) based glioma segmentation as a method to enhance deep learning explainability. Methods: By hypothesizing that deep feature extraction can be modeled as a spatiotemporally continuous process, we designed a novel deep learning model, neural O…
▽ More
Purpose: To develop a neural ordinary differential equation (ODE) model for visualizing deep neural network (DNN) behavior during multi-parametric MRI (mp-MRI) based glioma segmentation as a method to enhance deep learning explainability. Methods: By hypothesizing that deep feature extraction can be modeled as a spatiotemporally continuous process, we designed a novel deep learning model, neural ODE, in which deep feature extraction was governed by an ODE without explicit expression. The dynamics of 1) MR images after interactions with DNN and 2) segmentation formation can be visualized after solving ODE. An accumulative contribution curve (ACC) was designed to quantitatively evaluate the utilization of each MRI by DNN towards the final segmentation results. The proposed neural ODE model was demonstrated using 369 glioma patients with a 4-modality mp-MRI protocol: T1, contrast-enhanced T1 (T1-Ce), T2, and FLAIR. Three neural ODE models were trained to segment enhancing tumor (ET), tumor core (TC), and whole tumor (WT). The key MR modalities with significant utilization by DNN were identified based on ACC analysis. Segmentation results by DNN using only the key MR modalities were compared to the ones using all 4 MR modalities. Results: All neural ODE models successfully illustrated image dynamics as expected. ACC analysis identified T1-Ce as the only key modality in ET and TC segmentations, while both FLAIR and T2 were key modalities in WT segmentation. Compared to the U-Net results using all 4 MR modalities, Dice coefficient of ET (0.784->0.775), TC (0.760->0.758), and WT (0.841->0.837) using the key modalities only had minimal differences without significance. Conclusion: The neural ODE model offers a new tool for optimizing the deep learning model inputs with enhanced explainability. The presented methodology can be generalized to other medical image-related deep learning applications.
△ Less
Submitted 23 March, 2022; v1 submitted 1 March, 2022;
originally announced March 2022.
-
SPLDExtraTrees: Robust machine learning approach for predicting kinase inhibitor resistance
Authors:
Ziyi Yang,
Zhaofeng Ye,
Yijia Xiao,
Changyu Hsieh,
Shengyu Zhang
Abstract:
Drug resistance is a major threat to the global health and a significant concern throughout the clinical treatment of diseases and drug development. The mutation in proteins that is related to drug binding is a common cause for adaptive drug resistance. Therefore, quantitative estimations of how mutations would affect the interaction between a drug and the target protein would be of vital signific…
▽ More
Drug resistance is a major threat to the global health and a significant concern throughout the clinical treatment of diseases and drug development. The mutation in proteins that is related to drug binding is a common cause for adaptive drug resistance. Therefore, quantitative estimations of how mutations would affect the interaction between a drug and the target protein would be of vital significance for the drug development and the clinical practice. Computational methods that rely on molecular dynamics simulations, Rosetta protocols, as well as machine learning methods have been proven to be capable of predicting ligand affinity changes upon protein mutation. However, the severely limited sample size and heavy noise induced overfitting and generalization issues have impeded wide adoption of machine learning for studying drug resistance. In this paper, we propose a robust machine learning method, termed SPLDExtraTrees, which can accurately predict ligand binding affinity changes upon protein mutation and identify resistance-causing mutations. Especially, the proposed method ranks training data following a specific scheme that starts with easy-to-learn samples and gradually incorporates harder and diverse samples into the training, and then iterates between sample weight recalculations and model updates. In addition, we calculate additional physics-based structural features to provide the machine learning model with the valuable domain knowledge on proteins for this data-limited predictive tasks. The experiments substantiate the capability of the proposed method for predicting kinase inhibitor resistance under three scenarios, and achieves predictive accuracy comparable to that of molecular dynamics and Rosetta methods with much less computational costs.
△ Less
Submitted 14 January, 2022; v1 submitted 15 November, 2021;
originally announced November 2021.
-
Multidimensional representations in late-life depression: convergence in neuroimaging, cognition, clinical symptomatology and genetics
Authors:
Junhao Wen,
Cynthia H. Y. Fu,
Duygu Tosun,
Yogasudha Veturi,
Zhijian Yang,
Ahmed Abdulkadir,
Elizabeth Mamourian,
Dhivya Srinivasan,
Jingxuan Bao,
Guray Erus,
Haochang Shou,
Mohamad Habes,
Jimit Doshi,
Erdem Varol,
Scott R Mackin,
Aristeidis Sotiras,
Yong Fan,
Andrew J. Saykin,
Yvette I. Sheline,
Li Shen,
Marylyn D. Ritchie,
David A. Wolk,
Marilyn Albert,
Susan M. Resnick,
Christos Davatzikos
Abstract:
Late-life depression (LLD) is characterized by considerable heterogeneity in clinical manifestation. Unraveling such heterogeneity would aid in elucidating etiological mechanisms and pave the road to precision and individualized medicine. We sought to delineate, cross-sectionally and longitudinally, disease-related heterogeneity in LLD linked to neuroanatomy, cognitive functioning, clinical sympto…
▽ More
Late-life depression (LLD) is characterized by considerable heterogeneity in clinical manifestation. Unraveling such heterogeneity would aid in elucidating etiological mechanisms and pave the road to precision and individualized medicine. We sought to delineate, cross-sectionally and longitudinally, disease-related heterogeneity in LLD linked to neuroanatomy, cognitive functioning, clinical symptomatology, and genetic profiles. Multimodal data from a multicentre sample (N=996) were analyzed. A semi-supervised clustering method (HYDRA) was applied to regional grey matter (GM) brain volumes to derive dimensional representations. Two dimensions were identified, which accounted for the LLD-related heterogeneity in voxel-wise GM maps, white matter (WM) fractional anisotropy (FA), neurocognitive functioning, clinical phenotype, and genetics. Dimension one (Dim1) demonstrated relatively preserved brain anatomy without WM disruptions relative to healthy controls. In contrast, dimension two (Dim2) showed widespread brain atrophy and WM integrity disruptions, along with cognitive impairment and higher depression severity. Moreover, one de novo independent genetic variant (rs13120336) was significantly associated with Dim 1 but not with Dim 2. Notably, the two dimensions demonstrated significant SNP-based heritability of 18-27% within the general population (N=12,518 in UKBB). Lastly, in a subset of individuals having longitudinal measurements, Dim2 demonstrated a more rapid longitudinal decrease in GM and brain age, and was more likely to progress to Alzheimers disease, compared to Dim1 (N=1,413 participants and 7,225 scans from ADNI, BLSA, and BIOCARD datasets).
△ Less
Submitted 25 October, 2021; v1 submitted 20 October, 2021;
originally announced October 2021.
-
Disentangling brain heterogeneity via semi-supervised deep-learning and MRI: dimensional representations of Alzheimer's Disease
Authors:
Zhijian Yang,
Ilya M. Nasrallah,
Haochang Shou,
Junhao Wen,
Jimit Doshi,
Mohamad Habes,
Guray Erus,
Ahmed Abdulkadir,
Susan M. Resnick,
David Wolk,
Christos Davatzikos
Abstract:
Heterogeneity of brain diseases is a challenge for precision diagnosis/prognosis. We describe and validate Smile-GAN (SeMI-supervised cLustEring-Generative Adversarial Network), a novel semi-supervised deep-clustering method, which dissects neuroanatomical heterogeneity, enabling identification of disease subtypes via their imaging signatures relative to controls. When applied to MRIs (2 studies;…
▽ More
Heterogeneity of brain diseases is a challenge for precision diagnosis/prognosis. We describe and validate Smile-GAN (SeMI-supervised cLustEring-Generative Adversarial Network), a novel semi-supervised deep-clustering method, which dissects neuroanatomical heterogeneity, enabling identification of disease subtypes via their imaging signatures relative to controls. When applied to MRIs (2 studies; 2,832 participants; 8,146 scans) including cognitively normal individuals and those with cognitive impairment and dementia, Smile-GAN identified 4 neurodegenerative patterns/axes: P1, normal anatomy and highest cognitive performance; P2, mild/diffuse atrophy and more prominent executive dysfunction; P3, focal medial temporal atrophy and relatively greater memory impairment; P4, advanced neurodegeneration. Further application to longitudinal data revealed two distinct progression pathways: P1$\rightarrow$P2$\rightarrow$P4 and P1$\rightarrow$P3$\rightarrow$P4. Baseline expression of these patterns predicted the pathway and rate of future neurodegeneration. Pattern expression offered better yet complementary performance in predicting clinical progression, compared to amyloid/tau. These deep-learning derived biomarkers offer promise for precision diagnostics and targeted clinical trial recruitment.
△ Less
Submitted 24 February, 2021;
originally announced February 2021.
-
EPIHC: Improving Enhancer-Promoter Interaction Prediction by using Hybrid features and Communicative learning
Authors:
Shuai Liu,
Xinran Xu,
Zhihao Yang,
Xiaohan Zhao,
Wen Zhang
Abstract:
Enhancer-promoter interactions (EPIs) regulate the expression of specific genes in cells, and EPIs are important for understanding gene regulation, cell differentiation and disease mechanisms. EPI identification through the wet experiments is costly and time-consuming, and computational methods are in demand. In this paper, we propose a deep neural network-based method EPIHC based on sequence-deri…
▽ More
Enhancer-promoter interactions (EPIs) regulate the expression of specific genes in cells, and EPIs are important for understanding gene regulation, cell differentiation and disease mechanisms. EPI identification through the wet experiments is costly and time-consuming, and computational methods are in demand. In this paper, we propose a deep neural network-based method EPIHC based on sequence-derived features and genomic features for the EPI prediction. EPIHC extracts features from enhancer and promoter sequences respectively using convolutional neural networks (CNN), and then design a communicative learning module to captures the communicative information between enhancer and promoter sequences. EPIHC also take the genomic features of enhancers and promoters into account. At last, EPIHC combines sequence-derived features and genomic features to predict EPIs. The computational experiments show that EPIHC outperforms the existing state-of-the-art EPI prediction methods on the benchmark datasets and chromosome-split datasets, and the study reveal that the communicative learning module can bring explicit information about EPIs, which is ignore by CNN. Moreover, we consider two strategies to improve performances of EPIHC in the cross-cell line prediction, and experimental results show that EPIHC constructed on training cell lines exhibit improved performances for the other cell lines.
△ Less
Submitted 30 December, 2020;
originally announced December 2020.
-
Molcontroller: a VMD Graphical User Interface for Manipulating Molecules
Authors:
ChenChen Wu,
Shengtang Liu,
Shitong Zhang,
Zaixing Yang
Abstract:
Visual Molecular Dynamics (VMD) is one of the most widely used molecular graphics software in the community of theoretical simulations. So far, however, it still lacks a graphical user interface (GUI) for molecular manipulations when doing some modeling tasks. For instance, translation or rotation of a selected molecule(s) or part(s) of a molecule, which are currently only can be achieved using tc…
▽ More
Visual Molecular Dynamics (VMD) is one of the most widely used molecular graphics software in the community of theoretical simulations. So far, however, it still lacks a graphical user interface (GUI) for molecular manipulations when doing some modeling tasks. For instance, translation or rotation of a selected molecule(s) or part(s) of a molecule, which are currently only can be achieved using tcl scripts. Here, we use tcl script develop a user-friendly GUI for VMD, named Molcontroller, which is featured by allowing users to quickly and conveniently perform various molecular manipulations. This GUI might be helpful for improving the modeling efficiency of VMD users.
△ Less
Submitted 2 July, 2020;
originally announced July 2020.
-
Smile-GANs: Semi-supervised clustering via GANs for dissecting brain disease heterogeneity from medical images
Authors:
Zhijian Yang,
Junhao Wen,
Christos Davatzikos
Abstract:
Machine learning methods applied to complex biomedical data has enabled the construction of disease signatures of diagnostic/prognostic value. However, less attention has been given to understanding disease heterogeneity. Semi-supervised clustering methods can address this problem by estimating multiple transformations from a (e.g. healthy) control (CN) group to a patient (PT) group, seeking to ca…
▽ More
Machine learning methods applied to complex biomedical data has enabled the construction of disease signatures of diagnostic/prognostic value. However, less attention has been given to understanding disease heterogeneity. Semi-supervised clustering methods can address this problem by estimating multiple transformations from a (e.g. healthy) control (CN) group to a patient (PT) group, seeking to capture the heterogeneity of underlying pathlogic processes. Herein, we propose a novel method, Smile-GANs (SeMi-supervIsed cLustEring via GANs), for semi-supervised clustering, and apply it to brain MRI scans. Smile-GANs first learns multiple distinct mappings by generating PT from CN, with each mapping characterizing one relatively distinct pathological pattern. Moreover, a clustering model is trained interactively with mapping functions to assign PT into corresponding subtype memberships. Using relaxed assumptions on PT/CN data distribution and imposing mapping non-linearity, Smile-GANs captures heterogeneous differences in distribution between the CN and PT domains. We first validate Smile-GANs using simulated data, subsequently on real data, by demonstrating its potential in characterizing heterogeneity in Alzheimer's Disease (AD) and its prodromal phases. The model was first trained using baseline MRIs from the ADNI2 database and then applied to longitudinal data from ADNI1 and BLSA. Four robust subtypes with distinct neuroanatomical patterns were discovered: 1) normal brain, 2) diffuse atrophy atypical of AD, 3) focal medial temporal lobe atrophy, 4) typical-AD. Further longitudinal analyses discover two distinct progressive pathways from prodromal to full AD: i) subtypes 1 - 2 - 4, and ii) subtypes 1 - 3 - 4. Although demonstrated on an important biomedical problem, Smile-GANs is general and can find application in many biomedical and other domains.
△ Less
Submitted 26 June, 2020;
originally announced June 2020.
-
A machine learning approach to using Quality-of-Life patient scores in guiding prostate radiation therapy dosing
Authors:
Zhijian Yang,
Daniel Olszewski,
Chujun He,
Giulia Pintea,
Jun Lian,
Tom Chou,
Ronald Chen,
Blerta Shtylla
Abstract:
Thanks to advancements in diagnosis and treatment, prostate cancer patients have high long-term survival rates. Currently, an important goal is to preserve quality-of-life during and after treatment. The relationship between the radiation a patient receives and the subsequent side effects he experiences is complex and difficult to model or predict. Here, we use machine learning algorithms and stat…
▽ More
Thanks to advancements in diagnosis and treatment, prostate cancer patients have high long-term survival rates. Currently, an important goal is to preserve quality-of-life during and after treatment. The relationship between the radiation a patient receives and the subsequent side effects he experiences is complex and difficult to model or predict. Here, we use machine learning algorithms and statistical models to explore the connection between radiation treatment and post-treatment gastro-urinary function. Since only a limited number of patient datasets are currently available, we used image flipping and curvature-based interpolation methods to generate more data in order to leverage transfer learning. Using interpolated and augmented data, we trained a convolutional autoencoder network to obtain near-optimal starting points for the weights. A convolutional neural network then analyzed the relationship between patient-reported quality-of-life and radiation. We also used analysis of variance and logistic regression to explore organ sensitivity to radiation and develop dosage thresholds for each organ region. Our findings show no connection between the bladder and quality-of-life scores. However, we found a connection between radiation applied to posterior and anterior rectal regions to changes in quality-of-life. Finally, we estimated radiation therapy dosage thresholds for each organ. Our analysis connects machine learning methods with organ sensitivity, thus providing a framework for informing cancer patient care using patient reported quality-of-life metrics.
△ Less
Submitted 21 May, 2020;
originally announced May 2020.
-
Using single-cell entropy to describe the dynamics of reprogramming and differentiation of induced pluripotent stem cells
Authors:
Yusong Ye,
Zhuoqin Yang,
Jinzhi Lei
Abstract:
Induced pluripotent stem cells (iPSCs) provide a great model to study the process of reprogramming and differentiation of stem cells. Single-cell RNA sequencing (scRNA-seq) enables us to investigate the reprogramming process at single-cell level. Here, we introduce single-cell entropy (scEntropy) as a macroscopic variable to quantify the cellular transcriptome from scRNA-seq data during reprogramm…
▽ More
Induced pluripotent stem cells (iPSCs) provide a great model to study the process of reprogramming and differentiation of stem cells. Single-cell RNA sequencing (scRNA-seq) enables us to investigate the reprogramming process at single-cell level. Here, we introduce single-cell entropy (scEntropy) as a macroscopic variable to quantify the cellular transcriptome from scRNA-seq data during reprogramming and differentiation of iPSCs. scEntropy measures the relative order parameter of genomic transcriptions at single cell level during the cell fate change process, which shows increasing during differentiation, and decreasing upon reprogramming. Moreover, based on the scEntropy dynamics, we construct a phenomenological stochastic differential equation model and the corresponding Fokker-Plank equation for cell state transitions during iPSC differentiation, which provide insights to infer cell fates changes and stem cell differentiation. This study is the first to introduce the novel concept of scEntropy to the biological process of iPSC, and suggests that the scEntropy can provide a suitable quantify to describe cell fate transition in differentiation and reprogramming of stem cells.
△ Less
Submitted 16 April, 2020;
originally announced April 2020.