Search | arXiv e-print repository

MODA: A Unified 3D Diffusion Framework for Multi-Task Target-Aware Molecular Generation

Authors: Dong Xu, Zhangfan Yang, Sisi Yuan, Jenna Xinyi Yao, Jiangqiang Li, Junkai Ji

Abstract: Three-dimensional molecular generators based on diffusion models can now reach near-crystallographic accuracy, yet they remain fragmented across tasks. SMILES-only inputs, two-stage pretrain-finetune pipelines, and one-task-one-model practices hinder stereochemical fidelity, task alignment, and zero-shot transfer. We introduce MODA, a diffusion framework that unifies fragment growing, linker desig… ▽ More Three-dimensional molecular generators based on diffusion models can now reach near-crystallographic accuracy, yet they remain fragmented across tasks. SMILES-only inputs, two-stage pretrain-finetune pipelines, and one-task-one-model practices hinder stereochemical fidelity, task alignment, and zero-shot transfer. We introduce MODA, a diffusion framework that unifies fragment growing, linker design, scaffold hopping, and side-chain decoration with a Bayesian mask scheduler. During training, a contiguous spatial fragment is masked and then denoised in one pass, enabling the model to learn shared geometric and chemical priors across tasks. Multi-task training yields a universal backbone that surpasses six diffusion baselines and three training paradigms on substructure, chemical property, interaction, and geometry. Model-C reduces ligand-protein clashes and substructure divergences while maintaining Lipinski compliance, whereas Model-B preserves similarity but trails in novelty and binding affinity. Zero-shot de novo design and lead-optimisation tests confirm stable negative Vina scores and high improvement rates without force-field refinement. These results demonstrate that a single-stage multi-task diffusion routine can replace two-stage workflows for structure-based molecular design. △ Less

Submitted 9 July, 2025; originally announced July 2025.

arXiv:2504.00020 [pdf, other]

Celler:A Genomic Language Model for Long-Tailed Single-Cell Annotation

Authors: Huan Zhao, Yiming Liu, Jina Yao, Ling Xiong, Zexin Zhou, Zixing Zhang

Abstract: Recent breakthroughs in single-cell technology have ushered in unparalleled opportunities to decode the molecular intricacy of intricate biological systems, especially those linked to diseases unique to humans. However, these progressions have also ushered in novel obstacles-specifically, the efficient annotation of extensive, long-tailed single-cell data pertaining to disease conditions. To effec… ▽ More Recent breakthroughs in single-cell technology have ushered in unparalleled opportunities to decode the molecular intricacy of intricate biological systems, especially those linked to diseases unique to humans. However, these progressions have also ushered in novel obstacles-specifically, the efficient annotation of extensive, long-tailed single-cell data pertaining to disease conditions. To effectively surmount this challenge, we introduce Celler, a state-of-the-art generative pre-training model crafted specifically for the annotation of single-cell data. Celler incorporates two groundbreaking elements: First, we introduced the Gaussian Inflation (GInf) Loss function. By dynamically adjusting sample weights, GInf Loss significantly enhances the model's ability to learn from rare categories while reducing the risk of overfitting for common categories. Secondly, we introduce an innovative Hard Data Mining (HDM) strategy into the training process, specifically targeting the challenging-to-learn minority data samples, which significantly improved the model's predictive accuracy. Additionally, to further advance research in this field, we have constructed a large-scale single-cell dataset: Celler-75, which encompasses 40 million cells distributed across 80 human tissues and 75 specific diseases. This dataset provides critical support for comprehensively exploring the potential of single-cell technology in disease research. Our code is available at https://github.com/AI4science-ym/HiCeller. △ Less

Submitted 27 March, 2025; originally announced April 2025.

arXiv:2502.15867 [pdf]

Strategic priorities for transformative progress in advancing biology with proteomics and artificial intelligence

Authors: Yingying Sun, Jun A, Zhiwei Liu, Rui Sun, Liujia Qian, Samuel H. Payne, Wout Bittremieux, Markus Ralser, Chen Li, Yi Chen, Zhen Dong, Yasset Perez-Riverol, Asif Khan, Chris Sander, Ruedi Aebersold, Juan Antonio Vizcaíno, Jonathan R Krieger, Jianhua Yao, Han Wen, Linfeng Zhang, Yunping Zhu, Yue Xuan, Benjamin Boyang Sun, Liang Qiao, Henning Hermjakob , et al. (37 additional authors not shown)

Abstract: Artificial intelligence (AI) is transforming scientific research, including proteomics. Advances in mass spectrometry (MS)-based proteomics data quality, diversity, and scale, combined with groundbreaking AI techniques, are unlocking new challenges and opportunities in biological discovery. Here, we highlight key areas where AI is driving innovation, from data analysis to new biological insights.… ▽ More Artificial intelligence (AI) is transforming scientific research, including proteomics. Advances in mass spectrometry (MS)-based proteomics data quality, diversity, and scale, combined with groundbreaking AI techniques, are unlocking new challenges and opportunities in biological discovery. Here, we highlight key areas where AI is driving innovation, from data analysis to new biological insights. These include developing an AI-friendly ecosystem for proteomics data generation, sharing, and analysis; improving peptide and protein identification and quantification; characterizing protein-protein interactions and protein complexes; advancing spatial and perturbation proteomics; integrating multi-omics data; and ultimately enabling AI-empowered virtual cells. △ Less

Submitted 21 February, 2025; originally announced February 2025.

Comments: 28 pages, 2 figures, perspective in AI proteomics

arXiv:2502.14934 [pdf, other]

Fast and Accurate Blind Flexible Docking

Authors: Zizhuo Zhang, Lijun Wu, Kaiyuan Gao, Jiangchao Yao, Tao Qin, Bo Han

Abstract: Molecular docking that predicts the bound structures of small molecules (ligands) to their protein targets, plays a vital role in drug discovery. However, existing docking methods often face limitations: they either overlook crucial structural changes by assuming protein rigidity or suffer from low computational efficiency due to their reliance on generative models for structure sampling. To addre… ▽ More Molecular docking that predicts the bound structures of small molecules (ligands) to their protein targets, plays a vital role in drug discovery. However, existing docking methods often face limitations: they either overlook crucial structural changes by assuming protein rigidity or suffer from low computational efficiency due to their reliance on generative models for structure sampling. To address these challenges, we propose FABFlex, a fast and accurate regression-based multi-task learning model designed for realistic blind flexible docking scenarios, where proteins exhibit flexibility and binding pocket sites are unknown (blind). Specifically, FABFlex's architecture comprises three specialized modules working in concert: (1) A pocket prediction module that identifies potential binding sites, addressing the challenges inherent in blind docking scenarios. (2) A ligand docking module that predicts the bound (holo) structures of ligands from their unbound (apo) states. (3) A pocket docking module that forecasts the holo structures of protein pockets from their apo conformations. Notably, FABFlex incorporates an iterative update mechanism that serves as a conduit between the ligand and pocket docking modules, enabling continuous structural refinements. This approach effectively integrates the three subtasks of blind flexible docking-pocket identification, ligand conformation prediction, and protein flexibility modeling-into a unified, coherent framework. Extensive experiments on public benchmark datasets demonstrate that FABFlex not only achieves superior effectiveness in predicting accurate binding modes but also exhibits a significant speed advantage (208 $\times$) compared to existing state-of-the-art methods. Our code is released at https://github.com/tmlr-group/FABFlex. △ Less

Submitted 20 February, 2025; originally announced February 2025.

Comments: 25 pages, Accepted by ICLR 2025

arXiv:2404.16880 [pdf, other]

Atomas: Hierarchical Alignment on Molecule-Text for Unified Molecule Understanding and Generation

Authors: Yikun Zhang, Geyan Ye, Chaohao Yuan, Bo Han, Long-Kai Huang, Jianhua Yao, Wei Liu, Yu Rong

Abstract: Molecule-and-text cross-modal representation learning has emerged as a promising direction for enhancing the quality of molecular representation, thereby improving performance in various scientific fields. However, most approaches employ a global alignment approach to learn the knowledge from different modalities that may fail to capture fine-grained information, such as molecule-and-text fragment… ▽ More Molecule-and-text cross-modal representation learning has emerged as a promising direction for enhancing the quality of molecular representation, thereby improving performance in various scientific fields. However, most approaches employ a global alignment approach to learn the knowledge from different modalities that may fail to capture fine-grained information, such as molecule-and-text fragments and stereoisomeric nuances, which is crucial for downstream tasks. Furthermore, it is incapable of modeling such information using a similar global alignment strategy due to the lack of annotations about the fine-grained fragments in the existing dataset. In this paper, we propose Atomas, a hierarchical molecular representation learning framework that jointly learns representations from SMILES strings and text. We design a Hierarchical Adaptive Alignment model to automatically learn the fine-grained fragment correspondence between two modalities and align these representations at three semantic levels. Atomas's end-to-end training framework supports understanding and generating molecules, enabling a wider range of downstream tasks. Atomas achieves superior performance across 12 tasks on 11 datasets, outperforming 11 baseline models thus highlighting the effectiveness and versatility of our method. Scaling experiments further demonstrate Atomas's robustness and scalability. Moreover, visualization and qualitative analysis, validated by human experts, confirm the chemical relevance of our approach. Codes are released on https://github.com/yikunpku/Atomas. △ Less

Submitted 3 March, 2025; v1 submitted 23 April, 2024; originally announced April 2024.

arXiv:2404.16866 [pdf, other]

Annotation-guided Protein Design with Multi-Level Domain Alignment

Authors: Chaohao Yuan, Songyou Li, Geyan Ye, Yikun Zhang, Long-Kai Huang, Wenbing Huang, Wei Liu, Jianhua Yao, Yu Rong

Abstract: The core challenge of de novo protein design lies in creating proteins with specific functions or properties, guided by certain conditions. Current models explore to generate protein using structural and evolutionary guidance, which only provide indirect conditions concerning functions and properties. However, textual annotations of proteins, especially the annotations for protein domains, which d… ▽ More The core challenge of de novo protein design lies in creating proteins with specific functions or properties, guided by certain conditions. Current models explore to generate protein using structural and evolutionary guidance, which only provide indirect conditions concerning functions and properties. However, textual annotations of proteins, especially the annotations for protein domains, which directly describe the protein's high-level functionalities, properties, and their correlation with target amino acid sequences, remain unexplored in the context of protein design tasks. In this paper, we propose Protein-Annotation Alignment Generation, PAAG, a multi-modality protein design framework that integrates the textual annotations extracted from protein database for controllable generation in sequence space. Specifically, within a multi-level alignment module, PAAG can explicitly generate proteins containing specific domains conditioned on the corresponding domain annotations, and can even design novel proteins with flexible combinations of different kinds of annotations. Our experimental results underscore the superiority of the aligned protein representations from PAAG over 7 prediction tasks. Furthermore, PAAG demonstrates a significant increase in generation success rate (24.7% vs 4.7% in zinc finger, and 54.3% vs 22.0% in the immunoglobulin domain) in comparison to the existing model. We anticipate that PAAG will broaden the horizons of protein design by leveraging the knowledge from between textual annotation and proteins. △ Less

Submitted 12 December, 2024; v1 submitted 18 April, 2024; originally announced April 2024.

Comments: Accepted by KDD 2025

arXiv:2402.16894 [pdf, other]

Topological Analysis of Mouse Brain Vasculature via 3D Light-sheet Microscopy Images

Authors: Jiachen Yao, Nina Hagemann, Qiaojie Xiong, Jianxu Chen, Dirk M. Hermann, Chao Chen

Abstract: Vascular networks play a crucial role in understanding brain functionalities. Brain integrity and function, neuronal activity and plasticity, which are crucial for learning, are actively modulated by their local environments, specifically vascular networks. With recent developments in high-resolution 3D light-sheet microscopy imaging together with tissue processing techniques, it becomes feasible… ▽ More Vascular networks play a crucial role in understanding brain functionalities. Brain integrity and function, neuronal activity and plasticity, which are crucial for learning, are actively modulated by their local environments, specifically vascular networks. With recent developments in high-resolution 3D light-sheet microscopy imaging together with tissue processing techniques, it becomes feasible to obtain and examine large-scale brain vasculature in mice. To establish a structural foundation for functional study, however, we need advanced image analysis and structural modeling methods. Existing works use geometric features such as thickness, tortuosity, etc. However, geometric features cannot fully capture structural characteristics such as the richness of branches, connectivity, etc. In this paper, we study the morphology of brain vasculature through a topological lens. We extract topological features based on the theory of topological data analysis. Comparing of these robust and multi-scale topological structural features across different brain anatomical structures and between normal and obese populations sheds light on their promising future in studying neurological diseases. △ Less

Submitted 23 February, 2024; originally announced February 2024.

arXiv:2311.01276 [pdf, other]

Neural Atoms: Propagating Long-range Interaction in Molecular Graphs through Efficient Communication Channel

Authors: Xuan Li, Zhanke Zhou, Jiangchao Yao, Yu Rong, Lu Zhang, Bo Han

Abstract: Graph Neural Networks (GNNs) have been widely adopted for drug discovery with molecular graphs. Nevertheless, current GNNs mainly excel in leveraging short-range interactions (SRI) but struggle to capture long-range interactions (LRI), both of which are crucial for determining molecular properties. To tackle this issue, we propose a method to abstract the collective information of atomic groups in… ▽ More Graph Neural Networks (GNNs) have been widely adopted for drug discovery with molecular graphs. Nevertheless, current GNNs mainly excel in leveraging short-range interactions (SRI) but struggle to capture long-range interactions (LRI), both of which are crucial for determining molecular properties. To tackle this issue, we propose a method to abstract the collective information of atomic groups into a few $\textit{Neural Atoms}$ by implicitly projecting the atoms of a molecular. Specifically, we explicitly exchange the information among neural atoms and project them back to the atoms' representations as an enhancement. With this mechanism, neural atoms establish the communication channels among distant nodes, effectively reducing the interaction scope of arbitrary node pairs into a single hop. To provide an inspection of our method from a physical perspective, we reveal its connection to the traditional LRI calculation method, Ewald Summation. The Neural Atom can enhance GNNs to capture LRI by approximating the potential LRI of the molecular. We conduct extensive experiments on four long-range graph benchmarks, covering graph-level and link-level tasks on molecular graphs. We achieve up to a 27.32% and 38.27% improvement in the 2D and 3D scenarios, respectively. Empirically, our method can be equipped with an arbitrary GNN to help capture LRI. Code and datasets are publicly available in https://github.com/tmlr-group/NeuralAtom. △ Less

Submitted 31 March, 2024; v1 submitted 2 November, 2023; originally announced November 2023.

arXiv:2307.05628 [pdf, other]

DNAGPT: A Generalized Pre-trained Tool for Versatile DNA Sequence Analysis Tasks

Authors: Daoan Zhang, Weitong Zhang, Yu Zhao, Jianguo Zhang, Bing He, Chenchen Qin, Jianhua Yao

Abstract: Pre-trained large language models demonstrate potential in extracting information from DNA sequences, yet adapting to a variety of tasks and data modalities remains a challenge. To address this, we propose DNAGPT, a generalized DNA pre-training model trained on over 200 billion base pairs from all mammals. By enhancing the classic GPT model with a binary classification task (DNA sequence order), a… ▽ More Pre-trained large language models demonstrate potential in extracting information from DNA sequences, yet adapting to a variety of tasks and data modalities remains a challenge. To address this, we propose DNAGPT, a generalized DNA pre-training model trained on over 200 billion base pairs from all mammals. By enhancing the classic GPT model with a binary classification task (DNA sequence order), a numerical regression task (guanine-cytosine content prediction), and a comprehensive token language, DNAGPT can handle versatile DNA analysis tasks while processing both sequence and numerical data. Our evaluation of genomic signal and region recognition, mRNA abundance regression, and artificial genomes generation tasks demonstrates DNAGPT's superior performance compared to existing models designed for specific downstream tasks, benefiting from pre-training using the newly designed model structure. △ Less

Submitted 30 August, 2023; v1 submitted 11 July, 2023; originally announced July 2023.

arXiv:1910.06659 [pdf]

doi 10.1109/TBME.2020.3004548

Ballistocardiogram artifact reduction in simultaneous EEG-fMRI using deep learning

Authors: J. R. McIntosh, J. Yao, Linbi Hong, J. Faller, P. Sajda

Abstract: Objective: The concurrent recording of electroencephalography (EEG) and functional magnetic resonance imaging (fMRI) is a technique that has received much attention due to its potential for combined high temporal and spatial resolution. However, the ballistocardiogram (BCG), a large-amplitude artifact caused by cardiac induced movement contaminates the EEG during EEG-fMRI recordings. Removal of BC… ▽ More Objective: The concurrent recording of electroencephalography (EEG) and functional magnetic resonance imaging (fMRI) is a technique that has received much attention due to its potential for combined high temporal and spatial resolution. However, the ballistocardiogram (BCG), a large-amplitude artifact caused by cardiac induced movement contaminates the EEG during EEG-fMRI recordings. Removal of BCG in software has generally made use of linear decompositions of the corrupted EEG. This is not ideal as the BCG signal is non-stationary and propagates in a manner which is non-linearly dependent on the electrocardiogram (ECG). In this paper, we present a novel method for BCG artifact suppression using recurrent neural networks (RNNs). Methods: EEG signals were recovered by training RNNs on the nonlinear mappings between ECG and the BCG corrupted EEG. We evaluated our model's performance against the commonly used Optimal Basis Set (OBS) method at the level of individual subjects, and investigated generalization across subjects. Results: We show that our algorithm can generate larger average power reduction of the BCG at critical frequencies, while simultaneously improving task relevant EEG based classification. Conclusion: The presented deep learning architecture can be used to reduce BCG related artifacts in EEG-fMRI recordings. Significance: We present a deep learning approach that can be used to suppress the BCG artifact in EEG-fMRI without the use of additional hardware. This method may have scope to be combined with current hardware methods, operate in real-time and be used for direct modeling of the BCG. △ Less

Submitted 15 October, 2019; originally announced October 2019.

arXiv:1601.07533 [pdf]

doi 10.1109/ISBI.2016.7493477

Osteoporotic and Neoplastic Compression Fracture Classification on Longitudinal CT

Authors: Yinong Wang, Jianhua Yao, Joseph E. Burns, Ronald M. Summers

Abstract: Classification of vertebral compression fractures (VCF) having osteoporotic or neoplastic origin is fundamental to the planning of treatment. We developed a fracture classification system by acquiring quantitative morphologic and bone density determinants of fracture progression through the use of automated measurements from longitudinal studies. A total of 250 CT studies were acquired for the tas… ▽ More Classification of vertebral compression fractures (VCF) having osteoporotic or neoplastic origin is fundamental to the planning of treatment. We developed a fracture classification system by acquiring quantitative morphologic and bone density determinants of fracture progression through the use of automated measurements from longitudinal studies. A total of 250 CT studies were acquired for the task, each having previously identified VCFs with osteoporosis or neoplasm. Thirty-six features or each identified VCF were computed and classified using a committee of support vector machines. Ten-fold cross validation on 695 identified fractured vertebrae showed classification accuracies of 0.812, 0.665, and 0.820 for the measured, longitudinal, and combined feature sets respectively. △ Less

Submitted 27 January, 2016; originally announced January 2016.

Comments: Contributed 4-Page Paper to be presented at the 2016 IEEE International Symposium on Biomedical Imaging (ISBI), April 13-16, 2016, Prague, Czech Republic

Showing 1–11 of 11 results for author: Yao, J