Search | arXiv e-print repository

arXiv:2507.02231 [pdf]

Downregulation of aquaporin 3 promotes hyperosmolarity-induced apoptosis of nucleus pulposus cells through PI3K/Akt/mTOR pathway suppression

Authors: Yuan Sang, Huiqing Zhao, Jiajun Wu, Ting Zhang, Wenbin Xu, Hui Yao, Kaihua Liu, Chang Liu, Junbin Zhang, Ping Li, Depeng Wu, Yichun Xu, Jianying Zhang, Gang Hou

Abstract: Hyperosmolarity is a key contributor to nucleus pulposus cell (NPC) apoptosis during intervertebral disc degeneration (IVDD). Aquaporin 3 (AQP3), a membrane channel protein, regulates cellular osmotic balance by transporting water and osmolytes. Although AQP3 downregulation is associated with disc degeneration, its role in apoptosis under hyperosmotic conditions remains unclear. Here, we demonstra… ▽ More Hyperosmolarity is a key contributor to nucleus pulposus cell (NPC) apoptosis during intervertebral disc degeneration (IVDD). Aquaporin 3 (AQP3), a membrane channel protein, regulates cellular osmotic balance by transporting water and osmolytes. Although AQP3 downregulation is associated with disc degeneration, its role in apoptosis under hyperosmotic conditions remains unclear. Here, we demonstrate that hyperosmolarity induces AQP3 depletion, suppresses the PI3K/AKT/mTOR signaling pathway, and promotes mitochondrial dysfunction and ROS accumulation in NPCs. Lentiviral overexpression of AQP3 restores this pathway, attenuates oxidative damage, and reduces apoptosis, preserving disc structure in IVDD rat models. In contrast, pharmacological inhibition of AQP3 exacerbates ECM catabolism and NP tissue loss. Our findings reveal that AQP3 deficiency under hyperosmolarity contributes to NPC apoptosis via suppression of PI3K/AKT/mTOR signaling, potentially creating a pathological cycle of disc degeneration. These results highlight AQP3 as a promising therapeutic target for IVDD. △ Less

Submitted 2 July, 2025; originally announced July 2025.

arXiv:2506.23075 [pdf, ps, other]

CSBrain: A Cross-scale Spatiotemporal Brain Foundation Model for EEG Decoding

Authors: Yuchen Zhou, Jiamin Wu, Zichen Ren, Zhouheng Yao, Weiheng Lu, Kunyu Peng, Qihao Zheng, Chunfeng Song, Wanli Ouyang, Chao Gou

Abstract: Understanding and decoding brain activity from electroencephalography (EEG) signals is a fundamental challenge in neuroscience and AI, with applications in cognition, emotion recognition, diagnosis, and brain-computer interfaces. While recent EEG foundation models advance generalized decoding via unified architectures and large-scale pretraining, they adopt a scale-agnostic dense modeling paradigm… ▽ More Understanding and decoding brain activity from electroencephalography (EEG) signals is a fundamental challenge in neuroscience and AI, with applications in cognition, emotion recognition, diagnosis, and brain-computer interfaces. While recent EEG foundation models advance generalized decoding via unified architectures and large-scale pretraining, they adopt a scale-agnostic dense modeling paradigm inherited from NLP and vision. This design neglects a core property of neural activity: cross-scale spatiotemporal structure. EEG task patterns span a wide range of temporal and spatial scales, from short bursts to slow rhythms, and from localized cortical responses to distributed interactions. Ignoring this diversity leads to suboptimal representations and weak generalization. We propose CSBrain, a Cross-scale Spatiotemporal Brain foundation model for generalized EEG decoding. CSBrain introduces: (i) Cross-scale Spatiotemporal Tokenization (CST), which aggregates multi-scale features from localized temporal windows and anatomical brain regions into compact scale-aware tokens; and (ii) Structured Sparse Attention (SSA), which captures cross-window and cross-region dependencies, enhancing scale diversity while removing spurious correlations. CST and SSA are alternately stacked to progressively integrate multi-scale dependencies. Experiments on 11 EEG tasks across 16 datasets show that CSBrain consistently outperforms task-specific and foundation model baselines. These results establish cross-scale modeling as a key inductive bias and position CSBrain as a robust backbone for future brain-AI research. △ Less

Submitted 28 June, 2025; originally announced June 2025.

arXiv:2506.17310 [pdf, ps, other]

PaceLLM: Brain-Inspired Large Language Models for Long-Context Understanding

Authors: Kangcong Li, Peng Ye, Chongjun Tu, Lin Zhang, Chunfeng Song, Jiamin Wu, Tao Yang, Qihao Zheng, Tao Chen

Abstract: While Large Language Models (LLMs) demonstrate strong performance across domains, their long-context capabilities are limited by transient neural activations causing information decay and unstructured feed-forward network (FFN) weights leading to semantic fragmentation. Inspired by the brain's working memory and cortical modularity, we propose PaceLLM, featuring two innovations: (1) a Persistent A… ▽ More While Large Language Models (LLMs) demonstrate strong performance across domains, their long-context capabilities are limited by transient neural activations causing information decay and unstructured feed-forward network (FFN) weights leading to semantic fragmentation. Inspired by the brain's working memory and cortical modularity, we propose PaceLLM, featuring two innovations: (1) a Persistent Activity (PA) Mechanism that mimics prefrontal cortex (PFC) neurons' persistent firing by introducing an activation-level memory bank to dynamically retrieve, reuse, and update critical FFN states, addressing contextual decay; and (2) Cortical Expert (CE) Clustering that emulates task-adaptive neural specialization to reorganize FFN weights into semantic modules, establishing cross-token dependencies and mitigating fragmentation. Extensive evaluations show that PaceLLM achieves 6% improvement on LongBench's Multi-document QA and 12.5-17.5% performance gains on Infinite-Bench tasks, while extending measurable context length to 200K tokens in Needle-In-A-Haystack (NIAH) tests. This work pioneers brain-inspired LLM optimization and is complementary to other works. Besides, it can be generalized to any model and enhance their long-context performance and interpretability without structural overhauls. △ Less

Submitted 18 June, 2025; originally announced June 2025.

arXiv:2506.07553 [pdf, ps, other]

GTR-CoT: Graph Traversal as Visual Chain of Thought for Molecular Structure Recognition

Authors: Jingchao Wang, Haote Yang, Jiang Wu, Yifan He, Xingjian Wei, Yinfan Wang, Chengjin Liu, Lingli Ge, Lijun Wu, Bin Wang, Dahua Lin, Conghui He

Abstract: Optical Chemical Structure Recognition (OCSR) is crucial for digitizing chemical knowledge by converting molecular images into machine-readable formats. While recent vision-language models (VLMs) have shown potential in this task, their image-captioning approach often struggles with complex molecular structures and inconsistent annotations. To overcome these challenges, we introduce GTR-Mol-VLM, a… ▽ More Optical Chemical Structure Recognition (OCSR) is crucial for digitizing chemical knowledge by converting molecular images into machine-readable formats. While recent vision-language models (VLMs) have shown potential in this task, their image-captioning approach often struggles with complex molecular structures and inconsistent annotations. To overcome these challenges, we introduce GTR-Mol-VLM, a novel framework featuring two key innovations: (1) the Graph Traversal as Visual Chain of Thought mechanism that emulates human reasoning by incrementally parsing molecular graphs through sequential atom-bond predictions, and (2) the data-centric principle of Faithfully Recognize What You've Seen, which addresses the mismatch between abbreviated structures in images and their expanded annotations. To support model development, we constructed GTR-CoT-1.3M, a large-scale instruction-tuning dataset with meticulously corrected annotations, and introduced MolRec-Bench, the first benchmark designed for a fine-grained evaluation of graph-parsing accuracy in OCSR. Comprehensive experiments demonstrate that GTR-Mol-VLM achieves superior results compared to specialist models, chemistry-domain VLMs, and commercial general-purpose VLMs. Notably, in scenarios involving molecular images with functional group abbreviations, GTR-Mol-VLM outperforms the second-best baseline by approximately 14 percentage points, both in SMILES-based and graph-based metrics. We hope that this work will drive OCSR technology to more effectively meet real-world needs, thereby advancing the fields of cheminformatics and AI for Science. We will release GTR-CoT at https://github.com/opendatalab/GTR-CoT. △ Less

Submitted 9 June, 2025; v1 submitted 9 June, 2025; originally announced June 2025.

arXiv:2506.05443 [pdf]

UniPTMs: The First Unified Multi-type PTM Site Prediction Model via Master-Slave Architecture-Based Multi-Stage Fusion Strategy and Hierarchical Contrastive Loss

Authors: Yiyu Lin, Yan Wang, You Zhou, Xinye Ni, Jiahui Wu, Sen Yang

Abstract: As a core mechanism of epigenetic regulation in eukaryotes, protein post-translational modifications (PTMs) require precise prediction to decipher dynamic life activity networks. To address the limitations of existing deep learning models in cross-modal feature fusion, domain generalization, and architectural optimization, this study proposes UniPTMs: the first unified framework for multi-type PTM… ▽ More As a core mechanism of epigenetic regulation in eukaryotes, protein post-translational modifications (PTMs) require precise prediction to decipher dynamic life activity networks. To address the limitations of existing deep learning models in cross-modal feature fusion, domain generalization, and architectural optimization, this study proposes UniPTMs: the first unified framework for multi-type PTM prediction. The framework innovatively establishes a "Master-Slave" dual-path collaborative architecture: The master path dynamically integrates high-dimensional representations of protein sequences, structures, and evolutionary information through a Bidirectional Gated Cross-Attention (BGCA) module, while the slave path optimizes feature discrepancies and recalibration between structural and traditional features using a Low-Dimensional Fusion Network (LDFN). Complemented by a Multi-scale Adaptive convolutional Pyramid (MACP) for capturing local feature patterns and a Bidirectional Hierarchical Gated Fusion Network (BHGFN) enabling multi-level feature integration across paths, the framework employs a Hierarchical Dynamic Weighting Fusion (HDWF) mechanism to intelligently aggregate multimodal features. Enhanced by a novel Hierarchical Contrastive loss function for feature consistency optimization, UniPTMs demonstrates significant performance improvements (3.2%-11.4% MCC and 4.2%-14.3% AP increases) over state-of-the-art models across five modification types and transcends the Single-Type Prediction Paradigm. To strike a balance between model complexity and performance, we have also developed a lightweight variant named UniPTMs-mini. △ Less

Submitted 5 June, 2025; originally announced June 2025.

arXiv:2505.15453 [pdf, other]

A dynamical memory with only one spiking neuron

Authors: Damien Depannemaecker, Adrien d'Hollande, Jiaming Wu, Marcelo J. Rozenberg

Abstract: Common wisdom indicates that to implement a Dynamical Memory with spiking neurons two ingredients are necessary: recurrence and a neuron population. Here we shall show that the second requirement is not needed. We shall demonstrate that under very general assumptions a single recursive spiking neuron can realize a robust model of a dynamical memory. We demonstrate the implementation of a dynamical… ▽ More Common wisdom indicates that to implement a Dynamical Memory with spiking neurons two ingredients are necessary: recurrence and a neuron population. Here we shall show that the second requirement is not needed. We shall demonstrate that under very general assumptions a single recursive spiking neuron can realize a robust model of a dynamical memory. We demonstrate the implementation of a dynamical memory in both, software and hardware. In the former case, we introduce trivial extensions of the popular aQIF and AdEx models. In the latter, we show traces obtained in a circuit model with a recently proposed memristive spiking neuron. We show that the bistability of the theoretical models can be understood in terms of a self-consistent problem that can be represented geometrically. Our minimal dynamical memory model provides a simplest implementation of an important neuro-computational primitive, which can be useful in navigation system models based on purely spiking dynamics. A one neuron dynamical memory may also provides a natural explanation to the surprising recent observation that the excitation bump in Drosophila's ellipsoidal body is made by just a handful of neurons. △ Less

Submitted 21 May, 2025; originally announced May 2025.

Comments: 17 pages, 9 figures

arXiv:2505.08581 [pdf, other]

ReSurgSAM2: Referring Segment Anything in Surgical Video via Credible Long-term Tracking

Authors: Haofeng Liu, Mingqi Gao, Xuxiao Luo, Ziyue Wang, Guanyi Qin, Junde Wu, Yueming Jin

Abstract: Surgical scene segmentation is critical in computer-assisted surgery and is vital for enhancing surgical quality and patient outcomes. Recently, referring surgical segmentation is emerging, given its advantage of providing surgeons with an interactive experience to segment the target object. However, existing methods are limited by low efficiency and short-term tracking, hindering their applicabil… ▽ More Surgical scene segmentation is critical in computer-assisted surgery and is vital for enhancing surgical quality and patient outcomes. Recently, referring surgical segmentation is emerging, given its advantage of providing surgeons with an interactive experience to segment the target object. However, existing methods are limited by low efficiency and short-term tracking, hindering their applicability in complex real-world surgical scenarios. In this paper, we introduce ReSurgSAM2, a two-stage surgical referring segmentation framework that leverages Segment Anything Model 2 to perform text-referred target detection, followed by tracking with reliable initial frame identification and diversity-driven long-term memory. For the detection stage, we propose a cross-modal spatial-temporal Mamba to generate precise detection and segmentation results. Based on these results, our credible initial frame selection strategy identifies the reliable frame for the subsequent tracking. Upon selecting the initial frame, our method transitions to the tracking stage, where it incorporates a diversity-driven memory mechanism that maintains a credible and diverse memory bank, ensuring consistent long-term tracking. Extensive experiments demonstrate that ReSurgSAM2 achieves substantial improvements in accuracy and efficiency compared to existing methods, operating in real-time at 61.2 FPS. Our code and datasets will be available at https://github.com/jinlab-imvr/ReSurgSAM2. △ Less

Submitted 13 May, 2025; originally announced May 2025.

Comments: Early accepted by MICCAI 2025

arXiv:2505.03121 [pdf]

AutoLoop: a novel autoregressive deep learning method for protein loop prediction with high accuracy

Authors: Tianyue Wang, Xujun Zhang, Langcheng Wang, Odin Zhang, Jike Wang, Ercheng Wang, Jialu Wu, Renling Hu, Jingxuan Ge, Shimeng Li, Qun Su, Jiajun Yu, Chang-Yu Hsieh, Tingjun Hou, Yu Kang

Abstract: Protein structure prediction is a critical and longstanding challenge in biology, garnering widespread interest due to its significance in understanding biological processes. A particular area of focus is the prediction of missing loops in proteins, which are vital in determining protein function and activity. To address this challenge, we propose AutoLoop, a novel computational model designed to… ▽ More Protein structure prediction is a critical and longstanding challenge in biology, garnering widespread interest due to its significance in understanding biological processes. A particular area of focus is the prediction of missing loops in proteins, which are vital in determining protein function and activity. To address this challenge, we propose AutoLoop, a novel computational model designed to automatically generate accurate loop backbone conformations that closely resemble their natural structures. AutoLoop employs a bidirectional training approach while merging atom- and residue-level embedding, thus improving robustness and precision. We compared AutoLoop with twelve established methods, including FREAD, NGK, AlphaFold2, and AlphaFold3. AutoLoop consistently outperforms other methods, achieving a median RMSD of 1.12 Angstrom and a 2-Angstrom success rate of 73.23% on the CASP15 dataset, while maintaining strong performance on the HOMSTARD dataset. It demonstrates the best performance across nearly all loop lengths and secondary structural types. Beyond accuracy, AutoLoop is computationally efficient, requiring only 0.10 s per generation. A post-processing module for side-chain packing and energy minimization further improves results slightly, confirming the reliability of the predicted backbone. A case study also highlights AutoLoop's potential for precise predictions based on dominant loop conformations. These advances hold promise for protein engineering and drug discovery. △ Less

Submitted 5 May, 2025; originally announced May 2025.

Comments: 34 pages, 7 figures

arXiv:2504.10983 [pdf, other]

ProtFlow: Fast Protein Sequence Design via Flow Matching on Compressed Protein Language Model Embeddings

Authors: Zitai Kong, Yiheng Zhu, Yinlong Xu, Hanjing Zhou, Mingzhe Yin, Jialu Wu, Hongxia Xu, Chang-Yu Hsieh, Tingjun Hou, Jian Wu

Abstract: The design of protein sequences with desired functionalities is a fundamental task in protein engineering. Deep generative methods, such as autoregressive models and diffusion models, have greatly accelerated the discovery of novel protein sequences. However, these methods mainly focus on local or shallow residual semantics and suffer from low inference efficiency, large modeling space and high tr… ▽ More The design of protein sequences with desired functionalities is a fundamental task in protein engineering. Deep generative methods, such as autoregressive models and diffusion models, have greatly accelerated the discovery of novel protein sequences. However, these methods mainly focus on local or shallow residual semantics and suffer from low inference efficiency, large modeling space and high training cost. To address these challenges, we introduce ProtFlow, a fast flow matching-based protein sequence design framework that operates on embeddings derived from semantically meaningful latent space of protein language models. By compressing and smoothing the latent space, ProtFlow enhances performance while training on limited computational resources. Leveraging reflow techniques, ProtFlow enables high-quality single-step sequence generation. Additionally, we develop a joint design pipeline for the design scene of multichain proteins. We evaluate ProtFlow across diverse protein design tasks, including general peptides and long-chain proteins, antimicrobial peptides, and antibodies. Experimental results demonstrate that ProtFlow outperforms task-specific methods in these applications, underscoring its potential and broad applicability in computational protein sequence design and analysis. △ Less

Submitted 15 April, 2025; originally announced April 2025.

arXiv:2504.10525 [pdf]

BioChemInsight: An Open-Source Toolkit for Automated Identification and Recognition of Optical Chemical Structures and Activity Data in Scientific Publications

Authors: Zhe Wang, Fangtian Fu, Wei Zhang, Lige Yan, Yan Meng, Jianping Wu, Hui Wu, Gang Xu, Si Chen

Abstract: Automated extraction of chemical structures and their bioactivity data is crucial for accelerating drug discovery and enabling data-driven pharmaceutical research. Existing optical chemical structure recognition (OCSR) tools fail to autonomously associate molecular structures with their bioactivity profiles, creating a critical bottleneck in structure-activity relationship (SAR) analysis. Here, we… ▽ More Automated extraction of chemical structures and their bioactivity data is crucial for accelerating drug discovery and enabling data-driven pharmaceutical research. Existing optical chemical structure recognition (OCSR) tools fail to autonomously associate molecular structures with their bioactivity profiles, creating a critical bottleneck in structure-activity relationship (SAR) analysis. Here, we present BioChemInsight, an open-source pipeline that integrates: (1) DECIMER Segmentation and MolVec for chemical structure recognition, (2) Qwen2.5-VL-32B for compound identifier association, and (3) PaddleOCR with Gemini-2.0-flash for bioactivity extraction and unit normalization. We evaluated the performance of BioChemInsight on 25 patents and 17 articles. BioChemInsight achieved 95% accuracy for tabular patent data (structure/identifier recognition), with lower accuracy in non-tabular patents (~80% structures, ~75% identifiers), plus 92.2 % bioactivity extraction accuracy. For articles, it attained >99% identifiers and 78-80% structure accuracy in non-tabular formats, plus 97.4% bioactivity extraction accuracy. The system generates ready-to-use SAR datasets, reducing data preprocessing time from weeks to hours while enabling applications in high-throughput screening and ML-driven drug design (https://github.com/dahuilangda/BioChemInsight). △ Less

Submitted 12 April, 2025; originally announced April 2025.

Comments: 20 pages, 7 figures

arXiv:2503.17738 [pdf]

Tumor-associated CD19$^+$ macrophages induce immunosuppressive microenvironment in hepatocellular carcinoma

Authors: Junli Wang, Wanyue Cao, Jinyan Huang, Yu Zhou, Rujia Zheng, Yu Lou, Jiaqi Yang, Jianghui Tang, Mao Ye, Zhengtao Hong, Jiangchao Wu, Haonan Ding, Yuquan Zhang, Jianpeng Sheng, Xinjiang Lu, Pinglong Xu, Xiongbin Lu, Xueli Bai, Tingbo Liang, Qi Zhang

Abstract: Tumor-associated macrophages are a key component that contributes to the immunosuppressive microenvironment in human cancers. However, therapeutic targeting of macrophages has been a challenge in clinic due to the limited understanding of their heterogeneous subpopulations and distinct functions. Here, we identify a unique and clinically relevant CD19$^+$ subpopulation of macrophages that is enric… ▽ More Tumor-associated macrophages are a key component that contributes to the immunosuppressive microenvironment in human cancers. However, therapeutic targeting of macrophages has been a challenge in clinic due to the limited understanding of their heterogeneous subpopulations and distinct functions. Here, we identify a unique and clinically relevant CD19$^+$ subpopulation of macrophages that is enriched in many types of cancer, particularly in hepatocellular carcinoma (HCC). The CD19$^+$ macrophages exhibit increased levels of PD-L1 and CD73, enhanced mitochondrial oxidation, and compromised phagocytosis, indicating their immunosuppressive functions. Targeting CD19$^+$ macrophages with anti-CD19 chimeric antigen receptor T (CAR-T) cells inhibited HCC tumor growth. We identify PAX5 as a primary driver of up-regulated mitochondrial biogenesis in CD19$^+$ macrophages, which depletes cytoplasmic Ca$^{2+}$, leading to lysosomal deficiency and consequent accumulation of CD73 and PD-L1. Inhibiting CD73 or mitochondrial oxidation enhanced the efficacy of immune checkpoint blockade therapy in treating HCC, suggesting great promise for CD19$^+$ macrophage-targeting therapeutics. △ Less

Submitted 22 March, 2025; originally announced March 2025.

Comments: 7 figures

arXiv:2503.04362 [pdf, other]

A Generalist Cross-Domain Molecular Learning Framework for Structure-Based Drug Discovery

Authors: Yiheng Zhu, Mingyang Li, Junlong Liu, Kun Fu, Jiansheng Wu, Qiuyi Li, Mingze Yin, Jieping Ye, Jian Wu, Zheng Wang

Abstract: Structure-based drug discovery (SBDD) is a systematic scientific process that develops new drugs by leveraging the detailed physical structure of the target protein. Recent advancements in pre-trained models for biomolecules have demonstrated remarkable success across various biochemical applications, including drug discovery and protein engineering. However, in most approaches, the pre-trained mo… ▽ More Structure-based drug discovery (SBDD) is a systematic scientific process that develops new drugs by leveraging the detailed physical structure of the target protein. Recent advancements in pre-trained models for biomolecules have demonstrated remarkable success across various biochemical applications, including drug discovery and protein engineering. However, in most approaches, the pre-trained models primarily focus on the characteristics of either small molecules or proteins, without delving into their binding interactions which are essential cross-domain relationships pivotal to SBDD. To fill this gap, we propose a general-purpose foundation model named BIT (an abbreviation for Biomolecular Interaction Transformer), which is capable of encoding a range of biochemical entities, including small molecules, proteins, and protein-ligand complexes, as well as various data formats, encompassing both 2D and 3D structures. Specifically, we introduce Mixture-of-Domain-Experts (MoDE) to handle the biomolecules from diverse biochemical domains and Mixture-of-Structure-Experts (MoSE) to capture positional dependencies in the molecular structures. The proposed mixture-of-experts approach enables BIT to achieve both deep fusion and domain-specific encoding, effectively capturing fine-grained molecular interactions within protein-ligand complexes. Then, we perform cross-domain pre-training on the shared Transformer backbone via several unified self-supervised denoising tasks. Experimental results on various benchmarks demonstrate that BIT achieves exceptional performance in downstream tasks, including binding affinity prediction, structure-based virtual screening, and molecular property prediction. △ Less

Submitted 6 March, 2025; originally announced March 2025.

arXiv:2503.03783 [pdf, other]

Passive Heart Rate Monitoring During Smartphone Use in Everyday Life

Authors: Shun Liao, Paolo Di Achille, Jiang Wu, Silviu Borac, Jonathan Wang, Xin Liu, Eric Teasley, Lawrence Cai, Yuzhe Yang, Yun Liu, Daniel McDuff, Hao-Wei Su, Brent Winslow, Anupam Pathak, Shwetak Patel, James A. Taylor, Jameson K. Rogers, Ming-Zher Poh

Abstract: Resting heart rate (RHR) is an important biomarker of cardiovascular health and mortality, but tracking it longitudinally generally requires a wearable device, limiting its availability. We present PHRM, a deep learning system for passive heart rate (HR) and RHR measurements during everyday smartphone use, using facial video-based photoplethysmography. Our system was developed using 225,773 videos… ▽ More Resting heart rate (RHR) is an important biomarker of cardiovascular health and mortality, but tracking it longitudinally generally requires a wearable device, limiting its availability. We present PHRM, a deep learning system for passive heart rate (HR) and RHR measurements during everyday smartphone use, using facial video-based photoplethysmography. Our system was developed using 225,773 videos from 495 participants and validated on 185,970 videos from 205 participants in laboratory and free-living conditions, representing the largest validation study of its kind. Compared to reference electrocardiogram, PHRM achieved a mean absolute percentage error (MAPE) < 10% for HR measurements across three skin tone groups of light, medium and dark pigmentation; MAPE for each skin tone group was non-inferior versus the others. Daily RHR measured by PHRM had a mean absolute error < 5 bpm compared to a wearable HR tracker, and was associated with known risk factors. These results highlight the potential of smartphones to enable passive and equitable heart health monitoring. △ Less

Submitted 21 March, 2025; v1 submitted 4 March, 2025; originally announced March 2025.

Comments: Updated author list

arXiv:2502.19391 [pdf, other]

Towards More Accurate Full-Atom Antibody Co-Design

Authors: Jiayang Wu, Xingyi Zhang, Xiangyu Dong, Kun Xie, Ziqi Liu, Wensheng Gan, Sibo Wang, Le Song

Abstract: Antibody co-design represents a critical frontier in drug development, where accurate prediction of both 1D sequence and 3D structure of complementarity-determining regions (CDRs) is essential for targeting specific epitopes. Despite recent advances in equivariant graph neural networks for antibody design, current approaches often fall short in capturing the intricate interactions that govern anti… ▽ More Antibody co-design represents a critical frontier in drug development, where accurate prediction of both 1D sequence and 3D structure of complementarity-determining regions (CDRs) is essential for targeting specific epitopes. Despite recent advances in equivariant graph neural networks for antibody design, current approaches often fall short in capturing the intricate interactions that govern antibody-antigen recognition and binding specificity. In this work, we present Igformer, a novel end-to-end framework that addresses these limitations through innovative modeling of antibody-antigen binding interfaces. Our approach refines the inter-graph representation by integrating personalized propagation with global attention mechanisms, enabling comprehensive capture of the intricate interplay between local chemical interactions and global conformational dependencies that characterize effective antibody-antigen binding. Through extensive validation on epitope-binding CDR design and structure prediction tasks, Igformer demonstrates significant improvements over existing methods, suggesting that explicit modeling of multi-scale residue interactions can substantially advance computational antibody design for therapeutic applications. △ Less

Submitted 11 February, 2025; originally announced February 2025.

arXiv:2502.08975 [pdf, other]

Graph-structured Small Molecule Drug Discovery Through Deep Learning: Progress, Challenges, and Opportunities

Authors: Kun Li, Yida Xiong, Hongzhi Zhang, Xiantao Cai, Jia Wu, Bo Du, Wenbin Hu

Abstract: Due to their excellent drug-like and pharmacokinetic properties, small molecule drugs are widely used to treat various diseases, making them a critical component of drug discovery. In recent years, with the rapid development of deep learning (DL) techniques, DL-based small molecule drug discovery methods have achieved excellent performance in prediction accuracy, speed, and complex molecular relat… ▽ More Due to their excellent drug-like and pharmacokinetic properties, small molecule drugs are widely used to treat various diseases, making them a critical component of drug discovery. In recent years, with the rapid development of deep learning (DL) techniques, DL-based small molecule drug discovery methods have achieved excellent performance in prediction accuracy, speed, and complex molecular relationship modeling compared to traditional machine learning approaches. These advancements enhance drug screening efficiency and optimization and provide more precise and effective solutions for various drug discovery tasks. Contributing to this field's development, this paper aims to systematically summarize and generalize the recent key tasks and representative techniques in graph-structured small molecule drug discovery in recent years. Specifically, we provide an overview of the major tasks in small molecule drug discovery and their interrelationships. Next, we analyze the six core tasks, summarizing the related methods, commonly used datasets, and technological development trends. Finally, we discuss key challenges, such as interpretability and out-of-distribution generalization, and offer our insights into future research directions for small molecule drug discovery. △ Less

Submitted 14 May, 2025; v1 submitted 13 February, 2025; originally announced February 2025.

Comments: 10 pages, 1 figures, 8 tables

arXiv:2502.07297 [pdf, other]

Generation of Drug-Induced Cardiac Reactions towards Virtual Clinical Trials

Authors: Qian Shao, Bang Du, Zepeng Li, Qiyuan Chen, Hongxia Xu, Jimeng Sun, Jian Wu, Jintai Chen

Abstract: Clinical trials remain critical in cardiac drug development but face high failure rates due to efficacy limitations and safety risks, incurring substantial costs. In-silico trial methodologies, particularly generative models simulating drug-induced electrocardiogram (ECG) alterations, offer a potential solution to mitigate these challenges. While existing models show progress in ECG synthesis, the… ▽ More Clinical trials remain critical in cardiac drug development but face high failure rates due to efficacy limitations and safety risks, incurring substantial costs. In-silico trial methodologies, particularly generative models simulating drug-induced electrocardiogram (ECG) alterations, offer a potential solution to mitigate these challenges. While existing models show progress in ECG synthesis, their constrained fidelity and inability to characterize individual-specific pharmacological response patterns fundamentally limit clinical translatability. To address these issues, we propose a novel Drug-Aware Diffusion Model (DADM). Specifically, we construct a set of ordinary differential equations to provide external physical knowledge (EPK) of the realistic ECG morphology. The EPK is used to adaptively constrain the morphology of the generated ECGs through a dynamic cross-attention (DCA) mechanism. Furthermore, we propose an extension of ControlNet to incorporate demographic and drug data, simulating individual drug reactions. Compared to the other eight state-of-the-art (SOTA) ECG generative models: 1) Quantitative and expert evaluation demonstrate that DADM generates ECGs with superior fidelity; 2) Comparative results on two real-world databases covering 8 types of drug regimens verify that DADM can more accurately simulate drug-induced changes in ECGs, improving the accuracy by at least 5.79% and recall by 8%. In addition, the ECGs generated by DADM can also enhance model performance in downstream drug-effect classification tasks. △ Less

Submitted 18 May, 2025; v1 submitted 11 February, 2025; originally announced February 2025.

Comments: Under review

arXiv:2502.00934 [pdf]

Optimizing Global Genomic Surveillance for Early Detection of Emerging SARS-CoV-2 Variants

Authors: Haogao Gu, Jifan Li, Wanying Sun, Mengting Li, Kathy Leung, Joseph T. Wu, Hsiang-Yu Yuan, Maggie H. Wang, Bingyi Yang, Matthew R. McKay, Ning Ning, Leo L. M. Poon

Abstract: Background: Global viral threats underscore the need for effective genomic surveillance, but high costs and uneven resource distribution hamper its implementation. Targeting surveillance to international travelers in major travel hubs may offer a more efficient strategy for the early detection of SARS-CoV-2 variants. Methods: We developed and calibrated a multiple-strain metapopulation model of… ▽ More Background: Global viral threats underscore the need for effective genomic surveillance, but high costs and uneven resource distribution hamper its implementation. Targeting surveillance to international travelers in major travel hubs may offer a more efficient strategy for the early detection of SARS-CoV-2 variants. Methods: We developed and calibrated a multiple-strain metapopulation model of global SARS-CoV-2 transmission using extensive epidemiological, phylogenetic, and high-resolution air travel data. We then compared baseline surveillance with various resource-allocation approaches that prioritize travelers, focusing on Omicron BA.1/BA.2 retrospectively and on hypothetical future variants under different emergence, transmission and vaccine effectiveness scenarios. Findings: Focusing existing surveillance resources on travelers at key global hubs significantly shortened detection delays without increasing total surveillance efforts. In retrospective analyses of Omicron BA.1/BA.2, traveler-targeted approaches consistently outperformed baseline strategies, even when overall resources were reduced. Simulations indicate that focusing surveillance on key travel hubs outperform baseline practices in detecting future variants, across different possible origins, even with reduced resources. This approach also remains effective in future pandemic scenarios with varying reproductive numbers and vaccine effectiveness. Interpretation: These findings provide a quantitative, cost-effective framework for strengthening global genomic surveillance. By reallocating resources toward international travelers in select travel hubs, early detection of emerging variants can be enhanced, informing rapid public health interventions and bolstering preparedness for future pandemics. △ Less

Submitted 13 February, 2025; v1 submitted 2 February, 2025; originally announced February 2025.

arXiv:2501.15799 [pdf, other]

Can Molecular Evolution Mechanism Enhance Molecular Representation?

Authors: Kun Li, Longtao Hu, Xiantao Cai, Jia Wu, Wenbin Hu

Abstract: Molecular evolution is the process of simulating the natural evolution of molecules in chemical space to explore potential molecular structures and properties. The relationships between similar molecules are often described through transformations such as adding, deleting, and modifying atoms and chemical bonds, reflecting specific evolutionary paths. Existing molecular representation methods main… ▽ More Molecular evolution is the process of simulating the natural evolution of molecules in chemical space to explore potential molecular structures and properties. The relationships between similar molecules are often described through transformations such as adding, deleting, and modifying atoms and chemical bonds, reflecting specific evolutionary paths. Existing molecular representation methods mainly focus on mining data, such as atomic-level structures and chemical bonds directly from the molecules, often overlooking their evolutionary history. Consequently, we aim to explore the possibility of enhancing molecular representations by simulating the evolutionary process. We extract and analyze the changes in the evolutionary pathway and explore combining it with existing molecular representations. Therefore, this paper proposes the molecular evolutionary network (MEvoN) for molecular representations. First, we construct the MEvoN using molecules with a small number of atoms and generate evolutionary paths utilizing similarity calculations. Then, by modeling the atomic-level changes, MEvoN reveals their impact on molecular properties. Experimental results show that the MEvoN-based molecular property prediction method significantly improves the performance of traditional end-to-end algorithms on several molecular datasets. The code is available at https://anonymous.4open.science/r/MEvoN-7416/. △ Less

Submitted 27 January, 2025; originally announced January 2025.

Comments: 9 pages, 6 figures, 5 tables

arXiv:2501.05099 [pdf, other]

doi 10.1103/PhysRevE.111.034309

Recovery of activation propagation and self-sustained oscillation abilities in stroke brain networks

Authors: Yingpeng Liu, Jiao Wu, Kesheng Xu, Muhua Zheng

Abstract: Healthy brain networks usually show highly efficient information communication and self-sustained oscillation abilities. However, how the brain network structure affects these dynamics after an injury (stroke) is not very clear. The recovery of structure and dynamics of stroke brain networks over time is still not known precisely. Based on the analysis of a large number of strokes' brain network d… ▽ More Healthy brain networks usually show highly efficient information communication and self-sustained oscillation abilities. However, how the brain network structure affects these dynamics after an injury (stroke) is not very clear. The recovery of structure and dynamics of stroke brain networks over time is still not known precisely. Based on the analysis of a large number of strokes' brain network data, we show that stroke changes the network properties in connection weights, average degree, clustering, community, etc. Yet, they will recover gradually over time to some extent. We then adopt a simplified reaction-diffusion model to investigate stroke patients' activation propagation and self-sustained oscillation abilities. Our results reveal that the stroke slows the adoption time across different brain scales, indicating a weakened brain's activation propagation ability. In addition, we show that the lifetime of self-sustained oscillatory patterns at three months post-stroke patients' brains significantly departs from the healthy one. Finally, we examine the properties of core networks of self-sustained oscillatory patterns, in which the directed edges denote the main pathways of activation propagation. Our results demonstrate that the lifetime and recovery of self-sustaining patterns are related to the properties of core networks, and the properties in the post-stroke greatly vary from those in the healthy group. Most importantly, the strokes' activation propagation and self-sustained oscillation abilities significantly improve at one year post-stroke, driven by structural connection repair. This work may help us to understand the relationship between structure and function in brain disorders. △ Less

Submitted 9 January, 2025; originally announced January 2025.

Comments: 20 pages, 13 figures

arXiv:2412.20014 [pdf, other]

ProtCLIP: Function-Informed Protein Multi-Modal Learning

Authors: Hanjing Zhou, Mingze Yin, Wei Wu, Mingyang Li, Kun Fu, Jintai Chen, Jian Wu, Zheng Wang

Abstract: Multi-modality pre-training paradigm that aligns protein sequences and biological descriptions has learned general protein representations and achieved promising performance in various downstream applications. However, these works were still unable to replicate the extraordinary success of language-supervised visual foundation models due to the ineffective usage of aligned protein-text paired data… ▽ More Multi-modality pre-training paradigm that aligns protein sequences and biological descriptions has learned general protein representations and achieved promising performance in various downstream applications. However, these works were still unable to replicate the extraordinary success of language-supervised visual foundation models due to the ineffective usage of aligned protein-text paired data and the lack of an effective function-informed pre-training paradigm. To address these issues, this paper curates a large-scale protein-text paired dataset called ProtAnno with a property-driven sampling strategy, and introduces a novel function-informed protein pre-training paradigm. Specifically, the sampling strategy determines selecting probability based on the sample confidence and property coverage, balancing the data quality and data quantity in face of large-scale noisy data. Furthermore, motivated by significance of the protein specific functional mechanism, the proposed paradigm explicitly model protein static and dynamic functional segments by two segment-wise pre-training objectives, injecting fine-grained information in a function-informed manner. Leveraging all these innovations, we develop ProtCLIP, a multi-modality foundation model that comprehensively represents function-aware protein embeddings. On 22 different protein benchmarks within 5 types, including protein functionality classification, mutation effect prediction, cross-modal transformation, semantic similarity inference and protein-protein interaction prediction, our ProtCLIP consistently achieves SOTA performance, with remarkable improvements of 75% on average in five cross-modal transformation benchmarks, 59.9% in GO-CC and 39.7% in GO-BP protein function prediction. The experimental results verify the extraordinary potential of ProtCLIP serving as the protein multi-modality foundation model. △ Less

Submitted 27 December, 2024; originally announced December 2024.

Journal ref: AAAI 2025

arXiv:2411.17331 [pdf, other]

Multiscale Jones Polynomial and Persistent Jones Polynomial for Knot Data Analysis

Authors: Ruzhi Song, Fengling Li, Jie Wu, Fengchun Lei, Guo-Wei Wei

Abstract: Many structures in science, engineering, and art can be viewed as curves in 3-space. The entanglement of these curves plays a crucial role in determining the functionality and physical properties of materials. Many concepts in knot theory provide theoretical tools to explore the complexity and entanglement of curves in 3-space. However, classical knot theory primarily focuses on global topological… ▽ More Many structures in science, engineering, and art can be viewed as curves in 3-space. The entanglement of these curves plays a crucial role in determining the functionality and physical properties of materials. Many concepts in knot theory provide theoretical tools to explore the complexity and entanglement of curves in 3-space. However, classical knot theory primarily focuses on global topological properties and lacks the consideration of local structural information, which is critical in practical applications. In this work, two localized models based on the Jones polynomial, namely the multiscale Jones polynomial and the persistent Jones polynomial, are proposed. The stability of these models, especially the insensitivity of the multiscale and persistent Jones polynomial models to small perturbations in curve collections, is analyzed, thus ensuring their robustness for real-world applications. △ Less

Submitted 26 November, 2024; originally announced November 2024.

Comments: 27 pages, 9 figures

MSC Class: 57K14; 92C10

arXiv:2411.15215 [pdf, other]

S$^2$ALM: Sequence-Structure Pre-trained Large Language Model for Comprehensive Antibody Representation Learning

Authors: Mingze Yin, Hanjing Zhou, Jialu Wu, Yiheng Zhu, Yuxuan Zhan, Zitai Kong, Hongxia Xu, Chang-Yu Hsieh, Jintai Chen, Tingjun Hou, Jian Wu

Abstract: Antibodies safeguard our health through their precise and potent binding to specific antigens, demonstrating promising therapeutic efficacy in the treatment of numerous diseases, including COVID-19. Recent advancements in biomedical language models have shown the great potential to interpret complex biological structures and functions. However, existing antibody specific models have a notable limi… ▽ More Antibodies safeguard our health through their precise and potent binding to specific antigens, demonstrating promising therapeutic efficacy in the treatment of numerous diseases, including COVID-19. Recent advancements in biomedical language models have shown the great potential to interpret complex biological structures and functions. However, existing antibody specific models have a notable limitation that they lack explicit consideration for antibody structural information, despite the fact that both 1D sequence and 3D structure carry unique and complementary insights into antibody behavior and functionality. This paper proposes Sequence-Structure multi-level pre-trained Antibody Language Model (S$^2$ALM), combining holistic sequential and structural information in one unified, generic antibody foundation model. We construct a hierarchical pre-training paradigm incorporated with two customized multi-level training objectives to facilitate the modeling of comprehensive antibody representations. S$^2$ALM's representation space uncovers inherent functional binding mechanisms, biological evolution properties and structural interaction patterns. Pre-trained over 75 million sequences and 11.7 million structures, S$^2$ALM can be adopted for diverse downstream tasks: accurately predicting antigen-antibody binding affinities, precisely distinguishing B cell maturation stages, identifying antibody crucial binding positions, and specifically designing novel coronavirus-binding antibodies. Remarkably, S$^2$ALM outperforms well-established and renowned baselines and sets new state-of-the-art performance across extensive antibody specific understanding and generation tasks. S$^2$ALM's ability to model comprehensive and generalized representations further positions its potential to advance real-world therapeutic antibody development, potentially addressing unmet academic, industrial, and clinical needs. △ Less

Submitted 20 November, 2024; originally announced November 2024.

arXiv:2411.02120 [pdf, other]

Bridge-IF: Learning Inverse Protein Folding with Markov Bridges

Authors: Yiheng Zhu, Jialu Wu, Qiuyi Li, Jiahuan Yan, Mingze Yin, Wei Wu, Mingyang Li, Jieping Ye, Zheng Wang, Jian Wu

Abstract: Inverse protein folding is a fundamental task in computational protein design, which aims to design protein sequences that fold into the desired backbone structures. While the development of machine learning algorithms for this task has seen significant success, the prevailing approaches, which predominantly employ a discriminative formulation, frequently encounter the error accumulation issue and… ▽ More Inverse protein folding is a fundamental task in computational protein design, which aims to design protein sequences that fold into the desired backbone structures. While the development of machine learning algorithms for this task has seen significant success, the prevailing approaches, which predominantly employ a discriminative formulation, frequently encounter the error accumulation issue and often fail to capture the extensive variety of plausible sequences. To fill these gaps, we propose Bridge-IF, a generative diffusion bridge model for inverse folding, which is designed to learn the probabilistic dependency between the distributions of backbone structures and protein sequences. Specifically, we harness an expressive structure encoder to propose a discrete, informative prior derived from structures, and establish a Markov bridge to connect this prior with native sequences. During the inference stage, Bridge-IF progressively refines the prior sequence, culminating in a more plausible design. Moreover, we introduce a reparameterization perspective on Markov bridge models, from which we derive a simplified loss function that facilitates more effective training. We also modulate protein language models (PLMs) with structural conditions to precisely approximate the Markov bridge process, thereby significantly enhancing generation performance while maintaining parameter-efficient training. Extensive experiments on well-established benchmarks demonstrate that Bridge-IF predominantly surpasses existing baselines in sequence recovery and excels in the design of plausible proteins with high foldability. The code is available at https://github.com/violet-sto/Bridge-IF. △ Less

Submitted 4 November, 2024; originally announced November 2024.

Comments: NeurIPS 2024

arXiv:2410.21069 [pdf]

EMOCPD: Efficient Attention-based Models for Computational Protein Design Using Amino Acid Microenvironment

Authors: Xiaoqi Ling, Cheng Cai, Demin Kong, Zhisheng Wei, Jing Wu, Lei Wang, Zhaohong Deng

Abstract: Computational protein design (CPD) refers to the use of computational methods to design proteins. Traditional methods relying on energy functions and heuristic algorithms for sequence design are inefficient and do not meet the demands of the big data era in biomolecules, with their accuracy limited by the energy functions and search algorithms. Existing deep learning methods are constrained by the… ▽ More Computational protein design (CPD) refers to the use of computational methods to design proteins. Traditional methods relying on energy functions and heuristic algorithms for sequence design are inefficient and do not meet the demands of the big data era in biomolecules, with their accuracy limited by the energy functions and search algorithms. Existing deep learning methods are constrained by the learning capabilities of the networks, failing to extract effective information from sparse protein structures, which limits the accuracy of protein design. To address these shortcomings, we developed an Efficient attention-based Models for Computational Protein Design using amino acid microenvironment (EMOCPD). It aims to predict the category of each amino acid in a protein by analyzing the three-dimensional atomic environment surrounding the amino acids, and optimize the protein based on the predicted high-probability potential amino acid categories. EMOCPD employs a multi-head attention mechanism to focus on important features in the sparse protein microenvironment and utilizes an inverse residual structure to optimize the network architecture. The proposed EMOCPD achieves over 80% accuracy on the training set and 68.33% and 62.32% accuracy on two independent test sets, respectively, surpassing the best comparative methods by over 10%. In protein design, the thermal stability and protein expression of the predicted mutants from EMOCPD show significant improvements compared to the wild type, effectively validating EMOCPD's potential in designing superior proteins. Furthermore, the predictions of EMOCPD are influenced positively, negatively, or have minimal impact based on the content of the 20 amino acids, categorizing amino acids as positive, negative, or neutral. Research findings indicate that EMOCPD is more suitable for designing proteins with lower contents of negative amino acids. △ Less

Submitted 29 October, 2024; v1 submitted 28 October, 2024; originally announced October 2024.

arXiv:2410.06232 [pdf, other]

Range, not Independence, Drives Modularity in Biologically Inspired Representations

Authors: Will Dorrell, Kyle Hsu, Luke Hollingsworth, Jin Hwa Lee, Jiajun Wu, Chelsea Finn, Peter E Latham, Tim EJ Behrens, James CR Whittington

Abstract: Why do biological and artificial neurons sometimes modularise, each encoding a single meaningful variable, and sometimes entangle their representation of many variables? In this work, we develop a theory of when biologically inspired networks -- those that are nonnegative and energy efficient -- modularise their representation of source variables (sources). We derive necessary and sufficient condi… ▽ More Why do biological and artificial neurons sometimes modularise, each encoding a single meaningful variable, and sometimes entangle their representation of many variables? In this work, we develop a theory of when biologically inspired networks -- those that are nonnegative and energy efficient -- modularise their representation of source variables (sources). We derive necessary and sufficient conditions on a sample of sources that determine whether the neurons in an optimal biologically-inspired linear autoencoder modularise. Our theory applies to any dataset, extending far beyond the case of statistical independence studied in previous work. Rather we show that sources modularise if their support is ``sufficiently spread''. From this theory, we extract and validate predictions in a variety of empirical studies on how data distribution affects modularisation in nonlinear feedforward and recurrent neural networks trained on supervised and unsupervised tasks. Furthermore, we apply these ideas to neuroscience data, showing that range independence can be used to understand the mixing or modularising of spatial and reward information in entorhinal recordings in seemingly conflicting experiments. Further, we use these results to suggest alternate origins of mixed-selectivity, beyond the predominant theory of flexible nonlinear classification. In sum, our theory prescribes precise conditions on when neural activities modularise, providing tools for inducing and elucidating modular representations in brains and machines. △ Less

Submitted 11 April, 2025; v1 submitted 8 October, 2024; originally announced October 2024.

Comments: 37 pages, 12 figures. WD and KH contributed equally; LH and JHL contributed equally

arXiv:2410.04815 [pdf, other]

A Review of BioTree Construction in the Context of Information Fusion: Priors, Methods, Applications and Trends

Authors: Zelin Zang, Yongjie Xu, Chenrui Duan, Yue Yuan, Jinlin Wu, Zhen Lei, Stan Z. Li

Abstract: Biological tree (BioTree) analysis is a foundational tool in biology, enabling the exploration of evolutionary and differentiation relationships among organisms, genes, and cells. Traditional tree construction methods, while instrumental in early research, face significant challenges in handling the growing complexity and scale of modern biological data, particularly in integrating multimodal data… ▽ More Biological tree (BioTree) analysis is a foundational tool in biology, enabling the exploration of evolutionary and differentiation relationships among organisms, genes, and cells. Traditional tree construction methods, while instrumental in early research, face significant challenges in handling the growing complexity and scale of modern biological data, particularly in integrating multimodal datasets. Advances in deep learning (DL) offer transformative opportunities by enabling the fusion of biological prior knowledge with data-driven models. These approaches address key limitations of traditional methods, facilitating the construction of more accurate and interpretable BioTrees. This review highlights critical biological priors essential for phylogenetic and differentiation tree analyses and explores strategies for integrating these priors into DL models to enhance accuracy and interpretability. Additionally, the review systematically examines commonly used data modalities and databases, offering a valuable resource for developing and evaluating multimodal fusion models. Traditional tree construction methods are critically assessed, focusing on their biological assumptions, technical limitations, and scalability issues. Recent advancements in DL-based tree generation methods are reviewed, emphasizing their innovative approaches to multimodal integration and prior knowledge incorporation. Finally, the review discusses diverse applications of BioTrees in various biological disciplines, from phylogenetics to developmental biology, and outlines future trends in leveraging DL to advance BioTree research. By addressing the challenges of data complexity and prior knowledge integration, this review aims to inspire interdisciplinary innovation at the intersection of biology and DL. △ Less

Submitted 15 February, 2025; v1 submitted 7 October, 2024; originally announced October 2024.

Comments: 115 pages, 15 figures

arXiv:2410.03803 [pdf, other]

Text-guided Diffusion Model for 3D Molecule Generation

Authors: Yanchen Luo, Junfeng Fang, Sihang Li, Zhiyuan Liu, Jiancan Wu, An Zhang, Wenjie Du, Xiang Wang

Abstract: The de novo generation of molecules with targeted properties is crucial in biology, chemistry, and drug discovery. Current generative models are limited to using single property values as conditions, struggling with complex customizations described in detailed human language. To address this, we propose the text guidance instead, and introduce TextSMOG, a new Text-guided Small Molecule Generation… ▽ More The de novo generation of molecules with targeted properties is crucial in biology, chemistry, and drug discovery. Current generative models are limited to using single property values as conditions, struggling with complex customizations described in detailed human language. To address this, we propose the text guidance instead, and introduce TextSMOG, a new Text-guided Small Molecule Generation Approach via 3D Diffusion Model which integrates language and diffusion models for text-guided small molecule generation. This method uses textual conditions to guide molecule generation, enhancing both stability and diversity. Experimental results show TextSMOG's proficiency in capturing and utilizing information from textual descriptions, making it a powerful tool for generating 3D molecular structures in response to complex textual customizations. △ Less

Submitted 4 October, 2024; originally announced October 2024.

arXiv:2409.11174 [pdf, other]

Identifying Influential nodes in Brain Networks via Self-Supervised Graph-Transformer

Authors: Yanqing Kang, Di Zhu, Haiyang Zhang, Enze Shi, Sigang Yu, Jinru Wu, Xuhui Wang, Xuan Liu, Geng Chen, Xi Jiang, Tuo Zhang, Shu Zhang

Abstract: Studying influential nodes (I-nodes) in brain networks is of great significance in the field of brain imaging. Most existing studies consider brain connectivity hubs as I-nodes. However, this approach relies heavily on prior knowledge from graph theory, which may overlook the intrinsic characteristics of the brain network, especially when its architecture is not fully understood. In contrast, self… ▽ More Studying influential nodes (I-nodes) in brain networks is of great significance in the field of brain imaging. Most existing studies consider brain connectivity hubs as I-nodes. However, this approach relies heavily on prior knowledge from graph theory, which may overlook the intrinsic characteristics of the brain network, especially when its architecture is not fully understood. In contrast, self-supervised deep learning can learn meaningful representations directly from the data. This approach enables the exploration of I-nodes for brain networks, which is also lacking in current studies. This paper proposes a Self-Supervised Graph Reconstruction framework based on Graph-Transformer (SSGR-GT) to identify I-nodes, which has three main characteristics. First, as a self-supervised model, SSGR-GT extracts the importance of brain nodes to the reconstruction. Second, SSGR-GT uses Graph-Transformer, which is well-suited for extracting features from brain graphs, combining both local and global characteristics. Third, multimodal analysis of I-nodes uses graph-based fusion technology, combining functional and structural brain information. The I-nodes we obtained are distributed in critical areas such as the superior frontal lobe, lateral parietal lobe, and lateral occipital lobe, with a total of 56 identified across different experiments. These I-nodes are involved in more brain networks than other regions, have longer fiber connections, and occupy more central positions in structural connectivity. They also exhibit strong connectivity and high node efficiency in both functional and structural networks. Furthermore, there is a significant overlap between the I-nodes and both the structural and functional rich-club. These findings enhance our understanding of the I-nodes within the brain network, and provide new insights for future research in further understanding the brain working mechanisms. △ Less

Submitted 17 September, 2024; originally announced September 2024.

arXiv:2408.15999 [pdf]

Q-MRS: A Deep Learning Framework for Quantitative Magnetic Resonance Spectra Analysis

Authors: Christopher J. Wu, Lawrence S. Kegeles, Jia Guo

Abstract: Magnetic resonance spectroscopy (MRS) is an established technique for studying tissue metabolism, particularly in central nervous system disorders. While powerful and versatile, MRS is often limited by challenges associated with data quality, processing, and quantification. Existing MRS quantification methods face difficulties in balancing model complexity and reproducibility during spectral model… ▽ More Magnetic resonance spectroscopy (MRS) is an established technique for studying tissue metabolism, particularly in central nervous system disorders. While powerful and versatile, MRS is often limited by challenges associated with data quality, processing, and quantification. Existing MRS quantification methods face difficulties in balancing model complexity and reproducibility during spectral modeling, often falling into the trap of either oversimplification or over-parameterization. To address these limitations, this study introduces a deep learning (DL) framework that employs transfer learning, in which the model is pre-trained on simulated datasets before it undergoes fine-tuning on in vivo data. The proposed framework showed promising performance when applied to the Philips dataset from the BIG GABA repository and represents an exciting advancement in MRS data analysis. △ Less

Submitted 28 August, 2024; originally announced August 2024.

Comments: 8 pages, 4 figures, and 3 tables for the main body; 9 pages, 4 figures, and 3 tables for the supplementary material

arXiv:2408.11356 [pdf, other]

One-step Structure Prediction and Screening for Protein-Ligand Complexes using Multi-Task Geometric Deep Learning

Authors: Kelei He, Tiejun Dong, Jinhui Wu, Junfeng Zhang

Abstract: Understanding the structure of the protein-ligand complex is crucial to drug development. Existing virtual structure measurement and screening methods are dominated by docking and its derived methods combined with deep learning. However, the sampling and scoring methodology have largely restricted the accuracy and efficiency. Here, we show that these two fundamental tasks can be accurately tackled… ▽ More Understanding the structure of the protein-ligand complex is crucial to drug development. Existing virtual structure measurement and screening methods are dominated by docking and its derived methods combined with deep learning. However, the sampling and scoring methodology have largely restricted the accuracy and efficiency. Here, we show that these two fundamental tasks can be accurately tackled with a single model, namely LigPose, based on multi-task geometric deep learning. By representing the ligand and the protein pair as a graph, LigPose directly optimizes the three-dimensional structure of the complex, with the learning of binding strength and atomic interactions as auxiliary tasks, enabling its one-step prediction ability without docking tools. Extensive experiments show LigPose achieved state-of-the-art performance on major tasks in drug research. Its considerable improvements indicate a promising paradigm of AI-based pipeline for drug development. △ Less

Submitted 21 August, 2024; originally announced August 2024.

arXiv:2408.09106 [pdf, other]

Fragment-Masked Diffusion for Molecular Optimization

Authors: Kun Li, Xiantao Cai, Jia Wu, Shirui Pan, Huiting Xu, Bo Du, Wenbin Hu

Abstract: Molecular optimization is a crucial aspect of drug discovery, aimed at refining molecular structures to enhance drug efficacy and minimize side effects, ultimately accelerating the overall drug development process. Many molecular optimization methods have been proposed, significantly advancing drug discovery. These methods primarily on understanding the specific drug target structures or their hyp… ▽ More Molecular optimization is a crucial aspect of drug discovery, aimed at refining molecular structures to enhance drug efficacy and minimize side effects, ultimately accelerating the overall drug development process. Many molecular optimization methods have been proposed, significantly advancing drug discovery. These methods primarily on understanding the specific drug target structures or their hypothesized roles in combating diseases. However, challenges such as a limited number of available targets and a difficulty capturing clear structures hinder innovative drug development. In contrast, phenotypic drug discovery (PDD) does not depend on clear target structures and can identify hits with novel and unbiased polypharmacology signatures. As a result, PDD-based molecular optimization can reduce potential safety risks while optimizing phenotypic activity, thereby increasing the likelihood of clinical success. Therefore, we propose a fragment-masked molecular optimization method based on PDD (FMOP). FMOP employs a regression-free diffusion model to conditionally optimize the molecular masked regions, effectively generating new molecules with similar scaffolds. On the large-scale drug response dataset GDSCv2, we optimize the potential molecules across all 985 cell lines. The overall experiments demonstrate that the in-silico optimization success rate reaches 95.4\%, with an average efficacy increase of 7.5\%. Additionally, we conduct extensive ablation and visualization experiments, confirming that FMOP is an effective and robust molecular optimization method. The code is available at: https://anonymous.4open.science/r/FMOP-98C2. △ Less

Submitted 14 May, 2025; v1 submitted 17 August, 2024; originally announced August 2024.

Comments: 12 pages, 9 figures, 4 tables

arXiv:2407.04055 [pdf, other]

Benchmark on Drug Target Interaction Modeling from a Structure Perspective

Authors: Xinnan Zhang, Jialin Wu, Junyi Xie, Tianlong Chen, Kaixiong Zhou

Abstract: The prediction modeling of drug-target interactions is crucial to drug discovery and design, which has seen rapid advancements owing to deep learning technologies. Recently developed methods, such as those based on graph neural networks (GNNs) and Transformers, demonstrate exceptional performance across various datasets by effectively extracting structural information. However, the benchmarking of… ▽ More The prediction modeling of drug-target interactions is crucial to drug discovery and design, which has seen rapid advancements owing to deep learning technologies. Recently developed methods, such as those based on graph neural networks (GNNs) and Transformers, demonstrate exceptional performance across various datasets by effectively extracting structural information. However, the benchmarking of these novel methods often varies significantly in terms of hyperparameter settings and datasets, which limits algorithmic progress. In view of these, we conduct a comprehensive survey and benchmark for drug-target interaction modeling from a structure perspective, via integrating tens of explicit (i.e., GNN-based) and implicit (i.e., Transformer-based) structure learning algorithms. To this end, we first unify the hyperparameter setting within each class of structure learning methods. Moreover, we conduct a macroscopical comparison between these two classes of encoding strategies as well as the different featurization techniques that inform molecules' chemical and physical properties. We then carry out the microscopical comparison between all the integrated models across the six datasets, via comprehensively benchmarking their effectiveness and efficiency. Remarkably, the summarized insights from the benchmark studies lead to the design of model combos. We demonstrate that our combos can achieve new state-of-the-art performance on various datasets associated with cost-effective memory and computation. Our code is available at \hyperlink{https://github.com/justinwjl/GTB-DTI/tree/main}{https://github.com/justinwjl/GTB-DTI/tree/main}. △ Less

Submitted 4 July, 2024; originally announced July 2024.

Comments: Submitted to NIPS 2024 Dataset and Benchmark

arXiv:2405.14545 [pdf, other]

A Cross-Field Fusion Strategy for Drug-Target Interaction Prediction

Authors: Hongzhi Zhang, Xiuwen Gong, Shirui Pan, Jia Wu, Bo Du, Wenbin Hu

Abstract: Drug-target interaction (DTI) prediction is a critical component of the drug discovery process. In the drug development engineering field, predicting novel drug-target interactions is extremely crucial.However, although existing methods have achieved high accuracy levels in predicting known drugs and drug targets, they fail to utilize global protein information during DTI prediction. This leads to… ▽ More Drug-target interaction (DTI) prediction is a critical component of the drug discovery process. In the drug development engineering field, predicting novel drug-target interactions is extremely crucial.However, although existing methods have achieved high accuracy levels in predicting known drugs and drug targets, they fail to utilize global protein information during DTI prediction. This leads to an inability to effectively predict interaction the interactions between novel drugs and their targets. As a result, the cross-field information fusion strategy is employed to acquire local and global protein information. Thus, we propose the siamese drug-target interaction SiamDTI prediction method, which utilizes a double channel network structure for cross-field supervised learning.Experimental results on three benchmark datasets demonstrate that SiamDTI achieves higher accuracy levels than other state-of-the-art (SOTA) methods on novel drugs and targets.Additionally, SiamDTI's performance with known drugs and targets is comparable to that of SOTA approachs. The code is available at https://anonymous.4open.science/r/DDDTI-434D. △ Less

Submitted 23 May, 2024; originally announced May 2024.

arXiv:2405.14536 [pdf, other]

Regressor-free Molecule Generation to Support Drug Response Prediction

Authors: Kun Li, Xiuwen Gong, Shirui Pan, Jia Wu, Bo Du, Wenbin Hu

Abstract: Drug response prediction (DRP) is a crucial phase in drug discovery, and the most important metric for its evaluation is the IC50 score. DRP results are heavily dependent on the quality of the generated molecules. Existing molecule generation methods typically employ classifier-based guidance, enabling sampling within the IC50 classification range. However, these methods fail to ensure the samplin… ▽ More Drug response prediction (DRP) is a crucial phase in drug discovery, and the most important metric for its evaluation is the IC50 score. DRP results are heavily dependent on the quality of the generated molecules. Existing molecule generation methods typically employ classifier-based guidance, enabling sampling within the IC50 classification range. However, these methods fail to ensure the sampling space range's effectiveness, generating numerous ineffective molecules. Through experimental and theoretical study, we hypothesize that conditional generation based on the target IC50 score can obtain a more effective sampling space. As a result, we introduce regressor-free guidance molecule generation to ensure sampling within a more effective space and support DRP. Regressor-free guidance combines a diffusion model's score estimation with a regression controller model's gradient based on number labels. To effectively map regression labels between drugs and cell lines, we design a common-sense numerical knowledge graph that constrains the order of text representations. Experimental results on the real-world dataset for the DRP task demonstrate our method's effectiveness in drug discovery. The code is available at:https://anonymous.4open.science/r/RMCD-DBD1. △ Less

Submitted 23 May, 2024; originally announced May 2024.

Comments: 22 pages, 7 figures, 9 tables,

arXiv:2404.16357 [pdf, other]

Reverse engineering the brain input: Network control theory to identify cognitive task-related control nodes

Authors: Zhichao Liang, Yinuo Zhang, Jushen Wu, Quanying Liu

Abstract: The human brain receives complex inputs when performing cognitive tasks, which range from external inputs via the senses to internal inputs from other brain regions. However, the explicit inputs to the brain during a cognitive task remain unclear. Here, we present an input identification framework for reverse engineering the control nodes and the corresponding inputs to the brain. The framework is… ▽ More The human brain receives complex inputs when performing cognitive tasks, which range from external inputs via the senses to internal inputs from other brain regions. However, the explicit inputs to the brain during a cognitive task remain unclear. Here, we present an input identification framework for reverse engineering the control nodes and the corresponding inputs to the brain. The framework is verified with synthetic data generated by a predefined linear system, indicating it can robustly reconstruct data and recover the inputs. Then we apply the framework to the real motor-task fMRI data from 200 human subjects. Our results show that the model with sparse inputs can reconstruct neural dynamics in motor tasks ($EV=0.779$) and the identified 28 control nodes largely overlap with the motor system. Underpinned by network control theory, our framework offers a general tool for understanding brain inputs. △ Less

Submitted 25 April, 2024; originally announced April 2024.

arXiv:2403.03089 [pdf, other]

VQSynery: Robust Drug Synergy Prediction With Vector Quantization Mechanism

Authors: Jiawei Wu, Mingyuan Yan, Dianbo Liu

Abstract: The pursuit of optimizing cancer therapies is significantly advanced by the accurate prediction of drug synergy. Traditional methods, such as clinical trials, are reliable yet encumbered by extensive time and financial demands. The emergence of high-throughput screening and computational innovations has heralded a shift towards more efficient methodologies for exploring drug interactions. In this… ▽ More The pursuit of optimizing cancer therapies is significantly advanced by the accurate prediction of drug synergy. Traditional methods, such as clinical trials, are reliable yet encumbered by extensive time and financial demands. The emergence of high-throughput screening and computational innovations has heralded a shift towards more efficient methodologies for exploring drug interactions. In this study, we present VQSynergy, a novel framework that employs the Vector Quantization (VQ) mechanism, integrated with gated residuals and a tailored attention mechanism, to enhance the precision and generalizability of drug synergy predictions. Our findings demonstrate that VQSynergy surpasses existing models in terms of robustness, particularly under Gaussian noise conditions, highlighting its superior performance and utility in the complex and often noisy domain of drug synergy research. This study underscores the potential of VQSynergy in revolutionizing the field through its advanced predictive capabilities, thereby contributing to the optimization of cancer treatment strategies. △ Less

Submitted 5 March, 2024; originally announced March 2024.

arXiv:2402.17997 [pdf]

StaPep: an open-source tool for the structure prediction and feature extraction of hydrocarbon-stapled peptides

Authors: Zhe Wang, Jianping Wu, Mengjun Zheng, Chenchen Geng, Borui Zhen, Wei Zhang, Hui Wu, Zhengyang Xu, Gang Xu, Si Chen, Xiang Li

Abstract: Many tools exist for extracting structural and physiochemical descriptors from linear peptides to predict their properties, but similar tools for hydrocarbon-stapled peptides are lacking.Here, we present StaPep, a Python-based toolkit designed for generating 2D/3D structures and calculating 21 distinct features for hydrocarbon-stapled peptides.The current version supports hydrocarbon-stapled pepti… ▽ More Many tools exist for extracting structural and physiochemical descriptors from linear peptides to predict their properties, but similar tools for hydrocarbon-stapled peptides are lacking.Here, we present StaPep, a Python-based toolkit designed for generating 2D/3D structures and calculating 21 distinct features for hydrocarbon-stapled peptides.The current version supports hydrocarbon-stapled peptides containing 2 non-standard amino acids (norleucine and 2-aminoisobutyric acid) and 6 nonnatural anchoring residues (S3, S5, S8, R3, R5 and R8).Then we established a hand-curated dataset of 201 hydrocarbon-stapled peptides and 384 linear peptides with sequence information and experimental membrane permeability, to showcase StaPep's application in artificial intelligence projects.A machine learning-based predictor utilizing above calculated features was developed with AUC of 0.85, for identifying cell-penetrating hydrocarbon-stapled peptides.StaPep's pipeline spans data retrieval, cleaning, structure generation, molecular feature calculation, and machine learning model construction for hydrocarbon-stapled peptides.The source codes and dataset are freely available on Github: https://github.com/dahuilangda/stapep_package. △ Less

Submitted 27 February, 2024; originally announced February 2024.

Comments: 26 pages, 6 figures

arXiv:2402.10516 [pdf, other]

Generative AI for Controllable Protein Sequence Design: A Survey

Authors: Yiheng Zhu, Zitai Kong, Jialu Wu, Weize Liu, Yuqiang Han, Mingze Yin, Hongxia Xu, Chang-Yu Hsieh, Tingjun Hou

Abstract: The design of novel protein sequences with targeted functionalities underpins a central theme in protein engineering, impacting diverse fields such as drug discovery and enzymatic engineering. However, navigating this vast combinatorial search space remains a severe challenge due to time and financial constraints. This scenario is rapidly evolving as the transformative advancements in AI, particul… ▽ More The design of novel protein sequences with targeted functionalities underpins a central theme in protein engineering, impacting diverse fields such as drug discovery and enzymatic engineering. However, navigating this vast combinatorial search space remains a severe challenge due to time and financial constraints. This scenario is rapidly evolving as the transformative advancements in AI, particularly in the realm of generative models and optimization algorithms, have been propelling the protein design field towards an unprecedented revolution. In this survey, we systematically review recent advances in generative AI for controllable protein sequence design. To set the stage, we first outline the foundational tasks in protein sequence design in terms of the constraints involved and present key generative models and optimization algorithms. We then offer in-depth reviews of each design task and discuss the pertinent applications. Finally, we identify the unresolved challenges and highlight research opportunities that merit deeper exploration. △ Less

Submitted 16 February, 2024; originally announced February 2024.

Comments: 9 pages

arXiv:2402.02164 [pdf]

Hierarchical Structure Enhances the Convergence and Generalizability of Linear Molecular Representation

Authors: Juan-Ni Wu, Tong Wang, Li-Juan Tang, Hai-Long Wu, Ru-Qin Yu

Abstract: Language models demonstrate fundamental abilities in syntax, semantics, and reasoning, though their performance often depends significantly on the inputs they process. This study introduces TSIS (Simplified TSID) and its variants:TSISD (TSIS with Depth-First Search), TSISO (TSIS in Order), and TSISR (TSIS in Random), as integral components of the t-SMILES framework. These additions complete the fr… ▽ More Language models demonstrate fundamental abilities in syntax, semantics, and reasoning, though their performance often depends significantly on the inputs they process. This study introduces TSIS (Simplified TSID) and its variants:TSISD (TSIS with Depth-First Search), TSISO (TSIS in Order), and TSISR (TSIS in Random), as integral components of the t-SMILES framework. These additions complete the framework's design, providing diverse approaches to molecular representation. Through comprehensive analysis and experiments employing deep generative models, including GPT, diffusion models, and reinforcement learning, the findings reveal that the hierarchical structure of t-SMILES is more straightforward to parse than initially anticipated. Furthermore, t-SMILES consistently outperforms other linear representations such as SMILES, SELFIES, and SAFE, demonstrating superior convergence speed and enhanced generalization capabilities. △ Less

Submitted 18 November, 2024; v1 submitted 3 February, 2024; originally announced February 2024.

Comments: 26pages, 6 figures

arXiv:2401.09500 [pdf, other]

MorphGrower: A Synchronized Layer-by-layer Growing Approach for Plausible Neuronal Morphology Generation

Authors: Nianzu Yang, Kaipeng Zeng, Haotian Lu, Yexin Wu, Zexin Yuan, Danni Chen, Shengdian Jiang, Jiaxiang Wu, Yimin Wang, Junchi Yan

Abstract: Neuronal morphology is essential for studying brain functioning and understanding neurodegenerative disorders. As acquiring real-world morphology data is expensive, computational approaches for morphology generation have been studied. Traditional methods heavily rely on expert-set rules and parameter tuning, making it difficult to generalize across different types of morphologies. Recently, MorphV… ▽ More Neuronal morphology is essential for studying brain functioning and understanding neurodegenerative disorders. As acquiring real-world morphology data is expensive, computational approaches for morphology generation have been studied. Traditional methods heavily rely on expert-set rules and parameter tuning, making it difficult to generalize across different types of morphologies. Recently, MorphVAE was introduced as the sole learning-based method, but its generated morphologies lack plausibility, i.e., they do not appear realistic enough and most of the generated samples are topologically invalid. To fill this gap, this paper proposes MorphGrower, which mimicks the neuron natural growth mechanism for generation. Specifically, MorphGrower generates morphologies layer by layer, with each subsequent layer conditioned on the previously generated structure. During each layer generation, MorphGrower utilizes a pair of sibling branches as the basic generation block and generates branch pairs synchronously. This approach ensures topological validity and allows for fine-grained generation, thereby enhancing the realism of the final generated morphologies. Results on four real-world datasets demonstrate that MorphGrower outperforms MorphVAE by a notable margin. Importantly, the electrophysiological response simulation demonstrates the plausibility of our generated samples from a neuroscience perspective. Our code is available at https://github.com/Thinklab-SJTU/MorphGrower. △ Less

Submitted 27 May, 2024; v1 submitted 17 January, 2024; originally announced January 2024.

arXiv:2311.12834 [pdf, other]

Knot data analysis using multiscale Gauss link integral

Authors: Li Shen, Hongsong Feng, Fengling Li, Fengchun Lei, Jie Wu, Guo-Wei Wei

Abstract: In the past decade, topological data analysis (TDA) has emerged as a powerful approach in data science. The main technique in TDA is persistent homology, which tracks topological invariants over the filtration of point cloud data using algebraic topology. Although knot theory and related subjects are a focus of study in mathematics, their success in practical applications is quite limited due to t… ▽ More In the past decade, topological data analysis (TDA) has emerged as a powerful approach in data science. The main technique in TDA is persistent homology, which tracks topological invariants over the filtration of point cloud data using algebraic topology. Although knot theory and related subjects are a focus of study in mathematics, their success in practical applications is quite limited due to the lack of localization and quantization. We address these challenges by introducing knot data analysis (KDA), a new paradigm that incorporating curve segmentation and multiscale analysis into the Gauss link integral. The resulting multiscale Gauss link integral (mGLI) recovers the global topological properties of knots and links at an appropriate scale but offers multiscale feature vectors to capture the local structures and connectivities of each curve segment at various scales. The proposed mGLI significantly outperforms other state-of-the-art methods in benchmark protein flexibility analysis, including earlier persistent homology-based methods. Our approach enables the integration of artificial intelligence (AI) and KDA for general curve-like objects and data. △ Less

Submitted 2 October, 2023; originally announced November 2023.

arXiv:2311.02798 [pdf, other]

doi 10.1038/s41467-024-55082-4

Multi-channel learning for integrating structural hierarchies into context-dependent molecular representation

Authors: Yue Wan, Jialu Wu, Tingjun Hou, Chang-Yu Hsieh, Xiaowei Jia

Abstract: Reliable molecular property prediction is essential for various scientific endeavors and industrial applications, such as drug discovery. However, the data scarcity, combined with the highly non-linear causal relationships between physicochemical and biological properties and conventional molecular featurization schemes, complicates the development of robust molecular machine learning models. Self… ▽ More Reliable molecular property prediction is essential for various scientific endeavors and industrial applications, such as drug discovery. However, the data scarcity, combined with the highly non-linear causal relationships between physicochemical and biological properties and conventional molecular featurization schemes, complicates the development of robust molecular machine learning models. Self-supervised learning (SSL) has emerged as a popular solution, utilizing large-scale, unannotated molecular data to learn a foundational representation of chemical space that might be advantageous for downstream tasks. Yet, existing molecular SSL methods largely overlook chemical knowledge, including molecular structure similarity, scaffold composition, and the context-dependent aspects of molecular properties when operating over the chemical space. They also struggle to learn the subtle variations in structure-activity relationship. This paper introduces a novel pre-training framework that learns robust and generalizable chemical knowledge. It leverages the structural hierarchy within the molecule, embeds them through distinct pre-training tasks across channels, and aggregates channel information in a task-specific manner during fine-tuning. Our approach demonstrates competitive performance across various molecular property benchmarks and offers strong advantages in particularly challenging yet ubiquitous scenarios like activity cliffs. △ Less

Submitted 12 January, 2025; v1 submitted 5 November, 2023; originally announced November 2023.

Journal ref: Nat Commun 16, 413 (2025)

arXiv:2310.08735 [pdf, other]

Noise driven phase transitions in eco-evolutionary systems

Authors: Jim Wu, David J. Schwab, Trevor GrandPre

Abstract: In complex ecosystems such as microbial communities, there is constant ecological and evolutionary feedback between the residing species and the environment occurring on concurrent timescales. Species respond and adapt to their surroundings by modifying their phenotypic traits, which in turn alters their environment and the resources available. To study this interplay between ecological and evolut… ▽ More In complex ecosystems such as microbial communities, there is constant ecological and evolutionary feedback between the residing species and the environment occurring on concurrent timescales. Species respond and adapt to their surroundings by modifying their phenotypic traits, which in turn alters their environment and the resources available. To study this interplay between ecological and evolutionary mechanisms, we develop a consumer-resource model that incorporates phenotypic mutations. In the absence of noise, we find that phase transitions require finely-tuned interaction kernels. Additionally, we quantify the effects of noise on frequency dependent selection by defining a time-integrated mutation current, which accounts for the rate at which mutations and speciation occurs. We find three distinct phases: homogeneous, patterned, and patterned traveling waves. The last phase represents one way in which co-evolution of species can happen in a fluctuating environment. Our results highlight the principal roles that noise and non-reciprocal interactions between resources and consumers play in phase transitions within eco-evolutionary systems. △ Less

Submitted 16 October, 2023; v1 submitted 12 October, 2023; originally announced October 2023.

arXiv:2309.16684 [pdf, other]

Leveraging Side Information for Ligand Conformation Generation using Diffusion-Based Approaches

Authors: Jiamin Wu, He Cao, Yuan Yao

Abstract: Ligand molecule conformation generation is a critical challenge in drug discovery. Deep learning models have been developed to tackle this problem, particularly through the use of generative models in recent years. However, these models often generate conformations that lack meaningful structure and randomness due to the absence of essential side information. Examples of such side information incl… ▽ More Ligand molecule conformation generation is a critical challenge in drug discovery. Deep learning models have been developed to tackle this problem, particularly through the use of generative models in recent years. However, these models often generate conformations that lack meaningful structure and randomness due to the absence of essential side information. Examples of such side information include the chemical and geometric features of the target protein, ligand-target compound interactions, and ligand chemical properties. Without these constraints, the generated conformations may not be suitable for further selection and design of new drugs. To address this limitation, we propose a novel method for generating ligand conformations that leverage side information and incorporate flexible constraints into standard diffusion models. Drawing inspiration from the concept of message passing, we introduce ligand-target massage passing block, a mechanism that facilitates the exchange of information between target nodes and ligand nodes, thereby incorporating target node features. To capture non-covalent interactions, we introduce ligand-target compound inter and intra edges. To further improve the biological relevance of the generated conformations, we train energy models using scalar chemical features. These models guide the progress of the standard Denoising Diffusion Probabilistic Models, resulting in more biologically meaningful conformations. We evaluate the performance of SIDEGEN using the PDBBind-2020 dataset, comparing it against other methods. The results demonstrate improvements in both Aligned RMSD and Ligand RMSD evaluations. Specifically, our model outperforms GeoDiff (trained on PDBBind-2020) by 20% in terms of the median aligned RMSD metric. △ Less

Submitted 2 August, 2023; originally announced September 2023.

arXiv:2305.05086 [pdf]

Mechanical Evidence for the Phylogenetic Origin of the Red Panda's False Thumb as an Adaptation to Arboreal Locomotion

Authors: Braden Barnett, Yiqi Lyu, Kyle Pichney, Brian Sun, Jixiao Wu

Abstract: We constructed a modular, biomimetic red panda paw with which to experimentally investigate the evolutionary reason for the existence of the false thumbs of red pandas. These thumbs were once believed to have shared a common origin with the similar false thumbs of giant pandas; however, the discovery of a carnivorous fossil ancestor of the red panda that had false thumbs implies that the red panda… ▽ More We constructed a modular, biomimetic red panda paw with which to experimentally investigate the evolutionary reason for the existence of the false thumbs of red pandas. These thumbs were once believed to have shared a common origin with the similar false thumbs of giant pandas; however, the discovery of a carnivorous fossil ancestor of the red panda that had false thumbs implies that the red panda did not evolve its thumbs to assist in eating bamboo, as the giant panda did, but rather evolved its thumbs for some other purpose. The leading proposal for this purpose is that the thumbs developed to aid arboreal locomotion. To test this hypothesis, we conducted grasp tests on rods 5-15 mm in diameter using a biomimetic paw with 0-16 mm interchangeable thumb lengths. The results of these tests demonstrated an optimal thumb length of 7 mm, which is just above that of the red panda's true thumb length of 5.5 mm. Given trends in the data that suggest that smaller thumbs are better suited to grasping larger diameter rods, we conclude that the red panda's thumb being sized below the optimum length suggests an adaptation toward grasping branches as opposed to relatively thinner food items, supporting the new proposal that the red panda's thumbs are an adaptation primary to climbing rather than food manipulation. △ Less

Submitted 8 May, 2023; originally announced May 2023.

Comments: 14 pages, 10 figures

arXiv:2302.00545 [pdf, other]

An Out-of-Domain Synapse Detection Challenge for Microwasp Brain Connectomes

Authors: Jingpeng Wu, Yicong Li, Nishika Gupta, Kazunori Shinomiya, Pat Gunn, Alexey Polilov, Hanspeter Pfister, Dmitri Chklovskii, Donglai Wei

Abstract: The size of image stacks in connectomics studies now reaches the terabyte and often petabyte scales with a great diversity of appearance across brain regions and samples. However, manual annotation of neural structures, e.g., synapses, is time-consuming, which leads to limited training data often smaller than 0.001\% of the test data in size. Domain adaptation and generalization approaches were pr… ▽ More The size of image stacks in connectomics studies now reaches the terabyte and often petabyte scales with a great diversity of appearance across brain regions and samples. However, manual annotation of neural structures, e.g., synapses, is time-consuming, which leads to limited training data often smaller than 0.001\% of the test data in size. Domain adaptation and generalization approaches were proposed to address similar issues for natural images, which were less evaluated on connectomics data due to a lack of out-of-domain benchmarks. △ Less

Submitted 1 February, 2023; originally announced February 2023.

arXiv:2301.03424 [pdf, other]

An open unified deep graph learning framework for discovering drug leads

Authors: Yueming Yin, Haifeng Hu, Zhen Yang, Jitao Yang, Chun Ye, Jiansheng Wu, Wilson Wen Bin Goh

Abstract: Computational discovery of ideal lead compounds is a critical process for modern drug discovery. It comprises multiple stages: hit screening, molecular property prediction, and molecule optimization. Current efforts are disparate, involving the establishment of models for each stage, followed by multi-stage multi-model integration. However, this is non-ideal, as clumsy integration of incompatible… ▽ More Computational discovery of ideal lead compounds is a critical process for modern drug discovery. It comprises multiple stages: hit screening, molecular property prediction, and molecule optimization. Current efforts are disparate, involving the establishment of models for each stage, followed by multi-stage multi-model integration. However, this is non-ideal, as clumsy integration of incompatible models increases research overheads, and may even reduce success rates in drug discovery. Facilitating compatibilities requires establishing inherent model consistencies across lead discovery stages. Towards that effect, we propose an open deep graph learning (DGL) based pipeline: generative adversarial feature subspace enhancement (GAFSE), which first unifies the modeling of these stages into one learning framework. GAFSE also offers standardized modular design and streamlined interfaces for future expansions and community support. GAFSE combines adversarial/generative learning, graph attention network, graph reconstruction network, and optimizes the classification/regression loss, adversarial/generative loss, and reconstruction loss simultaneously. Convergence analysis theoretically guarantees model generalization performance. Exhaustive benchmarking demonstrates that the GAFSE pipeline achieves excellent performance across almost all lead discovery stages, while also providing valuable model interpretability. Hence, we believe this tool will enhance the efficiency and productivity of drug discovery researchers. △ Less

Submitted 20 January, 2023; v1 submitted 5 December, 2022; originally announced January 2023.

arXiv:2301.01829 [pdf]

t-SMILES: A Scalable Fragment-based Molecular Representation Framework for De Novo Molecule Generation

Authors: Juan-Ni Wu, Tong Wang, Yue Chen, Li-Juan Tang, Hai-Long Wu, Ru-Qin Yu

Abstract: Effective representation of molecules is a crucial factor affecting the performance of artificial intelligence models. This study introduces a flexible, fragment-based, multiscale molecular representation framework called t-SMILES (tree-based SMILES) with three code algorithms: TSSA, TSDY and TSID. It describes molecules using SMILES-type strings obtained by performing a breadth-first search on a… ▽ More Effective representation of molecules is a crucial factor affecting the performance of artificial intelligence models. This study introduces a flexible, fragment-based, multiscale molecular representation framework called t-SMILES (tree-based SMILES) with three code algorithms: TSSA, TSDY and TSID. It describes molecules using SMILES-type strings obtained by performing a breadth-first search on a full binary tree formed from a fragmented molecular graph. Systematic evaluations using JTVAE, BRICS, MMPA, and Scaffold show the feasibility of constructing a multi-code molecular description system, where various descriptions complement each other, enhancing the overall performance. In addition, it can avoid overfitting and achieve higher novelty scores while maintaining reasonable similarity on labeled low-resource datasets, regardless of whether the model is original, data-augmented, or pre-trained then fine-tuned. Furthermore, it significantly outperforms classical SMILES, DeepSMILES, SELFIES and baseline models in goal-directed tasks. And it surpasses state-of-the-art fragment, graph and SMILES based approaches on ChEMBL, Zinc, and QM9. △ Less

Submitted 20 May, 2024; v1 submitted 4 January, 2023; originally announced January 2023.

arXiv:2210.17401 [pdf, other]

Towards a Better Model with Dual Transformer for Drug Response Prediction

Authors: Kun Li, Jia Wu, Bo Du, Sergey V. Petoukhov, Huiting Xu, Zheman Xiao, Wenbin Hu

Abstract: GNN-based methods have achieved excellent results as a mainstream task in drug response prediction tasks in recent years. Traditional GNN methods use only the atoms in a drug molecule as nodes to obtain the representation of the molecular graph through node information passing, whereas the method using the transformer can only extract information about the nodes. However, the covalent bonding and… ▽ More GNN-based methods have achieved excellent results as a mainstream task in drug response prediction tasks in recent years. Traditional GNN methods use only the atoms in a drug molecule as nodes to obtain the representation of the molecular graph through node information passing, whereas the method using the transformer can only extract information about the nodes. However, the covalent bonding and chirality of a drug molecule have a great influence on the pharmacological properties of the molecule, and these information are implied in the chemical bonds formed by the edges between the atoms. In addition, CNN methods for modelling cell lines genomics sequences can only perceive local rather than global information about the sequence. In order to solve the above problems, we propose the decoupled dual transformer structure with edge embedded for drug respond prediction (TransEDRP), which is used for the representation of cell line genomics and drug respectively. For the drug branch, we encoded the chemical bond information within the molecule as the embedding of the edge in the molecular graph, extracted the global structural and biochemical information of the drug molecule using graph transformer. For the branch of cell lines genomics, we use the multi-headed attention mechanism to globally represent the genomics sequence. Finally, the drug and genomics branches are fused to predict IC50 values through the transformer layer and the fully connected layer, which two branches are different modalities. Extensive experiments have shown that our method is better than the current mainstream approach in all evaluation indicators. △ Less

Submitted 10 December, 2024; v1 submitted 23 October, 2022; originally announced October 2022.

Comments: 28 pages, 4 figures, 5 tables

arXiv:2210.04552 [pdf]

Intrinsic motivation, Need for cognition, Grit, Growth Mindset and Academic Achievement in High School Students: Latent Profiles and Its Predictive Effects

Authors: Jun Wu, Shuoli Qi, Yueshan Zhong

Abstract: Recent efforts to identify non-cognitive predictors of academic achievement have especially focused on self-constructs, whose measurement is concerned with a specific domain (e.g., mathematics). However, other important factors, such as character and motivation, have received less attention. Additionally, the predictive accuracy of non-cognitive factors lacks evidence from subjects including Engli… ▽ More Recent efforts to identify non-cognitive predictors of academic achievement have especially focused on self-constructs, whose measurement is concerned with a specific domain (e.g., mathematics). However, other important factors, such as character and motivation, have received less attention. Additionally, the predictive accuracy of non-cognitive factors lacks evidence from subjects including English and Science. In this study, we take a person-centered approach and focus on students' intrinsic motivation, need for cognition, grit, and growth mindset. We mainly focus on how these factors predict students' mathematics, English, and science grades between 9th grade and 12th grade. 2,308 samples from high school students in Boston (Female = 1,237; aged from 13 to 17). The research results indicated that: (1) four latent profiles of students emerged: High in grit students (n = 997, 43.2%, higher scores of grit); Moderate students (n = 905, 38.3%, moderate in all scores); High in intrinsic motivation students (n = 252, 11.8%, higher scores of intrinsic motivation); Low in grit students (n = 154, 6.7%, lower scores of grit); (2) students' gender, race, maternal education level, and social-economic ranking predicted the profiles; and (3) four profiles of students had a significant predictive effect on Mathematics, Science and English scores in both 9th grade and 12th grade. We discussed the importance of character education for adolescents and motivation for learning in high school. △ Less

Submitted 10 October, 2022; originally announced October 2022.

Comments: 24 pages, 2 tables, 2 figures

MSC Class: 62P15 (Primary) ACM Class: G.2

Showing 1–50 of 102 results for author: Wu, J