Search | arXiv e-print repository

ProtSAE: Disentangling and Interpreting Protein Language Models via Semantically-Guided Sparse Autoencoders

Authors: Xiangyu Liu, Haodi Lei, Yi Liu, Yang Liu, Wei Hu

Abstract: Sparse Autoencoder (SAE) has emerged as a powerful tool for mechanistic interpretability of large language models. Recent works apply SAE to protein language models (PLMs), aiming to extract and analyze biologically meaningful features from their latent spaces. However, SAE suffers from semantic entanglement, where individual neurons often mix multiple nonlinear concepts, making it difficult to re… ▽ More Sparse Autoencoder (SAE) has emerged as a powerful tool for mechanistic interpretability of large language models. Recent works apply SAE to protein language models (PLMs), aiming to extract and analyze biologically meaningful features from their latent spaces. However, SAE suffers from semantic entanglement, where individual neurons often mix multiple nonlinear concepts, making it difficult to reliably interpret or manipulate model behaviors. In this paper, we propose a semantically-guided SAE, called ProtSAE. Unlike existing SAE which requires annotation datasets to filter and interpret activations, we guide semantic disentanglement during training using both annotation datasets and domain knowledge to mitigate the effects of entangled attributes. We design interpretability experiments showing that ProtSAE learns more biologically relevant and interpretable hidden features compared to previous methods. Performance analyses further demonstrate that ProtSAE maintains high reconstruction fidelity while achieving better results in interpretable probing. We also show the potential of ProtSAE in steering PLMs for downstream generation tasks. △ Less

Submitted 26 August, 2025; originally announced September 2025.

arXiv:2507.20925 [pdf, ps, other]

Zero-Shot Learning with Subsequence Reordering Pretraining for Compound-Protein Interaction

Authors: Hongzhi Zhang, Zhonglie Liu, Kun Meng, Jiameng Chen, Jia Wu, Bo Du, Di Lin, Yan Che, Wenbin Hu

Abstract: Given the vastness of chemical space and the ongoing emergence of previously uncharacterized proteins, zero-shot compound-protein interaction (CPI) prediction better reflects the practical challenges and requirements of real-world drug development. Although existing methods perform adequately during certain CPI tasks, they still face the following challenges: (1) Representation learning from local… ▽ More Given the vastness of chemical space and the ongoing emergence of previously uncharacterized proteins, zero-shot compound-protein interaction (CPI) prediction better reflects the practical challenges and requirements of real-world drug development. Although existing methods perform adequately during certain CPI tasks, they still face the following challenges: (1) Representation learning from local or complete protein sequences often overlooks the complex interdependencies between subsequences, which are essential for predicting spatial structures and binding properties. (2) Dependence on large-scale or scarce multimodal protein datasets demands significant training data and computational resources, limiting scalability and efficiency. To address these challenges, we propose a novel approach that pretrains protein representations for CPI prediction tasks using subsequence reordering, explicitly capturing the dependencies between protein subsequences. Furthermore, we apply length-variable protein augmentation to ensure excellent pretraining performance on small training datasets. To evaluate the model's effectiveness and zero-shot learning ability, we combine it with various baseline methods. The results demonstrate that our approach can improve the baseline model's performance on the CPI task, especially in the challenging zero-shot scenario. Compared to existing pre-training models, our model demonstrates superior performance, particularly in data-scarce scenarios where training samples are limited. Our implementation is available at https://github.com/Hoch-Zhang/PSRP-CPI. △ Less

Submitted 28 July, 2025; originally announced July 2025.

arXiv:2506.00854 [pdf, ps, other]

EEG2TEXT-CN: An Exploratory Study of Open-Vocabulary Chinese Text-EEG Alignment via Large Language Model and Contrastive Learning on ChineseEEG

Authors: Jacky Tai-Yu Lu, Jung Chiang, Chi-Sheng Chen, Anna Nai-Yun Tung, Hsiang Wei Hu, Yuan Chiao Cheng

Abstract: We propose EEG2TEXT-CN, which, to the best of our knowledge, represents one of the earliest open-vocabulary EEG-to-text generation frameworks tailored for Chinese. Built on a biologically grounded EEG encoder (NICE-EEG) and a compact pretrained language model (MiniLM), our architecture aligns multichannel brain signals with natural language representations via masked pretraining and contrastive le… ▽ More We propose EEG2TEXT-CN, which, to the best of our knowledge, represents one of the earliest open-vocabulary EEG-to-text generation frameworks tailored for Chinese. Built on a biologically grounded EEG encoder (NICE-EEG) and a compact pretrained language model (MiniLM), our architecture aligns multichannel brain signals with natural language representations via masked pretraining and contrastive learning. Using a subset of the ChineseEEG dataset, where each sentence contains approximately ten Chinese characters aligned with 128-channel EEG recorded at 256 Hz, we segment EEG into per-character embeddings and predict full sentences in a zero-shot setting. The decoder is trained with teacher forcing and padding masks to accommodate variable-length sequences. Evaluation on over 1,500 training-validation sentences and 300 held-out test samples shows promising lexical alignment, with a best BLEU-1 score of 6.38\%. While syntactic fluency remains a challenge, our findings demonstrate the feasibility of non-phonetic, cross-modal language decoding from EEG. This work opens a new direction in multilingual brain-to-text research and lays the foundation for future cognitive-language interfaces in Chinese. △ Less

Submitted 8 July, 2025; v1 submitted 1 June, 2025; originally announced June 2025.

arXiv:2505.13940 [pdf, ps, other]

DrugPilot: LLM-based Parameterized Reasoning Agent for Drug Discovery

Authors: Kun Li, Zhennan Wu, Shoupeng Wang, Jia Wu, Shirui Pan, Wenbin Hu

Abstract: Large language models (LLMs) integrated with autonomous agents hold significant potential for advancing scientific discovery through automated reasoning and task execution. However, applying LLM agents to drug discovery is still constrained by challenges such as large-scale multimodal data processing, limited task automation, and poor support for domain-specific tools. To overcome these limitation… ▽ More Large language models (LLMs) integrated with autonomous agents hold significant potential for advancing scientific discovery through automated reasoning and task execution. However, applying LLM agents to drug discovery is still constrained by challenges such as large-scale multimodal data processing, limited task automation, and poor support for domain-specific tools. To overcome these limitations, we introduce DrugPilot, a LLM-based agent system with a parameterized reasoning architecture designed for end-to-end scientific workflows in drug discovery. DrugPilot enables multi-stage research processes by integrating structured tool use with a novel parameterized memory pool. The memory pool converts heterogeneous data from both public sources and user-defined inputs into standardized representations. This design supports efficient multi-turn dialogue, reduces information loss during data exchange, and enhances complex scientific decision-making. To support training and benchmarking, we construct a drug instruction dataset covering eight core drug discovery tasks. Under the Berkeley function-calling benchmark, DrugPilot significantly outperforms state-of-the-art agents such as ReAct and LoT, achieving task completion rates of 98.0%, 93.5%, and 64.0% for simple, multi-tool, and multi-turn scenarios, respectively. These results highlight DrugPilot's potential as a versatile agent framework for computational science domains requiring automated, interactive, and data-integrated reasoning. △ Less

Submitted 28 July, 2025; v1 submitted 20 May, 2025; originally announced May 2025.

Comments: 29 pages, 8 figures, 2 tables

arXiv:2505.12068 [pdf, ps, other]

Learning High-Order Relationships with Hypergraph Attention-based Spatio-Temporal Aggregation for Brain Disease Analysis

Authors: Wenqi Hu, Xuerui Su, Guanliang Li, Yidi Pan, Aijing Lin

Abstract: Traditional functional connectivity based on functional magnetic resonance imaging (fMRI) can only capture pairwise interactions between brain regions. Hypergraphs, which reveal high-order relationships among multiple brain regions, have been widely used for disease analysis. However, existing methods often rely on predefined hypergraph structures, limiting their ability to model complex patterns.… ▽ More Traditional functional connectivity based on functional magnetic resonance imaging (fMRI) can only capture pairwise interactions between brain regions. Hypergraphs, which reveal high-order relationships among multiple brain regions, have been widely used for disease analysis. However, existing methods often rely on predefined hypergraph structures, limiting their ability to model complex patterns. Moreover, temporal information, an essential component of brain high-order relationships, is frequently overlooked. To address these limitations, we propose a novel framework that jointly learns informative and sparse high-order brain structures along with their temporal dynamics. Inspired by the information bottleneck principle, we introduce an objective that maximizes information and minimizes redundancy, aiming to retain disease-relevant high-order features while suppressing irrelevant signals. Our model comprises a multi-hyperedge binary mask module for hypergraph structure learning, a hypergraph self-attention aggregation module that captures spatial features through adaptive attention across nodes and hyperedges, and a spatio-temporal low-dimensional network for extracting discriminative spatio-temporal representations for disease classification. Experiments on benchmark fMRI datasets demonstrate that our method outperforms the state-of-the-art approaches and successfully identifies meaningful high-order brain interactions. These findings provide new insights into brain network modeling and the study of neuropsychiatric disorders. △ Less

Submitted 17 May, 2025; originally announced May 2025.

arXiv:2502.08975 [pdf, other]

Graph-structured Small Molecule Drug Discovery Through Deep Learning: Progress, Challenges, and Opportunities

Authors: Kun Li, Yida Xiong, Hongzhi Zhang, Xiantao Cai, Jia Wu, Bo Du, Wenbin Hu

Abstract: Due to their excellent drug-like and pharmacokinetic properties, small molecule drugs are widely used to treat various diseases, making them a critical component of drug discovery. In recent years, with the rapid development of deep learning (DL) techniques, DL-based small molecule drug discovery methods have achieved excellent performance in prediction accuracy, speed, and complex molecular relat… ▽ More Due to their excellent drug-like and pharmacokinetic properties, small molecule drugs are widely used to treat various diseases, making them a critical component of drug discovery. In recent years, with the rapid development of deep learning (DL) techniques, DL-based small molecule drug discovery methods have achieved excellent performance in prediction accuracy, speed, and complex molecular relationship modeling compared to traditional machine learning approaches. These advancements enhance drug screening efficiency and optimization and provide more precise and effective solutions for various drug discovery tasks. Contributing to this field's development, this paper aims to systematically summarize and generalize the recent key tasks and representative techniques in graph-structured small molecule drug discovery in recent years. Specifically, we provide an overview of the major tasks in small molecule drug discovery and their interrelationships. Next, we analyze the six core tasks, summarizing the related methods, commonly used datasets, and technological development trends. Finally, we discuss key challenges, such as interpretability and out-of-distribution generalization, and offer our insights into future research directions for small molecule drug discovery. △ Less

Submitted 14 May, 2025; v1 submitted 13 February, 2025; originally announced February 2025.

Comments: 10 pages, 1 figures, 8 tables

arXiv:2501.15799 [pdf, other]

Can Molecular Evolution Mechanism Enhance Molecular Representation?

Authors: Kun Li, Longtao Hu, Xiantao Cai, Jia Wu, Wenbin Hu

Abstract: Molecular evolution is the process of simulating the natural evolution of molecules in chemical space to explore potential molecular structures and properties. The relationships between similar molecules are often described through transformations such as adding, deleting, and modifying atoms and chemical bonds, reflecting specific evolutionary paths. Existing molecular representation methods main… ▽ More Molecular evolution is the process of simulating the natural evolution of molecules in chemical space to explore potential molecular structures and properties. The relationships between similar molecules are often described through transformations such as adding, deleting, and modifying atoms and chemical bonds, reflecting specific evolutionary paths. Existing molecular representation methods mainly focus on mining data, such as atomic-level structures and chemical bonds directly from the molecules, often overlooking their evolutionary history. Consequently, we aim to explore the possibility of enhancing molecular representations by simulating the evolutionary process. We extract and analyze the changes in the evolutionary pathway and explore combining it with existing molecular representations. Therefore, this paper proposes the molecular evolutionary network (MEvoN) for molecular representations. First, we construct the MEvoN using molecules with a small number of atoms and generate evolutionary paths utilizing similarity calculations. Then, by modeling the atomic-level changes, MEvoN reveals their impact on molecular properties. Experimental results show that the MEvoN-based molecular property prediction method significantly improves the performance of traditional end-to-end algorithms on several molecular datasets. The code is available at https://anonymous.4open.science/r/MEvoN-7416/. △ Less

Submitted 27 January, 2025; originally announced January 2025.

Comments: 9 pages, 6 figures, 5 tables

arXiv:2501.15007 [pdf, other]

Controllable Protein Sequence Generation with LLM Preference Optimization

Authors: Xiangyu Liu, Yi Liu, Silei Chen, Wei Hu

Abstract: Designing proteins with specific attributes offers an important solution to address biomedical challenges. Pre-trained protein large language models (LLMs) have shown promising results on protein sequence generation. However, to control sequence generation for specific attributes, existing work still exhibits poor functionality and structural stability. In this paper, we propose a novel controllab… ▽ More Designing proteins with specific attributes offers an important solution to address biomedical challenges. Pre-trained protein large language models (LLMs) have shown promising results on protein sequence generation. However, to control sequence generation for specific attributes, existing work still exhibits poor functionality and structural stability. In this paper, we propose a novel controllable protein design method called CtrlProt. We finetune a protein LLM with a new multi-listwise preference optimization strategy to improve generation quality and support multi-attribute controllable generation. Experiments demonstrate that CtrlProt can meet functionality and structural stability requirements effectively, achieving state-of-the-art performance in both single-attribute and multi-attribute protein sequence generation. △ Less

Submitted 24 January, 2025; originally announced January 2025.

Comments: Accepted in the 39th Annual AAAI Conference on Artificial Intelligence (AAAI 2025)

arXiv:2408.09106 [pdf, other]

Fragment-Masked Diffusion for Molecular Optimization

Authors: Kun Li, Xiantao Cai, Jia Wu, Shirui Pan, Huiting Xu, Bo Du, Wenbin Hu

Abstract: Molecular optimization is a crucial aspect of drug discovery, aimed at refining molecular structures to enhance drug efficacy and minimize side effects, ultimately accelerating the overall drug development process. Many molecular optimization methods have been proposed, significantly advancing drug discovery. These methods primarily on understanding the specific drug target structures or their hyp… ▽ More Molecular optimization is a crucial aspect of drug discovery, aimed at refining molecular structures to enhance drug efficacy and minimize side effects, ultimately accelerating the overall drug development process. Many molecular optimization methods have been proposed, significantly advancing drug discovery. These methods primarily on understanding the specific drug target structures or their hypothesized roles in combating diseases. However, challenges such as a limited number of available targets and a difficulty capturing clear structures hinder innovative drug development. In contrast, phenotypic drug discovery (PDD) does not depend on clear target structures and can identify hits with novel and unbiased polypharmacology signatures. As a result, PDD-based molecular optimization can reduce potential safety risks while optimizing phenotypic activity, thereby increasing the likelihood of clinical success. Therefore, we propose a fragment-masked molecular optimization method based on PDD (FMOP). FMOP employs a regression-free diffusion model to conditionally optimize the molecular masked regions, effectively generating new molecules with similar scaffolds. On the large-scale drug response dataset GDSCv2, we optimize the potential molecules across all 985 cell lines. The overall experiments demonstrate that the in-silico optimization success rate reaches 95.4\%, with an average efficacy increase of 7.5\%. Additionally, we conduct extensive ablation and visualization experiments, confirming that FMOP is an effective and robust molecular optimization method. The code is available at: https://anonymous.4open.science/r/FMOP-98C2. △ Less

Submitted 14 May, 2025; v1 submitted 17 August, 2024; originally announced August 2024.

Comments: 12 pages, 9 figures, 4 tables

arXiv:2405.14545 [pdf, other]

A Cross-Field Fusion Strategy for Drug-Target Interaction Prediction

Authors: Hongzhi Zhang, Xiuwen Gong, Shirui Pan, Jia Wu, Bo Du, Wenbin Hu

Abstract: Drug-target interaction (DTI) prediction is a critical component of the drug discovery process. In the drug development engineering field, predicting novel drug-target interactions is extremely crucial.However, although existing methods have achieved high accuracy levels in predicting known drugs and drug targets, they fail to utilize global protein information during DTI prediction. This leads to… ▽ More Drug-target interaction (DTI) prediction is a critical component of the drug discovery process. In the drug development engineering field, predicting novel drug-target interactions is extremely crucial.However, although existing methods have achieved high accuracy levels in predicting known drugs and drug targets, they fail to utilize global protein information during DTI prediction. This leads to an inability to effectively predict interaction the interactions between novel drugs and their targets. As a result, the cross-field information fusion strategy is employed to acquire local and global protein information. Thus, we propose the siamese drug-target interaction SiamDTI prediction method, which utilizes a double channel network structure for cross-field supervised learning.Experimental results on three benchmark datasets demonstrate that SiamDTI achieves higher accuracy levels than other state-of-the-art (SOTA) methods on novel drugs and targets.Additionally, SiamDTI's performance with known drugs and targets is comparable to that of SOTA approachs. The code is available at https://anonymous.4open.science/r/DDDTI-434D. △ Less

Submitted 23 May, 2024; originally announced May 2024.

arXiv:2405.14536 [pdf, other]

Regressor-free Molecule Generation to Support Drug Response Prediction

Authors: Kun Li, Xiuwen Gong, Shirui Pan, Jia Wu, Bo Du, Wenbin Hu

Abstract: Drug response prediction (DRP) is a crucial phase in drug discovery, and the most important metric for its evaluation is the IC50 score. DRP results are heavily dependent on the quality of the generated molecules. Existing molecule generation methods typically employ classifier-based guidance, enabling sampling within the IC50 classification range. However, these methods fail to ensure the samplin… ▽ More Drug response prediction (DRP) is a crucial phase in drug discovery, and the most important metric for its evaluation is the IC50 score. DRP results are heavily dependent on the quality of the generated molecules. Existing molecule generation methods typically employ classifier-based guidance, enabling sampling within the IC50 classification range. However, these methods fail to ensure the sampling space range's effectiveness, generating numerous ineffective molecules. Through experimental and theoretical study, we hypothesize that conditional generation based on the target IC50 score can obtain a more effective sampling space. As a result, we introduce regressor-free guidance molecule generation to ensure sampling within a more effective space and support DRP. Regressor-free guidance combines a diffusion model's score estimation with a regression controller model's gradient based on number labels. To effectively map regression labels between drugs and cell lines, we design a common-sense numerical knowledge graph that constrains the order of text representations. Experimental results on the real-world dataset for the DRP task demonstrate our method's effectiveness in drug discovery. The code is available at:https://anonymous.4open.science/r/RMCD-DBD1. △ Less

Submitted 23 May, 2024; originally announced May 2024.

Comments: 22 pages, 7 figures, 9 tables,

arXiv:2312.10707 [pdf, other]

CLDR: Contrastive Learning Drug Response Models from Natural Language Supervision

Authors: Kun Li, Wenbin Hu

Abstract: Deep learning-based drug response prediction (DRP) methods can accelerate the drug discovery process and reduce R\&D costs. Although the mainstream methods achieve high accuracy in predicting response regression values, the regression-aware representations of these methods are fragmented and fail to capture the continuity of the sample order. This phenomenon leads to models optimized to sub-optima… ▽ More Deep learning-based drug response prediction (DRP) methods can accelerate the drug discovery process and reduce R\&D costs. Although the mainstream methods achieve high accuracy in predicting response regression values, the regression-aware representations of these methods are fragmented and fail to capture the continuity of the sample order. This phenomenon leads to models optimized to sub-optimal solution spaces, reducing generalization ability and may result in significant wasted costs in the drug discovery phase. In this paper, we propose \MN, a contrastive learning framework with natural language supervision for the DRP. The \MN~converts regression labels into text, which is merged with the captions text of the drug response as a second modality of the samples compared to the traditional modalities (graph, sequence). In each batch, two modalities of one sample are considered positive pairs and the other pairs are considered negative pairs. At the same time, in order to enhance the continuous representation capability of the numerical text, a common-sense numerical knowledge graph is introduced. We validated several hundred thousand samples from the Genomics of Drug Sensitivity in Cancer dataset, observing the average improvement of the DRP method ranges from 7.8\% to 31.4\% with the application of our framework. The experiments prove that the \MN~effectively constrains the samples to a continuous distribution in the representation space, and achieves impressive prediction performance with only a few epochs of fine-tuning after pre-training. The code is available at: \url{https://gitee.com/xiaoyibang/clipdrug.git}. △ Less

Submitted 17 December, 2023; originally announced December 2023.

Comments: 9 pages, 4 figures, 3 tables

arXiv:2311.07624 [pdf]

Disordered hyperuniformity signals functioning and resilience of self-organized vegetation patterns

Authors: Wensi Hu, Quan-Xing Liu, Bo Wang, Nuo Xu, Lijuan Cui, Chi Xu

Abstract: In harsh environments, organisms may self-organize into spatially patterned systems in various ways. So far, studies of ecosystem spatial self-organization have primarily focused on apparent orders reflected by regular patterns. However, self-organized ecosystems may also have cryptic orders that can be unveiled only through certain quantitative analyses. Here we show that disordered hyperuniformi… ▽ More In harsh environments, organisms may self-organize into spatially patterned systems in various ways. So far, studies of ecosystem spatial self-organization have primarily focused on apparent orders reflected by regular patterns. However, self-organized ecosystems may also have cryptic orders that can be unveiled only through certain quantitative analyses. Here we show that disordered hyperuniformity as a striking class of hidden orders can exist in spatially self-organized vegetation landscapes. By analyzing the high-resolution remotely sensed images across the American drylands, we demonstrate that it is not uncommon to find disordered hyperuniform vegetation states characterized by suppressed density fluctuations at long range. Such long-range hyperuniformity has been documented in a wide range of microscopic systems. Our finding contributes to expanding this domain to accommodate natural landscape ecological systems. We use theoretical modeling to propose that disordered hyperuniform vegetation patterning can arise from three generalized mechanisms prevalent in dryland ecosystems, including (1) critical absorbing states driven by an ecological legacy effect, (2) scale-dependent feedbacks driven by plant-plant facilitation and competition, and (3) density-dependent aggregation driven by plant-sediment feedbacks. Our modeling results also show that disordered hyperuniform patterns can help ecosystems cope with arid conditions with enhanced functioning of soil moisture acquisition. However, this advantage may come at the cost of slower recovery of ecosystem structure upon perturbations. Our work highlights that disordered hyperuniformity as a distinguishable but underexplored ecosystem self-organization state merits systematic studies to better understand its underlying mechanisms, functioning, and resilience. △ Less

Submitted 13 November, 2023; originally announced November 2023.

Comments: 34 pages, 6 figures; Supplementary Materials, 19 pages, 10 figures, 2 tables

arXiv:2310.12996 [pdf, other]

Zero-shot Learning of Drug Response Prediction for Preclinical Drug Screening

Authors: Kun Li, Yong Luo, Xiantao Cai, Wenbin Hu, Bo Du

Abstract: Conventional deep learning methods typically employ supervised learning for drug response prediction (DRP). This entails dependence on labeled response data from drugs for model training. However, practical applications in the preclinical drug screening phase demand that DRP models predict responses for novel compounds, often with unknown drug responses. This presents a challenge, rendering superv… ▽ More Conventional deep learning methods typically employ supervised learning for drug response prediction (DRP). This entails dependence on labeled response data from drugs for model training. However, practical applications in the preclinical drug screening phase demand that DRP models predict responses for novel compounds, often with unknown drug responses. This presents a challenge, rendering supervised deep learning methods unsuitable for such scenarios. In this paper, we propose a zero-shot learning solution for the DRP task in preclinical drug screening. Specifically, we propose a Multi-branch Multi-Source Domain Adaptation Test Enhancement Plug-in, called MSDA. MSDA can be seamlessly integrated with conventional DRP methods, learning invariant features from the prior response data of similar drugs to enhance real-time predictions of unlabeled compounds. We conducted experiments using the GDSCv2 and CellMiner datasets. The results demonstrate that MSDA efficiently predicts drug responses for novel compounds, leading to a general performance improvement of 5-10\% in the preclinical drug screening phase. The significance of this solution resides in its potential to accelerate the drug discovery process, improve drug candidate assessment, and facilitate the success of drug discovery. △ Less

Submitted 5 October, 2023; originally announced October 2023.

Comments: 16 pages, 3 figures, 3 tables

arXiv:2303.00313 [pdf, other]

Deep Learning Methods for Small Molecule Drug Discovery: A Survey

Authors: Wenhao Hu, Yingying Liu, Xuanyu Chen, Wenhao Chai, Hangyue Chen, Hongwei Wang, Gaoang Wang

Abstract: With the development of computer-assisted techniques, research communities including biochemistry and deep learning have been devoted into the drug discovery field for over a decade. Various applications of deep learning have drawn great attention in drug discovery, such as molecule generation, molecular property prediction, retrosynthesis prediction, and reaction prediction. While most existing s… ▽ More With the development of computer-assisted techniques, research communities including biochemistry and deep learning have been devoted into the drug discovery field for over a decade. Various applications of deep learning have drawn great attention in drug discovery, such as molecule generation, molecular property prediction, retrosynthesis prediction, and reaction prediction. While most existing surveys only focus on one of the applications, limiting the view of researchers in the community. In this paper, we present a comprehensive review on the aforementioned four aspects, and discuss the relationships among different applications. The latest literature and classical benchmarks are presented for better understanding the development of variety of approaches. We commence by summarizing the molecule representation format in these works, followed by an introduction of recent proposed approaches for each of the four tasks. Furthermore, we review a variety of commonly used datasets and evaluation metrics and compare the performance of deep learning-based models. Finally, we conclude by identifying remaining challenges and discussing the future trend for deep learning methods in drug discovery. △ Less

Submitted 5 March, 2023; v1 submitted 1 March, 2023; originally announced March 2023.

arXiv:2210.17401 [pdf, other]

Towards a Better Model with Dual Transformer for Drug Response Prediction

Authors: Kun Li, Jia Wu, Bo Du, Sergey V. Petoukhov, Huiting Xu, Zheman Xiao, Wenbin Hu

Abstract: GNN-based methods have achieved excellent results as a mainstream task in drug response prediction tasks in recent years. Traditional GNN methods use only the atoms in a drug molecule as nodes to obtain the representation of the molecular graph through node information passing, whereas the method using the transformer can only extract information about the nodes. However, the covalent bonding and… ▽ More GNN-based methods have achieved excellent results as a mainstream task in drug response prediction tasks in recent years. Traditional GNN methods use only the atoms in a drug molecule as nodes to obtain the representation of the molecular graph through node information passing, whereas the method using the transformer can only extract information about the nodes. However, the covalent bonding and chirality of a drug molecule have a great influence on the pharmacological properties of the molecule, and these information are implied in the chemical bonds formed by the edges between the atoms. In addition, CNN methods for modelling cell lines genomics sequences can only perceive local rather than global information about the sequence. In order to solve the above problems, we propose the decoupled dual transformer structure with edge embedded for drug respond prediction (TransEDRP), which is used for the representation of cell line genomics and drug respectively. For the drug branch, we encoded the chemical bond information within the molecule as the embedding of the edge in the molecular graph, extracted the global structural and biochemical information of the drug molecule using graph transformer. For the branch of cell lines genomics, we use the multi-headed attention mechanism to globally represent the genomics sequence. Finally, the drug and genomics branches are fused to predict IC50 values through the transformer layer and the fully connected layer, which two branches are different modalities. Extensive experiments have shown that our method is better than the current mainstream approach in all evaluation indicators. △ Less

Submitted 10 December, 2024; v1 submitted 23 October, 2022; originally announced October 2022.

Comments: 28 pages, 4 figures, 5 tables

arXiv:2008.11622 [pdf]

doi 10.1021/acsnano.0c03972

Effectiveness of Common Fabrics to Block Aqueous Aerosols of COVID Virus-like Nanoparticles

Authors: Steven R. Lustig, John J. S. Biswakarma, Devyesh Rana, Susan H. Tilford, Weike Hu, Ming Su, Michael S. Rosenblatt

Abstract: Layered systems of commonly available fabric materials can be used by the public and healthcare providers in face masks to reduce the risk of inhaling viruses with protection about equivalent or better than the filtration and adsorption offered by 5-layer N95 respirators. Over 70 different common fabric combinations and masks were evaluated under steady state, forced convection air flux with pulse… ▽ More Layered systems of commonly available fabric materials can be used by the public and healthcare providers in face masks to reduce the risk of inhaling viruses with protection about equivalent or better than the filtration and adsorption offered by 5-layer N95 respirators. Over 70 different common fabric combinations and masks were evaluated under steady state, forced convection air flux with pulsed aerosols that simulate forceful respiration. The aerosols contain fluorescent virus-like nanoparticles to track transmission through materials that greatly assist the accuracy of detection, thus avoiding artifacts including pore flooding and the loss of aerosol due to evaporation and droplet break-up. Effective materials comprise both absorbent, hydrophilic layers and barrier, hydrophobic layers. Although the hydrophobic layers can adhere virus-like nanoparticles, they may also repel droplets from adjacent absorbent layers and prevent wicking transport across the fabric system. Effective designs are noted with absorbent layers comprising terry cloth towel, quilting cotton and flannel. Effective designs are noted with barrier layers comprising non-woven polypropylene, polyester and polyaramid. △ Less

Submitted 26 August, 2020; originally announced August 2020.

arXiv:2006.09928 [pdf, other]

Functional connectome fingerprinting: Identifying individuals and predicting cognitive function via deep learning

Authors: Biao Cai, Gemeng Zhang, Aiying Zhang, Li Xiao, Wenxing Hu, Julia M. Stephen, Tony W. Wilson, Vince D. Calhoun, Yu-Ping Wang

Abstract: The dynamic characteristics of functional network connectivity have been widely acknowledged and studied. Both shared and unique information has been shown to be present in the connectomes. However, very little has been known about whether and how this common pattern can predict the individual variability of the brain, i.e. "brain fingerprinting", which attempts to reliably identify a particular i… ▽ More The dynamic characteristics of functional network connectivity have been widely acknowledged and studied. Both shared and unique information has been shown to be present in the connectomes. However, very little has been known about whether and how this common pattern can predict the individual variability of the brain, i.e. "brain fingerprinting", which attempts to reliably identify a particular individual from a pool of subjects. In this paper, we propose to enhance the individual uniqueness based on an autoencoder network. More specifically, we rely on the hypothesis that the common neural activities shared across individuals may lessen individual discrimination. By reducing contributions from shared activities, inter-subject variability can be enhanced. Results show that that refined connectomes utilizing an autoencoder with sparse dictionary learning can successfully distinguish one individual from the remaining participants with reasonably high accuracy (up to 99:5% for the rest-rest pair). Furthermore, high-level cognitive behavior (e.g., fluid intelligence, executive function, and language comprehension) can also be better predicted using refined functional connectivity profiles. As expected, the high-order association cortices contributed more to both individual discrimination and behavior prediction. The proposed approach provides a promising way to enhance and leverage the individualized characteristics of brain networks. △ Less

Submitted 17 June, 2020; originally announced June 2020.

arXiv:2006.09454 [pdf, other]

Interpretable multimodal fusion networks reveal mechanisms of brain cognition

Authors: Wenxing Hu, Xianghe Meng, Yuntong Bai, Aiying Zhang, Biao Cai, Gemeng Zhang, Tony W. Wilson, Julia M. Stephen, Vince D. Calhoun, Yu-Ping Wang

Abstract: Multimodal fusion benefits disease diagnosis by providing a more comprehensive perspective. Developing algorithms is challenging due to data heterogeneity and the complex within- and between-modality associations. Deep-network-based data-fusion models have been developed to capture the complex associations and the performance in diagnosis has been improved accordingly. Moving beyond diagnosis pred… ▽ More Multimodal fusion benefits disease diagnosis by providing a more comprehensive perspective. Developing algorithms is challenging due to data heterogeneity and the complex within- and between-modality associations. Deep-network-based data-fusion models have been developed to capture the complex associations and the performance in diagnosis has been improved accordingly. Moving beyond diagnosis prediction, evaluation of disease mechanisms is critically important for biomedical research. Deep-network-based data-fusion models, however, are difficult to interpret, bringing about difficulties for studying biological mechanisms. In this work, we develop an interpretable multimodal fusion model, namely gCAM-CCL, which can perform automated diagnosis and result interpretation simultaneously. The gCAM-CCL model can generate interpretable activation maps, which quantify pixel-level contributions of the input features. This is achieved by combining intermediate feature maps using gradient-based weights. Moreover, the estimated activation maps are class-specific, and the captured cross-data associations are interest/label related, which further facilitates class-specific analysis and biological mechanism analysis. We validate the gCAM-CCL model on a brain imaging-genetic study, and show gCAM-CCL's performed well for both classification and mechanism analysis. Mechanism analysis suggests that during task-fMRI scans, several object recognition related regions of interests (ROIs) are first activated and then several downstream encoding ROIs get involved. Results also suggest that the higher cognition performing group may have stronger neurotransmission signaling while the lower cognition performing group may have problem in brain/neuron development, resulting from genetic variations. △ Less

Submitted 16 June, 2020; originally announced June 2020.

arXiv:2005.01200 [pdf, other]

Evolution of chemotactic hitchhiking

Authors: Gurdip Uppal, Weiyi Hu, Dervis Can Vural

Abstract: Bacteria typically reside in heterogeneous environments with various chemogradients where motile cells can gain an advantage over non-motile cells. Since motility is energetically costly, cells must optimize their swimming speed and behavior to maximize their fitness. Here we investigate how cheating strategies might evolve where slow or non-motile microbes exploit faster ones by sticking together… ▽ More Bacteria typically reside in heterogeneous environments with various chemogradients where motile cells can gain an advantage over non-motile cells. Since motility is energetically costly, cells must optimize their swimming speed and behavior to maximize their fitness. Here we investigate how cheating strategies might evolve where slow or non-motile microbes exploit faster ones by sticking together and hitching a ride. Starting with physical and biological first-principles we computationally study the effects of sticking on the evolution of motility in a controlled chemostat environment. We find stickiness allows slow cheaters to dominate when nutrients are dispersed at intermediate distances. Here, slow microbes exploit faster ones until they consume the population, leading to a tragedy of commons. For long races, slow microbes do gain an initial advantage from sticking, but eventually fall behind. Here, fast microbes are more likely to stick to other fast microbes, and cooperate to increase their own population. We therefore find the nature of the hitchhiking interaction, parasitic or mutualistic, depends on the nutrient distribution. △ Less

Submitted 3 May, 2020; originally announced May 2020.

Comments: 10 pages, 5 figures

arXiv:1901.11418 [pdf, other]

Sequential Bayesian Detection of Spike Activities from Fluorescence Observations

Authors: Zhuangkun Wei, Bin Li, Weisi Guo, Wenxiu Hu, Chenglin Zhao

Abstract: Extracting and detecting spike activities from the fluorescence observations is an important step in understanding how neuron systems work. The main challenge lies in that the combination of the ambient noise with dynamic baseline fluctuation, often contaminates the observations, thereby deteriorating the reliability of spike detection. This may be even worse in the face of the nonlinear biologica… ▽ More Extracting and detecting spike activities from the fluorescence observations is an important step in understanding how neuron systems work. The main challenge lies in that the combination of the ambient noise with dynamic baseline fluctuation, often contaminates the observations, thereby deteriorating the reliability of spike detection. This may be even worse in the face of the nonlinear biological process, the coupling interactions between spikes and baseline, and the unknown critical parameters of an underlying physiological model, in which erroneous estimations of parameters will affect the detection of spikes causing further error propagation. In this paper, we propose a random finite set (RFS) based Bayesian approach. The dynamic behaviors of spike sequence, fluctuated baseline and unknown parameters are formulated as one RFS. This RFS state is capable of distinguishing the hidden active/silent states induced by spike and non-spike activities respectively, thereby \emph{negating the interaction role} played by spikes and other factors. Then, premised on the RFS states, a Bayesian inference scheme is designed to simultaneously estimate the model parameters, baseline, and crucial spike activities. Our results demonstrate that the proposed scheme can gain an extra $12\%$ detection accuracy in comparison with the state-of-the-art MLSpike method. △ Less

Submitted 31 January, 2019; originally announced January 2019.

arXiv:1604.06131 [pdf, ps, other]

doi 10.1039/C6SM00934D

Amoeboid swimming in a channel

Authors: Hao Wu, A. Farutin, W. -F. Hu, M. Thiébaud, S. Rafaï, P. Peyla, M. -C. Lai, C. Misbah

Abstract: Several micro-organisms, such as bacteria, algae, or spermatozoa, use flagella or cilia to swim in a fluid, while many other micro-organisms instead use ample shape deformation, described as amoeboid, to propel themselves by either crawling on a substrate or swimming. Many eukaryotic cells were believed to require an underlying substratum to migrate (crawl) by using membrane deformation (like bleb… ▽ More Several micro-organisms, such as bacteria, algae, or spermatozoa, use flagella or cilia to swim in a fluid, while many other micro-organisms instead use ample shape deformation, described as amoeboid, to propel themselves by either crawling on a substrate or swimming. Many eukaryotic cells were believed to require an underlying substratum to migrate (crawl) by using membrane deformation (like blebbing or generation of lamellipodia) but there is now increasing evidence that a large variety of cells (including those of the immune system) can migrate without the assistance of focal adhesion, allowing them to swim as efficiently as they can crawl. This paper details the analysis of amoeboid swimming in a confined fluid by modeling the swimmer as an inextensible membrane deploying local active forces. The swimmer displays a rich behavior: it may settle into a straight trajectory in the channel or navigate from one wall to the other depending on its confinement. The nature of the swimmer is also found to be affected by confinement: the swimmer can behave, on the average over one swimming cycle, as a pusher at low confinement, and becomes a puller at higher confinement. The swimmer's nature is thus not an intrinsic property. The scaling of the swimmer velocity V with the force amplitude A is analyzed in detail showing that at small enough A, $V\sim A^2/η^2$, whereas at large enough A, V is independent of the force and is determined solely by the stroke frequency and swimmer size. This finding starkly contrasts with currently known results found from swimming models where motion is based on flagellar or ciliary activity, where $V\sim A/η$. To conclude, two definitions of efficiency as put forward in the literature are analyzed with distinct outcomes. We find that one type of efficiency has an optimum as a function of confinement while the other does not. Future perspectives are outlined. △ Less

Submitted 28 August, 2016; v1 submitted 20 April, 2016; originally announced April 2016.

Comments: Advance Article, Soft Matter (2016), 16 pages, 18 figures

Journal ref: Soft Matter, 12, 7470-7484 (2016)

arXiv:1502.03975 [pdf, ps, other]

doi 10.1103/PhysRevE.92.050701

Amoeboid motion in confined geometry

Authors: Hao Wu, M. Thiébaud, W. -F. Hu, A. Farutin, S. Rafaï, M. -C. Lai, P. Peyla, C. Misbah

Abstract: Many eukaryotic cells undergo frequent shape changes (described as amoeboid motion) that enable them to move forward. We investigate the effect of confinement on a minimal model of amoeboid swimmer. Complex pictures emerge: (i) The swimmer's nature (i.e., either pusher or puller) can be modified by confinement, thus suggesting that this is not an intrinsic property of the swimmer. This swimming na… ▽ More Many eukaryotic cells undergo frequent shape changes (described as amoeboid motion) that enable them to move forward. We investigate the effect of confinement on a minimal model of amoeboid swimmer. Complex pictures emerge: (i) The swimmer's nature (i.e., either pusher or puller) can be modified by confinement, thus suggesting that this is not an intrinsic property of the swimmer. This swimming nature transition stems from intricate internal degrees of freedom of membrane deformation. (ii) The swimming speed might increase with increasing confinement before decreasing again for stronger confinements. (iii) A straight amoeoboid swimmer's trajectory in the channel can become unstable, and ample lateral excursions of the swimmer prevail. This happens for both pusher- and puller-type swimmers. For weak confinement, these excursions are symmetric, while they become asymmetric at stronger confinement, whereby the swimmer is located closer to one of the two walls. In this study, we combine numerical and theoretical analyses. △ Less

Submitted 4 November, 2015; v1 submitted 13 February, 2015; originally announced February 2015.

Comments: 5 pages, 7 figures

Journal ref: Phys. Rev. E 92, 050701 (2015)

arXiv:1403.3256 [pdf]

Parkinson disease is a TH17 dominant autoimmune disorder against accumulated alpha-synuclein

Authors: Wan-Chung Hu

Abstract: Parkinson disease is a very common neurodegenerative disorder. Patients usually undergo destruction of substantia nigra to develop typical symptoms such as resting tremor, hypokinesia, and rigidity. However, the exact mechanism of Parkinson disease is still unknown, so it is called idiopathic Parkinsonism. According to my microarray analysis of peripheral blood leukocytes and substantia nigra brai… ▽ More Parkinson disease is a very common neurodegenerative disorder. Patients usually undergo destruction of substantia nigra to develop typical symptoms such as resting tremor, hypokinesia, and rigidity. However, the exact mechanism of Parkinson disease is still unknown, so it is called idiopathic Parkinsonism. According to my microarray analysis of peripheral blood leukocytes and substantia nigra brain tissue, I propose that Parkinson disease is actually a TH17 dominant autoimmune disease. Based on the microarray data in substantia nigra, HSP40, HSP70, HSP90, HSP27, HSP105, TLR5, TLR7, CEBPB, CEBPG, FOS, and caspase1 are significantly up-regulated. In peripheral leukocytes, NFKB1A, CEBPD, FOS, retinoic receptor alpha, suppressor of IKK epsilon, S100A11, G-CSF, MMP9, IL-1 receptor, IL-8 receptor, TNF receptor, caspase8, c1q receptor, cathepsin Z, HLA-G, complement receptor1, and complement 5a receptor. General immune related genes are also up-regulated including ILF2, CD22, CD3E, BLNK, ILF3, TCR alpha, TCR zeta, TCR delta, LAT, ITK, Ly9, and BANK1. The autoantigen is mainly alpha-synuclein. After knowing the exact disease pathophysiology, we can develop better drugs to prevent or control the detrimental disorder. △ Less

Submitted 21 November, 2013; originally announced March 2014.

arXiv:1311.4968 [pdf]

Unstable Angina is a syndrome correlated to mixed Th17 and Th1 immune disorder

Authors: Wan-Chung Hu

Abstract: Unstable angina is common clinical manifestation of atherosclerosis. However, the detailed pathogenesis of unstable angina is still not known. Here, I propose that unstable angina is a mixed TH17 and TH1 immune disorder. By using microarray analysis, I find out that TH1 and TH17 related cytokine, cytokine receptor, chemokines, complement, immune-related transcription factors, anti-bacterial genes,… ▽ More Unstable angina is common clinical manifestation of atherosclerosis. However, the detailed pathogenesis of unstable angina is still not known. Here, I propose that unstable angina is a mixed TH17 and TH1 immune disorder. By using microarray analysis, I find out that TH1 and TH17 related cytokine, cytokine receptor, chemokines, complement, immune-related transcription factors, anti-bacterial genes, Toll-like receptors, and heat shock proteins are all up-regulated in peripheral leukocytes of unstable angina. In addition, H-ATPase, glycolytic genes, platelet and RBC related genes are also up-regulated in peripheral leukocytes of during unstable angina. It also implies that atherosclerosis is a mixed TH17 and TH1 autoimmune disease. If we know the etiology of unstable angina as well as atherosclerosis better, we can have better methods to control and prevent this detrimental illness. △ Less

Submitted 20 November, 2013; originally announced November 2013.

arXiv:1311.4747 [pdf]

Sepsis is a syndrome with hyperactivity of TH17-like innate immunity and hypoactivity of adaptive immunity

Authors: Wan-Chung Hu

Abstract: Currently, there are two major theories for the pathogenesis of sepsis: hyperimmune and hypoimmune. Hyperimmune theory suggests that cytokine storm causes the symptoms of sepsis. On the contrary, hypoimmune theory suggests that immunosuppression causes the manifestations of sepsis. By using microarray study, this study implies that hyperactivity of TH17-like innate immunity and failure of adaptive… ▽ More Currently, there are two major theories for the pathogenesis of sepsis: hyperimmune and hypoimmune. Hyperimmune theory suggests that cytokine storm causes the symptoms of sepsis. On the contrary, hypoimmune theory suggests that immunosuppression causes the manifestations of sepsis. By using microarray study, this study implies that hyperactivity of TH17-like innate immunity and failure of adaptive immunity are noted in sepsis patients. I find out that innate immunity related genes are significantly up-regulated including CD14, TLR1,2,4,5,8, HSP70, CEBP proteins, AP1(JUNB, FOSL2), TGF-β, IL-6, TGF-α, CSF2 receptor, TNFRSF1A, S100A binding proteins, CCR2, formyl peptide receptor2, amyloid proteins, pentraxin, defensins, CLEC5A, whole complement machinery, CPD, NCF, MMP, neutrophil elastase, caspases, IgG and IgA Fc receptors(CD64, CD32), ALOX5, PTGS, LTB4R, LTA4H, and ICAM1. Majority of adaptive immunity genes are down-regulated including MHC related genes, TCR genes, granzymes/perforin, CD40, CD8, CD3, TCR signaling, BCR signaling, T & B cell specific transcription factors, NK killer receptors, and TH17 helper specific transcription factors(STAT3, RORA, REL). In addition, Treg related genes are up-regulated including TGFβ, IL-15, STAT5B, SMAD2/4, CD36, and thrombospondin. Thus, both hyperimmune and hypoimmune play important roles in the pathophysiology of sepsis. △ Less

Submitted 19 November, 2013; originally announced November 2013.

arXiv:1311.4384 [pdf]

Acute Respiratory Distress Syndrome is a TH17-like and Treg immune disease

Authors: Wan-Chung Hu

Abstract: Acute Respiratory Distress Syndrome (ARDS) is a very severe syndrome leading to respiratory failure and subsequent mortality. Sepsis is one of the leading causes of ARDS. Thus, extracellular bacteria play an important role in the pathophysiology of ARDS. Overactivated neutrophils are the major effector cells in ARDS. Thus, extracellular bacteria triggered TH17-like innate immunity with neutrophil… ▽ More Acute Respiratory Distress Syndrome (ARDS) is a very severe syndrome leading to respiratory failure and subsequent mortality. Sepsis is one of the leading causes of ARDS. Thus, extracellular bacteria play an important role in the pathophysiology of ARDS. Overactivated neutrophils are the major effector cells in ARDS. Thus, extracellular bacteria triggered TH17-like innate immunity with neutrophil activation might accounts for the etiology of ARDS. Here, microarray analysis was employed to describe TH17-like innate immunity-related cytokine including TGF-β and IL-6 up-regulation in whole blood of ARDS patients. It was found that the innate TH17-related TLR1,2,4,5,8, HSP70, G-CSF, GM-CSF, complements, defensin, PMN chemokines, cathepsins, Fc receptors, NCFs, FOS, JunB, CEBPs, NFkB, and leukotriene B4 are all up-regulated. TGF-β secreting Treg cells play important roles in lung fibrosis. Up-regulation of Treg associated STAT5B and TGF-β with down-regulation of MHC genes, TCR genes, and co-stimulation molecule CD86 are noted. Key TH17 transcription factors, STAT3 and RORα, are down-regulated. Thus, the full adaptive TH17 helper CD4 T cells may not be successfully triggered. Many fibrosis promoting genes are also up-regulated including MMP8, MMP9, FGF13, TIMP1, TIMP2, PLOD1, P4HB, P4HA1, PDGFC, HMMR, HS2ST1, CHSY1, and CSGALNACT. Failure to induce successful adaptive immunity could also attribute to ARDS pathogenesis. Thus, ARDS is actually a TH17-like and Treg immune disorder. △ Less

Submitted 18 November, 2013; originally announced November 2013.

Showing 1–27 of 27 results for author: Hu, W