Search | arXiv e-print repository

Network structural change point detection and reconstruction for balanced neuronal networks

Authors: Kai Chen, Mingzhang Wang, Songting Li, Douglas Zhou

Abstract: Understanding brain dynamics and functions critically depends on knowledge of the network connectivity among neurons. However, the complexity of brain structural connectivity, coupled with continuous modifications driven by synaptic plasticity, makes its direct experimental measurement particularly challenging. Conventional connectivity inference methods based on neuronal recordings often assumes… ▽ More Understanding brain dynamics and functions critically depends on knowledge of the network connectivity among neurons. However, the complexity of brain structural connectivity, coupled with continuous modifications driven by synaptic plasticity, makes its direct experimental measurement particularly challenging. Conventional connectivity inference methods based on neuronal recordings often assumes a static underlying structural connectivity and requires stable statistical features of neural activities, making them unsuitable for reconstructing structural connectivity that undergoes changes. To fulfill the needs of reconstructing networks undergoing potential structural changes, we propose a unified network reconstruction framework that combines connectivity-induced change point detection (CPD) with pairwise time-delayed correlation coefficient (TDCC) method. For general neuronal networks in balanced regimes, we develop a theoretical analysis for discriminating changes in structural connectivity based on the fluctuation of neuronal voltage time series. We then demonstrate a pairwise TDCC method to reconstruct the network using spike train recordings segmented at the detected change points. We show the effectiveness of our CPD-TDCC network reconstruction using large-scale network simulations with multiple neuronal models. Crucially, our method accommodates networks with changes in both network topologies and synaptic coupling strengths while retaining accuracy even with sparsely sampled subnetwork data, achieving a critical advancement for practical applications in real experimental situations. Our CPD-TDCC framework addresses the critical gap in network reconstruction by accounting connectivity-induced changes points, potentially offering a valuable tool for studying structure and dynamics in the cortical brain. △ Less

Submitted 3 July, 2025; originally announced July 2025.

Comments: 22 pages, 5 figures

arXiv:2507.02004 [pdf, ps, other]

STELLA: Self-Evolving LLM Agent for Biomedical Research

Authors: Ruofan Jin, Zaixi Zhang, Mengdi Wang, Le Cong

Abstract: The rapid growth of biomedical data, tools, and literature has created a fragmented research landscape that outpaces human expertise. While AI agents offer a solution, they typically rely on static, manually curated toolsets, limiting their ability to adapt and scale. Here, we introduce STELLA, a self-evolving AI agent designed to overcome these limitations. STELLA employs a multi-agent architectu… ▽ More The rapid growth of biomedical data, tools, and literature has created a fragmented research landscape that outpaces human expertise. While AI agents offer a solution, they typically rely on static, manually curated toolsets, limiting their ability to adapt and scale. Here, we introduce STELLA, a self-evolving AI agent designed to overcome these limitations. STELLA employs a multi-agent architecture that autonomously improves its own capabilities through two core mechanisms: an evolving Template Library for reasoning strategies and a dynamic Tool Ocean that expands as a Tool Creation Agent automatically discovers and integrates new bioinformatics tools. This allows STELLA to learn from experience. We demonstrate that STELLA achieves state-of-the-art accuracy on a suite of biomedical benchmarks, scoring approximately 26\% on Humanity's Last Exam: Biomedicine, 54\% on LAB-Bench: DBQA, and 63\% on LAB-Bench: LitQA, outperforming leading models by up to 6 percentage points. More importantly, we show that its performance systematically improves with experience; for instance, its accuracy on the Humanity's Last Exam benchmark almost doubles with increased trials. STELLA represents a significant advance towards AI Agent systems that can learn and grow, dynamically scaling their expertise to accelerate the pace of biomedical discovery. △ Less

Submitted 1 July, 2025; originally announced July 2025.

arXiv:2505.23839 [pdf, other]

GeneBreaker: Jailbreak Attacks against DNA Language Models with Pathogenicity Guidance

Authors: Zaixi Zhang, Zhenghong Zhou, Ruofan Jin, Le Cong, Mengdi Wang

Abstract: DNA, encoding genetic instructions for almost all living organisms, fuels groundbreaking advances in genomics and synthetic biology. Recently, DNA Foundation Models have achieved success in designing synthetic functional DNA sequences, even whole genomes, but their susceptibility to jailbreaking remains underexplored, leading to potential concern of generating harmful sequences such as pathogens o… ▽ More DNA, encoding genetic instructions for almost all living organisms, fuels groundbreaking advances in genomics and synthetic biology. Recently, DNA Foundation Models have achieved success in designing synthetic functional DNA sequences, even whole genomes, but their susceptibility to jailbreaking remains underexplored, leading to potential concern of generating harmful sequences such as pathogens or toxin-producing genes. In this paper, we introduce GeneBreaker, the first framework to systematically evaluate jailbreak vulnerabilities of DNA foundation models. GeneBreaker employs (1) an LLM agent with customized bioinformatic tools to design high-homology, non-pathogenic jailbreaking prompts, (2) beam search guided by PathoLM and log-probability heuristics to steer generation toward pathogen-like sequences, and (3) a BLAST-based evaluation pipeline against a curated Human Pathogen Database (JailbreakDNABench) to detect successful jailbreaks. Evaluated on our JailbreakDNABench, GeneBreaker successfully jailbreaks the latest Evo series models across 6 viral categories consistently (up to 60\% Attack Success Rate for Evo2-40B). Further case studies on SARS-CoV-2 spike protein and HIV-1 envelope protein demonstrate the sequence and structural fidelity of jailbreak output, while evolutionary modeling of SARS-CoV-2 underscores biosecurity risks. Our findings also reveal that scaling DNA foundation models amplifies dual-use risks, motivating enhanced safety alignment and tracing mechanisms. Our code is at https://github.com/zaixizhang/GeneBreaker. △ Less

Submitted 28 May, 2025; originally announced May 2025.

arXiv:2505.22250 [pdf]

YH-MINER: Multimodal Intelligent System for Natural Ecological Reef Metric Extraction

Authors: Mingzhuang Wang, Yvyang Li, Xiyang Zhang, Fei Tan, Qi Shi, Guotao Zhang, Siqi Chen, Yufei Liu, Lei Lei, Ming Zhou, Qiang Lin, Hongqiang Yang

Abstract: Coral reefs, crucial for sustaining marine biodiversity and ecological processes (e.g., nutrient cycling, habitat provision), face escalating threats, underscoring the need for efficient monitoring. Coral reef ecological monitoring faces dual challenges of low efficiency in manual analysis and insufficient segmentation accuracy in complex underwater scenarios. This study develops the YH-MINER syst… ▽ More Coral reefs, crucial for sustaining marine biodiversity and ecological processes (e.g., nutrient cycling, habitat provision), face escalating threats, underscoring the need for efficient monitoring. Coral reef ecological monitoring faces dual challenges of low efficiency in manual analysis and insufficient segmentation accuracy in complex underwater scenarios. This study develops the YH-MINER system, establishing an intelligent framework centered on the Multimodal Large Model (MLLM) for "object detection-semantic segmentation-prior input". The system uses the object detection module ([email protected]=0.78) to generate spatial prior boxes for coral instances, driving the segment module to complete pixel-level segmentation in low-light and densely occluded scenarios. The segmentation masks and finetuned classification instructions are fed into the Qwen2-VL-based multimodal model as prior inputs, achieving a genus-level classification accuracy of 88% and simultaneously extracting core ecological metrics. Meanwhile, the system retains the scalability of the multimodal model through standardized interfaces, laying a foundation for future integration into multimodal agent-based underwater robots and supporting the full-process automation of "image acquisition-prior generation-real-time analysis". △ Less

Submitted 29 May, 2025; v1 submitted 28 May, 2025; originally announced May 2025.

arXiv:2505.17922 [pdf, ps, other]

Bayesian ensemble learning for predicting health outcomes of multipollutant mixtures

Authors: Yu-Chien Ning, Xin Zhou, Francine Laden, Molin Wang

Abstract: We introduce the SoftBart approach from Bayesian ensemble learning to estimate the relationship between multipollutant mixtures and health on chronic exposures in epidemiology research. This approach offers several key advantages over existing methods: (1) it is computationally efficient and well-suited for analyzing large datasets; (2) it is flexible in estimating various correlated nonlinear fun… ▽ More We introduce the SoftBart approach from Bayesian ensemble learning to estimate the relationship between multipollutant mixtures and health on chronic exposures in epidemiology research. This approach offers several key advantages over existing methods: (1) it is computationally efficient and well-suited for analyzing large datasets; (2) it is flexible in estimating various correlated nonlinear functions simultaneously; and (3) it accurately identifies active variables within highly correlated multipollutant mixtures. Through simulations, we demonstrate the method's superiority by comparing its accuracy in estimating and quantifying uncertainties for both main and interaction effects with the commonly used method, BKMR. Last, we apply the method to analyze a multipollutant dataset with 10,110 participates from the Nurses' Health Study. △ Less

Submitted 23 May, 2025; originally announced May 2025.

Comments: 13 pages, 6 figures

arXiv:2505.09883 [pdf, other]

DeepPlantCRE: A Transformer-CNN Hybrid Framework for Plant Gene Expression Modeling and Cross-Species Generalization

Authors: Yingjun Wu, Jingyun Huang, Liang Ming, Pengcheng Deng, Maojun Wang, Zeyu Zhang

Abstract: The investigation of plant transcriptional regulation constitutes a fundamental basis for crop breeding, where cis-regulatory elements (CREs), as the key factor determining gene expression, have become the focus of crop genetic improvement research. Deep learning techniques, leveraging their exceptional capacity for high-dimensional feature extraction and nonlinear regulatory relationship modeling… ▽ More The investigation of plant transcriptional regulation constitutes a fundamental basis for crop breeding, where cis-regulatory elements (CREs), as the key factor determining gene expression, have become the focus of crop genetic improvement research. Deep learning techniques, leveraging their exceptional capacity for high-dimensional feature extraction and nonlinear regulatory relationship modeling, have been extensively employed in this field. However, current methodologies present notable limitations: single CNN-based architectures struggle to capture long-range regulatory interactions, while existing CNN-Transformer hybrid models demonstrate proneness to overfitting and inadequate generalization in cross-species prediction contexts. To address these challenges, this study proposes DeepPlantCRE, a deep-learning framework for plant gene expression prediction and CRE Extraction. The model employs a Transformer-CNN hybrid architecture that achieves enhanced Accuracy, AUC-ROC, and F1-score metrics over existing baselines (DeepCRE and PhytoExpr), with improved generalization performance and overfitting inhibiting. Cross-species validation experiments conducted on gene expression datasets from \textit{Gossypium}, \textit{Arabidopsis thaliana}, \textit{Solanum lycopersicum}, \textit{Sorghum bicolor}, and \textit{Arabidopsis thaliana} reveal that the model achieves peak prediction accuracy of 92.3\%, particularly excelling in complex genomic data analysis. Furthermore, interpretability investigations using DeepLIFT and Transcription Factor Motif Discovery from the importance scores algorithm (TF-MoDISco) demonstrate that the derived motifs from our model exhibit high concordance with known transcription factor binding sites (TFBSs) such as MYR2, TSO1 in JASPAR plant database, substantiating the potential of biological interpretability and practical agricultural application of DeepPlantCRE. △ Less

Submitted 14 May, 2025; originally announced May 2025.

arXiv:2505.09873 [pdf, other]

Deep Learning and Explainable AI: New Pathways to Genetic Insights

Authors: Chenyu Wang, Chaoying Zuo, Zihan Su, Yuhang Xing, Lu Li, Maojun Wang, Zeyu Zhang

Abstract: Deep learning-based AI models have been extensively applied in genomics, achieving remarkable success across diverse applications. As these models gain prominence, there exists an urgent need for interpretability methods to establish trustworthiness in model-driven decisions. For genetic researchers, interpretable insights derived from these models hold significant value in providing novel perspec… ▽ More Deep learning-based AI models have been extensively applied in genomics, achieving remarkable success across diverse applications. As these models gain prominence, there exists an urgent need for interpretability methods to establish trustworthiness in model-driven decisions. For genetic researchers, interpretable insights derived from these models hold significant value in providing novel perspectives for understanding biological processes. Current interpretability analyses in genomics predominantly rely on intuition and experience rather than rigorous theoretical foundations. In this review, we systematically categorize interpretability methods into input-based and model-based approaches, while critically evaluating their limitations through concrete biological application scenarios. Furthermore, we establish theoretical underpinnings to elucidate the origins of these constraints through formal mathematical demonstrations, aiming to assist genetic researchers in better understanding and designing models in the future. Finally, we provide feasible suggestions for future research on interpretability in the field of genetics. △ Less

Submitted 14 May, 2025; originally announced May 2025.

arXiv:2505.01700 [pdf, other]

PoseX: AI Defeats Physics Approaches on Protein-Ligand Cross Docking

Authors: Yize Jiang, Xinze Li, Yuanyuan Zhang, Jin Han, Youjun Xu, Ayush Pandit, Zaixi Zhang, Mengdi Wang, Mengyang Wang, Chong Liu, Guang Yang, Yejin Choi, Wu-Jun Li, Tianfan Fu, Fang Wu, Junhong Liu

Abstract: Existing protein-ligand docking studies typically focus on the self-docking scenario, which is less practical in real applications. Moreover, some studies involve heavy frameworks requiring extensive training, posing challenges for convenient and efficient assessment of docking methods. To fill these gaps, we design PoseX, an open-source benchmark to evaluate both self-docking and cross-docking, e… ▽ More Existing protein-ligand docking studies typically focus on the self-docking scenario, which is less practical in real applications. Moreover, some studies involve heavy frameworks requiring extensive training, posing challenges for convenient and efficient assessment of docking methods. To fill these gaps, we design PoseX, an open-source benchmark to evaluate both self-docking and cross-docking, enabling a practical and comprehensive assessment of algorithmic advances. Specifically, we curated a novel dataset comprising 718 entries for self-docking and 1,312 entries for cross-docking; second, we incorporated 23 docking methods in three methodological categories, including physics-based methods (e.g., Schrödinger Glide), AI docking methods (e.g., DiffDock) and AI co-folding methods (e.g., AlphaFold3); third, we developed a relaxation method for post-processing to minimize conformational energy and refine binding poses; fourth, we built a leaderboard to rank submitted models in real-time. We derived some key insights and conclusions from extensive experiments: (1) AI approaches have consistently outperformed physics-based methods in overall docking success rate. (2) Most intra- and intermolecular clashes of AI approaches can be greatly alleviated with relaxation, which means combining AI modeling with physics-based post-processing could achieve excellent performance. (3) AI co-folding methods exhibit ligand chirality issues, except for Boltz-1x, which introduced physics-inspired potentials to fix hallucinations, suggesting modeling on stereochemistry improves the structural plausibility markedly. (4) Specifying binding pockets significantly promotes docking performance, indicating that pocket information can be leveraged adequately, particularly for AI co-folding methods, in future modeling efforts. The code, dataset, and leaderboard are released at https://github.com/CataAI/PoseX. △ Less

Submitted 21 May, 2025; v1 submitted 3 May, 2025; originally announced May 2025.

arXiv:2503.20179 [pdf, other]

ProtoBERT-LoRA: Parameter-Efficient Prototypical Finetuning for Immunotherapy Study Identification

Authors: Shijia Zhang, Xiyu Ding, Kai Ding, Jacob Zhang, Kevin Galinsky, Mengrui Wang, Ryan P. Mayers, Zheyu Wang, Hadi Kharrazi

Abstract: Identifying immune checkpoint inhibitor (ICI) studies in genomic repositories like Gene Expression Omnibus (GEO) is vital for cancer research yet remains challenging due to semantic ambiguity, extreme class imbalance, and limited labeled data in low-resource settings. We present ProtoBERT-LoRA, a hybrid framework that combines PubMedBERT with prototypical networks and Low-Rank Adaptation (LoRA) fo… ▽ More Identifying immune checkpoint inhibitor (ICI) studies in genomic repositories like Gene Expression Omnibus (GEO) is vital for cancer research yet remains challenging due to semantic ambiguity, extreme class imbalance, and limited labeled data in low-resource settings. We present ProtoBERT-LoRA, a hybrid framework that combines PubMedBERT with prototypical networks and Low-Rank Adaptation (LoRA) for efficient fine-tuning. The model enforces class-separable embeddings via episodic prototype training while preserving biomedical domain knowledge. Our dataset was divided as: Training (20 positive, 20 negative), Prototype Set (10 positive, 10 negative), Validation (20 positive, 200 negative), and Test (71 positive, 765 negative). Evaluated on test dataset, ProtoBERT-LoRA achieved F1-score of 0.624 (precision: 0.481, recall: 0.887), outperforming the rule-based system, machine learning baselines and finetuned PubMedBERT. Application to 44,287 unlabeled studies reduced manual review efforts by 82%. Ablation studies confirmed that combining prototypes with LoRA improved performance by 29% over stand-alone LoRA. △ Less

Submitted 25 March, 2025; originally announced March 2025.

Comments: Submitted to AMIA 2025 Annual Symposium

arXiv:2502.00934 [pdf]

Optimizing Global Genomic Surveillance for Early Detection of Emerging SARS-CoV-2 Variants

Authors: Haogao Gu, Jifan Li, Wanying Sun, Mengting Li, Kathy Leung, Joseph T. Wu, Hsiang-Yu Yuan, Maggie H. Wang, Bingyi Yang, Matthew R. McKay, Ning Ning, Leo L. M. Poon

Abstract: Background: Global viral threats underscore the need for effective genomic surveillance, but high costs and uneven resource distribution hamper its implementation. Targeting surveillance to international travelers in major travel hubs may offer a more efficient strategy for the early detection of SARS-CoV-2 variants. Methods: We developed and calibrated a multiple-strain metapopulation model of… ▽ More Background: Global viral threats underscore the need for effective genomic surveillance, but high costs and uneven resource distribution hamper its implementation. Targeting surveillance to international travelers in major travel hubs may offer a more efficient strategy for the early detection of SARS-CoV-2 variants. Methods: We developed and calibrated a multiple-strain metapopulation model of global SARS-CoV-2 transmission using extensive epidemiological, phylogenetic, and high-resolution air travel data. We then compared baseline surveillance with various resource-allocation approaches that prioritize travelers, focusing on Omicron BA.1/BA.2 retrospectively and on hypothetical future variants under different emergence, transmission and vaccine effectiveness scenarios. Findings: Focusing existing surveillance resources on travelers at key global hubs significantly shortened detection delays without increasing total surveillance efforts. In retrospective analyses of Omicron BA.1/BA.2, traveler-targeted approaches consistently outperformed baseline strategies, even when overall resources were reduced. Simulations indicate that focusing surveillance on key travel hubs outperform baseline practices in detecting future variants, across different possible origins, even with reduced resources. This approach also remains effective in future pandemic scenarios with varying reproductive numbers and vaccine effectiveness. Interpretation: These findings provide a quantitative, cost-effective framework for strengthening global genomic surveillance. By reallocating resources toward international travelers in select travel hubs, early detection of emerging variants can be enhanced, informing rapid public health interventions and bolstering preparedness for future pandemics. △ Less

Submitted 13 February, 2025; v1 submitted 2 February, 2025; originally announced February 2025.

arXiv:2412.16483 [pdf, other]

MOL-Mamba: Enhancing Molecular Representation with Structural & Electronic Insights

Authors: Jingjing Hu, Dan Guo, Zhan Si, Deguang Liu, Yunfeng Diao, Jing Zhang, Jinxing Zhou, Meng Wang

Abstract: Molecular representation learning plays a crucial role in various downstream tasks, such as molecular property prediction and drug design. To accurately represent molecules, Graph Neural Networks (GNNs) and Graph Transformers (GTs) have shown potential in the realm of self-supervised pretraining. However, existing approaches often overlook the relationship between molecular structure and electroni… ▽ More Molecular representation learning plays a crucial role in various downstream tasks, such as molecular property prediction and drug design. To accurately represent molecules, Graph Neural Networks (GNNs) and Graph Transformers (GTs) have shown potential in the realm of self-supervised pretraining. However, existing approaches often overlook the relationship between molecular structure and electronic information, as well as the internal semantic reasoning within molecules. This omission of fundamental chemical knowledge in graph semantics leads to incomplete molecular representations, missing the integration of structural and electronic data. To address these issues, we introduce MOL-Mamba, a framework that enhances molecular representation by combining structural and electronic insights. MOL-Mamba consists of an Atom & Fragment Mamba-Graph (MG) for hierarchical structural reasoning and a Mamba-Transformer (MT) fuser for integrating molecular structure and electronic correlation learning. Additionally, we propose a Structural Distribution Collaborative Training and E-semantic Fusion Training framework to further enhance molecular representation learning. Extensive experiments demonstrate that MOL-Mamba outperforms state-of-the-art baselines across eleven chemical-biological molecular datasets. △ Less

Submitted 5 February, 2025; v1 submitted 20 December, 2024; originally announced December 2024.

Comments: Accepted by AAAI2025

arXiv:2412.12229 [pdf]

Efficacy of Temporal Interference Electrical Stimulation for Spinal Cord Injury Rehabilitation: A Case Series

Authors: Ruidong Cheng, Yuling Shao, Xi Li, Li Zhang, Zehao Sheng, Chenyang Li, Xu Xie, Huilin Mou, Weidong Chen, Shaomin Zhang, Yuchen Xu, Minmin Wang

Abstract: Spinal cord injury (SCI) is a debilitating condition that often results in significant motor and sensory deficits, impacting the quality of life. Current rehabilitation methods, including physical therapy and electrical stimulation, offer variable outcomes and often require invasive procedures. Temporal interference (TI) stimulation has emerged as a novel, non-invasive neuromodulation technique ca… ▽ More Spinal cord injury (SCI) is a debilitating condition that often results in significant motor and sensory deficits, impacting the quality of life. Current rehabilitation methods, including physical therapy and electrical stimulation, offer variable outcomes and often require invasive procedures. Temporal interference (TI) stimulation has emerged as a novel, non-invasive neuromodulation technique capable of targeting deep neural structures with precision, providing a promising alternative for SCI rehabilitation. This study explores the efficacy of TI stimulation as a non-invasive approach for improving motor and sensory function in patients with incomplete SCI. Three male patients with incomplete cervical SCI (AIS D) participated in a two-week intervention consisting of 14 sessions of TI stimulation targeting their injury sites. TI stimulation was delivered using frequencies of 1000 Hz and 1040 Hz, with assessments conducted pre- and post-intervention, including motor and sensory evaluations, functional scales, and imaging studies.All participants demonstrated significant improvements in neurological function, motor strength, sensory perception, and functional independence. Neurological levels of injury shifted upward in all cases, with one patient improving from C5 to C7. Graded Redefined Assessment of Strength, Sensibility and Prehension (GRASSP) results shows additional strength, prehension and sensory outcomes obtained for the arm and hand functions of participants. Motor scores (UEMS and LEMS) increased, sensory scores for light touch and pin prick improved, and functional assessments, such as the Berg Balance Scale (BBS) and Barthel Index (BI), showed marked gains. Pain scores also decreased in two participants, highlighting additional therapeutic benefits. △ Less

Submitted 16 December, 2024; originally announced December 2024.

Comments: 19 pages,1 table

arXiv:2410.20354 [pdf, other]

FoldMark: Protecting Protein Generative Models with Watermarking

Authors: Zaixi Zhang, Ruofan Jin, Kaidi Fu, Le Cong, Marinka Zitnik, Mengdi Wang

Abstract: Protein structure is key to understanding protein function and is essential for progress in bioengineering, drug discovery, and molecular biology. Recently, with the incorporation of generative AI, the power and accuracy of computational protein structure prediction/design have been improved significantly. However, ethical concerns such as copyright protection and harmful content generation (biose… ▽ More Protein structure is key to understanding protein function and is essential for progress in bioengineering, drug discovery, and molecular biology. Recently, with the incorporation of generative AI, the power and accuracy of computational protein structure prediction/design have been improved significantly. However, ethical concerns such as copyright protection and harmful content generation (biosecurity) pose challenges to the wide implementation of protein generative models. Here, we investigate whether it is possible to embed watermarks into protein generative models and their outputs for copyright authentication and the tracking of generated structures. As a proof of concept, we propose a two-stage method FoldMark as a generalized watermarking strategy for protein generative models. FoldMark first pretrain watermark encoder and decoder, which can minorly adjust protein structures to embed user-specific information and faithfully recover the information from the encoded structure. In the second step, protein generative models are fine-tuned with watermark-conditioned Low-Rank Adaptation (LoRA) modules to preserve generation quality while learning to generate watermarked structures with high recovery rates. Extensive experiments are conducted on open-source protein structure prediction models (e.g., ESMFold and MultiFlow) and de novo structure design models (e.g., FrameDiff and FoldFlow) and we demonstrate that our method is effective across all these generative models. Meanwhile, our watermarking framework only exerts a negligible impact on the original protein structure quality and is robust under potential post-processing and adaptive attacks. △ Less

Submitted 11 November, 2024; v1 submitted 27 October, 2024; originally announced October 2024.

arXiv:2409.19645 [pdf, other]

FlexSBDD: Structure-Based Drug Design with Flexible Protein Modeling

Authors: Zaixi Zhang, Mengdi Wang, Qi Liu

Abstract: Structure-based drug design (SBDD), which aims to generate 3D ligand molecules binding to target proteins, is a fundamental task in drug discovery. Existing SBDD methods typically treat protein as rigid and neglect protein structural change when binding with ligand molecules, leading to a big gap with real-world scenarios and inferior generation qualities (e.g., many steric clashes). To bridge the… ▽ More Structure-based drug design (SBDD), which aims to generate 3D ligand molecules binding to target proteins, is a fundamental task in drug discovery. Existing SBDD methods typically treat protein as rigid and neglect protein structural change when binding with ligand molecules, leading to a big gap with real-world scenarios and inferior generation qualities (e.g., many steric clashes). To bridge the gap, we propose FlexSBDD, a deep generative model capable of accurately modeling the flexible protein-ligand complex structure for ligand molecule generation. FlexSBDD adopts an efficient flow matching framework and leverages E(3)-equivariant network with scalar-vector dual representation to model dynamic structural changes. Moreover, novel data augmentation schemes based on structure relaxation/sidechain repacking are adopted to boost performance. Extensive experiments demonstrate that FlexSBDD achieves state-of-the-art performance in generating high-affinity molecules and effectively modeling the protein's conformation change to increase favorable protein-ligand interactions (e.g., Hydrogen bonds) and decrease steric clashes. △ Less

Submitted 29 September, 2024; originally announced September 2024.

Comments: Accepted by NeurIPS 2024

arXiv:2409.09828 [pdf, other]

Latent Diffusion Models for Controllable RNA Sequence Generation

Authors: Kaixuan Huang, Yukang Yang, Kaidi Fu, Yanyi Chu, Le Cong, Mengdi Wang

Abstract: This work presents RNAdiffusion, a latent diffusion model for generating and optimizing discrete RNA sequences of variable lengths. RNA is a key intermediary between DNA and protein, exhibiting high sequence diversity and complex three-dimensional structures to support a wide range of functions. We utilize pretrained BERT-type models to encode raw RNA sequences into token-level, biologically meani… ▽ More This work presents RNAdiffusion, a latent diffusion model for generating and optimizing discrete RNA sequences of variable lengths. RNA is a key intermediary between DNA and protein, exhibiting high sequence diversity and complex three-dimensional structures to support a wide range of functions. We utilize pretrained BERT-type models to encode raw RNA sequences into token-level, biologically meaningful representations. A Query Transformer is employed to compress such representations into a set of fixed-length latent vectors, with an autoregressive decoder trained to reconstruct RNA sequences from these latent variables. We then develop a continuous diffusion model within this latent space. To enable optimization, we integrate the gradients of reward models--surrogates for RNA functional properties--into the backward diffusion process, thereby generating RNAs with high reward scores. Empirical results confirm that RNAdiffusion generates non-coding RNAs that align with natural distributions across various biological metrics. Further, we fine-tune the diffusion model on mRNA 5' untranslated regions (5'-UTRs) and optimize sequences for high translation efficiencies. Our guided diffusion model effectively generates diverse 5'-UTRs with high Mean Ribosome Loading (MRL) and Translation Efficiency (TE), outperforming baselines in balancing rewards and structural stability trade-off. Our findings hold potential for advancing RNA sequence-function research and therapeutic RNA design. △ Less

Submitted 2 October, 2024; v1 submitted 15 September, 2024; originally announced September 2024.

arXiv:2408.05224 [pdf, ps, other]

Optimal Strategy for Stabilizing Protein Folding Intermediates

Authors: Mengshou Wang, Liangrong Pengb, Baoguo Jia, Liu Hong

Abstract: To manipulate the protein population at certain functional state through chemical stabilizers is crucial for protein-related studies. It not only plays a key role in protein structure analysis and protein folding kinetics, but also affects protein functionality to a large extent and thus has wide applications in medicine, food industry, etc. However, due to concerns about side effects or financial… ▽ More To manipulate the protein population at certain functional state through chemical stabilizers is crucial for protein-related studies. It not only plays a key role in protein structure analysis and protein folding kinetics, but also affects protein functionality to a large extent and thus has wide applications in medicine, food industry, etc. However, due to concerns about side effects or financial costs of stabilizers, identifying optimal strategies for enhancing protein stability with a minimal amount of stabilizers is of great importance. Here we prove that either for the fixed terminal time (including both finite and infinite cases) or the free one, the optimal control strategy for stabilizing the folding intermediates with a linear strategy for stabilizer addition belongs to the class of Bang-Bang controls. The corresponding optimal switching time is derived analytically, whose phase diagram with respect to several key parameters is explored in detail. The Bang-Bang control will be broken when nonlinear strategies for stabilizer addition are adopted. Our current study on optimal strategies for protein stabilizers not only offers deep insights into the general picture of protein folding kinetics, but also provides valuable theoretical guidance on treatments for protein-related diseases in medicine. △ Less

Submitted 28 July, 2024; originally announced August 2024.

Comments: 19 pages, 5 figures, 2 tables

MSC Class: 34Hxx; 92Cxx

arXiv:2407.20978 [pdf]

Are gene-by-environment interactions leveraged in multi-modality neural networks for breast cancer prediction?

Authors: Monica Isgut, Andrew Hornback, Yunan Luo, Asma Khimani, Neha Jain, May D. Wang

Abstract: Polygenic risk scores (PRSs) can significantly enhance breast cancer risk prediction when combined with clinical risk factor data. While many studies have explored the value-add of PRSs, little is known about the potential impact of gene-by-gene or gene-by-environment interactions towards enhancing the risk discrimination capabilities of multi-modal models combining PRSs with clinical data. In thi… ▽ More Polygenic risk scores (PRSs) can significantly enhance breast cancer risk prediction when combined with clinical risk factor data. While many studies have explored the value-add of PRSs, little is known about the potential impact of gene-by-gene or gene-by-environment interactions towards enhancing the risk discrimination capabilities of multi-modal models combining PRSs with clinical data. In this study, we integrated data on 318 individual genotype variants along with clinical data in a neural network to explore whether gene-by-gene (i.e., between individual variants) and/or gene-by-environment (between clinical risk factors and variants) interactions could be leveraged jointly during training to improve breast cancer risk prediction performance. We benchmarked our approach against a baseline model combining traditional univariate PRSs with clinical data in a logistic regression model and ran an interpretability analysis to identify feature interactions. While our model did not demonstrate improved performance over the baseline, we discovered 248 (<1%) statistically significant gene-by-gene and gene-by-environment interactions out of the ~53.6k possible feature pairs, the most contributory of which included rs6001930 (MKL1) and rs889312 (MAP3K1), with age and menopause being the most heavily interacting non-genetic risk factors. We also modeled the significant interactions as a network of highly connected features, suggesting that potential higher-order interactions are captured by the model. Although gene-by-environment (or gene-by-gene) interactions did not enhance breast cancer risk prediction performance in neural networks, our study provides evidence that these interactions can be leveraged by these models to inform their predictions. This study represents the first application of neural networks to screen for interactions impacting breast cancer risk using real-world data. △ Less

Submitted 30 July, 2024; originally announced July 2024.

arXiv:2407.12296 [pdf]

Discovery of novel antimicrobial peptides with notable antibacterial potency by a LLM-based foundation model

Authors: Jike Wang, Jianwen Feng, Yu Kang, Peichen Pan, Jingxuan Ge, Yan Wang, Mingyang Wang, Zhenxing Wu, Xingcai Zhang, Jiameng Yu, Xujun Zhang, Tianyue Wang, Lirong Wen, Guangning Yan, Yafeng Deng, Hui Shi, Chang-Yu Hsieh, Zhihui Jiang, Tingjun Hou

Abstract: Large language models (LLMs) have shown remarkable advancements in chemistry and biomedical research, acting as versatile foundation models for various tasks. We introduce AMP-Designer, an LLM-based approach for swiftly designing novel antimicrobial peptides (AMPs) with desired properties. Within 11 days, AMP-Designer achieved the de novo design of 18 AMPs with broad-spectrum activity against Gram… ▽ More Large language models (LLMs) have shown remarkable advancements in chemistry and biomedical research, acting as versatile foundation models for various tasks. We introduce AMP-Designer, an LLM-based approach for swiftly designing novel antimicrobial peptides (AMPs) with desired properties. Within 11 days, AMP-Designer achieved the de novo design of 18 AMPs with broad-spectrum activity against Gram-negative bacteria. In vitro validation revealed a 94.4% success rate, with two candidates demonstrating exceptional antibacterial efficacy, minimal hemotoxicity, stability in human plasma, and low potential to induce resistance, as evidenced by significant bacterial load reduction in murine lung infection experiments. The entire process, from design to validation, concluded in 48 days. AMP-Designer excels in creating AMPs targeting specific strains despite limited data availability, with a top candidate displaying a minimum inhibitory concentration of 2.0 μg/ml against Propionibacterium acnes. Integrating advanced machine learning techniques, AMP-Designer demonstrates remarkable efficiency, paving the way for innovative solutions to antibiotic resistance. △ Less

Submitted 2 March, 2025; v1 submitted 16 July, 2024; originally announced July 2024.

Comments: 43 pages, 6 figures, 5 tables. Due to the limitation "The abstract field cannot be longer than 1,920 characters", the abstract appearing here is slightly shorter than that in the PDF file

arXiv:2407.07930 [pdf]

Token-Mol 1.0: Tokenized drug design with large language model

Authors: Jike Wang, Rui Qin, Mingyang Wang, Meijing Fang, Yangyang Zhang, Yuchen Zhu, Qun Su, Qiaolin Gou, Chao Shen, Odin Zhang, Zhenxing Wu, Dejun Jiang, Xujun Zhang, Huifeng Zhao, Xiaozhe Wan, Zhourui Wu, Liwei Liu, Yu Kang, Chang-Yu Hsieh, Tingjun Hou

Abstract: Significant interests have recently risen in leveraging sequence-based large language models (LLMs) for drug design. However, most current applications of LLMs in drug discovery lack the ability to comprehend three-dimensional (3D) structures, thereby limiting their effectiveness in tasks that explicitly involve molecular conformations. In this study, we introduced Token-Mol, a token-only 3D drug… ▽ More Significant interests have recently risen in leveraging sequence-based large language models (LLMs) for drug design. However, most current applications of LLMs in drug discovery lack the ability to comprehend three-dimensional (3D) structures, thereby limiting their effectiveness in tasks that explicitly involve molecular conformations. In this study, we introduced Token-Mol, a token-only 3D drug design model. This model encodes all molecular information, including 2D and 3D structures, as well as molecular property data, into tokens, which transforms classification and regression tasks in drug discovery into probabilistic prediction problems, thereby enabling learning through a unified paradigm. Token-Mol is built on the transformer decoder architecture and trained using random causal masking techniques. Additionally, we proposed the Gaussian cross-entropy (GCE) loss function to overcome the challenges in regression tasks, significantly enhancing the capacity of LLMs to learn continuous numerical values. Through a combination of fine-tuning and reinforcement learning (RL), Token-Mol achieves performance comparable to or surpassing existing task-specific methods across various downstream tasks, including pocket-based molecular generation, conformation generation, and molecular property prediction. Compared to existing molecular pre-trained models, Token-Mol exhibits superior proficiency in handling a wider range of downstream tasks essential for drug design. Notably, our approach improves regression task accuracy by approximately 30% compared to similar token-only methods. Token-Mol overcomes the precision limitations of token-only models and has the potential to integrate seamlessly with general models such as ChatGPT, paving the way for the development of a universal artificial intelligence drug design model that facilitates rapid and high-quality drug design by experts. △ Less

Submitted 19 August, 2024; v1 submitted 10 July, 2024; originally announced July 2024.

arXiv:2407.07357 [pdf, ps, other]

A deep graph model for the signed interaction prediction in biological network

Authors: Shuyi Jin, Mengji Zhang, Meijie Wang, Lun Yu

Abstract: Predicting signed interactions in biological networks is crucial for understanding drug mechanisms and facilitating drug repurposing. While deep graph models have demonstrated success in modeling complex biological systems, existing approaches often fail to distinguish between positive and negative interactions, limiting their utility for precise pharmacological predictions. In this study, we prop… ▽ More Predicting signed interactions in biological networks is crucial for understanding drug mechanisms and facilitating drug repurposing. While deep graph models have demonstrated success in modeling complex biological systems, existing approaches often fail to distinguish between positive and negative interactions, limiting their utility for precise pharmacological predictions. In this study, we propose a novel deep graph model, \textbf{RGCNTD} (Relational Graph Convolutional Network with Tensor Decomposition), designed to predict both polar (e.g., activation, inhibition) and non-polar (e.g., binding, affect) chemical-gene interactions. Our model integrates graph convolutional networks with tensor decomposition to enhance feature representation and incorporates a conflict-aware sampling strategy to resolve polarity ambiguities. We introduce new evaluation metrics, \textit{AUC\textsubscript{polarity}} and \textit{CP@500}, to assess the model's ability to differentiate interaction types. Experimental results demonstrate that \textbf{RGCNTD} outperforms baseline models, achieving superior classification accuracy and improved discrimination of polar edges. Furthermore, we analyze the impact of subgraph components on predictive performance, revealing that additional network structures do not always enhance accuracy. These findings highlight the importance of polarity-aware modeling in drug discovery and network pharmacology, providing a robust framework for predicting complex biological interactions. △ Less

Submitted 17 March, 2025; v1 submitted 10 July, 2024; originally announced July 2024.

arXiv:2406.07662 [pdf, other]

Progress Towards Decoding Visual Imagery via fNIRS

Authors: Michel Adamic, Wellington Avelino, Anna Brandenberger, Bryan Chiang, Hunter Davis, Stephen Fay, Andrew Gregory, Aayush Gupta, Raphael Hotter, Grace Jiang, Fiona Leng, Stephen Polcyn, Thomas Ribeiro, Paul Scotti, Michelle Wang, Marley Xiong, Jonathan Xu

Abstract: We demonstrate the possibility of reconstructing images from fNIRS brain activity and start building a prototype to match the required specs. By training an image reconstruction model on downsampled fMRI data, we discovered that cm-scale spatial resolution is sufficient for image generation. We obtained 71% retrieval accuracy with 1-cm resolution, compared to 93% on the full-resolution fMRI, and 2… ▽ More We demonstrate the possibility of reconstructing images from fNIRS brain activity and start building a prototype to match the required specs. By training an image reconstruction model on downsampled fMRI data, we discovered that cm-scale spatial resolution is sufficient for image generation. We obtained 71% retrieval accuracy with 1-cm resolution, compared to 93% on the full-resolution fMRI, and 20% with 2-cm resolution. With simulations and high-density tomography, we found that time-domain fNIRS can achieve 1-cm resolution, compared to 2-cm resolution for continuous-wave fNIRS. Lastly, we share designs for a prototype time-domain fNIRS device, consisting of a laser driver, a single photon detector, and a time-to-digital converter system. △ Less

Submitted 22 June, 2024; v1 submitted 11 June, 2024; originally announced June 2024.

arXiv:2405.15158 [pdf, other]

ProtFAD: Introducing function-aware domains as implicit modality towards protein function prediction

Authors: Mingqing Wang, Zhiwei Nie, Yonghong He, Athanasios V. Vasilakos, Zhixiang Ren

Abstract: Protein function prediction is currently achieved by encoding its sequence or structure, where the sequence-to-function transcendence and high-quality structural data scarcity lead to obvious performance bottlenecks. Protein domains are "building blocks" of proteins that are functionally independent, and their combinations determine the diverse biological functions. However, most existing studies… ▽ More Protein function prediction is currently achieved by encoding its sequence or structure, where the sequence-to-function transcendence and high-quality structural data scarcity lead to obvious performance bottlenecks. Protein domains are "building blocks" of proteins that are functionally independent, and their combinations determine the diverse biological functions. However, most existing studies have yet to thoroughly explore the intricate functional information contained in the protein domains. To fill this gap, we propose a synergistic integration approach for a function-aware domain representation, and a domain-joint contrastive learning strategy to distinguish different protein functions while aligning the modalities. Specifically, we align the domain semantics with GO terms and text description to pre-train domain embeddings. Furthermore, we partition proteins into multiple sub-views based on continuous joint domains for contrastive training under the supervision of a novel triplet InfoNCE loss. Our approach significantly and comprehensively outperforms the state-of-the-art methods on various benchmarks, and clearly differentiates proteins carrying distinct functions compared to the competitor. Our implementation is available at https://github.com/AI-HPC-Research-Team/ProtFAD. △ Less

Submitted 2 December, 2024; v1 submitted 23 May, 2024; originally announced May 2024.

Comments: 17 pages, 7 figures, 5 tables

arXiv:2405.12144 [pdf]

Alterations of electrocortical activity during hand movements induced by motor cortex glioma

Authors: Yihan Wu, Tao Chang, Siliang Chen, Xiaodong Niu, Yu Li, Yuan Fang, Lei Yang, Yixuan Zong, Yaoxin Yang, Yuehua Li, Mengsong Wang, Wen Yang, Yixuan Wu, Chen Fu, Xia Fang, Yuxin Quan, Xilin Peng, Qiang Sun, Marc M. Van Hulle, Yanhui Liu, Ning Jiang, Dario Farina, Yuan Yang, Jiayuan He, Qing Mao

Abstract: Glioma cells can reshape functional neuronal networks by hijacking neuronal synapses, leading to partial or complete neurological dysfunction. These mechanisms have been previously explored for language functions. However, the impact of glioma on sensorimotor functions is still unknown. Therefore, we recruited a control group of patients with unaffected motor cortex and a group of patients with gl… ▽ More Glioma cells can reshape functional neuronal networks by hijacking neuronal synapses, leading to partial or complete neurological dysfunction. These mechanisms have been previously explored for language functions. However, the impact of glioma on sensorimotor functions is still unknown. Therefore, we recruited a control group of patients with unaffected motor cortex and a group of patients with glioma-infiltrated motor cortex, and recorded high-density electrocortical signals during finger movement tasks. The results showed that glioma suppresses task-related synchronization in the high-gamma band and reduces the power across all frequency bands. The resulting atypical motor information transmission model with discrete signaling pathways and delayed responses disrupts the stability of neuronal encoding patterns for finger movement kinematics across various temporal-spatial scales. These findings demonstrate that gliomas functionally invade neural circuits within the motor cortex. This result advances our understanding of motor function processing in chronic disease states, which is important to advance the surgical strategies and neurorehabilitation approaches for patients with malignant gliomas. △ Less

Submitted 20 May, 2024; originally announced May 2024.

arXiv:2405.11096 [pdf]

doi 10.17912/micropub.biology.001231

MicroBundlePillarTrack: A Python package for automated segmentation, tracking, and analysis of pillar deflection in cardiac microbundles

Authors: Hiba Kobeissi, Xining Gao, Samuel J. DePalma, Jourdan K. Ewoldt, Miranda C. Wang, Shoshana L. Das, Javiera Jilberto, David Nordsletten, Brendon M. Baker, Christopher S. Chen, Emma Lejeune

Abstract: Movies of human induced pluripotent stem cell (hiPSC)-derived engineered cardiac tissue (microbundles) contain abundant information about structural and functional maturity. However, extracting these data in a reproducible and high-throughput manner remains a major challenge. Furthermore, it is not straightforward to make direct quantitative comparisons across the multiple in vitro experimental pl… ▽ More Movies of human induced pluripotent stem cell (hiPSC)-derived engineered cardiac tissue (microbundles) contain abundant information about structural and functional maturity. However, extracting these data in a reproducible and high-throughput manner remains a major challenge. Furthermore, it is not straightforward to make direct quantitative comparisons across the multiple in vitro experimental platforms employed to fabricate these tissues. Here, we present "MicroBundlePillarTrack," an open-source optical flow-based package developed in Python to track the deflection of pillars in cardiac microbundles grown on experimental platforms with two different pillar designs ("Type 1" and "Type 2" design). Our software is able to automatically segment the pillars, track their displacements, and output time-dependent metrics for contractility analysis, including beating amplitude and rate, contractile force, and tissue stress. Because this software is fully automated, it will allow for both faster and more reproducible analyses of larger datasets and it will enable more reliable cross-platform comparisons as compared to existing approaches that require manual steps and are tailored to a specific experimental platform. To complement this open-source software, we share a dataset of 1,540 brightfield example movies on which we have tested our software. Through sharing this data and software, our goal is to directly enable quantitative comparisons across labs, and facilitate future collective progress via the biomedical engineering open-source data and software ecosystem. △ Less

Submitted 15 August, 2024; v1 submitted 17 May, 2024; originally announced May 2024.

Comments: 8 main pages, 1 main figure, Supplementary Information included. microPublication Biology (2024)

MSC Class: 92F05; 74A05 ACM Class: J.2; J.3

arXiv:2404.18443 [pdf, other]

BMRetriever: Tuning Large Language Models as Better Biomedical Text Retrievers

Authors: Ran Xu, Wenqi Shi, Yue Yu, Yuchen Zhuang, Yanqiao Zhu, May D. Wang, Joyce C. Ho, Chao Zhang, Carl Yang

Abstract: Developing effective biomedical retrieval models is important for excelling at knowledge-intensive biomedical tasks but still challenging due to the deficiency of sufficient publicly annotated biomedical data and computational resources. We present BMRetriever, a series of dense retrievers for enhancing biomedical retrieval via unsupervised pre-training on large biomedical corpora, followed by ins… ▽ More Developing effective biomedical retrieval models is important for excelling at knowledge-intensive biomedical tasks but still challenging due to the deficiency of sufficient publicly annotated biomedical data and computational resources. We present BMRetriever, a series of dense retrievers for enhancing biomedical retrieval via unsupervised pre-training on large biomedical corpora, followed by instruction fine-tuning on a combination of labeled datasets and synthetic pairs. Experiments on 5 biomedical tasks across 11 datasets verify BMRetriever's efficacy on various biomedical applications. BMRetriever also exhibits strong parameter efficiency, with the 410M variant outperforming baselines up to 11.7 times larger, and the 2B variant matching the performance of models with over 5B parameters. The training data and model checkpoints are released at \url{https://huggingface.co/BMRetriever} to ensure transparency, reproducibility, and application to new domains. △ Less

Submitted 3 October, 2024; v1 submitted 29 April, 2024; originally announced April 2024.

Comments: Accepted to EMNLP 2024. The model and data are uploaded to \url{https://github.com/ritaranx/BMRetriever}

Journal ref: EMNLP 2024

arXiv:2404.18021 [pdf, other]

CRISPR-GPT: An LLM Agent for Automated Design of Gene-Editing Experiments

Authors: Kaixuan Huang, Yuanhao Qu, Henry Cousins, William A. Johnson, Di Yin, Mihir Shah, Denny Zhou, Russ Altman, Mengdi Wang, Le Cong

Abstract: The introduction of genome engineering technology has transformed biomedical research, making it possible to make precise changes to genetic information. However, creating an efficient gene-editing system requires a deep understanding of CRISPR technology, and the complex experimental systems under investigation. While Large Language Models (LLMs) have shown promise in various tasks, they often la… ▽ More The introduction of genome engineering technology has transformed biomedical research, making it possible to make precise changes to genetic information. However, creating an efficient gene-editing system requires a deep understanding of CRISPR technology, and the complex experimental systems under investigation. While Large Language Models (LLMs) have shown promise in various tasks, they often lack specific knowledge and struggle to accurately solve biological design problems. In this work, we introduce CRISPR-GPT, an LLM agent augmented with domain knowledge and external tools to automate and enhance the design process of CRISPR-based gene-editing experiments. CRISPR-GPT leverages the reasoning ability of LLMs to facilitate the process of selecting CRISPR systems, designing guide RNAs, recommending cellular delivery methods, drafting protocols, and designing validation experiments to confirm editing outcomes. We showcase the potential of CRISPR-GPT for assisting non-expert researchers with gene-editing experiments from scratch and validate the agent's effectiveness in a real-world use case. Furthermore, we explore the ethical and regulatory considerations associated with automated gene-editing design, highlighting the need for responsible and transparent use of these tools. Our work aims to bridge the gap between beginner biological researchers and CRISPR genome engineering techniques, and demonstrate the potential of LLM agents in facilitating complex biological discovery tasks. △ Less

Submitted 27 April, 2024; originally announced April 2024.

arXiv:2404.02924 [pdf, other]

Accounting for contact network uncertainty in epidemic inferences

Authors: Maxwell H. Wang, Jukka-Pekka Onnela

Abstract: When modeling the dynamics of infectious disease, the incorporation of contact network information allows for the capture of the non-randomness and heterogeneity of realistic contact patterns. Oftentimes, it is assumed that the underlying contact pattern is known with perfect certainty. However, in realistic settings, the observed data often serves as an imperfect proxy of the actual contact patte… ▽ More When modeling the dynamics of infectious disease, the incorporation of contact network information allows for the capture of the non-randomness and heterogeneity of realistic contact patterns. Oftentimes, it is assumed that the underlying contact pattern is known with perfect certainty. However, in realistic settings, the observed data often serves as an imperfect proxy of the actual contact patterns in the population. Furthermore, the epidemic in the real world are often not fully observed; event times such as infection and recovery times may be missing. In order to conduct accurate inferences on parameters of contagion spread, it is crucial to incorporate these sources of uncertainty. In this paper, we propose the use of Mixture Density Network compressed ABC (MDN-ABC) to learn informative summary statistics for the available data. This method will allow for Bayesian inference on the epidemic parameters of a contagious process, while accounting for imperfect observations on the epidemic and the contact network. We will demonstrate the use of this method on simulated epidemics and networks, and extend this framework to analyze the spread of Tattoo Skin Disease (TSD) among bottlenose dolphins in Shark Bay, Australia. △ Less

Submitted 15 April, 2024; v1 submitted 1 April, 2024; originally announced April 2024.

Comments: 27 pages, 7 figures

arXiv:2404.00014 [pdf]

Deep Geometry Handling and Fragment-wise Molecular 3D Graph Generation

Authors: Odin Zhang, Yufei Huang, Shichen Cheng, Mengyao Yu, Xujun Zhang, Haitao Lin, Yundian Zeng, Mingyang Wang, Zhenxing Wu, Huifeng Zhao, Zaixi Zhang, Chenqing Hua, Yu Kang, Sunliang Cui, Peichen Pan, Chang-Yu Hsieh, Tingjun Hou

Abstract: Most earlier 3D structure-based molecular generation approaches follow an atom-wise paradigm, incrementally adding atoms to a partially built molecular fragment within protein pockets. These methods, while effective in designing tightly bound ligands, often overlook other essential properties such as synthesizability. The fragment-wise generation paradigm offers a promising solution. However, a co… ▽ More Most earlier 3D structure-based molecular generation approaches follow an atom-wise paradigm, incrementally adding atoms to a partially built molecular fragment within protein pockets. These methods, while effective in designing tightly bound ligands, often overlook other essential properties such as synthesizability. The fragment-wise generation paradigm offers a promising solution. However, a common challenge across both atom-wise and fragment-wise methods lies in their limited ability to co-design plausible chemical and geometrical structures, resulting in distorted conformations. In response to this challenge, we introduce the Deep Geometry Handling protocol, a more abstract design that extends the design focus beyond the model architecture. Through a comprehensive review of existing geometry-related models and their protocols, we propose a novel hybrid strategy, culminating in the development of FragGen - a geometry-reliable, fragment-wise molecular generation method. FragGen marks a significant leap forward in the quality of generated geometry and the synthesis accessibility of molecules. The efficacy of FragGen is further validated by its successful application in designing type II kinase inhibitors at the nanomolar level. △ Less

Submitted 15 March, 2024; originally announced April 2024.

arXiv:2403.00815 [pdf, other]

RAM-EHR: Retrieval Augmentation Meets Clinical Predictions on Electronic Health Records

Authors: Ran Xu, Wenqi Shi, Yue Yu, Yuchen Zhuang, Bowen Jin, May D. Wang, Joyce C. Ho, Carl Yang

Abstract: We present RAM-EHR, a Retrieval AugMentation pipeline to improve clinical predictions on Electronic Health Records (EHRs). RAM-EHR first collects multiple knowledge sources, converts them into text format, and uses dense retrieval to obtain information related to medical concepts. This strategy addresses the difficulties associated with complex names for the concepts. RAM-EHR then augments the loc… ▽ More We present RAM-EHR, a Retrieval AugMentation pipeline to improve clinical predictions on Electronic Health Records (EHRs). RAM-EHR first collects multiple knowledge sources, converts them into text format, and uses dense retrieval to obtain information related to medical concepts. This strategy addresses the difficulties associated with complex names for the concepts. RAM-EHR then augments the local EHR predictive model co-trained with consistency regularization to capture complementary information from patient visits and summarized knowledge. Experiments on two EHR datasets show the efficacy of RAM-EHR over previous knowledge-enhanced baselines (3.4% gain in AUROC and 7.2% gain in AUPR), emphasizing the effectiveness of the summarized knowledge from RAM-EHR for clinical prediction tasks. The code will be published at \url{https://github.com/ritaranx/RAM-EHR}. △ Less

Submitted 26 July, 2024; v1 submitted 25 February, 2024; originally announced March 2024.

Comments: ACL 2024 (Oral)

Journal ref: ACL 2024

arXiv:2401.06173 [pdf, other]

Tree Search-Based Evolutionary Bandits for Protein Sequence Optimization

Authors: Jiahao Qiu, Hui Yuan, Jinghong Zhang, Wentao Chen, Huazheng Wang, Mengdi Wang

Abstract: While modern biotechnologies allow synthesizing new proteins and function measurements at scale, efficiently exploring a protein sequence space and engineering it remains a daunting task due to the vast sequence space of any given protein. Protein engineering is typically conducted through an iterative process of adding mutations to the wild-type or lead sequences, recombination of mutations, and… ▽ More While modern biotechnologies allow synthesizing new proteins and function measurements at scale, efficiently exploring a protein sequence space and engineering it remains a daunting task due to the vast sequence space of any given protein. Protein engineering is typically conducted through an iterative process of adding mutations to the wild-type or lead sequences, recombination of mutations, and running new rounds of screening. To enhance the efficiency of such a process, we propose a tree search-based bandit learning method, which expands a tree starting from the initial sequence with the guidance of a bandit machine learning model. Under simplified assumptions and a Gaussian Process prior, we provide theoretical analysis and a Bayesian regret bound, demonstrating that the combination of local search and bandit learning method can efficiently discover a near-optimal design. The full algorithm is compatible with a suite of randomized tree search heuristics, machine learning models, pre-trained embeddings, and bandit techniques. We test various instances of the algorithm across benchmark protein datasets using simulated screens. Experiment results demonstrate that the algorithm is both sample-efficient and able to find top designs using reasonably small mutation counts. △ Less

Submitted 8 January, 2024; originally announced January 2024.

Comments: AAAI 2024

arXiv:2401.04246 [pdf, other]

Scalable Normalizing Flows Enable Boltzmann Generators for Macromolecules

Authors: Joseph C. Kim, David Bloore, Karan Kapoor, Jun Feng, Ming-Hong Hao, Mengdi Wang

Abstract: The Boltzmann distribution of a protein provides a roadmap to all of its functional states. Normalizing flows are a promising tool for modeling this distribution, but current methods are intractable for typical pharmacological targets; they become computationally intractable due to the size of the system, heterogeneity of intra-molecular potential energy, and long-range interactions. To remedy the… ▽ More The Boltzmann distribution of a protein provides a roadmap to all of its functional states. Normalizing flows are a promising tool for modeling this distribution, but current methods are intractable for typical pharmacological targets; they become computationally intractable due to the size of the system, heterogeneity of intra-molecular potential energy, and long-range interactions. To remedy these issues, we present a novel flow architecture that utilizes split channels and gated attention to efficiently learn the conformational distribution of proteins defined by internal coordinates. We show that by utilizing a 2-Wasserstein loss, one can smooth the transition from maximum likelihood training to energy-based training, enabling the training of Boltzmann Generators for macromolecules. We evaluate our model and training strategy on villin headpiece HP35(nle-nle), a 35-residue subdomain, and protein G, a 56-residue protein. We demonstrate that standard architectures and training strategies, such as maximum likelihood alone, fail while our novel architecture and multi-stage training strategy are able to model the conformational distributions of protein G and HP35. △ Less

Submitted 8 January, 2024; originally announced January 2024.

arXiv:2312.12989 [pdf, other]

Benchmarking and Analyzing In-context Learning, Fine-tuning and Supervised Learning for Biomedical Knowledge Curation: a focused study on chemical entities of biological interest

Authors: Emily Groves, Minhong Wang, Yusuf Abdulle, Holger Kunz, Jason Hoelscher-Obermaier, Ronin Wu, Honghan Wu

Abstract: Automated knowledge curation for biomedical ontologies is key to ensure that they remain comprehensive, high-quality and up-to-date. In the era of foundational language models, this study compares and analyzes three NLP paradigms for curation tasks: in-context learning (ICL), fine-tuning (FT), and supervised learning (ML). Using the Chemical Entities of Biological Interest (ChEBI) database as a mo… ▽ More Automated knowledge curation for biomedical ontologies is key to ensure that they remain comprehensive, high-quality and up-to-date. In the era of foundational language models, this study compares and analyzes three NLP paradigms for curation tasks: in-context learning (ICL), fine-tuning (FT), and supervised learning (ML). Using the Chemical Entities of Biological Interest (ChEBI) database as a model ontology, three curation tasks were devised. For ICL, three prompting strategies were employed with GPT-4, GPT-3.5, BioGPT. PubmedBERT was chosen for the FT paradigm. For ML, six embedding models were utilized for training Random Forest and Long-Short Term Memory models. Five setups were designed to assess ML and FT model performance across different data availability scenarios.Datasets for curation tasks included: task 1 (620,386), task 2 (611,430), and task 3 (617,381), maintaining a 50:50 positive versus negative ratio. For ICL models, GPT-4 achieved best accuracy scores of 0.916, 0.766 and 0.874 for tasks 1-3 respectively. In a direct comparison, ML (trained on ~260,000 triples) outperformed ICL in accuracy across all tasks. (accuracy differences: +.11, +.22 and +.17). Fine-tuned PubmedBERT performed similarly to leading ML models in tasks 1 & 2 (F1 differences: -.014 and +.002), but worse in task 3 (-.048). Simulations revealed performance declines in both ML and FT models with smaller and higher imbalanced training data. where ICL (particularly GPT-4) excelled in tasks 1 & 3. GPT-4 excelled in tasks 1 and 3 with less than 6,000 triples, surpassing ML/FT. ICL underperformed ML/FT in task 2.ICL-augmented foundation models can be good assistants for knowledge curation with correct prompting, however, not making ML and FT paradigms obsolete. The latter two require task-specific data to beat ICL. In such cases, ML relies on small pretrained embeddings, minimizing computational demands. △ Less

Submitted 20 December, 2023; originally announced December 2023.

Comments: 26 pages, 5 figures, 14 tables

arXiv:2311.04238 [pdf, other]

Flexible Bayesian Inference on Partially Observed Epidemics

Authors: Maxwell H. Wang, Jukka-Pekka Onnela

Abstract: Individual-based models of contagious processes are useful for predicting epidemic trajectories and informing intervention strategies. In such models, the incorporation of contact network information can capture the non-randomness and heterogeneity of realistic contact dynamics. In this paper, we consider Bayesian inference on the spreading parameters of an SIR contagion on a known, static network… ▽ More Individual-based models of contagious processes are useful for predicting epidemic trajectories and informing intervention strategies. In such models, the incorporation of contact network information can capture the non-randomness and heterogeneity of realistic contact dynamics. In this paper, we consider Bayesian inference on the spreading parameters of an SIR contagion on a known, static network, where information regarding individual disease status is known only from a series of tests (positive or negative disease status). When the contagion model is complex or information such as infection and removal times is missing, the posterior distribution can be difficult to sample from. Previous work has considered the use of Approximate Bayesian Computation (ABC), which allows for simulation-based Bayesian inference on complex models. However, ABC methods usually require the user to select reasonable summary statistics. Here, we consider an inference scheme based on the Mixture Density Network compressed ABC (MDN-ABC), which minimizes the expected posterior entropy in order to learn informative summary statistics. This allows us to conduct Bayesian inference on the parameters of a partially observed contagious process while also circumventing the need for manual summary statistic selection. This methodology can be extended to incorporate additional simulation complexities, including behavioral change after positive tests or false test results. △ Less

Submitted 6 November, 2023; originally announced November 2023.

Comments: 27 pages, 7 figures

arXiv:2308.01241 [pdf, other]

Digital Twin Brain: a simulation and assimilation platform for whole human brain

Authors: Wenlian Lu, Longbin Zeng, Xin Du, Wenyong Zhang, Shitong Xiang, Huarui Wang, Jiexiang Wang, Mingda Ji, Yubo Hou, Minglong Wang, Yuhao Liu, Zhongyu Chen, Qibao Zheng, Ningsheng Xu, Jianfeng Feng

Abstract: In this work, we present a computing platform named digital twin brain (DTB) that can simulate spiking neuronal networks of the whole human brain scale and more importantly, a personalized biological brain structure. In comparison to most brain simulations with a homogeneous global structure, we highlight that the sparseness, couplingness and heterogeneity in the sMRI, DTI and PET data of the brai… ▽ More In this work, we present a computing platform named digital twin brain (DTB) that can simulate spiking neuronal networks of the whole human brain scale and more importantly, a personalized biological brain structure. In comparison to most brain simulations with a homogeneous global structure, we highlight that the sparseness, couplingness and heterogeneity in the sMRI, DTI and PET data of the brain has an essential impact on the efficiency of brain simulation, which is proved from the scaling experiments that the DTB of human brain simulation is communication-intensive and memory-access intensive computing systems rather than computation-intensive. We utilize a number of optimization techniques to balance and integrate the computation loads and communication traffics from the heterogeneous biological structure to the general GPU-based HPC and achieve leading simulation performance for the whole human brain-scaled spiking neuronal networks. On the other hand, the biological structure, equipped with a mesoscopic data assimilation, enables the DTB to investigate brain cognitive function by a reverse-engineering method, which is demonstrated by a digital experiment of visual evaluation on the DTB. Furthermore, we believe that the developing DTB will be a promising powerful platform for a large of research orients including brain-inspiredintelligence, rain disease medicine and brain-machine interface. △ Less

Submitted 2 August, 2023; originally announced August 2023.

Comments: 12 pages, 11 figures

arXiv:2306.11768 [pdf, other]

Geometric Deep Learning for Structure-Based Drug Design: A Survey

Authors: Zaixi Zhang, Jiaxian Yan, Yining Huang, Qi Liu, Enhong Chen, Mengdi Wang, Marinka Zitnik

Abstract: Structure-based drug design (SBDD) leverages the three-dimensional geometry of proteins to identify potential drug candidates. Traditional approaches, rooted in physicochemical modeling and domain expertise, are often resource-intensive. Recent advancements in geometric deep learning, which effectively integrate and process 3D geometric data, alongside breakthroughs in accurate protein structure p… ▽ More Structure-based drug design (SBDD) leverages the three-dimensional geometry of proteins to identify potential drug candidates. Traditional approaches, rooted in physicochemical modeling and domain expertise, are often resource-intensive. Recent advancements in geometric deep learning, which effectively integrate and process 3D geometric data, alongside breakthroughs in accurate protein structure predictions from tools like AlphaFold, have significantly propelled the field forward. This paper systematically reviews the state-of-the-art in geometric deep learning for SBDD. We begin by outlining foundational tasks in SBDD, discussing prevalent 3D protein representations, and highlighting representative predictive and generative models. Next, we provide an in-depth review of key tasks, including binding site prediction, binding pose generation, de novo molecule generation, linker design, protein pocket generation, and binding affinity prediction. For each task, we present formal problem definitions, key methods, datasets, evaluation metrics, and performance benchmarks. Lastly, we explore current challenges and future opportunities in SBDD. Challenges include oversimplified problem formulations, limited out-of-distribution generalization, biosecurity concerns related to the misuse of structural data, insufficient evaluation metrics and large-scale benchmarks, and the need for experimental validation and enhanced model interpretability. Opportunities lie in leveraging multimodal datasets, integrating domain knowledge, developing comprehensive benchmarks, establishing criteria aligned with clinical outcomes, and designing foundation models to expand the scope of design tasks. We also curate \url{https://github.com/zaixizhang/Awesome-SBDD}, reflecting ongoing contributions and new datasets in SBDD. △ Less

Submitted 15 November, 2024; v1 submitted 20 June, 2023; originally announced June 2023.

Comments: 28 pages, under review

arXiv:2212.06394 [pdf]

Tangent functional connectomes uncover more unique phenotypic traits

Authors: Kausar Abbas, Mintao Liu, Michael Wang, Duy Duong-Tran, Uttara Tipnis, Enrico Amico, Alan D. Kaplan, Mario Dzemidzic, David Kareken, Beau M. Ances, Jaroslaw Harezlak, Joaquín Goñi

Abstract: Functional connectomes (FCs) contain pairwise estimations of functional couplings based on pairs of brain regions activity. FCs are commonly represented as correlation matrices that are symmetric positive definite (SPD) lying on or inside the SPD manifold. Since the geometry on the SPD manifold is non-Euclidean, the inter-related entries of FCs undermine the use of Euclidean-based distances. By pr… ▽ More Functional connectomes (FCs) contain pairwise estimations of functional couplings based on pairs of brain regions activity. FCs are commonly represented as correlation matrices that are symmetric positive definite (SPD) lying on or inside the SPD manifold. Since the geometry on the SPD manifold is non-Euclidean, the inter-related entries of FCs undermine the use of Euclidean-based distances. By projecting FCs into a tangent space, we can obtain tangent functional connectomes (tangent-FCs). Tangent-FCs have shown a higher predictive power of behavior and cognition, but no studies have evaluated the effect of such projections with respect to fingerprinting. We hypothesize that tangent-FCs have a higher fingerprint than regular FCs. Fingerprinting was measured by identification rates (ID rates) on test-retest FCs as well as on monozygotic and dizygotic twins. Our results showed that identification rates are systematically higher when using tangent-FCs. Specifically, we found: (i) Riemann and log-Euclidean matrix references systematically led to higher ID rates. (ii) In tangent-FCs, Main-diagonal regularization prior to tangent space projection was critical for ID rate when using Euclidean distance, whereas barely affected ID rates when using correlation distance. (iii) ID rates were dependent on condition and fMRI scan length. (iv) Parcellation granularity was key for ID rates in FCs, as well as in tangent-FCs with fixed regularization, whereas optimal regularization of tangent-FCs mostly removed this effect. (v) Correlation distance in tangent-FCs outperformed any other configuration of distance on FCs or on tangent-FCs across the fingerprint gradient (here sampled by assessing test-retest, Monozygotic and Dizygotic twins). (vi)ID rates tended to be higher in task scans compared to resting-state scans when accounting for fMRI scan length. △ Less

Submitted 9 June, 2023; v1 submitted 13 December, 2022; originally announced December 2022.

Comments: 31 pages, 10 figures, 2 tables

arXiv:2211.05658 [pdf, other]

doi 10.1016/j.neuroimage.2023.120348

Multi-objective optimization via evolutionary algorithm (MOVEA) for high-definition transcranial electrical stimulation of the human brain

Authors: Mo Wang, Kexin Lou, Zeming Liu, Pengfei Wei, Quanying Liu

Abstract: Designing a transcranial electrical stimulation (TES) strategy requires considering multiple objectives, such as intensity in the target area, focality, stimulation depth, and avoidance zone, which are often mutually exclusive. A computational framework for optimizing different strategies and comparing trade-offs between these objectives is currently lacking. In this paper, we propose a general fr… ▽ More Designing a transcranial electrical stimulation (TES) strategy requires considering multiple objectives, such as intensity in the target area, focality, stimulation depth, and avoidance zone, which are often mutually exclusive. A computational framework for optimizing different strategies and comparing trade-offs between these objectives is currently lacking. In this paper, we propose a general framework called multi-objective optimization via evolutionary algorithms (MOVEA) to address the non-convex optimization problem in designing TES strategies without predefined direction. MOVEA enables simultaneous optimization of multiple targets through Pareto optimization, generating a Pareto front after a single run without manual weight adjustment and allowing easy expansion to more targets. This Pareto front consists of optimal solutions that meet various requirements while respecting trade-off relationships between conflicting objectives such as intensity and focality. MOVEA is versatile and suitable for both transcranial alternating current stimulation (tACS) and transcranial temporal interference stimulation (tTIS) based on high definition (HD) and two-pair systems. We performed a comprehensive comparison between tACS and tTIS in terms of intensity, focality, and steerability for targets at different depths.MOVEA facilitates the optimization of TES based on specific objectives and constraints, advancing tTIS and tACS-based neuromodulation in understanding the causal relationship between brain regions and cognitive functions and in treating diseases. The code for MOVEA is available at https://github.com/ncclabsustech/MOVEA. △ Less

Submitted 3 April, 2023; v1 submitted 10 November, 2022; originally announced November 2022.

Journal ref: NeuroImage, Volume 280, 2020

arXiv:2210.05713 [pdf, other]

Explainable fMRI-based Brain Decoding via Spatial Temporal-pyramid Graph Convolutional Network

Authors: Ziyuan Ye, Youzhi Qu, Zhichao Liang, Mo Wang, Quanying Liu

Abstract: Brain decoding, aiming to identify the brain states using neural activity, is important for cognitive neuroscience and neural engineering. However, existing machine learning methods for fMRI-based brain decoding either suffer from low classification performance or poor explainability. Here, we address this issue by proposing a biologically inspired architecture, Spatial Temporal-pyramid Graph Conv… ▽ More Brain decoding, aiming to identify the brain states using neural activity, is important for cognitive neuroscience and neural engineering. However, existing machine learning methods for fMRI-based brain decoding either suffer from low classification performance or poor explainability. Here, we address this issue by proposing a biologically inspired architecture, Spatial Temporal-pyramid Graph Convolutional Network (STpGCN), to capture the spatial-temporal graph representation of functional brain activities. By designing multi-scale spatial-temporal pathways and bottom-up pathways that mimic the information process and temporal integration in the brain, STpGCN is capable of explicitly utilizing the multi-scale temporal dependency of brain activities via graph, thereby achieving high brain decoding performance. Additionally, we propose a sensitivity analysis method called BrainNetX to better explain the decoding results by automatically annotating task-related brain regions from the brain-network standpoint. We conduct extensive experiments on fMRI data under 23 cognitive tasks from Human Connectome Project (HCP) S1200. The results show that STpGCN significantly improves brain decoding performance compared to competing baseline models; BrainNetX successfully annotates task-relevant brain regions. Post hoc analysis based on these regions further validates that the hierarchical structure in STpGCN significantly contributes to the explainability, robustness and generalization of the model. Our methods not only provide insights into information representation in the brain under multiple cognitive tasks but also indicate a bright future for fMRI-based brain decoding. △ Less

Submitted 8 October, 2022; originally announced October 2022.

arXiv:2208.04314 [pdf]

TripHLApan: predicting HLA molecules binding peptides based on triple coding matrix and transfer learning

Authors: Meng Wang, Chuqi Lei, Jianxin Wang, Yaohang Li, Min Li

Abstract: Human leukocyte antigen (HLA) is an important molecule family in the field of human immunity, which recognizes foreign threats and triggers immune responses by presenting peptides to T cells. In recent years, the synthesis of tumor vaccines to induce specific immune responses has become the forefront of cancer treatment. Computationally modeling the binding patterns between peptide and HLA can gre… ▽ More Human leukocyte antigen (HLA) is an important molecule family in the field of human immunity, which recognizes foreign threats and triggers immune responses by presenting peptides to T cells. In recent years, the synthesis of tumor vaccines to induce specific immune responses has become the forefront of cancer treatment. Computationally modeling the binding patterns between peptide and HLA can greatly accelerate the development of tumor vaccines. However, most of the prediction methods performance is very limited and they cannot fully take advantage of the analysis of existing biological knowledge as the basis of modeling. In this paper, we propose TripHLApan, a novel pan-specific prediction model, for HLA molecular peptide binding prediction. TripHLApan exhibits powerful prediction ability by integrating triple coding matrix, BiGRU + Attention models, and transfer learning strategy. The comprehensive evaluations demonstrate the effectiveness of TripHLApan in predicting HLA-I and HLA-II peptide binding in different test environments. The predictive power of HLA-I is further demonstrated in the latest data set. In addition, we show that TripHLApan has strong binding reconstitution ability in the samples of a melanoma patient. In conclusion, TripHLApan is a powerful tool for predicting the binding of HLA-I and HLA-II molecular peptides for the synthesis of tumor vaccines. △ Less

Submitted 6 August, 2022; originally announced August 2022.

Comments: 25 pages, 7 figures

arXiv:2206.12240 [pdf, other]

PSP: Million-level Protein Sequence Dataset for Protein Structure Prediction

Authors: Sirui Liu, Jun Zhang, Haotian Chu, Min Wang, Boxin Xue, Ningxi Ni, Jialiang Yu, Yuhao Xie, Zhenyu Chen, Mengyun Chen, Yuan Liu, Piya Patra, Fan Xu, Jie Chen, Zidong Wang, Lijiang Yang, Fan Yu, Lei Chen, Yi Qin Gao

Abstract: Proteins are essential component of human life and their structures are important for function and mechanism analysis. Recent work has shown the potential of AI-driven methods for protein structure prediction. However, the development of new models is restricted by the lack of dataset and benchmark training procedure. To the best of our knowledge, the existing open source datasets are far less to… ▽ More Proteins are essential component of human life and their structures are important for function and mechanism analysis. Recent work has shown the potential of AI-driven methods for protein structure prediction. However, the development of new models is restricted by the lack of dataset and benchmark training procedure. To the best of our knowledge, the existing open source datasets are far less to satisfy the needs of modern protein sequence-structure related research. To solve this problem, we present the first million-level protein structure prediction dataset with high coverage and diversity, named as PSP. This dataset consists of 570k true structure sequences (10TB) and 745k complementary distillation sequences (15TB). We provide in addition the benchmark training procedure for SOTA protein structure prediction model on this dataset. We validate the utility of this dataset for training by participating CAMEO contest in which our model won the first place. We hope our PSP dataset together with the training benchmark can enable a broader community of AI/biology researchers for AI-driven protein related research. △ Less

Submitted 24 June, 2022; originally announced June 2022.

arXiv:2109.00123 [pdf, ps, other]

doi 10.1103/PhysRevE.104.034405

Regulatory Feedback Effects on Tissue Growth Dynamics in a Two-Stage Cell Lineage Model

Authors: Mao-Xiang Wang, Arthur Lander, Pik-Yin Lai

Abstract: Identifying the mechanism of intercellular feedback regulation is critical for the basic understanding of tissue growth control in organisms. In this paper, we analyze a tissue growth model consisting of a single lineage of two cell types regulated by negative feedback signalling molecules that undergo spatial diffusion. By deriving the fixed points for the uniform steady states and carrying out l… ▽ More Identifying the mechanism of intercellular feedback regulation is critical for the basic understanding of tissue growth control in organisms. In this paper, we analyze a tissue growth model consisting of a single lineage of two cell types regulated by negative feedback signalling molecules that undergo spatial diffusion. By deriving the fixed points for the uniform steady states and carrying out linear stability analysis, phase diagrams are obtained analytically for arbitrary parameters of the model. Two different generic growth modes are found: blow-up growth and final-state controlled growth which are governed by the non-trivial fixed point and the trivial fixed point respectively, and can be sensitively switched by varying the negative feedback regulation on the proliferation of the stem cells. Analytic expressions for the characteristic time scales for these two growth modes are also derived. Remarkably, the trivial and non-trivial uniform steady states can coexist and a sharp transition occurs in the bistable regime as the relevant parameters are varied. Furthermore, the bi-stable growth properties allows for the external control to switch between these two growth modes. In addition, the condition for an early accelerated growth followed by a retarded growth can be derived. These analytical results are further verified by numerical simulations and provide insights on the growth behavior of the tissue. Our results are also discussed in the light of possible realistic biological experiments and tissue growth control strategy. Furthermore, by external feedback control of the concentration of regulatory molecules, it is possible to achieve a desired growth mode, as demonstrated with an analysis of boosted growth, catch-up growth and the design for the target of a linear growth dynamic. △ Less

Submitted 31 August, 2021; originally announced September 2021.

Comments: to be published in Physical Review E

arXiv:2104.10878 [pdf, other]

doi 10.3934/math.2022376

Comparing regional and provincial-wide COVID-19 models with physical distancing in British Columbia

Authors: Geoffrey McGregor, Jennifer Tippett, Andy T. S. Wan, Mengxiao Wang, Samuel W. K. Wong

Abstract: We study the effects of physical distancing measures for the spread of COVID-19 in regional areas within British Columbia, using the reported cases of the five provincial Health Authorities. Building on the Bayesian epidemiological model of Anderson et al. (2020), we propose a hierarchical regional Bayesian model with time-varying regional parameters between March to December of 2020. In the absen… ▽ More We study the effects of physical distancing measures for the spread of COVID-19 in regional areas within British Columbia, using the reported cases of the five provincial Health Authorities. Building on the Bayesian epidemiological model of Anderson et al. (2020), we propose a hierarchical regional Bayesian model with time-varying regional parameters between March to December of 2020. In the absence of COVID-19 variants and vaccinations during this period, we examine the regionalized basic reproduction number, modelled prevalence, relative reduction in contact due to physical distancing, and proportion of anticipated cases that have been tested and reported. We observe significant differences between the regional and provincial-wide models and demonstrate the hierarchical regional model can better estimate regional prevalence, especially in rural regions. These results indicate that it can be useful to apply similar regional models to other parts of Canada or other countries. △ Less

Submitted 13 November, 2021; v1 submitted 22 April, 2021; originally announced April 2021.

Comments: 35 pages, 16 figures

Journal ref: AIMS Mathematics, 2022, 7(4): 6743-6778

arXiv:2104.01474 [pdf, other]

Thalamocortical contribution to solving credit assignment in neural systems

Authors: Mien Brabeeba Wang, Michael M. Halassa

Abstract: Animal brains evolved to optimize behavior in dynamically changing environments, selecting actions that maximize future rewards. A large body of experimental work indicates that such optimization changes the wiring of neural circuits, appropriately mapping environmental input onto behavioral outputs. A major unsolved scientific question is how optimal wiring adjustments, which must target the conn… ▽ More Animal brains evolved to optimize behavior in dynamically changing environments, selecting actions that maximize future rewards. A large body of experimental work indicates that such optimization changes the wiring of neural circuits, appropriately mapping environmental input onto behavioral outputs. A major unsolved scientific question is how optimal wiring adjustments, which must target the connections responsible for rewards, can be accomplished when the relation between sensory inputs, action taken, environmental context with rewards is ambiguous. The computational problem of properly targeting cues, contexts and actions that lead to reward is known as structural, contextual and temporal credit assignment respectively. In this review, we survey prior approaches to these three types of problems and advance the notion that the brain's specialized neural architectures provide efficient solutions. Within this framework, the thalamus with its cortical and basal ganglia interactions serve as a systems-level solution to credit assignment. Specifically, we propose that thalamocortical interaction is the locus of meta-learning where the thalamus provides cortical control functions that parametrize the cortical activity association space. By selecting among these control functions, the basal ganglia hierarchically guide thalamocortical plasticity across two timescales to enable meta-learning. The faster timescale establishes contextual associations to enable rapid behavioral flexibility while the slower one enables generalization to new contexts. Incorporating different thalamic control functions under this framework clarifies how thalamocortical-basal ganglia interactions may simultaneously solve the three credit assignment problems. △ Less

Submitted 3 April, 2021; originally announced April 2021.

arXiv:2103.00399 [pdf]

Hydrophobic interaction determines docking affinity of SARS CoV 2 variants with antibodies

Authors: Jiacheng Li, Chengyu Hou, Menghao Wang, Chencheng Liao, Shuai Guo, Liping Shi, Xiaoliang Ma, Hongchi Zhang, Shenda Jiang, Bing Zheng, Lin Ye, Lin Yang, Xiaodong He

Abstract: Preliminary epidemiologic, phylogenetic and clinical findings suggest that several novel severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) variants have increased transmissibility and decreased efficacy of several existing vaccines. Four mutations in the receptor-binding domain (RBD) of the spike protein that are reported to contribute to increased transmission. Understanding physical m… ▽ More Preliminary epidemiologic, phylogenetic and clinical findings suggest that several novel severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) variants have increased transmissibility and decreased efficacy of several existing vaccines. Four mutations in the receptor-binding domain (RBD) of the spike protein that are reported to contribute to increased transmission. Understanding physical mechanism responsible for the affinity enhancement between the SARS-CoV-2 variants and ACE2 is the "urgent challenge" for developing blockers, vaccines and therapeutic antibodies against the coronavirus disease 2019 (COVID-19) pandemic. Based on a hydrophobic-interaction-based protein docking mechanism, this study reveals that the mutation N501Y obviously increased the hydrophobic attraction and decrease hydrophilic repulsion between the RBD and ACE2 that most likely caused the transmissibility increment of the variants. By analyzing the mutation-induced hydrophobic surface changes in the attraction and repulsion at the binding site of the complexes of the SARS-CoV-2 variants and antibodies, we found out that all the mutations of N501Y, E484K, K417N and L452R can selectively decrease or increase their binding affinity with some antibodies. △ Less

Submitted 28 February, 2021; originally announced March 2021.

Comments: arXiv admin note: substantial text overlap with arXiv:2008.11883

arXiv:2102.13276 [pdf, other]

Spectral Top-Down Recovery of Latent Tree Models

Authors: Yariv Aizenbud, Ariel Jaffe, Meng Wang, Amber Hu, Noah Amsel, Boaz Nadler, Joseph T. Chang, Yuval Kluger

Abstract: Modeling the distribution of high dimensional data by a latent tree graphical model is a prevalent approach in multiple scientific domains. A common task is to infer the underlying tree structure, given only observations of its terminal nodes. Many algorithms for tree recovery are computationally intensive, which limits their applicability to trees of moderate size. For large trees, a common appro… ▽ More Modeling the distribution of high dimensional data by a latent tree graphical model is a prevalent approach in multiple scientific domains. A common task is to infer the underlying tree structure, given only observations of its terminal nodes. Many algorithms for tree recovery are computationally intensive, which limits their applicability to trees of moderate size. For large trees, a common approach, termed divide-and-conquer, is to recover the tree structure in two steps. First, recover the structure separately of multiple, possibly random subsets of the terminal nodes. Second, merge the resulting subtrees to form a full tree. Here, we develop Spectral Top-Down Recovery (STDR), a deterministic divide-and-conquer approach to infer large latent tree models. Unlike previous methods, STDR partitions the terminal nodes in a non random way, based on the Fiedler vector of a suitable Laplacian matrix related to the observed nodes. We prove that under certain conditions, this partitioning is consistent with the tree structure. This, in turn, leads to a significantly simpler merging procedure of the small subtrees. We prove that STDR is statistically consistent and bound the number of samples required to accurately recover the tree with high probability. Using simulated data from several common tree models in phylogenetics, we demonstrate that STDR has a significant advantage in terms of runtime, with improved or similar accuracy. △ Less

Submitted 7 December, 2021; v1 submitted 25 February, 2021; originally announced February 2021.

arXiv:2102.05440 [pdf]

Protein corona critically affects the bio-behaviors of SARS-CoV-2

Authors: Yue-wen Yin, Yan-jing Sheng, Min Wang, Song-di Ni, Hong-ming Ding, Yu-qiang Ma

Abstract: The outbreak of the coronavirus disease 2019 (COVID-19) caused by severe acute respiratory syndrome coronavirus-2 (SARS-CoV-2) has become a worldwide public health crisis. When the SARS-CoV-2 enters the biological fluids in the human body, different types of biomolecules (in particular proteins) may adsorb on its surface and alter its infection ability. Although great efforts have recently been de… ▽ More The outbreak of the coronavirus disease 2019 (COVID-19) caused by severe acute respiratory syndrome coronavirus-2 (SARS-CoV-2) has become a worldwide public health crisis. When the SARS-CoV-2 enters the biological fluids in the human body, different types of biomolecules (in particular proteins) may adsorb on its surface and alter its infection ability. Although great efforts have recently been devoted to the interaction of the specific antibodies with the SARS-CoV-2, it still remains largely unknown how the other serum proteins affect the infection of the SARS-CoV-2. In this work, we systematically investigate the interaction of serum proteins with the SARS-CoV-2 RBD by the molecular docking and the all-atom molecular dynamics simulations. It is found that the non-specific immunoglobulin (Ig) indeed cannot effectively bind to the SARS-CoV-2 RBD while the human serum albumin (HSA) may have some potential of blocking its infection (to ACE2). More importantly, we find that the RBD can cause the significant structural change of the Apolipoprotein E (ApoE), by which SARS-CoV-2 may hijack the metabolic pathway of the ApoE to facilitate its cell entry. The present study enhances the understanding of the role of protein corona in the bio-behaviors of SARS-CoV-2, which may aid the more precise and personalized treatment for COVID-19 infection in the clinic. △ Less

Submitted 10 February, 2021; originally announced February 2021.

Comments: 18 pages, 7 figures

arXiv:2005.14669 [pdf, other]

Mutations strengthened SARS-CoV-2 infectivity

Authors: Jiahui Chen, Rui Wang, Menglun Wang, Guo-Wei Wei

Abstract: Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) infectivity is a major concern in coronavirus disease 2019 (COVID-19) prevention and economic reopening. However, rigorous determination of SARS-COV-2 infectivity is essentially impossible owing to its continuous evolution with over 13752 single nucleotide polymorphisms (SNP) variants in six different subtypes. We develop an advanced mac… ▽ More Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) infectivity is a major concern in coronavirus disease 2019 (COVID-19) prevention and economic reopening. However, rigorous determination of SARS-COV-2 infectivity is essentially impossible owing to its continuous evolution with over 13752 single nucleotide polymorphisms (SNP) variants in six different subtypes. We develop an advanced machine learning algorithm based on the algebraic topology to quantitatively evaluate the binding affinity changes of SARS-CoV-2 spike glycoprotein (S protein) and host angiotensin-converting enzyme 2 (ACE2) receptor following the mutations. Based on mutation-induced binding affinity changes, we reveal that five out of six SARS-CoV-2 subtypes have become either moderately or slightly more infectious, while one subtype has weakened its infectivity. We find that SARS-CoV-2 is slightly more infectious than SARS-CoV according to computed S protein-ACE2 binding affinity changes. Based on a systematic evaluation of all possible 3686 future mutations on the S protein receptor-binding domain (RBD), we show that most likely future mutations will make SARS-CoV-2 more infectious. Combining sequence alignment, probability analysis, and binding affinity calculation, we predict that a few residues on the receptor-binding motif (RBM), i.e., 452, 489, 500, 501, and 505, have very high chances to mutate into significantly more infectious COVID-19 strains. △ Less

Submitted 27 May, 2020; originally announced May 2020.

Comments: 24 pages, 2 tables and 19 figures

arXiv:2005.11935 [pdf]

A Novel Approach of using AR and Smart Surgical Glasses Supported Trauma Care

Authors: Anurag Lal, Ming-Hsien Hu, Pei-Yuan Lee, Min Liang Wang

Abstract: BACKGROUND: Augmented reality (AR) is gaining popularity in varying field such as computer gaming and medical education fields. However, still few of applications in real surgeries. Orthopedic surgical applications are currently limited and underdeveloped. - METHODS: The clinic validation was prepared with the currently available AR equipment and software. A total of 1 Vertebroplasty, 2 ORIF Pelvi… ▽ More BACKGROUND: Augmented reality (AR) is gaining popularity in varying field such as computer gaming and medical education fields. However, still few of applications in real surgeries. Orthopedic surgical applications are currently limited and underdeveloped. - METHODS: The clinic validation was prepared with the currently available AR equipment and software. A total of 1 Vertebroplasty, 2 ORIF Pelvis fracture, 1 ORIF with PFN for Proximal Femoral Fracture, 1 CRIF for distal radius fracture and 2 ORIF for Tibia Fracture cases were performed with fluoroscopy combined with AR smart surgical glasses system. - RESULTS: A total of 1 Vertebroplasty, 2 ORIF Pelvis fracture, 1 ORIF with PFN for Proximal Femoral Fracture, 1 CRIF for distal radius fracture and 2 ORIF for Tibia Fracture cases are performed to evaluate the benefits of AR surgery. Among the AR surgeries, surgeons wear the smart surgical are lot reduce of eyes of turns to focus on the monitors. This paper shows the potential ability of augmented reality technology for trauma surgery. △ Less

Submitted 25 May, 2020; originally announced May 2020.

Comments: 10 pages, 9 Figures, Conference. arXiv admin note: text overlap with arXiv:1801.01560 by other authors

arXiv:2002.07096 [pdf]

Visual Data Analysis and Simulation Prediction for COVID-19

Authors: Baoquan Chen, Mingyi Shi, Xingyu Ni, Liangwang Ruan, Hongda Jiang, Heyuan Yao, Mengdi Wang, Zhenhua Song, Qiang Zhou, Tong Ge

Abstract: The COVID-19 (formerly, 2019-nCoV) epidemic has become a global health emergency, as such, WHO declared PHEIC. China has taken the most hit since the outbreak of the virus, which could be dated as far back as late November by some experts. It was not until January 23rd that the Wuhan government finally recognized the severity of the epidemic and took a drastic measure to curtain the virus spread b… ▽ More The COVID-19 (formerly, 2019-nCoV) epidemic has become a global health emergency, as such, WHO declared PHEIC. China has taken the most hit since the outbreak of the virus, which could be dated as far back as late November by some experts. It was not until January 23rd that the Wuhan government finally recognized the severity of the epidemic and took a drastic measure to curtain the virus spread by closing down all transportation connecting the outside world. In this study, we seek to answer a few questions: How did the virus get spread from the epicenter Wuhan city to the rest of the country? To what extent did the measures, such as, city closure and community quarantine, help controlling the situation? More importantly, can we forecast any significant future development of the event had some of the conditions changed? By collecting and visualizing publicly available data, we first show patterns and characteristics of the epidemic development; we then employ a mathematical model of disease transmission dynamics to evaluate the effectiveness of some epidemic control measures, and more importantly, to offer a few tips on preventive measures. △ Less

Submitted 6 March, 2020; v1 submitted 14 February, 2020; originally announced February 2020.

Comments: 19 pages, 21 figures, revised English version and originally Chinese version

arXiv:1911.03839 [pdf, ps, other]

In Vitro Fertilization (IVF) Cumulative Pregnancy Rate Prediction from Basic Patient Characteristics

Authors: Bo Zhang, Yuqi Cui, Meng Wang, Jingjing Li, Lei Jin, Dongrui Wu

Abstract: Tens of millions of women suffer from infertility worldwide each year. In vitro fertilization (IVF) is the best choice for many such patients. However, IVF is expensive, time-consuming, and both physically and emotionally demanding. The first question that a patient usually asks before the IVF is how likely she will conceive, given her basic medical examination information. This paper proposes thr… ▽ More Tens of millions of women suffer from infertility worldwide each year. In vitro fertilization (IVF) is the best choice for many such patients. However, IVF is expensive, time-consuming, and both physically and emotionally demanding. The first question that a patient usually asks before the IVF is how likely she will conceive, given her basic medical examination information. This paper proposes three approaches to predict the cumulative pregnancy rate after multiple oocyte pickup cycles. Experiments on 11,190 patients showed that first clustering the patients into different groups and then building a support vector machine model for each group can achieve the best overall performance. Our model could be a quick and economic approach for reliably estimating the cumulative pregnancy rate for a patient, given only her basic medical examination information, well before starting the actual IVF procedure. The predictions can help the patient make optimal decisions on whether to use her own oocyte or donor oocyte, how many oocyte pickup cycles she may need, whether to use embryo frozen, etc. They will also reduce the patient's cost and time to pregnancy, and improve her quality of life. △ Less

Submitted 9 November, 2019; originally announced November 2019.

Showing 1–50 of 69 results for author: Wang, M