-
Sample-Efficient Reinforcement Learning Controller for Deep Brain Stimulation in Parkinson's Disease
Authors:
Harsh Ravivarapu,
Gaurav Bagwe,
Xiaoyong Yuan,
Chunxiu Yu,
Lan Zhang
Abstract:
Deep brain stimulation (DBS) is an established intervention for Parkinson's disease (PD), but conventional open-loop systems lack adaptability, are energy-inefficient due to continuous stimulation, and provide limited personalization to individual neural dynamics. Adaptive DBS (aDBS) offers a closed-loop alternative, using biomarkers such as beta-band oscillations to dynamically modulate stimulati…
▽ More
Deep brain stimulation (DBS) is an established intervention for Parkinson's disease (PD), but conventional open-loop systems lack adaptability, are energy-inefficient due to continuous stimulation, and provide limited personalization to individual neural dynamics. Adaptive DBS (aDBS) offers a closed-loop alternative, using biomarkers such as beta-band oscillations to dynamically modulate stimulation. While reinforcement learning (RL) holds promise for personalized aDBS control, existing methods suffer from high sample complexity, unstable exploration in binary action spaces, and limited deployability on resource-constrained hardware.
We propose SEA-DBS, a sample-efficient actor-critic framework that addresses the core challenges of RL-based adaptive neurostimulation. SEA-DBS integrates a predictive reward model to reduce reliance on real-time feedback and employs Gumbel Softmax-based exploration for stable, differentiable policy updates in binary action spaces. Together, these components improve sample efficiency, exploration robustness, and compatibility with resource-constrained neuromodulatory hardware. We evaluate SEA-DBS on a biologically realistic simulation of Parkinsonian basal ganglia activity, demonstrating faster convergence, stronger suppression of pathological beta-band power, and resilience to post-training FP16 quantization. Our results show that SEA-DBS offers a practical and effective RL-based aDBS framework for real-time, resource-constrained neuromodulation.
△ Less
Submitted 8 July, 2025;
originally announced July 2025.
-
Leveraging Transformer Models to Capture Multi-Scale Dynamics in Biomolecules by nano-GPT
Authors:
Wenqi Zeng,
Lu Zhang,
Yuan Yao
Abstract:
Long-term biomolecular dynamics are critical for understanding key evolutionary transformations in molecular systems. However, capturing these processes requires extended simulation timescales that often exceed the practical limits of conventional models. To address this, shorter simulations, initialized with diverse perturbations, are commonly used to sample phase space and explore a wide range o…
▽ More
Long-term biomolecular dynamics are critical for understanding key evolutionary transformations in molecular systems. However, capturing these processes requires extended simulation timescales that often exceed the practical limits of conventional models. To address this, shorter simulations, initialized with diverse perturbations, are commonly used to sample phase space and explore a wide range of behaviors. Recent advances have leveraged language models to infer long-term behavior from short trajectories, but methods such as long short-term memory (LSTM) networks are constrained to low-dimensional reaction coordinates, limiting their applicability to complex systems. In this work, we present nano-GPT, a novel deep learning model inspired by the GPT architecture, specifically designed to capture long-term dynamics in molecular systems with fine-grained conformational states and complex transitions. The model employs a two-pass training mechanism that incrementally replaces molecular dynamics (MD) tokens with model-generated predictions, effectively mitigating accumulation errors inherent in the training window. We validate nano-GPT on three distinct systems: a four-state model potential, the alanine dipeptide, a well-studied simple molecule, and the Fip35 WW domain, a complex biomolecular system. Our results show that nano-GPT effectively captures long-timescale dynamics by learning high-order dependencies through attention mechanism, offering a novel perspective for interpreting biomolecular processes.
△ Less
Submitted 3 July, 2025;
originally announced July 2025.
-
PaceLLM: Brain-Inspired Large Language Models for Long-Context Understanding
Authors:
Kangcong Li,
Peng Ye,
Chongjun Tu,
Lin Zhang,
Chunfeng Song,
Jiamin Wu,
Tao Yang,
Qihao Zheng,
Tao Chen
Abstract:
While Large Language Models (LLMs) demonstrate strong performance across domains, their long-context capabilities are limited by transient neural activations causing information decay and unstructured feed-forward network (FFN) weights leading to semantic fragmentation. Inspired by the brain's working memory and cortical modularity, we propose PaceLLM, featuring two innovations: (1) a Persistent A…
▽ More
While Large Language Models (LLMs) demonstrate strong performance across domains, their long-context capabilities are limited by transient neural activations causing information decay and unstructured feed-forward network (FFN) weights leading to semantic fragmentation. Inspired by the brain's working memory and cortical modularity, we propose PaceLLM, featuring two innovations: (1) a Persistent Activity (PA) Mechanism that mimics prefrontal cortex (PFC) neurons' persistent firing by introducing an activation-level memory bank to dynamically retrieve, reuse, and update critical FFN states, addressing contextual decay; and (2) Cortical Expert (CE) Clustering that emulates task-adaptive neural specialization to reorganize FFN weights into semantic modules, establishing cross-token dependencies and mitigating fragmentation. Extensive evaluations show that PaceLLM achieves 6% improvement on LongBench's Multi-document QA and 12.5-17.5% performance gains on Infinite-Bench tasks, while extending measurable context length to 200K tokens in Needle-In-A-Haystack (NIAH) tests. This work pioneers brain-inspired LLM optimization and is complementary to other works. Besides, it can be generalized to any model and enhance their long-context performance and interpretability without structural overhauls.
△ Less
Submitted 18 June, 2025;
originally announced June 2025.
-
Uncertainty-Aware Metabolic Stability Prediction with Dual-View Contrastive Learning
Authors:
Peijin Guo,
Minghui Li,
Hewen Pan,
Bowen Chen,
Yang Wu,
Zikang Guo,
Leo Yu Zhang,
Shengshan Hu,
Shengqing Hu
Abstract:
Accurate prediction of molecular metabolic stability (MS) is critical for drug research and development but remains challenging due to the complex interplay of molecular interactions. Despite recent advances in graph neural networks (GNNs) for MS prediction, current approaches face two critical limitations: (1) incomplete molecular modeling due to atom-centric message-passing mechanisms that disre…
▽ More
Accurate prediction of molecular metabolic stability (MS) is critical for drug research and development but remains challenging due to the complex interplay of molecular interactions. Despite recent advances in graph neural networks (GNNs) for MS prediction, current approaches face two critical limitations: (1) incomplete molecular modeling due to atom-centric message-passing mechanisms that disregard bond-level topological features, and (2) prediction frameworks that lack reliable uncertainty quantification. To address these challenges, we propose TrustworthyMS, a novel contrastive learning framework designed for uncertainty-aware metabolic stability prediction. First, a molecular graph topology remapping mechanism synchronizes atom-bond interactions through edge-induced feature propagation, capturing both localized electronic effects and global conformational constraints. Second, contrastive topology-bond alignment enforces consistency between molecular topology views and bond patterns via feature alignment, enhancing representation robustness. Third, uncertainty modeling through Beta-Binomial uncertainty quantification enables simultaneous prediction and confidence calibration under epistemic uncertainty. Through extensive experiments, our results demonstrate that TrustworthyMS outperforms current state-of-the-art methods in terms of predictive performance.
△ Less
Submitted 1 June, 2025;
originally announced June 2025.
-
Learning collective multi-cellular dynamics from temporal scRNA-seq via a transformer-enhanced Neural SDE
Authors:
Qi Jiang,
Lei Zhang,
Longquan Li,
Lin Wan
Abstract:
Time-series single-cell RNA-sequencing (scRNA-seq) datasets offer unprecedented insights into the dynamics and heterogeneity of cellular systems. These systems exhibit multiscale collective behaviors driven by intricate intracellular gene regulatory networks and intercellular interactions of molecules. However, inferring interacting cell population dynamics from time-series scRNA-seq data remains…
▽ More
Time-series single-cell RNA-sequencing (scRNA-seq) datasets offer unprecedented insights into the dynamics and heterogeneity of cellular systems. These systems exhibit multiscale collective behaviors driven by intricate intracellular gene regulatory networks and intercellular interactions of molecules. However, inferring interacting cell population dynamics from time-series scRNA-seq data remains a significant challenge, as cells are isolated and destroyed during sequencing. To address this, we introduce scIMF, a single-cell deep generative Interacting Mean Field model, designed to learn collective multi-cellular dynamics. Our approach leverages a transformer-enhanced stochastic differential equation network to simultaneously capture cell-intrinsic dynamics and intercellular interactions. Through extensive benchmarking on multiple scRNA-seq datasets, scIMF outperforms existing methods in reconstructing gene expression at held-out time points, demonstrating that modeling cell-cell communication enhances the accuracy of multicellular dynamics characterization.Additionally, our model provides biologically interpretable insights into cell-cell interactions during dynamic processes, offering a powerful tool for understanding complex cellular systems.
△ Less
Submitted 22 May, 2025;
originally announced May 2025.
-
Bridging Large Language Models and Single-Cell Transcriptomics in Dissecting Selective Motor Neuron Vulnerability
Authors:
Douglas Jiang,
Zilin Dai,
Luxuan Zhang,
Qiyi Yu,
Haoqi Sun,
Feng Tian
Abstract:
Understanding cell identity and function through single-cell level sequencing data remains a key challenge in computational biology. We present a novel framework that leverages gene-specific textual annotations from the NCBI Gene database to generate biologically contextualized cell embeddings. For each cell in a single-cell RNA sequencing (scRNA-seq) dataset, we rank genes by expression level, re…
▽ More
Understanding cell identity and function through single-cell level sequencing data remains a key challenge in computational biology. We present a novel framework that leverages gene-specific textual annotations from the NCBI Gene database to generate biologically contextualized cell embeddings. For each cell in a single-cell RNA sequencing (scRNA-seq) dataset, we rank genes by expression level, retrieve their NCBI Gene descriptions, and transform these descriptions into vector embedding representations using large language models (LLMs). The models used include OpenAI text-embedding-ada-002, text-embedding-3-small, and text-embedding-3-large (Jan 2024), as well as domain-specific models BioBERT and SciBERT. Embeddings are computed via an expression-weighted average across the top N most highly expressed genes in each cell, providing a compact, semantically rich representation. This multimodal strategy bridges structured biological data with state-of-the-art language modeling, enabling more interpretable downstream applications such as cell-type clustering, cell vulnerability dissection, and trajectory inference.
△ Less
Submitted 11 May, 2025;
originally announced May 2025.
-
GyralNet Subnetwork Partitioning via Differentiable Spectral Modularity Optimization
Authors:
Yan Zhuang,
Minheng Chen,
Chao Cao,
Tong Chen,
Jing Zhang,
Xiaowei Yu,
Yanjun Lyu,
Lu Zhang,
Tianming Liu,
Dajiang Zhu
Abstract:
Understanding the structural and functional organization of the human brain requires a detailed examination of cortical folding patterns, among which the three-hinge gyrus (3HG) has been identified as a key structural landmark. GyralNet, a network representation of cortical folding, models 3HGs as nodes and gyral crests as edges, highlighting their role as critical hubs in cortico-cortical connect…
▽ More
Understanding the structural and functional organization of the human brain requires a detailed examination of cortical folding patterns, among which the three-hinge gyrus (3HG) has been identified as a key structural landmark. GyralNet, a network representation of cortical folding, models 3HGs as nodes and gyral crests as edges, highlighting their role as critical hubs in cortico-cortical connectivity. However, existing methods for analyzing 3HGs face significant challenges, including the sub-voxel scale of 3HGs at typical neuroimaging resolutions, the computational complexity of establishing cross-subject correspondences, and the oversimplification of treating 3HGs as independent nodes without considering their community-level relationships. To address these limitations, we propose a fully differentiable subnetwork partitioning framework that employs a spectral modularity maximization optimization strategy to modularize the organization of 3HGs within GyralNet. By incorporating topological structural similarity and DTI-derived connectivity patterns as attribute features, our approach provides a biologically meaningful representation of cortical organization. Extensive experiments on the Human Connectome Project (HCP) dataset demonstrate that our method effectively partitions GyralNet at the individual level while preserving the community-level consistency of 3HGs across subjects, offering a robust foundation for understanding brain connectivity.
△ Less
Submitted 31 March, 2025; v1 submitted 25 March, 2025;
originally announced March 2025.
-
NaFM: Pre-training a Foundation Model for Small-Molecule Natural Products
Authors:
Yuheng Ding,
Bo Qiang,
Yiran Zhou,
Jie Yu,
Qi Li,
Liangren Zhang,
Yusong Wang,
Zhenmin Liu
Abstract:
Natural products, as metabolites from microorganisms, animals, or plants, exhibit diverse biological activities, making them crucial for drug discovery. Nowadays, existing deep learning methods for natural products research primarily rely on supervised learning approaches designed for specific downstream tasks. However, such one-model-for-a-task paradigm often lacks generalizability and leaves sig…
▽ More
Natural products, as metabolites from microorganisms, animals, or plants, exhibit diverse biological activities, making them crucial for drug discovery. Nowadays, existing deep learning methods for natural products research primarily rely on supervised learning approaches designed for specific downstream tasks. However, such one-model-for-a-task paradigm often lacks generalizability and leaves significant room for performance improvement. Additionally, existing molecular characterization methods are not well-suited for the unique tasks associated with natural products. To address these limitations, we have pre-trained a foundation model for natural products based on their unique properties. Our approach employs a novel pretraining strategy that is especially tailored to natural products. By incorporating contrastive learning and masked graph learning objectives, we emphasize evolutional information from molecular scaffolds while capturing side-chain information. Our framework achieves state-of-the-art (SOTA) results in various downstream tasks related to natural product mining and drug discovery. We first compare taxonomy classification with synthesized molecule-focused baselines to demonstrate that current models are inadequate for understanding natural synthesis. Furthermore, by diving into a fine-grained analysis at both the gene and microbial levels, NaFM demonstrates the ability to capture evolutionary information. Eventually, our method is experimented with virtual screening, illustrating informative natural product representations that can lead to more effective identification of potential drug candidates.
△ Less
Submitted 18 May, 2025; v1 submitted 22 March, 2025;
originally announced March 2025.
-
Uni-3DAR: Unified 3D Generation and Understanding via Autoregression on Compressed Spatial Tokens
Authors:
Shuqi Lu,
Haowei Lin,
Lin Yao,
Zhifeng Gao,
Xiaohong Ji,
Weinan E,
Linfeng Zhang,
Guolin Ke
Abstract:
Recent advancements in large language models and their multi-modal extensions have demonstrated the effectiveness of unifying generation and understanding through autoregressive next-token prediction. However, despite the critical role of 3D structural generation and understanding (3D GU) in AI for science, these tasks have largely evolved independently, with autoregressive methods remaining under…
▽ More
Recent advancements in large language models and their multi-modal extensions have demonstrated the effectiveness of unifying generation and understanding through autoregressive next-token prediction. However, despite the critical role of 3D structural generation and understanding (3D GU) in AI for science, these tasks have largely evolved independently, with autoregressive methods remaining underexplored. To bridge this gap, we introduce Uni-3DAR, a unified framework that seamlessly integrates 3D GU tasks via autoregressive prediction. At its core, Uni-3DAR employs a novel hierarchical tokenization that compresses 3D space using an octree, leveraging the inherent sparsity of 3D structures. It then applies an additional tokenization for fine-grained structural details, capturing key attributes such as atom types and precise spatial coordinates in microscopic 3D structures. We further propose two optimizations to enhance efficiency and effectiveness. The first is a two-level subtree compression strategy, which reduces the octree token sequence by up to 8x. The second is a masked next-token prediction mechanism tailored for dynamically varying token positions, significantly boosting model performance. By combining these strategies, Uni-3DAR successfully unifies diverse 3D GU tasks within a single autoregressive framework. Extensive experiments across multiple microscopic 3D GU tasks, including molecules, proteins, polymers, and crystals, validate its effectiveness and versatility. Notably, Uni-3DAR surpasses previous state-of-the-art diffusion models by a substantial margin, achieving up to 256\% relative improvement while delivering inference speeds up to 21.8x faster. The code is publicly available at https://github.com/dptech-corp/Uni-3DAR.
△ Less
Submitted 21 March, 2025; v1 submitted 20 March, 2025;
originally announced March 2025.
-
VenusFactory: A Unified Platform for Protein Engineering Data Retrieval and Language Model Fine-Tuning
Authors:
Yang Tan,
Chen Liu,
Jingyuan Gao,
Banghao Wu,
Mingchen Li,
Ruilin Wang,
Lingrong Zhang,
Huiqun Yu,
Guisheng Fan,
Liang Hong,
Bingxin Zhou
Abstract:
Natural language processing (NLP) has significantly influenced scientific domains beyond human language, including protein engineering, where pre-trained protein language models (PLMs) have demonstrated remarkable success. However, interdisciplinary adoption remains limited due to challenges in data collection, task benchmarking, and application. This work presents VenusFactory, a versatile engine…
▽ More
Natural language processing (NLP) has significantly influenced scientific domains beyond human language, including protein engineering, where pre-trained protein language models (PLMs) have demonstrated remarkable success. However, interdisciplinary adoption remains limited due to challenges in data collection, task benchmarking, and application. This work presents VenusFactory, a versatile engine that integrates biological data retrieval, standardized task benchmarking, and modular fine-tuning of PLMs. VenusFactory supports both computer science and biology communities with choices of both a command-line execution and a Gradio-based no-code interface, integrating $40+$ protein-related datasets and $40+$ popular PLMs. All implementations are open-sourced on https://github.com/tyang816/VenusFactory.
△ Less
Submitted 19 March, 2025;
originally announced March 2025.
-
Core-Periphery Principle Guided State Space Model for Functional Connectome Classification
Authors:
Minheng Chen,
Xiaowei Yu,
Jing Zhang,
Tong Chen,
Chao Cao,
Yan Zhuang,
Yanjun Lyu,
Lu Zhang,
Tianming Liu,
Dajiang Zhu
Abstract:
Understanding the organization of human brain networks has become a central focus in neuroscience, particularly in the study of functional connectivity, which plays a crucial role in diagnosing neurological disorders. Advances in functional magnetic resonance imaging and machine learning techniques have significantly improved brain network analysis. However, traditional machine learning approaches…
▽ More
Understanding the organization of human brain networks has become a central focus in neuroscience, particularly in the study of functional connectivity, which plays a crucial role in diagnosing neurological disorders. Advances in functional magnetic resonance imaging and machine learning techniques have significantly improved brain network analysis. However, traditional machine learning approaches struggle to capture the complex relationships between brain regions, while deep learning methods, particularly Transformer-based models, face computational challenges due to their quadratic complexity in long-sequence modeling. To address these limitations, we propose a Core-Periphery State-Space Model (CP-SSM), an innovative framework for functional connectome classification. Specifically, we introduce Mamba, a selective state-space model with linear complexity, to effectively capture long-range dependencies in functional brain networks. Furthermore, inspired by the core-periphery (CP) organization, a fundamental characteristic of brain networks that enhances efficient information transmission, we design CP-MoE, a CP-guided Mixture-of-Experts that improves the representation learning of brain connectivity patterns. We evaluate CP-SSM on two benchmark fMRI datasets: ABIDE and ADNI. Experimental results demonstrate that CP-SSM surpasses Transformer-based models in classification performance while significantly reducing computational complexity. These findings highlight the effectiveness and efficiency of CP-SSM in modeling brain functional connectivity, offering a promising direction for neuroimaging-based neurological disease diagnosis.
△ Less
Submitted 18 March, 2025;
originally announced March 2025.
-
Advanced Deep Learning Methods for Protein Structure Prediction and Design
Authors:
Yichao Zhang,
Ningyuan Deng,
Xinyuan Song,
Ziqian Bi,
Tianyang Wang,
Zheyu Yao,
Keyu Chen,
Ming Li,
Qian Niu,
Junyu Liu,
Benji Peng,
Sen Zhang,
Ming Liu,
Li Zhang,
Xuanhe Pan,
Jinlang Wang,
Pohsun Feng,
Yizhu Wen,
Lawrence KQ Yan,
Hongming Tseng,
Yan Zhong,
Yunze Wang,
Ziyuan Qin,
Bowen Jing,
Junjie Yang
, et al. (3 additional authors not shown)
Abstract:
After AlphaFold won the Nobel Prize, protein prediction with deep learning once again became a hot topic. We comprehensively explore advanced deep learning methods applied to protein structure prediction and design. It begins by examining recent innovations in prediction architectures, with detailed discussions on improvements such as diffusion based frameworks and novel pairwise attention modules…
▽ More
After AlphaFold won the Nobel Prize, protein prediction with deep learning once again became a hot topic. We comprehensively explore advanced deep learning methods applied to protein structure prediction and design. It begins by examining recent innovations in prediction architectures, with detailed discussions on improvements such as diffusion based frameworks and novel pairwise attention modules. The text analyses key components including structure generation, evaluation metrics, multiple sequence alignment processing, and network architecture, thereby illustrating the current state of the art in computational protein modelling. Subsequent chapters focus on practical applications, presenting case studies that range from individual protein predictions to complex biomolecular interactions. Strategies for enhancing prediction accuracy and integrating deep learning techniques with experimental validation are thoroughly explored. The later sections review the industry landscape of protein design, highlighting the transformative role of artificial intelligence in biotechnology and discussing emerging market trends and future challenges. Supplementary appendices provide essential resources such as databases and open source tools, making this volume a valuable reference for researchers and students.
△ Less
Submitted 29 March, 2025; v1 submitted 14 March, 2025;
originally announced March 2025.
-
Beyond Atoms: Enhancing Molecular Pretrained Representations with 3D Space Modeling
Authors:
Shuqi Lu,
Xiaohong Ji,
Bohang Zhang,
Lin Yao,
Siyuan Liu,
Zhifeng Gao,
Linfeng Zhang,
Guolin Ke
Abstract:
Molecular pretrained representations (MPR) has emerged as a powerful approach for addressing the challenge of limited supervised data in applications such as drug discovery and material design. While early MPR methods relied on 1D sequences and 2D graphs, recent advancements have incorporated 3D conformational information to capture rich atomic interactions. However, these prior models treat molec…
▽ More
Molecular pretrained representations (MPR) has emerged as a powerful approach for addressing the challenge of limited supervised data in applications such as drug discovery and material design. While early MPR methods relied on 1D sequences and 2D graphs, recent advancements have incorporated 3D conformational information to capture rich atomic interactions. However, these prior models treat molecules merely as discrete atom sets, overlooking the space surrounding them. We argue from a physical perspective that only modeling these discrete points is insufficient. We first present a simple yet insightful observation: naively adding randomly sampled virtual points beyond atoms can surprisingly enhance MPR performance. In light of this, we propose a principled framework that incorporates the entire 3D space spanned by molecules. We implement the framework via a novel Transformer-based architecture, dubbed SpaceFormer, with three key components: (1) grid-based space discretization; (2) grid sampling/merging; and (3) efficient 3D positional encoding. Extensive experiments show that SpaceFormer significantly outperforms previous 3D MPR models across various downstream tasks with limited data, validating the benefit of leveraging the additional 3D space beyond atoms in MPR models.
△ Less
Submitted 18 March, 2025; v1 submitted 13 March, 2025;
originally announced March 2025.
-
BrainNet-MoE: Brain-Inspired Mixture-of-Experts Learning for Neurological Disease Identification
Authors:
Jing Zhang,
Xiaowei Yu,
Tong Chen,
Chao Cao,
Mingheng Chen,
Yan Zhuang,
Yanjun Lyu,
Lu Zhang,
Li Su,
Tianming Liu,
Dajiang Zhu
Abstract:
The Lewy body dementia (LBD) is the second most common neurodegenerative dementia after Alzheimer's disease (AD). Early differentiation between AD and LBD is crucial because they require different treatment approaches, but this is challenging due to significant clinical overlap, heterogeneity, complex pathogenesis, and the rarity of LBD. While recent advances in artificial intelligence (AI) demons…
▽ More
The Lewy body dementia (LBD) is the second most common neurodegenerative dementia after Alzheimer's disease (AD). Early differentiation between AD and LBD is crucial because they require different treatment approaches, but this is challenging due to significant clinical overlap, heterogeneity, complex pathogenesis, and the rarity of LBD. While recent advances in artificial intelligence (AI) demonstrate powerful learning capabilities and offer new hope for accurate diagnosis, existing methods primarily focus on designing "neural-level networks". Our work represents a pioneering effort in modeling system-level artificial neural network called BrainNet-MoE for brain modeling and diagnosing. Inspired by the brain's hierarchical organization of bottom-up sensory integration and top-down control, we design a set of disease-specific expert groups to process brain sub-network under different condition, A disease gate mechanism guides the specializa-tion of expert groups, while a transformer layer enables communication be-tween all sub-networks, generating a comprehensive whole-brain represen-tation for downstream disease classification. Experimental results show superior classification accuracy with interpretable insights into how brain sub-networks contribute to different neurodegenerative conditions.
△ Less
Submitted 5 March, 2025;
originally announced March 2025.
-
VenusMutHub: A systematic evaluation of protein mutation effect predictors on small-scale experimental data
Authors:
Liang Zhang,
Hua Pang,
Chenghao Zhang,
Song Li,
Yang Tan,
Fan Jiang,
Mingchen Li,
Yuanxi Yu,
Ziyi Zhou,
Banghao Wu,
Bingxin Zhou,
Hao Liu,
Pan Tan,
Liang Hong
Abstract:
In protein engineering, while computational models are increasingly used to predict mutation effects, their evaluations primarily rely on high-throughput deep mutational scanning (DMS) experiments that use surrogate readouts, which may not adequately capture the complex biochemical properties of interest. Many proteins and their functions cannot be assessed through high-throughput methods due to t…
▽ More
In protein engineering, while computational models are increasingly used to predict mutation effects, their evaluations primarily rely on high-throughput deep mutational scanning (DMS) experiments that use surrogate readouts, which may not adequately capture the complex biochemical properties of interest. Many proteins and their functions cannot be assessed through high-throughput methods due to technical limitations or the nature of the desired properties, and this is particularly true for the real industrial application scenario. Therefore, the desired testing datasets, will be small-size (~10-100) experimental data for each protein, and involve as many proteins as possible and as many properties as possible, which is, however, lacking. Here, we present VenusMutHub, a comprehensive benchmark study using 905 small-scale experimental datasets curated from published literature and public databases, spanning 527 proteins across diverse functional properties including stability, activity, binding affinity, and selectivity. These datasets feature direct biochemical measurements rather than surrogate readouts, providing a more rigorous assessment of model performance in predicting mutations that affect specific molecular functions. We evaluate 23 computational models across various methodological paradigms, such as sequence-based, structure-informed and evolutionary approaches. This benchmark provides practical guidance for selecting appropriate prediction methods in protein engineering applications where accurate prediction of specific functional properties is crucial.
△ Less
Submitted 10 March, 2025; v1 submitted 5 March, 2025;
originally announced March 2025.
-
Association of normalization, non-differentially expressed genes and data source with machine learning performance in intra-dataset or cross-dataset modelling of transcriptomic and clinical data
Authors:
Fei Deng,
Lanjing Zhang
Abstract:
Cross-dataset testing is critical for examining machine learning (ML) model's performance. However, most studies on modelling transcriptomic and clinical data only conducted intra-dataset testing. It is also unclear whether normalization and non-differentially expressed genes (NDEG) can improve cross-dataset modeling performance of ML. We thus aim to understand whether normalization, NDEG and data…
▽ More
Cross-dataset testing is critical for examining machine learning (ML) model's performance. However, most studies on modelling transcriptomic and clinical data only conducted intra-dataset testing. It is also unclear whether normalization and non-differentially expressed genes (NDEG) can improve cross-dataset modeling performance of ML. We thus aim to understand whether normalization, NDEG and data source are associated with performance of ML in cross-dataset testing. The transcriptomic and clinical data shared by the lung adenocarcinoma cases in TCGA and ONCOSG were used. The best cross-dataset ML performance was reached using transcriptomic data alone and statistically better than those using transcriptomic and clinical data. The best balance accuracy, area under curve and accuracy were significantly better in ML algorithms training on TCGA and tested on ONCOSG than those trained on ONCOSG and tested on TCGA (p<0.05 for all). Normalization and NDEG greatly improved intra-dataset ML performances in both datasets, but not in cross-dataset testing. Strikingly, modelling transcriptomic data of ONCOSG alone outperformed modelling transcriptomic and clinical data whereas including clinical data in TCGA did not significantly impact ML performance, suggesting limited clinical data value or an overwhelming influence of transcriptomic data in TCGA. Performance gains in intra-dataset testing were more pronounced for ML models trained on ONCOSG than TCGA. Among the six ML models compared, Support vector machine was the most frequent best-performer in both intra-dataset and cross-dataset testing. Therefore, our data show data source, normalization and NDEG are associated with intra-dataset and cross-dataset ML performance in modelling transcriptomic and clinical data.
△ Less
Submitted 26 February, 2025; v1 submitted 26 February, 2025;
originally announced February 2025.
-
Strategic priorities for transformative progress in advancing biology with proteomics and artificial intelligence
Authors:
Yingying Sun,
Jun A,
Zhiwei Liu,
Rui Sun,
Liujia Qian,
Samuel H. Payne,
Wout Bittremieux,
Markus Ralser,
Chen Li,
Yi Chen,
Zhen Dong,
Yasset Perez-Riverol,
Asif Khan,
Chris Sander,
Ruedi Aebersold,
Juan Antonio Vizcaíno,
Jonathan R Krieger,
Jianhua Yao,
Han Wen,
Linfeng Zhang,
Yunping Zhu,
Yue Xuan,
Benjamin Boyang Sun,
Liang Qiao,
Henning Hermjakob
, et al. (37 additional authors not shown)
Abstract:
Artificial intelligence (AI) is transforming scientific research, including proteomics. Advances in mass spectrometry (MS)-based proteomics data quality, diversity, and scale, combined with groundbreaking AI techniques, are unlocking new challenges and opportunities in biological discovery. Here, we highlight key areas where AI is driving innovation, from data analysis to new biological insights.…
▽ More
Artificial intelligence (AI) is transforming scientific research, including proteomics. Advances in mass spectrometry (MS)-based proteomics data quality, diversity, and scale, combined with groundbreaking AI techniques, are unlocking new challenges and opportunities in biological discovery. Here, we highlight key areas where AI is driving innovation, from data analysis to new biological insights. These include developing an AI-friendly ecosystem for proteomics data generation, sharing, and analysis; improving peptide and protein identification and quantification; characterizing protein-protein interactions and protein complexes; advancing spatial and perturbation proteomics; integrating multi-omics data; and ultimately enabling AI-empowered virtual cells.
△ Less
Submitted 21 February, 2025;
originally announced February 2025.
-
Conditional Success of Adaptive Therapy: The Role of Treatment-Holiday Thresholds Revealed by Mathematical Modeling
Authors:
Lanfei Sun,
Haifeng Zhang,
Kai Kang,
Xiaoxin Wang,
Leyi Zhang,
Yanan Cai,
Changjing Zhuge,
Lei Zhang
Abstract:
Adaptive therapy (AT) improves cancer treatment by controlling the competition between sensitive and resistant cells through treatment holidays. This study highlights the critical role of treatment-holiday thresholds in AT for tumors composed of drug-sensitive and resistant cells. Using a Lotka-Volterra model, the research compares AT with maximum tolerated dose therapy and intermittent therapy, s…
▽ More
Adaptive therapy (AT) improves cancer treatment by controlling the competition between sensitive and resistant cells through treatment holidays. This study highlights the critical role of treatment-holiday thresholds in AT for tumors composed of drug-sensitive and resistant cells. Using a Lotka-Volterra model, the research compares AT with maximum tolerated dose therapy and intermittent therapy, showing that AT's success largely depends on the threshold at which treatment is paused and resumed, as well as on the competition between sensitive and resistant cells. Three scenarios of comparison between AT and other therapies are identified: uniform-decline, conditional-improve, and uniform-improve, illustrating that optimizing the treatment-holiday threshold is crucial for AT effectiveness. Tumor composition, including initial tumor burden and the proportion of resistant cells, influences outcomes. Adjusting threshold values enables AT to suppress resistant subclones, preserving sensitive cells, ultimately improving progression-free survival. These findings emphasize the importance of personalized treatment strategies potentially enhancing long-term therapeutic outcomes.
△ Less
Submitted 15 February, 2025; v1 submitted 13 February, 2025;
originally announced February 2025.
-
Classification of Mild Cognitive Impairment Based on Dynamic Functional Connectivity Using Spatio-Temporal Transformer
Authors:
Jing Zhang,
Yanjun Lyu,
Xiaowei Yu,
Lu Zhang,
Chao Cao,
Tong Chen,
Minheng Chen,
Yan Zhuang,
Tianming Liu,
Dajiang Zhu
Abstract:
Dynamic functional connectivity (dFC) using resting-state functional magnetic resonance imaging (rs-fMRI) is an advanced technique for capturing the dynamic changes of neural activities, and can be very useful in the studies of brain diseases such as Alzheimer's disease (AD). Yet, existing studies have not fully leveraged the sequential information embedded within dFC that can potentially provide…
▽ More
Dynamic functional connectivity (dFC) using resting-state functional magnetic resonance imaging (rs-fMRI) is an advanced technique for capturing the dynamic changes of neural activities, and can be very useful in the studies of brain diseases such as Alzheimer's disease (AD). Yet, existing studies have not fully leveraged the sequential information embedded within dFC that can potentially provide valuable information when identifying brain conditions. In this paper, we propose a novel framework that jointly learns the embedding of both spatial and temporal information within dFC based on the transformer architecture. Specifically, we first construct dFC networks from rs-fMRI data through a sliding window strategy. Then, we simultaneously employ a temporal block and a spatial block to capture higher-order representations of dynamic spatio-temporal dependencies, via mapping them into an efficient fused feature representation. To further enhance the robustness of these feature representations by reducing the dependency on labeled data, we also introduce a contrastive learning strategy to manipulate different brain states. Experimental results on 345 subjects with 570 scans from the Alzheimer's Disease Neuroimaging Initiative (ADNI) demonstrate the superiority of our proposed method for MCI (Mild Cognitive Impairment, the prodromal stage of AD) prediction, highlighting its potential for early identification of AD.
△ Less
Submitted 27 January, 2025;
originally announced January 2025.
-
Normalization and selecting non-differentially expressed genes improve machine learning modelling of cross-platform transcriptomic data
Authors:
Fei Deng,
Catherine H Feng,
Nan Gao,
Lanjing Zhang
Abstract:
Normalization is a critical step in quantitative analyses of biological processes. Recent works show that cross-platform integration and normalization enable machine learning (ML) training on RNA microarray and RNA-seq data, but no independent datasets were used in their studies. Therefore, it is unclear how to improve ML modelling performance on independent RNA array and RNA-seq based datasets. I…
▽ More
Normalization is a critical step in quantitative analyses of biological processes. Recent works show that cross-platform integration and normalization enable machine learning (ML) training on RNA microarray and RNA-seq data, but no independent datasets were used in their studies. Therefore, it is unclear how to improve ML modelling performance on independent RNA array and RNA-seq based datasets. Inspired by the house-keeping genes that are commonly used in experimental biology, this study tests the hypothesis that non-differentially expressed genes (NDEG) may improve normalization of transcriptomic data and subsequently cross-platform modelling performance of ML models. Microarray and RNA-seq datasets of the TCGA breast cancer were used as independent training and test datasets, respectively, to classify the molecular subtypes of breast cancer. NDEG (p>0.85) and differentially expressed genes (DEG, p<0.05) were selected based on the p values of ANOVA analysis and used for subsequent data normalization and classification, respectively. Models trained based on data from one platform were used for testing on the other platform. Our data show that NDEG and DEG gene selection could effectively improve the model classification performance. Normalization methods based on parametric statistical analysis were inferior to those based on nonparametric statistics. In this study, the LOG_QN and LOG_QNZ normalization methods combined with the neural network classification model seem to achieve better performance. Therefore, NDEG-based normalization appears useful for cross-platform testing on completely independent datasets. However, more studies are required to examine whether NDEG-based normalization can improve ML classification performance in other datasets and other omic data types.
△ Less
Submitted 24 January, 2025;
originally announced January 2025.
-
A new perspective on brain stimulation interventions: Optimal stochastic tracking control of brain network dynamics
Authors:
Kangli Dong,
Siya Chen,
Ying Dan,
Lu Zhang,
Xinyi Li,
Wei Liang,
Yue Zhao,
Yu Sun
Abstract:
Network control theory (NCT) has recently been utilized in neuroscience to facilitate our understanding of brain stimulation effects. A particularly useful branch of NCT is optimal control, which focuses on applying theoretical and computational principles of control theory to design optimal strategies to achieve specific goals in neural processes. However, most existing research focuses on optima…
▽ More
Network control theory (NCT) has recently been utilized in neuroscience to facilitate our understanding of brain stimulation effects. A particularly useful branch of NCT is optimal control, which focuses on applying theoretical and computational principles of control theory to design optimal strategies to achieve specific goals in neural processes. However, most existing research focuses on optimally controlling brain network dynamics from the original state to a target state at a specific time point. In this paper, we present the first investigation of introducing optimal stochastic tracking control strategy to synchronize the dynamics of the brain network to a target dynamics rather than to a target state at a specific time point. We utilized fMRI data from healthy groups, and cases of stroke and post-stroke aphasia. For all participants, we utilized a gradient descent optimization method to estimate the parameters for the brain network dynamic system. We then utilized optimal stochastic tracking control techniques to drive original unhealthy dynamics by controlling a certain number of nodes to synchronize with target healthy dynamics. Results show that the energy associated with optimal stochastic tracking control is negatively correlated with the intrinsic average controllability of the brain network system, while the energy of the optimal state approaching control is significantly related to the target state value. For a 100-dimensional brain network system, controlling the five nodes with the lowest tracking energy can achieve relatively acceptable dynamics control effects. Our results suggest that stochastic tracking control is more aligned with the objective of brain stimulation interventions, and is closely related to the intrinsic characteristics of the brain network system, potentially representing a new direction for future brain network optimal control research.
△ Less
Submitted 16 January, 2025; v1 submitted 14 January, 2025;
originally announced January 2025.
-
Large Language Models for Bioinformatics
Authors:
Wei Ruan,
Yanjun Lyu,
Jing Zhang,
Jiazhang Cai,
Peng Shu,
Yang Ge,
Yao Lu,
Shang Gao,
Yue Wang,
Peilong Wang,
Lin Zhao,
Tao Wang,
Yufang Liu,
Luyang Fang,
Ziyu Liu,
Zhengliang Liu,
Yiwei Li,
Zihao Wu,
Junhao Chen,
Hanqi Jiang,
Yi Pan,
Zhenyuan Yang,
Jingyuan Chen,
Shizhe Liang,
Wei Zhang
, et al. (30 additional authors not shown)
Abstract:
With the rapid advancements in large language model (LLM) technology and the emergence of bioinformatics-specific language models (BioLMs), there is a growing need for a comprehensive analysis of the current landscape, computational characteristics, and diverse applications. This survey aims to address this need by providing a thorough review of BioLMs, focusing on their evolution, classification,…
▽ More
With the rapid advancements in large language model (LLM) technology and the emergence of bioinformatics-specific language models (BioLMs), there is a growing need for a comprehensive analysis of the current landscape, computational characteristics, and diverse applications. This survey aims to address this need by providing a thorough review of BioLMs, focusing on their evolution, classification, and distinguishing features, alongside a detailed examination of training methodologies, datasets, and evaluation frameworks. We explore the wide-ranging applications of BioLMs in critical areas such as disease diagnosis, drug discovery, and vaccine development, highlighting their impact and transformative potential in bioinformatics. We identify key challenges and limitations inherent in BioLMs, including data privacy and security concerns, interpretability issues, biases in training data and model outputs, and domain adaptation complexities. Finally, we highlight emerging trends and future directions, offering valuable insights to guide researchers and clinicians toward advancing BioLMs for increasingly sophisticated biological and clinical applications.
△ Less
Submitted 9 January, 2025;
originally announced January 2025.
-
Pan-infection Foundation Framework Enables Multiple Pathogen Prediction
Authors:
Lingrui Zhang,
Haonan Wu,
Nana Jin,
Chenqing Zheng,
Jize Xie,
Qitai Cai,
Jun Wang,
Qin Cao,
Xubin Zheng,
Jiankun Wang,
Lixin Cheng
Abstract:
Host-response-based diagnostics can improve the accuracy of diagnosing bacterial and viral infections, thereby reducing inappropriate antibiotic prescriptions. However, the existing cohorts with limited sample size and coarse infections types are unable to support the exploration of an accurate and generalizable diagnostic model. Here, we curate the largest infection host-response transcriptome da…
▽ More
Host-response-based diagnostics can improve the accuracy of diagnosing bacterial and viral infections, thereby reducing inappropriate antibiotic prescriptions. However, the existing cohorts with limited sample size and coarse infections types are unable to support the exploration of an accurate and generalizable diagnostic model. Here, we curate the largest infection host-response transcriptome data, including 11,247 samples across 89 blood transcriptome datasets from 13 countries and 21 platforms. We build a diagnostic model for pathogen prediction starting from a pan-infection model as foundation (AUC = 0.97) based on the pan-infection dataset. Then, we utilize knowledge distillation to efficiently transfer the insights from this "teacher" model to four lightweight pathogen "student" models, i.e., staphylococcal infection (AUC = 0.99), streptococcal infection (AUC = 0.94), HIV infection (AUC = 0.93), and RSV infection (AUC = 0.94), as well as a sepsis "student" model (AUC = 0.99). The proposed knowledge distillation framework not only facilitates the diagnosis of pathogens using pan-infection data, but also enables an across-disease study from pan-infection to sepsis. Moreover, the framework enables high-degree lightweight design of diagnostic models, which is expected to be adaptively deployed in clinical settings.
△ Less
Submitted 31 December, 2024;
originally announced January 2025.
-
Efficacy of Temporal Interference Electrical Stimulation for Spinal Cord Injury Rehabilitation: A Case Series
Authors:
Ruidong Cheng,
Yuling Shao,
Xi Li,
Li Zhang,
Zehao Sheng,
Chenyang Li,
Xu Xie,
Huilin Mou,
Weidong Chen,
Shaomin Zhang,
Yuchen Xu,
Minmin Wang
Abstract:
Spinal cord injury (SCI) is a debilitating condition that often results in significant motor and sensory deficits, impacting the quality of life. Current rehabilitation methods, including physical therapy and electrical stimulation, offer variable outcomes and often require invasive procedures. Temporal interference (TI) stimulation has emerged as a novel, non-invasive neuromodulation technique ca…
▽ More
Spinal cord injury (SCI) is a debilitating condition that often results in significant motor and sensory deficits, impacting the quality of life. Current rehabilitation methods, including physical therapy and electrical stimulation, offer variable outcomes and often require invasive procedures. Temporal interference (TI) stimulation has emerged as a novel, non-invasive neuromodulation technique capable of targeting deep neural structures with precision, providing a promising alternative for SCI rehabilitation. This study explores the efficacy of TI stimulation as a non-invasive approach for improving motor and sensory function in patients with incomplete SCI. Three male patients with incomplete cervical SCI (AIS D) participated in a two-week intervention consisting of 14 sessions of TI stimulation targeting their injury sites. TI stimulation was delivered using frequencies of 1000 Hz and 1040 Hz, with assessments conducted pre- and post-intervention, including motor and sensory evaluations, functional scales, and imaging studies.All participants demonstrated significant improvements in neurological function, motor strength, sensory perception, and functional independence. Neurological levels of injury shifted upward in all cases, with one patient improving from C5 to C7. Graded Redefined Assessment of Strength, Sensibility and Prehension (GRASSP) results shows additional strength, prehension and sensory outcomes obtained for the arm and hand functions of participants. Motor scores (UEMS and LEMS) increased, sensory scores for light touch and pin prick improved, and functional assessments, such as the Berg Balance Scale (BBS) and Barthel Index (BI), showed marked gains. Pain scores also decreased in two participants, highlighting additional therapeutic benefits.
△ Less
Submitted 16 December, 2024;
originally announced December 2024.
-
Language model driven: a PROTAC generation pipeline with dual constraints of structure and property
Authors:
Jinsong Shao,
Qineng Gong,
Zeyu Yin,
Yu Chen,
Yajie Hao,
Lei Zhang,
Linlin Jiang,
Min Yao,
Jinlong Li,
Fubo Wang,
Li Wang
Abstract:
The imperfect modeling of ternary complexes has limited the application of computer-aided drug discovery tools in PROTAC research and development. In this study, an AI-assisted approach for PROTAC molecule design pipeline named LM-PROTAC was developed, which stands for language model driven Proteolysis Targeting Chimera, by embedding a transformer-based generative model with dual constraints on st…
▽ More
The imperfect modeling of ternary complexes has limited the application of computer-aided drug discovery tools in PROTAC research and development. In this study, an AI-assisted approach for PROTAC molecule design pipeline named LM-PROTAC was developed, which stands for language model driven Proteolysis Targeting Chimera, by embedding a transformer-based generative model with dual constraints on structure and properties, referred to as the DCT. This study utilized the fragmentation representation of molecules and developed a language model driven pipeline. Firstly, a language model driven affinity model for protein compounds to screen molecular fragments with high affinity for the target protein. Secondly, structural and physicochemical properties of these fragments were constrained during the generation process to meet specific scenario requirements. Finally, a two-round screening of the preliminary generated molecules using a multidimensional property prediction model to generate a batch of PROTAC molecules capable of degrading disease-relevant target proteins for validation in vitro experiments, thus achieving a complete solution for AI-assisted PROTAC drug generation. Taking the tumor key target Wnt3a as an example, the LM-PROTAC pipeline successfully generated PROTAC molecules capable of inhibiting Wnt3a. The results show that DCT can efficiently generate PROTAC that targets and hydrolyses Wnt3a.
△ Less
Submitted 12 December, 2024;
originally announced December 2024.
-
Single-Cell Omics Arena: A Benchmark Study for Large Language Models on Cell Type Annotation Using Single-Cell Data
Authors:
Junhao Liu,
Siwei Xu,
Lei Zhang,
Jing Zhang
Abstract:
Over the past decade, the revolution in single-cell sequencing has enabled the simultaneous molecular profiling of various modalities across thousands of individual cells, allowing scientists to investigate the diverse functions of complex tissues and uncover underlying disease mechanisms. Among all the analytical steps, assigning individual cells to specific types is fundamental for understanding…
▽ More
Over the past decade, the revolution in single-cell sequencing has enabled the simultaneous molecular profiling of various modalities across thousands of individual cells, allowing scientists to investigate the diverse functions of complex tissues and uncover underlying disease mechanisms. Among all the analytical steps, assigning individual cells to specific types is fundamental for understanding cellular heterogeneity. However, this process is usually labor-intensive and requires extensive expert knowledge. Recent advances in large language models (LLMs) have demonstrated their ability to efficiently process and synthesize vast corpora of text to automatically extract essential biological knowledge, such as marker genes, potentially promoting more efficient and automated cell type annotations. To thoroughly evaluate the capability of modern instruction-tuned LLMs in automating the cell type identification process, we introduce SOAR, a comprehensive benchmarking study of LLMs for cell type annotation tasks in single-cell genomics. Specifically, we assess the performance of 8 instruction-tuned LLMs across 11 datasets, spanning multiple cell types and species. Our study explores the potential of LLMs to accurately classify and annotate cell types in single-cell RNA sequencing (scRNA-seq) data, while extending their application to multiomics data through cross-modality translation. Additionally, we evaluate the effectiveness of chain-of-thought (CoT) prompting techniques in generating detailed biological insights during the annotation process. The results demonstrate that LLMs can provide robust interpretations of single-cell data without requiring additional fine-tuning, advancing the automation of cell type annotation in genomics research.
△ Less
Submitted 3 December, 2024;
originally announced December 2024.
-
Towards Unified Molecule-Enhanced Pathology Image Representation Learning via Integrating Spatial Transcriptomics
Authors:
Minghao Han,
Dingkang Yang,
Jiabei Cheng,
Xukun Zhang,
Linhao Qu,
Zizhi Chen,
Lihua Zhang
Abstract:
Recent advancements in multimodal pre-training models have significantly advanced computational pathology. However, current approaches predominantly rely on visual-language models, which may impose limitations from a molecular perspective and lead to performance bottlenecks. Here, we introduce a Unified Molecule-enhanced Pathology Image REpresentationn Learning framework (UMPIRE). UMPIRE aims to l…
▽ More
Recent advancements in multimodal pre-training models have significantly advanced computational pathology. However, current approaches predominantly rely on visual-language models, which may impose limitations from a molecular perspective and lead to performance bottlenecks. Here, we introduce a Unified Molecule-enhanced Pathology Image REpresentationn Learning framework (UMPIRE). UMPIRE aims to leverage complementary information from gene expression profiles to guide the multimodal pre-training, enhancing the molecular awareness of pathology image representation learning. We demonstrate that this molecular perspective provides a robust, task-agnostic training signal for learning pathology image embeddings. Due to the scarcity of paired data, approximately 4 million entries of spatial transcriptomics gene expression were collected to train the gene encoder. By leveraging powerful pre-trained encoders, UMPIRE aligns the encoders across over 697K pathology image-gene expression pairs. The performance of UMPIRE is demonstrated across various molecular-related downstream tasks, including gene expression prediction, spot classification, and mutation state prediction in whole slide images. Our findings highlight the effectiveness of multimodal data integration and open new avenues for exploring computational pathology enhanced by molecular perspectives. The code and pre-trained weights are available at https://github.com/Hanminghao/UMPIRE.
△ Less
Submitted 30 November, 2024;
originally announced December 2024.
-
Exploring Optimal Transport-Based Multi-Grained Alignments for Text-Molecule Retrieval
Authors:
Zijun Min,
Bingshuai Liu,
Liang Zhang,
Jia Song,
Jinsong Su,
Song He,
Xiaochen Bo
Abstract:
The field of bioinformatics has seen significant progress, making the cross-modal text-molecule retrieval task increasingly vital. This task focuses on accurately retrieving molecule structures based on textual descriptions, by effectively aligning textual descriptions and molecules to assist researchers in identifying suitable molecular candidates. However, many existing approaches overlook the d…
▽ More
The field of bioinformatics has seen significant progress, making the cross-modal text-molecule retrieval task increasingly vital. This task focuses on accurately retrieving molecule structures based on textual descriptions, by effectively aligning textual descriptions and molecules to assist researchers in identifying suitable molecular candidates. However, many existing approaches overlook the details inherent in molecule sub-structures. In this work, we introduce the Optimal TRansport-based Multi-grained Alignments model (ORMA), a novel approach that facilitates multi-grained alignments between textual descriptions and molecules. Our model features a text encoder and a molecule encoder. The text encoder processes textual descriptions to generate both token-level and sentence-level representations, while molecules are modeled as hierarchical heterogeneous graphs, encompassing atom, motif, and molecule nodes to extract representations at these three levels. A key innovation in ORMA is the application of Optimal Transport (OT) to align tokens with motifs, creating multi-token representations that integrate multiple token alignments with their corresponding motifs. Additionally, we employ contrastive learning to refine cross-modal alignments at three distinct scales: token-atom, multitoken-motif, and sentence-molecule, ensuring that the similarities between correctly matched text-molecule pairs are maximized while those of unmatched pairs are minimized. To our knowledge, this is the first attempt to explore alignments at both the motif and multi-token levels. Experimental results on the ChEBI-20 and PCdes datasets demonstrate that ORMA significantly outperforms existing state-of-the-art (SOTA) models.
△ Less
Submitted 4 November, 2024;
originally announced November 2024.
-
MassSpecGym: A benchmark for the discovery and identification of molecules
Authors:
Roman Bushuiev,
Anton Bushuiev,
Niek F. de Jonge,
Adamo Young,
Fleming Kretschmer,
Raman Samusevich,
Janne Heirman,
Fei Wang,
Luke Zhang,
Kai Dührkop,
Marcus Ludwig,
Nils A. Haupt,
Apurva Kalia,
Corinna Brungs,
Robin Schmid,
Russell Greiner,
Bo Wang,
David S. Wishart,
Li-Ping Liu,
Juho Rousu,
Wout Bittremieux,
Hannes Rost,
Tytus D. Mak,
Soha Hassoun,
Florian Huber
, et al. (5 additional authors not shown)
Abstract:
The discovery and identification of molecules in biological and environmental samples is crucial for advancing biomedical and chemical sciences. Tandem mass spectrometry (MS/MS) is the leading technique for high-throughput elucidation of molecular structures. However, decoding a molecular structure from its mass spectrum is exceptionally challenging, even when performed by human experts. As a resu…
▽ More
The discovery and identification of molecules in biological and environmental samples is crucial for advancing biomedical and chemical sciences. Tandem mass spectrometry (MS/MS) is the leading technique for high-throughput elucidation of molecular structures. However, decoding a molecular structure from its mass spectrum is exceptionally challenging, even when performed by human experts. As a result, the vast majority of acquired MS/MS spectra remain uninterpreted, thereby limiting our understanding of the underlying (bio)chemical processes. Despite decades of progress in machine learning applications for predicting molecular structures from MS/MS spectra, the development of new methods is severely hindered by the lack of standard datasets and evaluation protocols. To address this problem, we propose MassSpecGym -- the first comprehensive benchmark for the discovery and identification of molecules from MS/MS data. Our benchmark comprises the largest publicly available collection of high-quality labeled MS/MS spectra and defines three MS/MS annotation challenges: de novo molecular structure generation, molecule retrieval, and spectrum simulation. It includes new evaluation metrics and a generalization-demanding data split, therefore standardizing the MS/MS annotation tasks and rendering the problem accessible to the broad machine learning community. MassSpecGym is publicly available at https://github.com/pluskal-lab/MassSpecGym.
△ Less
Submitted 14 February, 2025; v1 submitted 30 October, 2024;
originally announced October 2024.
-
ZIF-90 treats fungal keratitis by promoting macrophage apoptosis and inhibiting inflammatory response
Authors:
Xueyun Fu,
Jing Lin,
Qian Wang,
Lina Zhang,
Ziyi Wang,
Menghui Chi,
Daohao Li,
Guiqiu Zhao,
Cui Li
Abstract:
Fungal keratitis is a severe vision-threatening corneal infection with a prognosis influenced by fungal virulence and the host's immune defense mechanisms. The immune system, through its regulation of the inflammatory response, ensures cells and tissues can effectively activate defense mechanisms in response to infection and injury. However, there is still a lack of effective drugs that attenuate…
▽ More
Fungal keratitis is a severe vision-threatening corneal infection with a prognosis influenced by fungal virulence and the host's immune defense mechanisms. The immune system, through its regulation of the inflammatory response, ensures cells and tissues can effectively activate defense mechanisms in response to infection and injury. However, there is still a lack of effective drugs that attenuate fungal virulence while relieving the inflammatory response caused by fungal keratitis. Therefore, finding effective treatments to solve these problems is particularly important.
We synthesized ZIF-90 by water-based synthesis and characterized by SEM, XRD etc. In vitro experiments included CCK-8 and ELISA. These evaluations verified the disruptive effects of ZIF-90 on Aspergillus. fumigatus spore adhesion, morphology, cell membrane, and the effect of ZIF-90 on apoptosis. In addition, to investigate whether the metal-ligand zinc and the organic ligand imidazole act as essential factors in ZIF-90, we investigated the in vitro antimicrobial and anti-inflammatory effects of ZIF-8, ZIF-67, and MOF-74 (Zn) by MIC and ELISA experiments.
ZIF-90 has therapeutic effects on fungal keratitis, which could break the protective organelles of Aspergillus. fumigatus, such as the cell wall. In addition, ZIF-90 can avoid excessive inflammatory response by promoting apoptosis of inflammatory cells. The results demonstrated that both zinc ions and imidazole possessed antimicrobial and anti-inflammatory effects. In addition, ZIF-90 exhibited better biocompatibility compared to ZIF-8, ZIF-67, and MOF-74 (Zn).
ZIF-90 has anti-inflammatory and antifungal effects and preferable biocompatibility, and has great potential for the treatment of fungal keratitis.
△ Less
Submitted 27 October, 2024;
originally announced October 2024.
-
Zeolitic Imidazolate Framework-8 offers an anti-inflammatory and antifungal method in the treatment of Aspergillus fungus keratitis in vitro and in vivo
Authors:
Xueyun Fu,
Xue Tian,
Jing Lin,
Qian Wang,
Lingwen Gu,
Ziyi Wang,
Menghui Chi,
Bing Yu,
Zhuhui Feng,
Wenyao Liu,
Lina Zhang,
Cui Li,
Guiqiu Zhao
Abstract:
Background: Fungal keratitis is a serious blinding eye disease. Traditional drugs used to treat fungal keratitis commonly have the disadvantages of low bioavailability, poor dispersion, and limited permeability. Purpose: To develop a new method for the treatment of fungal keratitis with improved bioavailability, dispersion, and permeability. Purpose: To develop a new method for the treatment of fu…
▽ More
Background: Fungal keratitis is a serious blinding eye disease. Traditional drugs used to treat fungal keratitis commonly have the disadvantages of low bioavailability, poor dispersion, and limited permeability. Purpose: To develop a new method for the treatment of fungal keratitis with improved bioavailability, dispersion, and permeability. Purpose: To develop a new method for the treatment of fungal keratitis with improved bioavailability, dispersion, and permeability. Methods: Zeolitic Imidazolate Framework-8 (ZIF-8) was formed by zinc ions and 2-methylimidazole linked by coordination bonds and characterized by Scanning electron microscopy (SEM), X-ray diffraction (XRD), and Zeta potential. The safety of ZIF-8 on HCECs and RAW 264.7 cells was detected by Cell Counting Kit-8 (CCK-8). The anti-inflammatory effects of ZIF-8 on RAW 246.7 cells were evaluated by Quantitative Real-Time PCR Experiments (qPCR) and Enzyme-linked immunosorbent assay (ELISA). Clinical score, Colony-Forming Units (CFU). In vivo, treatment with ZIF-8 reduced corneal fungal load and mitigated neutrophil infiltration in fungal keratitis, which effectively reduced the severity of keratitis in mice and alleviated the infiltration of inflammatory factors in the mouse cornea. In addition, ZIF-8 reduces the inflammatory response by downregulating the expression of pro-inflammatory cytokines TNF-α, IL-6, and IL-1\b{eta} after Aspergillus fumigatus infection in vivo and in vitro. Conclusion: ZIF-8 has a significant anti-inflammatory and antifungal effect, which provides a new solution for the treatment of fungal keratitis.
△ Less
Submitted 29 October, 2024; v1 submitted 29 September, 2024;
originally announced October 2024.
-
Bounding the number of reticulation events for displaying multiple trees in a phylogenetic network
Authors:
Yufeng Wu,
Louxin Zhang
Abstract:
Reconstructing a parsimonious phylogenetic network that displays multiple phylogenetic trees is an important problem in theory of phylogenetics, where the complexity of the inferred networks is measured by reticulation numbers. The reticulation number for a set of trees is defined as the minimum number of reticulations in a phylogenetic network that displays those trees. A mathematical problem is…
▽ More
Reconstructing a parsimonious phylogenetic network that displays multiple phylogenetic trees is an important problem in theory of phylogenetics, where the complexity of the inferred networks is measured by reticulation numbers. The reticulation number for a set of trees is defined as the minimum number of reticulations in a phylogenetic network that displays those trees. A mathematical problem is bounding the reticulation number for multiple trees over a fixed number of taxa. While this problem has been extensively studied for two trees, much less is known about the upper bounds on the reticulation numbers for three or more arbitrary trees. In this paper, we present a few non-trivial upper bounds on reticulation numbers for three or more trees.
△ Less
Submitted 26 August, 2024;
originally announced August 2024.
-
Dynamic PDB: A New Dataset and a SE(3) Model Extension by Integrating Dynamic Behaviors and Physical Properties in Protein Structures
Authors:
Ce Liu,
Jun Wang,
Zhiqiang Cai,
Yingxu Wang,
Huizhen Kuang,
Kaihui Cheng,
Liwei Zhang,
Qingkun Su,
Yining Tang,
Fenglei Cao,
Limei Han,
Siyu Zhu,
Yuan Qi
Abstract:
Despite significant progress in static protein structure collection and prediction, the dynamic behavior of proteins, one of their most vital characteristics, has been largely overlooked in prior research. This oversight can be attributed to the limited availability, diversity, and heterogeneity of dynamic protein datasets. To address this gap, we propose to enhance existing prestigious static 3D…
▽ More
Despite significant progress in static protein structure collection and prediction, the dynamic behavior of proteins, one of their most vital characteristics, has been largely overlooked in prior research. This oversight can be attributed to the limited availability, diversity, and heterogeneity of dynamic protein datasets. To address this gap, we propose to enhance existing prestigious static 3D protein structural databases, such as the Protein Data Bank (PDB), by integrating dynamic data and additional physical properties. Specifically, we introduce a large-scale dataset, Dynamic PDB, encompassing approximately 12.6K proteins, each subjected to all-atom molecular dynamics (MD) simulations lasting 1 microsecond to capture conformational changes. Furthermore, we provide a comprehensive suite of physical properties, including atomic velocities and forces, potential and kinetic energies of proteins, and the temperature of the simulation environment, recorded at 1 picosecond intervals throughout the simulations. For benchmarking purposes, we evaluate state-of-the-art methods on the proposed dataset for the task of trajectory prediction. To demonstrate the value of integrating richer physical properties in the study of protein dynamics and related model design, we base our approach on the SE(3) diffusion model and incorporate these physical properties into the trajectory prediction process. Preliminary results indicate that this straightforward extension of the SE(3) model yields improved accuracy, as measured by MAE and RMSD, when the proposed physical properties are taken into consideration. https://fudan-generative-vision.github.io/dynamicPDB/ .
△ Less
Submitted 18 September, 2024; v1 submitted 22 August, 2024;
originally announced August 2024.
-
LipidBERT: A Lipid Language Model Pre-trained on METiS de novo Lipid Library
Authors:
Tianhao Yu,
Cai Yao,
Zhuorui Sun,
Feng Shi,
Lin Zhang,
Kangjie Lyu,
Xuan Bai,
Andong Liu,
Xicheng Zhang,
Jiali Zou,
Wenshou Wang,
Chris Lai,
Kai Wang
Abstract:
In this study, we generate and maintain a database of 10 million virtual lipids through METiS's in-house de novo lipid generation algorithms and lipid virtual screening techniques. These virtual lipids serve as a corpus for pre-training, lipid representation learning, and downstream task knowledge transfer, culminating in state-of-the-art LNP property prediction performance. We propose LipidBERT,…
▽ More
In this study, we generate and maintain a database of 10 million virtual lipids through METiS's in-house de novo lipid generation algorithms and lipid virtual screening techniques. These virtual lipids serve as a corpus for pre-training, lipid representation learning, and downstream task knowledge transfer, culminating in state-of-the-art LNP property prediction performance. We propose LipidBERT, a BERT-like model pre-trained with the Masked Language Model (MLM) and various secondary tasks. Additionally, we compare the performance of embeddings generated by LipidBERT and PhatGPT, our GPT-like lipid generation model, on downstream tasks. The proposed bilingual LipidBERT model operates in two languages: the language of ionizable lipid pre-training, using in-house dry-lab lipid structures, and the language of LNP fine-tuning, utilizing in-house LNP wet-lab data. This dual capability positions LipidBERT as a key AI-based filter for future screening tasks, including new versions of METiS de novo lipid libraries and, more importantly, candidates for in vivo testing for orgran-targeting LNPs. To the best of our knowledge, this is the first successful demonstration of the capability of a pre-trained language model on virtual lipids and its effectiveness in downstream tasks using web-lab data. This work showcases the clever utilization of METiS's in-house de novo lipid library as well as the power of dry-wet lab integration.
△ Less
Submitted 3 May, 2025; v1 submitted 12 August, 2024;
originally announced August 2024.
-
Quantum Long Short-Term Memory for Drug Discovery
Authors:
Liang Zhang,
Yin Xu,
Mohan Wu,
Liang Wang,
Hua Xu
Abstract:
Quantum computing combined with machine learning (ML) is an extremely promising research area, with numerous studies demonstrating that quantum machine learning (QML) is expected to solve scientific problems more effectively than classical ML. In this work, we successfully apply QML to drug discovery, showing that QML can significantly improve model performance and achieve faster convergence compa…
▽ More
Quantum computing combined with machine learning (ML) is an extremely promising research area, with numerous studies demonstrating that quantum machine learning (QML) is expected to solve scientific problems more effectively than classical ML. In this work, we successfully apply QML to drug discovery, showing that QML can significantly improve model performance and achieve faster convergence compared to classical ML. Moreover, we demonstrate that the model accuracy of the QML improves as the number of qubits increases. We also introduce noise to the QML model and find that it has little effect on our experimental conclusions, illustrating the high robustness of the QML model. This work highlights the potential application of quantum computing to yield significant benefits for scientific advancement as the qubit quantity increase and quality improvement in the future.
△ Less
Submitted 29 July, 2024;
originally announced July 2024.
-
Prompting Whole Slide Image Based Genetic Biomarker Prediction
Authors:
Ling Zhang,
Boxiang Yun,
Xingran Xie,
Qingli Li,
Xinxing Li,
Yan Wang
Abstract:
Prediction of genetic biomarkers, e.g., microsatellite instability and BRAF in colorectal cancer is crucial for clinical decision making. In this paper, we propose a whole slide image (WSI) based genetic biomarker prediction method via prompting techniques. Our work aims at addressing the following challenges: (1) extracting foreground instances related to genetic biomarkers from gigapixel WSIs, a…
▽ More
Prediction of genetic biomarkers, e.g., microsatellite instability and BRAF in colorectal cancer is crucial for clinical decision making. In this paper, we propose a whole slide image (WSI) based genetic biomarker prediction method via prompting techniques. Our work aims at addressing the following challenges: (1) extracting foreground instances related to genetic biomarkers from gigapixel WSIs, and (2) the interaction among the fine-grained pathological components in WSIs.Specifically, we leverage large language models to generate medical prompts that serve as prior knowledge in extracting instances associated with genetic biomarkers. We adopt a coarse-to-fine approach to mine biomarker information within the tumor microenvironment. This involves extracting instances related to genetic biomarkers using coarse medical prior knowledge, grouping pathology instances into fine-grained pathological components and mining their interactions. Experimental results on two colorectal cancer datasets show the superiority of our method, achieving 91.49% in AUC for MSI classification. The analysis further shows the clinical interpretability of our method. Code is publicly available at https://github.com/DeepMed-Lab-ECNU/PromptBio.
△ Less
Submitted 26 June, 2024;
originally announced July 2024.
-
How big does a population need to be before demographers can ignore individual-level randomness in demographic events?
Authors:
John Bryant,
Tahu Kukutai,
Junni L. Zhang
Abstract:
When studying a national-level population, demographers can safely ignore the effect of individual-level randomness on age-sex structure. When studying a single community, or group of communities, however, the potential importance of individual-level randomness is less clear. We seek to measure the effect of individual-level randomness in births and deaths on standard summary indicators of age-sex…
▽ More
When studying a national-level population, demographers can safely ignore the effect of individual-level randomness on age-sex structure. When studying a single community, or group of communities, however, the potential importance of individual-level randomness is less clear. We seek to measure the effect of individual-level randomness in births and deaths on standard summary indicators of age-sex structure, for populations of different sizes, focusing on on demographic conditions typical of historical populations. We conduct a microsimulation experiment where we simulate events and age-sex structure under a range of settings for demographic rates and population size. The experiment results suggest that individual-level randomness strongly affects age-sex structure for populations of about 100, but has a much smaller effect on populations of 1,000, and a negligible effect on populations of 10,000. Our conclusion is that analyses of age-sex structure in historical populations with sizes on the order 100 must account for individual-level randomness in demographic events. Analyses of populations with sizes on the order of 1,000 may need to make some allowance for individual-level variation, but other issues, such as measurement error, probably deserve more attention. Analyses of populations of 10,000 can safely ignore individual-level variation.
△ Less
Submitted 20 June, 2024;
originally announced June 2024.
-
PathoLM: Identifying pathogenicity from the DNA sequence through the Genome Foundation Model
Authors:
Sajib Acharjee Dip,
Uddip Acharjee Shuvo,
Tran Chau,
Haoqiu Song,
Petra Choi,
Xuan Wang,
Liqing Zhang
Abstract:
Pathogen identification is pivotal in diagnosing, treating, and preventing diseases, crucial for controlling infections and safeguarding public health. Traditional alignment-based methods, though widely used, are computationally intense and reliant on extensive reference databases, often failing to detect novel pathogens due to their low sensitivity and specificity. Similarly, conventional machine…
▽ More
Pathogen identification is pivotal in diagnosing, treating, and preventing diseases, crucial for controlling infections and safeguarding public health. Traditional alignment-based methods, though widely used, are computationally intense and reliant on extensive reference databases, often failing to detect novel pathogens due to their low sensitivity and specificity. Similarly, conventional machine learning techniques, while promising, require large annotated datasets and extensive feature engineering and are prone to overfitting. Addressing these challenges, we introduce PathoLM, a cutting-edge pathogen language model optimized for the identification of pathogenicity in bacterial and viral sequences. Leveraging the strengths of pre-trained DNA models such as the Nucleotide Transformer, PathoLM requires minimal data for fine-tuning, thereby enhancing pathogen detection capabilities. It effectively captures a broader genomic context, significantly improving the identification of novel and divergent pathogens. We developed a comprehensive data set comprising approximately 30 species of viruses and bacteria, including ESKAPEE pathogens, seven notably virulent bacterial strains resistant to antibiotics. Additionally, we curated a species classification dataset centered specifically on the ESKAPEE group. In comparative assessments, PathoLM dramatically outperforms existing models like DciPatho, demonstrating robust zero-shot and few-shot capabilities. Furthermore, we expanded PathoLM-Sp for ESKAPEE species classification, where it showed superior performance compared to other advanced deep learning methods, despite the complexities of the task.
△ Less
Submitted 18 June, 2024;
originally announced June 2024.
-
Efficient and Precise Force Field Optimization for Biomolecules Using DPA-2
Authors:
Junhan Chang,
Duo Zhang,
Yuqing Deng,
Hongrui Lin,
Zhirong Liu,
Linfeng Zhang,
Hang Zheng,
Xinyan Wang
Abstract:
Molecular simulations are essential tools in computational chemistry, enabling the prediction and understanding of molecular interactions and thermodynamic properties of biomolecules. However, traditional force fields face significant challenges in accurately representing novel molecules and complex chemical environments due to the labor-intensive process of manually setting optimization parameter…
▽ More
Molecular simulations are essential tools in computational chemistry, enabling the prediction and understanding of molecular interactions and thermodynamic properties of biomolecules. However, traditional force fields face significant challenges in accurately representing novel molecules and complex chemical environments due to the labor-intensive process of manually setting optimization parameters and the high computational cost of quantum mechanical calculations. To overcome these difficulties, we fine-tuned a high-accuracy DPA-2 pre-trained model and applied it to optimize force field parameters on-the-fly, significantly reducing computational costs. Our method combines this fine-tuned DPA-2 model with a node-embedding-based similarity metric, allowing seamless augmentation to new chemical species without manual intervention. We applied this process to the TYK2 inhibitor and PTP1B systems and demonstrated its effectiveness through the improvement of free energy perturbation calculation results. This advancement contributes valuable insights and tools for the computational chemistry community.
△ Less
Submitted 14 June, 2024;
originally announced June 2024.
-
Augmentation-based Unsupervised Cross-Domain Functional MRI Adaptation for Major Depressive Disorder Identification
Authors:
Yunling Ma,
Chaojun Zhang,
Xiaochuan Wang,
Qianqian Wang,
Liang Cao,
Limei Zhang,
Mingxia Liu
Abstract:
Major depressive disorder (MDD) is a common mental disorder that typically affects a person's mood, cognition, behavior, and physical health. Resting-state functional magnetic resonance imaging (rs-fMRI) data are widely used for computer-aided diagnosis of MDD. While multi-site fMRI data can provide more data for training reliable diagnostic models, significant cross-site data heterogeneity would…
▽ More
Major depressive disorder (MDD) is a common mental disorder that typically affects a person's mood, cognition, behavior, and physical health. Resting-state functional magnetic resonance imaging (rs-fMRI) data are widely used for computer-aided diagnosis of MDD. While multi-site fMRI data can provide more data for training reliable diagnostic models, significant cross-site data heterogeneity would result in poor model generalizability. Many domain adaptation methods are designed to reduce the distributional differences between sites to some extent, but usually ignore overfitting problem of the model on the source domain. Intuitively, target data augmentation can alleviate the overfitting problem by forcing the model to learn more generalized features and reduce the dependence on source domain data. In this work, we propose a new augmentation-based unsupervised cross-domain fMRI adaptation (AUFA) framework for automatic diagnosis of MDD. The AUFA consists of 1) a graph representation learning module for extracting rs-fMRI features with spatial attention, 2) a domain adaptation module for feature alignment between source and target data, 3) an augmentation-based self-optimization module for alleviating model overfitting on the source domain, and 4) a classification module. Experimental results on 1,089 subjects suggest that AUFA outperforms several state-of-the-art methods in MDD identification. Our approach not only reduces data heterogeneity between different sites, but also localizes disease-related functional connectivity abnormalities and provides interpretability for the model.
△ Less
Submitted 6 June, 2024; v1 submitted 31 May, 2024;
originally announced June 2024.
-
Learning Human-Aligned Representations with Contrastive Learning and Generative Similarity
Authors:
Raja Marjieh,
Sreejan Kumar,
Declan Campbell,
Liyi Zhang,
Gianluca Bencomo,
Jake Snell,
Thomas L. Griffiths
Abstract:
Humans rely on effective representations to learn from few examples and abstract useful information from sensory data. Inducing such representations in machine learning models has been shown to improve their performance on various benchmarks such as few-shot learning and robustness. However, finding effective training procedures to achieve that goal can be challenging as psychologically rich train…
▽ More
Humans rely on effective representations to learn from few examples and abstract useful information from sensory data. Inducing such representations in machine learning models has been shown to improve their performance on various benchmarks such as few-shot learning and robustness. However, finding effective training procedures to achieve that goal can be challenging as psychologically rich training data such as human similarity judgments are expensive to scale, and Bayesian models of human inductive biases are often intractable for complex, realistic domains. Here, we address this challenge by leveraging a Bayesian notion of generative similarity whereby two data points are considered similar if they are likely to have been sampled from the same distribution. This measure can be applied to complex generative processes, including probabilistic programs. We incorporate generative similarity into a contrastive learning objective to enable learning of embeddings that express human cognitive representations. We demonstrate the utility of our approach by showing that it can be used to capture human-like representations of shape regularity, abstract Euclidean geometric concepts, and semantic hierarchies for natural images.
△ Less
Submitted 31 January, 2025; v1 submitted 29 May, 2024;
originally announced May 2024.
-
Uni-Mol Docking V2: Towards Realistic and Accurate Binding Pose Prediction
Authors:
Eric Alcaide,
Zhifeng Gao,
Guolin Ke,
Yaqi Li,
Linfeng Zhang,
Hang Zheng,
Gengmo Zhou
Abstract:
In recent years, machine learning (ML) methods have emerged as promising alternatives for molecular docking, offering the potential for high accuracy without incurring prohibitive computational costs. However, recent studies have indicated that these ML models may overfit to quantitative metrics while neglecting the physical constraints inherent in the problem. In this work, we present Uni-Mol Doc…
▽ More
In recent years, machine learning (ML) methods have emerged as promising alternatives for molecular docking, offering the potential for high accuracy without incurring prohibitive computational costs. However, recent studies have indicated that these ML models may overfit to quantitative metrics while neglecting the physical constraints inherent in the problem. In this work, we present Uni-Mol Docking V2, which demonstrates a remarkable improvement in performance, accurately predicting the binding poses of 77+% of ligands in the PoseBusters benchmark with an RMSD value of less than 2.0 Å, and 75+% passing all quality checks. This represents a significant increase from the 62% achieved by the previous Uni-Mol Docking model. Notably, our Uni-Mol Docking approach generates chemically accurate predictions, circumventing issues such as chirality inversions and steric clashes that have plagued previous ML models. Furthermore, we observe enhanced performance in terms of high-quality predictions (RMSD values of less than 1.0 Å and 1.5 Å) and physical soundness when Uni-Mol Docking is combined with more physics-based methods like Uni-Dock. Our results represent a significant advancement in the application of artificial intelligence for scientific research, adopting a holistic approach to ligand docking that is well-suited for industrial applications in virtual screening and drug design. The code, data and service for Uni-Mol Docking are publicly available for use and further development in https://github.com/dptech-corp/Uni-Mol.
△ Less
Submitted 20 May, 2024;
originally announced May 2024.
-
A Vector Representation for Phylogenetic Trees
Authors:
Cedric Chauve,
Caroline Colijn,
Louxin Zhang
Abstract:
Good representations for phylogenetic trees and networks are important for optimizing storage efficiency and implementation of scalable methods for the inference and analysis of evolutionary trees for genes, genomes and species. We introduce a new representation for rooted phylogenetic trees that encodes a binary tree on n taxa as a vector of length 2n in which each taxon appears exactly twice. Us…
▽ More
Good representations for phylogenetic trees and networks are important for optimizing storage efficiency and implementation of scalable methods for the inference and analysis of evolutionary trees for genes, genomes and species. We introduce a new representation for rooted phylogenetic trees that encodes a binary tree on n taxa as a vector of length 2n in which each taxon appears exactly twice. Using this new tree representation, we introduce a novel tree rearrangement operator, called a HOP, that results in a tree space of diameter n and a quadratic neighbourhood size. We also introduce a novel metric, the HOP distance, which is the minimum number of HOPs to transform a tree into another tree. The HOP distance can be computed in near-linear time, a rare instance of a tree rearrangement distance that is tractable. Our experiments show that the HOP distance is better correlated to the Subtree-Prune-and-Regraft distance than the widely used Robinson-Foulds distance. We also describe how the novel tree representation we introduce can be further generalized to tree-child networks.
△ Less
Submitted 11 May, 2024;
originally announced May 2024.
-
Latent Chemical Space Searching for Plug-in Multi-objective Molecule Generation
Authors:
Ningfeng Liu,
Jie Yu,
Siyu Xiu,
Xinfang Zhao,
Siyu Lin,
Bo Qiang,
Ruqiu Zheng,
Hongwei Jin,
Liangren Zhang,
Zhenming Liu
Abstract:
Molecular generation, an essential method for identifying new drug structures, has been supported by advancements in machine learning and computational technology. However, challenges remain in multi-objective generation, model adaptability, and practical application in drug discovery. In this study, we developed a versatile 'plug-in' molecular generation model that incorporates multiple objective…
▽ More
Molecular generation, an essential method for identifying new drug structures, has been supported by advancements in machine learning and computational technology. However, challenges remain in multi-objective generation, model adaptability, and practical application in drug discovery. In this study, we developed a versatile 'plug-in' molecular generation model that incorporates multiple objectives related to target affinity, drug-likeness, and synthesizability, facilitating its application in various drug development contexts. We improved the Particle Swarm Optimization (PSO) in the context of drug discoveries, and identified PSO-ENP as the optimal variant for multi-objective molecular generation and optimization through comparative experiments. The model also incorporates a novel target-ligand affinity predictor, enhancing the model's utility by supporting three-dimensional information and improving synthetic feasibility. Case studies focused on generating and optimizing drug-like big marine natural products were performed, underscoring PSO-ENP's effectiveness and demonstrating its considerable potential for practical drug discovery applications.
△ Less
Submitted 9 April, 2024;
originally announced April 2024.
-
Enhancing the efficiency of protein language models with minimal wet-lab data through few-shot learning
Authors:
Ziyi Zhou,
Liang Zhang,
Yuanxi Yu,
Mingchen Li,
Liang Hong,
Pan Tan
Abstract:
Accurately modeling the protein fitness landscapes holds great importance for protein engineering. Recently, due to their capacity and representation ability, pre-trained protein language models have achieved state-of-the-art performance in predicting protein fitness without experimental data. However, their predictions are limited in accuracy as well as interpretability. Furthermore, such deep le…
▽ More
Accurately modeling the protein fitness landscapes holds great importance for protein engineering. Recently, due to their capacity and representation ability, pre-trained protein language models have achieved state-of-the-art performance in predicting protein fitness without experimental data. However, their predictions are limited in accuracy as well as interpretability. Furthermore, such deep learning models require abundant labeled training examples for performance improvements, posing a practical barrier. In this work, we introduce FSFP, a training strategy that can effectively optimize protein language models under extreme data scarcity. By combining the techniques of meta-transfer learning, learning to rank, and parameter-efficient fine-tuning, FSFP can significantly boost the performance of various protein language models using merely tens of labeled single-site mutants from the target protein. The experiments across 87 deep mutational scanning datasets underscore its superiority over both unsupervised and supervised approaches, revealing its potential in facilitating AI-guided protein design.
△ Less
Submitted 2 February, 2024;
originally announced February 2024.
-
Computing the Bounds of the Number of Reticulations in a Tree-Child Network That Displays a Set of Trees
Authors:
Yufeng Wu,
Louxin Zhang
Abstract:
Phylogenetic network is an evolutionary model that uses a rooted directed acyclic graph (instead of a tree) to model an evolutionary history of species in which reticulate events (e.g., hybrid speciation or horizontal gene transfer) occurred. Tree-child network is a kind of phylogenetic network with structural constraints. Existing approaches for tree-child network reconstruction can be slow for l…
▽ More
Phylogenetic network is an evolutionary model that uses a rooted directed acyclic graph (instead of a tree) to model an evolutionary history of species in which reticulate events (e.g., hybrid speciation or horizontal gene transfer) occurred. Tree-child network is a kind of phylogenetic network with structural constraints. Existing approaches for tree-child network reconstruction can be slow for large data. In this paper, we present several computational approaches for bounding from below the number of reticulations in a tree-child network that displays a given set of rooted binary phylogenetic trees. In addition, we also present some theoretical results on bounding from above the number of reticulations. Through simulation, we demonstrate that the new lower bounds on the reticulation number for tree-child networks can practically be computed for large tree data. The bounds can provide estimates of reticulation for relatively large data.
△ Less
Submitted 29 November, 2023;
originally announced November 2023.
-
Neural Atoms: Propagating Long-range Interaction in Molecular Graphs through Efficient Communication Channel
Authors:
Xuan Li,
Zhanke Zhou,
Jiangchao Yao,
Yu Rong,
Lu Zhang,
Bo Han
Abstract:
Graph Neural Networks (GNNs) have been widely adopted for drug discovery with molecular graphs. Nevertheless, current GNNs mainly excel in leveraging short-range interactions (SRI) but struggle to capture long-range interactions (LRI), both of which are crucial for determining molecular properties. To tackle this issue, we propose a method to abstract the collective information of atomic groups in…
▽ More
Graph Neural Networks (GNNs) have been widely adopted for drug discovery with molecular graphs. Nevertheless, current GNNs mainly excel in leveraging short-range interactions (SRI) but struggle to capture long-range interactions (LRI), both of which are crucial for determining molecular properties. To tackle this issue, we propose a method to abstract the collective information of atomic groups into a few $\textit{Neural Atoms}$ by implicitly projecting the atoms of a molecular. Specifically, we explicitly exchange the information among neural atoms and project them back to the atoms' representations as an enhancement. With this mechanism, neural atoms establish the communication channels among distant nodes, effectively reducing the interaction scope of arbitrary node pairs into a single hop. To provide an inspection of our method from a physical perspective, we reveal its connection to the traditional LRI calculation method, Ewald Summation. The Neural Atom can enhance GNNs to capture LRI by approximating the potential LRI of the molecular. We conduct extensive experiments on four long-range graph benchmarks, covering graph-level and link-level tasks on molecular graphs. We achieve up to a 27.32% and 38.27% improvement in the 2D and 3D scenarios, respectively. Empirically, our method can be equipped with an arbitrary GNN to help capture LRI. Code and datasets are publicly available in https://github.com/tmlr-group/NeuralAtom.
△ Less
Submitted 31 March, 2024; v1 submitted 2 November, 2023;
originally announced November 2023.
-
Semantic reconstruction of continuous language from MEG signals
Authors:
Bo Wang,
Xiran Xu,
Longxiang Zhang,
Boda Xiao,
Xihong Wu,
Jing Chen
Abstract:
Decoding language from neural signals holds considerable theoretical and practical importance. Previous research has indicated the feasibility of decoding text or speech from invasive neural signals. However, when using non-invasive neural signals, significant challenges are encountered due to their low quality. In this study, we proposed a data-driven approach for decoding semantic of language fr…
▽ More
Decoding language from neural signals holds considerable theoretical and practical importance. Previous research has indicated the feasibility of decoding text or speech from invasive neural signals. However, when using non-invasive neural signals, significant challenges are encountered due to their low quality. In this study, we proposed a data-driven approach for decoding semantic of language from Magnetoencephalography (MEG) signals recorded while subjects were listening to continuous speech. First, a multi-subject decoding model was trained using contrastive learning to reconstruct continuous word embeddings from MEG data. Subsequently, a beam search algorithm was adopted to generate text sequences based on the reconstructed word embeddings. Given a candidate sentence in the beam, a language model was used to predict the subsequent words. The word embeddings of the subsequent words were correlated with the reconstructed word embedding. These correlations were then used as a measure of the probability for the next word. The results showed that the proposed continuous word embedding model can effectively leverage both subject-specific and subject-shared information. Additionally, the decoded text exhibited significant similarity to the target text, with an average BERTScore of 0.816, a score comparable to that in the previous fMRI study.
△ Less
Submitted 14 September, 2023;
originally announced September 2023.
-
DrugChat: Towards Enabling ChatGPT-Like Capabilities on Drug Molecule Graphs
Authors:
Youwei Liang,
Ruiyi Zhang,
Li Zhang,
Pengtao Xie
Abstract:
A ChatGPT-like system for drug compounds could be a game-changer in pharmaceutical research, accelerating drug discovery, enhancing our understanding of structure-activity relationships, guiding lead optimization, aiding drug repurposing, reducing the failure rate, and streamlining clinical trials. In this work, we make an initial attempt towards enabling ChatGPT-like capabilities on drug molecule…
▽ More
A ChatGPT-like system for drug compounds could be a game-changer in pharmaceutical research, accelerating drug discovery, enhancing our understanding of structure-activity relationships, guiding lead optimization, aiding drug repurposing, reducing the failure rate, and streamlining clinical trials. In this work, we make an initial attempt towards enabling ChatGPT-like capabilities on drug molecule graphs, by developing a prototype system DrugChat. DrugChat works in a similar way as ChatGPT. Users upload a compound molecule graph and ask various questions about this compound. DrugChat will answer these questions in a multi-turn, interactive manner. The DrugChat system consists of a graph neural network (GNN), a large language model (LLM), and an adaptor. The GNN takes a compound molecule graph as input and learns a representation for this graph. The adaptor transforms the graph representation produced by the GNN into another representation that is acceptable to the LLM. The LLM takes the compound representation transformed by the adaptor and users' questions about this compound as inputs and generates answers. All these components are trained end-to-end. To train DrugChat, we collected instruction tuning datasets which contain 10,834 drug compounds and 143,517 question-answer pairs. The code and data is available at \url{https://github.com/UCSD-AI4H/drugchat}
△ Less
Submitted 18 May, 2023;
originally announced September 2023.
-
Deep neural network improves the estimation of polygenic risk scores for breast cancer
Authors:
Adrien Badré,
Li Zhang,
Wellington Muchero,
Justin C. Reynolds,
Chongle Pan
Abstract:
Polygenic risk scores (PRS) estimate the genetic risk of an individual for a complex disease based on many genetic variants across the whole genome. In this study, we compared a series of computational models for estimation of breast cancer PRS. A deep neural network (DNN) was found to outperform alternative machine learning techniques and established statistical algorithms, including BLUP, BayesA…
▽ More
Polygenic risk scores (PRS) estimate the genetic risk of an individual for a complex disease based on many genetic variants across the whole genome. In this study, we compared a series of computational models for estimation of breast cancer PRS. A deep neural network (DNN) was found to outperform alternative machine learning techniques and established statistical algorithms, including BLUP, BayesA and LDpred. In the test cohort with 50% prevalence, the Area Under the receiver operating characteristic Curve (AUC) were 67.4% for DNN, 64.2% for BLUP, 64.5% for BayesA, and 62.4% for LDpred. BLUP, BayesA, and LPpred all generated PRS that followed a normal distribution in the case population. However, the PRS generated by DNN in the case population followed a bi-modal distribution composed of two normal distributions with distinctly different means. This suggests that DNN was able to separate the case population into a high-genetic-risk case sub-population with an average PRS significantly higher than the control population and a normal-genetic-risk case sub-population with an average PRS similar to the control population. This allowed DNN to achieve 18.8% recall at 90% precision in the test cohort with 50% prevalence, which can be extrapolated to 65.4% recall at 20% precision in a general population with 12% prevalence. Interpretation of the DNN model identified salient variants that were assigned insignificant p-values by association studies, but were important for DNN prediction. These variants may be associated with the phenotype through non-linear relationships.
△ Less
Submitted 24 July, 2023;
originally announced July 2023.