-
Structure-Aware Contrastive Learning with Fine-Grained Binding Representations for Drug Discovery
Authors:
Jing Lan,
Hexiao Ding,
Hongzhao Chen,
Yufeng Jiang,
Nga-Chun Ng,
Gwing Kei Yip,
Gerald W. Y. Cheng,
Yunlin Mao,
Jing Cai,
Liang-ting Lin,
Jung Sun Yoo
Abstract:
Accurate identification of drug-target interactions (DTI) remains a central challenge in computational pharmacology, where sequence-based methods offer scalability. This work introduces a sequence-based drug-target interaction framework that integrates structural priors into protein representations while maintaining high-throughput screening capability. Evaluated across multiple benchmarks, the mo…
▽ More
Accurate identification of drug-target interactions (DTI) remains a central challenge in computational pharmacology, where sequence-based methods offer scalability. This work introduces a sequence-based drug-target interaction framework that integrates structural priors into protein representations while maintaining high-throughput screening capability. Evaluated across multiple benchmarks, the model achieves state-of-the-art performance on Human and BioSNAP datasets and remains competitive on BindingDB. In virtual screening tasks, it surpasses prior methods on LIT-PCBA, yielding substantial gains in AUROC and BEDROC. Ablation studies confirm the critical role of learned aggregation, bilinear attention, and contrastive alignment in enhancing predictive robustness. Embedding visualizations reveal improved spatial correspondence with known binding pockets and highlight interpretable attention patterns over ligand-residue contacts. These results validate the framework's utility for scalable and structure-aware DTI prediction.
△ Less
Submitted 18 September, 2025;
originally announced September 2025.
-
Quantum-Boosted High-Fidelity Deep Learning
Authors:
Feng-ao Wang,
Shaobo Chen,
Yao Xuan,
Junwei Liu,
Qi Gao,
Hongdong Zhu,
Junjie Hou,
Lixin Yuan,
Jinyu Cheng,
Chenxin Yi,
Hai Wei,
Yin Ma,
Tao Xu,
Kai Wen,
Yixue Li
Abstract:
A fundamental limitation of probabilistic deep learning is its predominant reliance on Gaussian priors. This simplistic assumption prevents models from accurately capturing the complex, non-Gaussian landscapes of natural data, particularly in demanding domains like complex biological data, severely hindering the fidelity of the model for scientific discovery. The physically-grounded Boltzmann dist…
▽ More
A fundamental limitation of probabilistic deep learning is its predominant reliance on Gaussian priors. This simplistic assumption prevents models from accurately capturing the complex, non-Gaussian landscapes of natural data, particularly in demanding domains like complex biological data, severely hindering the fidelity of the model for scientific discovery. The physically-grounded Boltzmann distribution offers a more expressive alternative, but it is computationally intractable on classical computers. To date, quantum approaches have been hampered by the insufficient qubit scale and operational stability required for the iterative demands of deep learning. Here, we bridge this gap by introducing the Quantum Boltzmann Machine-Variational Autoencoder (QBM-VAE), a large-scale and long-time stable hybrid quantum-classical architecture. Our framework leverages a quantum processor for efficient sampling from the Boltzmann distribution, enabling its use as a powerful prior within a deep generative model. Applied to million-scale single-cell datasets from multiple sources, the QBM-VAE generates a latent space that better preserves complex biological structures, consistently outperforming conventional Gaussian-based deep learning models like VAE and SCVI in essential tasks such as omics data integration, cell-type classification, and trajectory inference. It also provides a typical example of introducing a physics priori into deep learning to drive the model to acquire scientific discovery capabilities that breaks through data limitations. This work provides the demonstration of a practical quantum advantage in deep learning on a large-scale scientific problem and offers a transferable blueprint for developing hybrid quantum AI models.
△ Less
Submitted 14 August, 2025;
originally announced August 2025.
-
ARTreeFormer: A Faster Attention-based Autoregressive Model for Phylogenetic Inference
Authors:
Tianyu Xie,
Yicong Mao,
Cheng Zhang
Abstract:
Probabilistic modeling over the combinatorially large space of tree topologies remains a central challenge in phylogenetic inference. Previous approaches often necessitate pre-sampled tree topologies, limiting their modeling capability to a subset of the entire tree space. A recent advancement is ARTree, a deep autoregressive model that offers unrestricted distributions for tree topologies. Howeve…
▽ More
Probabilistic modeling over the combinatorially large space of tree topologies remains a central challenge in phylogenetic inference. Previous approaches often necessitate pre-sampled tree topologies, limiting their modeling capability to a subset of the entire tree space. A recent advancement is ARTree, a deep autoregressive model that offers unrestricted distributions for tree topologies. However, its reliance on repetitive tree traversals and inefficient local message passing for computing topological node representations may hamper the scalability to large datasets. This paper proposes ARTreeFormer, a novel approach that harnesses fixed-point iteration and attention mechanisms to accelerate ARTree. By introducing a fixed-point iteration algorithm for computing the topological node embeddings, ARTreeFormer allows fast vectorized computation, especially on CUDA devices. This, together with an attention-based global message passing scheme, significantly improves the computation speed of ARTree while maintaining great approximation performance. We demonstrate the effectiveness and efficiency of our method on a benchmark of challenging real data phylogenetic inference problems.
△ Less
Submitted 24 July, 2025;
originally announced July 2025.
-
Learning Patient-Specific Spatial Biomarker Dynamics via Operator Learning for Alzheimer's Disease Progression
Authors:
Jindong Wang,
Yutong Mao,
Xiao Liu,
Wenrui Hao
Abstract:
Alzheimer's disease (AD) is a complex, multifactorial neurodegenerative disorder with substantial heterogeneity in progression and treatment response. Despite recent therapeutic advances, predictive models capable of accurately forecasting individualized disease trajectories remain limited. Here, we present a machine learning-based operator learning framework for personalized modeling of AD progre…
▽ More
Alzheimer's disease (AD) is a complex, multifactorial neurodegenerative disorder with substantial heterogeneity in progression and treatment response. Despite recent therapeutic advances, predictive models capable of accurately forecasting individualized disease trajectories remain limited. Here, we present a machine learning-based operator learning framework for personalized modeling of AD progression, integrating longitudinal multimodal imaging, biomarker, and clinical data. Unlike conventional models with prespecified dynamics, our approach directly learns patient-specific disease operators governing the spatiotemporal evolution of amyloid, tau, and neurodegeneration biomarkers. Using Laplacian eigenfunction bases, we construct geometry-aware neural operators capable of capturing complex brain dynamics. Embedded within a digital twin paradigm, the framework enables individualized predictions, simulation of therapeutic interventions, and in silico clinical trials. Applied to AD clinical data, our method achieves high prediction accuracy exceeding 90% across multiple biomarkers, substantially outperforming existing approaches. This work offers a scalable, interpretable platform for precision modeling and personalized therapeutic optimization in neurodegenerative diseases.
△ Less
Submitted 21 July, 2025;
originally announced July 2025.
-
Towards Unified Neural Decoding with Brain Functional Network Modeling
Authors:
Di Wu,
Linghao Bu,
Yifei Jia,
Lu Cao,
Siyuan Li,
Siyu Chen,
Yueqian Zhou,
Sheng Fan,
Wenjie Ren,
Dengchang Wu,
Kang Wang,
Yue Zhang,
Yuehui Ma,
Jie Yang,
Mohamad Sawan
Abstract:
Recent achievements in implantable brain-computer interfaces (iBCIs) have demonstrated the potential to decode cognitive and motor behaviors with intracranial brain recordings; however, individual physiological and electrode implantation heterogeneities have constrained current approaches to neural decoding within single individuals, rendering interindividual neural decoding elusive. Here, we pres…
▽ More
Recent achievements in implantable brain-computer interfaces (iBCIs) have demonstrated the potential to decode cognitive and motor behaviors with intracranial brain recordings; however, individual physiological and electrode implantation heterogeneities have constrained current approaches to neural decoding within single individuals, rendering interindividual neural decoding elusive. Here, we present Multi-individual Brain Region-Aggregated Network (MIBRAIN), a neural decoding framework that constructs a whole functional brain network model by integrating intracranial neurophysiological recordings across multiple individuals. MIBRAIN leverages self-supervised learning to derive generalized neural prototypes and supports group-level analysis of brain-region interactions and inter-subject neural synchrony. To validate our framework, we recorded stereoelectroencephalography (sEEG) signals from a cohort of individuals performing Mandarin syllable articulation. Both real-time online and offline decoding experiments demonstrated significant improvements in both audible and silent articulation decoding, enhanced decoding accuracy with increased multi-subject data integration, and effective generalization to unseen subjects. Furthermore, neural predictions for regions without direct electrode coverage were validated against authentic neural data. Overall, this framework paves the way for robust neural decoding across individuals and offers insights for practical clinical applications.
△ Less
Submitted 30 May, 2025;
originally announced June 2025.
-
AI Agent Behavioral Science
Authors:
Lin Chen,
Yunke Zhang,
Jie Feng,
Haoye Chai,
Honglin Zhang,
Bingbing Fan,
Yibo Ma,
Shiyuan Zhang,
Nian Li,
Tianhui Liu,
Nicholas Sukiennik,
Keyu Zhao,
Yu Li,
Ziyi Liu,
Fengli Xu,
Yong Li
Abstract:
Recent advances in large language models (LLMs) have enabled the development of AI agents that exhibit increasingly human-like behaviors, including planning, adaptation, and social dynamics across diverse, interactive, and open-ended scenarios. These behaviors are not solely the product of the internal architectures of the underlying models, but emerge from their integration into agentic systems o…
▽ More
Recent advances in large language models (LLMs) have enabled the development of AI agents that exhibit increasingly human-like behaviors, including planning, adaptation, and social dynamics across diverse, interactive, and open-ended scenarios. These behaviors are not solely the product of the internal architectures of the underlying models, but emerge from their integration into agentic systems operating within specific contexts, where environmental factors, social cues, and interaction feedbacks shape behavior over time. This evolution necessitates a new scientific perspective: AI Agent Behavioral Science. Rather than focusing only on internal mechanisms, this perspective emphasizes the systematic observation of behavior, design of interventions to test hypotheses, and theory-guided interpretation of how AI agents act, adapt, and interact over time. We systematize a growing body of research across individual agent, multi-agent, and human-agent interaction settings, and further demonstrate how this perspective informs responsible AI by treating fairness, safety, interpretability, accountability, and privacy as behavioral properties. By unifying recent findings and laying out future directions, we position AI Agent Behavioral Science as a necessary complement to traditional model-centric approaches, providing essential tools for understanding, evaluating, and governing the real-world behavior of increasingly autonomous AI systems.
△ Less
Submitted 12 June, 2025; v1 submitted 4 June, 2025;
originally announced June 2025.
-
Single cell resolution 3D imaging and segmentation within intact live tissues
Authors:
G. Paci,
P. Vicente-Munuera,
I. Fernandez-Mosquera,
A. Miranda,
K. Lau,
Q. Zhang,
R. Barrientos,
Y. Mao
Abstract:
Epithelial cells form diverse structures from squamous spherical organoids to densely packed pseudostratified tissues. Quantification of cellular properties in these contexts requires high-resolution deep imaging and computational techniques to achieve truthful three-dimensional (3D) structural features. Here, we describe a detailed step-by-step protocol for sample preparation, imaging and deep-le…
▽ More
Epithelial cells form diverse structures from squamous spherical organoids to densely packed pseudostratified tissues. Quantification of cellular properties in these contexts requires high-resolution deep imaging and computational techniques to achieve truthful three-dimensional (3D) structural features. Here, we describe a detailed step-by-step protocol for sample preparation, imaging and deep-learning-assisted cell segmentation to achieve accurate quantification of fluorescently labelled individual cells in 3D within live tissues. We share the lessons learned through troubleshooting 3D imaging of Drosophila wing discs, including considerations on the choice of microscopy modality and settings (objective, sample mounting) and available segmentation methods. In addition, we include a computational pipeline alongside custom code to assist replication of the protocol. While we focus on the segmentation of cell outlines from membrane labelling, this protocol applies to a wide variety of samples, and we believe it be valuable for studying other tissues that demand complex analysis in 3D.
△ Less
Submitted 31 January, 2025;
originally announced January 2025.
-
ProteinWeaver: A Divide-and-Assembly Approach for Protein Backbone Design
Authors:
Yiming Ma,
Fei Ye,
Yi Zhou,
Zaixiang Zheng,
Dongyu Xue,
Quanquan Gu
Abstract:
Nature creates diverse proteins through a 'divide and assembly' strategy. Inspired by this idea, we introduce ProteinWeaver, a two-stage framework for protein backbone design. Our method first generates individual protein domains and then employs an SE(3) diffusion model to flexibly assemble these domains. A key challenge lies in the assembling step, given the complex and rugged nature of the inte…
▽ More
Nature creates diverse proteins through a 'divide and assembly' strategy. Inspired by this idea, we introduce ProteinWeaver, a two-stage framework for protein backbone design. Our method first generates individual protein domains and then employs an SE(3) diffusion model to flexibly assemble these domains. A key challenge lies in the assembling step, given the complex and rugged nature of the inter-domain interaction landscape. To address this challenge, we employ preference alignment to discern complex relationships between structure and interaction landscapes through comparative analysis of generated samples. Comprehensive experiments demonstrate that ProteinWeaver: (1) generates high-quality, novel protein backbones through versatile domain assembly; (2) outperforms RFdiffusion, the current state-of-the-art in backbone design, by 13\% and 39\% for long-chain proteins; (3) shows the potential for cooperative function design through illustrative case studies. To sum up, by introducing a `divide-and-assembly' paradigm, ProteinWeaver advances protein engineering and opens new avenues for functional protein design.
△ Less
Submitted 27 November, 2024; v1 submitted 8 November, 2024;
originally announced November 2024.
-
The Updated Genome Warehouse: Enhancing Data Value, Security, and Usability to Address Data Expansion
Authors:
Yingke Ma,
Xuetong Zhao,
Yaokai Jia,
Zhenxian Han,
Caixia Yu,
Zhuojing Fan,
Zhang Zhang,
Jingfa Xiao,
Wenming Zhao,
Yiming Bao,
Meili Chen
Abstract:
The Genome Warehouse (GWH), accessible at https://ngdc.cncb.ac.cn/gwh, is an extensively utilized public repository dedicated to the deposition, management and sharing of genome assembly sequences, annotations, and metadata. This paper highlights noteworthy enhancements to the GWH since the 2021 version, emphasizing substantial advancements in web interfaces for data submission, database functiona…
▽ More
The Genome Warehouse (GWH), accessible at https://ngdc.cncb.ac.cn/gwh, is an extensively utilized public repository dedicated to the deposition, management and sharing of genome assembly sequences, annotations, and metadata. This paper highlights noteworthy enhancements to the GWH since the 2021 version, emphasizing substantial advancements in web interfaces for data submission, database functionality updates, and resource integration. Key updates include the reannotation of released prokaryotic genomes, mirroring of genome resources from National Center for Biotechnology Information (NCBI) GenBank and RefSeq, integration of Poxviridae sequences, implementation of an online batch submission system, enhancements to the quality control system, advanced search capabilities, and the introduction of a controlled-access mechanism for human genome data. These improvements collectively augment the ease and security of data submission and access as well as genome data value, thereby fostering heightened convenience and utility for researchers in the genomic field.
△ Less
Submitted 23 November, 2024;
originally announced November 2024.
-
RatGene: Gene deletion-addition algorithms using growth to production ratio for growth-coupled production in constraint-based metabolic networks
Authors:
Yier Ma,
Takeyuki Tamura
Abstract:
In computational metabolic design, it is often necessary to modify the original constraint-based metabolic networks to lead to growth-coupled production, where cell growth forces target metabolite production. However, in genome-scale models, finding strategies to simultaneously delete and add genes to induce growth-coupled production is challenging. This is particularly true when heavy computation…
▽ More
In computational metabolic design, it is often necessary to modify the original constraint-based metabolic networks to lead to growth-coupled production, where cell growth forces target metabolite production. However, in genome-scale models, finding strategies to simultaneously delete and add genes to induce growth-coupled production is challenging. This is particularly true when heavy computation is necessary due to numerous gene deletions and additions. In this study, we mathematically defined related problems, proved NP-hardness and/or NP-completeness, and developed an algorithm named RatGene that (1) automatically integrates multiple constraint-based metabolic networks, (2) identifies gene deletion-addition strategies by a growth-to-production ratio-based approach, and (3) eliminates redundant gene additions and deletions. The results of computational experiments demonstrated that the RatGene-based approach can significantly improve the success ratio for identifying the strategies for growth-coupled production. RatGene can facilitate a more rational approach to computational metabolic design for the production of useful substances using microorganisms by concurrently considering both gene deletions and additions.
△ Less
Submitted 18 November, 2024;
originally announced November 2024.
-
Double-Strand Break Clustering: An Economical and Effective Strategy for DNA Repair
Authors:
Junyi Chen,
Wenzong Ma,
Yuqi Ma,
Gen Yang
Abstract:
In mammalian cells, repair centers for DNA double-strand breaks (DSBs) have been identified. However, previous researches predominantly rely on methods that induce specific DSBs by cutting particular DNA sequences. The clustering and its spatiotemporal properties of non-specifically DSBs, especially those induced by environmental stresses such as irradiation, remains unclear. In this study, we use…
▽ More
In mammalian cells, repair centers for DNA double-strand breaks (DSBs) have been identified. However, previous researches predominantly rely on methods that induce specific DSBs by cutting particular DNA sequences. The clustering and its spatiotemporal properties of non-specifically DSBs, especially those induced by environmental stresses such as irradiation, remains unclear. In this study, we used Dragonfly microscopy to induce high-precision damage in cells and discovered that DSB clustering during the early stages of DNA damage response (DDR) and repair, but not during the repair plateau phase. Early in DDR, DSB clustered into existing 53BP1 foci. The DSB clustering at different stages has different implications for DNA repair. By controlling the distance between adjacent damage points, we found that the probability of DSB clustering remains constant at distances of 0.8 - 1.4 um, while clustering does not occur beyond 1.4 um. Within the 0.8 um range, the probability of clustering significantly increases due to the phase separation effect of 53BP1. Using a Monte Carlo approach, we developed a dynamic model of 53BP1 foci formation, fission, and fusion. This model accurately predicts experimental outcomes and further demonstrates the temporal and spatial influences on DSB clustering. These results showed that, similarly to specifically induced DSBs, non-specifically induced DSBs can also cluster. The extent of DSB clustering is influenced by both temporal and spatial factors, which provide new insights into the dynamics of DSB clustering and the role of 53BP1 in DNA repair processes. Such findings could enhance our understanding of DNA damage responses and help us improve DNA repair therapies in disease.
△ Less
Submitted 4 October, 2024;
originally announced October 2024.
-
Objectively Evaluating the Reliability of Cell Type Annotation Using LLM-Based Strategies
Authors:
Wenjin Ye,
Yuanchen Ma,
Junkai Xiang,
Hongjie Liang,
Tao Wang,
Qiuling Xiang,
Andy Peng Xiang,
Wu Song,
Weiqiang Li,
Weijun Huang
Abstract:
Reliability in cell type annotation is challenging in single-cell RNA-sequencing data analysis because both expert-driven and automated methods can be biased or constrained by their training data, especially for novel or rare cell types. Although large language models (LLMs) are useful, our evaluation found that only a few matched expert annotations due to biased data sources and inflexible traini…
▽ More
Reliability in cell type annotation is challenging in single-cell RNA-sequencing data analysis because both expert-driven and automated methods can be biased or constrained by their training data, especially for novel or rare cell types. Although large language models (LLMs) are useful, our evaluation found that only a few matched expert annotations due to biased data sources and inflexible training inputs. To overcome these limitations, we developed the LICT (Large language model-based Identifier for Cell Types) software package using a multi-model fusion and "talk-to-machine" strategy. Tested across various single-cell RNA sequencing datasets, our approach significantly improved annotation reliability, especially in datasets with low cellular heterogeneity. Notably, we established objective criteria to assess annotation reliability using the "talk-to-machine" approach, which addresses discrepancies between our annotations and expert ones, enabling reliable evaluation even without reference data. This strategy enhances annotation credibility and sets the stage for advancing future LLM-based cell type annotation methods.
△ Less
Submitted 23 September, 2024;
originally announced September 2024.
-
ProteinBench: A Holistic Evaluation of Protein Foundation Models
Authors:
Fei Ye,
Zaixiang Zheng,
Dongyu Xue,
Yuning Shen,
Lihao Wang,
Yiming Ma,
Yan Wang,
Xinyou Wang,
Xiangxin Zhou,
Quanquan Gu
Abstract:
Recent years have witnessed a surge in the development of protein foundation models, significantly improving performance in protein prediction and generative tasks ranging from 3D structure prediction and protein design to conformational dynamics. However, the capabilities and limitations associated with these models remain poorly understood due to the absence of a unified evaluation framework. To…
▽ More
Recent years have witnessed a surge in the development of protein foundation models, significantly improving performance in protein prediction and generative tasks ranging from 3D structure prediction and protein design to conformational dynamics. However, the capabilities and limitations associated with these models remain poorly understood due to the absence of a unified evaluation framework. To fill this gap, we introduce ProteinBench, a holistic evaluation framework designed to enhance the transparency of protein foundation models. Our approach consists of three key components: (i) A taxonomic classification of tasks that broadly encompass the main challenges in the protein domain, based on the relationships between different protein modalities; (ii) A multi-metric evaluation approach that assesses performance across four key dimensions: quality, novelty, diversity, and robustness; and (iii) In-depth analyses from various user objectives, providing a holistic view of model performance. Our comprehensive evaluation of protein foundation models reveals several key findings that shed light on their current capabilities and limitations. To promote transparency and facilitate further research, we release the evaluation dataset, code, and a public leaderboard publicly for further analysis and a general modular toolkit. We intend for ProteinBench to be a living benchmark for establishing a standardized, in-depth evaluation framework for protein foundation models, driving their development and application while fostering collaboration within the field.
△ Less
Submitted 7 October, 2024; v1 submitted 10 September, 2024;
originally announced September 2024.
-
Enhancing Terrestrial Net Primary Productivity Estimation with EXP-CASA: A Novel Light Use Efficiency Model Approach
Authors:
Guanzhou Chen,
Kaiqi Zhang,
Xiaodong Zhang,
Hong Xie,
Haobo Yang,
Xiaoliang Tan,
Tong Wang,
Yule Ma,
Qing Wang,
Jinzhou Cao,
Weihong Cui
Abstract:
The Light Use Efficiency model, epitomized by the CASA model, is extensively applied in the quantitative estimation of vegetation Net Primary Productivity. However, the classic CASA model is marked by significant complexity: the estimation of environmental stress parameters, in particular, necessitates multi-source observation data, adding to the complexity and uncertainty of the model's operation…
▽ More
The Light Use Efficiency model, epitomized by the CASA model, is extensively applied in the quantitative estimation of vegetation Net Primary Productivity. However, the classic CASA model is marked by significant complexity: the estimation of environmental stress parameters, in particular, necessitates multi-source observation data, adding to the complexity and uncertainty of the model's operation. Additionally, the saturation effect of the Normalized Difference Vegetation Index (NDVI), a key variable in the CASA model, weakened the accuracy of CASA's NPP predictions in densely vegetated areas. To address these limitations, this study introduces the Exponential-CASA (EXP-CASA) model. The EXP-CASA model effectively improves the CASA model by using novel functions for estimating the fraction of absorbed photosynthetically active radiation (FPAR) and environmental stress, by utilizing long-term observational data from FLUXNET and MODIS surface reflectance data. In a comparative analysis of NPP estimation accuracy among four different NPP products, EXP-CASA ($R^2 = 0.68, RMSE= 1.1gC\cdot m^{-2} \cdot d^{-1}$) outperforms others, followed by GLASS-NPP, and lastly MODIS-NPP and classic CASA. Additionally, this research assesses the EXP-CASA model's adaptability to various vegetation indices, evaluates the sensitivity and stability of its parameters over time, and compares its accuracy against other leading NPP estimation products. The findings reveal that the EXP-CASA model exhibits strong adaptability to diverse vegetation indices and stability of model parameters over time series. By introducing a novel estimation approach that optimizes model construction, the EXP-CASA model remarkably improves the accuracy of NPP estimations and paves the way for global-scale, consistent, and continuous assessment of vegetation NPP.
△ Less
Submitted 28 June, 2024;
originally announced June 2024.
-
BrainMAE: A Region-aware Self-supervised Learning Framework for Brain Signals
Authors:
Yifan Yang,
Yutong Mao,
Xufu Liu,
Xiao Liu
Abstract:
The human brain is a complex, dynamic network, which is commonly studied using functional magnetic resonance imaging (fMRI) and modeled as network of Regions of interest (ROIs) for understanding various brain functions. Recent studies utilize deep learning approaches to learn the brain network representation based on functional connectivity (FC) profile, broadly falling into two main categories. T…
▽ More
The human brain is a complex, dynamic network, which is commonly studied using functional magnetic resonance imaging (fMRI) and modeled as network of Regions of interest (ROIs) for understanding various brain functions. Recent studies utilize deep learning approaches to learn the brain network representation based on functional connectivity (FC) profile, broadly falling into two main categories. The Fixed-FC approaches, utilizing the FC profile which represents the linear temporal relation within the brain network, are limited by failing to capture informative brain temporal dynamics. On the other hand, the Dynamic-FC approaches, modeling the evolving FC profile over time, often exhibit less satisfactory performance due to challenges in handling the inherent noisy nature of fMRI data.
To address these challenges, we propose Brain Masked Auto-Encoder (BrainMAE) for learning representations directly from fMRI time-series data. Our approach incorporates two essential components: a region-aware graph attention mechanism designed to capture the relationships between different brain ROIs, and a novel self-supervised masked autoencoding framework for effective model pre-training. These components enable the model to capture rich temporal dynamics of brain activity while maintaining resilience to inherent noise in fMRI data. Our experiments demonstrate that BrainMAE consistently outperforms established baseline methods by significant margins in four distinct downstream tasks. Finally, leveraging the model's inherent interpretability, our analysis of model-generated representations reveals findings that resonate with ongoing research in the field of neuroscience.
△ Less
Submitted 24 June, 2024;
originally announced June 2024.
-
Structure-based Drug Design Benchmark: Do 3D Methods Really Dominate?
Authors:
Kangyu Zheng,
Yingzhou Lu,
Zaixi Zhang,
Zhongwei Wan,
Yao Ma,
Marinka Zitnik,
Tianfan Fu
Abstract:
Currently, the field of structure-based drug design is dominated by three main types of algorithms: search-based algorithms, deep generative models, and reinforcement learning. While existing works have typically focused on comparing models within a single algorithmic category, cross-algorithm comparisons remain scarce. In this paper, to fill the gap, we establish a benchmark to evaluate the perfo…
▽ More
Currently, the field of structure-based drug design is dominated by three main types of algorithms: search-based algorithms, deep generative models, and reinforcement learning. While existing works have typically focused on comparing models within a single algorithmic category, cross-algorithm comparisons remain scarce. In this paper, to fill the gap, we establish a benchmark to evaluate the performance of sixteen models across these different algorithmic foundations by assessing the pharmaceutical properties of the generated molecules and their docking affinities with specified target proteins. We highlight the unique advantages of each algorithmic approach and offer recommendations for the design of future SBDD models. We emphasize that 1D/2D ligand-centric drug design methods can be used in SBDD by treating the docking function as a black-box oracle, which is typically neglected. The empirical results show that 1D/2D methods achieve competitive performance compared with 3D-based methods that use the 3D structure of the target protein explicitly. Also, AutoGrow4, a 2D molecular graph-based genetic algorithm, dominates SBDD in terms of optimization ability. The relevant code is available in https://github.com/zkysfls/2024-sbdd-benchmark.
△ Less
Submitted 4 June, 2024;
originally announced June 2024.
-
Augmentation-based Unsupervised Cross-Domain Functional MRI Adaptation for Major Depressive Disorder Identification
Authors:
Yunling Ma,
Chaojun Zhang,
Xiaochuan Wang,
Qianqian Wang,
Liang Cao,
Limei Zhang,
Mingxia Liu
Abstract:
Major depressive disorder (MDD) is a common mental disorder that typically affects a person's mood, cognition, behavior, and physical health. Resting-state functional magnetic resonance imaging (rs-fMRI) data are widely used for computer-aided diagnosis of MDD. While multi-site fMRI data can provide more data for training reliable diagnostic models, significant cross-site data heterogeneity would…
▽ More
Major depressive disorder (MDD) is a common mental disorder that typically affects a person's mood, cognition, behavior, and physical health. Resting-state functional magnetic resonance imaging (rs-fMRI) data are widely used for computer-aided diagnosis of MDD. While multi-site fMRI data can provide more data for training reliable diagnostic models, significant cross-site data heterogeneity would result in poor model generalizability. Many domain adaptation methods are designed to reduce the distributional differences between sites to some extent, but usually ignore overfitting problem of the model on the source domain. Intuitively, target data augmentation can alleviate the overfitting problem by forcing the model to learn more generalized features and reduce the dependence on source domain data. In this work, we propose a new augmentation-based unsupervised cross-domain fMRI adaptation (AUFA) framework for automatic diagnosis of MDD. The AUFA consists of 1) a graph representation learning module for extracting rs-fMRI features with spatial attention, 2) a domain adaptation module for feature alignment between source and target data, 3) an augmentation-based self-optimization module for alleviating model overfitting on the source domain, and 4) a classification module. Experimental results on 1,089 subjects suggest that AUFA outperforms several state-of-the-art methods in MDD identification. Our approach not only reduces data heterogeneity between different sites, but also localizes disease-related functional connectivity abnormalities and provides interpretability for the model.
△ Less
Submitted 6 June, 2024; v1 submitted 31 May, 2024;
originally announced June 2024.
-
Incorporating changeable attitudes toward vaccination into an SIR infectious disease model
Authors:
Yi Jiang,
Kristin M. Kurianski,
Jane HyoJin Lee,
Yanping Ma,
Daniel Cicala,
Glenn Ledder
Abstract:
We develop a mechanistic model that classifies individuals both in terms of epidemiological status (SIR) and vaccination attitude (willing or unwilling), with the goal of discovering how disease spread is influenced by changing opinions about vaccination. Analysis of the model identifies existence and stability criteria for both disease-free and endemic disease equilibria. The analytical results,…
▽ More
We develop a mechanistic model that classifies individuals both in terms of epidemiological status (SIR) and vaccination attitude (willing or unwilling), with the goal of discovering how disease spread is influenced by changing opinions about vaccination. Analysis of the model identifies existence and stability criteria for both disease-free and endemic disease equilibria. The analytical results, supported by numerical simulations, show that attitude changes induced by disease prevalence can destabilize endemic disease equilibria, resulting in limit cycles.
△ Less
Submitted 14 August, 2024; v1 submitted 6 May, 2024;
originally announced May 2024.
-
Hill Function-based Model of Transcriptional Response: Impact of Nonspecific Binding and RNAP Interactions
Authors:
Wenjia Shi,
Yao Ma,
Peilin Hu,
Mi Pang,
Xiaona Huang,
Yiting Dang,
Yuxin Xie,
Danni Wu
Abstract:
Hill function is one of the widely used gene transcription regulation models. Its attribute of fitting may result in a lack of an underlying physical picture, yet the fitting parameters can provide information about biochemical reactions, such as the number of transcription factors (TFs) and the binding energy between regulatory elements. However, it remains unclear when and how much biochemical i…
▽ More
Hill function is one of the widely used gene transcription regulation models. Its attribute of fitting may result in a lack of an underlying physical picture, yet the fitting parameters can provide information about biochemical reactions, such as the number of transcription factors (TFs) and the binding energy between regulatory elements. However, it remains unclear when and how much biochemical information can Hill function provide in addition to fitting. Here, started from the interactions between TFs and RNA polymerase during transcription regulation and both of their association-dissociation reactions at specific/nonspecific sites on DNA, the regulatory effect of TFs was deduced as fold change. We found that, for weak promoter, fold change can degrade into the regulatory factor (Freg) which is closely correlated with Hill function. By directly comparing and fitting with Hill function, the fitting parameters and corresponding biochemical reaction parameters in Freg were analyzed and discussed, where the single TF and multiple TFs that with cooperativity and basic logic effects were considered. We concluded the strength of promoter and interactions between TFs determine whether Hill function can reflect the corresponding biochemical information. Our findings highlight the role of Hill function in modeling/fitting for transcriptional regulation, which also benefits the preparation of synthetic regulatory elements.
△ Less
Submitted 3 March, 2024;
originally announced March 2024.
-
Guidelines in Wastewater-based Epidemiology of SARS-CoV-2 with Diagnosis
Authors:
Madiha Fatima,
Zhihua Cao,
Aichun Huang,
Shengyuan Wu,
Xinxian Fan,
Yi Wang,
Liu Jiren,
Ziyun Zhu,
Qiongrou Ye,
Yuan Ma,
Joseph K. F Chow,
Peng Jia,
Yangshou Liu,
Yubin Lin,
Manjun Ye,
Tong Wu,
Zhixun Li,
Cong Cai,
Wenhai Zhang,
Cheris H. Q. Ding,
Yuanzhe Cai,
Feijuan Huang
Abstract:
With the global spread and increasing transmission rate of SARS-CoV-2, more and more laboratories and researchers are turning their attention to wastewater-based epidemiology (WBE), hoping it can become an effective tool for large-scale testing and provide more ac-curate predictions of the number of infected individuals. Based on the cases of sewage sampling and testing in some regions such as Hon…
▽ More
With the global spread and increasing transmission rate of SARS-CoV-2, more and more laboratories and researchers are turning their attention to wastewater-based epidemiology (WBE), hoping it can become an effective tool for large-scale testing and provide more ac-curate predictions of the number of infected individuals. Based on the cases of sewage sampling and testing in some regions such as Hong Kong, Brazil, and the United States, the feasibility of detecting the novel coronavirus in sewage is extremely high. This study re-views domestic and international achievements in detecting SARS-CoV-2 through WBE and summarizes four aspects of COVID-19, including sampling methods, virus decay rate cal-culation, standardized population coverage of the watershed, algorithm prediction, and provides ideas for combining field modeling with epidemic prevention and control. Moreover, we highlighted some diagnostic techniques for detection of the virus from sew-age sample. Our review is a new approach in identification of the research gaps in waste water-based epidemiology and diagnosis and we also predict the future prospect of our analysis.
△ Less
Submitted 26 December, 2023;
originally announced January 2024.
-
Exploring the Impacts of Land Use/Cover Change on Ecosystem Services in Multiple Scenarios --The Case of Sichuan-Chongqing Region, China
Authors:
Ran Chen,
Jing Zhao,
Xiaomin Luo,
Xinxue Yan,
Xi Zheng,
Yijun Mao,
Xiaoping Fu,
Xueqi Yao,
Sijia Jiang
Abstract:
To improve the environment of the ecosystem, China has implemented the Green-forGrain Program for two decades, which has resulted in an imbalance among ecology.economy and food. This study focuses on the "ecology-food" imbalance problem.taking Sichuan-Chongqing Region as an example, to set up future scenarios topredicate the distribution of ESs. We first forecast land use/cover change in 2050under…
▽ More
To improve the environment of the ecosystem, China has implemented the Green-forGrain Program for two decades, which has resulted in an imbalance among ecology.economy and food. This study focuses on the "ecology-food" imbalance problem.taking Sichuan-Chongqing Region as an example, to set up future scenarios topredicate the distribution of ESs. We first forecast land use/cover change in 2050under four different scenarios: Natural Development Scenarios; Arable LandConservation Scenarios; Ecological Priority Scenarios; Ecology-Arable LandHarmonization Scenarios. Then we assess changes in five ESs: habitat quality ,cropproduction, soil conservation, water yield, and carbon storage from 1990 to 2020 and2050. Finally, we reveal the spacial distribution of ESs. The following conclusions areobtained: (1) From 1990-2020, CS, SC, and HQ reveal an increasing trend with growthrates of 1.68%, 0.08%, and 0.46%: CP reveals a reduce rate of 2.75% . (2) S4 has anincrease in arable land, and CP has increased by 7.56% compared to S1, reversingthe trend of reduced CP under S1. (3) The high-high anomalies area of CP under S4 isoasically the same as that under S2, which proves that S4 is a scenario policy that canbe referred to for future development.
△ Less
Submitted 19 December, 2023;
originally announced January 2024.
-
Benchmarking the CoW with the TopCoW Challenge: Topology-Aware Anatomical Segmentation of the Circle of Willis for CTA and MRA
Authors:
Kaiyuan Yang,
Fabio Musio,
Yihui Ma,
Norman Juchler,
Johannes C. Paetzold,
Rami Al-Maskari,
Luciano Höher,
Hongwei Bran Li,
Ibrahim Ethem Hamamci,
Anjany Sekuboyina,
Suprosanna Shit,
Houjing Huang,
Chinmay Prabhakar,
Ezequiel de la Rosa,
Bastian Wittmann,
Diana Waldmannstetter,
Florian Kofler,
Fernando Navarro,
Martin Menten,
Ivan Ezhov,
Daniel Rueckert,
Iris N. Vos,
Ynte M. Ruigrok,
Birgitta K. Velthuis,
Hugo J. Kuijf
, et al. (88 additional authors not shown)
Abstract:
The Circle of Willis (CoW) is an important network of arteries connecting major circulations of the brain. Its vascular architecture is believed to affect the risk, severity, and clinical outcome of serious neurovascular diseases. However, characterizing the highly variable CoW anatomy is still a manual and time-consuming expert task. The CoW is usually imaged by two non-invasive angiographic imag…
▽ More
The Circle of Willis (CoW) is an important network of arteries connecting major circulations of the brain. Its vascular architecture is believed to affect the risk, severity, and clinical outcome of serious neurovascular diseases. However, characterizing the highly variable CoW anatomy is still a manual and time-consuming expert task. The CoW is usually imaged by two non-invasive angiographic imaging modalities, magnetic resonance angiography (MRA) and computed tomography angiography (CTA), but there exist limited datasets with annotations on CoW anatomy, especially for CTA. Therefore, we organized the TopCoW challenge with the release of an annotated CoW dataset. The TopCoW dataset is the first public dataset with voxel-level annotations for 13 CoW vessel components, enabled by virtual reality technology. It is also the first large dataset using 200 pairs of MRA and CTA from the same patients. As part of the benchmark, we invited submissions worldwide and attracted over 250 registered participants from six continents. The submissions were evaluated on both internal and external test datasets of 226 scans from over five centers. The top performing teams achieved over 90% Dice scores at segmenting the CoW components, over 80% F1 scores at detecting key CoW components, and over 70% balanced accuracy at classifying CoW variants for nearly all test sets. The best algorithms also showed clinical potential in classifying fetal-type posterior cerebral artery and locating aneurysms with CoW anatomy. TopCoW demonstrated the utility and versatility of CoW segmentation algorithms for a wide range of downstream clinical applications with explainability. The annotated datasets and best performing algorithms have been released as public Zenodo records to foster further methodological development and clinical tool building.
△ Less
Submitted 8 July, 2025; v1 submitted 29 December, 2023;
originally announced December 2023.
-
Digital twin brain: a bridge between biological intelligence and artificial intelligence
Authors:
Hui Xiong,
Congying Chu,
Lingzhong Fan,
Ming Song,
Jiaqi Zhang,
Yawei Ma,
Ruonan Zheng,
Junyang Zhang,
Zhengyi Yang,
Tianzi Jiang
Abstract:
In recent years, advances in neuroscience and artificial intelligence have paved the way for unprecedented opportunities for understanding the complexity of the brain and its emulation by computational systems. Cutting-edge advancements in neuroscience research have revealed the intricate relationship between brain structure and function, while the success of artificial neural networks highlights…
▽ More
In recent years, advances in neuroscience and artificial intelligence have paved the way for unprecedented opportunities for understanding the complexity of the brain and its emulation by computational systems. Cutting-edge advancements in neuroscience research have revealed the intricate relationship between brain structure and function, while the success of artificial neural networks highlights the importance of network architecture. Now is the time to bring them together to better unravel how intelligence emerges from the brain's multiscale repositories. In this review, we propose the Digital Twin Brain (DTB) as a transformative platform that bridges the gap between biological and artificial intelligence. It consists of three core elements: the brain structure that is fundamental to the twinning process, bottom-layer models to generate brain functions, and its wide spectrum of applications. Crucially, brain atlases provide a vital constraint, preserving the brain's network organization within the DTB. Furthermore, we highlight open questions that invite joint efforts from interdisciplinary fields and emphasize the far-reaching implications of the DTB. The DTB can offer unprecedented insights into the emergence of intelligence and neurological disorders, which holds tremendous promise for advancing our understanding of both biological and artificial intelligence, and ultimately propelling the development of artificial general intelligence and facilitating precision mental healthcare.
△ Less
Submitted 2 August, 2023;
originally announced August 2023.
-
Deep Learning Approach to Predict Hemorrhage in Moyamoya Disease
Authors:
Meng Zhao,
Yonggang Ma,
Qian Zhang,
Jizong Zhao
Abstract:
Objective: Reliable tools to predict moyamoya disease (MMD) patients at risk for hemorrhage could have significant value. The aim of this paper is to develop three machine learning classification algorithms to predict hemorrhage in moyamoya disease.
Methods: Clinical data of consecutive MMD patients who were admitted to our hospital between 2009 and 2015 were reviewed. Demographics, clinical, radi…
▽ More
Objective: Reliable tools to predict moyamoya disease (MMD) patients at risk for hemorrhage could have significant value. The aim of this paper is to develop three machine learning classification algorithms to predict hemorrhage in moyamoya disease.
Methods: Clinical data of consecutive MMD patients who were admitted to our hospital between 2009 and 2015 were reviewed. Demographics, clinical, radiographic data were analyzed to develop artificial neural network (ANN), support vector machine (SVM), and random forest models.
Results: We extracted 33 parameters, including 11 demographic and 22 radiographic features as input for model development. Of all compared classification results, ANN achieved the highest overall accuracy of 75.7% (95% CI, 68.6%-82.8%), followed by SVM with 69.2% (95% CI, 56.9%-81.5%) and random forest with 70.0% (95% CI, 57.0%-83.0%).
Conclusions: The proposed ANN framework can be a potential effective tool to predict the possibility of hemorrhage among adult MMD patients based on clinical information and radiographic features.
△ Less
Submitted 31 January, 2023;
originally announced February 2023.
-
SPICE, A Dataset of Drug-like Molecules and Peptides for Training Machine Learning Potentials
Authors:
Peter Eastman,
Pavan Kumar Behara,
David L. Dotson,
Raimondas Galvelis,
John E. Herr,
Josh T. Horton,
Yuezhi Mao,
John D. Chodera,
Benjamin P. Pritchard,
Yuanqing Wang,
Gianni De Fabritiis,
Thomas E. Markland
Abstract:
Machine learning potentials are an important tool for molecular simulation, but their development is held back by a shortage of high quality datasets to train them on. We describe the SPICE dataset, a new quantum chemistry dataset for training potentials relevant to simulating drug-like small molecules interacting with proteins. It contains over 1.1 million conformations for a diverse set of small…
▽ More
Machine learning potentials are an important tool for molecular simulation, but their development is held back by a shortage of high quality datasets to train them on. We describe the SPICE dataset, a new quantum chemistry dataset for training potentials relevant to simulating drug-like small molecules interacting with proteins. It contains over 1.1 million conformations for a diverse set of small molecules, dimers, dipeptides, and solvated amino acids. It includes 15 elements, charged and uncharged molecules, and a wide range of covalent and non-covalent interactions. It provides both forces and energies calculated at the ωB97M-D3(BJ)/def2-TZVPPD level of theory, along with other useful quantities such as multipole moments and bond orders. We train a set of machine learning potentials on it and demonstrate that they can achieve chemical accuracy across a broad region of chemical space. It can serve as a valuable resource for the creation of transferable, ready to use potential functions for use in molecular simulations.
△ Less
Submitted 23 November, 2022; v1 submitted 21 September, 2022;
originally announced September 2022.
-
Discovering novel systemic biomarkers in photos of the external eye
Authors:
Boris Babenko,
Ilana Traynis,
Christina Chen,
Preeti Singh,
Akib Uddin,
Jorge Cuadros,
Lauren P. Daskivich,
April Y. Maa,
Ramasamy Kim,
Eugene Yu-Chuan Kang,
Yossi Matias,
Greg S. Corrado,
Lily Peng,
Dale R. Webster,
Christopher Semturs,
Jonathan Krause,
Avinash V. Varadarajan,
Naama Hammel,
Yun Liu
Abstract:
External eye photos were recently shown to reveal signs of diabetic retinal disease and elevated HbA1c. In this paper, we evaluate if external eye photos contain information about additional systemic medical conditions. We developed a deep learning system (DLS) that takes external eye photos as input and predicts multiple systemic parameters, such as those related to the liver (albumin, AST); kidn…
▽ More
External eye photos were recently shown to reveal signs of diabetic retinal disease and elevated HbA1c. In this paper, we evaluate if external eye photos contain information about additional systemic medical conditions. We developed a deep learning system (DLS) that takes external eye photos as input and predicts multiple systemic parameters, such as those related to the liver (albumin, AST); kidney (eGFR estimated using the race-free 2021 CKD-EPI creatinine equation, the urine ACR); bone & mineral (calcium); thyroid (TSH); and blood count (Hgb, WBC, platelets). Development leveraged 151,237 images from 49,015 patients with diabetes undergoing diabetic eye screening in 11 sites across Los Angeles county, CA. Evaluation focused on 9 pre-specified systemic parameters and leveraged 3 validation sets (A, B, C) spanning 28,869 patients with and without diabetes undergoing eye screening in 3 independent sites in Los Angeles County, CA, and the greater Atlanta area, GA. We compared against baseline models incorporating available clinicodemographic variables (e.g. age, sex, race/ethnicity, years with diabetes). Relative to the baseline, the DLS achieved statistically significant superior performance at detecting AST>36, calcium<8.6, eGFR<60, Hgb<11, platelets<150, ACR>=300, and WBC<4 on validation set A (a patient population similar to the development sets), where the AUC of DLS exceeded that of the baseline by 5.2-19.4%. On validation sets B and C, with substantial patient population differences compared to the development sets, the DLS outperformed the baseline for ACR>=300 and Hgb<11 by 7.3-13.2%. Our findings provide further evidence that external eye photos contain important biomarkers of systemic health spanning multiple organ systems. Further work is needed to investigate whether and how these biomarkers can be translated into clinical impact.
△ Less
Submitted 18 July, 2022;
originally announced July 2022.
-
HelixFold: An Efficient Implementation of AlphaFold2 using PaddlePaddle
Authors:
Guoxia Wang,
Xiaomin Fang,
Zhihua Wu,
Yiqun Liu,
Yang Xue,
Yingfei Xiang,
Dianhai Yu,
Fan Wang,
Yanjun Ma
Abstract:
Accurate protein structure prediction can significantly accelerate the development of life science. The accuracy of AlphaFold2, a frontier end-to-end structure prediction system, is already close to that of the experimental determination techniques. Due to the complex model architecture and large memory consumption, it requires lots of computational resources and time to implement the training and…
▽ More
Accurate protein structure prediction can significantly accelerate the development of life science. The accuracy of AlphaFold2, a frontier end-to-end structure prediction system, is already close to that of the experimental determination techniques. Due to the complex model architecture and large memory consumption, it requires lots of computational resources and time to implement the training and inference of AlphaFold2 from scratch. The cost of running the original AlphaFold2 is expensive for most individuals and institutions. Therefore, reducing this cost could accelerate the development of life science. We implement AlphaFold2 using PaddlePaddle, namely HelixFold, to improve training and inference speed and reduce memory consumption. The performance is improved by operator fusion, tensor fusion, and hybrid parallelism computation, while the memory is optimized through Recompute, BFloat16, and memory read/write in-place. Compared with the original AlphaFold2 (implemented with Jax) and OpenFold (implemented with PyTorch), HelixFold needs only 7.5 days to complete the full end-to-end training and only 5.3 days when using hybrid parallelism, while both AlphaFold2 and OpenFold take about 11 days. HelixFold saves 1x training time. We verified that HelixFold's accuracy could be on par with AlphaFold2 on the CASP14 and CAMEO datasets. HelixFold's code is available on GitHub for free download: https://github.com/PaddlePaddle/PaddleHelix/tree/dev/apps/protein_folding/helixfold, and we also provide stable web services on https://paddlehelix.baidu.com/app/drug/protein/forecast.
△ Less
Submitted 13 July, 2022; v1 submitted 12 July, 2022;
originally announced July 2022.
-
Graph-based Molecular Representation Learning
Authors:
Zhichun Guo,
Kehan Guo,
Bozhao Nan,
Yijun Tian,
Roshni G. Iyer,
Yihong Ma,
Olaf Wiest,
Xiangliang Zhang,
Wei Wang,
Chuxu Zhang,
Nitesh V. Chawla
Abstract:
Molecular representation learning (MRL) is a key step to build the connection between machine learning and chemical science. In particular, it encodes molecules as numerical vectors preserving the molecular structures and features, on top of which the downstream tasks (e.g., property prediction) can be performed. Recently, MRL has achieved considerable progress, especially in methods based on deep…
▽ More
Molecular representation learning (MRL) is a key step to build the connection between machine learning and chemical science. In particular, it encodes molecules as numerical vectors preserving the molecular structures and features, on top of which the downstream tasks (e.g., property prediction) can be performed. Recently, MRL has achieved considerable progress, especially in methods based on deep molecular graph learning. In this survey, we systematically review these graph-based molecular representation techniques, especially the methods incorporating chemical domain knowledge. Specifically, we first introduce the features of 2D and 3D molecular graphs. Then we summarize and categorize MRL methods into three groups based on their input. Furthermore, we discuss some typical chemical applications supported by MRL. To facilitate studies in this fast-developing area, we also list the benchmarks and commonly used datasets in the paper. Finally, we share our thoughts on future research directions.
△ Less
Submitted 28 November, 2023; v1 submitted 8 July, 2022;
originally announced July 2022.
-
Local vaccination and systemic tumor suppression via irradiation and manganese adjuvant in mice
Authors:
Chunyang Lu,
Jing Qian,
Jianfeng Lv,
Jintao Han,
Xiaoyi Sun,
Junyi Chen,
Siwei Ding,
Zhusong Mei,
Yulan Liang,
Yuqi Ma,
Ye Zhao,
Chen Lin,
Yanying Zhao,
Yixing Geng,
Wenjun Ma,
Yugang Wang,
Xueqing Yan,
Gen Yang
Abstract:
Presently 4T-1 luc cells were irradiated with proton under ultra-high dose rate FLASH or with gamma-ray with conventional dose rate, and then subcutaneous vaccination with or without Mn immuno-enhancing adjuvant into the mice for three times. One week later, we injected untreated 4T-1 luc cells on the other side of the vaccinated mice, and found that the untreated 4T-1 luc cells injected later nea…
▽ More
Presently 4T-1 luc cells were irradiated with proton under ultra-high dose rate FLASH or with gamma-ray with conventional dose rate, and then subcutaneous vaccination with or without Mn immuno-enhancing adjuvant into the mice for three times. One week later, we injected untreated 4T-1 luc cells on the other side of the vaccinated mice, and found that the untreated 4T-1 luc cells injected later nearly totally did not grow tumor (1/17) while controls without previous vaccination all grow tumors (18/18). The result is very interesting and the findings may help to explore in situ tumor vaccination as well as new combined radiotherapy strategies to effectively ablate primary and disseminated tumors. To our limited knowledge, this is the first paper reporting the high efficiency induction of systemic vaccination suppressing the metastasized/disseminated tumor progression.
△ Less
Submitted 26 April, 2021;
originally announced April 2021.
-
The Butterfly Effect in Primary Visual Cortex
Authors:
Jizhao Liu,
Jing Lian,
J C Sprott,
Qidong Liu,
Yide Ma
Abstract:
Exploring and establishing artificial neural networks with electrophysiological characteristics and high computational efficiency is a popular topic in the field of computer vision. Inspired by the working mechanism of primary visual cortex, pulse-coupled neural network (PCNN) can exhibit the characteristics of synchronous oscillation, refractory period, and exponential decay. However, electrophys…
▽ More
Exploring and establishing artificial neural networks with electrophysiological characteristics and high computational efficiency is a popular topic in the field of computer vision. Inspired by the working mechanism of primary visual cortex, pulse-coupled neural network (PCNN) can exhibit the characteristics of synchronous oscillation, refractory period, and exponential decay. However, electrophysiological evidence shows that the neurons exhibit highly complex non-linear dynamics when stimulated by external periodic signals. This chaos phenomenon, also known as the " butterfly effect", cannot be explained by all PCNN models. In this work, we analyze the main obstacle preventing PCNN models from imitating real primary visual cortex. We consider neuronal excitation as a stochastic process. We then propose a novel neural network, called continuous-coupled neural network (CCNN). Theoretical analysis indicates that the dynamic behavior of CCNN is distinct from PCNN. Numerical results show that the CCNN model exhibits periodic behavior under DC stimulus, and exhibits chaotic behavior under AC stimulus, which is consistent with the results of real neurons. Furthermore, the image and video processing mechanisms of the CCNN model are analyzed. Experimental results on image segmentation indicate that the CCNN model has better performance than the state-of-the-art of visual cortex neural network models.
△ Less
Submitted 23 July, 2022; v1 submitted 15 April, 2021;
originally announced April 2021.
-
Protein corona critically affects the bio-behaviors of SARS-CoV-2
Authors:
Yue-wen Yin,
Yan-jing Sheng,
Min Wang,
Song-di Ni,
Hong-ming Ding,
Yu-qiang Ma
Abstract:
The outbreak of the coronavirus disease 2019 (COVID-19) caused by severe acute respiratory syndrome coronavirus-2 (SARS-CoV-2) has become a worldwide public health crisis. When the SARS-CoV-2 enters the biological fluids in the human body, different types of biomolecules (in particular proteins) may adsorb on its surface and alter its infection ability. Although great efforts have recently been de…
▽ More
The outbreak of the coronavirus disease 2019 (COVID-19) caused by severe acute respiratory syndrome coronavirus-2 (SARS-CoV-2) has become a worldwide public health crisis. When the SARS-CoV-2 enters the biological fluids in the human body, different types of biomolecules (in particular proteins) may adsorb on its surface and alter its infection ability. Although great efforts have recently been devoted to the interaction of the specific antibodies with the SARS-CoV-2, it still remains largely unknown how the other serum proteins affect the infection of the SARS-CoV-2. In this work, we systematically investigate the interaction of serum proteins with the SARS-CoV-2 RBD by the molecular docking and the all-atom molecular dynamics simulations. It is found that the non-specific immunoglobulin (Ig) indeed cannot effectively bind to the SARS-CoV-2 RBD while the human serum albumin (HSA) may have some potential of blocking its infection (to ACE2). More importantly, we find that the RBD can cause the significant structural change of the Apolipoprotein E (ApoE), by which SARS-CoV-2 may hijack the metabolic pathway of the ApoE to facilitate its cell entry. The present study enhances the understanding of the role of protein corona in the bio-behaviors of SARS-CoV-2, which may aid the more precise and personalized treatment for COVID-19 infection in the clinic.
△ Less
Submitted 10 February, 2021;
originally announced February 2021.
-
Accurate Evaluation on the Interactions of SARS-CoV-2 with Its Receptor ACE2 and Antibodies CR3022/CB6
Authors:
Hong-ming Ding,
Yue-wen Yin,
Song-di Ni,
Yan-jing Sheng,
Yu-qiang Ma
Abstract:
The spread of the coronavirus disease 2019 (COVID-19) caused by severe acute respiratory syndrome coronavirus-2 (SARS-CoV-2) has become a global health crisis. The binding affinity of SARS-CoV-2 (in particular the receptor binding domain, RBD) to its receptor angiotensin converting enzyme 2 (ACE2) and the antibodies is of great importance in understanding the infectivity of COVID-19 and evaluating…
▽ More
The spread of the coronavirus disease 2019 (COVID-19) caused by severe acute respiratory syndrome coronavirus-2 (SARS-CoV-2) has become a global health crisis. The binding affinity of SARS-CoV-2 (in particular the receptor binding domain, RBD) to its receptor angiotensin converting enzyme 2 (ACE2) and the antibodies is of great importance in understanding the infectivity of COVID-19 and evaluating the candidate therapeutic for COVID-19. In this work, we propose a new method based on molecular mechanics/Poisson-Boltzmann surface area (MM/PBSA) to accurately calculate the free energy of SARS-CoV-2 RBD binding to ACE2 and antibodies. The calculated binding free energy of SARS-CoV-2 RBD to ACE2 is -13.3 kcal/mol, and that of SARS-CoV RBD to ACE2 is -11.4 kcal/mol, which agrees well with experimental result (-11.3 kcal/mol and -10.1 kcal/mol, respectively). Moreover, we take two recently reported antibodies as the example, and calculate the free energy of antibodies binding to SARS-CoV-2 RBD, which is also consistent with the experimental findings. Further, within the framework of the modified MM/PBSA, we determine the key residues and the main driving forces for the SARS-CoV-2 RBD/CB6 interaction by the computational alanine scanning method. The present study offers a computationally efficient and numerically reliable method to evaluate the free energy of SARS-CoV-2 binding to other proteins, which may stimulate the development of the therapeutics against the COVID-19 disease in real applications.
△ Less
Submitted 17 January, 2021;
originally announced February 2021.
-
A new parsimonious method for classifying Cancer Tissue-of-Origin Based on DNA Methylation 450K data
Authors:
Shen Jia,
Yulin Zhang,
Yiming Mao,
Jiawei Gao,
Yixuan Chen,
Yuxuan Jiang,
Haochen Luo,
Kebo Lv,
Jionglong Su
Abstract:
DNA methylation is a well-studied genetic modification that regulates gene transcription of Eukaryotes. Its alternations have been recognized as a significant component of cancer development. In this study, we use the DNA methylation 450k data from The Cancer Genome Atlas to evaluate the efficacy of DNA methylation data on cancer classification for 30 cancer types. We propose a new method for gene…
▽ More
DNA methylation is a well-studied genetic modification that regulates gene transcription of Eukaryotes. Its alternations have been recognized as a significant component of cancer development. In this study, we use the DNA methylation 450k data from The Cancer Genome Atlas to evaluate the efficacy of DNA methylation data on cancer classification for 30 cancer types. We propose a new method for gene selection in high dimensional data(over 450 thousand). Variance filtering is first introduced for dimension reduction and Recursive feature elimination (RFE) is then used for feature selection. We address the problem of selecting a small subsets of genes from large number of methylated sites, and our parsimonious model is demonstrated to be efficient, achieving an accuracy over 91%, outperforming other studies which use DNA micro-arrays and RNA-seq Data . The performance of 20 models, which are based on 4 estimators (Random Forest, Decision Tree, Extra Tree and Support Vector Machine) and 5 classifiers (k-Nearest Neighbours, Support Vector Machine, XGboost, Light GBM and Multi-Layer Perceptron), is compared and robustness of the RFE algorithm is examined. Results suggest that the combined model of extra tree plus catboost classifier offers the best performance in cancer identification, with an overall validation accuracy of 91% , 92.3%, 93.3% and 93.5% for 20, 30, 40 and 50 features respectively. The biological functions in cancer development of 50 selected genes is also explored through enrichment analysis and the results show that 12 out of 16 of our top features have already been identified to be specific with cancer and we also propose some more genes to be tested for future studies. Therefore, our method may be utilzed as an auxiliary diagnostic method to determine the actual clinicopathological status of a specific cancer.
△ Less
Submitted 3 January, 2021;
originally announced January 2021.
-
Deep manifold learning reveals hidden dynamics of proteasome autoregulation
Authors:
Zhaolong Wu,
Shuwen Zhang,
Wei Li Wang,
Yinping Ma,
Yuanchen Dong,
Youdong Mao
Abstract:
The 2.5-MDa 26S proteasome maintains proteostasis and regulates myriad cellular processes. How polyubiquitylated substrate interactions regulate proteasome activity is not understood. Here we introduce a deep manifold learning framework, named AlphaCryo4D, which enables atomic-level cryogenic electron microscopy (cryo-EM) reconstructions of nonequilibrium conformational continuum and reconstitutes…
▽ More
The 2.5-MDa 26S proteasome maintains proteostasis and regulates myriad cellular processes. How polyubiquitylated substrate interactions regulate proteasome activity is not understood. Here we introduce a deep manifold learning framework, named AlphaCryo4D, which enables atomic-level cryogenic electron microscopy (cryo-EM) reconstructions of nonequilibrium conformational continuum and reconstitutes hidden dynamics of proteasome autoregulation in the act of substrate degradation. AlphaCryo4D integrates 3D deep residual learning with manifold embedding of free-energy landscapes, which directs 3D clustering via an energy-based particle-voting algorithm. In blind assessments using simulated heterogeneous cryo-EM datasets, AlphaCryo4D achieved 3D classification accuracy three times that of conventional method and reconstructed continuous conformational changes of a 130-kDa protein at sub-3-angstrom resolution. By using AlphaCryo4D to analyze a single experimental cryo-EM dataset, we identified 64 conformers of the substrate-bound human 26S proteasome, revealing conformational entanglement of two regulatory particles in the doubly capped holoenzymes and their energetic differences with singly capped ones. Novel ubiquitin-binding sites are discovered on the RPN2, RPN10 and Alpha5 subunits to remodel polyubiquitin chains for deubiquitylation and recycle. Importantly, AlphaCryo4D choreographs single-nucleotide-exchange dynamics of proteasomal AAA-ATPase motor during translocation initiation, which upregulates proteolytic activity by allosterically promoting nucleophilic attack. Our systemic analysis illuminates a grand hierarchical allostery for proteasome autoregulation.
△ Less
Submitted 13 June, 2021; v1 submitted 23 December, 2020;
originally announced December 2020.
-
Universal Differential Equations for Scientific Machine Learning
Authors:
Christopher Rackauckas,
Yingbo Ma,
Julius Martensen,
Collin Warner,
Kirill Zubov,
Rohit Supekar,
Dominic Skinner,
Ali Ramadhan,
Alan Edelman
Abstract:
In the context of science, the well-known adage "a picture is worth a thousand words" might well be "a model is worth a thousand datasets." In this manuscript we introduce the SciML software ecosystem as a tool for mixing the information of physical laws and scientific models with data-driven machine learning approaches. We describe a mathematical object, which we denote universal differential equ…
▽ More
In the context of science, the well-known adage "a picture is worth a thousand words" might well be "a model is worth a thousand datasets." In this manuscript we introduce the SciML software ecosystem as a tool for mixing the information of physical laws and scientific models with data-driven machine learning approaches. We describe a mathematical object, which we denote universal differential equations (UDEs), as the unifying framework connecting the ecosystem. We show how a wide variety of applications, from automatically discovering biological mechanisms to solving high-dimensional Hamilton-Jacobi-Bellman equations, can be phrased and efficiently handled through the UDE formalism and its tooling. We demonstrate the generality of the software tooling to handle stochasticity, delays, and implicit constraints. This funnels the wide variety of SciML applications into a core set of training mechanisms which are highly optimized, stabilized for stiff equations, and compatible with distributed parallelism and GPU accelerators.
△ Less
Submitted 2 November, 2021; v1 submitted 13 January, 2020;
originally announced January 2020.
-
The Intrinsic Properties of Brain Based on the Network Structure
Authors:
Xiang Zou,
Lie Yao,
Donghua Zhao,
Liang Chen,
Ying Mao
Abstract:
Objective: Brain is a fantastic organ that helps creature adapting to the environment. Network is the most essential structure of brain, but the capability of a simple network is still not very clear. In this study, we try to expound some brain functions only by the network property. Methods: Every network can be equivalent to a simplified network, which is expressed by an equation set. The dynami…
▽ More
Objective: Brain is a fantastic organ that helps creature adapting to the environment. Network is the most essential structure of brain, but the capability of a simple network is still not very clear. In this study, we try to expound some brain functions only by the network property. Methods: Every network can be equivalent to a simplified network, which is expressed by an equation set. The dynamic of the equation set can be described by some basic equations, which is based on the mathematical derivation. Results (1) In a closed network, the stability is based on the excitatory/inhibitory synapse proportion. Spike probabilities in the assembly can meet the solution of a nonlinear equation set. (2) Network activity can spontaneously evolve into a certain distribution under different stimulation, which is closely related to decision making. (3) Short memory can be formed by coupling of network assemblies. Conclusion: The essential property of a network may contribute to some important brain functions.
△ Less
Submitted 1 November, 2019;
originally announced November 2019.
-
On the theoretical prediction of microalgae growth for parallel flow
Authors:
C. Y. Ma
Abstract:
The established microalgae growth models are semi-empirical or considerable fitting coefficients exist currently. Therefore, the ability of the model prediction is reduced by the numerous fitting coefficients. Furthermore, the predicted results of the established models are dependent on the size of the photobioreactor (PBR), light intensity, flow and concentration field. The growth mechanism of mi…
▽ More
The established microalgae growth models are semi-empirical or considerable fitting coefficients exist currently. Therefore, the ability of the model prediction is reduced by the numerous fitting coefficients. Furthermore, the predicted results of the established models are dependent on the size of the photobioreactor (PBR), light intensity, flow and concentration field. The growth mechanism of microalgae has not clearly understood in PBR cultivation. It is difficult to predict the microalgae growth by theoretical methods, owing to the aforementioned factors. We developed an exploratory bridging microalgae growth model to predict the microalgae growth rate in PBRs by using the nondimensional method which is effectively in fluid dynamics and heat transfer. The analytical solution of the growth rate was obtained for the parallel flow. The nondimensional growth rate expressed as function of Reynolds number and Schmidt number, which can be used for arbitrary parallel flow due to the solution was expressed as nondimensional quantities. The theoretically predicted growth rate is compared with the experimentally measured microalgae growth rate on the order of magnitude. The nondimensional method successfully applied to the microalgae growth problem for the first time. The general nondimensional solution can unify the numerous experimental data for different laboratory conditions, and give a direction for the disorder of the microalgae growth problem. The nondimensional solution may be useful to explain the growth mechanism of microalgae and design large-scale PBRs for microalgae biofuel production. The significance of the work is to give a theoretical foundation and methodology of biological theory of microalgae growth.
△ Less
Submitted 10 May, 2022; v1 submitted 5 August, 2019;
originally announced August 2019.
-
Binary Classification of Alzheimer Disease using sMRI Imaging modality and Deep Learning
Authors:
Ahsan Bin Tufail,
Qiu-Na Zhang,
Yong-Kui Ma
Abstract:
Alzheimer's disease (AD) is an irreversible devastative neurodegenerative disorder associated with progressive impairment of memory and cognitive functions. Its early diagnosis is crucial for the development of possible future treatment option(s). Structural magnetic resonance images (sMRI) plays an important role to help in understanding the anatomical changes related to AD especially in its earl…
▽ More
Alzheimer's disease (AD) is an irreversible devastative neurodegenerative disorder associated with progressive impairment of memory and cognitive functions. Its early diagnosis is crucial for the development of possible future treatment option(s). Structural magnetic resonance images (sMRI) plays an important role to help in understanding the anatomical changes related to AD especially in its early stages. Conventional methods require the expertise of domain experts and extract hand-picked features such as gray matter substructures and train a classifier to distinguish AD subjects from healthy subjects. Different from these methods, this paper proposes to construct multiple deep 2D convolutional neural networks (2D-CNNs) to learn the various features from local brain images which are combined to make the final classification for AD diagnosis. The whole brain image was passed through two transfer learning architectures; Inception version 3 and Xception; as well as custom Convolutional Neural Network (CNN) built with the help of separable convolutional layers which can automatically learn the generic features from imaging data for classification. Our study is conducted using cross-sectional T1-weighted structural MRI brain images from Open Access Series of Imaging Studies (OASIS) database to maintain the size and contrast over different MRI scans. Experimental results show that the transfer learning approaches exceed the performance of non-transfer learning based approaches demonstrating the effectiveness of these approaches for the binary AD classification task.
△ Less
Submitted 3 April, 2020; v1 submitted 8 September, 2018;
originally announced September 2018.
-
Penalized matrix decomposition for denoising, compression, and improved demixing of functional imaging data
Authors:
E. Kelly Buchanan,
Ian Kinsella,
Ding Zhou,
Rong Zhu,
Pengcheng Zhou,
Felipe Gerhard,
John Ferrante,
Ying Ma,
Sharon Kim,
Mohammed Shaik,
Yajie Liang,
Rongwen Lu,
Jacob Reimer,
Paul Fahey,
Taliah Muhammad,
Graham Dempsey,
Elizabeth Hillman,
Na Ji,
Andreas Tolias,
Liam Paninski
Abstract:
Calcium imaging has revolutionized systems neuroscience, providing the ability to image large neural populations with single-cell resolution. The resulting datasets are quite large, which has presented a barrier to routine open sharing of this data, slowing progress in reproducible research. State of the art methods for analyzing this data are based on non-negative matrix factorization (NMF); thes…
▽ More
Calcium imaging has revolutionized systems neuroscience, providing the ability to image large neural populations with single-cell resolution. The resulting datasets are quite large, which has presented a barrier to routine open sharing of this data, slowing progress in reproducible research. State of the art methods for analyzing this data are based on non-negative matrix factorization (NMF); these approaches solve a non-convex optimization problem, and are effective when good initializations are available, but can break down in low-SNR settings where common initialization approaches fail. Here we introduce an approach to compressing and denoising functional imaging data. The method is based on a spatially-localized penalized matrix decomposition (PMD) of the data to separate (low-dimensional) signal from (temporally-uncorrelated) noise. This approach can be applied in parallel on local spatial patches and is therefore highly scalable, does not impose non-negativity constraints or require stringent identifiability assumptions (leading to significantly more robust results compared to NMF), and estimates all parameters directly from the data, so no hand-tuning is required. We have applied the method to a wide range of functional imaging data (including one-photon, two-photon, three-photon, widefield, somatic, axonal, dendritic, calcium, and voltage imaging datasets): in all cases, we observe ~2-4x increases in SNR and compression rates of 20-300x with minimal visible loss of signal, with no adjustment of hyperparameters; this in turn facilitates the process of demixing the observed activity into contributions from individual neurons. We focus on two challenging applications: dendritic calcium imaging data and voltage imaging data in the context of optogenetic stimulation. In both cases, we show that our new approach leads to faster and much more robust extraction of activity from the data.
△ Less
Submitted 17 July, 2018;
originally announced July 2018.
-
The Tensor Memory Hypothesis
Authors:
Volker Tresp,
Yunpu Ma
Abstract:
We discuss memory models which are based on tensor decompositions using latent representations of entities and events. We show how episodic memory and semantic memory can be realized and discuss how new memory traces can be generated from sensory input: Existing memories are the basis for perception and new memories are generated via perception. We relate our mathematical approach to the hippocamp…
▽ More
We discuss memory models which are based on tensor decompositions using latent representations of entities and events. We show how episodic memory and semantic memory can be realized and discuss how new memory traces can be generated from sensory input: Existing memories are the basis for perception and new memories are generated via perception. We relate our mathematical approach to the hippocampal memory indexing theory. We describe the first detailed mathematical models for the complete processing pipeline from sensory input and its semantic decoding, i.e., perception, to the formation of episodic and semantic memories and their declarative semantic decodings. Our main hypothesis is that perception includes an active semantic decoding process, which relies on latent representations of entities and predicates, and that episodic and semantic memories depend on the same decoding process. We contribute to the debate between the leading memory consolidation theories, i.e., the standard consolidation theory (SCT) and the multiple trace theory (MTT). The latter is closely related to the complementary learning systems (CLS) framework. In particular, we show explicitly how episodic memory can teach the neocortex to form a semantic memory, which is a core issue in MTT and CLS.
△ Less
Submitted 28 August, 2017; v1 submitted 9 August, 2017;
originally announced August 2017.
-
White matter deficits underlie the loss of consciousness level and predict recovery outcome in disorders of consciousness
Authors:
Xuehai Wu,
Jiaying Zhang,
Zaixu Cui,
Weijun Tang,
Chunhong Shao,
Jin Hu,
Jianhong Zhu,
Liangfu Zhou,
Yao Zhao,
Lu Lu,
Gang Chen,
Georg Northoff,
Gaolang Gong,
Ying Mao,
Yong He
Abstract:
This study aimed to identify white matter (WM) deficits underlying the loss of consciousness in disorder of consciousness (DOC) patients using Diffusion Tensor Imaging (DTI) and to demonstrate the potential value of DTI parameters in predicting recovery outcomes of DOC patients. With 30 DOC patients (8 comatose, 8 unresponsive wakefulness syndrome/vegetative state, and 14 minimal conscious state)…
▽ More
This study aimed to identify white matter (WM) deficits underlying the loss of consciousness in disorder of consciousness (DOC) patients using Diffusion Tensor Imaging (DTI) and to demonstrate the potential value of DTI parameters in predicting recovery outcomes of DOC patients. With 30 DOC patients (8 comatose, 8 unresponsive wakefulness syndrome/vegetative state, and 14 minimal conscious state) and 25 patient controls, we performed group comparison of DTI parameters across 48 core WM regions of interest (ROIs) using Analysis of Covariance. Compared with controls, DOC patients had decreased Fractional anisotropy (FA) and increased diffusivities in widespread WM area.The corresponding DTI parameters of those WM deficits in DOC patients significantly correlated with the consciousness level evaluated by Coma Recovery Scale Revised (CRS-R) and Glasgow Coma Scale (GCS). As for predicting the recovery outcomes (i.e., regaining consciousness or not, grouped by their Glasgow Outcome Scale more than 2 or not) at 3 months post scan, radial diffusivity of left superior cerebellar peduncle and FA of right sagittal stratum reached an accuracy of 87.5% and 75% respectively. Our findings showed multiple WM deficits underlying the loss of consciousness level, and demonstrated the potential value of these WM areas in predicting the recovery outcomes of DOC patients who have lost awareness of the environment and themselves.
△ Less
Submitted 24 November, 2016;
originally announced November 2016.
-
Unsupervised cryo-EM data clustering through adaptively constrained K-means algorithm
Authors:
Yaofang Xu,
Jiayi Wu,
Chang-Cheng Yin,
Youdong Mao
Abstract:
In single-particle cryo-electron microscopy (cryo-EM), K-means clustering algorithm is widely used in unsupervised 2D classification of projection images of biological macromolecules. 3D ab initio reconstruction requires accurate unsupervised classification in order to separate molecular projections of distinct orientations. Due to background noise in single-particle images and uncertainty of mole…
▽ More
In single-particle cryo-electron microscopy (cryo-EM), K-means clustering algorithm is widely used in unsupervised 2D classification of projection images of biological macromolecules. 3D ab initio reconstruction requires accurate unsupervised classification in order to separate molecular projections of distinct orientations. Due to background noise in single-particle images and uncertainty of molecular orientations, traditional K-means clustering algorithm may classify images into wrong classes and produce classes with a large variation in membership. Overcoming these limitations requires further development on clustering algorithms for cryo-EM data analysis. We propose a novel unsupervised data clustering method building upon the traditional K-means algorithm. By introducing an adaptive constraint term in the objective function, our algorithm not only avoids a large variation in class sizes but also produces more accurate data clustering. Applications of this approach to both simulated and experimental cryo-EM data demonstrate that our algorithm is a significantly improved alterative to the traditional K-means algorithm in single-particle cryo-EM analysis.
△ Less
Submitted 7 September, 2016;
originally announced September 2016.
-
Unsupervised single-particle deep clustering via statistical manifold learning
Authors:
Jiayi Wu,
Yong-Bei Ma,
Charles Congdon,
Bevin Brett,
Shuobing Chen,
Qi Ouyang,
Youdong Mao
Abstract:
Motivation: Structural heterogeneity in single-particle cryo-electron microscopy (cryo-EM) data represents a major challenge for high-resolution structure determination. Unsupervised classification may serve as the first step in the assessment of structural heterogeneity. Traditional algorithms for unsupervised classification, such as K-means clustering and maximum likelihood optimization, may cla…
▽ More
Motivation: Structural heterogeneity in single-particle cryo-electron microscopy (cryo-EM) data represents a major challenge for high-resolution structure determination. Unsupervised classification may serve as the first step in the assessment of structural heterogeneity. Traditional algorithms for unsupervised classification, such as K-means clustering and maximum likelihood optimization, may classify images into wrong classes with decreasing signal-to-noise-ratio (SNR) in the image data, yet demand increased cost in computation. Overcoming these limitations requires further development on clustering algorithms for high-performance cryo-EM data analysis. Results: Here we introduce a statistical manifold learning algorithm for unsupervised single-particle deep clustering. We show that statistical manifold learning improves classification accuracy by about 40% in the absence of input references for lower SNR data. Applications to several experimental datasets suggest that our deep clustering approach can detect subtle structural difference among classes. Through code optimization over the Intel high-performance computing (HPC) processors, our software implementation can generate thousands of reference-free class averages within several hours from hundreds of thousands of single-particle cryo-EM images, which allows significant improvement in ab initio 3D reconstruction resolution and quality. Our approach has been successfully applied in several structural determination projects. We expect that it provides a powerful computational tool in analyzing highly heterogeneous structural data and assisting in computational purification of single-particle datasets for high-resolution reconstruction.
△ Less
Submitted 31 December, 2016; v1 submitted 15 April, 2016;
originally announced April 2016.
-
On the parameters affecting dual-target-function evaluation of single-particle selection from cryo-electron micrographs
Authors:
Zhou Yu,
Wei Li Wang,
Luis R. Castillo-Menendez,
Joseph Sodroski,
Youdong Mao
Abstract:
In the analysis of frozen hydrated biomolecules by single-particle cryo-electron microscopy, template-based particle picking by a target function called fast local correlation (FLC) allows a large number of particle images to be automatically picked from micrographs. A second, independent target function based on maximum likelihood (ML) can be used to align the images and verify the presence of si…
▽ More
In the analysis of frozen hydrated biomolecules by single-particle cryo-electron microscopy, template-based particle picking by a target function called fast local correlation (FLC) allows a large number of particle images to be automatically picked from micrographs. A second, independent target function based on maximum likelihood (ML) can be used to align the images and verify the presence of signal in the picked particles. Although the paradigm of this dual-target-function (DTF) evaluation of single-particle selection has been practiced in recent years, it remains unclear how the performance of this DTF approach is affected by the signal-to-noise ratio of the images and by the choice of references for FLC and ML. Here we examine this problem through a systematic study of simulated data, followed by experimental substantiation. We quantitatively pinpoint the critical signal-to-noise ratio (SNR), at which the DTF approach starts losing its ability to select and verify particles from cryo-EM micrographs. A Gaussian model is shown to be as effective in picking particles as a single projection view of the imaged molecule in the tested cases. For both simulated micrographs and real cryo-EM data of the 173-kDa glucose isomerase complex, we found that the use of a Gaussian model to initialize the target functions suppressed the detrimental effect of reference bias in template-based particle selection. Given a sufficient signal-to-noise ratio in the images and the appropriate choice of references, the DTF approach can expedite the automated assembly of single-particle data sets.
△ Less
Submitted 23 September, 2015;
originally announced September 2015.
-
Dual-target function validation of single-particle selection from low-contrast cryo-electron micrographs
Authors:
Youdong Mao,
Luis R. Castillo-Menendez,
Joseph Sodroski
Abstract:
Weak-signal detection and single-particle selection from low-contrast micrographs of frozen hydrated biomolecules by cryo-electron microscopy (cryo-EM) presents a practical challenge. Cryo-EM image contrast degrades as the size of biomolecules of structural interest decreases. When the image contrast falls into a range where the location or presence of single particles becomes ambiguous, a need ar…
▽ More
Weak-signal detection and single-particle selection from low-contrast micrographs of frozen hydrated biomolecules by cryo-electron microscopy (cryo-EM) presents a practical challenge. Cryo-EM image contrast degrades as the size of biomolecules of structural interest decreases. When the image contrast falls into a range where the location or presence of single particles becomes ambiguous, a need arises for objective computational approaches to detect weak signal and to select and verify particles from these low-contrast micrographs. Here we propose an objective validation scheme for low-contrast particle selection using a combination of two different target functions. In an implementation of this dual-target function (DTF) validation, a first target function of fast local correlation was used to select particles through template matching, followed by signal validation through a second target function of maximum likelihood. By a systematic study of simulated data, we found that such an implementation of DTF validation is capable of selecting and verifying particles from cryo-EM micrographs with a signal-to-noise ratio as low as 0.002. Importantly, we demonstrated that DTF validation can robustly evade over-fitting or reference bias from the particle-picking template, allowing true signal to emerge from amidst heavy noise in an objective fashion. The DTF approach allows efficient assembly of a large number of single-particle cryo-EM images of smaller biomolecules or specimens containing contrast-degrading agents like detergents in a semi-automatic manner.
△ Less
Submitted 10 September, 2013;
originally announced September 2013.
-
Translocation of stiff polymers through a nanopore driven by binding particles
Authors:
Wancheng Yu,
Yiding Ma,
Kaifu Luo
Abstract:
We investigate the translocation of stiff polymers in the presence of binding particles through a nanopore by two-dimensional Langevin dynamics simulations. We find that the mean translocation time shows a minimum as a function of the binding energy $ε$ and the particle concentration $φ$, due to the interplay of the force from binding and the frictional force. Particularly, for the strong binding…
▽ More
We investigate the translocation of stiff polymers in the presence of binding particles through a nanopore by two-dimensional Langevin dynamics simulations. We find that the mean translocation time shows a minimum as a function of the binding energy $ε$ and the particle concentration $φ$, due to the interplay of the force from binding and the frictional force. Particularly, for the strong binding the translocation proceeds with a decreasing translocation velocity induced by a significant increase of the frictional force. In addition, both $ε$ and $φ$ have an notable impact on the distribution of the translocation time. With increasing $ε$ and $φ$, it undergoes a transition from an asymmetric and broad distribution under the weak binding to a nearly Gaussian one under the strong binding, and its width becomes gradually narrower.
△ Less
Submitted 5 December, 2012;
originally announced December 2012.
-
Determine dynamical behaviors by the Lyapunov function in competitive Lotka-Volterra systems
Authors:
Ying Tang,
Ruoshi Yuan,
Yian Ma
Abstract:
Global dynamical behaviors of the competitive Lotka-Volterra system even in 3-dimension are not fully understood. The Lyapunov function can provide us such knowledge once it is constructed. In this paper, we construct explicitly the Lyapunov function in three examples of the competitive Lotka-Volterra system for the whole state space: (1) the general 2-dimensional case; (2) a 3-dimensional model;…
▽ More
Global dynamical behaviors of the competitive Lotka-Volterra system even in 3-dimension are not fully understood. The Lyapunov function can provide us such knowledge once it is constructed. In this paper, we construct explicitly the Lyapunov function in three examples of the competitive Lotka-Volterra system for the whole state space: (1) the general 2-dimensional case; (2) a 3-dimensional model; (3) the model of May-Leonard. The dynamics of these examples include bistable case and cyclical behavior. The first two examples are the generalized gradient system defined in the Appendixes, while the model of May-Leonard is not. Our method is helpful to understand the limit cycle problems in general 3-dimensional case.
△ Less
Submitted 29 October, 2012;
originally announced October 2012.
-
Barcoding-free BAC Pooling Enables Combinatorial Selective Sequencing of the Barley Gene Space
Authors:
Stefano Lonardi,
Denisa Duma,
Matthew Alpert,
Francesca Cordero,
Marco Beccuti,
Prasanna R. Bhat,
Yonghui Wu,
Gianfranco Ciardo,
Burair Alsaihati,
Yaqin Ma,
Steve Wanamaker,
Josh Resnik,
Timothy J. Close
Abstract:
We propose a new sequencing protocol that combines recent advances in combinatorial pooling design and second-generation sequencing technology to efficiently approach de novo selective genome sequencing. We show that combinatorial pooling is a cost-effective and practical alternative to exhaustive DNA barcoding when dealing with hundreds or thousands of DNA samples, such as genome-tiling gene-rich…
▽ More
We propose a new sequencing protocol that combines recent advances in combinatorial pooling design and second-generation sequencing technology to efficiently approach de novo selective genome sequencing. We show that combinatorial pooling is a cost-effective and practical alternative to exhaustive DNA barcoding when dealing with hundreds or thousands of DNA samples, such as genome-tiling gene-rich BAC clones. The novelty of the protocol hinges on the computational ability to efficiently compare hundreds of million of short reads and assign them to the correct BAC clones so that the assembly can be carried out clone-by-clone. Experimental results on simulated data for the rice genome show that the deconvolution is extremely accurate (99.57% of the deconvoluted reads are assigned to the correct BAC), and the resulting BAC assemblies have very high quality (BACs are covered by contigs over about 77% of their length, on average). Experimental results on real data for a gene-rich subset of the barley genome confirm that the deconvolution is accurate (almost 70% of left/right pairs in paired-end reads are assigned to the same BAC, despite being processed independently) and the BAC assemblies have good quality (the average sum of all assembled contigs is about 88% of the estimated BAC length).
△ Less
Submitted 19 December, 2011;
originally announced December 2011.
-
Attenuation of transcriptional bursting in mRNA transport
Authors:
Li-ping Xiong,
Yu-qiang Ma,
Lei-Han Tang
Abstract:
Due to the stochastic nature of biochemical processes, the copy number of any given type of molecule inside a living cell often exhibits large temporal fluctuations. Here, we develop analytic methods to investigate how the noise arising from a bursting input is reshaped by a transport reaction which is either linear or of the Michaelis-Menten type. A slow transport rate smoothes out fluctuations…
▽ More
Due to the stochastic nature of biochemical processes, the copy number of any given type of molecule inside a living cell often exhibits large temporal fluctuations. Here, we develop analytic methods to investigate how the noise arising from a bursting input is reshaped by a transport reaction which is either linear or of the Michaelis-Menten type. A slow transport rate smoothes out fluctuations at the output end and minimizes the impact of bursting on the downstream cellular activities. In the context of gene expression in eukaryotic cells, our results indicate that transcriptional bursting can be substantially attenuated by the transport of mRNA from nucleus to cytoplasm. Saturation of the transport mediators or nuclear pores contributes further to the noise reduction. We suggest that the mRNA transport should be taken into account in the interpretation of relevant experimental data on transcriptional bursting.
△ Less
Submitted 23 July, 2009;
originally announced July 2009.
-
Self-Sustained Collective Oscillation Generated in an Array of Non-Oscillatory Cells
Authors:
Yue Ma,
Kenichi Yoshikawa
Abstract:
Oscillations represent a ubiquitous phenomenon in biological systems. The conventional models of biological periodic oscillations are usually proposed as interconnecting transcriptional feedback loops. Some specific proteins function as transcription factors, which in turn negatively regulate the expression of the genes that encode those "clock protein". These loops may lead to rhythmic changes…
▽ More
Oscillations represent a ubiquitous phenomenon in biological systems. The conventional models of biological periodic oscillations are usually proposed as interconnecting transcriptional feedback loops. Some specific proteins function as transcription factors, which in turn negatively regulate the expression of the genes that encode those "clock protein". These loops may lead to rhythmic changes in gene expression of a cell. In the case of multi-cellular tissue, the collective oscillation is often obtained from synchronization of these cells, which manifest themselves as autonomous oscillators. In contrast, here, we propose a different scenario for the occurrence of collective oscillation in a multi-cellular system independent of oscillation, neither intrinsically oscillatory cells nor periodic external stimulation. It is a coupling induced oscillation, with the consideration of wave propagation due to the intracellular communication.
△ Less
Submitted 25 March, 2009; v1 submitted 9 September, 2008;
originally announced September 2008.