Search | arXiv e-print repository

Curriculum Learning for Biological Sequence Prediction: The Case of De Novo Peptide Sequencing

Authors: Xiang Zhang, Jiaqi Wei, Zijie Qiu, Sheng Xu, Nanqing Dong, Zhiqiang Gao, Siqi Sun

Abstract: Peptide sequencing-the process of identifying amino acid sequences from mass spectrometry data-is a fundamental task in proteomics. Non-Autoregressive Transformers (NATs) have proven highly effective for this task, outperforming traditional methods. Unlike autoregressive models, which generate tokens sequentially, NATs predict all positions simultaneously, leveraging bidirectional context through… ▽ More Peptide sequencing-the process of identifying amino acid sequences from mass spectrometry data-is a fundamental task in proteomics. Non-Autoregressive Transformers (NATs) have proven highly effective for this task, outperforming traditional methods. Unlike autoregressive models, which generate tokens sequentially, NATs predict all positions simultaneously, leveraging bidirectional context through unmasked self-attention. However, existing NAT approaches often rely on Connectionist Temporal Classification (CTC) loss, which presents significant optimization challenges due to CTC's complexity and increases the risk of training failures. To address these issues, we propose an improved non-autoregressive peptide sequencing model that incorporates a structured protein sequence curriculum learning strategy. This approach adjusts protein's learning difficulty based on the model's estimated protein generational capabilities through a sampling process, progressively learning peptide generation from simple to complex sequences. Additionally, we introduce a self-refining inference-time module that iteratively enhances predictions using learned NAT token embeddings, improving sequence accuracy at a fine-grained level. Our curriculum learning strategy reduces NAT training failures frequency by more than 90% based on sampled training over various data distributions. Evaluations on nine benchmark species demonstrate that our approach outperforms all previous methods across multiple metrics and species. △ Less

Submitted 16 June, 2025; originally announced June 2025.

arXiv:2504.02698 [pdf, other]

SCMPPI: Supervised Contrastive Multimodal Framework for Predicting Protein-Protein Interactions

Authors: Shengrui XU, Tianchi Lu, Zikun Wang, Jixiu Zhai

Abstract: Protein-protein interaction (PPI) prediction plays a pivotal role in deciphering cellular functions and disease mechanisms. To address the limitations of traditional experimental methods and existing computational approaches in cross-modal feature fusion and false-negative suppression, we propose SCMPPI-a novel supervised contrastive multimodal framework. By effectively integrating sequence-based… ▽ More Protein-protein interaction (PPI) prediction plays a pivotal role in deciphering cellular functions and disease mechanisms. To address the limitations of traditional experimental methods and existing computational approaches in cross-modal feature fusion and false-negative suppression, we propose SCMPPI-a novel supervised contrastive multimodal framework. By effectively integrating sequence-based features (AAC, DPC, ESMC-CKSAAP) with network topology (Node2Vec embeddings) and incorporating an enhanced contrastive learning strategy with negative sample filtering, SCMPPI achieves superior prediction performance. Extensive experiments on eight benchmark datasets demonstrate its state-of-the-art accuracy(98.13%) and AUC(99.69%), along with excellent cross-species generalization (AUC>99%). Successful applications in CD9 networks, Wnt pathway analysis, and cancer-specific networks further highlight its potential for disease target discovery, establishing SCMPPI as a powerful tool for multimodal biological data analysis. △ Less

Submitted 27 April, 2025; v1 submitted 3 April, 2025; originally announced April 2025.

Comments: 20 pages,9 figures,conference

MSC Class: 92C40; 68T07 ACM Class: I.2.6; J.3

arXiv:2503.09672 [pdf]

doi 10.1109/TAFFC.2024.3435060

Mechanoreceptive A$β$ primary afferents discriminate naturalistic social touch inputs at a functionally relevant time scale

Authors: Shan Xu, Steven C. Hauser, Saad S. Nagi, James A. Jablonski, Merat Rezaei, Ewa Jarocka, Andrew G. Marshall, Håkan Olausson, Sarah McIntyre, Gregory J. Gerling

Abstract: Interpersonal touch is an important channel of social emotional interaction. How these physical skin-to-skin touch expressions are processed in the peripheral nervous system is not well understood. From microneurography recordings in humans, we evaluated the capacity of six subtypes of cutaneous mechanoreceptive afferents to differentiate human-delivered social touch expressions. Leveraging statis… ▽ More Interpersonal touch is an important channel of social emotional interaction. How these physical skin-to-skin touch expressions are processed in the peripheral nervous system is not well understood. From microneurography recordings in humans, we evaluated the capacity of six subtypes of cutaneous mechanoreceptive afferents to differentiate human-delivered social touch expressions. Leveraging statistical and classification analyses, we found that single units of multiple mechanoreceptive A$β$ subtypes, especially slowly adapting type II (SA-II) and fast adapting hair follicle afferents (HFA), can reliably differentiate social touch expressions at accuracies similar to human recognition. We then identified the most informative firing patterns of SA-II and HFA afferents, which indicate that average durations of 3-4 s of firing provide sufficient discriminative information. Those two subtypes also exhibit robust tolerance to spike-timing shifts of up to 10-20 ms, varying with touch expressions due to their specific firing properties. Greater shifts in spike-timing, however, can change a firing pattern's envelope to resemble that of another expression and drastically compromise an afferent's discrimination capacity. Altogether, the findings indicate that SA-II and HFA afferents differentiate the skin contact of social touch at time scales relevant for such interactions, which are 1-2 orders of magnitude longer than those for non-social touch. △ Less

Submitted 12 March, 2025; originally announced March 2025.

Comments: 14 pages, 8 figures, includes supplemntary materials at the end of the document

Journal ref: IEEE Transactions on Affective Computing, 2025, 16(1), 346-359

arXiv:2503.04483 [pdf, ps, other]

InfoSEM: A Deep Generative Model with Informative Priors for Gene Regulatory Network Inference

Authors: Tianyu Cui, Song-Jun Xu, Artem Moskalev, Shuwei Li, Tommaso Mansi, Mangal Prakash, Rui Liao

Abstract: Inferring Gene Regulatory Networks (GRNs) from gene expression data is crucial for understanding biological processes. While supervised models are reported to achieve high performance for this task, they rely on costly ground truth (GT) labels and risk learning gene-specific biases, such as class imbalances of GT interactions, rather than true regulatory mechanisms. To address these issues, we int… ▽ More Inferring Gene Regulatory Networks (GRNs) from gene expression data is crucial for understanding biological processes. While supervised models are reported to achieve high performance for this task, they rely on costly ground truth (GT) labels and risk learning gene-specific biases, such as class imbalances of GT interactions, rather than true regulatory mechanisms. To address these issues, we introduce InfoSEM, an unsupervised generative model that leverages textual gene embeddings as informative priors, improving GRN inference without GT labels. InfoSEM can also integrate GT labels as an additional prior when available, avoiding biases and further enhancing performance. Additionally, we propose a biologically motivated benchmarking framework that better reflects real-world applications such as biomarker discovery and reveals learned biases of existing supervised methods. InfoSEM outperforms existing models by 38.5% across four datasets using textual embeddings prior and further boosts performance by 11.1% when integrating labeled data as priors. △ Less

Submitted 8 June, 2025; v1 submitted 6 March, 2025; originally announced March 2025.

Comments: ICML 2025

arXiv:2501.00983 [pdf, other]

Critical Dynamics and Cyclic Memory Retrieval in Non-reciprocal Hopfield Networks

Authors: Shuyue Xue, Mohammad Maghrebi, George I. Mias, Carlo Piermarocchi

Abstract: We study Hopfield networks with non-reciprocal coupling inducing switches between memory patterns. Dynamical phase transitions occur between phases of no memory retrieval, retrieval of multiple point-attractors, and limit-cycle attractors. The limit cycle phase is bounded by two critical regions: a Hopf bifurcation line and a fold bifurcation line, each with unique dynamical critical exponents and… ▽ More We study Hopfield networks with non-reciprocal coupling inducing switches between memory patterns. Dynamical phase transitions occur between phases of no memory retrieval, retrieval of multiple point-attractors, and limit-cycle attractors. The limit cycle phase is bounded by two critical regions: a Hopf bifurcation line and a fold bifurcation line, each with unique dynamical critical exponents and sensitivity to perturbations. A Master Equation approach numerically verifies the critical behavior predicted analytically. We discuss how these networks could model biological processes near a critical threshold of cyclic instability evolving through multi-step transitions. △ Less

Submitted 1 January, 2025; originally announced January 2025.

arXiv:2412.02915 [pdf, other]

Single-Cell Omics Arena: A Benchmark Study for Large Language Models on Cell Type Annotation Using Single-Cell Data

Authors: Junhao Liu, Siwei Xu, Lei Zhang, Jing Zhang

Abstract: Over the past decade, the revolution in single-cell sequencing has enabled the simultaneous molecular profiling of various modalities across thousands of individual cells, allowing scientists to investigate the diverse functions of complex tissues and uncover underlying disease mechanisms. Among all the analytical steps, assigning individual cells to specific types is fundamental for understanding… ▽ More Over the past decade, the revolution in single-cell sequencing has enabled the simultaneous molecular profiling of various modalities across thousands of individual cells, allowing scientists to investigate the diverse functions of complex tissues and uncover underlying disease mechanisms. Among all the analytical steps, assigning individual cells to specific types is fundamental for understanding cellular heterogeneity. However, this process is usually labor-intensive and requires extensive expert knowledge. Recent advances in large language models (LLMs) have demonstrated their ability to efficiently process and synthesize vast corpora of text to automatically extract essential biological knowledge, such as marker genes, potentially promoting more efficient and automated cell type annotations. To thoroughly evaluate the capability of modern instruction-tuned LLMs in automating the cell type identification process, we introduce SOAR, a comprehensive benchmarking study of LLMs for cell type annotation tasks in single-cell genomics. Specifically, we assess the performance of 8 instruction-tuned LLMs across 11 datasets, spanning multiple cell types and species. Our study explores the potential of LLMs to accurately classify and annotate cell types in single-cell RNA sequencing (scRNA-seq) data, while extending their application to multiomics data through cross-modality translation. Additionally, we evaluate the effectiveness of chain-of-thought (CoT) prompting techniques in generating detailed biological insights during the annotation process. The results demonstrate that LLMs can provide robust interpretations of single-cell data without requiring additional fine-tuning, advancing the automation of cell type annotation in genomics research. △ Less

Submitted 3 December, 2024; originally announced December 2024.

arXiv:2410.05278 [pdf, other]

Dumpling GNN: Hybrid GNN Enables Better ADC Payload Activity Prediction Based on Chemical Structure

Authors: Shengjie Xu, Lingxi Xie

Abstract: Antibody-drug conjugates (ADCs) have emerged as a promising class of targeted cancer therapeutics, but the design and optimization of their cytotoxic payloads remain challenging. This study introduces DumplingGNN, a novel hybrid Graph Neural Network architecture specifically designed for predicting ADC payload activity based on chemical structure. By integrating Message Passing Neural Networks (MP… ▽ More Antibody-drug conjugates (ADCs) have emerged as a promising class of targeted cancer therapeutics, but the design and optimization of their cytotoxic payloads remain challenging. This study introduces DumplingGNN, a novel hybrid Graph Neural Network architecture specifically designed for predicting ADC payload activity based on chemical structure. By integrating Message Passing Neural Networks (MPNN), Graph Attention Networks (GAT), and GraphSAGE layers, DumplingGNN effectively captures multi-scale molecular features and leverages both 2D topological and 3D structural information. We evaluate DumplingGNN on a comprehensive ADC payload dataset focusing on DNA Topoisomerase I inhibitors, as well as on multiple public benchmarks from MoleculeNet. DumplingGNN achieves state-of-the-art performance across several datasets, including BBBP (96.4\% ROC-AUC), ToxCast (78.2\% ROC-AUC), and PCBA (88.87\% ROC-AUC). On our specialized ADC payload dataset, it demonstrates exceptional accuracy (91.48\%), sensitivity (95.08\%), and specificity (97.54\%). Ablation studies confirm the synergistic effects of the hybrid architecture and the critical role of 3D structural information in enhancing predictive accuracy. The model's strong interpretability, enabled by attention mechanisms, provides valuable insights into structure-activity relationships. DumplingGNN represents a significant advancement in molecular property prediction, with particular promise for accelerating the design and optimization of ADC payloads in targeted cancer therapy development. △ Less

Submitted 23 September, 2024; originally announced October 2024.

arXiv:2409.11443 [pdf, ps, other]

Dynamics of solutions to a multi-patch epidemic model with a saturation incidence mechanism

Authors: Yawo Ezunkpe, Cynthia T. Nnolum, Rachidi B. Salako, Shuwen Xue

Abstract: This study examines the behavior of solutions in a multi-patch epidemic model that includes a saturation incidence mechanism. When the fatality rate due to the disease is not null, our findings show that the solutions of the model tend to stabilize at disease-free equilibria. Conversely, when the disease-induced fatality rate is null, the dynamics of the model become more intricate. Notably, in th… ▽ More This study examines the behavior of solutions in a multi-patch epidemic model that includes a saturation incidence mechanism. When the fatality rate due to the disease is not null, our findings show that the solutions of the model tend to stabilize at disease-free equilibria. Conversely, when the disease-induced fatality rate is null, the dynamics of the model become more intricate. Notably, in this scenario, while the saturation effect reduces the basic reproduction number $\mathcal{R}_0$, it can also lead to a backward bifurcation of the endemic equilibria curve at $\mathcal{R}_0=1$. Provided certain fundamental assumptions are satisfied, we offer a detailed analysis of the global dynamics of solutions based on the value of $\mathcal{R}_0$. Additionally, we investigate the asymptotic profiles of endemic equilibria as population dispersal rates tend to zero. To support and illustrate our theoretical findings, we conduct numerical simulations. △ Less

Submitted 16 September, 2024; originally announced September 2024.

Comments: 44 pages

MSC Class: 34D05; 34D23; 92D25; 92D30; 37N25

arXiv:2407.08224 [pdf, other]

stEnTrans: Transformer-based deep learning for spatial transcriptomics enhancement

Authors: Shuailin Xue, Fangfang Zhu, Changmiao Wang, Wenwen Min

Abstract: The spatial location of cells within tissues and organs is crucial for the manifestation of their specific functions.Spatial transcriptomics technology enables comprehensive measurement of the gene expression patterns in tissues while retaining spatial information. However, current popular spatial transcriptomics techniques either have shallow sequencing depth or low resolution. We present stEnTra… ▽ More The spatial location of cells within tissues and organs is crucial for the manifestation of their specific functions.Spatial transcriptomics technology enables comprehensive measurement of the gene expression patterns in tissues while retaining spatial information. However, current popular spatial transcriptomics techniques either have shallow sequencing depth or low resolution. We present stEnTrans, a deep learning method based on Transformer architecture that provides comprehensive predictions for gene expression in unmeasured areas or unexpectedly lost areas and enhances gene expression in original and inputed spots. Utilizing a self-supervised learning approach, stEnTrans establishes proxy tasks on gene expression profile without requiring additional data, mining intrinsic features of the tissues as supervisory information. We evaluate stEnTrans on six datasets and the results indicate superior performance in enhancing spots resolution and predicting gene expression in unmeasured areas compared to other deep learning and traditional interpolation methods. Additionally, Our method also can help the discovery of spatial patterns in Spatial Transcriptomics and enrich to more biologically significant pathways. Our source code is available at https://github.com/shuailinxue/stEnTrans. △ Less

Submitted 11 July, 2024; originally announced July 2024.

Comments: ISBRA2024, Code: https://github.com/shuailinxue/stEnTrans

arXiv:2406.19659 [pdf]

Object Space is Embodied

Authors: Shan Xu, Xinran Feng, Yuannan Li, Jia Liu

Abstract: The perceived similarity between objects has often been attributed to their physical and conceptual features, such as appearance and animacy, and the theoretical framework of object space is accordingly conceived. Here, we extend this framework by proposing that object space may also be defined by embodied features, specifically action possibilities that objects afford to an agent (i.e., affordanc… ▽ More The perceived similarity between objects has often been attributed to their physical and conceptual features, such as appearance and animacy, and the theoretical framework of object space is accordingly conceived. Here, we extend this framework by proposing that object space may also be defined by embodied features, specifically action possibilities that objects afford to an agent (i.e., affordance) and their spatial relation with the agent (i.e., situatedness). To test this proposal, we quantified the embodied features with a set of action atoms. We found that embodied features explained the subjective similarity among familiar objects along with the objects' visual features. This observation was further replicated with novel objects. Our study demonstrates that embodied features, which place objects within an ecological context, are essential in constructing object space in the human visual system, emphasizing the importance of incorporating embodiment as a fundamental dimension in our understanding of the visual world. △ Less

Submitted 5 August, 2024; v1 submitted 28 June, 2024; originally announced June 2024.

arXiv:2406.14358 [pdf]

The neural correlates of logical-mathematical symbol systems processing resemble that of spatial cognition more than natural language processing

Authors: Yuannan Li, Shan Xu, Jia Liu

Abstract: The ability to manipulate logical-mathematical symbols (LMS), encompassing tasks such as calculation, reasoning, and programming, is a cognitive skill arguably unique to humans. Considering the relatively recent emergence of this ability in human evolutionary history, it has been suggested that LMS processing may build upon more fundamental cognitive systems, possibly through neuronal recycling. P… ▽ More The ability to manipulate logical-mathematical symbols (LMS), encompassing tasks such as calculation, reasoning, and programming, is a cognitive skill arguably unique to humans. Considering the relatively recent emergence of this ability in human evolutionary history, it has been suggested that LMS processing may build upon more fundamental cognitive systems, possibly through neuronal recycling. Previous studies have pinpointed two primary candidates, natural language processing and spatial cognition. Existing comparisons between these domains largely relied on task-level comparison, which may be confounded by task idiosyncrasy. The present study instead compared the neural correlates at the domain level with both automated meta-analysis and synthesized maps based on three representative LMS tasks, reasoning, calculation, and mental programming. Our results revealed a more substantial cortical overlap between LMS processing and spatial cognition, in contrast to language processing. Furthermore, in regions activated by both spatial and language processing, the multivariate activation pattern for LMS processing exhibited greater multivariate similarity to spatial cognition than to language processing. A hierarchical clustering analysis further indicated that typical LMS tasks were indistinguishable from spatial cognition tasks at the neural level, suggesting an inherent connection between these two cognitive processes. Taken together, our findings support the hypothesis that spatial cognition is likely the basis of LMS processing, which may shed light on the limitations of large language models in logical reasoning, particularly those trained exclusively on textual data without explicit emphasis on spatial content. △ Less

Submitted 20 June, 2024; originally announced June 2024.

arXiv:2406.10391 [pdf, other]

BEACON: Benchmark for Comprehensive RNA Tasks and Language Models

Authors: Yuchen Ren, Zhiyuan Chen, Lifeng Qiao, Hongtai Jing, Yuchen Cai, Sheng Xu, Peng Ye, Xinzhu Ma, Siqi Sun, Hongliang Yan, Dong Yuan, Wanli Ouyang, Xihui Liu

Abstract: RNA plays a pivotal role in translating genetic instructions into functional outcomes, underscoring its importance in biological processes and disease mechanisms. Despite the emergence of numerous deep learning approaches for RNA, particularly universal RNA language models, there remains a significant lack of standardized benchmarks to assess the effectiveness of these methods. In this study, we i… ▽ More RNA plays a pivotal role in translating genetic instructions into functional outcomes, underscoring its importance in biological processes and disease mechanisms. Despite the emergence of numerous deep learning approaches for RNA, particularly universal RNA language models, there remains a significant lack of standardized benchmarks to assess the effectiveness of these methods. In this study, we introduce the first comprehensive RNA benchmark BEACON (\textbf{BE}nchm\textbf{A}rk for \textbf{CO}mprehensive R\textbf{N}A Task and Language Models). First, BEACON comprises 13 distinct tasks derived from extensive previous work covering structural analysis, functional studies, and engineering applications, enabling a comprehensive assessment of the performance of methods on various RNA understanding tasks. Second, we examine a range of models, including traditional approaches like CNNs, as well as advanced RNA foundation models based on language models, offering valuable insights into the task-specific performances of these models. Third, we investigate the vital RNA language model components from the tokenizer and positional encoding aspects. Notably, our findings emphasize the superiority of single nucleotide tokenization and the effectiveness of Attention with Linear Biases (ALiBi) over traditional positional encoding methods. Based on these insights, a simple yet strong baseline called BEACON-B is proposed, which can achieve outstanding performance with limited data and computational resources. The datasets and source code of our benchmark are available at https://github.com/terry-r123/RNABenchmark. △ Less

Submitted 12 December, 2024; v1 submitted 14 June, 2024; originally announced June 2024.

Comments: Accepted by NeurIPS 2024 Dataset and Benchmark Track

arXiv:2404.04660 [pdf]

The study of periphery uniqueness and balance in ecological networks

Authors: Shipeng Xu

Abstract: The study of ecological networks is crucial for modern conservation biology, addressing habitat fragmentation and biodiversity loss, especially in complex regions. These networks, including corridors, sources, and nodes, are key for species movement and ecosystem functioning. The Periphery Analysis Model (PAM) is introduced as a new approach to study the periphery of these networks, focusing on pe… ▽ More The study of ecological networks is crucial for modern conservation biology, addressing habitat fragmentation and biodiversity loss, especially in complex regions. These networks, including corridors, sources, and nodes, are key for species movement and ecosystem functioning. The Periphery Analysis Model (PAM) is introduced as a new approach to study the periphery of these networks, focusing on peripheral nodes' role in environmental change response and network resilience. PAM, drawing from graph theory, complex network analysis, and landscape ecology, uses the Periphery Uniqueness Index (PuI) and the Periphery Balance Index (PbI) to measure peripheral nodes' attributes and balance. It also offers derived indices for a detailed understanding of the periphery's influence. By revealing the periphery's defining characteristics, PAM enhances knowledge of ecological networks' structural features, providing insights for biodiversity, connectivity, and ecosystem health. The research encourages integrating PAM into conservation strategies to inform policy for ecosystem preservation amid environmental challenges. △ Less

Submitted 6 April, 2024; originally announced April 2024.

arXiv:2403.19900 [pdf]

Mechanochemical bistability of intestinal organoids enables robust morphogenesis

Authors: Shi-Lei Xue, Qiutan Yang, Prisca Liberali, Edouard Hannezo

Abstract: How pattern and form are generated in a reproducible manner during embryogenesis remains poorly understood. Intestinal organoid morphogenesis involves a number of mechanochemical regulators, including cell-type specific cytoskeletal forces and osmotically-driven lumen volume changes. However, whether and how these forces are coordinated in time and space via feedbacks to ensure robust morphogenesi… ▽ More How pattern and form are generated in a reproducible manner during embryogenesis remains poorly understood. Intestinal organoid morphogenesis involves a number of mechanochemical regulators, including cell-type specific cytoskeletal forces and osmotically-driven lumen volume changes. However, whether and how these forces are coordinated in time and space via feedbacks to ensure robust morphogenesis remains unclear. Here, we propose a minimal physical model of organoid morphogenesis with local cellular mechano-sensation, where lumen volume changes can impact epithelial shape via both direct mechanical (passive) and indirect mechanosensitive (active) mechanisms. We show how mechano-sensitive feedbacks on cytoskeletal tension generically give rise to morphological bistability, where both bulged (open) and budded (closed) crypt states are possible and dependent on the history of volume changes. Such bistability can explain several paradoxical experimental observations, such as the importance of the timing of lumen shrinkage and robustness of the final morphogenetic state to mechanical perturbations. More quantitatively, we performed mechanical and pharmacological experiments to validate the key modelling assumptions and make quantitative predictions on organoid morphogenesis. This suggests that bistability arising from feedbacks between cellular tensions and fluid pressure could be a general mechanism to allow for the coordination of multicellular shape changes in developing systems. △ Less

Submitted 28 March, 2024; originally announced March 2024.

arXiv:2402.17179 [pdf, other]

Molecule Design by Latent Prompt Transformer

Authors: Deqian Kong, Yuhao Huang, Jianwen Xie, Edouardo Honig, Ming Xu, Shuanghong Xue, Pei Lin, Sanping Zhou, Sheng Zhong, Nanning Zheng, Ying Nian Wu

Abstract: This work explores the challenging problem of molecule design by framing it as a conditional generative modeling task, where target biological properties or desired chemical constraints serve as conditioning variables. We propose the Latent Prompt Transformer (LPT), a novel generative model comprising three components: (1) a latent vector with a learnable prior distribution modeled by a neural tra… ▽ More This work explores the challenging problem of molecule design by framing it as a conditional generative modeling task, where target biological properties or desired chemical constraints serve as conditioning variables. We propose the Latent Prompt Transformer (LPT), a novel generative model comprising three components: (1) a latent vector with a learnable prior distribution modeled by a neural transformation of Gaussian white noise; (2) a molecule generation model based on a causal Transformer, which uses the latent vector as a prompt; and (3) a property prediction model that predicts a molecule's target properties and/or constraint values using the latent prompt. LPT can be learned by maximum likelihood estimation on molecule-property pairs. During property optimization, the latent prompt is inferred from target properties and constraints through posterior sampling and then used to guide the autoregressive molecule generation. After initial training on existing molecules and their properties, we adopt an online learning algorithm to progressively shift the model distribution towards regions that support desired target properties. Experiments demonstrate that LPT not only effectively discovers useful molecules across single-objective, multi-objective, and structure-constrained optimization tasks, but also exhibits strong sample efficiency. △ Less

Submitted 31 October, 2024; v1 submitted 26 February, 2024; originally announced February 2024.

arXiv:2312.12094 [pdf, other]

CrossBind: Collaborative Cross-Modal Identification of Protein Nucleic-Acid-Binding Residues

Authors: Linglin Jing, Sheng Xu, Yifan Wang, Yuzhe Zhou, Tao Shen, Zhigang Ji, Hui Fang, Zhen Li, Siqi Sun

Abstract: Accurate identification of protein nucleic-acid-binding residues poses a significant challenge with important implications for various biological processes and drug design. Many typical computational methods for protein analysis rely on a single model that could ignore either the semantic context of the protein or the global 3D geometric information. Consequently, these approaches may result in in… ▽ More Accurate identification of protein nucleic-acid-binding residues poses a significant challenge with important implications for various biological processes and drug design. Many typical computational methods for protein analysis rely on a single model that could ignore either the semantic context of the protein or the global 3D geometric information. Consequently, these approaches may result in incomplete or inaccurate protein analysis. To address the above issue, in this paper, we present CrossBind, a novel collaborative cross-modal approach for identifying binding residues by exploiting both protein geometric structure and its sequence prior knowledge extracted from a large-scale protein language model. Specifically, our multi-modal approach leverages a contrastive learning technique and atom-wise attention to capture the positional relationships between atoms and residues, thereby incorporating fine-grained local geometric knowledge, for better binding residue prediction. Extensive experimental results demonstrate that our approach outperforms the next best state-of-the-art methods, GraphSite and GraphBind, on DNA and RNA datasets by 10.8/17.3% in terms of the harmonic mean of precision and recall (F1-Score) and 11.9/24.8% in Matthews correlation coefficient (MCC), respectively. We release the code at https://github.com/BEAM-Labs/CrossBind. △ Less

Submitted 20 December, 2023; v1 submitted 19 December, 2023; originally announced December 2023.

Comments: Accepted to AAAI-24

arXiv:2312.11584 [pdf, other]

ContraNovo: A Contrastive Learning Approach to Enhance De Novo Peptide Sequencing

Authors: Zhi Jin, Sheng Xu, Xiang Zhang, Tianze Ling, Nanqing Dong, Wanli Ouyang, Zhiqiang Gao, Cheng Chang, Siqi Sun

Abstract: De novo peptide sequencing from mass spectrometry (MS) data is a critical task in proteomics research. Traditional de novo algorithms have encountered a bottleneck in accuracy due to the inherent complexity of proteomics data. While deep learning-based methods have shown progress, they reduce the problem to a translation task, potentially overlooking critical nuances between spectra and peptides.… ▽ More De novo peptide sequencing from mass spectrometry (MS) data is a critical task in proteomics research. Traditional de novo algorithms have encountered a bottleneck in accuracy due to the inherent complexity of proteomics data. While deep learning-based methods have shown progress, they reduce the problem to a translation task, potentially overlooking critical nuances between spectra and peptides. In our research, we present ContraNovo, a pioneering algorithm that leverages contrastive learning to extract the relationship between spectra and peptides and incorporates the mass information into peptide decoding, aiming to address these intricacies more efficiently. Through rigorous evaluations on two benchmark datasets, ContraNovo consistently outshines contemporary state-of-the-art solutions, underscoring its promising potential in enhancing de novo peptide sequencing. The source code is available at https://github.com/BEAM-Labs/ContraNovo. △ Less

Submitted 18 December, 2023; originally announced December 2023.

Comments: This paper has been accepted by AAAI 2024

arXiv:2308.16713 [pdf, other]

Accurate Prediction of Antibody Function and Structure Using Bio-Inspired Antibody Language Model

Authors: Hongtai Jing, Zhengtao Gao, Sheng Xu, Tao Shen, Zhangzhi Peng, Shwai He, Tao You, Shuang Ye, Wei Lin, Siqi Sun

Abstract: In recent decades, antibodies have emerged as indispensable therapeutics for combating diseases, particularly viral infections. However, their development has been hindered by limited structural information and labor-intensive engineering processes. Fortunately, significant advancements in deep learning methods have facilitated the precise prediction of protein structure and function by leveraging… ▽ More In recent decades, antibodies have emerged as indispensable therapeutics for combating diseases, particularly viral infections. However, their development has been hindered by limited structural information and labor-intensive engineering processes. Fortunately, significant advancements in deep learning methods have facilitated the precise prediction of protein structure and function by leveraging co-evolution information from homologous proteins. Despite these advances, predicting the conformation of antibodies remains challenging due to their unique evolution and the high flexibility of their antigen-binding regions. Here, to address this challenge, we present the Bio-inspired Antibody Language Model (BALM). This model is trained on a vast dataset comprising 336 million 40% non-redundant unlabeled antibody sequences, capturing both unique and conserved properties specific to antibodies. Notably, BALM showcases exceptional performance across four antigen-binding prediction tasks. Moreover, we introduce BALMFold, an end-to-end method derived from BALM, capable of swiftly predicting full atomic antibody structures from individual sequences. Remarkably, BALMFold outperforms those well-established methods like AlphaFold2, IgFold, ESMFold, and OmegaFold in the antibody benchmark, demonstrating significant potential to advance innovative engineering and streamline therapeutic antibody development by reducing the need for unnecessary trials. △ Less

Submitted 31 August, 2023; originally announced August 2023.

arXiv:2305.08929 [pdf, other]

doi 10.15212/AMM-2024-0047

AF2-Mutation: Adversarial Sequence Mutations against AlphaFold2 on Protein Tertiary Structure Prediction

Authors: Zhongju Yuan, Tao Shen, Sheng Xu, Leiye Yu, Ruobing Ren, Siqi Sun

Abstract: Deep learning-based approaches, such as AlphaFold2 (AF2), have significantly advanced protein tertiary structure prediction, achieving results comparable to real biological experimental methods. While AF2 has shown limitations in predicting the effects of mutations, its robustness against sequence mutations remains to be determined. Starting with the wild-type (WT) sequence, we investigate adversa… ▽ More Deep learning-based approaches, such as AlphaFold2 (AF2), have significantly advanced protein tertiary structure prediction, achieving results comparable to real biological experimental methods. While AF2 has shown limitations in predicting the effects of mutations, its robustness against sequence mutations remains to be determined. Starting with the wild-type (WT) sequence, we investigate adversarial sequences generated via an evolutionary approach, which AF2 predicts to be substantially different from WT. Our experiments on CASP14 reveal that by modifying merely three residues in the protein sequence using a combination of replacement, deletion, and insertion strategies, the alteration in AF2's predictions, as measured by the Local Distance Difference Test (lDDT), reaches 46.61. Moreover, when applied to a specific protein, SPNS2, our proposed algorithm successfully identifies biologically meaningful residues critical to protein structure determination and potentially indicates alternative conformations, thus significantly expediting the experimental process. △ Less

Submitted 15 May, 2023; originally announced May 2023.

arXiv:2302.10406 [pdf]

Time to Embrace Natural Language Processing (NLP)-based Digital Pathology: Benchmarking NLP- and Convolutional Neural Network-based Deep Learning Pipelines

Authors: Min Cen, Xingyu Li, Bangwei Guo, Jitendra Jonnagaddala, Hong Zhang, Xu Steven Xu

Abstract: NLP-based computer vision models, particularly vision transformers, have been shown to outperform CNN models in many imaging tasks. However, most digital pathology artificial-intelligence models are based on CNN architectures, probably owing to a lack of data regarding NLP models for pathology images. In this study, we developed digital pathology pipelines to benchmark the five most recently propo… ▽ More NLP-based computer vision models, particularly vision transformers, have been shown to outperform CNN models in many imaging tasks. However, most digital pathology artificial-intelligence models are based on CNN architectures, probably owing to a lack of data regarding NLP models for pathology images. In this study, we developed digital pathology pipelines to benchmark the five most recently proposed NLP models (vision transformer (ViT), Swin Transformer, MobileViT, CMT, and Sequencer2D) and four popular CNN models (ResNet18, ResNet50, MobileNetV2, and EfficientNet) to predict biomarkers in colorectal cancer (microsatellite instability, CpG island methylator phenotype, and BRAF mutation). Hematoxylin and eosin-stained whole-slide images from Molecular and Cellular Oncology and The Cancer Genome Atlas were used as training and external validation datasets, respectively. Cross-study external validations revealed that the NLP-based models significantly outperformed the CNN-based models in biomarker prediction tasks, improving the overall prediction and precision up to approximately 10% and 26%, respectively. Notably, compared with existing models in the current literature using large training datasets, our NLP models achieved state-of-the-art predictions for all three biomarkers using a relatively small training dataset, suggesting that large training datasets are not a prerequisite for NLP models or transformers, and NLP may be more suitable for clinical studies in which small training datasets are commonly collected. The superior performance of Sequencer2D suggests that further research and innovation on both transformer and bidirectional long short-term memory architectures are warranted in the field of digital pathology. NLP models can replace classic CNN architectures and become the new workhorse backbone in the field of digital pathology. △ Less

Submitted 20 February, 2023; originally announced February 2023.

arXiv:2302.08062 [pdf]

doi 10.1111/2041-210X.14229

Fossil Image Identification using Deep Learning Ensembles of Data Augmented Multiviews

Authors: Chengbin Hou, Xinyu Lin, Hanhui Huang, Sheng Xu, Junxuan Fan, Yukun Shi, Hairong Lv

Abstract: Identification of fossil species is crucial to evolutionary studies. Recent advances from deep learning have shown promising prospects in fossil image identification. However, the quantity and quality of labeled fossil images are often limited due to fossil preservation, conditioned sampling, and expensive and inconsistent label annotation by domain experts, which pose great challenges to training… ▽ More Identification of fossil species is crucial to evolutionary studies. Recent advances from deep learning have shown promising prospects in fossil image identification. However, the quantity and quality of labeled fossil images are often limited due to fossil preservation, conditioned sampling, and expensive and inconsistent label annotation by domain experts, which pose great challenges to training deep learning based image classification models. To address these challenges, we follow the idea of the wisdom of crowds and propose a multiview ensemble framework, which collects Original (O), Gray (G), and Skeleton (S) views of each fossil image reflecting its different characteristics to train multiple base models, and then makes the final decision via soft voting. Experiments on the largest fusulinid dataset with 2400 images show that the proposed OGS consistently outperforms baselines (using a single model for each view), and obtains superior or comparable performance compared to OOO (using three base models for three the same Original views). Besides, as the training data decreases, the proposed framework achieves more gains. While considering the identification consistency estimation with respect to human experts, OGS receives the highest agreement with the original labels of dataset and with the re-identifications of two human experts. The validation performance provides a quantitative estimation of consistency across different experts and genera. We conclude that the proposed framework can present state-of-the-art performance in the fusulinid fossil identification case study. This framework is designed for general fossil identification and it is expected to see applications to other fossil datasets in future work. The source code is publicly available at https://github.com/houchengbin/Fossil-Image-Identification to benefit future research in fossil image identification. △ Less

Submitted 1 February, 2024; v1 submitted 15 February, 2023; originally announced February 2023.

Comments: published in Methods in Ecology and Evolution

Journal ref: Methods in Ecology and Evolution, 14, 3020-3034 (2023)

arXiv:2211.02234 [pdf, other]

A Latent Space Model for HLA Compatibility Networks in Kidney Transplantation

Authors: Zhipeng Huang, Kevin S. Xu

Abstract: Kidney transplantation is the preferred treatment for people suffering from end-stage renal disease. Successful kidney transplants still fail over time, known as graft failure; however, the time to graft failure, or graft survival time, can vary significantly between different recipients. A significant biological factor affecting graft survival times is the compatibility between the human leukocyt… ▽ More Kidney transplantation is the preferred treatment for people suffering from end-stage renal disease. Successful kidney transplants still fail over time, known as graft failure; however, the time to graft failure, or graft survival time, can vary significantly between different recipients. A significant biological factor affecting graft survival times is the compatibility between the human leukocyte antigens (HLAs) of the donor and recipient. We propose to model HLA compatibility using a network, where the nodes denote different HLAs of the donor and recipient, and edge weights denote compatibilities of the HLAs, which can be positive or negative. The network is indirectly observed, as the edge weights are estimated from transplant outcomes rather than directly observed. We propose a latent space model for such indirectly-observed weighted and signed networks. We demonstrate that our latent space model can not only result in more accurate estimates of HLA compatibilities, but can also be incorporated into survival analysis models to improve accuracy for the downstream task of predicting graft survival times. △ Less

Submitted 3 November, 2022; originally announced November 2022.

Comments: This work has been accepted to BIBM 2022

arXiv:2208.11518 [pdf]

Prognostic Significance of Tumor-Infiltrating Lymphocytes Using Deep Learning on Pathology Images in Colorectal Cancers

Authors: Anran Liu, Xingyu Li, Hongyi Wu, Bangwei Guo, Jitendra Jonnagaddala, Hong Zhang, Xu Steven Xu

Abstract: Purpose Tumor-infiltrating lymphocytes (TILs) have significant prognostic values in cancers. However, very few automated, deep-learning-based TIL scoring algorithms have been developed for colorectal cancers (CRC). Methods We developed an automated, multiscale LinkNet workflow for quantifying cellular-level TILs for CRC tumors using H&E-stained images. The predictive performance of the automatic T… ▽ More Purpose Tumor-infiltrating lymphocytes (TILs) have significant prognostic values in cancers. However, very few automated, deep-learning-based TIL scoring algorithms have been developed for colorectal cancers (CRC). Methods We developed an automated, multiscale LinkNet workflow for quantifying cellular-level TILs for CRC tumors using H&E-stained images. The predictive performance of the automatic TIL scores (TIL) for disease progression and overall survival was evaluate using two international datasets, including 554 CRC patients from The Cancer Genome Atlas (TCGA) and 1130 CRC patients from Molecular and Cellular Oncology (MCO). Results The LinkNet model provided an outstanding precision (0.9508), recall (0.9185), and overall F1 score (0.9347). Clear dose-response relationships were observed between TILs and risk of disease progression or death decreased in both TCGA and MCO cohorts. Both univariate and multivariate Cox regression analyses for the TCGA data demonstrated that patients with high TILs had significant (approx. 75%) reduction of risk for disease progression. In both MCO and TCGA studies, the TIL-high group was significantly associated with improved overall survival in univariate analysis (30% and 54% reduction in risk, respectively). However, potential confounding was observed in the MCO dataset. The favorable effects of high TILs were consistently observed in different subgroups according to know risk factors. Conclusion A deep-learning workflow for automatic TIL quantification based on LinkNet was successfully developed. △ Less

Submitted 15 September, 2022; v1 submitted 23 August, 2022; originally announced August 2022.

arXiv:2208.10495 [pdf]

doi 10.1002/cjp2.312

Predicting microsatellite instability and key biomarkers in colorectal cancer from H&E-stained images: Achieving SOTA predictive performance with fewer data using Swin Transformer

Authors: Bangwei Guo, Xingyu Li, Jitendra Jonnagaddala, Hong Zhang, Xu Steven Xu

Abstract: Artificial intelligence (AI) models have been developed for predicting clinically relevant biomarkers, including microsatellite instability (MSI), for colorectal cancers (CRC). However, the current deep-learning networks are data-hungry and require large training datasets, which are often lacking in the medical domain. In this study, based on the latest Hierarchical Vision Transformer using Shifte… ▽ More Artificial intelligence (AI) models have been developed for predicting clinically relevant biomarkers, including microsatellite instability (MSI), for colorectal cancers (CRC). However, the current deep-learning networks are data-hungry and require large training datasets, which are often lacking in the medical domain. In this study, based on the latest Hierarchical Vision Transformer using Shifted Windows (Swin-T), we developed an efficient workflow for biomarkers in CRC (MSI, hypermutation, chromosomal instability, CpG island methylator phenotype, BRAF, and TP53 mutation) that only required relatively small datasets, but achieved the state-of-the-art (SOTA) predictive performance. Our Swin-T workflow not only substantially outperformed published models in an intra-study cross-validation experiment using TCGA-CRC-DX dataset (N = 462), but also showed excellent generalizability in cross-study external validation and delivered a SOTA AUROC of 0.90 for MSI using the MCO dataset for training (N = 1065) and the same TCGA-CRC-DX for testing. Similar performance (AUROC=0.91) was achieved by Echle and colleagues using approximately 8000 training samples (ResNet18) on the same testing dataset. Swin-T was extremely efficient using small training datasets and exhibits robust predictive performance with only 200-500 training samples. These data indicate that Swin-T may be 5-10 times more efficient than the current state-of-the-art algorithms for MSI based on ResNet18 and ShuffleNet. Furthermore, the Swin-T models showed promise as pre-screening tests for MSI status and BRAF mutation status, which could exclude and reduce the samples before the subsequent standard testing in a cascading diagnostic workflow to allow turnaround time reduction and cost saving. △ Less

Submitted 11 September, 2022; v1 submitted 21 August, 2022; originally announced August 2022.

arXiv:2208.09813 [pdf, other]

Genome-wide nucleotide-resolution model of single-strand break site reveals species evolutionary hierarchy

Authors: Sheng Xu, Junkang Wei, Yu Li

Abstract: Single-strand breaks (SSBs) are the major DNA damage in the genome arising spontaneously as the outcome of genotoxins and intermediates of DNA transactions. SSBs play a crucial role in various biological processes and show a non-random distribution in the genome. Several SSB detection approaches such as S1 END-seq and SSiNGLe-ILM emerged to characterize the genomic landscape of SSB with nucleotide… ▽ More Single-strand breaks (SSBs) are the major DNA damage in the genome arising spontaneously as the outcome of genotoxins and intermediates of DNA transactions. SSBs play a crucial role in various biological processes and show a non-random distribution in the genome. Several SSB detection approaches such as S1 END-seq and SSiNGLe-ILM emerged to characterize the genomic landscape of SSB with nucleotide resolution. However, these sequencing-based methods are costly and unfeasible for large-scale analysis of diverse species. Thus, we proposed the first computational approach, SSBlazer, which is an explainable and scalable deep learning framework for genome-wide nucleotide-resolution SSB site prediction. We demonstrated that SSBlazer can accurately predict SSB sites and sufficiently alleviate false positives by constructing an imbalanced dataset to simulate the realistic SSB distribution. The model interpretation analysis reveals that SSBlazer captures the pattern of individual CpG in genomic context and the motif of TGCC in the center region as critical features. Besides, SSBlazer is a lightweight model with robust cross-species generalization ability in the cross-species evaluation, which enables the large-scale genome-wide application in diverse species. Strikingly, the putative SSB genomic landscapes of 216 vertebrates reveal a negative correlation between SSB frequency and evolutionary hierarchy, suggesting that the genome tends to be integrity during evolution. △ Less

Submitted 21 August, 2022; originally announced August 2022.

arXiv:2207.09598 [pdf, other]

Deep learning-based identification of sub-nuclear structures in FIB-SEM images

Authors: Niraj Gupta, Eric J. Roberts, Song Pang, C. Shan Xu, Harald F. Hess, Fan Wu, Abby Dernburg, Danielle Jorgens, Petrus H. Zwart, Vignesh Kasinath

Abstract: Three-dimensional volumetric imaging of cells allows for in situ visualization, thus preserving contextual insights into cellular processes. Despite recent advances in machine learning methods, morphological analysis of sub-nuclear structures have proven challenging due to both the shallow contrast profile and the technical limitation in feature detection. Here, we present a convolutional neural n… ▽ More Three-dimensional volumetric imaging of cells allows for in situ visualization, thus preserving contextual insights into cellular processes. Despite recent advances in machine learning methods, morphological analysis of sub-nuclear structures have proven challenging due to both the shallow contrast profile and the technical limitation in feature detection. Here, we present a convolutional neural network, supervised deep learning-based approach which can identify sub-nuclear structures with 90% accuracy. We develop and apply this model to C. elegans gonads imaged using focused ion beam milling combined with scanning electron microscopy resulting in the accurate identification and segmentation of all sub-nuclear structures including entire chromosomes. We discuss in depth the architecture, parameterization, and optimization of the deep learning model, as well as provide evaluation metrics to assess the quality of the network prediction. Lastly, we highlight specific aspects of the model that can be optimized for its broad application to other volumetric imaging data as well as in situ cryo-electron tomography. △ Less

Submitted 19 July, 2022; originally announced July 2022.

Comments: 15 pages, 10 figures, for eventual peer-reviewed journal publication

arXiv:2206.00455 [pdf]

A robust and lightweight deep attention multiple instance learning algorithm for predicting genetic alterations

Authors: Bangwei Guo, Xingyu Li, Miaomiao Yang, Hong Zhang, Xu Steven Xu

Abstract: Deep-learning models based on whole-slide digital pathology images (WSIs) become increasingly popular for predicting molecular biomarkers. Instance-based models has been the mainstream strategy for predicting genetic alterations using WSIs although bag-based models along with self-attention mechanism-based algorithms have been proposed for other digital pathology applications. In this paper, we pr… ▽ More Deep-learning models based on whole-slide digital pathology images (WSIs) become increasingly popular for predicting molecular biomarkers. Instance-based models has been the mainstream strategy for predicting genetic alterations using WSIs although bag-based models along with self-attention mechanism-based algorithms have been proposed for other digital pathology applications. In this paper, we proposed a novel Attention-based Multiple Instance Mutation Learning (AMIML) model for predicting gene mutations. AMIML was comprised of successive 1-D convolutional layers, a decoder, and a residual weight connection to facilitate further integration of a lightweight attention mechanism to detect the most predictive image patches. Using data for 24 clinically relevant genes from four cancer cohorts in The Cancer Genome Atlas (TCGA) studies (UCEC, BRCA, GBM and KIRC), we compared AMIML with one popular instance-based model and four recently published bag-based models (e.g., CHOWDER, HE2RNA, etc.). AMIML demonstrated excellent robustness, not only outperforming all the five baseline algorithms in the vast majority of the tested genes (17 out of 24), but also providing near-best-performance for the other seven genes. Conversely, the performance of the baseline published algorithms varied across different cancers/genes. In addition, compared to the published models for genetic alterations, AMIML provided a significant improvement for predicting a wide range of genes (e.g., KMT2C, TP53, and SETD2 for KIRC; ERBB2, BRCA1, and BRCA2 for BRCA; JAK1, POLE, and MTOR for UCEC) as well as produced outstanding predictive models for other clinically relevant gene mutations, which have not been reported in the current literature. Furthermore, with the flexible and interpretable attention-based MIL pooling mechanism, AMIML could further zero-in and detect predictive image patches. △ Less

Submitted 31 May, 2022; originally announced June 2022.

arXiv:2204.01593 [pdf]

Optimize Deep Learning Models for Prediction of Gene Mutations Using Unsupervised Clustering

Authors: Zihan Chen, Xingyu Li, Miaomiao Yang, Hong Zhang, Xu Steven Xu

Abstract: Deep learning has become the mainstream methodological choice for analyzing and interpreting whole-slide digital pathology images (WSIs). It is commonly assumed that tumor regions carry most predictive information. In this paper, we proposed an unsupervised clustering-based multiple-instance learning, and apply our method to develop deep-learning models for prediction of gene mutations using WSIs… ▽ More Deep learning has become the mainstream methodological choice for analyzing and interpreting whole-slide digital pathology images (WSIs). It is commonly assumed that tumor regions carry most predictive information. In this paper, we proposed an unsupervised clustering-based multiple-instance learning, and apply our method to develop deep-learning models for prediction of gene mutations using WSIs from three cancer types in The Cancer Genome Atlas (TCGA) studies (CRC, LUAD, and HNSCC). We showed that unsupervised clustering of image patches could help identify predictive patches, exclude patches lack of predictive information, and therefore improve prediction on gene mutations in all three different cancer types, compared with the WSI based method without selection of image patches and models based on only tumor regions. Additionally, our proposed algorithm outperformed two recently published baseline algorithms leveraging unsupervised clustering to assist model prediction. The unsupervised-clustering-based approach for mutation prediction allows identification of the spatial regions related to mutation of a specific gene via the resolved probability scores, highlighting the heterogeneity of a predicted genotype in the tumor microenvironment. Finally, our study also demonstrated that selection of tumor regions of WSIs is not always the best way to identify patches for prediction of gene mutations, and other tissue types in the tumor micro-environment may provide better prediction ability for gene mutations than tumor tissues. △ Less

Submitted 24 April, 2022; v1 submitted 31 March, 2022; originally announced April 2022.

arXiv:2201.09637 [pdf, other]

DrugOOD: Out-of-Distribution (OOD) Dataset Curator and Benchmark for AI-aided Drug Discovery -- A Focus on Affinity Prediction Problems with Noise Annotations

Authors: Yuanfeng Ji, Lu Zhang, Jiaxiang Wu, Bingzhe Wu, Long-Kai Huang, Tingyang Xu, Yu Rong, Lanqing Li, Jie Ren, Ding Xue, Houtim Lai, Shaoyong Xu, Jing Feng, Wei Liu, Ping Luo, Shuigeng Zhou, Junzhou Huang, Peilin Zhao, Yatao Bian

Abstract: AI-aided drug discovery (AIDD) is gaining increasing popularity due to its promise of making the search for new pharmaceuticals quicker, cheaper and more efficient. In spite of its extensive use in many fields, such as ADMET prediction, virtual screening, protein folding and generative chemistry, little has been explored in terms of the out-of-distribution (OOD) learning problem with \emph{noise},… ▽ More AI-aided drug discovery (AIDD) is gaining increasing popularity due to its promise of making the search for new pharmaceuticals quicker, cheaper and more efficient. In spite of its extensive use in many fields, such as ADMET prediction, virtual screening, protein folding and generative chemistry, little has been explored in terms of the out-of-distribution (OOD) learning problem with \emph{noise}, which is inevitable in real world AIDD applications. In this work, we present DrugOOD, a systematic OOD dataset curator and benchmark for AI-aided drug discovery, which comes with an open-source Python package that fully automates the data curation and OOD benchmarking processes. We focus on one of the most crucial problems in AIDD: drug target binding affinity prediction, which involves both macromolecule (protein target) and small-molecule (drug compound). In contrast to only providing fixed datasets, DrugOOD offers automated dataset curator with user-friendly customization scripts, rich domain annotations aligned with biochemistry knowledge, realistic noise annotations and rigorous benchmarking of state-of-the-art OOD algorithms. Since the molecular data is often modeled as irregular graphs using graph neural network (GNN) backbones, DrugOOD also serves as a valuable testbed for \emph{graph OOD learning} problems. Extensive empirical studies have shown a significant performance gap between in-distribution and out-of-distribution experiments, which highlights the need to develop better schemes that can allow for OOD generalization under noise for AIDD. △ Less

Submitted 24 January, 2022; originally announced January 2022.

Comments: 54 pages, 11 figures

arXiv:2201.02748 [pdf]

doi 10.1109/TOH.2021.3137833

Subtle Contact Nuances in the Delivery of Human-to-Human Touch Distinguish Emotional Sentiment

Authors: Shan Xu, Chang Xu, Sarah McIntyre, Håkan Olausson, Gregory J. Gerling

Abstract: We routinely communicate distinct social and emotional sentiments through nuanced touch. For example, we might gently hold another's arm to offer a sense of calm, yet intensively hold another's arm to express excitement or anxiety. As this example indicates, distinct sentiments may be shaped by the subtlety in one's touch delivery. This work investigates how slight distinctions in skin-to-skin con… ▽ More We routinely communicate distinct social and emotional sentiments through nuanced touch. For example, we might gently hold another's arm to offer a sense of calm, yet intensively hold another's arm to express excitement or anxiety. As this example indicates, distinct sentiments may be shaped by the subtlety in one's touch delivery. This work investigates how slight distinctions in skin-to-skin contact influence both the recognition of cued emotional messages (e.g., anger, sympathy) and the rating of emotional content (i.e., arousal, valence). By self-selecting preferred gestures (e.g., holding, stroking), touchers convey distinct messages by touching the receiver's forearm. Skin-to-skin contact attributes (e.g., velocity, depth, area) are optically tracked in high resolution. Contact is then examined within gesture, between messages. The results indicate touchers subtly, but significantly, vary contact attributes of a gesture to communicate distinct messages, which are recognizable by receivers. This tuning also correlates with receivers' arousal and valence. For instance, arousal increases with velocity for stroking, and depth for holding. Moreover, as shown here with human-to-human touch, valence is tied with velocity, which is the same trend as reported with brushes. The findings indicate that subtle nuance in skin-to-skin contact is important in conveying social messages and inducing emotions. △ Less

Submitted 7 January, 2022; originally announced January 2022.

arXiv:2111.06425 [pdf, other]

Multiple Hypothesis Hypergraph Tracking for Posture Identification in Embryonic Caenorhabditis elegans

Authors: Andrew Lauziere, Evan Ardiel, Stephen Xu, Hari Shroff

Abstract: Current methods in multiple object tracking (MOT) rely on independent object trajectories undergoing predictable motion to effectively track large numbers of objects. Adversarial conditions such as volatile object motion and imperfect detections create a challenging tracking landscape in which established methods may yield inadequate results. Multiple hypothesis hypergraph tracking (MHHT) is devel… ▽ More Current methods in multiple object tracking (MOT) rely on independent object trajectories undergoing predictable motion to effectively track large numbers of objects. Adversarial conditions such as volatile object motion and imperfect detections create a challenging tracking landscape in which established methods may yield inadequate results. Multiple hypothesis hypergraph tracking (MHHT) is developed to perform MOT among interdependent objects amid noisy detections. The method extends traditional multiple hypothesis tracking (MHT) via hypergraphs to model correlated object motion, allowing for robust tracking in challenging scenarios. MHHT is applied to perform seam cell tracking during late-stage embryogenesis in embryonic C. elegans. △ Less

Submitted 8 July, 2022; v1 submitted 11 November, 2021; originally announced November 2021.

arXiv:2111.01969 [pdf, other]

PhyloTransformer: A Discriminative Model for Mutation Prediction Based on a Multi-head Self-attention Mechanism

Authors: Yingying Wu, Shusheng Xu, Shing-Tung Yau, Yi Wu

Abstract: Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) has caused an ongoing pandemic infecting 219 million people as of 10/19/21, with a 3.6% mortality rate. Natural selection can generate favorable mutations with improved fitness advantages; however, the identified coronaviruses may be the tip of the iceberg, and potentially more fatal variants of concern (VOCs) may emerge over time. Under… ▽ More Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) has caused an ongoing pandemic infecting 219 million people as of 10/19/21, with a 3.6% mortality rate. Natural selection can generate favorable mutations with improved fitness advantages; however, the identified coronaviruses may be the tip of the iceberg, and potentially more fatal variants of concern (VOCs) may emerge over time. Understanding the patterns of emerging VOCs and forecasting mutations that may lead to gain of function or immune escape is urgently required. Here we developed PhyloTransformer, a Transformer-based discriminative model that engages a multi-head self-attention mechanism to model genetic mutations that may lead to viral reproductive advantage. In order to identify complex dependencies between the elements of each input sequence, PhyloTransformer utilizes advanced modeling techniques, including a novel Fast Attention Via positive Orthogonal Random features approach (FAVOR+) from Performer, and the Masked Language Model (MLM) from Bidirectional Encoder Representations from Transformers (BERT). PhyloTransformer was trained with 1,765,297 genetic sequences retrieved from the Global Initiative for Sharing All Influenza Data (GISAID) database. Firstly, we compared the prediction accuracy of novel mutations and novel combinations using extensive baseline models; we found that PhyloTransformer outperformed every baseline method with statistical significance. Secondly, we examined predictions of mutations in each nucleotide of the receptor binding motif (RBM), and we found our predictions were precise and accurate. Thirdly, we predicted modifications of N-glycosylation sites to identify mutations associated with altered glycosylation that may be favored during viral evolution. We anticipate that PhyloTransformer may guide proactive vaccine design for effective targeting of future SARS-CoV-2 variants. △ Less

Submitted 2 November, 2021; originally announced November 2021.

arXiv:2110.04921 [pdf, other]

doi 10.1364/OE.445001

Increasing a microscope's effective field of view via overlapped imaging and machine learning

Authors: Xing Yao, Vinayak Pathak, Haoran Xi, Amey Chaware, Colin Cooke, Kanghyun Kim, Shiqi Xu, Yuting Li, Timothy Dunn, Pavan Chandra Konda, Kevin C. Zhou, Roarke Horstmeyer

Abstract: This work demonstrates a multi-lens microscopic imaging system that overlaps multiple independent fields of view on a single sensor for high-efficiency automated specimen analysis. Automatic detection, classification and counting of various morphological features of interest is now a crucial component of both biomedical research and disease diagnosis. While convolutional neural networks (CNNs) hav… ▽ More This work demonstrates a multi-lens microscopic imaging system that overlaps multiple independent fields of view on a single sensor for high-efficiency automated specimen analysis. Automatic detection, classification and counting of various morphological features of interest is now a crucial component of both biomedical research and disease diagnosis. While convolutional neural networks (CNNs) have dramatically improved the accuracy of counting cells and sub-cellular features from acquired digital image data, the overall throughput is still typically hindered by the limited space-bandwidth product (SBP) of conventional microscopes. Here, we show both in simulation and experiment that overlapped imaging and co-designed analysis software can achieve accurate detection of diagnostically-relevant features for several applications, including counting of white blood cells and the malaria parasite, leading to multi-fold increase in detection and processing throughput with minimal reduction in accuracy. △ Less

Submitted 10 October, 2021; originally announced October 2021.

arXiv:2107.01422 [pdf, other]

Imaging dynamics beneath turbid media via parallelized single-photon detection

Authors: Shiqi Xu, Xi Yang, Wenhui Liu, Joakim Jonsson, Ruobing Qian, Pavan Chandra Konda, Kevin C. Zhou, Lucas Kreiss, Qionghai Dai, Haoqian Wang, Edouard Berrocal, Roarke Horstmeyer

Abstract: Noninvasive optical imaging through dynamic scattering media has numerous important biomedical applications but still remains a challenging task. While standard diffuse imaging methods measure optical absorption or fluorescent emission, it is also well-established that the temporal correlation of scattered coherent light diffuses through tissue much like optical intensity. Few works to date, howev… ▽ More Noninvasive optical imaging through dynamic scattering media has numerous important biomedical applications but still remains a challenging task. While standard diffuse imaging methods measure optical absorption or fluorescent emission, it is also well-established that the temporal correlation of scattered coherent light diffuses through tissue much like optical intensity. Few works to date, however, have aimed to experimentally measure and process such temporal correlation data to demonstrate deep-tissue video reconstruction of decorrelation dynamics. In this work, we utilize a single-photon avalanche diode (SPAD) array camera to simultaneously monitor the temporal dynamics of speckle fluctuations at the single-photon level from 12 different phantom tissue surface locations delivered via a customized fiber bundle array. We then apply a deep neural network to convert the acquired single-photon measurements into video of scattering dynamics beneath rapidly decorrelating tissue phantoms. We demonstrate the ability to reconstruct images of transient (0.1-0.4s) dynamic events occurring up to 8 mm beneath a decorrelating tissue phantom with millimeter-scale resolution, and highlight how our model can flexibly extend to monitor flow speed within buried phantom vessels. △ Less

Submitted 12 June, 2022; v1 submitted 3 July, 2021; originally announced July 2021.

arXiv:2012.03303 [pdf, other]

doi 10.1016/j.bpj.2021.06.020

A Tridomain Model for Potassium Clearance in Optic Nerve of Necturus

Authors: Yi Zhu, Shixin Xu, Robert S. Eisenberg, Huaxiong Huang

Abstract: The accumulation of potassium in the narrow space outside nerve cells is a classical subject of biophysics that has received much attention recently. It may be involved in potassium accumulation \textcolor{black}{including} spreading depression, perhaps migraine and some kinds of epilepsy, even (speculatively) learning. Quantitative analysis is likely to help evaluate the role of potassium clearan… ▽ More The accumulation of potassium in the narrow space outside nerve cells is a classical subject of biophysics that has received much attention recently. It may be involved in potassium accumulation \textcolor{black}{including} spreading depression, perhaps migraine and some kinds of epilepsy, even (speculatively) learning. Quantitative analysis is likely to help evaluate the role of potassium clearance from the extracellular space after a train of action potentials. Clearance involves three structures that extend down the length of the nerve: glia, extracellular space, and axon and so need to be described as systems distributed in space in the tradition used for electrical potential in the `cable equations' of nerve since the work of Hodgkin in 1937. A three-compartment model is proposed here for the optic nerve and is used to study the accumulation of potassium and its clearance. The model allows the convection, diffusion, and electrical migration of water and ions. We depend on the data of Orkand et al to ensure the relevance of our model and align its parameters with the anatomy and properties of membranes, channels, and transporters: our model fits their experimental data quite well. The aligned model shows that glia has an important role in buffering potassium, as expected. The model shows that potassium is cleared mostly by convective flow through the syncytia of glia driven by osmotic pressure differences. A simplified model might be possible, but it must involve flow down the length of the optic nerve. It is easy for compartment models to neglect this flow. Our model can be used for structures quite different from the optic nerve that might have different distributions of channels and transporters in its three compartments. It can be generalized to include a fourth (distributed) compartment representing blood vessels to deal with the glymphatic flow into the circulatory system. △ Less

Submitted 16 May, 2021; v1 submitted 6 December, 2020; originally announced December 2020.

Comments: 35 pages, 13 figures

MSC Class: 92C05 92C37 35Q92

arXiv:1908.08190 [pdf, ps, other]

doi 10.1088/1478-3975/ab6754

Diversity in Biology: definitions, quantification, and models

Authors: Song Xu, Lucas Böttcher, Tom Chou

Abstract: Diversity indices are useful single-number metrics for characterizing a complex distribution of a set of attributes across a population of interest. The utility of these different metrics or sets of metrics depend on the context and application, and whether a predictive mechanistic model exists. In this topical review, we first summarize the relevant mathematical principles underlying heterogeneit… ▽ More Diversity indices are useful single-number metrics for characterizing a complex distribution of a set of attributes across a population of interest. The utility of these different metrics or sets of metrics depend on the context and application, and whether a predictive mechanistic model exists. In this topical review, we first summarize the relevant mathematical principles underlying heterogeneity in a large population before outlining the various definitions of `diversity' and providing examples of scientific topics in which its quantification plays an important role. We then review how diversity has been a ubiquitous concept across multiple fields including ecology, immunology, cellular barcoding experiments, and socioeconomic studies. Since many of these applications involve sampling of populations, we also review how diversity in small samples is related to the diversity in the entire population. Features that arise in each of these applications are highlighted. △ Less

Submitted 4 March, 2020; v1 submitted 21 August, 2019; originally announced August 2019.

Comments: Revised, corrected, and in press, 22 pages, 9 figures, 1 table

arXiv:1902.03510 [pdf]

doi 10.1109/TCBB.2018.2886334

Inverse Projection Representation and Category Contribution Rate for Robust Tumor Recognition

Authors: Xiao-Hui Yang, Li Tian, Yun-Mei Chen, Li-Jun Yang, Shuang Xu, Wen-Ming Wu

Abstract: Sparse representation based classification (SRC) methods have achieved remarkable results. SRC, however, still suffer from requiring enough training samples, insufficient use of test samples and instability of representation. In this paper, a stable inverse projection representation based classification (IPRC) is presented to tackle these problems by effectively using test samples. An IPR is first… ▽ More Sparse representation based classification (SRC) methods have achieved remarkable results. SRC, however, still suffer from requiring enough training samples, insufficient use of test samples and instability of representation. In this paper, a stable inverse projection representation based classification (IPRC) is presented to tackle these problems by effectively using test samples. An IPR is firstly proposed and its feasibility and stability are analyzed. A classification criterion named category contribution rate is constructed to match the IPR and complete classification. Moreover, a statistical measure is introduced to quantify the stability of representation-based classification methods. Based on the IPRC technique, a robust tumor recognition framework is presented by interpreting microarray gene expression data, where a two-stage hybrid gene selection method is introduced to select informative genes. Finally, the functional analysis of candidate's pathogenicity-related genes is given. Extensive experiments on six public tumor microarray gene expression datasets demonstrate the proposed technique is competitive with state-of-the-art methods. △ Less

Submitted 27 June, 2019; v1 submitted 9 February, 2019; originally announced February 2019.

Comments: 14 pages, 19 figures, 10 tables

Journal ref: IEEE/ACM Transactions on Computational Biology and Bioinformatics, 2018

arXiv:1810.04162 [pdf, other]

doi 10.1016/j.bpj.2019.02.007

A Bidomain Model for Lens Microcirculation

Authors: Yi Zhu, Shixin Xu, Robert S. Eisenberg, Huaxiong Huang

Abstract: There exists a large body of research on the lens of mammalian eye over the past several decades. The objective of the current work is to provide a link between the most recent computational models to some of the pioneering work in the 1970s and 80s. We introduce a general non-electro-neutral model to study the microcirculation in lens of eyes. It describes the steady state relationships among ion… ▽ More There exists a large body of research on the lens of mammalian eye over the past several decades. The objective of the current work is to provide a link between the most recent computational models to some of the pioneering work in the 1970s and 80s. We introduce a general non-electro-neutral model to study the microcirculation in lens of eyes. It describes the steady state relationships among ion fluxes, water flow and electric field inside cells, and in the narrow extracellular spaces between cells in the lens. Using asymptotic analysis, we derive a simplified model based on physiological data and compare our results with those in the literature. We show that our simplified model can be reduced further to the first generation models while our full model is consistent with the most recent computational models. In addition, our simplified model captures the main features of the full model. Our results serve as a useful link intermediate between the computational models and the first generation analytical models. △ Less

Submitted 22 May, 2019; v1 submitted 9 October, 2018; originally announced October 2018.

Journal ref: Biophysical Journal, 2019, 116, (6), pp. 1171-1184

arXiv:1808.08948 [pdf, ps, other]

doi 10.1088/1751-8121/aadcb4

Immigration-induced phase transition in a regulated multispecies birth-death process

Authors: Song Xu, Tom Chou

Abstract: Power-law-distributed species counts or clone counts arise in many biological settings such as multispecies cell populations, population genetics, and ecology. This empirical observation that the number of species $c_{k}$ represented by $k$ individuals scales as negative powers of $k$ is also supported by a series of theoretical birth-death-immigration (BDI) models that consistently predict many l… ▽ More Power-law-distributed species counts or clone counts arise in many biological settings such as multispecies cell populations, population genetics, and ecology. This empirical observation that the number of species $c_{k}$ represented by $k$ individuals scales as negative powers of $k$ is also supported by a series of theoretical birth-death-immigration (BDI) models that consistently predict many low-population species, a few intermediate-population species, and very high-population species. However, we show how a simple global population-dependent regulation in a neutral BDI model destroys the power law distributions. Simulation of the regulated BDI model shows a high probability of observing a high-population species that dominates the total population. Further analysis reveals that the origin of this breakdown is associated with the failure of a mean-field approximation for the expected species abundance distribution. We find an accurate estimate for the expected distribution $\langle c_k \rangle$ by mapping the problem to a lower-dimensional Moran process, allowing us to also straightforwardly calculate the covariances $\langle c_k c_\ell \rangle$. Finally, we exploit the concepts associated with energy landscapes to explain the failure of the mean-field assumption by identifying a phase transition in the quasi-steady-state species counts triggered by a decreasing immigration rate. △ Less

Submitted 27 August, 2018; originally announced August 2018.

Comments: 22 pages, 8 figures, accepted to J. Phys. A

arXiv:1806.00646 [pdf]

Osmosis through a Semi-permeable Membrane: a Consistent Approach to Interactions

Authors: Shixin Xu, Bob Eisenberg, Zilong Song, Huaxiong Huang

Abstract: The movement of ionic solutions is an essential part of biology and technology. Fluidics, from nano- to micro- to microfluidics, is a burgeoning area of technology which is all about the movement of ionic solutions, on various scales. Many cells, tissues, and organs of animals and plants depend on osmosis, as the movement of fluids is called in biology. Indeed, the movement of fluids through chann… ▽ More The movement of ionic solutions is an essential part of biology and technology. Fluidics, from nano- to micro- to microfluidics, is a burgeoning area of technology which is all about the movement of ionic solutions, on various scales. Many cells, tissues, and organs of animals and plants depend on osmosis, as the movement of fluids is called in biology. Indeed, the movement of fluids through channel proteins (that have a hole down their middle) is fluidics on an atomic scale. Ionic fluids are complex fluids, with energy stored in many ways. Ionic fluids flow driven by gradients of concentration, chemical and electrical potential, and hydrostatic pressure. Each flow is classically described by its own field theory, independent of the others, but of course, in reality every gradient drives every kind of flow to a varying extent. Combining field equations is tricky and so the theory of complex fluids derives the equations, rather than assumes their interactions. When field equations are derived, rather than assumed, their variables are consistent. That is to say all variables satisfy all equations under all conditions with one set of parameters. Here we treat a classical osmotic cell in this spirit, using a sharp interface method to derive boundary conditions consistent with all flows and fields. We allow volume to change with concentration, since changes of volume are a property of ionic solutions known to all who make them in the laboratory. We consider flexible and inflexible membranes. We show how to combine the energetics of the membrane with the energetics of the surrounding complex fluids. The results seem general but need application to specific situations of technological, biological and experimental importance before the consequences of consistency can be understood. △ Less

Submitted 7 June, 2018; v1 submitted 2 June, 2018; originally announced June 2018.

Comments: typos corrected; equations reformatted a bit; masking of part of Fig.1 corrected

arXiv:1802.01980 [pdf]

Layered structure and leveled function of a human brain

Authors: Shengyong Xu, Jingjing Xu, Rujun Dai

Abstract: The anatomically layered structure of a human brain results in leveled functions. In all these levels of different functions, comparison, feedback and imitation are the universal and crucial mechanisms. Languages, symbols and tools play key roles in the development of human brain and entire civilization. The anatomically layered structure of a human brain results in leveled functions. In all these levels of different functions, comparison, feedback and imitation are the universal and crucial mechanisms. Languages, symbols and tools play key roles in the development of human brain and entire civilization. △ Less

Submitted 4 February, 2018; originally announced February 2018.

arXiv:1712.08309 [pdf]

Bacterial cooperation leads to heteroresistance

Authors: Shilian Xu, Jiaru Yang, Chong Yin

Abstract: By challenging E. coli with sublethal norfloxacin for 10 days, Henry Lee and James Collins suggests the bacterial altruism leads to the population-wide resistance. By detailedly analyzing experiment data, we suggest that bacterial cooperation leads to population-wide resistance under norfloxacin pressure and simultaneously propose the bacteria shield is the possible feedback mechanism of less resi… ▽ More By challenging E. coli with sublethal norfloxacin for 10 days, Henry Lee and James Collins suggests the bacterial altruism leads to the population-wide resistance. By detailedly analyzing experiment data, we suggest that bacterial cooperation leads to population-wide resistance under norfloxacin pressure and simultaneously propose the bacteria shield is the possible feedback mechanism of less resistant bacteria. The bacteria shield is that the less resistant bacteria sacrifice the large number of themselves to consume norfloxacin and then to relieve the norfloxacin burden from highly resistant bacteria. Thus, due to highly resistant bacteria and less resistant bacteria extracted from the same bacteria population, bacterial cooperation leads to heteroresistance. △ Less

Submitted 22 December, 2017; originally announced December 2017.

arXiv:1711.05042 [pdf]

A memory mechanism based on two dimensional code of neurosome pattern

Authors: Shengyong Xu, Jingjing Xu

Abstract: We have recognized that 2D codes, i.e., a group of strongly connected neurosomes that can be simultaneously excited, are the basic data carriers for memory in a brain. An echoing mechanism between two neighboring layers of neurosomes is assumed to establish temporary memory, and repeating processes enhance the formation of long-term memory. Creation and degradation of memory information are statis… ▽ More We have recognized that 2D codes, i.e., a group of strongly connected neurosomes that can be simultaneously excited, are the basic data carriers for memory in a brain. An echoing mechanism between two neighboring layers of neurosomes is assumed to establish temporary memory, and repeating processes enhance the formation of long-term memory. Creation and degradation of memory information are statistically. The maximum capacity of memory storage in a human brain is estimated to be one billion of 2D codes. By triggering one or more neurosomes in a neurosome-based 2D code, the whole strongly connected neurosome network is capable of exciting simultaneously and projecting its excitation onto an analysis layer of neurons in cortex, thus retrieving the stored memory data. The capability of comparing two 2D codes in the analysis layer is one of the major brain functions. △ Less

Submitted 14 November, 2017; originally announced November 2017.

Comments: 9 pages, 2 figures

arXiv:1711.00045 [pdf]

Retention Time of Peptides in Liquid Chromatography Is Well Estimated upon Deep Transfer Learning

Authors: Chunwei Ma, Zhiyong Zhu, Jun Ye, Jiarui Yang, Jianguo Pei, Shaohang Xu, Chang Yu, Fan Mo, Bo Wen, Siqi Liu

Abstract: A fully automatic prediction for peptide retention time (RT) in liquid chromatography (LC), termed as DeepRT, was developed using deep learning approach, an ensemble of Residual Network (ResNet) and Long Short-Term Memory (LSTM). In contrast to the traditional predictor based on the hand-crafted features for peptides, DeepRT learns features from raw amino acid sequences and makes relatively accura… ▽ More A fully automatic prediction for peptide retention time (RT) in liquid chromatography (LC), termed as DeepRT, was developed using deep learning approach, an ensemble of Residual Network (ResNet) and Long Short-Term Memory (LSTM). In contrast to the traditional predictor based on the hand-crafted features for peptides, DeepRT learns features from raw amino acid sequences and makes relatively accurate prediction of peptide RTs with 0.987 R2 for unmodified peptides. Furthermore, by virtue of transfer learning, DeepRT enables utilization of the peptides datasets generated from different LC conditions and of different modification status, resulting in the RT prediction of 0.992 R2 for unmodified peptides and 0.978 R2 for post-translationally modified peptides. Even though chromatographic behaviors of peptides are quite complicated, the study here demonstrated that peptide RT prediction could be largely improved by deep transfer learning. The DeepRT software is freely available at https://github.com/horsepurve/DeepRT, under Apache2 open source License. △ Less

Submitted 31 October, 2017; originally announced November 2017.

Comments: 13-page research article

arXiv:1705.05368 [pdf]

DeepRT: deep learning for peptide retention time prediction in proteomics

Authors: Chunwei Ma, Zhiyong Zhu, Jun Ye, Jiarui Yang, Jianguo Pei, Shaohang Xu, Ruo Zhou, Chang Yu, Fan Mo, Bo Wen, Siqi Liu

Abstract: Accurate predictions of peptide retention times (RT) in liquid chromatography have many applications in mass spectrometry-based proteomics. Herein, we present DeepRT, a deep learning based software for peptide retention time prediction. DeepRT automatically learns features directly from the peptide sequences using the deep convolutional Neural Network (CNN) and Recurrent Neural Network (RNN) model… ▽ More Accurate predictions of peptide retention times (RT) in liquid chromatography have many applications in mass spectrometry-based proteomics. Herein, we present DeepRT, a deep learning based software for peptide retention time prediction. DeepRT automatically learns features directly from the peptide sequences using the deep convolutional Neural Network (CNN) and Recurrent Neural Network (RNN) model, which eliminates the need to use hand-crafted features or rules. After the feature learning, principal component analysis (PCA) was used for dimensionality reduction, then three conventional machine learning methods were utilized to perform modeling. Two published datasets were used to evaluate the performance of DeepRT and we demonstrate that DeepRT greatly outperforms previous state-of-the-art approaches ELUDE and GPTime. △ Less

Submitted 15 May, 2017; originally announced May 2017.

arXiv:1404.7766 [pdf]

Genome-wide Scan of Archaic Hominin Introgressions in Eurasians Reveals Complex Admixture History

Authors: Ya Hu, Yi Wang, Qiliang Ding, Yungang He, Minxian Wang, Jiucun Wang, Shuhua Xu, Li Jin

Abstract: Introgressions from Neanderthals and Denisovans were detected in modern humans. Introgressions from other archaic hominins were also implicated, however, identification of which poses a great technical challenge. Here, we introduced an approach in identifying introgressions from all possible archaic hominins in Eurasian genomes, without referring to archaic hominin sequences. We focused on mutatio… ▽ More Introgressions from Neanderthals and Denisovans were detected in modern humans. Introgressions from other archaic hominins were also implicated, however, identification of which poses a great technical challenge. Here, we introduced an approach in identifying introgressions from all possible archaic hominins in Eurasian genomes, without referring to archaic hominin sequences. We focused on mutations emerged in archaic hominins after their divergence from modern humans (denoted as archaic-specific mutations), and identified introgressive segments which showed significant enrichment of archaic-specific mutations over the rest of the genome. Furthermore, boundaries of introgressions were identified using a dynamic programming approach to partition whole genome into segments which contained different levels of archaic-specific mutations. We found that detected introgressions shared more archaic-specific mutations with Altai Neanderthal than they shared with Denisovan, and 60.3% of archaic hominin introgressions were from Neanderthals. Furthermore, we detected more introgressions from two unknown archaic hominins whom diverged with modern humans approximately 859 and 3,464 thousand years ago. The latter unknown archaic hominin contributed to the genomes of the common ancestors of modern humans and Neanderthals. In total, archaic hominin introgressions comprised 2.4% of Eurasian genomes. Above results suggested a complex admixture history among hominins. The proposed approach could also facilitate admixture research across species. △ Less

Submitted 30 April, 2014; originally announced April 2014.

Comments: 42 Pages, 1 Table, 4 Figures, 1 Supplementary Table, and 10 Supplementary Figures

arXiv:1311.7328 [pdf]

H3K4 mono- and di-methyltransferase MLL4 is required for enhancer activation during cell differentiation

Authors: Ji-Eun Lee, Chaochen Wang, Shiliyang Xu, Young-Wook Cho, Lifeng Wang, Xuesong Feng, Vittorio Sartorelli, Anne Baldridge, Weiqun Peng, Kai Ge

Abstract: Enhancers play a central role in cell-type-specific gene expression and are marked by H3K4me1/2. Active enhancers are further marked by H3K27ac. However, the methyltransferases responsible for H3K4me1/2 on enhancers remain elusive. Furthermore, how these enzymes function on enhancers to regulate cell-type-specific gene expression is unclear. Here we identify MLL4 (KMT2D) as a major mammalian H3K4… ▽ More Enhancers play a central role in cell-type-specific gene expression and are marked by H3K4me1/2. Active enhancers are further marked by H3K27ac. However, the methyltransferases responsible for H3K4me1/2 on enhancers remain elusive. Furthermore, how these enzymes function on enhancers to regulate cell-type-specific gene expression is unclear. Here we identify MLL4 (KMT2D) as a major mammalian H3K4 mono- and di-methyltransferase with partial functional redundancy with MLL3 (KMT2C). Using adipogenesis and myogenesis as model systems, we show that MLL4 exhibits cell-type- and differentiation-stage-specific genomic binding and is predominantly localized on enhancers. MLL4 co-localizes with lineage-determining transcription factors (TFs) on active enhancers during differentiation. Deletion of MLL4 markedly decreases H3K4me1/2, H3K27ac, Polymerase II and Mediator levels on enhancers and leads to severe defects in cell-type-specific gene expression and cell differentiation. Together, these findings identify MLL4 as a major mammalian H3K4 mono- and di-methyltransferase essential for enhancer activation during cell differentiation. △ Less

Submitted 28 November, 2013; originally announced November 2013.

Comments: eLife 2013

arXiv:1311.2180 [pdf, ps, other]

Adaptive Epidemic Dynamics in Networks: Thresholds and Control

Authors: Shouhuai Xu, Wenlian Lu, Li Xu, Zhenxin Zhan

Abstract: Theoretical modeling of computer virus/worm epidemic dynamics is an important problem that has attracted many studies. However, most existing models are adapted from biological epidemic ones. Although biological epidemic models can certainly be adapted to capture some computer virus spreading scenarios (especially when the so-called homogeneity assumption holds), the problem of computer virus spre… ▽ More Theoretical modeling of computer virus/worm epidemic dynamics is an important problem that has attracted many studies. However, most existing models are adapted from biological epidemic ones. Although biological epidemic models can certainly be adapted to capture some computer virus spreading scenarios (especially when the so-called homogeneity assumption holds), the problem of computer virus spreading is not well understood because it has many important perspectives that are not necessarily accommodated in the biological epidemic models. In this paper we initiate the study of such a perspective, namely that of adaptive defense against epidemic spreading in arbitrary networks. More specifically, we investigate a non-homogeneous Susceptible-Infectious-Susceptible (SIS) model where the model parameters may vary with respect to time. In particular, we focus on two scenarios we call semi-adaptive defense and fully-adaptive} defense, which accommodate implicit and explicit dependency relationships between the model parameters, respectively. In the semi-adaptive defense scenario, the model's input parameters are given; the defense is semi-adaptive because the adjustment is implicitly dependent upon the outcome of virus spreading. For this scenario, we present a set of sufficient conditions (some are more general or succinct than others) under which the virus spreading will die out; such sufficient conditions are also known as epidemic thresholds in the literature. In the fully-adaptive defense scenario, some input parameters are not known (i.e., the aforementioned sufficient conditions are not applicable) but the defender can observe the outcome of virus spreading. For this scenario, we present adaptive control strategies under which the virus spreading will die out or will be contained to a desired level. △ Less

Submitted 1 April, 2016; v1 submitted 9 November, 2013; originally announced November 2013.

Comments: 20 pages, 8 figures. This paper was submitted in March 2009, revised in August 2009, and accepted in December 2009. However, the paper was not officially published until 2014 due to non-technical reasons

ACM Class: K.6.5

Journal ref: ACM Transactions on Autonomous and Adaptive Systems (TAAS), 8(4), Article 19, 2014

arXiv:1307.2506 [pdf, other]

Landscape construction in non-gradient dynamics: A case from evolution

Authors: Song Xu, Xinan Wang, Shuyun Jiao

Abstract: Adaptive landscape has been a fundamental concept in many branches of modern biology since Wright's first proposition in 1932. Meanwhile, the general existence of landscape remains controversial. The causes include the mixed uses of different landscape definitions with their own different aims and advantages. Sometimes the difficulty and the impossibility of the landscape construction for complex… ▽ More Adaptive landscape has been a fundamental concept in many branches of modern biology since Wright's first proposition in 1932. Meanwhile, the general existence of landscape remains controversial. The causes include the mixed uses of different landscape definitions with their own different aims and advantages. Sometimes the difficulty and the impossibility of the landscape construction for complex models are also equated. To clarify these confusions, based on a recent formulation of Wright's theory, the current authors construct generalized adaptive landscape in a two-loci population model with non-gradient dynamics, where the conventional gradient landscape does not exist. On the generalized landscape, a population moves along an evolutionary trajectory which always increases or conserves adaptiveness but does not necessarily follow the steepest gradient direction. Comparisons of different aspects of various landscapes lead to a conclusion that the generalized landscape is a possible direction to continue the exploration of Wright's theory for complex dynamics. △ Less

Submitted 9 December, 2015; v1 submitted 9 July, 2013; originally announced July 2013.

Comments: arXiv admin note: text overlap with arXiv:q-bio/0605020 by other authors

arXiv:1304.4337 [pdf, other]

doi 10.1103/PhysRevE.89.012724

Two-timescale evolution on a singular landscape

Authors: Song Xu, Shuyun Jiao, Pengyao Jiang, Ping Ao

Abstract: Under the effect of strong genetic drift, it is highly probable to observe gene fixation or gene loss in a population, shown by infinite peaks on a coherently constructed potential energy landscape. It is then important to ask what such singular peaks imply, with or without the effects of other biological factors. We studied the stochastic escape time from the infinite potential peaks in the Wrigh… ▽ More Under the effect of strong genetic drift, it is highly probable to observe gene fixation or gene loss in a population, shown by infinite peaks on a coherently constructed potential energy landscape. It is then important to ask what such singular peaks imply, with or without the effects of other biological factors. We studied the stochastic escape time from the infinite potential peaks in the Wright-Fisher model, where the typical two-scale diffusion dynamics was observed via computer simulations. We numerically found the average escape time for all the bi-stable cases and analytically approximated the results under weak mutations and selections by calculating the mean first passage time (MFPT) in singular potential peak. Our results showed that Kramers' classical escape formula can be extended to the models with non-Gaussian probability distributions, overcoming constraints in previous methods. The constructed landscape provides a global and coherent description for system's evolutionary dynamics, allowing new biological results to be generated. △ Less

Submitted 16 April, 2013; originally announced April 2013.

Comments: arXiv admin note: text overlap with arXiv:1108.1484

Showing 1–50 of 53 results for author: Xu, S