-
GRAPE: Heterogeneous Graph Representation Learning for Genetic Perturbation with Coding and Non-Coding Biotype
Authors:
Changxi Chi,
Jun Xia,
Jingbo Zhou,
Jiabei Cheng,
Chang Yu,
Stan Z. Li
Abstract:
Predicting genetic perturbations enables the identification of potentially crucial genes prior to wet-lab experiments, significantly improving overall experimental efficiency. Since genes are the foundation of cellular life, building gene regulatory networks (GRN) is essential to understand and predict the effects of genetic perturbations. However, current methods fail to fully leverage gene-relat…
▽ More
Predicting genetic perturbations enables the identification of potentially crucial genes prior to wet-lab experiments, significantly improving overall experimental efficiency. Since genes are the foundation of cellular life, building gene regulatory networks (GRN) is essential to understand and predict the effects of genetic perturbations. However, current methods fail to fully leverage gene-related information, and solely rely on simple evaluation metrics to construct coarse-grained GRN. More importantly, they ignore functional differences between biotypes, limiting the ability to capture potential gene interactions. In this work, we leverage pre-trained large language model and DNA sequence model to extract features from gene descriptions and DNA sequence data, respectively, which serve as the initialization for gene representations. Additionally, we introduce gene biotype information for the first time in genetic perturbation, simulating the distinct roles of genes with different biotypes in regulating cellular processes, while capturing implicit gene relationships through graph structure learning (GSL). We propose GRAPE, a heterogeneous graph neural network (HGNN) that leverages gene representations initialized with features from descriptions and sequences, models the distinct roles of genes with different biotypes, and dynamically refines the GRN through GSL. The results on publicly available datasets show that our method achieves state-of-the-art performance.
△ Less
Submitted 5 May, 2025;
originally announced May 2025.
-
On learning functions over biological sequence space: relating Gaussian process priors, regularization, and gauge fixing
Authors:
Samantha Petti,
Carlos Martí-Gómez,
Justin B. Kinney,
Juannan Zhou,
David M. McCandlish
Abstract:
Mappings from biological sequences (DNA, RNA, protein) to quantitative measures of sequence functionality play an important role in contemporary biology. We are interested in the related tasks of (i) inferring predictive sequence-to-function maps and (ii) decomposing sequence-function maps to elucidate the contributions of individual subsequences. Because each sequence-function map can be written…
▽ More
Mappings from biological sequences (DNA, RNA, protein) to quantitative measures of sequence functionality play an important role in contemporary biology. We are interested in the related tasks of (i) inferring predictive sequence-to-function maps and (ii) decomposing sequence-function maps to elucidate the contributions of individual subsequences. Because each sequence-function map can be written as a weighted sum over subsequences in multiple ways, meaningfully interpreting these weights requires "gauge-fixing," i.e., defining a unique representation for each map. Recent work has established that most existing gauge-fixed representations arise as the unique solutions to $L_2$-regularized regression in an overparameterized "weight space" where the choice of regularizer defines the gauge. Here, we establish the relationship between regularized regression in overparameterized weight space and Gaussian process approaches that operate in "function space," i.e. the space of all real-valued functions on a finite set of sequences. We disentangle how weight space regularizers both impose an implicit prior on the learned function and restrict the optimal weights to a particular gauge. We also show how to construct regularizers that correspond to arbitrary explicit Gaussian process priors combined with a wide variety of gauges. Next, we derive the distribution of gauge-fixed weights implied by the Gaussian process posterior and demonstrate that even for long sequences this distribution can be efficiently computed for product-kernel priors using a kernel trick. Finally, we characterize the implicit function space priors associated with the most common weight space regularizers. Overall, our framework unifies and extends our ability to infer and interpret sequence-function relationships.
△ Less
Submitted 26 April, 2025;
originally announced April 2025.
-
Advanced Deep Learning Methods for Protein Structure Prediction and Design
Authors:
Yichao Zhang,
Ningyuan Deng,
Xinyuan Song,
Ziqian Bi,
Tianyang Wang,
Zheyu Yao,
Keyu Chen,
Ming Li,
Qian Niu,
Junyu Liu,
Benji Peng,
Sen Zhang,
Ming Liu,
Li Zhang,
Xuanhe Pan,
Jinlang Wang,
Pohsun Feng,
Yizhu Wen,
Lawrence KQ Yan,
Hongming Tseng,
Yan Zhong,
Yunze Wang,
Ziyuan Qin,
Bowen Jing,
Junjie Yang
, et al. (3 additional authors not shown)
Abstract:
After AlphaFold won the Nobel Prize, protein prediction with deep learning once again became a hot topic. We comprehensively explore advanced deep learning methods applied to protein structure prediction and design. It begins by examining recent innovations in prediction architectures, with detailed discussions on improvements such as diffusion based frameworks and novel pairwise attention modules…
▽ More
After AlphaFold won the Nobel Prize, protein prediction with deep learning once again became a hot topic. We comprehensively explore advanced deep learning methods applied to protein structure prediction and design. It begins by examining recent innovations in prediction architectures, with detailed discussions on improvements such as diffusion based frameworks and novel pairwise attention modules. The text analyses key components including structure generation, evaluation metrics, multiple sequence alignment processing, and network architecture, thereby illustrating the current state of the art in computational protein modelling. Subsequent chapters focus on practical applications, presenting case studies that range from individual protein predictions to complex biomolecular interactions. Strategies for enhancing prediction accuracy and integrating deep learning techniques with experimental validation are thoroughly explored. The later sections review the industry landscape of protein design, highlighting the transformative role of artificial intelligence in biotechnology and discussing emerging market trends and future challenges. Supplementary appendices provide essential resources such as databases and open source tools, making this volume a valuable reference for researchers and students.
△ Less
Submitted 29 March, 2025; v1 submitted 14 March, 2025;
originally announced March 2025.
-
MOL-Mamba: Enhancing Molecular Representation with Structural & Electronic Insights
Authors:
Jingjing Hu,
Dan Guo,
Zhan Si,
Deguang Liu,
Yunfeng Diao,
Jing Zhang,
Jinxing Zhou,
Meng Wang
Abstract:
Molecular representation learning plays a crucial role in various downstream tasks, such as molecular property prediction and drug design. To accurately represent molecules, Graph Neural Networks (GNNs) and Graph Transformers (GTs) have shown potential in the realm of self-supervised pretraining. However, existing approaches often overlook the relationship between molecular structure and electroni…
▽ More
Molecular representation learning plays a crucial role in various downstream tasks, such as molecular property prediction and drug design. To accurately represent molecules, Graph Neural Networks (GNNs) and Graph Transformers (GTs) have shown potential in the realm of self-supervised pretraining. However, existing approaches often overlook the relationship between molecular structure and electronic information, as well as the internal semantic reasoning within molecules. This omission of fundamental chemical knowledge in graph semantics leads to incomplete molecular representations, missing the integration of structural and electronic data. To address these issues, we introduce MOL-Mamba, a framework that enhances molecular representation by combining structural and electronic insights. MOL-Mamba consists of an Atom & Fragment Mamba-Graph (MG) for hierarchical structural reasoning and a Mamba-Transformer (MT) fuser for integrating molecular structure and electronic correlation learning. Additionally, we propose a Structural Distribution Collaborative Training and E-semantic Fusion Training framework to further enhance molecular representation learning. Extensive experiments demonstrate that MOL-Mamba outperforms state-of-the-art baselines across eleven chemical-biological molecular datasets.
△ Less
Submitted 5 February, 2025; v1 submitted 20 December, 2024;
originally announced December 2024.
-
Discovering Multi-omic Biomarkers for Prostate Cancer Severity Using Machine Learning
Authors:
Jefferson Zhou,
Kahn Rhrissorrakrai
Abstract:
Prostate cancer is the second most common form of cancer, though most patients have a positive prognosis with many experiencing long-term survival with current treatment options. Yet, each treatment carries varying levels of intensity and side effects, therefore determining the severity of prostate cancer is an important criteria in selecting the most appropriate treatment. The Gleason score is th…
▽ More
Prostate cancer is the second most common form of cancer, though most patients have a positive prognosis with many experiencing long-term survival with current treatment options. Yet, each treatment carries varying levels of intensity and side effects, therefore determining the severity of prostate cancer is an important criteria in selecting the most appropriate treatment. The Gleason score is the most common grading system used to judge the severity of prostate cancer, but much of the grading process can be affected by human error or subjectivity. Finding biomarkers for prostate cancer Gleason scores in a quantitative, machine-driven approach could enable pathologists to validate their assessment of a patient cancer sample by examining such biomarkers. In our study, we identified biomarkers from multi-omics data using machine learning, statistical tools, and deep learning to train models against the Gleason score and capture the most important features that could potentially serve as biomarkers for the Gleason score. Through this process, multiple genes, such as COL1A1 and SFRP4, and cell cycle pathways, such as G2M checkpoint, E2F targets, and the PLK1 pathways, were found to be important predictive features for particular Gleason scores. The combination of these analytical methods shows potential for more accurate grading of prostate cancer, and greater understanding of biological processes behind prostate cancer severity that could provide additional therapeutic targets.
△ Less
Submitted 29 October, 2024;
originally announced October 2024.
-
A Selfish Herd with a Target
Authors:
Thomas Stemler,
Shannon Dee Algar,
Jesse Zhou
Abstract:
One of the most striking phenomena in biological systems is the tendency for biological agents to spatially aggregate, and subsequently display further collective behaviours such as rotational motion. One prominent explanation for why agents tend to aggregate is known as the selfish herd hypothesis (SHH). The SHH proposes that each agent has a "domain of danger" whose area is proportional to the r…
▽ More
One of the most striking phenomena in biological systems is the tendency for biological agents to spatially aggregate, and subsequently display further collective behaviours such as rotational motion. One prominent explanation for why agents tend to aggregate is known as the selfish herd hypothesis (SHH). The SHH proposes that each agent has a "domain of danger" whose area is proportional to the risk of predation. The SHH proposes that aggregation occurs as a result of agents seeking to minimise the area of their domain. Subsequent attempts to model the SHH have had varying success in displaying aggregation, and have mostly been unable to exhibit further collective behaviours, such as aligned motion or milling. Here, we introduce a model that seeks to generalise the principles of previous SHH models, by allowing agents to aim for domains of a specific (possibly non-minimal) area or a range of areas and study the resulting collective dynamics. Moreover, the model incorporates the lack of information that biological agents have by limiting the range of movement and vision of the agents. The model shows that the possibility of further collective motion is heavily dependent on the domain area the agents aim for - with several distinct phases of collective behaviour.
△ Less
Submitted 16 October, 2024;
originally announced October 2024.
-
Brain-JEPA: Brain Dynamics Foundation Model with Gradient Positioning and Spatiotemporal Masking
Authors:
Zijian Dong,
Ruilin Li,
Yilei Wu,
Thuan Tinh Nguyen,
Joanna Su Xian Chong,
Fang Ji,
Nathanael Ren Jie Tong,
Christopher Li Hsian Chen,
Juan Helen Zhou
Abstract:
We introduce Brain-JEPA, a brain dynamics foundation model with the Joint-Embedding Predictive Architecture (JEPA). This pioneering model achieves state-of-the-art performance in demographic prediction, disease diagnosis/prognosis, and trait prediction through fine-tuning. Furthermore, it excels in off-the-shelf evaluations (e.g., linear probing) and demonstrates superior generalizability across d…
▽ More
We introduce Brain-JEPA, a brain dynamics foundation model with the Joint-Embedding Predictive Architecture (JEPA). This pioneering model achieves state-of-the-art performance in demographic prediction, disease diagnosis/prognosis, and trait prediction through fine-tuning. Furthermore, it excels in off-the-shelf evaluations (e.g., linear probing) and demonstrates superior generalizability across different ethnic groups, surpassing the previous large model for brain activity significantly. Brain-JEPA incorporates two innovative techniques: Brain Gradient Positioning and Spatiotemporal Masking. Brain Gradient Positioning introduces a functional coordinate system for brain functional parcellation, enhancing the positional encoding of different Regions of Interest (ROIs). Spatiotemporal Masking, tailored to the unique characteristics of fMRI data, addresses the challenge of heterogeneous time-series patches. These methodologies enhance model performance and advance our understanding of the neural circuits underlying cognition. Overall, Brain-JEPA is paving the way to address pivotal questions of building brain functional coordinate system and masking brain activity at the AI-neuroscience interface, and setting a potentially new paradigm in brain activity analysis through downstream adaptation.
△ Less
Submitted 28 September, 2024;
originally announced September 2024.
-
Towards Within-Class Variation in Alzheimer's Disease Detection from Spontaneous Speech
Authors:
Jiawen Kang,
Dongrui Han,
Lingwei Meng,
Jingyan Zhou,
Jinchao Li,
Xixin Wu,
Helen Meng
Abstract:
Alzheimer's Disease (AD) detection has emerged as a promising research area that employs machine learning classification models to distinguish between individuals with AD and those without. Unlike conventional classification tasks, we identify within-class variation as a critical challenge in AD detection: individuals with AD exhibit a spectrum of cognitive impairments. Given that many AD detectio…
▽ More
Alzheimer's Disease (AD) detection has emerged as a promising research area that employs machine learning classification models to distinguish between individuals with AD and those without. Unlike conventional classification tasks, we identify within-class variation as a critical challenge in AD detection: individuals with AD exhibit a spectrum of cognitive impairments. Given that many AD detection tasks lack fine-grained labels, simplistic binary classification may overlook two crucial aspects: within-class differences and instance-level imbalance. The former compels the model to map AD samples with varying degrees of impairment to a single diagnostic label, disregarding certain changes in cognitive function. While the latter biases the model towards overrepresented severity levels. This work presents early efforts to address these challenges. We propose two novel methods: Soft Target Distillation (SoTD) and Instance-level Re-balancing (InRe), targeting two problems respectively. Experiments on the ADReSS and ADReSSo datasets demonstrate that the proposed methods significantly improve detection accuracy. Further analysis reveals that SoTD effectively harnesses the strengths of multiple component models, while InRe substantially alleviates model over-fitting. These findings provide insights for developing more robust and reliable AD detection models.
△ Less
Submitted 21 September, 2024;
originally announced September 2024.
-
Prompt Your Brain: Scaffold Prompt Tuning for Efficient Adaptation of fMRI Pre-trained Model
Authors:
Zijian Dong,
Yilei Wu,
Zijiao Chen,
Yichi Zhang,
Yueming Jin,
Juan Helen Zhou
Abstract:
We introduce Scaffold Prompt Tuning (ScaPT), a novel prompt-based framework for adapting large-scale functional magnetic resonance imaging (fMRI) pre-trained models to downstream tasks, with high parameter efficiency and improved performance compared to fine-tuning and baselines for prompt tuning. The full fine-tuning updates all pre-trained parameters, which may distort the learned feature space…
▽ More
We introduce Scaffold Prompt Tuning (ScaPT), a novel prompt-based framework for adapting large-scale functional magnetic resonance imaging (fMRI) pre-trained models to downstream tasks, with high parameter efficiency and improved performance compared to fine-tuning and baselines for prompt tuning. The full fine-tuning updates all pre-trained parameters, which may distort the learned feature space and lead to overfitting with limited training data which is common in fMRI fields. In contrast, we design a hierarchical prompt structure that transfers the knowledge learned from high-resource tasks to low-resource ones. This structure, equipped with a Deeply-conditioned Input-Prompt (DIP) mapping module, allows for efficient adaptation by updating only 2% of the trainable parameters. The framework enhances semantic interpretability through attention mechanisms between inputs and prompts, and it clusters prompts in the latent space in alignment with prior knowledge. Experiments on public resting state fMRI datasets reveal ScaPT outperforms fine-tuning and multitask-based prompt tuning in neurodegenerative diseases diagnosis/prognosis and personality trait prediction, even with fewer than 20 participants. It highlights ScaPT's efficiency in adapting pre-trained fMRI models to low-resource tasks.
△ Less
Submitted 20 August, 2024;
originally announced August 2024.
-
Unifying Sequences, Structures, and Descriptions for Any-to-Any Protein Generation with the Large Multimodal Model HelixProtX
Authors:
Zhiyuan Chen,
Tianhao Chen,
Chenggang Xie,
Yang Xue,
Xiaonan Zhang,
Jingbo Zhou,
Xiaomin Fang
Abstract:
Proteins are fundamental components of biological systems and can be represented through various modalities, including sequences, structures, and textual descriptions. Despite the advances in deep learning and scientific large language models (LLMs) for protein research, current methodologies predominantly focus on limited specialized tasks -- often predicting one protein modality from another. Th…
▽ More
Proteins are fundamental components of biological systems and can be represented through various modalities, including sequences, structures, and textual descriptions. Despite the advances in deep learning and scientific large language models (LLMs) for protein research, current methodologies predominantly focus on limited specialized tasks -- often predicting one protein modality from another. These approaches restrict the understanding and generation of multimodal protein data. In contrast, large multimodal models have demonstrated potential capabilities in generating any-to-any content like text, images, and videos, thus enriching user interactions across various domains. Integrating these multimodal model technologies into protein research offers significant promise by potentially transforming how proteins are studied. To this end, we introduce HelixProtX, a system built upon the large multimodal model, aiming to offer a comprehensive solution to protein research by supporting any-to-any protein modality generation. Unlike existing methods, it allows for the transformation of any input protein modality into any desired protein modality. The experimental results affirm the advanced capabilities of HelixProtX, not only in generating functional descriptions from amino acid sequences but also in executing critical tasks such as designing protein sequences and structures from textual descriptions. Preliminary findings indicate that HelixProtX consistently achieves superior accuracy across a range of protein-related tasks, outperforming existing state-of-the-art models. By integrating multimodal large models into protein research, HelixProtX opens new avenues for understanding protein biology, thereby promising to accelerate scientific discovery.
△ Less
Submitted 12 July, 2024;
originally announced July 2024.
-
NovoBench: Benchmarking Deep Learning-based De Novo Peptide Sequencing Methods in Proteomics
Authors:
Jingbo Zhou,
Shaorong Chen,
Jun Xia,
Sizhe Liu,
Tianze Ling,
Wenjie Du,
Yue Liu,
Jianwei Yin,
Stan Z. Li
Abstract:
Tandem mass spectrometry has played a pivotal role in advancing proteomics, enabling the high-throughput analysis of protein composition in biological tissues. Many deep learning methods have been developed for \emph{de novo} peptide sequencing task, i.e., predicting the peptide sequence for the observed mass spectrum. However, two key challenges seriously hinder the further advancement of this im…
▽ More
Tandem mass spectrometry has played a pivotal role in advancing proteomics, enabling the high-throughput analysis of protein composition in biological tissues. Many deep learning methods have been developed for \emph{de novo} peptide sequencing task, i.e., predicting the peptide sequence for the observed mass spectrum. However, two key challenges seriously hinder the further advancement of this important task. Firstly, since there is no consensus for the evaluation datasets, the empirical results in different research papers are often not comparable, leading to unfair comparison. Secondly, the current methods are usually limited to amino acid-level or peptide-level precision and recall metrics. In this work, we present the first unified benchmark NovoBench for \emph{de novo} peptide sequencing, which comprises diverse mass spectrum data, integrated models, and comprehensive evaluation metrics. Recent impressive methods, including DeepNovo, PointNovo, Casanovo, InstaNovo, AdaNovo and $π$-HelixNovo are integrated into our framework. In addition to amino acid-level and peptide-level precision and recall, we evaluate the models' performance in terms of identifying post-tranlational modifications (PTMs), efficiency and robustness to peptide length, noise peaks and missing fragment ratio, which are important influencing factors while seldom be considered. Leveraging this benchmark, we conduct a large-scale study of current methods, report many insightful findings that open up new possibilities for future development.
△ Less
Submitted 31 October, 2024; v1 submitted 16 June, 2024;
originally announced June 2024.
-
Density estimation for ordinal biological sequences and its applications
Authors:
Wei-Chia Chen,
Juannan Zhou,
David M. McCandlish
Abstract:
Biological sequences do not come at random. Instead, they appear with particular frequencies that reflect properties of the associated system or phenomenon. Knowing how biological sequences are distributed in sequence space is thus a natural first step toward understanding the underlying mechanisms. Here we propose a new method for inferring the probability distribution from which a sample of biol…
▽ More
Biological sequences do not come at random. Instead, they appear with particular frequencies that reflect properties of the associated system or phenomenon. Knowing how biological sequences are distributed in sequence space is thus a natural first step toward understanding the underlying mechanisms. Here we propose a new method for inferring the probability distribution from which a sample of biological sequences were drawn for the case where the sequences are composed of elements that admit a natural ordering. Our method is based on Bayesian field theory, a physics-based machine learning approach, and can be regarded as a nonparametric extension of the traditional maximum entropy estimate. As an example, we use it to analyze the aneuploidy data pertaining to gliomas from The Cancer Genome Atlas project. In addition, we demonstrate two follow-up analyses that can be performed with the resulting probability distribution. One of them is to investigate the associations among the sequence sites. This provides us a way to infer the governing biological grammar. The other is to study the global geometry of the probability landscape, which allows us to look at the problem from an evolutionary point of view. It can be seen that this methodology enables us to learn from a sample of sequences about how a biological system or phenomenon in the real world works.
△ Less
Submitted 17 April, 2024;
originally announced April 2024.
-
Strangers in a foreign land: 'Yeastizing' plant enzymes
Authors:
Kristen Van Gelder,
Steffen N. Lindner,
Andrew D. Hanson,
Juannan Zhou
Abstract:
Expressing plant metabolic pathways in microbial platforms is an efficient, cost-effective solution for producing many desired plant compounds. As eukaryotic organisms, yeasts are often the preferred platform. However, expression of plant enzymes in a yeast frequently leads to failure because the enzymes are poorly adapted to the foreign yeast cellular environment. Here we first summarize current…
▽ More
Expressing plant metabolic pathways in microbial platforms is an efficient, cost-effective solution for producing many desired plant compounds. As eukaryotic organisms, yeasts are often the preferred platform. However, expression of plant enzymes in a yeast frequently leads to failure because the enzymes are poorly adapted to the foreign yeast cellular environment. Here we first summarize current engineering approaches for optimizing performance of plant enzymes in yeast. A critical limitation of these approaches is that they are labor-intensive and must be customized for each individual enzyme, which significantly hinders the establishment of plant pathways in cellular factories. In response to this challenge, we propose the development of a cost-effective computational pipeline to redesign plant enzymes for better adaptation to the yeast cellular milieu. This proposition is underpinned by compelling evidence that plant and yeast enzymes exhibit distinct sequence features that are generalizable across enzyme families. Consequently, we introduce a data-driven machine learning framework designed to extract 'yeastizing' rules from natural protein sequence variations, which can be broadly applied to all enzymes. Additionally, we discuss the potential to integrate the machine learning model into a full design-build-test-cycle.
△ Less
Submitted 19 March, 2024; v1 submitted 19 March, 2024;
originally announced March 2024.
-
AdaNovo: Adaptive \emph{De Novo} Peptide Sequencing with Conditional Mutual Information
Authors:
Jun Xia,
Shaorong Chen,
Jingbo Zhou,
Tianze Ling,
Wenjie Du,
Sizhe Liu,
Stan Z. Li
Abstract:
Tandem mass spectrometry has played a pivotal role in advancing proteomics, enabling the analysis of protein composition in biological samples. Despite the development of various deep learning methods for identifying amino acid sequences (peptides) responsible for observed spectra, challenges persist in \emph{de novo} peptide sequencing. Firstly, prior methods struggle to identify amino acids with…
▽ More
Tandem mass spectrometry has played a pivotal role in advancing proteomics, enabling the analysis of protein composition in biological samples. Despite the development of various deep learning methods for identifying amino acid sequences (peptides) responsible for observed spectra, challenges persist in \emph{de novo} peptide sequencing. Firstly, prior methods struggle to identify amino acids with post-translational modifications (PTMs) due to their lower frequency in training data compared to canonical amino acids, further resulting in decreased peptide-level identification precision. Secondly, diverse types of noise and missing peaks in mass spectra reduce the reliability of training data (peptide-spectrum matches, PSMs). To address these challenges, we propose AdaNovo, a novel framework that calculates conditional mutual information (CMI) between the spectrum and each amino acid/peptide, using CMI for adaptive model training. Extensive experiments demonstrate AdaNovo's state-of-the-art performance on a 9-species benchmark, where the peptides in the training set are almost completely disjoint from the peptides of the test sets. Moreover, AdaNovo excels in identifying amino acids with PTMs and exhibits robustness against data noise. The supplementary materials contain the official code.
△ Less
Submitted 15 March, 2024; v1 submitted 9 March, 2024;
originally announced March 2024.
-
Pre-Training on Large-Scale Generated Docking Conformations with HelixDock to Unlock the Potential of Protein-ligand Structure Prediction Models
Authors:
Lihang Liu,
Shanzhuo Zhang,
Donglong He,
Xianbin Ye,
Jingbo Zhou,
Xiaonan Zhang,
Yaoyao Jiang,
Weiming Diao,
Hang Yin,
Hua Chai,
Fan Wang,
Jingzhou He,
Liang Zheng,
Yonghui Li,
Xiaomin Fang
Abstract:
Protein-ligand structure prediction is an essential task in drug discovery, predicting the binding interactions between small molecules (ligands) and target proteins (receptors). Recent advances have incorporated deep learning techniques to improve the accuracy of protein-ligand structure prediction. Nevertheless, the experimental validation of docking conformations remains costly, it raises conce…
▽ More
Protein-ligand structure prediction is an essential task in drug discovery, predicting the binding interactions between small molecules (ligands) and target proteins (receptors). Recent advances have incorporated deep learning techniques to improve the accuracy of protein-ligand structure prediction. Nevertheless, the experimental validation of docking conformations remains costly, it raises concerns regarding the generalizability of these deep learning-based methods due to the limited training data. In this work, we show that by pre-training on a large-scale docking conformation generated by traditional physics-based docking tools and then fine-tuning with a limited set of experimentally validated receptor-ligand complexes, we can obtain a protein-ligand structure prediction model with outstanding performance. Specifically, this process involved the generation of 100 million docking conformations for protein-ligand pairings, an endeavor consuming roughly 1 million CPU core days. The proposed model, HelixDock, aims to acquire the physical knowledge encapsulated by the physics-based docking tools during the pre-training phase. HelixDock has been rigorously benchmarked against both physics-based and deep learning-based baselines, demonstrating its exceptional precision and robust transferability in predicting binding confirmation. In addition, our investigation reveals the scaling laws governing pre-trained protein-ligand structure prediction models, indicating a consistent enhancement in performance with increases in model parameters and the volume of pre-training data. Moreover, we applied HelixDock to several drug discovery-related tasks to validate its practical utility. HelixDock demonstrates outstanding capabilities on both cross-docking and structure-based virtual screening benchmarks.
△ Less
Submitted 22 May, 2024; v1 submitted 21 October, 2023;
originally announced October 2023.
-
SI-SD: Sleep Interpreter through awake-guided cross-subject Semantic Decoding
Authors:
Hui Zheng,
Zhong-Tao Chen,
Hai-Teng Wang,
Jian-Yang Zhou,
Lin Zheng,
Pei-Yang Lin,
Yun-Zhe Liu
Abstract:
Understanding semantic content from brain activity during sleep represents a major goal in neuroscience. While studies in rodents have shown spontaneous neural reactivation of memories during sleep, capturing the semantic content of human sleep poses a significant challenge due to the absence of well-annotated sleep datasets and the substantial differences in neural patterns between wakefulness an…
▽ More
Understanding semantic content from brain activity during sleep represents a major goal in neuroscience. While studies in rodents have shown spontaneous neural reactivation of memories during sleep, capturing the semantic content of human sleep poses a significant challenge due to the absence of well-annotated sleep datasets and the substantial differences in neural patterns between wakefulness and sleep. To address these challenges, we designed a novel cognitive neuroscience experiment and collected a comprehensive, well-annotated electroencephalography (EEG) dataset from 134 subjects during both wakefulness and sleep. Leveraging this benchmark dataset, we developed SI-SD that enhances sleep semantic decoding through the position-wise alignment of neural latent sequence between wakefulness and sleep. In the 15-way classification task, our model achieves 24.12% and 21.39% top-1 accuracy on unseen subjects for NREM 2/3 and REM sleep, respectively, surpassing all other baselines. With additional fine-tuning, decoding performance improves to 30.32% and 31.65%, respectively. Besides, inspired by previous neuroscientific findings, we systematically analyze how the "Slow Oscillation" event impacts decoding performance in NREM 2/3 sleep -- decoding performance on unseen subjects further improves to 40.02%. Together, our findings and methodologies contribute to a promising neuro-AI framework for decoding brain activity during sleep.
△ Less
Submitted 19 May, 2024; v1 submitted 28 September, 2023;
originally announced September 2023.
-
Automated Bioinformatics Analysis via AutoBA
Authors:
Juexiao Zhou,
Bin Zhang,
Xiuying Chen,
Haoyang Li,
Xiaopeng Xu,
Siyuan Chen,
Xin Gao
Abstract:
With the fast-growing and evolving omics data, the demand for streamlined and adaptable tools to handle the analysis continues to grow. In response to this need, we introduce Auto Bioinformatics Analysis (AutoBA), an autonomous AI agent based on a large language model designed explicitly for conventional omics data analysis. AutoBA simplifies the analytical process by requiring minimal user input…
▽ More
With the fast-growing and evolving omics data, the demand for streamlined and adaptable tools to handle the analysis continues to grow. In response to this need, we introduce Auto Bioinformatics Analysis (AutoBA), an autonomous AI agent based on a large language model designed explicitly for conventional omics data analysis. AutoBA simplifies the analytical process by requiring minimal user input while delivering detailed step-by-step plans for various bioinformatics tasks. Through rigorous validation by expert bioinformaticians, AutoBA's robustness and adaptability are affirmed across a diverse range of omics analysis cases, including whole genome sequencing (WGS), RNA sequencing (RNA-seq), single-cell RNA-seq, ChIP-seq, and spatial transcriptomics. AutoBA's unique capacity to self-design analysis processes based on input data variations further underscores its versatility. Compared with online bioinformatic services, AutoBA deploys the analysis locally, preserving data privacy. Moreover, different from the predefined pipeline, AutoBA has adaptability in sync with emerging bioinformatics tools. Overall, AutoBA represents a convenient tool, offering robustness and adaptability for complex omics data analysis.
△ Less
Submitted 6 September, 2023;
originally announced September 2023.
-
Beyond the Snapshot: Brain Tokenized Graph Transformer for Longitudinal Brain Functional Connectome Embedding
Authors:
Zijian Dong,
Yilei Wu,
Yu Xiao,
Joanna Su Xian Chong,
Yueming Jin,
Juan Helen Zhou
Abstract:
Under the framework of network-based neurodegeneration, brain functional connectome (FC)-based Graph Neural Networks (GNN) have emerged as a valuable tool for the diagnosis and prognosis of neurodegenerative diseases such as Alzheimer's disease (AD). However, these models are tailored for brain FC at a single time point instead of characterizing FC trajectory. Discerning how FC evolves with diseas…
▽ More
Under the framework of network-based neurodegeneration, brain functional connectome (FC)-based Graph Neural Networks (GNN) have emerged as a valuable tool for the diagnosis and prognosis of neurodegenerative diseases such as Alzheimer's disease (AD). However, these models are tailored for brain FC at a single time point instead of characterizing FC trajectory. Discerning how FC evolves with disease progression, particularly at the predementia stages such as cognitively normal individuals with amyloid deposition or individuals with mild cognitive impairment (MCI), is crucial for delineating disease spreading patterns and developing effective strategies to slow down or even halt disease advancement. In this work, we proposed the first interpretable framework for brain FC trajectory embedding with application to neurodegenerative disease diagnosis and prognosis, namely Brain Tokenized Graph Transformer (Brain TokenGT). It consists of two modules: 1) Graph Invariant and Variant Embedding (GIVE) for generation of node and spatio-temporal edge embeddings, which were tokenized for downstream processing; 2) Brain Informed Graph Transformer Readout (BIGTR) which augments previous tokens with trainable type identifiers and non-trainable node identifiers and feeds them into a standard transformer encoder to readout. We conducted extensive experiments on two public longitudinal fMRI datasets of the AD continuum for three tasks, including differentiating MCI from controls, predicting dementia conversion in MCI, and classification of amyloid positive or negative cognitively normal individuals. Based on brain FC trajectory, the proposed Brain TokenGT approach outperformed all the other benchmark models and at the same time provided excellent interpretability. The code is available at https://github.com/ZijianD/Brain-TokenGT.git
△ Less
Submitted 12 July, 2023; v1 submitted 3 July, 2023;
originally announced July 2023.
-
Pattern formation in a predator-prey model with Allee effect and hyperbolic mortality on networked and non-networked environments
Authors:
Yong Ye,
Jiaying Zhou
Abstract:
With the development of network science, Turing pattern has been proven to be formed in discrete media such as complex networks, opening up the possibility of exploring it as a generation mechanism in the context of biology, chemistry, and physics. Turing instability in the predator-prey system has been widely studied in recent years. We hope to use the predator-prey interaction relationship in bi…
▽ More
With the development of network science, Turing pattern has been proven to be formed in discrete media such as complex networks, opening up the possibility of exploring it as a generation mechanism in the context of biology, chemistry, and physics. Turing instability in the predator-prey system has been widely studied in recent years. We hope to use the predator-prey interaction relationship in biological populations to explain the influence of network topology on pattern formation. In this paper, we establish a predator-prey model with weak Allee effect, analyze and verify the Turing instability conditions on the large ER (Erdös-Rényi) random network with the help of Turing stability theory and numerical experiments, and obtain the Turing instability region. The results indicate that diffusion plays a decisive role in the generation of spatial patterns, whether in continuous or discrete media. For spatiotemporal patterns, different initial values can also bring about changes in the pattern. When we analyze the model based on the network framework, we find that the average degree of the network has an important impact on the model, and different average degrees will lead to changes in the distribution pattern of the population.
△ Less
Submitted 4 October, 2023; v1 submitted 20 June, 2023;
originally announced June 2023.
-
Dirichlet Diffusion Score Model for Biological Sequence Generation
Authors:
Pavel Avdeyev,
Chenlai Shi,
Yuhao Tan,
Kseniia Dudnyk,
Jian Zhou
Abstract:
Designing biological sequences is an important challenge that requires satisfying complex constraints and thus is a natural problem to address with deep generative modeling. Diffusion generative models have achieved considerable success in many applications. Score-based generative stochastic differential equations (SDE) model is a continuous-time diffusion model framework that enjoys many benefits…
▽ More
Designing biological sequences is an important challenge that requires satisfying complex constraints and thus is a natural problem to address with deep generative modeling. Diffusion generative models have achieved considerable success in many applications. Score-based generative stochastic differential equations (SDE) model is a continuous-time diffusion model framework that enjoys many benefits, but the originally proposed SDEs are not naturally designed for modeling discrete data. To develop generative SDE models for discrete data such as biological sequences, here we introduce a diffusion process defined in the probability simplex space with stationary distribution being the Dirichlet distribution. This makes diffusion in continuous space natural for modeling discrete data. We refer to this approach as Dirchlet diffusion score model. We demonstrate that this technique can generate samples that satisfy hard constraints using a Sudoku generation task. This generative model can also solve Sudoku, including hard puzzles, without additional training. Finally, we applied this approach to develop the first human promoter DNA sequence design model and showed that designed sequences share similar properties with natural promoter sequences.
△ Less
Submitted 16 June, 2023; v1 submitted 18 May, 2023;
originally announced May 2023.
-
Generation of 3D Molecules in Pockets via Language Model
Authors:
Wei Feng,
Lvwei Wang,
Zaiyun Lin,
Yanhao Zhu,
Han Wang,
Jianqiang Dong,
Rong Bai,
Huting Wang,
Jielong Zhou,
Wei Peng,
Bo Huang,
Wenbiao Zhou
Abstract:
Generative models for molecules based on sequential line notation (e.g. SMILES) or graph representation have attracted an increasing interest in the field of structure-based drug design, but they struggle to capture important 3D spatial interactions and often produce undesirable molecular structures. To address these challenges, we introduce Lingo3DMol, a pocket-based 3D molecule generation method…
▽ More
Generative models for molecules based on sequential line notation (e.g. SMILES) or graph representation have attracted an increasing interest in the field of structure-based drug design, but they struggle to capture important 3D spatial interactions and often produce undesirable molecular structures. To address these challenges, we introduce Lingo3DMol, a pocket-based 3D molecule generation method that combines language models and geometric deep learning technology. A new molecular representation, fragment-based SMILES with local and global coordinates, was developed to assist the model in learning molecular topologies and atomic spatial positions. Additionally, we trained a separate noncovalent interaction predictor to provide essential binding pattern information for the generative model. Lingo3DMol can efficiently traverse drug-like chemical spaces, preventing the formation of unusual structures. The Directory of Useful Decoys-Enhanced (DUD-E) dataset was used for evaluation. Lingo3DMol outperformed state-of-the-art methods in terms of drug-likeness, synthetic accessibility, pocket binding mode, and molecule generation speed.
△ Less
Submitted 11 December, 2023; v1 submitted 17 May, 2023;
originally announced May 2023.
-
Cell Population Growth Kinetics in the Presence of Stochastic Heterogeneity of Cell Phenotype
Authors:
Yue Wang,
Joseph X. Zhou,
Edoardo Pedrini,
Irit Rubin,
May Khalil,
Roberto Taramelli,
Hong Qian,
Sui Huang
Abstract:
Recent studies at individual cell resolution have revealed phenotypic heterogeneity in nominally clonal tumor cell populations. The heterogeneity affects cell growth behaviors, which can result in departure from the idealized uniform exponential growth of the cell population. Here we measured the stochastic time courses of growth of an ensemble of populations of HL60 leukemia cells in cultures, st…
▽ More
Recent studies at individual cell resolution have revealed phenotypic heterogeneity in nominally clonal tumor cell populations. The heterogeneity affects cell growth behaviors, which can result in departure from the idealized uniform exponential growth of the cell population. Here we measured the stochastic time courses of growth of an ensemble of populations of HL60 leukemia cells in cultures, starting with distinct initial cell numbers to capture a departure from the {uniform exponential growth model for the initial growth (``take-off'')}. Despite being derived from the same cell clone, we observed significant variations in the early growth patterns of individual cultures with statistically significant differences in growth dynamics, which could be explained by the presence of inter-converting subpopulations with different growth rates, and which could last for many generations. Based on the hypothesis of existence of multiple subpopulations, we developed a branching process model that was consistent with the experimental observations.
△ Less
Submitted 18 October, 2023; v1 submitted 9 January, 2023;
originally announced January 2023.
-
Reconstructing high-order sequence features of dynamic functional connectivity networks based on diversified covert attention patterns for Alzheimer's disease classification
Authors:
Zhixiang Zhang,
Biao Jie,
Zhengdong Wang,
Jie Zhou,
Yang Yang
Abstract:
Recent studies have applied deep learning methods such as convolutional recurrent neural networks (CRNs) and Transformers to brain disease classification based on dynamic functional connectivity networks (dFCNs), such as Alzheimer's disease (AD), achieving better performance than traditional machine learning methods. However, in CRNs, the continuous convolution operations used to obtain high-order…
▽ More
Recent studies have applied deep learning methods such as convolutional recurrent neural networks (CRNs) and Transformers to brain disease classification based on dynamic functional connectivity networks (dFCNs), such as Alzheimer's disease (AD), achieving better performance than traditional machine learning methods. However, in CRNs, the continuous convolution operations used to obtain high-order aggregation features may overlook the non-linear correlation between different brain regions due to the essence of convolution being the linear weighted sum of local elements. Inspired by modern neuroscience on the research of covert attention in the nervous system, we introduce the self-attention mechanism, a core module of Transformers, to model diversified covert attention patterns and apply these patterns to reconstruct high-order sequence features of dFCNs in order to learn complex dynamic changes in brain information flow. Therefore, we propose a novel CRN method based on diversified covert attention patterns, DCA-CRN, which combines the advantages of CRNs in capturing local spatio-temporal features and sequence change patterns, as well as Transformers in learning global and high-order correlation features. Experimental results on the ADNI and ADHD-200 datasets demonstrate the prediction performance and generalization ability of our proposed method.
△ Less
Submitted 4 September, 2023; v1 submitted 18 November, 2022;
originally announced November 2022.
-
Representational dissimilarity metric spaces for stochastic neural networks
Authors:
Lyndon R. Duong,
Jingyang Zhou,
Josue Nassar,
Jules Berman,
Jeroen Olieslagers,
Alex H. Williams
Abstract:
Quantifying similarity between neural representations -- e.g. hidden layer activation vectors -- is a perennial problem in deep learning and neuroscience research. Existing methods compare deterministic responses (e.g. artificial networks that lack stochastic layers) or averaged responses (e.g., trial-averaged firing rates in biological data). However, these measures of _deterministic_ representat…
▽ More
Quantifying similarity between neural representations -- e.g. hidden layer activation vectors -- is a perennial problem in deep learning and neuroscience research. Existing methods compare deterministic responses (e.g. artificial networks that lack stochastic layers) or averaged responses (e.g., trial-averaged firing rates in biological data). However, these measures of _deterministic_ representational similarity ignore the scale and geometric structure of noise, both of which play important roles in neural computation. To rectify this, we generalize previously proposed shape metrics (Williams et al. 2021) to quantify differences in _stochastic_ representations. These new distances satisfy the triangle inequality, and thus can be used as a rigorous basis for many supervised and unsupervised analyses. Leveraging this novel framework, we find that the stochastic geometries of neurobiological representations of oriented visual gratings and naturalistic scenes respectively resemble untrained and trained deep network representations. Further, we are able to more accurately predict certain network attributes (e.g. training hyperparameters) from its position in stochastic (versus deterministic) shape space.
△ Less
Submitted 3 February, 2023; v1 submitted 21 November, 2022;
originally announced November 2022.
-
Lateral predictive coding revisited: Internal model, symmetry breaking, and response time
Authors:
Zhen-Ye Huang,
Xin-Yi Fan,
Jianwen Zhou,
Hai-Jun Zhou
Abstract:
Predictive coding is a promising theoretical framework in neuroscience for understanding information transmission and perception. It posits that the brain perceives the external world through internal models and updates these models under the guidance of prediction errors. Previous studies on predictive coding emphasized top-down feedback interactions in hierarchical multi-layered networks but lar…
▽ More
Predictive coding is a promising theoretical framework in neuroscience for understanding information transmission and perception. It posits that the brain perceives the external world through internal models and updates these models under the guidance of prediction errors. Previous studies on predictive coding emphasized top-down feedback interactions in hierarchical multi-layered networks but largely ignored lateral recurrent interactions. We perform analytical and numerical investigations in this work on the effects of single-layer lateral interactions. We consider a simple predictive response dynamics and run it on the MNIST dataset of hand-written digits. We find that learning will generally break the interaction symmetry between peer neurons, and that high input correlation between two neurons does not necessarily bring strong direct interactions between them. The optimized network responds to familiar input signals much faster than to novel or random inputs, and it significantly reduces the correlations between the output states of pairs of neurons.
△ Less
Submitted 18 July, 2022;
originally announced July 2022.
-
Pattern formation of parasite-host model induced by fear effect
Authors:
Yong Ye,
Yi Zhao,
Jiaying Zhou
Abstract:
In this paper, based on the epidemiological microparasite model, a parasite-host model is established by considering the fear effect of susceptible individuals on infectors. We explored the pattern formation with the help of numerical simulation, and analyzed the effects of fear effect, infected host mortality, population diffusion rate and reducing reproduction ability of infected hosts on popula…
▽ More
In this paper, based on the epidemiological microparasite model, a parasite-host model is established by considering the fear effect of susceptible individuals on infectors. We explored the pattern formation with the help of numerical simulation, and analyzed the effects of fear effect, infected host mortality, population diffusion rate and reducing reproduction ability of infected hosts on population activities in different degrees. Theoretically, we give the general conditions for the stability of the model under non-diffusion and considering the Turing instability caused by diffusion. Our results indicate how fear affects the distribution of the uninfected and infected hosts in the habitat and quantify the influence of the fear factor on the spatiotemporal pattern of the population. In addition, we analyze the influence of natural death rate, reproduction ability of infected hosts, and diffusion level of uninfected (infected) hosts on the spatiotemporal pattern, respectively. The results present that the growth of pattern induced by intensified fear effect follows the certain rule: cold spots $\rightarrow$ cold spots-stripes $\rightarrow$ cold stripes $\rightarrow$ hot stripes $\rightarrow$ hot spots-stripes $\rightarrow$ hot spots. Interestingly, the natural mortality and fear effect take the opposite effect on the growth order of the pattern. From the perspective of biological significance, we find that the degree of fear effect can reshape the distribution of population to meet the previous rule.
△ Less
Submitted 18 May, 2022;
originally announced May 2022.
-
RCMNet: A deep learning model assists CAR-T therapy for leukemia
Authors:
Ruitao Zhang,
Xueying Han,
Ijaz Gul,
Shiyao Zhai,
Ying Liu,
Yongbing Zhang,
Yuhan Dong,
Lan Ma,
Dongmei Yu,
Jin Zhou,
Peiwu Qin
Abstract:
Acute leukemia is a type of blood cancer with a high mortality rate. Current therapeutic methods include bone marrow transplantation, supportive therapy, and chemotherapy. Although a satisfactory remission of the disease can be achieved, the risk of recurrence is still high. Therefore, novel treatments are demanding. Chimeric antigen receptor-T (CAR-T) therapy has emerged as a promising approach t…
▽ More
Acute leukemia is a type of blood cancer with a high mortality rate. Current therapeutic methods include bone marrow transplantation, supportive therapy, and chemotherapy. Although a satisfactory remission of the disease can be achieved, the risk of recurrence is still high. Therefore, novel treatments are demanding. Chimeric antigen receptor-T (CAR-T) therapy has emerged as a promising approach to treat and cure acute leukemia. To harness the therapeutic potential of CAR-T cell therapy for blood diseases, reliable cell morphological identification is crucial. Nevertheless, the identification of CAR-T cells is a big challenge posed by their phenotypic similarity with other blood cells. To address this substantial clinical challenge, herein we first construct a CAR-T dataset with 500 original microscopy images after staining. Following that, we create a novel integrated model called RCMNet (ResNet18 with CBAM and MHSA) that combines the convolutional neural network (CNN) and Transformer. The model shows 99.63% top-1 accuracy on the public dataset. Compared with previous reports, our model obtains satisfactory results for image classification. Although testing on the CAR-T cells dataset, a decent performance is observed, which is attributed to the limited size of the dataset. Transfer learning is adapted for RCMNet and a maximum of 83.36% accuracy has been achieved, which is higher than other SOTA models. The study evaluates the effectiveness of RCMNet on a big public dataset and translates it to a clinical dataset for diagnostic applications.
△ Less
Submitted 6 May, 2022;
originally announced May 2022.
-
Structure-aware Protein Self-supervised Learning
Authors:
Can Chen,
Jingbo Zhou,
Fan Wang,
Xue Liu,
Dejing Dou
Abstract:
Protein representation learning methods have shown great potential to yield useful representation for many downstream tasks, especially on protein classification. Moreover, a few recent studies have shown great promise in addressing insufficient labels of proteins with self-supervised learning methods. However, existing protein language models are usually pretrained on protein sequences without co…
▽ More
Protein representation learning methods have shown great potential to yield useful representation for many downstream tasks, especially on protein classification. Moreover, a few recent studies have shown great promise in addressing insufficient labels of proteins with self-supervised learning methods. However, existing protein language models are usually pretrained on protein sequences without considering the important protein structural information. To this end, we propose a novel structure-aware protein self-supervised learning method to effectively capture structural information of proteins. In particular, a well-designed graph neural network (GNN) model is pretrained to preserve the protein structural information with self-supervised tasks from a pairwise residue distance perspective and a dihedral angle perspective, respectively. Furthermore, we propose to leverage the available protein language model pretrained on protein sequences to enhance the self-supervised learning. Specifically, we identify the relation between the sequential information in the protein language model and the structural information in the specially designed GNN model via a novel pseudo bi-level optimization scheme. Experiments on several supervised downstream tasks verify the effectiveness of our proposed method.The code of the proposed method is available in \url{https://github.com/GGchen1997/STEPS_Bioinformatics}.
△ Less
Submitted 8 April, 2023; v1 submitted 5 April, 2022;
originally announced April 2022.
-
Structure-aware Interactive Graph Neural Networks for the Prediction of Protein-Ligand Binding Affinity
Authors:
Shuangli Li,
Jingbo Zhou,
Tong Xu,
Liang Huang,
Fan Wang,
Haoyi Xiong,
Weili Huang,
Dejing Dou,
Hui Xiong
Abstract:
Drug discovery often relies on the successful prediction of protein-ligand binding affinity. Recent advances have shown great promise in applying graph neural networks (GNNs) for better affinity prediction by learning the representations of protein-ligand complexes. However, existing solutions usually treat protein-ligand complexes as topological graph data, thus the biomolecular structural inform…
▽ More
Drug discovery often relies on the successful prediction of protein-ligand binding affinity. Recent advances have shown great promise in applying graph neural networks (GNNs) for better affinity prediction by learning the representations of protein-ligand complexes. However, existing solutions usually treat protein-ligand complexes as topological graph data, thus the biomolecular structural information is not fully utilized. The essential long-range interactions among atoms are also neglected in GNN models. To this end, we propose a structure-aware interactive graph neural network (SIGN) which consists of two components: polar-inspired graph attention layers (PGAL) and pairwise interactive pooling (PiPool). Specifically, PGAL iteratively performs the node-edge aggregation process to update embeddings of nodes and edges while preserving the distance and angle information among atoms. Then, PiPool is adopted to gather interactive edges with a subsequent reconstruction loss to reflect the global interactions. Exhaustive experimental study on two benchmarks verifies the superiority of SIGN.
△ Less
Submitted 20 July, 2021;
originally announced July 2021.
-
DGL-LifeSci: An Open-Source Toolkit for Deep Learning on Graphs in Life Science
Authors:
Mufei Li,
Jinjing Zhou,
Jiajing Hu,
Wenxuan Fan,
Yangkang Zhang,
Yaxin Gu,
George Karypis
Abstract:
Graph neural networks (GNNs) constitute a class of deep learning methods for graph data. They have wide applications in chemistry and biology, such as molecular property prediction, reaction prediction and drug-target interaction prediction. Despite the interest, GNN-based modeling is challenging as it requires graph data pre-processing and modeling in addition to programming and deep learning. He…
▽ More
Graph neural networks (GNNs) constitute a class of deep learning methods for graph data. They have wide applications in chemistry and biology, such as molecular property prediction, reaction prediction and drug-target interaction prediction. Despite the interest, GNN-based modeling is challenging as it requires graph data pre-processing and modeling in addition to programming and deep learning. Here we present DGL-LifeSci, an open-source package for deep learning on graphs in life science. DGL-LifeSci is a python toolkit based on RDKit, PyTorch and Deep Graph Library (DGL). DGL-LifeSci allows GNN-based modeling on custom datasets for molecular property prediction, reaction prediction and molecule generation. With its command-line interfaces, users can perform modeling without any background in programming and deep learning. We test the command-line interfaces using standard benchmarks MoleculeNet, USPTO, and ZINC. Compared with previous implementations, DGL-LifeSci achieves a speed up by up to 6x. For modeling flexibility, DGL-LifeSci provides well-optimized modules for various stages of the modeling pipeline. In addition, DGL-LifeSci provides pre-trained models for reproducing the test experiment results and applying models without training. The code is distributed under an Apache-2.0 License and is freely accessible at https://github.com/awslabs/dgl-lifesci.
△ Less
Submitted 27 June, 2021;
originally announced June 2021.
-
ChemRL-GEM: Geometry Enhanced Molecular Representation Learning for Property Prediction
Authors:
Xiaomin Fang,
Lihang Liu,
Jieqiong Lei,
Donglong He,
Shanzhuo Zhang,
Jingbo Zhou,
Fan Wang,
Hua Wu,
Haifeng Wang
Abstract:
Effective molecular representation learning is of great importance to facilitate molecular property prediction, which is a fundamental task for the drug and material industry. Recent advances in graph neural networks (GNNs) have shown great promise in applying GNNs for molecular representation learning. Moreover, a few recent studies have also demonstrated successful applications of self-supervise…
▽ More
Effective molecular representation learning is of great importance to facilitate molecular property prediction, which is a fundamental task for the drug and material industry. Recent advances in graph neural networks (GNNs) have shown great promise in applying GNNs for molecular representation learning. Moreover, a few recent studies have also demonstrated successful applications of self-supervised learning methods to pre-train the GNNs to overcome the problem of insufficient labeled molecules. However, existing GNNs and pre-training strategies usually treat molecules as topological graph data without fully utilizing the molecular geometry information. Whereas, the three-dimensional (3D) spatial structure of a molecule, a.k.a molecular geometry, is one of the most critical factors for determining molecular physical, chemical, and biological properties. To this end, we propose a novel Geometry Enhanced Molecular representation learning method (GEM) for Chemical Representation Learning (ChemRL). At first, we design a geometry-based GNN architecture that simultaneously models atoms, bonds, and bond angles in a molecule. To be specific, we devised double graphs for a molecule: The first one encodes the atom-bond relations; The second one encodes bond-angle relations. Moreover, on top of the devised GNN architecture, we propose several novel geometry-level self-supervised learning strategies to learn spatial knowledge by utilizing the local and global molecular 3D structures. We compare ChemRL-GEM with various state-of-the-art (SOTA) baselines on different molecular benchmarks and exhibit that ChemRL-GEM can significantly outperform all baselines in both regression and classification tasks. For example, the experimental results show an overall improvement of 8.8% on average compared to SOTA baselines on the regression tasks, demonstrating the superiority of the proposed method.
△ Less
Submitted 22 February, 2022; v1 submitted 10 June, 2021;
originally announced June 2021.
-
Eigenvalue spectrum of neural networks with arbitrary Hebbian length
Authors:
Jianwen Zhou,
Zijian Jiang,
Tianqi Hou,
Ziming Chen,
K Y Michael Wong,
Haiping Huang
Abstract:
Associative memory is a fundamental function in the brain. Here, we generalize the standard associative memory model to include long-range Hebbian interactions at the learning stage, corresponding to a large synaptic integration window. In our model, the Hebbian length can be arbitrarily large. The spectral density of the coupling matrix is derived using the replica method, which is also shown to…
▽ More
Associative memory is a fundamental function in the brain. Here, we generalize the standard associative memory model to include long-range Hebbian interactions at the learning stage, corresponding to a large synaptic integration window. In our model, the Hebbian length can be arbitrarily large. The spectral density of the coupling matrix is derived using the replica method, which is also shown to be consistent with the results obtained by applying the free probability method. The maximal eigenvalue is then obtained by an iterative equation, related to the paramagnetic to spin glass transition in the model. Altogether, this work establishes the connection between the associative memory with arbitrary Hebbian length and the asymptotic eigen-spectrum of the neural-coupling matrix.
△ Less
Submitted 26 March, 2021;
originally announced March 2021.
-
Associative memory model with arbitrary Hebbian length
Authors:
Zijian Jiang,
Jianwen Zhou,
Tianqi Hou,
K. Y. Michael Wong,
Haiping Huang
Abstract:
Conversion of temporal to spatial correlations in the cortex is one of the most intriguing functions in the brain. The learning at synapses triggering the correlation conversion can take place in a wide integration window, whose influence on the correlation conversion remains elusive. Here, we propose a generalized associative memory model with arbitrary Hebbian length. The model can be analytical…
▽ More
Conversion of temporal to spatial correlations in the cortex is one of the most intriguing functions in the brain. The learning at synapses triggering the correlation conversion can take place in a wide integration window, whose influence on the correlation conversion remains elusive. Here, we propose a generalized associative memory model with arbitrary Hebbian length. The model can be analytically solved, and predicts that a small Hebbian length can already significantly enhance the correlation conversion, i.e., the stimulus-induced attractor can be highly correlated with a significant number of patterns in the stored sequence, thereby facilitating state transitions in the neural representation space. Moreover, an anti-Hebbian component is able to reshape the energy landscape of memories, akin to the function of sleep. Our work thus establishes the fundamental connection between associative memory, Hebbian length, and correlation conversion in the brain.
△ Less
Submitted 26 March, 2021;
originally announced March 2021.
-
Epidemic spreading under mutually independent intra- and inter-host pathogen evolution
Authors:
Xiyun Zhang,
Zhongyuan Ruan,
Muhua Zheng,
Jie Zhou,
Stefano Boccaletti,
Baruch Barzel
Abstract:
The dynamics of epidemic spreading is often reduced to the single control parameter $R_0$, whose value, above or below unity, determines the state of the contagion. If, however, the pathogen evolves as it spreads, $R_0$ may change over time, potentially leading to a mutation-driven spread, in which an initially sub-pandemic pathogen undergoes a breakthrough mutation. To predict the boundaries of t…
▽ More
The dynamics of epidemic spreading is often reduced to the single control parameter $R_0$, whose value, above or below unity, determines the state of the contagion. If, however, the pathogen evolves as it spreads, $R_0$ may change over time, potentially leading to a mutation-driven spread, in which an initially sub-pandemic pathogen undergoes a breakthrough mutation. To predict the boundaries of this pandemic phase, we introduce here a modeling framework to couple the network spreading patterns with the intra-host evolutionary dynamics. For many pathogens these two processes, intra- and inter-host, are driven by different selection forces. And yet here we show that even in the extreme case when these two forces are mutually independent, mutations can still fundamentally alter the pandemic phase-diagram, whose transitions are now shaped, not just by $R_0$, but also by the balance between the epidemic and the evolutionary timescales. If mutations are too slow, the pathogen prevalence decays prior to the appearance of a critical mutation. On the other hand, if mutations are too rapid, the pathogen evolution becomes volatile and, once again, it fails to spread. Between these two extremes, however, we identify a broad range of conditions in which an initially sub-pandemic pathogen can break through to gain widespread prevalence.
△ Less
Submitted 4 November, 2022; v1 submitted 19 February, 2021;
originally announced February 2021.
-
Distance-aware Molecule Graph Attention Network for Drug-Target Binding Affinity Prediction
Authors:
Jingbo Zhou,
Shuangli Li,
Liang Huang,
Haoyi Xiong,
Fan Wang,
Tong Xu,
Hui Xiong,
Dejing Dou
Abstract:
Accurately predicting the binding affinity between drugs and proteins is an essential step for computational drug discovery. Since graph neural networks (GNNs) have demonstrated remarkable success in various graph-related tasks, GNNs have been considered as a promising tool to improve the binding affinity prediction in recent years. However, most of the existing GNN architectures can only encode t…
▽ More
Accurately predicting the binding affinity between drugs and proteins is an essential step for computational drug discovery. Since graph neural networks (GNNs) have demonstrated remarkable success in various graph-related tasks, GNNs have been considered as a promising tool to improve the binding affinity prediction in recent years. However, most of the existing GNN architectures can only encode the topological graph structure of drugs and proteins without considering the relative spatial information among their atoms. Whereas, different from other graph datasets such as social networks and commonsense knowledge graphs, the relative spatial position and chemical bonds among atoms have significant impacts on the binding affinity. To this end, in this paper, we propose a diStance-aware Molecule graph Attention Network (S-MAN) tailored to drug-target binding affinity prediction. As a dedicated solution, we first propose a position encoding mechanism to integrate the topological structure and spatial position information into the constructed pocket-ligand graph. Moreover, we propose a novel edge-node hierarchical attentive aggregation structure which has edge-level aggregation and node-level aggregation. The hierarchical attentive aggregation can capture spatial dependencies among atoms, as well as fuse the position-enhanced information with the capability of discriminating multiple spatial relations among atoms. Finally, we conduct extensive experiments on two standard datasets to demonstrate the effectiveness of S-MAN.
△ Less
Submitted 17 December, 2020;
originally announced December 2020.
-
Wide-field Decodable Orthogonal Fingerprints of Single Nanoparticles Unlock Multiplexed Digital Assays
Authors:
Jiayan Liao,
Jiajia Zhou,
Yiliao Song,
Baolei Liu,
Yinghui Chen,
Fan Wang,
Chaohao Chen,
Jun Lin,
Xueyuan Chen,
Jie Lu,
Dayong Jin
Abstract:
The control in optical uniformity of single nanoparticles and tuning their diversity in orthogonal dimensions, dot to dot, holds the key to unlock nanoscience and applications. Here we report that the time-domain emissive profile from single upconversion nanoparticle, including the rising, decay and peak moment of the excited state population (T2 profile), can be arbitrarily tuned by upconversion…
▽ More
The control in optical uniformity of single nanoparticles and tuning their diversity in orthogonal dimensions, dot to dot, holds the key to unlock nanoscience and applications. Here we report that the time-domain emissive profile from single upconversion nanoparticle, including the rising, decay and peak moment of the excited state population (T2 profile), can be arbitrarily tuned by upconversion schemes, including interfacial energy migration, concentration dependency, energy transfer, and isolation of surface quenchers. This allows us to significantly increase the coding capacity at the nanoscale. We further implement both time-resolved wide-field imaging and deep-learning techniques to decode these fingerprints, showing high accuracies at high throughput. These high-dimensional optical fingerprints provide a new horizon for applications spanning from sub-diffraction-limit data storage, security inks, to high-throughput single-molecule digital assays and super-resolution imaging.
△ Less
Submitted 15 November, 2020;
originally announced November 2020.
-
Virus Transmission Risk in Urban Rail Systems: A Microscopic Simulation-based Analysis of Spatio-temporal Characteristics
Authors:
Jiali Zhou,
Haris N. Koutsopoulos
Abstract:
Transmission risk of air-borne diseases in public transportation systems is a concern. The paper proposes a modified Wells-Riley model for risk analysis in public transportation systems to capture the passenger flow characteristics, including spatial and temporal patterns in terms of number of boarding, alighting passengers, and number of infectors. The model is utilized to assess overall risk as…
▽ More
Transmission risk of air-borne diseases in public transportation systems is a concern. The paper proposes a modified Wells-Riley model for risk analysis in public transportation systems to capture the passenger flow characteristics, including spatial and temporal patterns in terms of number of boarding, alighting passengers, and number of infectors. The model is utilized to assess overall risk as a function of OD flows, actual operations, and factors such as mask wearing, and ventilation. The model is integrated with a microscopic simulation model of subway operations (SimMETRO). Using actual data from a subway system, a case study explores the impact of different factors on transmission risk, including mask-wearing, ventilation rates, infectiousness levels of disease and carrier rates. In general, mask-wearing and ventilation are effective under various demand levels, infectiousness levels, and carrier rates. Mask-wearing is more effective in mitigating risks. Impacts from operations and service frequency are also evaluated, emphasizing the importance of maintaining reliable, frequent operations in lowering transmission risks. Risk spatial patterns are also explored, highlighting locations of higher risk.
△ Less
Submitted 16 August, 2020;
originally announced August 2020.
-
Relationship between manifold smoothness and adversarial vulnerability in deep learning with local errors
Authors:
Zijian Jiang,
Jianwen Zhou,
Haiping Huang
Abstract:
Artificial neural networks can achieve impressive performances, and even outperform humans in some specific tasks. Nevertheless, unlike biological brains, the artificial neural networks suffer from tiny perturbations in sensory input, under various kinds of adversarial attacks. It is therefore necessary to study the origin of the adversarial vulnerability. Here, we establish a fundamental relation…
▽ More
Artificial neural networks can achieve impressive performances, and even outperform humans in some specific tasks. Nevertheless, unlike biological brains, the artificial neural networks suffer from tiny perturbations in sensory input, under various kinds of adversarial attacks. It is therefore necessary to study the origin of the adversarial vulnerability. Here, we establish a fundamental relationship between geometry of hidden representations (manifold perspective) and the generalization capability of the deep networks. For this purpose, we choose a deep neural network trained by local errors, and then analyze emergent properties of trained networks through the manifold dimensionality, manifold smoothness, and the generalization capability. To explore effects of adversarial examples, we consider independent Gaussian noise attacks and fast-gradient-sign-method (FGSM) attacks. Our study reveals that a high generalization accuracy requires a relatively fast power-law decay of the eigen-spectrum of hidden representations. Under Gaussian attacks, the relationship between generalization accuracy and power-law exponent is monotonic, while a non-monotonic behavior is observed for FGSM attacks. Our empirical study provides a route towards a final mechanistic interpretation of adversarial vulnerability under adversarial attacks.
△ Less
Submitted 23 December, 2020; v1 submitted 4 July, 2020;
originally announced July 2020.
-
Weakly-correlated synapses promote dimension reduction in deep neural networks
Authors:
Jianwen Zhou,
Haiping Huang
Abstract:
By controlling synaptic and neural correlations, deep learning has achieved empirical successes in improving classification performances. How synaptic correlations affect neural correlations to produce disentangled hidden representations remains elusive. Here we propose a simplified model of dimension reduction, taking into account pairwise correlations among synapses, to reveal the mechanism unde…
▽ More
By controlling synaptic and neural correlations, deep learning has achieved empirical successes in improving classification performances. How synaptic correlations affect neural correlations to produce disentangled hidden representations remains elusive. Here we propose a simplified model of dimension reduction, taking into account pairwise correlations among synapses, to reveal the mechanism underlying how the synaptic correlations affect dimension reduction. Our theory determines the synaptic-correlation scaling form requiring only mathematical self-consistency, for both binary and continuous synapses. The theory also predicts that weakly-correlated synapses encourage dimension reduction compared to their orthogonal counterparts. In addition, these synapses slow down the decorrelation process along the network depth. These two computational roles are explained by the proposed mean-field equation. The theoretical predictions are in excellent agreement with numerical simulations, and the key features are also captured by a deep learning with Hebbian rules.
△ Less
Submitted 20 June, 2020;
originally announced June 2020.
-
Assessing the Impact of COVID-19 on the Objective and Analysis of Oncology Clinical Trials -- Application of the Estimand Framework
Authors:
Evgeny Degtyarev,
Kaspar Rufibach,
Yue Shentu,
Godwin Yung,
Michelle Casey,
Stefan Englert,
Feng Liu,
Yi Liu,
Oliver Sailer,
Jonathan Siegel,
Steven Sun,
Rui Tang,
Jiangxiu Zhou
Abstract:
COVID-19 outbreak has rapidly evolved into a global pandemic. The impact of COVID-19 on patient journeys in oncology represents a new risk to interpretation of trial results and its broad applicability for future clinical practice. We identify key intercurrent events that may occur due to COVID-19 in oncology clinical trials with a focus on time-to-event endpoints and discuss considerations pertai…
▽ More
COVID-19 outbreak has rapidly evolved into a global pandemic. The impact of COVID-19 on patient journeys in oncology represents a new risk to interpretation of trial results and its broad applicability for future clinical practice. We identify key intercurrent events that may occur due to COVID-19 in oncology clinical trials with a focus on time-to-event endpoints and discuss considerations pertaining to the other estimand attributes introduced in the ICH E9 addendum. We propose strategies to handle COVID-19 related intercurrent events, depending on their relationship with malignancy and treatment and the interpretability of data after them. We argue that the clinical trial objective from a world without COVID-19 pandemic remains valid. The estimand framework provides a common language to discuss the impact of COVID-19 in a structured and transparent manner. This demonstrates that the applicability of the framework may even go beyond what it was initially intended for.
△ Less
Submitted 21 June, 2020; v1 submitted 8 June, 2020;
originally announced June 2020.
-
DeepScaffold: a comprehensive tool for scaffold-based de novo drug discovery using deep learning
Authors:
Yibo Li,
Jianxing Hu,
Yanxing Wang,
Jielong Zhou,
Liangren Zhang,
Zhenming Liu
Abstract:
The ultimate goal of drug design is to find novel compounds with desirable pharmacological properties. Designing molecules retaining particular scaffolds as the core structures of the molecules is one of the efficient ways to obtain potential drug candidates with desirable properties. We proposed a scaffold-based molecular generative model for scaffold-based drug discovery, which performs molecule…
▽ More
The ultimate goal of drug design is to find novel compounds with desirable pharmacological properties. Designing molecules retaining particular scaffolds as the core structures of the molecules is one of the efficient ways to obtain potential drug candidates with desirable properties. We proposed a scaffold-based molecular generative model for scaffold-based drug discovery, which performs molecule generation based on a wide spectrum of scaffold definitions, including BM-scaffolds, cyclic skeletons, as well as scaffolds with specifications on side-chain properties. The model can generalize the learned chemical rules of adding atoms and bonds to a given scaffold. Furthermore, the generated compounds were evaluated by molecular docking in DRD2 targets and the results demonstrated that this approach can be effectively applied to solve several drug design problems, including the generation of compounds containing a given scaffold and de novo drug design of potential drug candidates with specific docking scores. Finally, a command line interface is created.
△ Less
Submitted 4 September, 2019; v1 submitted 20 August, 2019;
originally announced August 2019.
-
Cellular reproduction number, generation time and growth rate differ between human- and avian-adapted influenza strains
Authors:
Ada W. C. Yan,
Jie Zhou,
Catherine A. A. Beauchemin,
Colin A. Russell,
Wendy S. Barclay,
Steven Riley
Abstract:
When analysing in vitro data, growth kinetics of influenza strains are often compared by computing their growth rates, which are sometimes used as proxies for fitness. However, analogous to mechanistic epidemic models, the growth rate can be defined as a function of two parameters: the basic reproduction number (the average number of cells each infected cell infects) and the mean generation time (…
▽ More
When analysing in vitro data, growth kinetics of influenza strains are often compared by computing their growth rates, which are sometimes used as proxies for fitness. However, analogous to mechanistic epidemic models, the growth rate can be defined as a function of two parameters: the basic reproduction number (the average number of cells each infected cell infects) and the mean generation time (the average length of a replication cycle). Using a mechanistic model, previously published data from experiments in human lung cells, and newly generated data, we compared estimates of all three parameters for six influenza A strains. Using previously published data, we found that the two human-adapted strains (pre-2009 seasonal H1N1, and pandemic H1N1) had a lower basic reproduction number, shorter mean generation time and slower growth rate than the two avian-adapted strains (H5N1 and H7N9). These same differences were then observed in data from new experiments where two strains were engineered to have different internal proteins (pandemic H1N1 and H5N1), but the same surface proteins (PR8), confirming our initial findings and implying that differences between strains were driven by internal genes. Also, the model predicted that the human-adapted strains underwent more replication cycles than the avian-adapted strains by the time of peak viral load, potentially accumulating mutations more quickly. These results suggest that the in vitro reproduction number, generation time and growth rate differ between human-adapted and avian-adapted influenza strains, and thus could be used to assess host adaptation of internal proteins to inform pandemic risk assessment.
△ Less
Submitted 19 March, 2019;
originally announced March 2019.
-
OPENMENDEL: A Cooperative Programming Project for Statistical Genetics
Authors:
Hua Zhou,
Janet S. Sinsheimer,
Christopher A. German,
Sarah S. Ji,
Douglas M. Bates,
Benjamin B. Chu,
Kevin L. Keys,
Juhyun Kim,
Seyoon Ko,
Gordon D. Mosher,
Jeanette C. Papp,
Eric M. Sobel,
Jing Zhai,
Jin J. Zhou,
Kenneth Lange
Abstract:
Statistical methods for genomewide association studies (GWAS) continue to improve. However, the increasing volume and variety of genetic and genomic data make computational speed and ease of data manipulation mandatory in future software. In our view, a collaborative effort of statistical geneticists is required to develop open source software targeted to genetic epidemiology. Our attempt to meet…
▽ More
Statistical methods for genomewide association studies (GWAS) continue to improve. However, the increasing volume and variety of genetic and genomic data make computational speed and ease of data manipulation mandatory in future software. In our view, a collaborative effort of statistical geneticists is required to develop open source software targeted to genetic epidemiology. Our attempt to meet this need is called the OPENMENDELproject (https://openmendel.github.io). It aims to (1) enable interactive and reproducible analyses with informative intermediate results, (2) scale to big data analytics, (3) embrace parallel and distributed computing, (4) adapt to rapid hardware evolution, (5) allow cloud computing, (6) allow integration of varied genetic data types, and (7) foster easy communication between clinicians, geneticists, statisticians, and computer scientists. This article reviews and makes recommendations to the genetic epidemiology community in the context of the OPENMENDEL project.
△ Less
Submitted 13 February, 2019;
originally announced February 2019.
-
Protein corona composition of PEGylated nanoparticles correlates strongly with amino acid composition of protein surface
Authors:
Giovanni Settanni,
Jiajia Zhou,
Tongchuan Suo,
Susanne Schöttler,
Katharina Landfester,
Friederike Schmid,
Volker Mailänder
Abstract:
Extensive molecular dynamics simulations reveal that the interactions between proteins and poly(ethylene glycol)(PEG) can be described in terms of the surface composition of the proteins. PEG molecules accumulate around non-polar residues while avoiding polar ones. A solvent-accessible-surface-area model of protein adsorption on PEGylated nanoparticles accurately fits a large set of data on the co…
▽ More
Extensive molecular dynamics simulations reveal that the interactions between proteins and poly(ethylene glycol)(PEG) can be described in terms of the surface composition of the proteins. PEG molecules accumulate around non-polar residues while avoiding polar ones. A solvent-accessible-surface-area model of protein adsorption on PEGylated nanoparticles accurately fits a large set of data on the composition of the protein corona recently obtained by label-free proteomic mass spectrometry.
△ Less
Submitted 28 December, 2016;
originally announced December 2016.
-
Mitochondrial Ca2+ uptake in skeletal muscle health and disease
Authors:
Jingsong Zhou,
Kamal Dhakal,
Jianxun Yi
Abstract:
Muscle uses Ca2+ as a messenger to control contraction and relies on ATP to maintain the intracellular Ca2+ homeostasis. Mitochondria are the major sub-cellular organelle of ATP production. With a negative inner membrane potential, mitochondria take up Ca2+ from their surroundings, a process called mitochondrial Ca2+ uptake. Under physiological conditions, Ca2+ uptake into mitochondria promotes AT…
▽ More
Muscle uses Ca2+ as a messenger to control contraction and relies on ATP to maintain the intracellular Ca2+ homeostasis. Mitochondria are the major sub-cellular organelle of ATP production. With a negative inner membrane potential, mitochondria take up Ca2+ from their surroundings, a process called mitochondrial Ca2+ uptake. Under physiological conditions, Ca2+ uptake into mitochondria promotes ATP production. Excessive uptake causes mitochondrial Ca2+ overload, which activates downstream adverse responses leading to cell dysfunction. Moreover, mitochondrial Ca2+ uptake could shape spatio-temporal patterns of intracellular Ca2+ signaling. Malfunction of mitochondrial Ca2+ uptake is implicated in muscle degeneration. Unlike non-excitable cells, mitochondria in muscle cells experience dramatic changes of intracellular Ca2+ levels. Besides the sudden elevation of Ca2+ level induced by action potentials, Ca2+ transients in muscle cells can be as short as a few milliseconds during a single twitch or as long as minutes during tetanic contraction, which raises the question whether mitochondrial Ca2+ uptake is fast and big enough to shape intracellular Ca2+ signaling during excitation-contraction coupling and creates technical challenges for quantification of the dynamic changes of Ca2+ inside mitochondria. This review focuses on characterization of mitochondrial Ca2+ uptake in skeletal muscle and its role in muscle physiology and diseases.
△ Less
Submitted 28 July, 2016;
originally announced July 2016.
-
AptaTRACE: Elucidating Sequence-Structure Binding Motifs by Uncovering Selection Trends in HT-SELEX Experiments
Authors:
Phuong Dao,
Jan Hoinka,
Yijie Wang,
Mayumi Takahashi,
Jiehua Zhou,
Fabrizio Costa,
John Rossi,
John Burnett,
Rolf Backofen,
Teresa M. Przytycka
Abstract:
Aptamers, short synthetic RNA/DNA molecules binding specific targets with high affinity and specificity, are utilized in an increasing spectrum of bio-medical applications. Aptamers are identified in vitro via the Systematic Evolution of Ligands by Exponential Enrichment (SELEX) protocol. SELEX selects binders through an iterative process that, starting from a pool of random ssDNA/RNA sequences, a…
▽ More
Aptamers, short synthetic RNA/DNA molecules binding specific targets with high affinity and specificity, are utilized in an increasing spectrum of bio-medical applications. Aptamers are identified in vitro via the Systematic Evolution of Ligands by Exponential Enrichment (SELEX) protocol. SELEX selects binders through an iterative process that, starting from a pool of random ssDNA/RNA sequences, amplifies target-affine species through a series of selection cycles. HT-SELEX, which combines SELEX with high throughput sequencing, has recently transformed aptamer development and has opened the field to even more applications. HT-SELEX is capable of generating over half a billion data points, challenging computational scientists with the task of identifying aptamer properties such as sequence structure motifs that determine binding. While currently available motif finding approaches suggest partial solutions to this question, none possess the generality or scalability required for HT-SELEX data, and they do not take advantage of important properties of the experimental procedure.
We present AptaTRACE, a novel approach for the identification of sequence-structure binding motifs in HT-SELEX derived aptamers. Our approach leverages the experimental design of the SELEX protocol and identifies sequence-structure motifs that show a signature of selection. Because of its unique approach, AptaTRACE can uncover motifs even when these are present in only a minuscule fraction of the pool. Due to these features, our method can help to reduce the number of selection cycles required to produce aptamers with the desired properties, thus reducing cost and time of this rather expensive procedure. The performance of the method on simulated and real data indicates that AptaTRACE can detect sequence-structure motifs even in highly challenging data.
△ Less
Submitted 5 April, 2016;
originally announced April 2016.
-
Fast Genome-Wide QTL Analysis Using Mendel
Authors:
Hua Zhou,
Jin Zhou,
Tao Hu,
Eric M Sobel,
Kenneth Lange
Abstract:
Pedigree GWAS (Option 29) in the current version of the Mendel software is an optimized subroutine for performing large scale genome-wide QTL analysis. This analysis (a) works for random sample data, pedigree data, or a mix of both, (b) is highly efficient in both run time and memory requirement, (c) accommodates both univariate and multivariate traits, (d) works for autosomal and x-linked loci, (…
▽ More
Pedigree GWAS (Option 29) in the current version of the Mendel software is an optimized subroutine for performing large scale genome-wide QTL analysis. This analysis (a) works for random sample data, pedigree data, or a mix of both, (b) is highly efficient in both run time and memory requirement, (c) accommodates both univariate and multivariate traits, (d) works for autosomal and x-linked loci, (e) correctly deals with missing data in traits, covariates, and genotypes, (f) allows for covariate adjustment and constraints among parameters, (g) uses either theoretical or SNP-based empirical kinship matrix for additive polygenic effects, (h) allows extra variance components such as dominant polygenic effects and household effects, (i) detects and reports outlier individuals and pedigrees, and (j) allows for robust estimation via the $t$-distribution. The current paper assesses these capabilities on the genetics analysis workshop 19 (GAW19) sequencing data. We analyzed simulated and real phenotypes for both family and random sample data sets. For instance, when jointly testing the 8 longitudinally measured systolic blood pressure (SBP) and diastolic blood pressure (DBP) traits, it takes Mendel 78 minutes on a standard laptop computer to read, quality check, and analyze a data set with 849 individuals and 8.3 million SNPs. Genome-wide eQTL analysis of 20,643 expression traits on 641 individuals with 8.3 million SNPs takes 30 hours using 20 parallel runs on a cluster. Mendel is freely available at \url{http://www.genetics.ucla.edu/software}.
△ Less
Submitted 30 July, 2014;
originally announced July 2014.
-
Relative Stability of Network States in Boolean Network Models of Gene Regulation in Development
Authors:
Joseph Xu Zhou,
Areejit Samal,
Aymeric Fouquier d'Hèrouël,
Nathan D. Price,
Sui Huang
Abstract:
Progress in cell type reprogramming has revived the interest in Waddington's concept of the epigenetic landscape. Recently researchers developed the quasi-potential theory to represent the Waddington's landscape. The Quasi-potential U(x), derived from interactions in the gene regulatory network (GRN) of a cell, quantifies the relative stability of network states, which determine the effort require…
▽ More
Progress in cell type reprogramming has revived the interest in Waddington's concept of the epigenetic landscape. Recently researchers developed the quasi-potential theory to represent the Waddington's landscape. The Quasi-potential U(x), derived from interactions in the gene regulatory network (GRN) of a cell, quantifies the relative stability of network states, which determine the effort required for state transitions in a multi-stable dynamical system. However, quasi-potential landscapes, originally developed for continuous systems, are not suitable for discrete-valued networks which are important tools to study complex systems. In this paper, we provide a framework to quantify the landscape for discrete Boolean networks (BNs). We apply our framework to study pancreas cell differentiation where an ensemble of BN models is considered based on the structure of a minimal GRN for pancreas development. We impose biologically motivated structural constraints (corresponding to specific type of Boolean functions) and dynamical constraints (corresponding to stable attractor states) to limit the space of BN models for pancreas development. In addition, we enforce a novel functional constraint corresponding to the relative ordering of attractor states in BN models to restrict the space of BN models to the biological relevant class. We find that BNs with canalyzing/sign-compatible Boolean functions best capture the dynamics of pancreas cell differentiation. This framework can also determine the genes' influence on cell state transitions, and thus can facilitate the rational design of cell reprogramming protocols.
△ Less
Submitted 12 October, 2015; v1 submitted 23 July, 2014;
originally announced July 2014.
-
Deep Supervised and Convolutional Generative Stochastic Network for Protein Secondary Structure Prediction
Authors:
Jian Zhou,
Olga G. Troyanskaya
Abstract:
Predicting protein secondary structure is a fundamental problem in protein structure prediction. Here we present a new supervised generative stochastic network (GSN) based method to predict local secondary structure with deep hierarchical representations. GSN is a recently proposed deep learning technique (Bengio & Thibodeau-Laufer, 2013) to globally train deep generative model. We present the sup…
▽ More
Predicting protein secondary structure is a fundamental problem in protein structure prediction. Here we present a new supervised generative stochastic network (GSN) based method to predict local secondary structure with deep hierarchical representations. GSN is a recently proposed deep learning technique (Bengio & Thibodeau-Laufer, 2013) to globally train deep generative model. We present the supervised extension of GSN, which learns a Markov chain to sample from a conditional distribution, and applied it to protein structure prediction. To scale the model to full-sized, high-dimensional data, like protein sequences with hundreds of amino acids, we introduce a convolutional architecture, which allows efficient learning across multiple layers of hierarchical representations. Our architecture uniquely focuses on predicting structured low-level labels informed with both low and high-level representations learned by the model. In our application this corresponds to labeling the secondary structure state of each amino-acid residue. We trained and tested the model on separate sets of non-homologous proteins sharing less than 30% sequence identity. Our model achieves 66.4% Q8 accuracy on the CB513 dataset, better than the previously reported best performance 64.9% (Wang et al., 2011) for this challenging secondary structure prediction problem.
△ Less
Submitted 6 March, 2014;
originally announced March 2014.
-
Cellular network entropy as the energy potential in Waddington's differentiation landscape
Authors:
Christopher R. S. Banerji,
Diego Miranda-Saavedra,
Simone Severini,
Martin Widschwendter,
Tariq Enver,
Joseph X. Zhou,
Andrew E. Teschendorff
Abstract:
Differentiation is a key cellular process in normal tissue development that is significantly altered in cancer. Although molecular signatures characterising pluripotency and multipotency exist, there is, as yet, no single quantitative mark of a cellular sample's position in the global differentiation hierarchy. Here we adopt a systems view and consider the sample's network entropy, a measure of si…
▽ More
Differentiation is a key cellular process in normal tissue development that is significantly altered in cancer. Although molecular signatures characterising pluripotency and multipotency exist, there is, as yet, no single quantitative mark of a cellular sample's position in the global differentiation hierarchy. Here we adopt a systems view and consider the sample's network entropy, a measure of signaling pathway promiscuity, computable from a sample's genome-wide expression profile. We demonstrate that network entropy provides a quantitative, in-silico, readout of the average undifferentiated state of the profiled cells, recapitulating the known hierarchy of pluripotent, multipotent and differentiated cell types. Network entropy further exhibits dynamic changes in time course differentiation data, and in line with a sample's differentiation stage. In disease, network entropy predicts a higher level of cellular plasticity in cancer stem cell populations compared to ordinary cancer cells. Importantly, network entropy also allows identification of key differentiation pathways. Our results are consistent with the view that pluripotency is a statistical property defined at the cellular population level, correlating with intra-sample heterogeneity, and driven by the degree of signaling promiscuity in cells. In summary, network entropy provides a quantitative measure of a cell's undifferentiated state, defining its elevation in Waddington's landscape.
△ Less
Submitted 26 October, 2013;
originally announced October 2013.