-
Bayesian ensemble learning for predicting health outcomes of multipollutant mixtures
Authors:
Yu-Chien Ning,
Xin Zhou,
Francine Laden,
Molin Wang
Abstract:
We introduce the SoftBart approach from Bayesian ensemble learning to estimate the relationship between multipollutant mixtures and health on chronic exposures in epidemiology research. This approach offers several key advantages over existing methods: (1) it is computationally efficient and well-suited for analyzing large datasets; (2) it is flexible in estimating various correlated nonlinear fun…
▽ More
We introduce the SoftBart approach from Bayesian ensemble learning to estimate the relationship between multipollutant mixtures and health on chronic exposures in epidemiology research. This approach offers several key advantages over existing methods: (1) it is computationally efficient and well-suited for analyzing large datasets; (2) it is flexible in estimating various correlated nonlinear functions simultaneously; and (3) it accurately identifies active variables within highly correlated multipollutant mixtures. Through simulations, we demonstrate the method's superiority by comparing its accuracy in estimating and quantifying uncertainties for both main and interaction effects with the commonly used method, BKMR. Last, we apply the method to analyze a multipollutant dataset with 10,110 participates from the Nurses' Health Study.
△ Less
Submitted 23 May, 2025;
originally announced May 2025.
-
HR-VILAGE-3K3M: A Human Respiratory Viral Immunization Longitudinal Gene Expression Dataset for Systems Immunity
Authors:
Xuejun Sun,
Yiran Song,
Xiaochen Zhou,
Ruilie Cai,
Yu Zhang,
Xinyi Li,
Rui Peng,
Jialiu Xie,
Yuanyuan Yan,
Muyao Tang,
Prem Lakshmanane,
Baiming Zou,
James S. Hagood,
Raymond J. Pickles,
Didong Li,
Fei Zou,
Xiaojing Zheng
Abstract:
Respiratory viral infections pose a global health burden, yet the cellular immune responses driving protection or pathology remain unclear. Natural infection cohorts often lack pre-exposure baseline data and structured temporal sampling. In contrast, inoculation and vaccination trials generate insightful longitudinal transcriptomic data. However, the scattering of these datasets across platforms,…
▽ More
Respiratory viral infections pose a global health burden, yet the cellular immune responses driving protection or pathology remain unclear. Natural infection cohorts often lack pre-exposure baseline data and structured temporal sampling. In contrast, inoculation and vaccination trials generate insightful longitudinal transcriptomic data. However, the scattering of these datasets across platforms, along with inconsistent metadata and preprocessing procedure, hinders AI-driven discovery. To address these challenges, we developed the Human Respiratory Viral Immunization LongitudinAl Gene Expression (HR-VILAGE-3K3M) repository: an AI-ready, rigorously curated dataset that integrates 14,136 RNA-seq profiles from 3,178 subjects across 66 studies encompassing over 2.56 million cells. Spanning vaccination, inoculation, and mixed exposures, the dataset includes microarray, bulk RNA-seq, and single-cell RNA-seq from whole blood, PBMCs, and nasal swabs, sourced from GEO, ImmPort, and ArrayExpress. We harmonized subject-level metadata, standardized outcome measures, applied unified preprocessing pipelines with rigorous quality control, and aligned all data to official gene symbols. To demonstrate the utility of HR-VILAGE-3K3M, we performed predictive modeling of vaccine responders and evaluated batch-effect correction methods. Beyond these initial demonstrations, it supports diverse systems immunology applications and benchmarking of feature selection and transfer learning algorithms. Its scale and heterogeneity also make it ideal for pretraining foundation models of the human immune response and for advancing multimodal learning frameworks. As the largest longitudinal transcriptomic resource for human respiratory viral immunization, it provides an accessible platform for reproducible AI-driven research, accelerating systems immunology and vaccine development against emerging viral threats.
△ Less
Submitted 19 May, 2025;
originally announced May 2025.
-
AI-powered virtual eye: perspective, challenges and opportunities
Authors:
Yue Wu,
Yibo Guo,
Yulong Yan,
Jiancheng Yang,
Xin Zhou,
Ching-Yu Cheng,
Danli Shi,
Mingguang He
Abstract:
We envision the "virtual eye" as a next-generation, AI-powered platform that uses interconnected foundation models to simulate the eye's intricate structure and biological function across all scales. Advances in AI, imaging, and multiomics provide a fertile ground for constructing a universal, high-fidelity digital replica of the human eye. This perspective traces the evolution from early mechanis…
▽ More
We envision the "virtual eye" as a next-generation, AI-powered platform that uses interconnected foundation models to simulate the eye's intricate structure and biological function across all scales. Advances in AI, imaging, and multiomics provide a fertile ground for constructing a universal, high-fidelity digital replica of the human eye. This perspective traces the evolution from early mechanistic and rule-based models to contemporary AI-driven approaches, integrating in a unified model with multimodal, multiscale, dynamic predictive capabilities and embedded feedback mechanisms. We propose a development roadmap emphasizing the roles of large-scale multimodal datasets, generative AI, foundation models, agent-based architectures, and interactive interfaces. Despite challenges in interpretability, ethics, data processing and evaluation, the virtual eye holds the potential to revolutionize personalized ophthalmic care and accelerate research into ocular health and disease.
△ Less
Submitted 7 May, 2025;
originally announced May 2025.
-
Accelerating Causal Network Discovery of Alzheimer Disease Biomarkers via Scientific Literature-based Retrieval Augmented Generation
Authors:
Xiaofan Zhou,
Liangjie Huang,
Pinyang Cheng,
Wenpen Yin,
Rui Zhang,
Wenrui Hao,
Lu Cheng
Abstract:
The causal relationships between biomarkers are essential for disease diagnosis and medical treatment planning. One notable application is Alzheimer's disease (AD) diagnosis, where certain biomarkers may influence the presence of others, enabling early detection, precise disease staging, targeted treatments, and improved monitoring of disease progression. However, understanding these causal relati…
▽ More
The causal relationships between biomarkers are essential for disease diagnosis and medical treatment planning. One notable application is Alzheimer's disease (AD) diagnosis, where certain biomarkers may influence the presence of others, enabling early detection, precise disease staging, targeted treatments, and improved monitoring of disease progression. However, understanding these causal relationships is complex and requires extensive research. Constructing a comprehensive causal network of biomarkers demands significant effort from human experts, who must analyze a vast number of research papers, and have bias in understanding diseases' biomarkers and their relation. This raises an important question: Can advanced large language models (LLMs), such as those utilizing retrieval-augmented generation (RAG), assist in building causal networks of biomarkers for further medical analysis? To explore this, we collected 200 AD-related research papers published over the past 25 years and then integrated scientific literature with RAG to extract AD biomarkers and generate causal relations among them. Given the high-risk nature of the medical diagnosis, we applied uncertainty estimation to assess the reliability of the generated causal edges and examined the faithfulness and scientificness of LLM reasoning using both automatic and human evaluation. We find that RAG enhances the ability of LLMs to generate more accurate causal networks from scientific papers. However, the overall performance of LLMs in identifying causal relations of AD biomarkers is still limited. We hope this study will inspire further foundational research on AI-driven analysis of AD biomarkers causal network discovery.
△ Less
Submitted 1 April, 2025;
originally announced April 2025.
-
An LLM-Driven Multi-Agent Debate System for Mendelian Diseases
Authors:
Xinyang Zhou,
Yongyong Ren,
Qianqian Zhao,
Daoyi Huang,
Xinbo Wang,
Tingting Zhao,
Zhixing Zhu,
Wenyuan He,
Shuyuan Li,
Yan Xu,
Yu Sun,
Yongguo Yu,
Shengnan Wu,
Jian Wang,
Guangjun Yu,
Dake He,
Bo Ban,
Hui Lu
Abstract:
Accurate diagnosis of Mendelian diseases is crucial for precision therapy and assistance in preimplantation genetic diagnosis. However, existing methods often fall short of clinical standards or depend on extensive datasets to build pretrained machine learning models. To address this, we introduce an innovative LLM-Driven multi-agent debate system (MD2GPS) with natural language explanations of the…
▽ More
Accurate diagnosis of Mendelian diseases is crucial for precision therapy and assistance in preimplantation genetic diagnosis. However, existing methods often fall short of clinical standards or depend on extensive datasets to build pretrained machine learning models. To address this, we introduce an innovative LLM-Driven multi-agent debate system (MD2GPS) with natural language explanations of the diagnostic results. It utilizes a language model to transform results from data-driven and knowledge-driven agents into natural language, then fostering a debate between these two specialized agents. This system has been tested on 1,185 samples across four independent datasets, enhancing the TOP1 accuracy from 42.9% to 66% on average. Additionally, in a challenging cohort of 72 cases, MD2GPS identified potential pathogenic genes in 12 patients, reducing the diagnostic time by 90%. The methods within each module of this multi-agent debate system are also replaceable, facilitating its adaptation for diagnosing and researching other complex diseases.
△ Less
Submitted 11 April, 2025; v1 submitted 10 April, 2025;
originally announced April 2025.
-
Integrating Protein Dynamics into Structure-Based Drug Design via Full-Atom Stochastic Flows
Authors:
Xiangxin Zhou,
Yi Xiao,
Haowei Lin,
Xinheng He,
Jiaqi Guan,
Yang Wang,
Qiang Liu,
Feng Zhou,
Liang Wang,
Jianzhu Ma
Abstract:
The dynamic nature of proteins, influenced by ligand interactions, is essential for comprehending protein function and progressing drug discovery. Traditional structure-based drug design (SBDD) approaches typically target binding sites with rigid structures, limiting their practical application in drug development. While molecular dynamics simulation can theoretically capture all the biologically…
▽ More
The dynamic nature of proteins, influenced by ligand interactions, is essential for comprehending protein function and progressing drug discovery. Traditional structure-based drug design (SBDD) approaches typically target binding sites with rigid structures, limiting their practical application in drug development. While molecular dynamics simulation can theoretically capture all the biologically relevant conformations, the transition rate is dictated by the intrinsic energy barrier between them, making the sampling process computationally expensive. To overcome the aforementioned challenges, we propose to use generative modeling for SBDD considering conformational changes of protein pockets. We curate a dataset of apo and multiple holo states of protein-ligand complexes, simulated by molecular dynamics, and propose a full-atom flow model (and a stochastic version), named DynamicFlow, that learns to transform apo pockets and noisy ligands into holo pockets and corresponding 3D ligand molecules. Our method uncovers promising ligand molecules and corresponding holo conformations of pockets. Additionally, the resultant holo-like states provide superior inputs for traditional SBDD approaches, playing a significant role in practical drug discovery.
△ Less
Submitted 5 March, 2025;
originally announced March 2025.
-
Dynamical analysis of an HIV infection model including quiescent cells and immune response
Authors:
Ibrahim Nali,
Attila Dénes,
Abdessamad Tridane,
Xueyong Zhou
Abstract:
This research gives a thorough examination of an HIV infection model that includes quiescent cells and immune response dynamics in the host. The model, represented by a system of ordinary differential equations, captures the complex interaction between the host's immune response and viral infection. The study focuses on the model's fundamental aspects, such as equilibrium analysis, computing the b…
▽ More
This research gives a thorough examination of an HIV infection model that includes quiescent cells and immune response dynamics in the host. The model, represented by a system of ordinary differential equations, captures the complex interaction between the host's immune response and viral infection. The study focuses on the model's fundamental aspects, such as equilibrium analysis, computing the basic reproduction number $\mathcal{R}_0$, stability analysis, bifurcation phenomena, numerical simulations, and sensitivity analysis.
The analysis reveals both an infection equilibrium, which indicates the persistence of the illness, and an infection-free equilibrium, which represents disease control possibilities. Applying matrix-theoretical approaches, stability analysis proved that the infection-free equilibrium is both locally and globally stable for $\mathcal{R}_0 < 1$. For the situation of $\mathcal{R}_0 > 1$, the infection equilibrium is locally asymptotically stable via the Routh--Hurwitz criterion. We also studied the uniform persistence of the infection, demonstrating that the infection remains present above a positive threshold under certain conditions. The study also found a transcritical forward-type bifurcation at $\mathcal{R}_0 = 1$, indicating a critical threshold that affects the system's behavior. The model's temporal dynamics are studied using numerical simulations, and sensitivity analysis identifies the most significant variables by assessing the effects of parameter changes on system behavior.
△ Less
Submitted 26 February, 2025;
originally announced March 2025.
-
UniMatch: Universal Matching from Atom to Task for Few-Shot Drug Discovery
Authors:
Ruifeng Li,
Mingqian Li,
Wei Liu,
Yuhua Zhou,
Xiangxin Zhou,
Yuan Yao,
Qiang Zhang,
Hongyang Chen
Abstract:
Drug discovery is crucial for identifying candidate drugs for various diseases.However, its low success rate often results in a scarcity of annotations, posing a few-shot learning problem. Existing methods primarily focus on single-scale features, overlooking the hierarchical molecular structures that determine different molecular properties. To address these issues, we introduce Universal Matchin…
▽ More
Drug discovery is crucial for identifying candidate drugs for various diseases.However, its low success rate often results in a scarcity of annotations, posing a few-shot learning problem. Existing methods primarily focus on single-scale features, overlooking the hierarchical molecular structures that determine different molecular properties. To address these issues, we introduce Universal Matching Networks (UniMatch), a dual matching framework that integrates explicit hierarchical molecular matching with implicit task-level matching via meta-learning, bridging multi-level molecular representations and task-level generalization. Specifically, our approach explicitly captures structural features across multiple levels, such as atoms, substructures, and molecules, via hierarchical pooling and matching, facilitating precise molecular representation and comparison. Additionally, we employ a meta-learning strategy for implicit task-level matching, allowing the model to capture shared patterns across tasks and quickly adapt to new ones. This unified matching framework ensures effective molecular alignment while leveraging shared meta-knowledge for fast adaptation. Our experimental results demonstrate that UniMatch outperforms state-of-the-art methods on the MoleculeNet and FS-Mol benchmarks, achieving improvements of 2.87% in AUROC and 6.52% in delta AUPRC. UniMatch also shows excellent generalization ability on the Meta-MolNet benchmark.
△ Less
Submitted 17 February, 2025;
originally announced February 2025.
-
Group Ligands Docking to Protein Pockets
Authors:
Jiaqi Guan,
Jiahan Li,
Xiangxin Zhou,
Xingang Peng,
Sheng Wang,
Yunan Luo,
Jian Peng,
Jianzhu Ma
Abstract:
Molecular docking is a key task in computational biology that has attracted increasing interest from the machine learning community. While existing methods have achieved success, they generally treat each protein-ligand pair in isolation. Inspired by the biochemical observation that ligands binding to the same target protein tend to adopt similar poses, we propose \textsc{GroupBind}, a novel molec…
▽ More
Molecular docking is a key task in computational biology that has attracted increasing interest from the machine learning community. While existing methods have achieved success, they generally treat each protein-ligand pair in isolation. Inspired by the biochemical observation that ligands binding to the same target protein tend to adopt similar poses, we propose \textsc{GroupBind}, a novel molecular docking framework that simultaneously considers multiple ligands docking to a protein. This is achieved by introducing an interaction layer for the group of ligands and a triangle attention module for embedding protein-ligand and group-ligand pairs. By integrating our approach with diffusion-based docking model, we set a new S performance on the PDBBind blind docking benchmark, demonstrating the effectiveness of our proposed molecular docking paradigm.
△ Less
Submitted 24 January, 2025;
originally announced January 2025.
-
BioTD: an online database of biotoxins
Authors:
Gaoang Wang,
Hang Wu,
Yang Liao,
Zhen Chen,
Qing Zhou,
Wenxing Wang,
Yifei Liu,
Yilin Wang,
Meijing Wu,
Ruiqi Xiang,
Yuntao Yu,
Xi Zhou,
Feng Zhu,
Zhonghua Liu,
Tingjun Hou
Abstract:
Biotoxins, mainly produced by venomous animals, plants and microorganisms, exhibit high physiological activity and unique effects such as lowering blood pressure and analgesia. A number of venom-derived drugs are already available on the market, with many more candidates currently undergoing clinical and laboratory studies. However, drug design resources related to biotoxins are insufficient, part…
▽ More
Biotoxins, mainly produced by venomous animals, plants and microorganisms, exhibit high physiological activity and unique effects such as lowering blood pressure and analgesia. A number of venom-derived drugs are already available on the market, with many more candidates currently undergoing clinical and laboratory studies. However, drug design resources related to biotoxins are insufficient, particularly a lack of accurate and extensive activity data. To fulfill this demand, we develop the Biotoxins Database (BioTD). BioTD is the largest open-source database for toxins, offering open access to 14,607 data records (8,185 activity records), covering 8,975 toxins sourced from 5,220 references and patents across over 900 species. The activity data in BioTD is categorized into five groups: Activity, Safety, Kinetics, Hemolysis and other physiological indicators. Moreover, BioTD provides data on 986 mutants, refines the whole sequence and signal peptide sequences of toxins, and annotates disulfide bond information. Given the importance of biotoxins and their associated data, this new database was expected to attract broad interests from diverse research fields in drug discovery. BioTD is freely accessible at http://biotoxin.net/.
△ Less
Submitted 28 December, 2024;
originally announced December 2024.
-
Contextual Representation Anchor Network to Alleviate Selection Bias in Few-Shot Drug Discovery
Authors:
Ruifeng Li,
Wei Liu,
Xiangxin Zhou,
Mingqian Li,
Qiang Zhang,
Hongyang Chen,
Xuemin Lin
Abstract:
In the drug discovery process, the low success rate of drug candidate screening often leads to insufficient labeled data, causing the few-shot learning problem in molecular property prediction. Existing methods for few-shot molecular property prediction overlook the sample selection bias, which arises from non-random sample selection in chemical experiments. This bias in data representativeness le…
▽ More
In the drug discovery process, the low success rate of drug candidate screening often leads to insufficient labeled data, causing the few-shot learning problem in molecular property prediction. Existing methods for few-shot molecular property prediction overlook the sample selection bias, which arises from non-random sample selection in chemical experiments. This bias in data representativeness leads to suboptimal performance. To overcome this challenge, we present a novel method named contextual representation anchor Network (CRA), where an anchor refers to a cluster center of the representations of molecules and serves as a bridge to transfer enriched contextual knowledge into molecular representations and enhance their expressiveness. CRA introduces a dual-augmentation mechanism that includes context augmentation, which dynamically retrieves analogous unlabeled molecules and captures their task-specific contextual knowledge to enhance the anchors, and anchor augmentation, which leverages the anchors to augment the molecular representations. We evaluate our approach on the MoleculeNet and FS-Mol benchmarks, as well as in domain transfer experiments. The results demonstrate that CRA outperforms the state-of-the-art by 2.60% and 3.28% in AUC and $Δ$AUC-PR metrics, respectively, and exhibits superior generalization capabilities.
△ Less
Submitted 29 October, 2024; v1 submitted 27 October, 2024;
originally announced October 2024.
-
Reprogramming Pretrained Target-Specific Diffusion Models for Dual-Target Drug Design
Authors:
Xiangxin Zhou,
Jiaqi Guan,
Yijia Zhang,
Xingang Peng,
Liang Wang,
Jianzhu Ma
Abstract:
Dual-target therapeutic strategies have become a compelling approach and attracted significant attention due to various benefits, such as their potential in overcoming drug resistance in cancer therapy. Considering the tremendous success that deep generative models have achieved in structure-based drug design in recent years, we formulate dual-target drug design as a generative task and curate a n…
▽ More
Dual-target therapeutic strategies have become a compelling approach and attracted significant attention due to various benefits, such as their potential in overcoming drug resistance in cancer therapy. Considering the tremendous success that deep generative models have achieved in structure-based drug design in recent years, we formulate dual-target drug design as a generative task and curate a novel dataset of potential target pairs based on synergistic drug combinations. We propose to design dual-target drugs with diffusion models that are trained on single-target protein-ligand complex pairs. Specifically, we align two pockets in 3D space with protein-ligand binding priors and build two complex graphs with shared ligand nodes for SE(3)-equivariant composed message passing, based on which we derive a composed drift in both 3D and categorical probability space in the generative process. Our algorithm can well transfer the knowledge gained in single-target pretraining to dual-target scenarios in a zero-shot manner. We also repurpose linker design methods as strong baselines for this task. Extensive experiments demonstrate the effectiveness of our method compared with various baselines.
△ Less
Submitted 26 November, 2024; v1 submitted 27 October, 2024;
originally announced October 2024.
-
ProteinBench: A Holistic Evaluation of Protein Foundation Models
Authors:
Fei Ye,
Zaixiang Zheng,
Dongyu Xue,
Yuning Shen,
Lihao Wang,
Yiming Ma,
Yan Wang,
Xinyou Wang,
Xiangxin Zhou,
Quanquan Gu
Abstract:
Recent years have witnessed a surge in the development of protein foundation models, significantly improving performance in protein prediction and generative tasks ranging from 3D structure prediction and protein design to conformational dynamics. However, the capabilities and limitations associated with these models remain poorly understood due to the absence of a unified evaluation framework. To…
▽ More
Recent years have witnessed a surge in the development of protein foundation models, significantly improving performance in protein prediction and generative tasks ranging from 3D structure prediction and protein design to conformational dynamics. However, the capabilities and limitations associated with these models remain poorly understood due to the absence of a unified evaluation framework. To fill this gap, we introduce ProteinBench, a holistic evaluation framework designed to enhance the transparency of protein foundation models. Our approach consists of three key components: (i) A taxonomic classification of tasks that broadly encompass the main challenges in the protein domain, based on the relationships between different protein modalities; (ii) A multi-metric evaluation approach that assesses performance across four key dimensions: quality, novelty, diversity, and robustness; and (iii) In-depth analyses from various user objectives, providing a holistic view of model performance. Our comprehensive evaluation of protein foundation models reveals several key findings that shed light on their current capabilities and limitations. To promote transparency and facilitate further research, we release the evaluation dataset, code, and a public leaderboard publicly for further analysis and a general modular toolkit. We intend for ProteinBench to be a living benchmark for establishing a standardized, in-depth evaluation framework for protein foundation models, driving their development and application while fostering collaboration within the field.
△ Less
Submitted 7 October, 2024; v1 submitted 10 September, 2024;
originally announced September 2024.
-
The Effects of Unilateral Slope Loading on Lower Limb Plantar Flexor Muscle EMG Signals in Young Healthy Males
Authors:
Xinyu Zhou,
Gengshang Dong,
Pengxuan Zhang
Abstract:
Different loading modes can significantly affect human gait, posture, and lower limb biomechanics. This study investigated the muscle activity intensity of the lower limb soleus muscle in the slope environment of young healthy adult male subjects under unilateral loading environment. Ten subjects held dumbbells equal to 5% and 10% of their body weight (BW) and walked at a fixed speed on a slope of…
▽ More
Different loading modes can significantly affect human gait, posture, and lower limb biomechanics. This study investigated the muscle activity intensity of the lower limb soleus muscle in the slope environment of young healthy adult male subjects under unilateral loading environment. Ten subjects held dumbbells equal to 5% and 10% of their body weight (BW) and walked at a fixed speed on a slope of 5 degree and 10 degree, respectively. The changes of electromyography (EMG) of bilateral soleus muscles of the lower limbs were recorded. Experiments were performed using one-way analysis of variance (ANOVA) and multivariate analysis of variance (MANOVA) to examine the relationship between load weight, slope angle, and muscle activity intensity. The data provided by this research can help to promote the development of the field of lower limb assist exoskeleton. The research results fill the missing data when loading on the slope side, provide data support for future assistance systems, and promote the formation of relevant data sets, so as to improve the terrain recognition ability and the movement ability of the device wearer.
△ Less
Submitted 6 September, 2024;
originally announced September 2024.
-
Decomposed Direct Preference Optimization for Structure-Based Drug Design
Authors:
Xiwei Cheng,
Xiangxin Zhou,
Yuwei Yang,
Yu Bao,
Quanquan Gu
Abstract:
Diffusion models have achieved promising results for Structure-Based Drug Design (SBDD). Nevertheless, high-quality protein subpocket and ligand data are relatively scarce, which hinders the models' generation capabilities. Recently, Direct Preference Optimization (DPO) has emerged as a pivotal tool for aligning generative models with human preferences. In this paper, we propose DecompDPO, a struc…
▽ More
Diffusion models have achieved promising results for Structure-Based Drug Design (SBDD). Nevertheless, high-quality protein subpocket and ligand data are relatively scarce, which hinders the models' generation capabilities. Recently, Direct Preference Optimization (DPO) has emerged as a pivotal tool for aligning generative models with human preferences. In this paper, we propose DecompDPO, a structure-based optimization method aligns diffusion models with pharmaceutical needs using multi-granularity preference pairs. DecompDPO introduces decomposition into the optimization objectives and obtains preference pairs at the molecule or decomposed substructure level based on each objective's decomposability. Additionally, DecompDPO introduces a physics-informed energy term to ensure reasonable molecular conformations in the optimization results. Notably, DecompDPO can be effectively used for two main purposes: (1) fine-tuning pretrained diffusion models for molecule generation across various protein families, and (2) molecular optimization given a specific protein subpocket after generation. Extensive experiments on the CrossDocked2020 benchmark show that DecompDPO significantly improves model performance, achieving up to 95.2% Med. High Affinity and a 36.2% success rate for molecule generation, and 100% Med. High Affinity and a 52.1% success rate for molecular optimization.
△ Less
Submitted 27 October, 2024; v1 submitted 18 July, 2024;
originally announced July 2024.
-
EEG-Deformer: A Dense Convolutional Transformer for Brain-computer Interfaces
Authors:
Yi Ding,
Yong Li,
Hao Sun,
Rui Liu,
Chengxuan Tong,
Chenyu Liu,
Xinliang Zhou,
Cuntai Guan
Abstract:
Effectively learning the temporal dynamics in electroencephalogram (EEG) signals is challenging yet essential for decoding brain activities using brain-computer interfaces (BCIs). Although Transformers are popular for their long-term sequential learning ability in the BCI field, most methods combining Transformers with convolutional neural networks (CNNs) fail to capture the coarse-to-fine tempora…
▽ More
Effectively learning the temporal dynamics in electroencephalogram (EEG) signals is challenging yet essential for decoding brain activities using brain-computer interfaces (BCIs). Although Transformers are popular for their long-term sequential learning ability in the BCI field, most methods combining Transformers with convolutional neural networks (CNNs) fail to capture the coarse-to-fine temporal dynamics of EEG signals. To overcome this limitation, we introduce EEG-Deformer, which incorporates two main novel components into a CNN-Transformer: (1) a Hierarchical Coarse-to-Fine Transformer (HCT) block that integrates a Fine-grained Temporal Learning (FTL) branch into Transformers, effectively discerning coarse-to-fine temporal patterns; and (2) a Dense Information Purification (DIP) module, which utilizes multi-level, purified temporal information to enhance decoding accuracy. Comprehensive experiments on three representative cognitive tasks-cognitive attention, driving fatigue, and mental workload detection-consistently confirm the generalizability of our proposed EEG-Deformer, demonstrating that it either outperforms or performs comparably to existing state-of-the-art methods. Visualization results show that EEG-Deformer learns from neurophysiologically meaningful brain regions for the corresponding cognitive tasks. The source code can be found at https://github.com/yi-ding-cs/EEG-Deformer.
△ Less
Submitted 29 October, 2024; v1 submitted 25 April, 2024;
originally announced May 2024.
-
Antigen-Specific Antibody Design via Direct Energy-based Preference Optimization
Authors:
Xiangxin Zhou,
Dongyu Xue,
Ruizhe Chen,
Zaixiang Zheng,
Liang Wang,
Quanquan Gu
Abstract:
Antibody design, a crucial task with significant implications across various disciplines such as therapeutics and biology, presents considerable challenges due to its intricate nature. In this paper, we tackle antigen-specific antibody sequence-structure co-design as an optimization problem towards specific preferences, considering both rationality and functionality. Leveraging a pre-trained condi…
▽ More
Antibody design, a crucial task with significant implications across various disciplines such as therapeutics and biology, presents considerable challenges due to its intricate nature. In this paper, we tackle antigen-specific antibody sequence-structure co-design as an optimization problem towards specific preferences, considering both rationality and functionality. Leveraging a pre-trained conditional diffusion model that jointly models sequences and structures of antibodies with equivariant neural networks, we propose direct energy-based preference optimization to guide the generation of antibodies with both rational structures and considerable binding affinities to given antigens. Our method involves fine-tuning the pre-trained diffusion model using a residue-level decomposed energy preference. Additionally, we employ gradient surgery to address conflicts between various types of energy, such as attraction and repulsion. Experiments on RAbD benchmark show that our approach effectively optimizes the energy of generated antibodies and achieves state-of-the-art performance in designing high-quality antibodies with low total energy and high binding affinity simultaneously, demonstrating the superiority of our approach.
△ Less
Submitted 27 October, 2024; v1 submitted 25 March, 2024;
originally announced March 2024.
-
Bridging Text and Molecule: A Survey on Multimodal Frameworks for Molecule
Authors:
Yi Xiao,
Xiangxin Zhou,
Qiang Liu,
Liang Wang
Abstract:
Artificial intelligence has demonstrated immense potential in scientific research. Within molecular science, it is revolutionizing the traditional computer-aided paradigm, ushering in a new era of deep learning. With recent progress in multimodal learning and natural language processing, an emerging trend has targeted at building multimodal frameworks to jointly model molecules with textual domain…
▽ More
Artificial intelligence has demonstrated immense potential in scientific research. Within molecular science, it is revolutionizing the traditional computer-aided paradigm, ushering in a new era of deep learning. With recent progress in multimodal learning and natural language processing, an emerging trend has targeted at building multimodal frameworks to jointly model molecules with textual domain knowledge. In this paper, we present the first systematic survey on multimodal frameworks for molecules research. Specifically,we begin with the development of molecular deep learning and point out the necessity to involve textual modality. Next, we focus on recent advances in text-molecule alignment methods, categorizing current models into two groups based on their architectures and listing relevant pre-training tasks. Furthermore, we delves into the utilization of large language models and prompting techniques for molecular tasks and present significant applications in drug discovery. Finally, we discuss the limitations in this field and highlight several promising directions for future research.
△ Less
Submitted 6 March, 2024;
originally announced March 2024.
-
DecompOpt: Controllable and Decomposed Diffusion Models for Structure-based Molecular Optimization
Authors:
Xiangxin Zhou,
Xiwei Cheng,
Yuwei Yang,
Yu Bao,
Liang Wang,
Quanquan Gu
Abstract:
Recently, 3D generative models have shown promising performances in structure-based drug design by learning to generate ligands given target binding sites. However, only modeling the target-ligand distribution can hardly fulfill one of the main goals in drug discovery -- designing novel ligands with desired properties, e.g., high binding affinity, easily synthesizable, etc. This challenge becomes…
▽ More
Recently, 3D generative models have shown promising performances in structure-based drug design by learning to generate ligands given target binding sites. However, only modeling the target-ligand distribution can hardly fulfill one of the main goals in drug discovery -- designing novel ligands with desired properties, e.g., high binding affinity, easily synthesizable, etc. This challenge becomes particularly pronounced when the target-ligand pairs used for training do not align with these desired properties. Moreover, most existing methods aim at solving \textit{de novo} design task, while many generative scenarios requiring flexible controllability, such as R-group optimization and scaffold hopping, have received little attention. In this work, we propose DecompOpt, a structure-based molecular optimization method based on a controllable and decomposed diffusion model. DecompOpt presents a new generation paradigm which combines optimization with conditional diffusion models to achieve desired properties while adhering to the molecular grammar. Additionally, DecompOpt offers a unified framework covering both \textit{de novo} design and controllable generation. To achieve so, ligands are decomposed into substructures which allows fine-grained control and local optimization. Experiments show that DecompOpt can efficiently generate molecules with improved properties than strong de novo baselines, and demonstrate great potential in controllable generation tasks.
△ Less
Submitted 6 March, 2024;
originally announced March 2024.
-
DecompDiff: Diffusion Models with Decomposed Priors for Structure-Based Drug Design
Authors:
Jiaqi Guan,
Xiangxin Zhou,
Yuwei Yang,
Yu Bao,
Jian Peng,
Jianzhu Ma,
Qiang Liu,
Liang Wang,
Quanquan Gu
Abstract:
Designing 3D ligands within a target binding site is a fundamental task in drug discovery. Existing structured-based drug design methods treat all ligand atoms equally, which ignores different roles of atoms in the ligand for drug design and can be less efficient for exploring the large drug-like molecule space. In this paper, inspired by the convention in pharmaceutical practice, we decompose the…
▽ More
Designing 3D ligands within a target binding site is a fundamental task in drug discovery. Existing structured-based drug design methods treat all ligand atoms equally, which ignores different roles of atoms in the ligand for drug design and can be less efficient for exploring the large drug-like molecule space. In this paper, inspired by the convention in pharmaceutical practice, we decompose the ligand molecule into two parts, namely arms and scaffold, and propose a new diffusion model, DecompDiff, with decomposed priors over arms and scaffold. In order to facilitate the decomposed generation and improve the properties of the generated molecules, we incorporate both bond diffusion in the model and additional validity guidance in the sampling phase. Extensive experiments on CrossDocked2020 show that our approach achieves state-of-the-art performance in generating high-affinity molecules while maintaining proper molecular properties and conformational stability, with up to -8.39 Avg. Vina Dock score and 24.5 Success Rate. The code is provided at https://github.com/bytedance/DecompDiff
△ Less
Submitted 26 February, 2024;
originally announced March 2024.
-
Binding-Adaptive Diffusion Models for Structure-Based Drug Design
Authors:
Zhilin Huang,
Ling Yang,
Zaixi Zhang,
Xiangxin Zhou,
Yu Bao,
Xiawu Zheng,
Yuwei Yang,
Yu Wang,
Wenming Yang
Abstract:
Structure-based drug design (SBDD) aims to generate 3D ligand molecules that bind to specific protein targets. Existing 3D deep generative models including diffusion models have shown great promise for SBDD. However, it is complex to capture the essential protein-ligand interactions exactly in 3D space for molecular generation. To address this problem, we propose a novel framework, namely Binding-…
▽ More
Structure-based drug design (SBDD) aims to generate 3D ligand molecules that bind to specific protein targets. Existing 3D deep generative models including diffusion models have shown great promise for SBDD. However, it is complex to capture the essential protein-ligand interactions exactly in 3D space for molecular generation. To address this problem, we propose a novel framework, namely Binding-Adaptive Diffusion Models (BindDM). In BindDM, we adaptively extract subcomplex, the essential part of binding sites responsible for protein-ligand interactions. Then the selected protein-ligand subcomplex is processed with SE(3)-equivariant neural networks, and transmitted back to each atom of the complex for augmenting the target-aware 3D molecule diffusion generation with binding interaction information. We iterate this hierarchical complex-subcomplex process with cross-hierarchy interaction node for adequately fusing global binding context between the complex and its corresponding subcomplex. Empirical studies on the CrossDocked2020 dataset show BindDM can generate molecules with more realistic 3D structures and higher binding affinities towards the protein targets, with up to -5.92 Avg. Vina Score, while maintaining proper molecular properties. Our code is available at https://github.com/YangLing0818/BindDM
△ Less
Submitted 14 January, 2024;
originally announced February 2024.
-
Advancing bioinformatics with large language models: components, applications and perspectives
Authors:
Jiajia Liu,
Mengyuan Yang,
Yankai Yu,
Haixia Xu,
Tiangang Wang,
Kang Li,
Xiaobo Zhou
Abstract:
Large language models (LLMs) are a class of artificial intelligence models based on deep learning, which have great performance in various tasks, especially in natural language processing (NLP). Large language models typically consist of artificial neural networks with numerous parameters, trained on large amounts of unlabeled input using self-supervised or semi-supervised learning. However, their…
▽ More
Large language models (LLMs) are a class of artificial intelligence models based on deep learning, which have great performance in various tasks, especially in natural language processing (NLP). Large language models typically consist of artificial neural networks with numerous parameters, trained on large amounts of unlabeled input using self-supervised or semi-supervised learning. However, their potential for solving bioinformatics problems may even exceed their proficiency in modeling human language. In this review, we will provide a comprehensive overview of the essential components of large language models (LLMs) in bioinformatics, spanning genomics, transcriptomics, proteomics, drug discovery, and single-cell analysis. Key aspects covered include tokenization methods for diverse data types, the architecture of transformer models, the core attention mechanism, and the pre-training processes underlying these models. Additionally, we will introduce currently available foundation models and highlight their downstream applications across various bioinformatics domains. Finally, drawing from our experience, we will offer practical guidance for both LLM users and developers, emphasizing strategies to optimize their use and foster further innovation in the field.
△ Less
Submitted 31 January, 2025; v1 submitted 8 January, 2024;
originally announced January 2024.
-
Estimation and Inference for High-dimensional Multi-response Growth Curve Model
Authors:
Xin Zhou,
Yin Xia,
Lexin Li
Abstract:
A growth curve model (GCM) aims to characterize how an outcome variable evolves, develops and grows as a function of time, along with other predictors. It provides a particularly useful framework to model growth trend in longitudinal data. However, the estimation and inference of GCM with a large number of response variables faces numerous challenges, and remains underdeveloped. In this article, w…
▽ More
A growth curve model (GCM) aims to characterize how an outcome variable evolves, develops and grows as a function of time, along with other predictors. It provides a particularly useful framework to model growth trend in longitudinal data. However, the estimation and inference of GCM with a large number of response variables faces numerous challenges, and remains underdeveloped. In this article, we study the high-dimensional multivariate-response linear GCM, and develop the corresponding estimation and inference procedures. Our proposal is far from a straightforward extension, and involves several innovative components. Specifically, we introduce a Kronecker product structure, which allows us to effectively decompose a very large covariance matrix, and to pool the correlated samples to improve the estimation accuracy. We devise a highly non-trivial multi-step estimation approach to estimate the individual covariance components separately and effectively. We also develop rigorous statistical inference procedures to test both the global effects and the individual effects, and establish the size and power properties, as well as the proper false discovery control. We demonstrate the effectiveness of the new method through both intensive simulations, and the analysis of a longitudinal neuroimaging data for Alzheimer's disease.
△ Less
Submitted 27 December, 2023;
originally announced December 2023.
-
Effective connectivity signatures in major depressive disorder: fMRI study using a multi-site dataset
Authors:
Peishan Dai,
Yun Shi,
Tong Xiong,
Xiaoyan Zhou,
Shenghui Liao,
Zhongchao Huang,
Xiaoping Yi,
Bihong T. Chen
Abstract:
Diagnosis of major depressive disorder (MDD) primarily relies on the patient's self-reported symptoms and a clinical evaluation. Effective connectivity (EC) from resting-state functional magnetic resonance imaging (rs-fMRI) analysis can reflect the directionality of connections between brain regions, making it a candidate method to classify MDD. This study used Granger causality analysis to extrac…
▽ More
Diagnosis of major depressive disorder (MDD) primarily relies on the patient's self-reported symptoms and a clinical evaluation. Effective connectivity (EC) from resting-state functional magnetic resonance imaging (rs-fMRI) analysis can reflect the directionality of connections between brain regions, making it a candidate method to classify MDD. This study used Granger causality analysis to extract EC features from a large multi-site MDD dataset. The ComBat algorithm and multivariate linear regression were used to harmonize site difference and to remove age and sex covariates, respectively. Two-sample t-tests and model-based feature selection methods were used to screen for highly discriminative EC features for MDD, and LightGBM was used to classify MDD. In this large-scale multi-site rs-fMRI dataset, 97 EC features deemed highly discriminative for MDD were screened. In the nested five-fold cross-validation, the best classification model with the 97 EC features achieved accuracy, sensitivity, and specificity of 94.35%, 93.52%, and 95.25%, respectively. In another independent large dataset, which tested the generalization performance of the 97 EC features, the best classification models achieved 94.74%, 90.59%, and 96.75% for accuracy, sensitivity, and specificity, respectively. This work demonstrated that EC had a reasonable discriminative ability and supported the notion for using EC to potentially assist clinical diagnosis of MDD.
△ Less
Submitted 29 December, 2023; v1 submitted 31 October, 2023;
originally announced October 2023.
-
Digital Twinning of the Human Ventricular Activation Sequence to Clinical 12-lead ECGs and Magnetic Resonance Imaging Using Realistic Purkinje Networks for in Silico Clinical Trials
Authors:
Julia Camps,
Lucas Arantes Berg,
Zhinuo Jenny Wang,
Rafael Sebastian,
Leto Luana Riebel,
Ruben Doste,
Xin Zhou,
Rafael Sachetto,
James Coleman,
Brodie Lawson,
Vicente Grau,
Kevin Burrage,
Alfonso Bueno-Orovio,
Rodrigo Weber,
Blanca Rodriguez
Abstract:
Cardiac in silico clinical trials can virtually assess the safety and efficacy of therapies using human-based modelling and simulation. These technologies can provide mechanistic explanations for clinically observed pathological behaviour. Designing virtual cohorts for in silico trials requires exploiting clinical data to capture the physiological variability in the human population. The clinical…
▽ More
Cardiac in silico clinical trials can virtually assess the safety and efficacy of therapies using human-based modelling and simulation. These technologies can provide mechanistic explanations for clinically observed pathological behaviour. Designing virtual cohorts for in silico trials requires exploiting clinical data to capture the physiological variability in the human population. The clinical characterisation of ventricular activation and the Purkinje network is challenging, especially non-invasively. Our study aims to present a novel digital twinning pipeline that can efficiently generate and integrate Purkinje networks into human multiscale biventricular models based on subject-specific clinical 12-lead electrocardiogram and magnetic resonance recordings. Essential novel features of the pipeline are the human-based Purkinje network generation method, personalisation considering ECG R wave progression as well as QRS morphology, and translation from reduced-order Eikonal models to equivalent biophysically-detailed monodomain ones. We demonstrate ECG simulations in line with clinical data with clinical image-based multiscale models with Purkinje in four control subjects and two hypertrophic cardiomyopathy patients (simulated and clinical QRS complexes with Pearson's correlation coefficients > 0.7). Our methods also considered possible differences in the density of Purkinje myocardial junctions in the Eikonal-based inference as regional conduction velocities. These differences translated into regional coupling effects between Purkinje and myocardial models in the monodomain formulation. In summary, we demonstrate a digital twin pipeline enabling simulations yielding clinically-consistent ECGs with clinical CMR image-based biventricular multiscale models, including personalised Purkinje in healthy and cardiac disease conditions.
△ Less
Submitted 23 June, 2023;
originally announced June 2023.
-
Sequential Best-Arm Identification with Application to Brain-Computer Interface
Authors:
Xin Zhou,
Botao Hao,
Jian Kang,
Tor Lattimore,
Lexin Li
Abstract:
A brain-computer interface (BCI) is a technology that enables direct communication between the brain and an external device or computer system. It allows individuals to interact with the device using only their thoughts, and holds immense potential for a wide range of applications in medicine, rehabilitation, and human augmentation. An electroencephalogram (EEG) and event-related potential (ERP)-b…
▽ More
A brain-computer interface (BCI) is a technology that enables direct communication between the brain and an external device or computer system. It allows individuals to interact with the device using only their thoughts, and holds immense potential for a wide range of applications in medicine, rehabilitation, and human augmentation. An electroencephalogram (EEG) and event-related potential (ERP)-based speller system is a type of BCI that allows users to spell words without using a physical keyboard, but instead by recording and interpreting brain signals under different stimulus presentation paradigms. Conventional non-adaptive paradigms treat each word selection independently, leading to a lengthy learning process. To improve the sampling efficiency, we cast the problem as a sequence of best-arm identification tasks in multi-armed bandits. Leveraging pre-trained large language models (LLMs), we utilize the prior knowledge learned from previous tasks to inform and facilitate subsequent tasks. To do so in a coherent way, we propose a sequential top-two Thompson sampling (STTS) algorithm under the fixed-confidence setting and the fixed-budget setting. We study the theoretical property of the proposed algorithm, and demonstrate its substantial empirical improvement through both synthetic data analysis as well as a P300 BCI speller simulator example.
△ Less
Submitted 17 May, 2023;
originally announced May 2023.
-
Cell Population Growth Kinetics in the Presence of Stochastic Heterogeneity of Cell Phenotype
Authors:
Yue Wang,
Joseph X. Zhou,
Edoardo Pedrini,
Irit Rubin,
May Khalil,
Roberto Taramelli,
Hong Qian,
Sui Huang
Abstract:
Recent studies at individual cell resolution have revealed phenotypic heterogeneity in nominally clonal tumor cell populations. The heterogeneity affects cell growth behaviors, which can result in departure from the idealized uniform exponential growth of the cell population. Here we measured the stochastic time courses of growth of an ensemble of populations of HL60 leukemia cells in cultures, st…
▽ More
Recent studies at individual cell resolution have revealed phenotypic heterogeneity in nominally clonal tumor cell populations. The heterogeneity affects cell growth behaviors, which can result in departure from the idealized uniform exponential growth of the cell population. Here we measured the stochastic time courses of growth of an ensemble of populations of HL60 leukemia cells in cultures, starting with distinct initial cell numbers to capture a departure from the {uniform exponential growth model for the initial growth (``take-off'')}. Despite being derived from the same cell clone, we observed significant variations in the early growth patterns of individual cultures with statistically significant differences in growth dynamics, which could be explained by the presence of inter-converting subpopulations with different growth rates, and which could last for many generations. Based on the hypothesis of existence of multiple subpopulations, we developed a branching process model that was consistent with the experimental observations.
△ Less
Submitted 18 October, 2023; v1 submitted 9 January, 2023;
originally announced January 2023.
-
Brain informed transfer learning for categorizing construction hazards
Authors:
Xiaoshan Zhou,
Pin-Chao Liao
Abstract:
A transfer learning paradigm is proposed for "knowledge" transfer between the human brain and convolutional neural network (CNN) for a construction hazard categorization task. Participants' brain activities are recorded using electroencephalogram (EEG) measurements when viewing the same images (target dataset) as the CNN. The CNN is pretrained on the EEG data and then fine-tuned on the constructio…
▽ More
A transfer learning paradigm is proposed for "knowledge" transfer between the human brain and convolutional neural network (CNN) for a construction hazard categorization task. Participants' brain activities are recorded using electroencephalogram (EEG) measurements when viewing the same images (target dataset) as the CNN. The CNN is pretrained on the EEG data and then fine-tuned on the construction scene images. The results reveal that the EEG-pretrained CNN achieves a 9 % higher accuracy compared with a network with same architecture but randomly initialized parameters on a three-class classification task. Brain activity from the left frontal cortex exhibits the highest performance gains, thus indicating high-level cognitive processing during hazard recognition. This work is a step toward improving machine learning algorithms by learning from human-brain signals recorded via a commercially available brain-computer interface. More generalized visual recognition systems can be effectively developed based on this approach of "keep human in the loop".
△ Less
Submitted 17 November, 2022;
originally announced November 2022.
-
A privacy-preserving data storage and service framework based on deep learning and blockchain for construction workers' wearable IoT sensors
Authors:
Xiaoshan Zhou,
Pin-Chao Liao
Abstract:
Classifying brain signals collected by wearable Internet of Things (IoT) sensors, especially brain-computer interfaces (BCIs), is one of the fastest-growing areas of research. However, research has mostly ignored the secure storage and privacy protection issues of collected personal neurophysiological data. Therefore, in this article, we try to bridge this gap and propose a secure privacy-preservi…
▽ More
Classifying brain signals collected by wearable Internet of Things (IoT) sensors, especially brain-computer interfaces (BCIs), is one of the fastest-growing areas of research. However, research has mostly ignored the secure storage and privacy protection issues of collected personal neurophysiological data. Therefore, in this article, we try to bridge this gap and propose a secure privacy-preserving protocol for implementing BCI applications. We first transformed brain signals into images and used generative adversarial network to generate synthetic signals to protect data privacy. Subsequently, we applied the paradigm of transfer learning for signal classification. The proposed method was evaluated by a case study and results indicate that real electroencephalogram data augmented with artificially generated samples provide superior classification performance. In addition, we proposed a blockchain-based scheme and developed a prototype on Ethereum, which aims to make storing, querying and sharing personal neurophysiological data and analysis reports secure and privacy-aware. The rights of three main transaction bodies - construction workers, BCI service providers and project managers - are described and the advantages of the proposed system are discussed. We believe this paper provides a well-rounded solution to safeguard private data against cyber-attacks, level the playing field for BCI application developers, and to the end improve professional well-being in the industry.
△ Less
Submitted 19 November, 2022;
originally announced November 2022.
-
Network medicine framework reveals generic herb-symptom effectiveness of Traditional Chinese Medicine
Authors:
Xiao Gan,
Zixin Shu,
Xinyan Wang,
Dengying Yan,
Jun Li,
Shany ofaim,
Réka Albert,
Xiaodong Li,
Baoyan Liu,
Xuezhong Zhou,
Albert-László Barabási
Abstract:
Traditional Chinese medicine (TCM) relies on natural medical products to treat symptoms and diseases. While clinical data have demonstrated the effectiveness of selected TCM-based treatments, the mechanistic root of how TCM herbs treat diseases remains largely unknown. More importantly, current approaches focus on single herbs or prescriptions, missing the high-level general principles of TCM. To…
▽ More
Traditional Chinese medicine (TCM) relies on natural medical products to treat symptoms and diseases. While clinical data have demonstrated the effectiveness of selected TCM-based treatments, the mechanistic root of how TCM herbs treat diseases remains largely unknown. More importantly, current approaches focus on single herbs or prescriptions, missing the high-level general principles of TCM. To uncover the mechanistic nature of TCM on a system level, in this work we establish a generic network medicine framework for TCM from the human protein interactome. Applying our framework reveals a network pattern between symptoms (diseases) and herbs in TCM. We first observe that genes associated with a symptom are not distributed randomly in the interactome, but cluster into localized modules; furthermore, a short network distance between two symptom modules is indicative of the symptoms' co-occurrence and similarity. Next, we show that the network proximity of a herb's targets to a symptom module is predictive of the herb's effectiveness in treating the symptom. We validate our framework with real-world hospital patient data by showing that (1) shorter network distance between symptoms of inpatients correlates with higher relative risk (co-occurrence), and (2) herb-symptom network proximity is indicative of patients' symptom recovery rate after herbal treatment. Finally, we identified novel herb-symptom pairs in which the herb's effectiveness in treating the symptom is predicted by network and confirmed in hospital data, but previously unknown to the TCM community. These predictions highlight our framework's potential in creating herb discovery or repurposing opportunities. In conclusion, network medicine offers a powerful novel platform to understand the mechanism of traditional medicine and to predict novel herbal treatment against diseases.
△ Less
Submitted 18 July, 2022;
originally announced July 2022.
-
Effect of compositional fluctuation on the survival of bet-hedging species
Authors:
Xiao Zhou,
BingKan Xue
Abstract:
Understanding the coexistence of diverse species in a changing environment is an important problem in community ecology. Bet-hedging is a strategy that helps species survive in such changing environments. However, studies of bet-hedging have often focused on the expected long-term growth rate of the species by itself, neglecting competition with other coexisting species. Here we study the extincti…
▽ More
Understanding the coexistence of diverse species in a changing environment is an important problem in community ecology. Bet-hedging is a strategy that helps species survive in such changing environments. However, studies of bet-hedging have often focused on the expected long-term growth rate of the species by itself, neglecting competition with other coexisting species. Here we study the extinction risk of a bet-hedging species in competition with others. We show that there are three contributions to the extinction risk. The first is the usual demographic fluctuation due to stochastic reproduction and selection processes in finite populations. The second, due to the fluctuation of population growth rate caused by environmental changes, may counterintuitively reduce the extinction risk for small populations. Besides those two, we reveal a third contribution, which is unique to bet-hedging species that diversify into multiple phenotypes: The phenotype composition of the population will fluctuate over time, resulting in increased extinction risk. We compare such compositional fluctuation to the demographic and environmental contributions, showing how they have different effects on the extinction risk depending on the population size, generation overlap, and environmental correlation.
△ Less
Submitted 10 April, 2022;
originally announced April 2022.
-
Evaluation of non-pharmaceutical interventions and optimal strategies for containing the COVID-19 pandemic
Authors:
Xiao Zhou,
Xiaohu Zhang,
Paolo Santi,
Carlo Ratti
Abstract:
Given multiple new COVID-19 variants are continuously emerging, non-pharmaceutical interventions are still primary control strategies to curb the further spread of coronavirus. However, implementing strict interventions over extended periods of time is inevitably hurting the economy. With an aim to solve this multi-objective decision-making problem, we investigate the underlying associations betwe…
▽ More
Given multiple new COVID-19 variants are continuously emerging, non-pharmaceutical interventions are still primary control strategies to curb the further spread of coronavirus. However, implementing strict interventions over extended periods of time is inevitably hurting the economy. With an aim to solve this multi-objective decision-making problem, we investigate the underlying associations between policies, mobility patterns, and virus transmission. We further evaluate the relative performance of existing COVID-19 control measures and explore potential optimal strategies that can strike the right balance between public health and socio-economic recovery for individual states in the US. The results highlight the power of state of emergency declaration and wearing face masks and emphasize the necessity of pursuing tailor-made strategies for different states and phases of epidemiological transmission. Our framework enables policymakers to create more refined designs of COVID-19 strategies and can be extended to inform policy makers of any country about best practices in pandemic response.
△ Less
Submitted 28 February, 2022;
originally announced February 2022.
-
MSA-MIL: A deep residual multiple instance learning model based on multi-scale annotation for classification and visualization of glomerular spikes
Authors:
Yilin Chen,
Ming Li,
Yongfei Wu,
Xueyu Liu,
Fang Hao,
Daoxiang Zhou,
Xiaoshuang Zhou,
Chen Wang
Abstract:
Membranous nephropathy (MN) is a frequent type of adult nephrotic syndrome, which has a high clinical incidence and can cause various complications. In the biopsy microscope slide of membranous nephropathy, spikelike projections on the glomerular basement membrane is a prominent feature of the MN. However, due to the whole biopsy slide contains large number of glomeruli, and each glomerulus includ…
▽ More
Membranous nephropathy (MN) is a frequent type of adult nephrotic syndrome, which has a high clinical incidence and can cause various complications. In the biopsy microscope slide of membranous nephropathy, spikelike projections on the glomerular basement membrane is a prominent feature of the MN. However, due to the whole biopsy slide contains large number of glomeruli, and each glomerulus includes many spike lesions, the pathological feature of the spikes is not obvious. It thus is time-consuming for doctors to diagnose glomerulus one by one and is difficult for pathologists with less experience to diagnose. In this paper, we establish a visualized classification model based on the multi-scale annotation multi-instance learning (MSA-MIL) to achieve glomerular classification and spikes visualization. The MSA-MIL model mainly involves three parts. Firstly, U-Net is used to extract the region of the glomeruli to ensure that the features learned by the succeeding algorithm are focused inside the glomeruli itself. Secondly, we use MIL to train an instance-level classifier combined with MSA method to enhance the learning ability of the network by adding a location-level labeled reinforced dataset, thereby obtaining an example-level feature representation with rich semantics. Lastly, the predicted scores of each tile in the image are summarized to obtain glomerular classification and visualization of the classification results of the spikes via the usage of sliding window method. The experimental results confirm that the proposed MSA-MIL model can effectively and accurately classify normal glomeruli and spiked glomerulus and visualize the position of spikes in the glomerulus. Therefore, the proposed model can provide a good foundation for assisting the clinical doctors to diagnose the glomerular membranous nephropathy.
△ Less
Submitted 18 July, 2020; v1 submitted 1 July, 2020;
originally announced July 2020.
-
Accelerating drug repurposing for COVID-19 via modeling drug mechanism of action with large scale gene-expression profiles
Authors:
Lu Han,
G. C. Shan,
B. F. Chu,
H. Y. Wang,
Z. J. Wang,
S. Q. Gao,
W. X. Zhou
Abstract:
The novel coronavirus disease, named COVID-19, emerged in China in December 2019, and has rapidly spread around the world. It is clearly urgent to fight COVID-19 at global scale. The development of methods for identifying drug uses based on phenotypic data can improve the efficiency of drug development. However, there are still many difficulties in identifying drug applications based on cell pictu…
▽ More
The novel coronavirus disease, named COVID-19, emerged in China in December 2019, and has rapidly spread around the world. It is clearly urgent to fight COVID-19 at global scale. The development of methods for identifying drug uses based on phenotypic data can improve the efficiency of drug development. However, there are still many difficulties in identifying drug applications based on cell picture data. This work reported one state-of-the-art machine learning method to identify drug uses based on the cell image features of 1024 drugs generated in the LINCS program. Because the multi-dimensional features of the image are affected by non-experimental factors, the characteristics of similar drugs vary greatly, and the current sample number is not enough to use deep learning and other methods are used for learning optimization. As a consequence, this study is based on the supervised ITML algorithm to convert the characteristics of drugs. The results show that the characteristics of ITML conversion are more conducive to the recognition of drug functions. The analysis of feature conversion shows that different features play important roles in identifying different drug functions. For the current COVID-19, Chloroquine and Hydroxychloroquine achieve antiviral effects by inhibiting endocytosis, etc., and were classified to the same community. And Clomiphene in the same community inibited the entry of Ebola Virus, indicated a similar MoAs that could be reflected by cell image.
△ Less
Submitted 5 October, 2021; v1 submitted 15 May, 2020;
originally announced May 2020.
-
Protein structure and sequence re-analysis of 2019-nCoV genome does not indicate snakes as its intermediate host or the unique similarity between its spike protein insertions and HIV-1
Authors:
Chengxin Zhang,
Wei Zheng,
Xiaoqiang Huang,
Eric W. Bell,
Xiaogen Zhou,
Yang Zhang
Abstract:
As the infection of 2019-nCoV coronavirus is quickly developing into a global pneumonia epidemic, careful analysis of its transmission and cellular mechanisms is sorely needed. In this report, we re-analyzed the computational approaches and findings presented in two recent manuscripts by Ji et al. (https://doi.org/10.1002/jmv.25682) and by Pradhan et al. (https://doi.org/10.1101/2020.01.30.927871)…
▽ More
As the infection of 2019-nCoV coronavirus is quickly developing into a global pneumonia epidemic, careful analysis of its transmission and cellular mechanisms is sorely needed. In this report, we re-analyzed the computational approaches and findings presented in two recent manuscripts by Ji et al. (https://doi.org/10.1002/jmv.25682) and by Pradhan et al. (https://doi.org/10.1101/2020.01.30.927871), which concluded that snakes are the intermediate hosts of 2019-nCoV and that the 2019-nCoV spike protein insertions shared a unique similarity to HIV-1. Results from our re-implementation of the analyses, built on larger-scale datasets using state-of-the-art bioinformatics methods and databases, do not support the conclusions proposed by these manuscripts. Based on our analyses and existing data of coronaviruses, we concluded that the intermediate hosts of 2019-nCoV are more likely to be mammals and birds than snakes, and that the "novel insertions" observed in the spike protein are naturally evolved from bat coronaviruses.
△ Less
Submitted 8 February, 2020;
originally announced February 2020.
-
DS-GCNs: Connectome Classification Using Dynamic Spectral Graph Convolution Networks with Assistant Task Training
Authors:
Xiaodan Xing,
Qingfeng Li,
Hao Wei,
Minqing Zhang,
Yiqiang Zhan,
Xiang Sean Zhou,
Zhong Xue,
Feng Shi
Abstract:
Functional Connectivity (FC) matrices measure the regional interactions in the brain and have been widely used in neurological brain disease classification. However, a FC matrix is neither a natural image which contains shape and texture information, nor a vector of independent features, which renders the extracting of efficient features from matrices as a challenging problem. A brain network, als…
▽ More
Functional Connectivity (FC) matrices measure the regional interactions in the brain and have been widely used in neurological brain disease classification. However, a FC matrix is neither a natural image which contains shape and texture information, nor a vector of independent features, which renders the extracting of efficient features from matrices as a challenging problem. A brain network, also named as connectome, could forma a graph structure naturally, the nodes of which are brain regions and the edges are interregional connectivity. Thus, in this study, we proposed novel graph convolutional networks (GCNs) to extract efficient disease-related features from FC matrices. Considering the time-dependent nature of brain activity, we computed dynamic FC matrices with sliding-windows and implemented a graph convolution based LSTM (long short term memory) layer to process dynamic graphs. Moreover, the demographics of patients were also used to guide the classification. However, unlike in conventional methods where personal information, i.e., gender and age were added as extra inputs, we argue that this kind of approach may not actually improve the classification performance, for such personal information given in dataset was usually balanced distributed. In this paper, we proposed to utilize the demographic information as extra outputs and to share parameters among three networks predicting subject status, gender and age, which serve as assistant tasks. We tested the performance of the proposed architecture in ADNI II dataset to classify Alzheimer's disease patients from normal controls. The classification accuracy, sensitivity and specificity reach 0.90, 0.92 and 0.89 on ADNI II dataset.
△ Less
Submitted 10 December, 2019;
originally announced January 2020.
-
Transparency guided ensemble convolutional neural networks for stratification of pseudoprogression and true progression of glioblastoma multiform
Authors:
Xiaoming Liu,
Michael D. Chan,
Xiaobo Zhou,
Xiaohua Qian
Abstract:
Pseudoprogression (PsP) is an imitation of true tumor progression (TTP) in patients with glioblastoma multiform (GBM). Differentiating them is a challenging and time-consuming task for radiologists. Although deep neural networks can automatically diagnose PsP and TTP, interpretability shortage is always the heel of Achilles. To overcome these shortcomings and win the trust of physician, we propose…
▽ More
Pseudoprogression (PsP) is an imitation of true tumor progression (TTP) in patients with glioblastoma multiform (GBM). Differentiating them is a challenging and time-consuming task for radiologists. Although deep neural networks can automatically diagnose PsP and TTP, interpretability shortage is always the heel of Achilles. To overcome these shortcomings and win the trust of physician, we propose a transparency guided ensemble convolutional neural network to automatically stratify PsP and TTP on magnetic resonance imaging (MRI). A total of 84 patients with GBM are enrolled in the study. First, three typical convolutional neutral networks (CNNs) -- VGG, ResNet and DenseNet -- are trained to distinguish PsP and TTP on the dataset. Subsequently, we use the class-specific gradient information from convolutional layers to highlight the important regions in MRI. Radiological experts are then recruited to select the most lesion-relevant layer of each CNN. Finally, the selected layers are utilized to guide the construction of multi-scale ensemble CNN. The classified accuracy of the presented network is 90.20%, the promotion of specificity reaches more than 20%. The results demonstrate that network transparency and ensemble can enhance the reliability and accuracy of CNNs. The presented network is promising for the diagnosis of PsP and TTP.
△ Less
Submitted 26 February, 2019;
originally announced February 2019.
-
The Function Transformation Omics - Funomics
Authors:
Yongshuai Jiang,
Jing Xu,
Simeng Hu,
Di Liu,
Linna Zhao,
Xu Zhou
Abstract:
There are no two identical leaves in the world, so how to find effective markers or features to distinguish them is an important issue. Function transformation, such as f(x,y) and f(x,y,z), can transform two, three, or multiple input/observation variables (in biology, it generally refers to the observed/measured value of biomarkers, biological characteristics, or other indicators) into a new outpu…
▽ More
There are no two identical leaves in the world, so how to find effective markers or features to distinguish them is an important issue. Function transformation, such as f(x,y) and f(x,y,z), can transform two, three, or multiple input/observation variables (in biology, it generally refers to the observed/measured value of biomarkers, biological characteristics, or other indicators) into a new output variable (new characteristics or indicators). This provided us a chance to re-cognize objective things or relationships beyond the original measurements. For example, Body Mass Index, which transform weight and high into a new indicator BMI=x/y^2 (where x is weight and y is high), is commonly used in to gauge obesity. Here, we proposed a new system, Funomics (Function Transformation Omics), for understanding the world in a different perspective. Funome can be understood as a set of math functions consist of basic elementary functions (such as power functions and exponential functions) and basic mathematical operations (such as addition, subtraction). By scanning the whole Funome, researchers can identify some special functions (called handsome functions) which can generate the novel important output variable (characteristics or indicators). We also start "the Funome project" to develop novel methods, function library and analysis software for Funome studies. The Funome project will accelerate the discovery of new useful indicators or characteristics, will improve the utilization efficiency of directly measured data, and will enhance our ability to understand the world. The analysis tools and data resources about the Funome project can be found gradually at http://www.funome.com.
△ Less
Submitted 17 August, 2018;
originally announced August 2018.
-
varbvs: Fast Variable Selection for Large-scale Regression
Authors:
Peter Carbonetto,
Xiang Zhou,
Matthew Stephens
Abstract:
We introduce varbvs, a suite of functions written in R and MATLAB for regression analysis of large-scale data sets using Bayesian variable selection methods. We have developed numerical optimization algorithms based on variational approximation methods that make it feasible to apply Bayesian variable selection to very large data sets. With a focus on examples from genome-wide association studies,…
▽ More
We introduce varbvs, a suite of functions written in R and MATLAB for regression analysis of large-scale data sets using Bayesian variable selection methods. We have developed numerical optimization algorithms based on variational approximation methods that make it feasible to apply Bayesian variable selection to very large data sets. With a focus on examples from genome-wide association studies, we demonstrate that varbvs scales well to data sets with hundreds of thousands of variables and thousands of samples, and has features that facilitate rapid data analyses. Moreover, varbvs allows for extensive model customization, which can be used to incorporate external information into the analysis. We expect that the combination of an easy-to-use interface and robust, scalable algorithms for posterior computation will encourage broader use of Bayesian variable selection in areas of applied statistics and computational biology. The most recent R and MATLAB source code is available for download at Github (https://github.com/pcarbo/varbvs), and the R package can be installed from CRAN (https://cran.r-project.org/package=varbvs).
△ Less
Submitted 19 September, 2017;
originally announced September 2017.
-
Reducing the uncertainty in the forest volume-to-biomass relationship built from limited field plots
Authors:
Caixia Liu,
Xiaolu Zhou,
Xiangdong Lei,
Huabing Huang,
Changhui Peng,
Xiaoyi Wang,
Jianfeng Sun,
Carl Zhou
Abstract:
The method of biomass estimation based on a volume-to-biomass relationship has been applied in estimating forest biomass conventionally through the mean volume (m3 ha-1). However, few studies have been reported concerning the verification of the volume-biomass equations regressed using field data. The possible bias may result from the volume measurements and extrapolations from sample plots to sta…
▽ More
The method of biomass estimation based on a volume-to-biomass relationship has been applied in estimating forest biomass conventionally through the mean volume (m3 ha-1). However, few studies have been reported concerning the verification of the volume-biomass equations regressed using field data. The possible bias may result from the volume measurements and extrapolations from sample plots to stands or a unit area. This paper addresses (i) how to verify the volume-biomass equations, and (ii) how to reduce the bias while building these equations. This paper presents an applicable method for verifying the field data using reasonable wood densities, restricting the error in field data processing based on limited field plots, and achieving a better understanding of the uncertainty in building those equations. The verified and improved volume-biomass equations are more reliable and will help to estimate forest carbon sequestration and carbon balance at any large scale.
△ Less
Submitted 21 February, 2017;
originally announced February 2017.
-
Graphitic C3N4 Sensitized TiO2 Nanotube Layers: A Visible Light Activated Efficient Antimicrobial Platform
Authors:
Jingwen Xu,
Yan Li,
Xuemei Zhou,
Yuzhen Li,
Zhi-Da Gao,
Yan-Yan Song,
Patrik Schmuki
Abstract:
In this work, we introduce a facile procedure to graft a thin graphitic C3N4 (g-C3N4) layer on aligned TiO2 nanotube arrays (TiNT) by one-step chemical vapor deposition (CVD) approach. This provides a platform to enhance the visible-light response of TiO2 nanotubes for antimicrobial applications. The formed g- C3N4/TiNT binary nanocomposite exhibits excellent bactericidal efficiency against E. col…
▽ More
In this work, we introduce a facile procedure to graft a thin graphitic C3N4 (g-C3N4) layer on aligned TiO2 nanotube arrays (TiNT) by one-step chemical vapor deposition (CVD) approach. This provides a platform to enhance the visible-light response of TiO2 nanotubes for antimicrobial applications. The formed g- C3N4/TiNT binary nanocomposite exhibits excellent bactericidal efficiency against E. coli as a visiblelight activated antibacterial coating.
△ Less
Submitted 20 October, 2016;
originally announced November 2016.
-
Proofreading of DNA Polymerase: a new kinetic model with higher-order terminal effects
Authors:
Yong-Shun Song,
Yao-Gen Shu,
Xin Zhou,
Zhong-Can Ou-Yang,
Ming Li
Abstract:
The fidelity of DNA replication by DNA polymerase (DNAP) has long been an important issue in biology. While numerous experiments have revealed details of the molecular structure and working mechanism of DNAP which consists of both a polymerase site and an exonuclease (proofreading) site, there were quite few theoretical studies on the fidelity issue. The first model which explicitly considered bot…
▽ More
The fidelity of DNA replication by DNA polymerase (DNAP) has long been an important issue in biology. While numerous experiments have revealed details of the molecular structure and working mechanism of DNAP which consists of both a polymerase site and an exonuclease (proofreading) site, there were quite few theoretical studies on the fidelity issue. The first model which explicitly considered both sites was proposed in 1970s' and the basic idea was widely accepted by later models. However, all these models did not systematically and rigorously investigate the dominant factor on DNAP fidelity, i.e, the higher-order terminal effects through which the polymerization pathway and the proofreading pathway coordinate to achieve high fidelity. In this paper, we propose a new and comprehensive kinetic model of DNAP based on some recent experimental observations, which includes previous models as special cases. We present a rigorous and unified treatment of the corresponding steady-state kinetic equations of any-order terminal effects, and derive analytical expressions for fidelity in terms of kinetic parameters under bio-relevant conditions. These expressions offer new insights on how the the higher-order terminal effects contribute substantially to the fidelity in an order-by-order way, and also show that the polymerization-and-proofreading mechanism is dominated only by very few key parameters. We then apply these results to calculate the fidelity of some real DNAPs, which are in good agreements with previous intuitive estimates given by experimentalists.
△ Less
Submitted 7 May, 2016; v1 submitted 8 March, 2016;
originally announced March 2016.
-
Bayesian Approximate Kernel Regression with Variable Selection
Authors:
Lorin Crawford,
Kris C. Wood,
Xiang Zhou,
Sayan Mukherjee
Abstract:
Nonlinear kernel regression models are often used in statistics and machine learning because they are more accurate than linear models. Variable selection for kernel regression models is a challenge partly because, unlike the linear regression setting, there is no clear concept of an effect size for regression coefficients. In this paper, we propose a novel framework that provides an effect size a…
▽ More
Nonlinear kernel regression models are often used in statistics and machine learning because they are more accurate than linear models. Variable selection for kernel regression models is a challenge partly because, unlike the linear regression setting, there is no clear concept of an effect size for regression coefficients. In this paper, we propose a novel framework that provides an effect size analog of each explanatory variable for Bayesian kernel regression models when the kernel is shift-invariant --- for example, the Gaussian kernel. We use function analytic properties of shift-invariant reproducing kernel Hilbert spaces (RKHS) to define a linear vector space that: (i) captures nonlinear structure, and (ii) can be projected onto the original explanatory variables. The projection onto the original explanatory variables serves as an analog of effect sizes. The specific function analytic property we use is that shift-invariant kernel functions can be approximated via random Fourier bases. Based on the random Fourier expansion we propose a computationally efficient class of Bayesian approximate kernel regression (BAKR) models for both nonlinear regression and binary classification for which one can compute an analog of effect sizes. We illustrate the utility of BAKR by examining two important problems in statistical genetics: genomic selection (i.e. phenotypic prediction) and association mapping (i.e. inference of significant variants or loci). State-of-the-art methods for genomic selection and association mapping are based on kernel regression and linear models, respectively. BAKR is the first method that is competitive in both settings.
△ Less
Submitted 9 June, 2017; v1 submitted 5 August, 2015;
originally announced August 2015.
-
Relative Stability of Network States in Boolean Network Models of Gene Regulation in Development
Authors:
Joseph Xu Zhou,
Areejit Samal,
Aymeric Fouquier d'Hèrouël,
Nathan D. Price,
Sui Huang
Abstract:
Progress in cell type reprogramming has revived the interest in Waddington's concept of the epigenetic landscape. Recently researchers developed the quasi-potential theory to represent the Waddington's landscape. The Quasi-potential U(x), derived from interactions in the gene regulatory network (GRN) of a cell, quantifies the relative stability of network states, which determine the effort require…
▽ More
Progress in cell type reprogramming has revived the interest in Waddington's concept of the epigenetic landscape. Recently researchers developed the quasi-potential theory to represent the Waddington's landscape. The Quasi-potential U(x), derived from interactions in the gene regulatory network (GRN) of a cell, quantifies the relative stability of network states, which determine the effort required for state transitions in a multi-stable dynamical system. However, quasi-potential landscapes, originally developed for continuous systems, are not suitable for discrete-valued networks which are important tools to study complex systems. In this paper, we provide a framework to quantify the landscape for discrete Boolean networks (BNs). We apply our framework to study pancreas cell differentiation where an ensemble of BN models is considered based on the structure of a minimal GRN for pancreas development. We impose biologically motivated structural constraints (corresponding to specific type of Boolean functions) and dynamical constraints (corresponding to stable attractor states) to limit the space of BN models for pancreas development. In addition, we enforce a novel functional constraint corresponding to the relative ordering of attractor states in BN models to restrict the space of BN models to the biological relevant class. We find that BNs with canalyzing/sign-compatible Boolean functions best capture the dynamics of pancreas cell differentiation. This framework can also determine the genes' influence on cell state transitions, and thus can facilitate the rational design of cell reprogramming protocols.
△ Less
Submitted 12 October, 2015; v1 submitted 23 July, 2014;
originally announced July 2014.
-
Robustly detecting differential expression in RNA sequencing data using observation weights
Authors:
Xiaobei Zhou,
Helen Lindsay,
Mark D. Robinson
Abstract:
A popular approach for comparing gene expression levels between (replicated) conditions of RNA sequencing data relies on counting reads that map to features of interest. Within such count-based methods, many flexible and advanced statistical approaches now exist and offer the ability to adjust for covariates (e.g., batch effects). Often, these methods include some sort of (sharing of information)…
▽ More
A popular approach for comparing gene expression levels between (replicated) conditions of RNA sequencing data relies on counting reads that map to features of interest. Within such count-based methods, many flexible and advanced statistical approaches now exist and offer the ability to adjust for covariates (e.g., batch effects). Often, these methods include some sort of (sharing of information) across features to improve inferences in small samples. It is important to achieve an appropriate tradeoff between statistical power and protection against outliers. Here, we study the robustness of existing approaches for count-based differential expression analysis and propose a new strategy based on observation weights that can be used within existing frameworks. The results suggest that outliers can have a global effect on differential analyses. We demonstrate the effectiveness of our new approach with real data and simulated data that reflects properties of real datasets (e.g., dispersion-mean trend) and develop an extensible framework for comprehensive testing of current and future methods. In addition, we explore the origin of such outliers, in some cases highlighting additional biological or technical factors within the experiment. Further details can be downloaded from the project website: http://imlspenticton.uzh.ch/robinson_lab/edgeR_robust/
△ Less
Submitted 14 March, 2014; v1 submitted 11 December, 2013;
originally announced December 2013.
-
Manipulate the coiling and uncoiling movements of Lepidoptera proboscis by its conformation optimizing
Authors:
Xiaohua Zhou,
Shengli Zhang
Abstract:
Many kinds of adult Lepidoptera insects possess a long proboscis which is used to suck liquids and has the coiling and uncoiling movements. Although experiments revealed qualitatively that the coiling movement is governed by the hydraulic mechanism and the uncoiling movement is due to the musculature and the elasticity, it needs a quantitative investigation to reveal how insects achieve these beha…
▽ More
Many kinds of adult Lepidoptera insects possess a long proboscis which is used to suck liquids and has the coiling and uncoiling movements. Although experiments revealed qualitatively that the coiling movement is governed by the hydraulic mechanism and the uncoiling movement is due to the musculature and the elasticity, it needs a quantitative investigation to reveal how insects achieve these behaviors accurately. Here a quasi-one-dimensional (Q1D) curvature elastica model is proposed to reveal the mechanism of these behaviors. We find that the functions of internal stipes muscle and basal galeal muscle which locate at the bottom of proboscis are to adjust the initial states in the coiling and uncoiling processes, respectively. The function of internal galeal muscle which exists along proboscis is to adjust the line tension. The knee bend shape is due to the local maximal spontaneous curvature and is an advantage for nectar-feeding butterfly. When there is no knee bend, the proboscis of fruit-piercing butterfly is easy to achieve the piercing movement which induced by the increase of internal hydraulic pressure. All of the results are in good agreement with experiential observation. Our study provides a revelatory method to investigate the mechanical behaviors of other 1D biologic structures, such as proboscis of marine snail and elephant. Our method and results are also significant in designing the bionic devices.
△ Less
Submitted 6 November, 2013;
originally announced November 2013.
-
Cellular network entropy as the energy potential in Waddington's differentiation landscape
Authors:
Christopher R. S. Banerji,
Diego Miranda-Saavedra,
Simone Severini,
Martin Widschwendter,
Tariq Enver,
Joseph X. Zhou,
Andrew E. Teschendorff
Abstract:
Differentiation is a key cellular process in normal tissue development that is significantly altered in cancer. Although molecular signatures characterising pluripotency and multipotency exist, there is, as yet, no single quantitative mark of a cellular sample's position in the global differentiation hierarchy. Here we adopt a systems view and consider the sample's network entropy, a measure of si…
▽ More
Differentiation is a key cellular process in normal tissue development that is significantly altered in cancer. Although molecular signatures characterising pluripotency and multipotency exist, there is, as yet, no single quantitative mark of a cellular sample's position in the global differentiation hierarchy. Here we adopt a systems view and consider the sample's network entropy, a measure of signaling pathway promiscuity, computable from a sample's genome-wide expression profile. We demonstrate that network entropy provides a quantitative, in-silico, readout of the average undifferentiated state of the profiled cells, recapitulating the known hierarchy of pluripotent, multipotent and differentiated cell types. Network entropy further exhibits dynamic changes in time course differentiation data, and in line with a sample's differentiation stage. In disease, network entropy predicts a higher level of cellular plasticity in cancer stem cell populations compared to ordinary cancer cells. Importantly, network entropy also allows identification of key differentiation pathways. Our results are consistent with the view that pluripotency is a statistical property defined at the cellular population level, correlating with intra-sample heterogeneity, and driven by the degree of signaling promiscuity in cells. In summary, network entropy provides a quantitative measure of a cell's undifferentiated state, defining its elevation in Waddington's landscape.
△ Less
Submitted 26 October, 2013;
originally announced October 2013.
-
SOAPdenovo-Trans: De novo transcriptome assembly with short RNA-Seq reads
Authors:
Yinlong Xie,
Gengxiong Wu,
Jingbo Tang,
Ruibang Luo,
Jordan Patterson,
Shanlin Liu,
Weihua Huang,
Guangzhu He,
Shengchang Gu,
Shengkang Li,
Xin Zhou,
Tak-Wah Lam,
Yingrui Li,
Xun Xu,
Gane Ka-Shu Wong,
Jun Wang
Abstract:
Motivation: Transcriptome sequencing has long been the favored method for quickly and inexpensively obtaining the sequences for a large number of genes from an organism with no reference genome. With the rapidly increasing throughputs and decreasing costs of next generation sequencing, RNA-Seq has gained in popularity; but given the typically short reads (e.g. 2 x 90 bp paired ends) of this techno…
▽ More
Motivation: Transcriptome sequencing has long been the favored method for quickly and inexpensively obtaining the sequences for a large number of genes from an organism with no reference genome. With the rapidly increasing throughputs and decreasing costs of next generation sequencing, RNA-Seq has gained in popularity; but given the typically short reads (e.g. 2 x 90 bp paired ends) of this technol- ogy, de novo assembly to recover complete or full-length transcript sequences remains an algorithmic challenge. Results: We present SOAPdenovo-Trans, a de novo transcriptome assembler designed specifically for RNA-Seq. Its performance was evaluated on transcriptome datasets from rice and mouse. Using the known transcripts from these well-annotated genomes (sequenced a decade ago) as our benchmark, we assessed how SOAPdenovo- Trans and two other popular software handle the practical issues of alternative splicing and variable expression levels. Our conclusion is that SOAPdenovo-Trans provides higher contiguity, lower redundancy, and faster execution. Availability and Implementation: Source code and user manual are at http://sourceforge.net/projects/soapdenovotrans/ Contact: [email protected] or [email protected]
△ Less
Submitted 9 August, 2013; v1 submitted 29 May, 2013;
originally announced May 2013.
-
Efficient Algorithms for Multivariate Linear Mixed Models in Genome-wide Association Studies
Authors:
Xiang Zhou,
Matthew Stephens
Abstract:
Multivariate linear mixed models (mvLMMs) have been widely used in many areas of genetics, and have attracted considerable recent interest in genome-wide association studies (GWASs). However, fitting mvLMMs is computationally non-trivial, and no existing method is computationally practical for performing the likelihood ratio test (LRT) for mvLMMs in GWAS settings with moderate sample size n. The e…
▽ More
Multivariate linear mixed models (mvLMMs) have been widely used in many areas of genetics, and have attracted considerable recent interest in genome-wide association studies (GWASs). However, fitting mvLMMs is computationally non-trivial, and no existing method is computationally practical for performing the likelihood ratio test (LRT) for mvLMMs in GWAS settings with moderate sample size n. The existing software MTMM perform an approximate LRT for two phenotypes, and as we find, its p values can substantially understate the significance of associations. Here, we present novel computationally-efficient algorithms for fitting mvLMMs, and computing the LRT in GWAS settings. After a single initial eigen-decomposition (with complexity O(n^3)) the algorithms i) reduce computational complexity (per iteration of the optimizer) from cubic to linear in n; and ii) in GWAS analyses, reduces per-marker complexity from cubic to quadratic in n. These innovations make it practical to compute the LRT for mvLMMs in GWASs for tens of thousands of samples and a moderate number of phenotypes (~2-10). With simulations, we show that the LRT provides correct control for type I error. With both simulations and real data we find that the LRT is more powerful than the approximate LRT from MTMM, and illustrate the benefits of analyzing more than two phenotypes. The method is implemented in the GEMMA software package, freely available at http://stephenslab.uchicago.edu/software.html
△ Less
Submitted 11 September, 2013; v1 submitted 19 May, 2013;
originally announced May 2013.
-
Polygenic Modeling with Bayesian Sparse Linear Mixed Models
Authors:
Xiang Zhou,
Peter Carbonetto,
Matthew Stephens
Abstract:
Both linear mixed models (LMMs) and sparse regression models are widely used in genetics applications, including, recently, polygenic modeling in genome-wide association studies. These two approaches make very different assumptions, so are expected to perform well in different situations. However, in practice, for a given data set one typically does not know which assumptions will be more accurate…
▽ More
Both linear mixed models (LMMs) and sparse regression models are widely used in genetics applications, including, recently, polygenic modeling in genome-wide association studies. These two approaches make very different assumptions, so are expected to perform well in different situations. However, in practice, for a given data set one typically does not know which assumptions will be more accurate. Motivated by this, we consider a hybrid of the two, which we refer to as a "Bayesian sparse linear mixed model" (BSLMM) that includes both these models as special cases. We address several key computational and statistical issues that arise when applying BSLMM, including appropriate prior specification for the hyper-parameters, and a novel Markov chain Monte Carlo algorithm for posterior inference. We apply BSLMM and compare it with other methods for two polygenic modeling applications: estimating the proportion of variance in phenotypes explained (PVE) by available genotypes, and phenotype (or breeding value) prediction. For PVE estimation, we demonstrate that BSLMM combines the advantages of both standard LMMs and sparse regression modeling. For phenotype prediction it considerably outperforms either of the other two methods, as well as several other large-scale regression methods previously suggested for this problem. Software implementing our method is freely available from http://stephenslab.uchicago.edu/software.html
△ Less
Submitted 14 November, 2012; v1 submitted 6 September, 2012;
originally announced September 2012.