Search | arXiv e-print repository

gghic: A Versatile R Package for Exploring and Visualizing 3D Genome Organization

Authors: Minghao Jiang, Duohui Jing, Jason W. H. Wong

Abstract: Motivation: The three-dimensional (3D) organization of the genome plays a critical role in regulating gene expression and maintaining cellular homeostasis. Disruptions in this spatial organization can result in abnormal chromatin interactions, contributing to the development of various diseases including cancer. Advances in chromosome conformation capture technologies, such as Hi-C, have enabled r… ▽ More Motivation: The three-dimensional (3D) organization of the genome plays a critical role in regulating gene expression and maintaining cellular homeostasis. Disruptions in this spatial organization can result in abnormal chromatin interactions, contributing to the development of various diseases including cancer. Advances in chromosome conformation capture technologies, such as Hi-C, have enabled researchers to study genome architecture at high resolution. However, the efficient visualization and interpretation of these complex datasets remain a major challenge, particularly when integrating genomic annotations and inter-chromosomal interactions. Results: We present gghic, an R package that extends the ggplot2 framework to enable intuitive and customizable visualization of genomic interaction data. gghic introduces novel layers for generating triangular heatmaps of chromatin interactions and annotating them with features such as chromatin loops, topologically associated domains (TADs), gene/transcript models, and data tracks (e.g., ChIP-seq signals). The package supports data from multiple chromosomes, facilitating the exploration of inter-chromosomal interactions. Built to integrate seamlessly with the R/Bioconductor ecosystem, gghic is compatible with widely used genomic data formats, including HiCExperiment and GInteractions objects. We demonstrate the utility of gghic by replicating a published figure showing a translocation event in T-cell acute lymphoblastic leukemia (T-ALL), highlighting its ability to integrate genomic annotations and generate publication-quality figures. Availability and implementation: The R package can be accessed at https://github.com/jasonwong-lab/gghic and is distributed under the GNU General Public License version 3.0. △ Less

Submitted 3 December, 2024; originally announced December 2024.

arXiv:2411.10548 [pdf, ps, other]

BioNeMo Framework: a modular, high-performance library for AI model development in drug discovery

Authors: Peter St. John, Dejun Lin, Polina Binder, Malcolm Greaves, Vega Shah, John St. John, Adrian Lange, Patrick Hsu, Rajesh Illango, Arvind Ramanathan, Anima Anandkumar, David H Brookes, Akosua Busia, Abhishaike Mahajan, Stephen Malina, Neha Prasad, Sam Sinai, Lindsay Edwards, Thomas Gaudelet, Cristian Regep, Martin Steinegger, Burkhard Rost, Alexander Brace, Kyle Hippe, Luca Naef , et al. (63 additional authors not shown)

Abstract: Artificial Intelligence models encoding biology and chemistry are opening new routes to high-throughput and high-quality in-silico drug development. However, their training increasingly relies on computational scale, with recent protein language models (pLM) training on hundreds of graphical processing units (GPUs). We introduce the BioNeMo Framework to facilitate the training of computational bio… ▽ More Artificial Intelligence models encoding biology and chemistry are opening new routes to high-throughput and high-quality in-silico drug development. However, their training increasingly relies on computational scale, with recent protein language models (pLM) training on hundreds of graphical processing units (GPUs). We introduce the BioNeMo Framework to facilitate the training of computational biology and chemistry AI models across hundreds of GPUs. Its modular design allows the integration of individual components, such as data loaders, into existing workflows and is open to community contributions. We detail technical features of the BioNeMo Framework through use cases such as pLM pre-training and fine-tuning. On 256 NVIDIA A100s, BioNeMo Framework trains a three billion parameter BERT-based pLM on over one trillion tokens in 4.2 days. The BioNeMo Framework is open-source and free for everyone to use. △ Less

Submitted 15 November, 2024; originally announced November 2024.

arXiv:2407.09089 [pdf]

Lomics: Generation of Pathways and Gene Sets using Large Language Models for Transcriptomic Analysis

Authors: Chun-Ka Wong, Ali Choo, Eugene C. C. Cheng, Wing-Chun San, Kelvin Chak-Kong Cheng, Yee-Man Lau, Minqing Lin, Fei Li, Wei-Hao Liang, Song-Yan Liao, Kwong-Man Ng, Ivan Fan-Ngai Hung, Hung-Fat Tse, Jason Wing-Hon Wong

Abstract: Interrogation of biological pathways is an integral part of omics data analysis. Large language models (LLMs) enable the generation of custom pathways and gene sets tailored to specific scientific questions. These targeted sets are significantly smaller than traditional pathway enrichment analysis libraries, reducing multiple hypothesis testing and potentially enhancing statistical power. Lomics (… ▽ More Interrogation of biological pathways is an integral part of omics data analysis. Large language models (LLMs) enable the generation of custom pathways and gene sets tailored to specific scientific questions. These targeted sets are significantly smaller than traditional pathway enrichment analysis libraries, reducing multiple hypothesis testing and potentially enhancing statistical power. Lomics (Large Language Models for Omics Studies) v1.0 is a python-based bioinformatics toolkit that streamlines the generation of pathways and gene sets for transcriptomic analysis. It operates in three steps: 1) deriving relevant pathways based on the researcher's scientific question, 2) generating valid gene sets for each pathway, and 3) outputting the results as .GMX files. Lomics also provides explanations for pathway selections. Consistency and accuracy are ensured through iterative processes, JSON format validation, and HUGO Gene Nomenclature Committee (HGNC) gene symbol verification. Lomics serves as a foundation for integrating LLMs into omics research, potentially improving the specificity and efficiency of pathway analysis. △ Less

Submitted 12 July, 2024; originally announced July 2024.

arXiv:2402.13297 [pdf, other]

Integrating Deep Learning and Synthetic Biology: A Co-Design Approach for Enhancing Gene Expression via N-terminal Coding Sequences

Authors: Zhanglu Yan, Weiran Chu, Yuhua Sheng, Kaiwen Tang, Shida Wang, Yanfeng Liu, Weng-Fai Wong

Abstract: N-terminal coding sequence (NCS) influences gene expression by impacting the translation initiation rate. The NCS optimization problem is to find an NCS that maximizes gene expression. The problem is important in genetic engineering. However, current methods for NCS optimization such as rational design and statistics-guided approaches are labor-intensive yield only relatively small improvements. T… ▽ More N-terminal coding sequence (NCS) influences gene expression by impacting the translation initiation rate. The NCS optimization problem is to find an NCS that maximizes gene expression. The problem is important in genetic engineering. However, current methods for NCS optimization such as rational design and statistics-guided approaches are labor-intensive yield only relatively small improvements. This paper introduces a deep learning/synthetic biology co-designed few-shot training workflow for NCS optimization. Our method utilizes k-nearest encoding followed by word2vec to encode the NCS, then performs feature extraction using attention mechanisms, before constructing a time-series network for predicting gene expression intensity, and finally a direct search algorithm identifies the optimal NCS with limited training data. We took green fluorescent protein (GFP) expressed by Bacillus subtilis as a reporting protein of NCSs, and employed the fluorescence enhancement factor as the metric of NCS optimization. Within just six iterative experiments, our model generated an NCS (MLD62) that increased average GFP expression by 5.41-fold, outperforming the state-of-the-art NCS designs. Extending our findings beyond GFP, we showed that our engineered NCS (MLD62) can effectively boost the production of N-acetylneuraminic acid by enhancing the expression of the crucial rate-limiting GNA1 gene, demonstrating its practical utility. We have open-sourced our NCS expression database and experimental procedures for public use. △ Less

Submitted 20 February, 2024; originally announced February 2024.

arXiv:2302.11669 [pdf, other]

RNA secondary structures: from ab initio prediction to better compression, and back

Authors: Evarista Onokpasa, Sebastian Wild, Prudence W. H. Wong

Abstract: In this paper, we use the biological domain knowledge incorporated into stochastic models for ab initio RNA secondary-structure prediction to improve the state of the art in joint compression of RNA sequence and structure data (Liu et al., BMC Bioinformatics, 2008). Moreover, we show that, conversely, compression ratio can serve as a cheap and robust proxy for comparing the prediction quality of d… ▽ More In this paper, we use the biological domain knowledge incorporated into stochastic models for ab initio RNA secondary-structure prediction to improve the state of the art in joint compression of RNA sequence and structure data (Liu et al., BMC Bioinformatics, 2008). Moreover, we show that, conversely, compression ratio can serve as a cheap and robust proxy for comparing the prediction quality of different stochastic models, which may help guide the search for better RNA structure prediction models. Our results build on expert stochastic context-free grammar models of RNA secondary structures (Dowell & Eddy, BMC Bioinformatics, 2004; Nebel & Scheid, Theory in Biosciences, 2011) combined with different (static and adaptive) models for rule probabilities and arithmetic coding. We provide a prototype implementation and an extensive empirical evaluation, where we illustrate how grammar features and probability models affect compression ratios. △ Less

Submitted 22 February, 2023; originally announced February 2023.

Comments: paper at Data Compression Conference 2023

arXiv:2212.10653 [pdf, other]

Estimating and Assessing Differential Equation Models with Time-Course Data

Authors: Samuel W. K. Wong, Shihao Yang, S. C. Kou

Abstract: Ordinary differential equation (ODE) models are widely used to describe chemical or biological processes. This article considers the estimation and assessment of such models on the basis of time-course data. Due to experimental limitations, time-course data are often noisy and some components of the system may not be observed. Furthermore, the computational demands of numerical integration have hi… ▽ More Ordinary differential equation (ODE) models are widely used to describe chemical or biological processes. This article considers the estimation and assessment of such models on the basis of time-course data. Due to experimental limitations, time-course data are often noisy and some components of the system may not be observed. Furthermore, the computational demands of numerical integration have hindered the widespread adoption of time-course analysis using ODEs. To address these challenges, we explore the efficacy of the recently developed MAGI (MAnifold-constrained Gaussian process Inference) method for ODE inference. First, via a range of examples we show that MAGI is capable of inferring the parameters and system trajectories, including unobserved components, with appropriate uncertainty quantification. Second, we illustrate how MAGI can be used to assess and select different ODE models with time-course data based on MAGI's efficient computation of model predictions. Overall, we believe MAGI is a useful method for the analysis of time-course data in the context of ODE models, which bypasses the need for any numerical integration. △ Less

Submitted 13 February, 2023; v1 submitted 20 December, 2022; originally announced December 2022.

Comments: 26 pages, 8 figures, with code supplement

arXiv:2210.13323 [pdf, other]

A Comparative Study of Compartmental Models for COVID-19 Transmission in Ontario, Canada

Authors: Yuxuan Zhao, Samuel W. K. Wong

Abstract: The number of confirmed COVID-19 cases reached over 1.3 million in Ontario, Canada by June 4, 2022. The continued spread of the virus underlying COVID-19 has been spurred by the emergence of variants since the initial outbreak in December, 2019. Much attention has thus been devoted to tracking and modelling the transmission of COVID-19. Compartmental models are commonly used to mimic epidemic tran… ▽ More The number of confirmed COVID-19 cases reached over 1.3 million in Ontario, Canada by June 4, 2022. The continued spread of the virus underlying COVID-19 has been spurred by the emergence of variants since the initial outbreak in December, 2019. Much attention has thus been devoted to tracking and modelling the transmission of COVID-19. Compartmental models are commonly used to mimic epidemic transmission mechanisms and are easy to understand. Their performance in real-world settings, however, needs to be more thoroughly assessed. In this comparative study, we examine five compartmental models -- four existing ones and an extended model that we propose -- and analyze their ability to describe COVID-19 transmission in Ontario from January 2022 to June 2022. △ Less

Submitted 24 October, 2022; originally announced October 2022.

Comments: 26 pages, 8 figures

arXiv:2207.03105 [pdf]

doi 10.1088/1361-6560/ac9e3e

Uncertainty-Aware Self-supervised Neural Network for Liver $T_{1ρ}$ Mapping with Relaxation Constraint

Authors: Chaoxing Huang, Yurui Qian, Simon Chun Ho Yu, Jian Hou, Baiyan Jiang, Queenie Chan, Vincent Wai-Sun Wong, Winnie Chiu-Wing Chu, Weitian Chen

Abstract: $T_{1ρ}$ mapping is a promising quantitative MRI technique for the non-invasive assessment of tissue properties. Learning-based approaches can map $T_{1ρ}$ from a reduced number of $T_{1ρ}$ weighted images, but requires significant amounts of high quality training data. Moreover, existing methods do not provide the confidence level of the $T_{1ρ}… ▽ More $T_{1ρ}$ mapping is a promising quantitative MRI technique for the non-invasive assessment of tissue properties. Learning-based approaches can map $T_{1ρ}$ from a reduced number of $T_{1ρ}$ weighted images, but requires significant amounts of high quality training data. Moreover, existing methods do not provide the confidence level of the $T_{1ρ}$ estimation. To address these problems, we proposed a self-supervised learning neural network that learns a $T_{1ρ}$ mapping using the relaxation constraint in the learning process. Epistemic uncertainty and aleatoric uncertainty are modelled for the $T_{1ρ}$ quantification network to provide a Bayesian confidence estimation of the $T_{1ρ}$ mapping. The uncertainty estimation can also regularize the model to prevent it from learning imperfect data. We conducted experiments on $T_{1ρ}$ data collected from 52 patients with non-alcoholic fatty liver disease. The results showed that our method outperformed the existing methods for $T_{1ρ}$ quantification of the liver using as few as two $T_{1ρ}$-weighted images. Our uncertainty estimation provided a feasible way of modelling the confidence of the self-supervised learning based $T_{1ρ}$ estimation, which is consistent with the reality in liver $T_{1ρ}$ imaging. △ Less

Submitted 25 October, 2022; v1 submitted 7 July, 2022; originally announced July 2022.

Comments: Provisionally accepted by Physics in Medicine and Biology

arXiv:2206.06159 [pdf]

Moving towards FAIR practices in epidemiological research

Authors: Montserrat Garcia-Closas, Thomas U. Ahearn, Mia M. Gaudet, Amber N. Hurson, Jeya Balaji Balasubramanian, Parichoy Pal Choudhury, Nicole M. Gerlanc, Bhaumik Patel, Daniel Russ, Mustapha Abubakar, Neal D. Freedman, Wendy S. W. Wong, Stephen J. Chanock, Amy Berrington de Gonzalez, Jonas S Almeida

Abstract: Reproducibility and replicability of research findings are central to the scientific integrity of epidemiology. In addition, many research questions require combiningdata from multiple sources to achieve adequate statistical power. However, barriers related to confidentiality, costs, and incentives often limit the extent and speed of sharing resources, both data and code. Epidemiological practices… ▽ More Reproducibility and replicability of research findings are central to the scientific integrity of epidemiology. In addition, many research questions require combiningdata from multiple sources to achieve adequate statistical power. However, barriers related to confidentiality, costs, and incentives often limit the extent and speed of sharing resources, both data and code. Epidemiological practices that follow FAIR principles can address these barriers by making resources (F)indable with the necessary metadata , (A)ccessible to authorized users and (I)nteroperable with other data, to optimize the (R)e-use of resources with appropriate credit to its creators. We provide an overview of these principles and describe approaches for implementation in epidemiology. Increasing degrees of FAIRness can be achieved by moving data and code from on-site locations to the Cloud, using machine-readable and non-proprietary files, and developing open-source code. Adoption of these practices will improve daily work and collaborative analyses, and facilitate compliance with data sharing policies from funders and scientific journals. Achieving a high degree of FAIRness will require funding, training, organizational support, recognition, and incentives for sharing resources. But these costs are amply outweighed by the benefits of making research more reproducible, impactful, and equitable by facilitating the re-use of precious research resources by the scientific community. △ Less

Submitted 13 June, 2022; originally announced June 2022.

arXiv:2203.11299 [pdf, other]

A Fundamental Inequality Governing the Rate Coding Response of Sensory Neurons

Authors: Willy Wong

Abstract: A fundamental inequality governing the spike activity of peripheral neurons is derived and tested against auditory data. This inequality states that the steady-state firing rate must lie between the arithmetic and geometric means of the spontaneous and peak activities during adaptation. Implications towards the development of auditory mechanistic models are explored. A fundamental inequality governing the spike activity of peripheral neurons is derived and tested against auditory data. This inequality states that the steady-state firing rate must lie between the arithmetic and geometric means of the spontaneous and peak activities during adaptation. Implications towards the development of auditory mechanistic models are explored. △ Less

Submitted 1 August, 2023; v1 submitted 21 March, 2022; originally announced March 2022.

arXiv:2201.07775 [pdf, other]

Monte Carlo sampling of flexible protein structures: an application to the SARS-CoV-2 omicron variant

Authors: Samuel W. K. Wong

Abstract: Proteins can exhibit dynamic structural flexibility as they carry out their functions, especially in binding regions that interact with other molecules. For the key SARS-CoV-2 spike protein that facilitates COVID-19 infection, studies have previously identified several such highly flexible regions with therapeutic importance. However, protein structures available from the Protein Data Bank are pre… ▽ More Proteins can exhibit dynamic structural flexibility as they carry out their functions, especially in binding regions that interact with other molecules. For the key SARS-CoV-2 spike protein that facilitates COVID-19 infection, studies have previously identified several such highly flexible regions with therapeutic importance. However, protein structures available from the Protein Data Bank are presented as static snapshots that may not adequately depict this flexibility, and furthermore these cannot keep pace with new mutations and variants. In this paper we present a sequential Monte Carlo method for broadly sampling the 3-D conformational space of protein structure, according to the Boltzmann distribution of a given energy function. Our approach is distinct from previous sampling methods that focus on finding the lowest-energy conformation for predicting a single stable structure. We exemplify our method on the SARS-CoV-2 omicron variant as an application of timely interest. Our results identify sequence positions 495-508 as a key region where omicron mutations have the most impact on the space of possible conformations, which coincides with the findings of other preliminary studies on the binding properties of the omicron variant. △ Less

Submitted 4 February, 2022; v1 submitted 19 January, 2022; originally announced January 2022.

Comments: 20 pages, 4 figures

arXiv:2105.08835 [pdf, ps, other]

Conformational variability of loops in the SARS-CoV-2 spike protein

Authors: Samuel W. K. Wong, Zongjun Liu

Abstract: The SARS-CoV-2 spike (S) protein facilitates viral infection, and has been the focus of many structure determination efforts. Its flexible loop regions are known to be involved in protein binding and may adopt multiple conformations. This paper identifies the S protein loops and studies their conformational variability based on the available Protein Data Bank (PDB) structures. While most loops had… ▽ More The SARS-CoV-2 spike (S) protein facilitates viral infection, and has been the focus of many structure determination efforts. Its flexible loop regions are known to be involved in protein binding and may adopt multiple conformations. This paper identifies the S protein loops and studies their conformational variability based on the available Protein Data Bank (PDB) structures. While most loops had essentially one stable conformation, 17 of 44 loop regions were observed to be structurally variable with multiple substantively distinct conformations based on a cluster analysis. Loop modeling methods were then applied to the S protein loop targets, and the prediction accuracies discussed in relation to the characteristics of the conformational clusters identified. Loops with multiple conformations were found to be challenging to model based on a single structural template. △ Less

Submitted 13 October, 2021; v1 submitted 18 May, 2021; originally announced May 2021.

Comments: 24 pages

arXiv:2104.10878 [pdf, other]

doi 10.3934/math.2022376

Comparing regional and provincial-wide COVID-19 models with physical distancing in British Columbia

Authors: Geoffrey McGregor, Jennifer Tippett, Andy T. S. Wan, Mengxiao Wang, Samuel W. K. Wong

Abstract: We study the effects of physical distancing measures for the spread of COVID-19 in regional areas within British Columbia, using the reported cases of the five provincial Health Authorities. Building on the Bayesian epidemiological model of Anderson et al. (2020), we propose a hierarchical regional Bayesian model with time-varying regional parameters between March to December of 2020. In the absen… ▽ More We study the effects of physical distancing measures for the spread of COVID-19 in regional areas within British Columbia, using the reported cases of the five provincial Health Authorities. Building on the Bayesian epidemiological model of Anderson et al. (2020), we propose a hierarchical regional Bayesian model with time-varying regional parameters between March to December of 2020. In the absence of COVID-19 variants and vaccinations during this period, we examine the regionalized basic reproduction number, modelled prevalence, relative reduction in contact due to physical distancing, and proportion of anticipated cases that have been tested and reported. We observe significant differences between the regional and provincial-wide models and demonstrate the hierarchical regional model can better estimate regional prevalence, especially in rural regions. These results indicate that it can be useful to apply similar regional models to other parts of Canada or other countries. △ Less

Submitted 13 November, 2021; v1 submitted 22 April, 2021; originally announced April 2021.

Comments: 35 pages, 16 figures

Journal ref: AIMS Mathematics, 2022, 7(4): 6743-6778

arXiv:2101.07494 [pdf, other]

SIR Simulation of COVID-19 Pandemic in Malaysia: Will the Vaccination Program be Effective?

Authors: W. K. Wong, Filbert H. Juwono, Tock H. Chua

Abstract: Since the end of 2019, COVID-19 has significantly affected the lives of people around the world. Towards the end of 2020, several COVID-19 vaccine candidates with relatively high efficacy have been reported in the final phase of clinical trials. Vaccines have been considered as critical tools for opening up social and economic activities, thereby lessening the impact of this disease on the society… ▽ More Since the end of 2019, COVID-19 has significantly affected the lives of people around the world. Towards the end of 2020, several COVID-19 vaccine candidates with relatively high efficacy have been reported in the final phase of clinical trials. Vaccines have been considered as critical tools for opening up social and economic activities, thereby lessening the impact of this disease on the society. This paper presents a simulation of COVID-19 spread using modified Susceptible-Infected-Removed (SIR) model under vaccine intervention in several localities of Malaysia, i.e. those cities or states with high relatively COVID-19 cases such as Kuala Lumpur, Penang, Sabah, and Sarawak. The results show that at different vaccine efficacy levels (0.75, 0.85, and 0.95), the curves of active infection vary slightly, indicating that vaccines with efficacy above 0.75 would produce the herd immunity required to level the curves. In addition, disparity is significant between implementing or not implementing a vaccination program. Simulation results also show that lowering the reproduction number, R0 is necessary to keep the infection curve flat despite vaccination. This is due to the assumption that vaccination is mostly carried out gradually at the assumed fixed rate. The statement is based on our simulation results with two values of R0: 1.1 and 1.2, indicative of reduction of R0 by social distancing. The lower R0 shows a smaller peak amplitude about half the value simulated with R0=1.2. In conclusion, the simulation model suggests a two-pronged strategy to combat the COVID-19 pandemic in Malaysia: vaccination and compliance with standard operating procedure issued by the World Health Organization (e.g. social distancing). △ Less

Submitted 19 January, 2021; originally announced January 2021.

arXiv:2101.02304 [pdf, other]

Statistical challenges in the analysis of sequence and structure data for the COVID-19 spike protein

Authors: Shiyu He, Samuel W. K. Wong

Abstract: As the major target of many vaccines and neutralizing antibodies against SARS-CoV-2, the spike (S) protein is observed to mutate over time. In this paper, we present statistical approaches to tackle some challenges associated with the analysis of S-protein data. We build a Bayesian hierarchical model to study the temporal and spatial evolution of S-protein sequences, after grouping the sequences i… ▽ More As the major target of many vaccines and neutralizing antibodies against SARS-CoV-2, the spike (S) protein is observed to mutate over time. In this paper, we present statistical approaches to tackle some challenges associated with the analysis of S-protein data. We build a Bayesian hierarchical model to study the temporal and spatial evolution of S-protein sequences, after grouping the sequences into representative clusters. We then apply sampling methods to investigate possible changes to the S-protein's 3-D structure as a result of commonly observed mutations. While the increasing spread of D614G variants has been noted in other research, our results also show that the co-occurring mutations of D614G together with S477N or A222V may spread even more rapidly, as quantified by our model estimates. △ Less

Submitted 30 January, 2021; v1 submitted 6 January, 2021; originally announced January 2021.

Comments: 21 pages, 5 figures

arXiv:2005.07550 [pdf, other]

doi 10.6339/JDS.202007_18(3).0017

Assessing the impacts of mutations to the structure of COVID-19 spike protein via sequential Monte Carlo

Authors: Samuel W. K. Wong

Abstract: Proteins play a key role in facilitating the infectiousness of the 2019 novel coronavirus. A specific spike protein enables this virus to bind to human cells, and a thorough understanding of its 3-dimensional structure is therefore critical for developing effective therapeutic interventions. However, its structure may continue to evolve over time as a result of mutations. In this paper, we use a d… ▽ More Proteins play a key role in facilitating the infectiousness of the 2019 novel coronavirus. A specific spike protein enables this virus to bind to human cells, and a thorough understanding of its 3-dimensional structure is therefore critical for developing effective therapeutic interventions. However, its structure may continue to evolve over time as a result of mutations. In this paper, we use a data science perspective to study the potential structural impacts due to ongoing mutations in its amino acid sequence. To do so, we identify a key segment of the protein and apply a sequential Monte Carlo sampling method to detect possible changes to the space of low-energy conformations for different amino acid sequences. Such computational approaches can further our understanding of this protein structure and complement laboratory efforts. △ Less

Submitted 11 June, 2020; v1 submitted 1 May, 2020; originally announced May 2020.

Comments: 15 pages, 4 figures

Journal ref: Journal of Data Science, 2020, 18(3): 511-525

arXiv:1610.07213 [pdf, other]

Stochastic Modeling and Statistical Inference of Intrinsic Noise in Gene Regulation System via Chemical Master Equation

Authors: Chao Du, Wing Hong Wong

Abstract: Intrinsic noise, the stochastic cell-to-cell fluctuations in mRNAs and proteins, has been observed and proved to play important roles in cellular systems. Due to the recent development in single-cell-level measurement technology, the studies on intrinsic noise are becoming increasingly popular among scholars. The chemical master equation (CME) has been used to model the evolutions of complex chemi… ▽ More Intrinsic noise, the stochastic cell-to-cell fluctuations in mRNAs and proteins, has been observed and proved to play important roles in cellular systems. Due to the recent development in single-cell-level measurement technology, the studies on intrinsic noise are becoming increasingly popular among scholars. The chemical master equation (CME) has been used to model the evolutions of complex chemical and biological systems since 1940, and are often served as the standard tool for modeling intrinsic noise in gene regulation system. A CME-based model can capture the discrete, stochastic, and dynamical nature of gene regulation system, and may offer casual and physical explanation of the observed data at single-cell level. Nonetheless, the complexity of CME also pose serious challenge for researchers in proposing practical modeling and inference frameworks. In this article, we will review the existing works on the modelings and inference of intrinsic noise in gene regulation system within the framework of CME model. We will explore the principles in constructing a CME model for studying gene regulation system and discuss the popular approximations of CME. Then we will study the simulation simulation methods as well as the analytical and numerical approaches that can be used to obtain solution to a CME model. Finally we will summary the exiting statistical methods that can be used to infer the unknown parameters or structures in CME model using single-cell-level gene expression data. △ Less

Submitted 11 November, 2017; v1 submitted 23 October, 2016; originally announced October 2016.

Comments: 64 pages, 5 figures

arXiv:1307.6445 [pdf, other]

doi 10.1007/s00422-020-00848-4

On the Rate Coding Response of Peripheral Sensory Neurons

Authors: Willy Wong

Abstract: The rate coding response of a single peripheral sensory neuron in the asymptotic, near-equilibrium limit can be derived using information theory, asymptotic Bayesian statistics and a theory of complex systems. Almost no biological knowledge is required. The theoretical expression shows good agreement with spike-frequency adaptation data across different sensory modalities and animal species. The a… ▽ More The rate coding response of a single peripheral sensory neuron in the asymptotic, near-equilibrium limit can be derived using information theory, asymptotic Bayesian statistics and a theory of complex systems. Almost no biological knowledge is required. The theoretical expression shows good agreement with spike-frequency adaptation data across different sensory modalities and animal species. The approach permits the discovery of a new neurophysiological equation and shares similarities with statistical physics. △ Less

Submitted 10 December, 2020; v1 submitted 24 July, 2013; originally announced July 2013.

arXiv:1207.3137 [pdf, ps, other]

doi 10.1214/13-AOAS645

Learning a nonlinear dynamical system model of gene regulation: A perturbed steady-state approach

Authors: Arwen Vanice Bradley, Ye Henry Li, Bokyung Choi, Wing Hung Wong

Abstract: Biological structure and function depend on complex regulatory interactions between many genes. A wealth of gene expression data is available from high-throughput genome-wide measurement technologies, but effective gene regulatory network inference methods are still needed. Model-based methods founded on quantitative descriptions of gene regulation are among the most promising, but many such metho… ▽ More Biological structure and function depend on complex regulatory interactions between many genes. A wealth of gene expression data is available from high-throughput genome-wide measurement technologies, but effective gene regulatory network inference methods are still needed. Model-based methods founded on quantitative descriptions of gene regulation are among the most promising, but many such methods rely on simple, local models or on ad hoc inference approaches lacking experimental interpretability. We propose an experimental design and develop an associated statistical method for inferring a gene network by learning a standard quantitative, interpretable, predictive, biophysics-based ordinary differential equation model of gene regulation. We fit the model parameters using gene expression measurements from perturbed steady-states of the system, like those following overexpression or knockdown experiments. Although the original model is nonlinear, our design allows us to transform it into a convex optimization problem by restricting attention to steady-states and using the lasso for parameter selection. Here, we describe the model and inference algorithm and apply them to a synthetic six-gene system, demonstrating that the model is detailed and flexible enough to account for activation and repression as well as synergistic and self-regulation, and the algorithm can efficiently and accurately recover the parameters used to generate the data. △ Less

Submitted 25 March, 2016; v1 submitted 12 July, 2012; originally announced July 2012.

Comments: Published in at http://dx.doi.org/10.1214/13-AOAS645 the Annals of Applied Statistics (http://www.imstat.org/aoas/) by the Institute of Mathematical Statistics (http://www.imstat.org)

Report number: IMS-AOAS-AOAS645

Journal ref: Annals of Applied Statistics 2013, Vol. 7, No. 3, 1311-1333

arXiv:1106.3211 [pdf, ps, other]

doi 10.1214/10-STS343

Statistical Modeling of RNA-Seq Data

Authors: Julia Salzman, Hui Jiang, Wing Hung Wong

Abstract: Recently, ultra high-throughput sequencing of RNA (RNA-Seq) has been developed as an approach for analysis of gene expression. By obtaining tens or even hundreds of millions of reads of transcribed sequences, an RNA-Seq experiment can offer a comprehensive survey of the population of genes (transcripts) in any sample of interest. This paper introduces a statistical model for estimating isoform abu… ▽ More Recently, ultra high-throughput sequencing of RNA (RNA-Seq) has been developed as an approach for analysis of gene expression. By obtaining tens or even hundreds of millions of reads of transcribed sequences, an RNA-Seq experiment can offer a comprehensive survey of the population of genes (transcripts) in any sample of interest. This paper introduces a statistical model for estimating isoform abundance from RNA-Seq data and is flexible enough to accommodate both single end and paired end RNA-Seq data and sampling bias along the length of the transcript. Based on the derivation of minimal sufficient statistics for the model, a computationally feasible implementation of the maximum likelihood estimator of the model is provided. Further, it is shown that using paired end RNA-Seq provides more accurate isoform abundance estimates than single end sequencing at fixed sequencing depth. Simulation studies are also given. △ Less

Submitted 16 June, 2011; originally announced June 2011.

Comments: Published in at http://dx.doi.org/10.1214/10-STS343 the Statistical Science (http://www.imstat.org/sts/) by the Institute of Mathematical Statistics (http://www.imstat.org)

Report number: IMS-STS-STS343

Journal ref: Statistical Science 2011, Vol. 26, No. 1, 62-83

arXiv:1105.0126 [pdf]

The surface accessibility of α-bungarotoxin monitored by a novel paramagnetic probe

Authors: Andrea Bernini, Vincenzo Venditti, Ottavia Spiga, Filippo Prischi, Mauro Botta, Angela Pui-Ling Tong, Wing-Tak Wong, Neri Niccolai

Abstract: The surface accessibility of α-bungarotoxin has been investigated by using Gd2L7, a newly designed paramagnetic NMR probe. Signal attenuations induced by Gd2L7 on α-bungarotoxin CαH peaks of 1H-13C HSQC spectra have been analyzed and compared with the ones previously obtained in the presence of GdDTPA-BMA. In spite of the different molecular size and shape, for the two probes a common pathway of a… ▽ More The surface accessibility of α-bungarotoxin has been investigated by using Gd2L7, a newly designed paramagnetic NMR probe. Signal attenuations induced by Gd2L7 on α-bungarotoxin CαH peaks of 1H-13C HSQC spectra have been analyzed and compared with the ones previously obtained in the presence of GdDTPA-BMA. In spite of the different molecular size and shape, for the two probes a common pathway of approach to the α-bungarotoxin surface can be observed with an equally enhanced access of both GdDTPA-BMA and Gd2L7 towards the protein surface side where the binding site is located. Molecular dynamics simulations suggest that protein backbone flexibility and surface hydration contribute to the observed preferential approach of both gadolinium complexes specifically to the part of the α-bungarotoxin surface which is involved in the interaction with its physiological target, the nicotinic acetylcholine receptor. △ Less

Submitted 30 April, 2011; originally announced May 2011.

Comments: 13 pages, 4 figures,preliminary report

arXiv:q-bio/0502033 [pdf, ps, other]

doi 10.1073/pnas.0403790102

The use of oscillatory signals in the study of genetic networks

Authors: Ovidiu Lipan, Wing H. Wong

Abstract: The structure of a genetic network is uncovered by studying its response to external stimuli (input signals). We present a theory of propagation of an input signal through a linear stochastic genetic network. It is found that there are important advantages in using oscillatory signals over step or impulse signals, and that the system may enter into a pure fluctuation resonance for a specific inp… ▽ More The structure of a genetic network is uncovered by studying its response to external stimuli (input signals). We present a theory of propagation of an input signal through a linear stochastic genetic network. It is found that there are important advantages in using oscillatory signals over step or impulse signals, and that the system may enter into a pure fluctuation resonance for a specific input frequency. △ Less

Submitted 23 February, 2005; originally announced February 2005.

Comments: 46 pages, 5 figures. Submitted to PNAS on May 27th 2004. The paper is under consideration

Showing 1–22 of 22 results for author: Wong, W