-
The Autoregressive Structural Model for analyzing longitudinal health data of an aging population in China
Authors:
Yazhuo Deng,
David R. Paul,
Audrey Q. Fu
Abstract:
We seek to elucidate the impact of social activity, physical activity and functional health status (factors) on depressive symptoms (outcome) in the China Health and Retirement Longitudinal Study (CHARLS), a multi-year study of aging involving 20,000 participants 45 years of age and older. Although a variety of statistical methods are available for analyzing longitudinal data, modeling the dynamic…
▽ More
We seek to elucidate the impact of social activity, physical activity and functional health status (factors) on depressive symptoms (outcome) in the China Health and Retirement Longitudinal Study (CHARLS), a multi-year study of aging involving 20,000 participants 45 years of age and older. Although a variety of statistical methods are available for analyzing longitudinal data, modeling the dynamics within a complex system remains a difficult methodological challenge. We develop an Autoregressive Structural Model (ASM) to examine these factors on depressive symptoms while accounting for temporal dependence. The ASM builds on the structural equation model and also consists of two components: a measurement model that connects observations to latent factors, and a structural model that delineates the mechanism among latent factors. Our ASM further incorporates autoregressive dependence into both components for repeated measurements. The results from applying the ASM to the CHARLS data indicate that social and physical activity independently and consistently mitigated depressive symptoms over the course of five years, by mediating through functional health status.
△ Less
Submitted 4 December, 2019;
originally announced December 2019.
-
Approximate Bayesian inference of directed acyclic graphs in biology with flexible priors on edge states
Authors:
Evan A Martin,
Audrey Qiuyan Fu
Abstract:
Graphical models or networks describe the statistical dependence among multiple variables and are widely used in biology (e.g., gene regulatory networks). Under appropriate assumptions, directed edges may represent causal relationships. A key feature of a biological network is sparsity, defined by how likely an edge is present, of which we often have some knowledge. However, most existing Bayesian…
▽ More
Graphical models or networks describe the statistical dependence among multiple variables and are widely used in biology (e.g., gene regulatory networks). Under appropriate assumptions, directed edges may represent causal relationships. A key feature of a biological network is sparsity, defined by how likely an edge is present, of which we often have some knowledge. However, most existing Bayesian methods use priors for the entire graph, making it difficult to specify the level of sparsity. The few methods that use priors on edges estimate the two directions independently; the sum of the two probabilities can exceed 1. Here, we present baycn (BAYesian Causal Network), a novel approximate Bayesian method that represents a graph in terms of three states of edges: the two directions and edge absence, and specifies priors on these edge states. We design a pseudo Bayesian sampling algorithm for efficient inference. We apply baycn to two genomic problems: i) distinguishing direct and indirect target genes of genetic variants, using these variants as instrumental variables, and ii) inferring combinatorial binding of highly-correlated transcription factors in Drosophila. In both cases and in extensive simulations, our method demonstrates much improved accuracy over existing methods for the whole graph and for individual edges.
△ Less
Submitted 27 November, 2023; v1 submitted 23 September, 2019;
originally announced September 2019.
-
MRPC: An R package for accurate inference of causal graphs
Authors:
Md. Bahadur Badsha,
Evan A Martin,
Audrey Qiuyan Fu
Abstract:
We present MRPC, an R package that learns causal graphs with improved accuracy over existing packages, such as pcalg and bnlearn. Our algorithm builds on the powerful PC algorithm, the canonical algorithm in computer science for learning directed acyclic graphs. The improvement in accuracy results from online control of the false discovery rate (FDR) that reduces false positive edges, a more accur…
▽ More
We present MRPC, an R package that learns causal graphs with improved accuracy over existing packages, such as pcalg and bnlearn. Our algorithm builds on the powerful PC algorithm, the canonical algorithm in computer science for learning directed acyclic graphs. The improvement in accuracy results from online control of the false discovery rate (FDR) that reduces false positive edges, a more accurate approach to identifying v-structures (i.e., $T_1 \rightarrow T_2 \leftarrow T_3$), and robust estimation of the correlation matrix among nodes. For genomic data that contain genotypes and gene expression for each sample, MRPC incorporates the principle of Mendelian randomization to orient the edges. Our package can be applied to continuous and discrete data.
△ Less
Submitted 5 June, 2018;
originally announced June 2018.
-
Bayesian clustering of replicated time-course gene expression data with weak signals
Authors:
Audrey Qiuyan Fu,
Steven Russell,
Sarah J. Bray,
Simon Tavaré
Abstract:
To identify novel dynamic patterns of gene expression, we develop a statistical method to cluster noisy measurements of gene expression collected from multiple replicates at multiple time points, with an unknown number of clusters. We propose a random-effects mixture model coupled with a Dirichlet-process prior for clustering. The mixture model formulation allows for probabilistic cluster assignme…
▽ More
To identify novel dynamic patterns of gene expression, we develop a statistical method to cluster noisy measurements of gene expression collected from multiple replicates at multiple time points, with an unknown number of clusters. We propose a random-effects mixture model coupled with a Dirichlet-process prior for clustering. The mixture model formulation allows for probabilistic cluster assignments. The random-effects formulation allows for attributing the total variability in the data to the sources that are consistent with the experimental design, particularly when the noise level is high and the temporal dependence is not strong. The Dirichlet-process prior induces a prior distribution on partitions and helps to estimate the number of clusters (or mixture components) from the data. We further tackle two challenges associated with Dirichlet-process prior-based methods. One is efficient sampling. We develop a novel Metropolis-Hastings Markov Chain Monte Carlo (MCMC) procedure to sample the partitions. The other is efficient use of the MCMC samples in forming clusters. We propose a two-step procedure for posterior inference, which involves resampling and relabeling, to estimate the posterior allocation probability matrix. This matrix can be directly used in cluster assignments, while describing the uncertainty in clustering. We demonstrate the effectiveness of our model and sampling procedure through simulated data. Applying our method to a real data set collected from Drosophila adult muscle cells after five-minute Notch activation, we identify 14 clusters of different transcriptional responses among 163 differentially expressed genes, which provides novel insights into underlying transcriptional mechanisms in the Notch signaling pathway. The algorithm developed here is implemented in the R package DIRECT, available on CRAN.
△ Less
Submitted 28 November, 2013; v1 submitted 18 October, 2012;
originally announced October 2012.
-
Statistical inference of transmission fidelity of DNA methylation patterns over somatic cell divisions in mammals
Authors:
Audrey Qiuyan Fu,
Diane P. Genereux,
Reinhard Stöger,
Charles D. Laird,
Matthew Stephens
Abstract:
We develop Bayesian inference methods for a recently-emerging type of epigenetic data to study the transmission fidelity of DNA methylation patterns over cell divisions. The data consist of parent-daughter double-stranded DNA methylation patterns with each pattern coming from a single cell and represented as an unordered pair of binary strings. The data are technically difficult and time-consuming…
▽ More
We develop Bayesian inference methods for a recently-emerging type of epigenetic data to study the transmission fidelity of DNA methylation patterns over cell divisions. The data consist of parent-daughter double-stranded DNA methylation patterns with each pattern coming from a single cell and represented as an unordered pair of binary strings. The data are technically difficult and time-consuming to collect, putting a premium on an efficient inference method. Our aim is to estimate rates for the maintenance and de novo methylation events that gave rise to the observed patterns, while accounting for measurement error. We model data at multiple sites jointly, thus using whole-strand information, and considerably reduce confounding between parameters. We also adopt a hierarchical structure that allows for variation in rates across sites without an explosion in the effective number of parameters. Our context-specific priors capture the expected stationarity, or near-stationarity, of the stochastic process that generated the data analyzed here. This expected stationarity is shown to greatly increase the precision of the estimation. Applying our model to a data set collected at the human FMR1 locus, we find that measurement errors, generally ignored in similar studies, occur at a nontrivial rate (inappropriate bisulfite conversion error: 1.6$%$ with 80$%$ CI: 0.9--2.3$%$). Accounting for these errors has a substantial impact on estimates of key biological parameters. The estimated average failure of maintenance rate and daughter de novo rate decline from 0.04 to 0.024 and from 0.14 to 0.07, respectively, when errors are accounted for. Our results also provide evidence that de novo events may occur on both parent and daughter strands: the median parent and daughter de novo rates are 0.08 (80$%$ CI: 0.04--0.13) and 0.07 (80$%$ CI: 0.04--0.11), respectively.
△ Less
Submitted 9 November, 2010;
originally announced November 2010.