-
Generative Models: An Interdisciplinary Perspective
Authors:
Kris Sankaran,
Susan P. Holmes
Abstract:
By linking conceptual theories with observed data, generative models can support reasoning in complex situations. They have come to play a central role both within and beyond statistics, providing the basis for power analysis in molecular biology, theory building in particle physics, and resource allocation in epidemiology, for example. We introduce the probabilistic and computational concepts und…
▽ More
By linking conceptual theories with observed data, generative models can support reasoning in complex situations. They have come to play a central role both within and beyond statistics, providing the basis for power analysis in molecular biology, theory building in particle physics, and resource allocation in epidemiology, for example. We introduce the probabilistic and computational concepts underlying modern generative models and then analyze how they can be used to inform experimental design, iterative model refinement, goodness-of-fit evaluation, and agent-based simulation. We emphasize a modular view of generative mechanisms and discuss how they can be flexibly recombined in new problem contexts. We provide practical illustrations throughout, and code for reproducing all examples is available at https://github.com/krisrs1128/generative_review. Finally, we observe how research in generative models is currently split across several islands of activity, and we highlight opportunities lying at disciplinary intersections.
△ Less
Submitted 11 August, 2022;
originally announced August 2022.
-
A Statistical Perspective on the Challenges in Molecular Microbial Biology
Authors:
Pratheepa Jeganathan,
Susan P. Holmes
Abstract:
High throughput sequencing (HTS)-based technology enables identifying and quantifying non-culturable microbial organisms in all environments. Microbial sequences have enhanced our understanding of the human microbiome, the soil and plant environment, and the marine environment. All molecular microbial data pose statistical challenges due to contamination sequences from reagents, batch effects, une…
▽ More
High throughput sequencing (HTS)-based technology enables identifying and quantifying non-culturable microbial organisms in all environments. Microbial sequences have enhanced our understanding of the human microbiome, the soil and plant environment, and the marine environment. All molecular microbial data pose statistical challenges due to contamination sequences from reagents, batch effects, unequal sampling, and undetected taxa. Technical biases and heteroscedasticity have the strongest effects, but different strains across subjects and environments also make direct differential abundance testing unwieldy. We provide an introduction to a few statistical tools that can overcome some of these difficulties and demonstrate those tools on an example. We show how standard statistical methods, such as simple hierarchical mixture and topic models, can facilitate inferences on latent microbial communities. We also review some nonparametric Bayesian approaches that combine visualization and uncertainty quantification. The intersection of molecular microbial biology and statistics is an exciting new venue. Finally, we list some of the important open problems that would benefit from more careful statistical method development.
△ Less
Submitted 6 March, 2021;
originally announced March 2021.
-
The Block Bootstrap Method for Longitudinal Microbiome Data
Authors:
Pratheepa Jeganathan,
Benjamin J. Callahan,
Diana M. Proctor,
David A. Relman,
Susan P. Holmes
Abstract:
Microbial ecology serves as a foundation for a wide range of scientific and biomedical studies. Rapidly-evolving high-throughput sequencing technology enables the comprehensive search for microbial biomarkers using longitudinal experiments. Such experiments consist of repeated biological observations from each subject over time and are essential in accounting for the high between-subject and withi…
▽ More
Microbial ecology serves as a foundation for a wide range of scientific and biomedical studies. Rapidly-evolving high-throughput sequencing technology enables the comprehensive search for microbial biomarkers using longitudinal experiments. Such experiments consist of repeated biological observations from each subject over time and are essential in accounting for the high between-subject and within-subject variability.
Unfortunately, many of the statistical tests based on parametric models rely on correctly specifying temporal dependence structure which is unavailable in most microbiome data.
In this paper, we propose an extension of the nonparametric bootstrap method that enables inference on these types longitudinal data. The proposed moving block bootstrap (MBB) method accounts for within-subject dependency by using overlapping blocks of repeated observations within each subject to draw valid inferences based on approximately pivotal statistics. Our simulation studies show an increase in power compared to merge-by-subject (MBS) strategies. We also show that compared to tests that presume independent samples (PIS), our proposed method reduces false microbial biomarker discovery rates.
In this paper, we illustrated the MBB method using three different pregnancy data and an oral microbiome data. We provide an open-source R package https://github.com/PratheepaJ/bootLong to make our method accessible and the study in this paper reproducible.
△ Less
Submitted 29 November, 2018; v1 submitted 6 September, 2018;
originally announced September 2018.
-
Inference of Dynamic Regimes in the Microbiome
Authors:
Kris Sankaran,
Susan P. Holmes
Abstract:
Many studies have been performed to characterize the dynamics and stability of the microbiome across a range of environmental contexts [Costello et al., 2012, Faust et al., 2015]. For example, it is often of interest to identify time intervals within which certain subsets of taxa have an interesting pattern of behavior. Viewed abstractly, these problems often have a flavor not just of time series…
▽ More
Many studies have been performed to characterize the dynamics and stability of the microbiome across a range of environmental contexts [Costello et al., 2012, Faust et al., 2015]. For example, it is often of interest to identify time intervals within which certain subsets of taxa have an interesting pattern of behavior. Viewed abstractly, these problems often have a flavor not just of time series modeling but also of regime detection, a problem with a rich history across a variety of applications, including speech recognition [Fox et al., 2011], finance [Lee, 2009], EEG analysis [Camilleri et al., 2014], and geophysics [Weatherley and Mora, 2002]. In spite of the parallels, regime detection methods are rarely used in microbiome analysis, most likely due to the fact that references for these methods are scattered across several literatures, descriptions are inaccessible outside limited research communities, and implementations are difficult to come across.
We distill the core ideas of different regime detection methods, provide example applications, and share reproducible code, making these techniques more accessible to microbiome researchers. We re-analyze data of Dethlefsen and Relman [2011], a study of the effects of antibiotics on the microbiome, using Classification and Regression Trees (CART) [Breiman et al., 1984], Hidden Markov Models (HMMs) [Rabiner and Juang, 1986], Bayesian nonparametric HMMs [Teh and Jordan, 2010, Fox et al., 2008], mixtures of Gaussian Processes (GPs) [Rasmussen and Ghahramani, 2002], switching dynamical systems [Linderman et al., 2016], and multiple changepoint detection [Fan and Mackey, 2015]. Along the way, we summarize each method, their relevance to the microbiome, and tradeoffs associated with using them. Ultimately, our goal is to describe types of temporal or regime switching structure that can be incorporated into studies of microbiome dynamics.
△ Less
Submitted 30 November, 2017;
originally announced December 2017.
-
Latent Variable Modeling for the Microbiome
Authors:
Kris Sankaran,
Susan P. Holmes
Abstract:
The human microbiome is a complex ecological system, and describing its structure and function under different environmental conditions is important from both basic scientific and medical perspectives. Viewed through a biostatistical lens, many microbiome analysis goals can be formulated as latent variable modeling problems. However, although probabilistic latent variable models are a cornerstone…
▽ More
The human microbiome is a complex ecological system, and describing its structure and function under different environmental conditions is important from both basic scientific and medical perspectives. Viewed through a biostatistical lens, many microbiome analysis goals can be formulated as latent variable modeling problems. However, although probabilistic latent variable models are a cornerstone of modern unsupervised learning, they are rarely applied in the context of microbiome data analysis, in spite of the evolutionary, temporal, and count structure that could be directly incorporated through such models. We explore the application of probabilistic latent variable models to microbiome data, with a focus on Latent Dirichlet Allocation, Nonnegative Matrix Factorization, and Dynamic Unigram models. To develop guidelines for when different methods are appropriate, we perform a simulation study. We further illustrate and compare these techniques using the data of [10], a study on the effects of antibiotics on bacterial community composition. Code and data for all simulations and case studies are available publicly.
△ Less
Submitted 15 November, 2017; v1 submitted 15 June, 2017;
originally announced June 2017.