Search | arXiv e-print repository

Estimating Fold Changes from Partially Observed Outcomes with Applications in Microbial Metagenomics

Authors: David S Clausen, Sarah Teichman, Amy D Willis

Abstract: We consider the problem of estimating fold-changes in the expected value of a multivariate outcome observed with unknown sample-specific and category-specific perturbations. This challenge arises in high-throughput sequencing studies of the abundance of microbial taxa because microbes are systematically over- and under-detected relative to their true abundances. Our model admits a partially identi… ▽ More We consider the problem of estimating fold-changes in the expected value of a multivariate outcome observed with unknown sample-specific and category-specific perturbations. This challenge arises in high-throughput sequencing studies of the abundance of microbial taxa because microbes are systematically over- and under-detected relative to their true abundances. Our model admits a partially identifiable estimand, and we establish full identifiability by imposing interpretable parameter constraints. To reduce bias and guarantee the existence of estimators in the presence of sparse observations, we apply an asymptotically negligible and constraint-invariant penalty to our estimating function. We develop a fast coordinate descent algorithm for estimation, and an augmented Lagrangian algorithm for estimation under null hypotheses. We construct a model-robust score test and demonstrate valid inference even for small sample sizes and violated distributional assumptions. The flexibility of the approach and comparisons to related methods are illustrated through a meta-analysis of microbial associations with colorectal cancer. △ Less

Submitted 14 March, 2025; v1 submitted 7 February, 2024; originally announced February 2024.

Comments: v2 includes clarified exposition, additional examples, expanded simulation study, and supporting theory; Dr Teichman contributed substantially to v2 and is now recognised as a coauthor

arXiv:2204.12733 [pdf, other]

Modeling complex measurement error in microbiome experiments to estimate relative abundances and detection effects

Authors: David S Clausen, Amy D Willis

Abstract: Accurate estimates of microbial species abundances are needed to advance our understanding of the role that microbiomes play in human and environmental health. However, artificially constructed microbiomes demonstrate that intuitive estimators of microbial relative abundances are biased. To address this, we propose a semiparametric method to estimate relative abundances, species detection effects,… ▽ More Accurate estimates of microbial species abundances are needed to advance our understanding of the role that microbiomes play in human and environmental health. However, artificially constructed microbiomes demonstrate that intuitive estimators of microbial relative abundances are biased. To address this, we propose a semiparametric method to estimate relative abundances, species detection effects, and/or cross-sample contamination in microbiome experiments. We show that certain experimental designs result in identifiable model parameters, and we present consistent estimators and asymptotically valid inference procedures. Notably, our procedure can estimate relative abundances on the boundary of the simplex. We demonstrate the utility of the method for comparing experimental protocols, removing cross-sample contamination, and estimating species' detectability. △ Less

Submitted 14 March, 2025; v1 submitted 27 April, 2022; originally announced April 2022.

Comments: v2 includes detailed identifiability results, a complete proof of weak convergence, additional simulation results, and clarified exposition

arXiv:1904.00117 [pdf, other]

Estimation of cell lineage trees by maximum-likelihood phylogenetics

Authors: Jean Feng, William S DeWitt III, Aaron McKenna, Noah Simon, Amy Willis, Frederick A Matsen IV

Abstract: CRISPR technology has enabled large-scale cell lineage tracing for complex multicellular organisms by mutating synthetic genomic barcodes during organismal development. However, these sophisticated biological tools currently use ad-hoc and outmoded computational methods to reconstruct the cell lineage tree from the mutated barcodes. Because these methods are agnostic to the biological mechanism, t… ▽ More CRISPR technology has enabled large-scale cell lineage tracing for complex multicellular organisms by mutating synthetic genomic barcodes during organismal development. However, these sophisticated biological tools currently use ad-hoc and outmoded computational methods to reconstruct the cell lineage tree from the mutated barcodes. Because these methods are agnostic to the biological mechanism, they are unable to take full advantage of the data's structure. We propose a statistical model for the mutation process and develop a procedure to estimate the tree topology, branch lengths, and mutation parameters by iteratively applying penalized maximum likelihood estimation. In contrast to existing techniques, our method estimates time along each branch, rather than number of mutation events, thus providing a detailed account of tissue-type differentiation. Via simulations, we demonstrate that our method is substantially more accurate than existing approaches. Our reconstructed trees also better recapitulate known aspects of zebrafish development and reproduce similar results across fish replicates. △ Less

Submitted 29 March, 2019; originally announced April 2019.

arXiv:1902.02776 [pdf, other]

Modeling microbial abundances and dysbiosis with beta-binomial regression

Authors: Bryan D. Martin, Daniela Witten, Amy D. Willis

Abstract: Using a sample from a population to estimate the proportion of the population with a certain category label is a broadly important problem. In the context of microbiome studies, this problem arises when researchers wish to use a sample from a population of microbes to estimate the population proportion of a particular taxon, known as the taxon's relative abundance. In this paper, we propose a beta… ▽ More Using a sample from a population to estimate the proportion of the population with a certain category label is a broadly important problem. In the context of microbiome studies, this problem arises when researchers wish to use a sample from a population of microbes to estimate the population proportion of a particular taxon, known as the taxon's relative abundance. In this paper, we propose a beta-binomial model for this task. Like existing models, our model allows for a taxon's relative abundance to be associated with covariates of interest. However, unlike existing models, our proposal also allows for the overdispersion in the taxon's counts to be associated with covariates of interest. We exploit this model in order to propose tests not only for differential relative abundance, but also for differential variability. The latter is particularly valuable in light of speculation that dysbiosis, the perturbation from a normal microbiome that can occur in certain disease conditions, may manifest as a loss of stability, or increase in variability, of the counts associated with each taxon. We demonstrate the performance of our proposed model using a simulation study and an application to soil microbial data. △ Less

Submitted 7 February, 2019; originally announced February 2019.

arXiv:1611.03456 [pdf, other]

Uncertainty in phylogenetic tree estimates

Authors: Amy D. Willis, Rayna C. Bell

Abstract: Estimating phylogenetic trees is an important problem in evolutionary biology, environmental policy and medicine. Although trees are estimated, their uncertainties are discarded by mathematicians working in tree space. Here we explicitly model the multivariate uncertainty of tree estimates. We consider both the cases where uncertainty information arises extrinsically (through covariate information… ▽ More Estimating phylogenetic trees is an important problem in evolutionary biology, environmental policy and medicine. Although trees are estimated, their uncertainties are discarded by mathematicians working in tree space. Here we explicitly model the multivariate uncertainty of tree estimates. We consider both the cases where uncertainty information arises extrinsically (through covariate information) and intrinsically (through the tree estimates themselves). The importance of accounting for tree uncertainty in tree space is demonstrated in two case studies. In the first instance, differences between gene trees are small relative to their uncertainties, while in the second, the differences are relatively large. Our main goal is visualization of tree uncertainty, and we demonstrate advantages of our method with respect to reproducibility, speed and preservation of topological differences compared to visualization based on multidimensional scaling. The proposal highlights that phylogenetic trees are estimated in an extremely high-dimensional space, resulting in uncertainty information that cannot be discarded. Most importantly, it is a method that allows biologists to diagnose whether differences between gene trees are biologically meaningful, or due to uncertainty in estimation. △ Less

Submitted 12 October, 2017; v1 submitted 10 November, 2016; originally announced November 2016.

Comments: Final version accepted to Journal of Computational and Graphical Statistics

arXiv:1607.08288 [pdf, other]

Confidence sets for phylogenetic trees

Authors: Amy Willis

Abstract: Inferring evolutionary histories (phylogenetic trees) has important applications in biology, criminology and public health. However, phylogenetic trees are complex mathematical objects that reside in a non-Euclidean space, which complicates their analysis. While our mathematical, algorithmic, and probabilistic understanding of phylogenies in their metric space is mature, rigorous inferential infra… ▽ More Inferring evolutionary histories (phylogenetic trees) has important applications in biology, criminology and public health. However, phylogenetic trees are complex mathematical objects that reside in a non-Euclidean space, which complicates their analysis. While our mathematical, algorithmic, and probabilistic understanding of phylogenies in their metric space is mature, rigorous inferential infrastructure is as yet undeveloped. In this manuscript we unify recent computational and probabilistic advances to construct tree--valued confidence sets. The procedure accounts for both centre and multiple directions of tree--valued variability. We draw on block replicates to improve testing, identifying the best supported most recent ancestor of the Zika virus, and formally testing the hypothesis that a Floridian dentist with AIDS infected two of his patients with HIV. The method illustrates connections between variability in Euclidean and tree space, opening phylogenetic tree analysis to techniques available in the multivariate Euclidean setting. △ Less

Submitted 12 October, 2017; v1 submitted 27 July, 2016; originally announced July 2016.

Comments: Final version accepted to the Journal of the American Statistical Association

MSC Class: 62F03; 62C99; 62H86

arXiv:1605.02082 [pdf, other]

Improved detection of changes in species richness in high-diversity microbial communities

Authors: Amy Willis, John Bunge, Thea Whitman

Abstract: High throughput sequencing (HTS) continues to expand our understanding of microbial communities, despite insufficient sequencing depths to detect all rare taxa. These low abundance taxa are not accounted for in existing methods for detecting changes in species richness. We address this with a new hierarchical model that permits rigorous testing for both heterogeneity and biodiversity changes, and… ▽ More High throughput sequencing (HTS) continues to expand our understanding of microbial communities, despite insufficient sequencing depths to detect all rare taxa. These low abundance taxa are not accounted for in existing methods for detecting changes in species richness. We address this with a new hierarchical model that permits rigorous testing for both heterogeneity and biodiversity changes, and simultaneously improves Type I & II error rates compared to existing methods. △ Less

Submitted 9 April, 2016; originally announced May 2016.

Comments: arXiv admin note: text overlap with arXiv:1506.05710

arXiv:1604.02598 [pdf, other]

Species richness estimation with high diversity but spurious singletons

Authors: Amy Willis

Abstract: The presence of uncommon taxa in high-throughput sequenced ecological samples pose challenges to the microbial ecologist, bioinformatician and statistician. It is rarely certain whether these taxa are truly present in the sample or the result of sequencing errors. Unfortunately, alpha-diversity quantification relies on accurate frequency counts, which can rarely be guaranteed. We present a species… ▽ More The presence of uncommon taxa in high-throughput sequenced ecological samples pose challenges to the microbial ecologist, bioinformatician and statistician. It is rarely certain whether these taxa are truly present in the sample or the result of sequencing errors. Unfortunately, alpha-diversity quantification relies on accurate frequency counts, which can rarely be guaranteed. We present a species richness estimation tool which predicts both the number of unobserved taxa and the number of true singletons based on the non-singleton frequency counts. This method can be treated as either inferential (for formally estimating richness) or exploratory (for assessing robustness of the richness estimate to the singleton count). If the estimate, called breakaway_nof1, is comparable to other richness estimators, this provides evidence that the richness estimate is robust to the level of quality control (eg. chimera-checking) employed in pre-processing. The function breakaway_nof1 is freely available from CRAN via the R package breakaway. △ Less

Submitted 9 April, 2016; originally announced April 2016.

arXiv:1506.05710 [pdf, other]

Inference for changes in biodiversity

Authors: Amy Willis, John Bunge, Thea Whitman

Abstract: We wish to formally test for changes in the taxonomic diversity of a community, especially in the presence of high latent diversity. Drawing on the meta-analysis literature, we construct a model for diversity that accounts for covariate effects as well as sampling variability. This permits inference for changes in richness with covariates and also a test for homogeneity. We argue that we can use t… ▽ More We wish to formally test for changes in the taxonomic diversity of a community, especially in the presence of high latent diversity. Drawing on the meta-analysis literature, we construct a model for diversity that accounts for covariate effects as well as sampling variability. This permits inference for changes in richness with covariates and also a test for homogeneity. We argue that we can use the principles of shrinkage estimation to improve richness estimation in this nonstandard context, which is especially important given the high variance of richness estimators and the increasing abundance of community composition data. We demonstrate the methodology under simulation, in a gut microbiome study (testing for a decrease in richness with antibiotics), and in a soil microbiome study (testing for homogeneity of replicates). We believe that this is the first formal procedure for analyzing changes in species richness. △ Less

Submitted 18 June, 2015; originally announced June 2015.

Comments: 23 pages, 4 figures

arXiv:1408.3333 [pdf, other]

Estimating Diversity via Frequency Ratios

Authors: A. Willis, J. Bunge

Abstract: We wish to estimate the total number of classes in a population based on sample counts, especially in the presence of high latent diversity. Drawing on probability theory that characterizes distributions on the integers by ratios of consecutive probabilities, we construct a nonlinear regression model for the ratios of consecutive frequency counts. This allows us to predict the unobserved count and… ▽ More We wish to estimate the total number of classes in a population based on sample counts, especially in the presence of high latent diversity. Drawing on probability theory that characterizes distributions on the integers by ratios of consecutive probabilities, we construct a nonlinear regression model for the ratios of consecutive frequency counts. This allows us to predict the unobserved count and hence estimate the total diversity. We believe that this is the first approach to depart from the classical mixed Poisson model in this problem. Our method is geometrically intuitive and yields good fits to data with reasonable standard errors. It is especially well-suited to analyzing high diversity datasets derived from next-generation sequencing in microbial ecology. We demonstrate the method's performance in this context and via simulation, and we present a dataset for which our method outperforms all competitors. △ Less

Submitted 9 December, 2014; v1 submitted 14 August, 2014; originally announced August 2014.

Comments: 17 pages, 1 figure, 4 tables

Showing 1–10 of 10 results for author: Willis, A