Search | arXiv e-print repository

Distances between Extension Spaces of Phylogenetic Trees

Authors: Maria Alejandra Valdez Cabrera, Amy D Willis

Abstract: Phylogenetic trees summarize evolutionary relationships between organisms, and tools to analyze collections of phylogenetic trees enable contrasts between different genes' ancestry. The BHV metric space has enabled the analysis of collections of trees that share a common set of leaves, but many genes are not shared, even between closely related species. BHV extension spaces represent trees with no… ▽ More Phylogenetic trees summarize evolutionary relationships between organisms, and tools to analyze collections of phylogenetic trees enable contrasts between different genes' ancestry. The BHV metric space has enabled the analysis of collections of trees that share a common set of leaves, but many genes are not shared, even between closely related species. BHV extension spaces represent trees with non-identical leaf sets in a common BHV space, but limited analytical tools exist for extension spaces. We define the distance between two phylogenetic trees with non-identical leaf sets as the shortest BHV distance between their extension spaces, and develop a reduced gradient algorithm to compute this distance. We study the scalability of our algorithm and apply it to analyze gene trees spanning multiple domains of life. Our distance and algorithm offer a fully general, interpretable approach to analyzing both ancient and recent evolutionary divergence. △ Less

Submitted 28 June, 2024; originally announced July 2024.

arXiv:2402.05231 [pdf, other]

Estimating Fold Changes from Partially Observed Outcomes with Applications in Microbial Metagenomics

Authors: David S Clausen, Sarah Teichman, Amy D Willis

Abstract: We consider the problem of estimating fold-changes in the expected value of a multivariate outcome observed with unknown sample-specific and category-specific perturbations. This challenge arises in high-throughput sequencing studies of the abundance of microbial taxa because microbes are systematically over- and under-detected relative to their true abundances. Our model admits a partially identi… ▽ More We consider the problem of estimating fold-changes in the expected value of a multivariate outcome observed with unknown sample-specific and category-specific perturbations. This challenge arises in high-throughput sequencing studies of the abundance of microbial taxa because microbes are systematically over- and under-detected relative to their true abundances. Our model admits a partially identifiable estimand, and we establish full identifiability by imposing interpretable parameter constraints. To reduce bias and guarantee the existence of estimators in the presence of sparse observations, we apply an asymptotically negligible and constraint-invariant penalty to our estimating function. We develop a fast coordinate descent algorithm for estimation, and an augmented Lagrangian algorithm for estimation under null hypotheses. We construct a model-robust score test and demonstrate valid inference even for small sample sizes and violated distributional assumptions. The flexibility of the approach and comparisons to related methods are illustrated through a meta-analysis of microbial associations with colorectal cancer. △ Less

Submitted 14 March, 2025; v1 submitted 7 February, 2024; originally announced February 2024.

Comments: v2 includes clarified exposition, additional examples, expanded simulation study, and supporting theory; Dr Teichman contributed substantially to v2 and is now recognised as a coauthor

arXiv:2204.12733 [pdf, other]

Modeling complex measurement error in microbiome experiments to estimate relative abundances and detection effects

Authors: David S Clausen, Amy D Willis

Abstract: Accurate estimates of microbial species abundances are needed to advance our understanding of the role that microbiomes play in human and environmental health. However, artificially constructed microbiomes demonstrate that intuitive estimators of microbial relative abundances are biased. To address this, we propose a semiparametric method to estimate relative abundances, species detection effects,… ▽ More Accurate estimates of microbial species abundances are needed to advance our understanding of the role that microbiomes play in human and environmental health. However, artificially constructed microbiomes demonstrate that intuitive estimators of microbial relative abundances are biased. To address this, we propose a semiparametric method to estimate relative abundances, species detection effects, and/or cross-sample contamination in microbiome experiments. We show that certain experimental designs result in identifiable model parameters, and we present consistent estimators and asymptotically valid inference procedures. Notably, our procedure can estimate relative abundances on the boundary of the simplex. We demonstrate the utility of the method for comparing experimental protocols, removing cross-sample contamination, and estimating species' detectability. △ Less

Submitted 14 March, 2025; v1 submitted 27 April, 2022; originally announced April 2022.

Comments: v2 includes detailed identifiability results, a complete proof of weak convergence, additional simulation results, and clarified exposition

arXiv:1902.02776 [pdf, other]

Modeling microbial abundances and dysbiosis with beta-binomial regression

Authors: Bryan D. Martin, Daniela Witten, Amy D. Willis

Abstract: Using a sample from a population to estimate the proportion of the population with a certain category label is a broadly important problem. In the context of microbiome studies, this problem arises when researchers wish to use a sample from a population of microbes to estimate the population proportion of a particular taxon, known as the taxon's relative abundance. In this paper, we propose a beta… ▽ More Using a sample from a population to estimate the proportion of the population with a certain category label is a broadly important problem. In the context of microbiome studies, this problem arises when researchers wish to use a sample from a population of microbes to estimate the population proportion of a particular taxon, known as the taxon's relative abundance. In this paper, we propose a beta-binomial model for this task. Like existing models, our model allows for a taxon's relative abundance to be associated with covariates of interest. However, unlike existing models, our proposal also allows for the overdispersion in the taxon's counts to be associated with covariates of interest. We exploit this model in order to propose tests not only for differential relative abundance, but also for differential variability. The latter is particularly valuable in light of speculation that dysbiosis, the perturbation from a normal microbiome that can occur in certain disease conditions, may manifest as a loss of stability, or increase in variability, of the counts associated with each taxon. We demonstrate the performance of our proposed model using a simulation study and an application to soil microbial data. △ Less

Submitted 7 February, 2019; originally announced February 2019.

arXiv:1611.03456 [pdf, other]

Uncertainty in phylogenetic tree estimates

Authors: Amy D. Willis, Rayna C. Bell

Abstract: Estimating phylogenetic trees is an important problem in evolutionary biology, environmental policy and medicine. Although trees are estimated, their uncertainties are discarded by mathematicians working in tree space. Here we explicitly model the multivariate uncertainty of tree estimates. We consider both the cases where uncertainty information arises extrinsically (through covariate information… ▽ More Estimating phylogenetic trees is an important problem in evolutionary biology, environmental policy and medicine. Although trees are estimated, their uncertainties are discarded by mathematicians working in tree space. Here we explicitly model the multivariate uncertainty of tree estimates. We consider both the cases where uncertainty information arises extrinsically (through covariate information) and intrinsically (through the tree estimates themselves). The importance of accounting for tree uncertainty in tree space is demonstrated in two case studies. In the first instance, differences between gene trees are small relative to their uncertainties, while in the second, the differences are relatively large. Our main goal is visualization of tree uncertainty, and we demonstrate advantages of our method with respect to reproducibility, speed and preservation of topological differences compared to visualization based on multidimensional scaling. The proposal highlights that phylogenetic trees are estimated in an extremely high-dimensional space, resulting in uncertainty information that cannot be discarded. Most importantly, it is a method that allows biologists to diagnose whether differences between gene trees are biologically meaningful, or due to uncertainty in estimation. △ Less

Submitted 12 October, 2017; v1 submitted 10 November, 2016; originally announced November 2016.

Comments: Final version accepted to Journal of Computational and Graphical Statistics

Showing 1–5 of 5 results for author: Willis, A D