-
Exact and efficient phylodynamic simulation from arbitrarily large populations
Authors:
Michael Celentano,
William S. DeWitt,
Sebastian Prillo,
Yun S. Song
Abstract:
Many biological studies involve inferring the evolutionary history of a sample of individuals from a large population and interpreting the reconstructed tree. Such an ascertained tree typically represents only a small part of a comprehensive population tree and is distorted by survivorship and sampling biases. Inferring evolutionary parameters from ascertained trees requires modeling both the unde…
▽ More
Many biological studies involve inferring the evolutionary history of a sample of individuals from a large population and interpreting the reconstructed tree. Such an ascertained tree typically represents only a small part of a comprehensive population tree and is distorted by survivorship and sampling biases. Inferring evolutionary parameters from ascertained trees requires modeling both the underlying population dynamics and the ascertainment process. A crucial component of this phylodynamic modeling involves tree simulation, which is used to benchmark probabilistic inference methods. To simulate an ascertained tree, one must first simulate the full population tree and then prune unobserved lineages. Consequently, the computational cost is determined not by the size of the final simulated tree, but by the size of the population tree in which it is embedded. In most biological scenarios, simulations of the entire population are prohibitively expensive due to computational demands placed on lineages without sampled descendants. Here, we address this challenge by proving that, for any partially ascertained process from a general multi-type birth-death-mutation-sampling model, there exists an equivalent process with complete sampling and no death, a property which we leverage to develop a highly efficient algorithm for simulating trees. Our algorithm scales linearly with the size of the final simulated tree and is independent of the population size, enabling simulations from extremely large populations beyond the reach of current methods but essential for various biological applications. We anticipate that this unprecedented speedup will significantly advance the development of novel inference methods that require extensive training data.
△ Less
Submitted 10 August, 2024; v1 submitted 26 February, 2024;
originally announced February 2024.
-
Representing and extending ensembles of parsimonious evolutionary histories with a directed acyclic graph
Authors:
Will Dumm,
Mary Barker,
William Howard-Snyder,
William S. DeWitt,
Frederick A. Matsen IV
Abstract:
In many situations, it would be useful to know not just the best phylogenetic tree for a given data set, but the collection of high-quality trees. This goal is typically addressed using Bayesian techniques, however, current Bayesian methods do not scale to large data sets. Furthermore, for large data sets with relatively low signal one cannot even store every good tree individually, especially whe…
▽ More
In many situations, it would be useful to know not just the best phylogenetic tree for a given data set, but the collection of high-quality trees. This goal is typically addressed using Bayesian techniques, however, current Bayesian methods do not scale to large data sets. Furthermore, for large data sets with relatively low signal one cannot even store every good tree individually, especially when the trees are required to be bifurcating. In this paper, we develop a novel object called the "history subpartition directed acyclic graph" (or "history sDAG" for short) that compactly represents an ensemble of trees with labels (e.g. ancestral sequences) mapped onto the internal nodes. The history sDAG can be built efficiently and can also be efficiently trimmed to only represent maximally parsimonious trees. We show that the history sDAG allows us to find many additional equally parsimonious trees, extending combinatorially beyond the ensemble used to construct it. We argue that this object could be useful as the "skeleton" of a more complete uncertainty quantification.
△ Less
Submitted 11 October, 2023;
originally announced October 2023.
-
Mean-field interacting multi-type birth-death processes with a view to applications in phylodynamics
Authors:
William S. DeWitt,
Steven N. Evans,
Ella Hiesmayr,
Sebastian Hummel
Abstract:
Multi-type birth-death processes underlie approaches for inferring evolutionary dynamics from phylogenetic trees across biological scales, ranging from deep-time species macroevolution to rapid viral evolution and somatic cellular proliferation. A limitation of current phylogenetic birth-death models is that they require restrictive linearity assumptions that yield tractable message-passing likeli…
▽ More
Multi-type birth-death processes underlie approaches for inferring evolutionary dynamics from phylogenetic trees across biological scales, ranging from deep-time species macroevolution to rapid viral evolution and somatic cellular proliferation. A limitation of current phylogenetic birth-death models is that they require restrictive linearity assumptions that yield tractable message-passing likelihoods, but that also preclude interactions between individuals. Many fundamental evolutionary processes -- such as environmental carrying capacity or frequency-dependent selection -- entail interactions, and may strongly influence the dynamics in some systems. Here, we introduce a multi-type birth-death process in mean-field interaction with an ensemble of replicas of the focal process. We prove that, under quite general conditions, the ensemble's stochastically evolving interaction field converges to a deterministic trajectory in the limit of an infinite ensemble. In this limit, the replicas effectively decouple, and self-consistent interactions appear as nonlinearities in the infinitesimal generator of the focal process. We investigate a special case that is rich enough to model both carrying capacity and frequency-dependent selection while yielding tractable message-passing likelihoods in the context of a phylogenetic birth-death model.
△ Less
Submitted 31 March, 2024; v1 submitted 12 July, 2023;
originally announced July 2023.
-
Dynamics of B-cell repertoires and emergence of cross-reactive responses in COVID-19 patients with different disease severity
Authors:
Zachary Montague,
Huibin Lv,
Jakub Otwinowski,
William S. DeWitt,
Giulio Isacchini,
Garrick K. Yip,
Wilson W. Ng,
Owen Tak-Yin Tsang,
Meng Yuan,
Hejun Liu,
Ian A. Wilson,
J. S. Malik Peiris,
Nicholas C. Wu,
Armita Nourmohammad,
Chris Ka Pun Mok
Abstract:
COVID-19 patients show varying severity of the disease ranging from asymptomatic to requiring intensive care. Although a number of SARS-CoV-2 specific monoclonal antibodies have been identified, we still lack an understanding of the overall landscape of B-cell receptor (BCR) repertoires in COVID-19 patients. Here, we used high-throughput sequencing of bulk and plasma B-cells collected over multipl…
▽ More
COVID-19 patients show varying severity of the disease ranging from asymptomatic to requiring intensive care. Although a number of SARS-CoV-2 specific monoclonal antibodies have been identified, we still lack an understanding of the overall landscape of B-cell receptor (BCR) repertoires in COVID-19 patients. Here, we used high-throughput sequencing of bulk and plasma B-cells collected over multiple time points during infection to characterize signatures of B-cell response to SARS-CoV-2 in 19 patients. Using principled statistical approaches, we determined differential features of BCRs associated with different disease severity. We identified 38 significantly expanded clonal lineages shared among patients as candidates for specific responses to SARS-CoV-2. Using single-cell sequencing, we verified reactivity of BCRs shared among individuals to SARS-CoV-2 epitopes. Moreover, we identified natural emergence of a BCR with cross-reactivity to SARS-CoV-1 and SARS-CoV-2 in a number of patients. Our results provide important insights for development of rational therapies and vaccines against COVID-19.
△ Less
Submitted 5 April, 2021; v1 submitted 13 July, 2020;
originally announced July 2020.
-
Estimation of cell lineage trees by maximum-likelihood phylogenetics
Authors:
Jean Feng,
William S DeWitt III,
Aaron McKenna,
Noah Simon,
Amy Willis,
Frederick A Matsen IV
Abstract:
CRISPR technology has enabled large-scale cell lineage tracing for complex multicellular organisms by mutating synthetic genomic barcodes during organismal development. However, these sophisticated biological tools currently use ad-hoc and outmoded computational methods to reconstruct the cell lineage tree from the mutated barcodes. Because these methods are agnostic to the biological mechanism, t…
▽ More
CRISPR technology has enabled large-scale cell lineage tracing for complex multicellular organisms by mutating synthetic genomic barcodes during organismal development. However, these sophisticated biological tools currently use ad-hoc and outmoded computational methods to reconstruct the cell lineage tree from the mutated barcodes. Because these methods are agnostic to the biological mechanism, they are unable to take full advantage of the data's structure. We propose a statistical model for the mutation process and develop a procedure to estimate the tree topology, branch lengths, and mutation parameters by iteratively applying penalized maximum likelihood estimation. In contrast to existing techniques, our method estimates time along each branch, rather than number of mutation events, thus providing a detailed account of tissue-type differentiation. Via simulations, we demonstrate that our method is substantially more accurate than existing approaches. Our reconstructed trees also better recapitulate known aspects of zebrafish development and reproduce similar results across fish replicates.
△ Less
Submitted 29 March, 2019;
originally announced April 2019.
-
Using genotype abundance to improve phylogenetic inference
Authors:
William S. DeWitt III,
Luka Mesin,
Gabriel D. Victora,
Vladimir N. Minin,
Frederick A. Matsen IV
Abstract:
Modern biological techniques enable very dense genetic sampling of unfolding evolutionary histories, and thus frequently sample some genotypes multiple times. This motivates strategies to incorporate genotype abundance information in phylogenetic inference. In this paper, we synthesize a stochastic process model with standard sequence-based phylogenetic optimality, and show that tree estimation is…
▽ More
Modern biological techniques enable very dense genetic sampling of unfolding evolutionary histories, and thus frequently sample some genotypes multiple times. This motivates strategies to incorporate genotype abundance information in phylogenetic inference. In this paper, we synthesize a stochastic process model with standard sequence-based phylogenetic optimality, and show that tree estimation is substantially improved by doing so. Our method is validated with extensive simulations and an experimental single-cell lineage tracing study of germinal center B cell receptor affinity maturation.
△ Less
Submitted 5 April, 2018; v1 submitted 29 August, 2017;
originally announced August 2017.
-
Replicate immunosequencing as a robust probe of B cell repertoire diversity
Authors:
William DeWitt,
Paul Lindau,
Thomas Snyder,
Marissa Vignali,
Ryan Emerson,
Harlan Robins
Abstract:
Fundamental to quantitative characterization of the B cell receptor repertoire is clonal diversity - the number of distinct somatically recombined receptors present in the repertoire and their relative abundances, defining the search space available for immune response. This study synthesizes flow cytometry and immunosequencing to study memory and naive B cells from the peripheral blood of three a…
▽ More
Fundamental to quantitative characterization of the B cell receptor repertoire is clonal diversity - the number of distinct somatically recombined receptors present in the repertoire and their relative abundances, defining the search space available for immune response. This study synthesizes flow cytometry and immunosequencing to study memory and naive B cells from the peripheral blood of three adults. A combinatorial experimental design was employed, constituting a sample abundance probe robust to amplification stochasticity, a crucial quantitative advance over previous sequencing studies of diversity. These data are leveraged to interrogate repertoire diversity, motivating an extension of a canonical diversity model in ecology and corpus linguistics. Maximum likelihood diversity estimates are provided for memory and naive B cell repertoires. Both evince domination by rare clones and regimes of power law scaling in abundance. Memory clones have more disparate repertoire abundances than naive clones, and most naive clones undergo no proliferation prior to antigen recognition.
△ Less
Submitted 1 October, 2014;
originally announced October 2014.