-
MREC: a fast and versatile framework for aligning and matching point clouds with applications to single cell molecular data
Authors:
Andrew J. Blumberg,
Mathieu Carriere,
Michael A. Mandell,
Raul Rabadan,
Soledad Villar
Abstract:
Comparing and aligning large datasets is a pervasive problem occurring across many different knowledge domains. We introduce and study MREC, a recursive decomposition algorithm for computing matchings between data sets. The basic idea is to partition the data, match the partitions, and then recursively match the points within each pair of identified partitions. The matching itself is done using bl…
▽ More
Comparing and aligning large datasets is a pervasive problem occurring across many different knowledge domains. We introduce and study MREC, a recursive decomposition algorithm for computing matchings between data sets. The basic idea is to partition the data, match the partitions, and then recursively match the points within each pair of identified partitions. The matching itself is done using black box matching procedures that are too expensive to run on the entire data set. Using an absolute measure of the quality of a matching, the framework supports optimization over parameters including partitioning procedures and matching algorithms. By design, MREC can be applied to extremely large data sets. We analyze the procedure to describe when we can expect it to work well and demonstrate its flexibility and power by applying it to a number of alignment problems arising in the analysis of single cell molecular data.
△ Less
Submitted 20 February, 2020; v1 submitted 6 January, 2020;
originally announced January 2020.
-
Quasi-universality in single-cell sequencing data
Authors:
Luis Aparicio,
Mykola Bordyuh,
Andrew J. Blumberg,
Raul Rabadan
Abstract:
The development of single-cell technologies provides the opportunity to identify new cellular states and reconstruct novel cell-to-cell relationships. Applications range from understanding the transcriptional and epigenetic processes involved in metazoan development to characterizing distinct cells types in heterogeneous populations like cancers or immune cells. However, analysis of the data is im…
▽ More
The development of single-cell technologies provides the opportunity to identify new cellular states and reconstruct novel cell-to-cell relationships. Applications range from understanding the transcriptional and epigenetic processes involved in metazoan development to characterizing distinct cells types in heterogeneous populations like cancers or immune cells. However, analysis of the data is impeded by its unknown intrinsic biological and technical variability together with its sparseness; these factors complicate the identification of true biological signals amidst artifact and noise. Here we show that, across technologies, roughly 95% of the eigenvalues derived from each single-cell data set can be described by universal distributions predicted by Random Matrix Theory. Interestingly, 5% of the spectrum shows deviations from these distributions and present a phenomenon known as eigenvector localization, where information tightly concentrates in groups of cells. Some of the localized eigenvectors reflect underlying biological signal, and some are simply a consequence of the sparsity of single cell data; roughly 3% is artifactual. Based on the universal distributions and a technique for detecting sparsity induced localization, we present a strategy to identify the residual 2% of directions that encode biological information and thereby denoise single-cell data. We demonstrate the effectiveness of this approach by comparing with standard single-cell data analysis techniques in a variety of examples with marked cell populations.
△ Less
Submitted 5 October, 2018;
originally announced October 2018.
-
Genomic data analysis in tree spaces
Authors:
Sakellarios Zairis,
Hossein Khiabanian,
Andrew J. Blumberg,
Raul Rabadan
Abstract:
Recently, an elegant approach in phylogenetics was introduced by Billera-Holmes-Vogtmann that allows a systematic comparison of different evolutionary histories using the metric geometry of tree spaces. In many problem settings one encounters heavily populated phylogenetic trees, where the large number of leaves encumbers visualization and analysis in the relevant evolutionary moduli spaces. To ad…
▽ More
Recently, an elegant approach in phylogenetics was introduced by Billera-Holmes-Vogtmann that allows a systematic comparison of different evolutionary histories using the metric geometry of tree spaces. In many problem settings one encounters heavily populated phylogenetic trees, where the large number of leaves encumbers visualization and analysis in the relevant evolutionary moduli spaces. To address this issue, we introduce tree dimensionality reduction, a structured approach to reducing large phylogenetic trees to a distribution of smaller trees. We prove a stability theorem ensuring that small perturbations of the large trees are taken to small perturbations of the resulting distributions.
We then present a series of four biologically motivated applications to the analysis of genomic data, spanning cancer and infectious disease. The first quantifies how chemotherapy can disrupt the evolution of common leukemias. The second examines a link between geometric information and the histologic grade in relapsed gliomas, where longer relapse branches were specific to high grade glioma. The third concerns genetic stability of xenograft models of cancer, where heterogeneity at the single cell level increased with later mouse passages. The last studies genetic diversity in seasonal influenza A virus. We apply tree dimensionality reduction to 24 years of longitudinally collected H3N2 hemagglutinin sequences, generating distributions of smaller trees spanning between three and five seasons. A negative correlation is observed between the influenza vaccine effectiveness during a season and the variance of the distributions produced using preceding seasons' sequence data. We also show how tree distributions relate to antigenic clusters and choice of influenza vaccine. Our formalism exposes links between viral genomic data and clinical observables such as vaccine selection and efficacy.
△ Less
Submitted 25 July, 2016;
originally announced July 2016.
-
Moduli Spaces of Phylogenetic Trees Describing Tumor Evolutionary Patterns
Authors:
Sakellarios Zairis,
Hossein Khiabanian,
Andrew J. Blumberg,
Raul Rabadan
Abstract:
Cancers follow a clonal Darwinian evolution, with fitter subclones replacing more quiescent cells, ultimately giving rise to macroscopic disease. High-throughput genomics provides the opportunity to investigate these processes and determine specific genetic alterations driving disease progression. Genomic sampling of a patient's cancer provides a molecular history, represented by a phylogenetic tr…
▽ More
Cancers follow a clonal Darwinian evolution, with fitter subclones replacing more quiescent cells, ultimately giving rise to macroscopic disease. High-throughput genomics provides the opportunity to investigate these processes and determine specific genetic alterations driving disease progression. Genomic sampling of a patient's cancer provides a molecular history, represented by a phylogenetic tree. Cohorts of patients represent a forest of related phylogenetic structures. To extract clinically relevant information, one must represent and statistically compare these collections of trees. We propose a framework based on an application of the work by Billera, Holmes and Vogtmann on phylogenetic tree spaces to the case of unrooted trees of intra-individual cancer tissue samples. We observe that these tree spaces are globally nonpositively curved, allowing for statistical inference on populations of patient histories. A projective tree space is introduced, permitting visualizations of aggregate evolutionary behavior. Published data from three types of human malignancies are explored within our framework.
△ Less
Submitted 3 October, 2014;
originally announced October 2014.