-
A simple and flexible test of sample exchangeability with applications to statistical genomics
Authors:
Alan J. Aw,
Jeffrey P. Spence,
Yun S. Song
Abstract:
In scientific studies involving analyses of multivariate data, basic but important questions often arise for the researcher: Is the sample exchangeable, meaning that the joint distribution of the sample is invariant to the ordering of the units? Are the features independent of one another, or perhaps the features can be grouped so that the groups are mutually independent? In statistical genomics,…
▽ More
In scientific studies involving analyses of multivariate data, basic but important questions often arise for the researcher: Is the sample exchangeable, meaning that the joint distribution of the sample is invariant to the ordering of the units? Are the features independent of one another, or perhaps the features can be grouped so that the groups are mutually independent? In statistical genomics, these considerations are fundamental to downstream tasks such as demographic inference and the construction of polygenic risk scores. We propose a non-parametric approach, which we call the V test, to address these two questions, namely, a test of sample exchangeability given dependency structure of features, and a test of feature independence given sample exchangeability. Our test is conceptually simple, yet fast and flexible. It controls the Type I error across realistic scenarios, and handles data of arbitrary dimensions by leveraging large-sample asymptotics. Through extensive simulations and a comparison against unsupervised tests of stratification based on random matrix theory, we find that our test compares favorably in various scenarios of interest. We apply the test to data from the 1000 Genomes Project, demonstrating how it can be employed to assess exchangeability of the genetic sample, or find optimal linkage disequilibrium (LD) splits for downstream analysis. For exchangeability assessment, we find that removing rare variants can substantially increase the p-value of the test statistic. For optimal LD splitting, the V test reports different optimal splits than previous approaches not relying on hypothesis testing. Software for our methods is available in R (CRAN: flintyR) and Python (PyPI: flintyPy).
△ Less
Submitted 30 August, 2023; v1 submitted 30 September, 2021;
originally announced September 2021.
-
Flexible mean field variational inference using mixtures of non-overlapping exponential families
Authors:
Jeffrey P. Spence
Abstract:
Sparse models are desirable for many applications across diverse domains as they can perform automatic variable selection, aid interpretability, and provide regularization. When fitting sparse models in a Bayesian framework, however, analytically obtaining a posterior distribution over the parameters of interest is intractable for all but the simplest cases. As a result practitioners must rely on…
▽ More
Sparse models are desirable for many applications across diverse domains as they can perform automatic variable selection, aid interpretability, and provide regularization. When fitting sparse models in a Bayesian framework, however, analytically obtaining a posterior distribution over the parameters of interest is intractable for all but the simplest cases. As a result practitioners must rely on either sampling algorithms such as Markov chain Monte Carlo or variational methods to obtain an approximate posterior. Mean field variational inference is a particularly simple and popular framework that is often amenable to analytically deriving closed-form parameter updates. When all distributions in the model are members of exponential families and are conditionally conjugate, optimization schemes can often be derived by hand. Yet, I show that using standard mean field variational inference can fail to produce sensible results for models with sparsity-inducing priors, such as the spike-and-slab. Fortunately, such pathological behavior can be remedied as I show that mixtures of exponential family distributions with non-overlapping support form an exponential family. In particular, any mixture of a diffuse exponential family and a point mass at zero to model sparsity forms an exponential family. Furthermore, specific choices of these distributions maintain conditional conjugacy. I use two applications to motivate these results: one from statistical genetics that has connections to generalized least squares with a spike-and-slab prior on the regression coefficients; and sparse probabilistic principal component analysis. The theoretical results presented here are broadly applicable beyond these two examples.
△ Less
Submitted 13 October, 2020;
originally announced October 2020.
-
Two-Locus Likelihoods under Variable Population Size and Fine-Scale Recombination Rate Estimation
Authors:
John A. Kamm,
Jeffrey P. Spence,
Jeffrey Chan,
Yun S. Song
Abstract:
Two-locus sampling probabilities have played a central role in devising an efficient composite likelihood method for estimating fine-scale recombination rates. Due to mathematical and computational challenges, these sampling probabilities are typically computed under the unrealistic assumption of a constant population size, and simulation studies have shown that resulting recombination rate estima…
▽ More
Two-locus sampling probabilities have played a central role in devising an efficient composite likelihood method for estimating fine-scale recombination rates. Due to mathematical and computational challenges, these sampling probabilities are typically computed under the unrealistic assumption of a constant population size, and simulation studies have shown that resulting recombination rate estimates can be severely biased in certain cases of historical population size changes. To alleviate this problem, we develop here new methods to compute the sampling probability for variable population size functions that are piecewise constant. Our main theoretical result, implemented in a new software package called LDpop, is a novel formula for the sampling probability that can be evaluated by numerically exponentiating a large but sparse matrix. This formula can handle moderate sample sizes ($n \leq 50$) and demographic size histories with a large number of epochs ($\mathcal{D} \geq 64$). In addition, LDpop implements an approximate formula for the sampling probability that is reasonably accurate and scales to hundreds in sample size ($n \geq 256$). Finally, LDpop includes an importance sampler for the posterior distribution of two-locus genealogies, based on a new result for the optimal proposal distribution in the variable-size setting. Using our methods, we study how a sharp population bottleneck followed by rapid growth affects the correlation between partially linked sites. Then, through an extensive simulation study, we show that accounting for population size changes under such a demographic model leads to substantial improvements in fine-scale recombination rate estimation. LDpop is freely available for download at https://github.com/popgenmethods/ldpop
△ Less
Submitted 10 April, 2016; v1 submitted 20 October, 2015;
originally announced October 2015.
-
The site frequency spectrum for general coalescents
Authors:
Jeffrey P. Spence,
John A. Kamm,
Yun S. Song
Abstract:
General genealogical processes such as $Λ$- and $Ξ$-coalescents, which respectively model multiple and simultaneous mergers, have important applications in studying marine species, strong positive selection, recurrent selective sweeps, strong bottlenecks, large sample sizes, and so on. Recently, there has been significant progress in developing useful inference tools for such general models. In pa…
▽ More
General genealogical processes such as $Λ$- and $Ξ$-coalescents, which respectively model multiple and simultaneous mergers, have important applications in studying marine species, strong positive selection, recurrent selective sweeps, strong bottlenecks, large sample sizes, and so on. Recently, there has been significant progress in developing useful inference tools for such general models. In particular, inference methods based on the site frequency spectrum (SFS) have received noticeable attention. Here, we derive a new formula for the expected SFS for general $Λ$- and $Ξ$-coalescents, which leads to an efficient algorithm. For time-homogeneous coalescents, the runtime of our algorithm for computing the expected SFS is $O(n^2)$, where $n$ is the sample size. This is a factor of $n^2$ faster than the state-of-the-art method. Furthermore, in contrast to existing methods, our method generalizes to time-inhomogeneous $Λ$- and $Ξ$-coalescents with measures that factorize as $Λ(dx)/ζ(t)$ and $Ξ(dx)/ζ(t)$, respectively, where $ζ$ denotes a strictly positive function of time. The runtime of our algorithm in this setting is $O(n^3)$. We also obtain general theoretical results for the identifiability of the $Λ$ measure when $ζ$ is a constant function, as well as for the identifiability of the function $ζ$ under a fixed $Ξ$ measure.
△ Less
Submitted 11 February, 2016; v1 submitted 19 October, 2015;
originally announced October 2015.