-
SEAGLE: A Scalable Exact Algorithm for Large-Scale Set-Based GxE Tests in Biobank Data
Authors:
Jocelyn T. Chi,
Ilse C. F. Ipsen,
Tzu-Hung Hsiao,
Ching-Heng Lin,
Li-San Wang,
Wan-Ping Lee,
Tzu-Pin Lu,
Jung-Ying Tzeng
Abstract:
The explosion of biobank data offers immediate opportunities for gene-environment (GxE) interaction studies of complex diseases because of the large sample sizes and the rich collection in genetic and non-genetic information. However, the extremely large sample size also introduces new computational challenges in GxE assessment, especially for set-based GxE variance component (VC) tests, which are…
▽ More
The explosion of biobank data offers immediate opportunities for gene-environment (GxE) interaction studies of complex diseases because of the large sample sizes and the rich collection in genetic and non-genetic information. However, the extremely large sample size also introduces new computational challenges in GxE assessment, especially for set-based GxE variance component (VC) tests, which are a widely used strategy to boost overall GxE signals and to evaluate the joint GxE effect of multiple variants from a biologically meaningful unit (e.g., gene). In this work, we focus on continuous traits and present SEAGLE, a Scalable Exact AlGorithm for Large-scale set-based GxE tests, to permit GxE VC tests for biobank-scale data. SEAGLE employs modern matrix computations to achieve the same "exact" results as the original GxE VC tests without imposing additional assumptions or relying on approximations. SEAGLE can easily accommodate sample sizes in the order of $10^5$, is implementable on standard laptops, and does not require specialized computing equipment. We demonstrate SEAGLE's performance through extensive simulations. We illustrate its utility by conducting genome-wide gene-based GxE analysis on the Taiwan Biobank data to explore the interaction of gene and physical activity status on body mass index.
△ Less
Submitted 14 May, 2021; v1 submitted 7 May, 2021;
originally announced May 2021.
-
Probabilistic Iterative Methods for Linear Systems
Authors:
Jon Cockayne,
Ilse C. F. Ipsen,
Chris J. Oates,
Tim W. Reid
Abstract:
This paper presents a probabilistic perspective on iterative methods for approximating the solution $\mathbf{x}_* \in \mathbb{R}^d$ of a nonsingular linear system $\mathbf{A} \mathbf{x}_* = \mathbf{b}$. In the approach a standard iterative method on $\mathbb{R}^d$ is lifted to act on the space of probability distributions $\mathcal{P}(\mathbb{R}^d)$. Classically, an iterative method produces a seq…
▽ More
This paper presents a probabilistic perspective on iterative methods for approximating the solution $\mathbf{x}_* \in \mathbb{R}^d$ of a nonsingular linear system $\mathbf{A} \mathbf{x}_* = \mathbf{b}$. In the approach a standard iterative method on $\mathbb{R}^d$ is lifted to act on the space of probability distributions $\mathcal{P}(\mathbb{R}^d)$. Classically, an iterative method produces a sequence $\mathbf{x}_m$ of approximations that converge to $\mathbf{x}_*$. The output of the iterative methods proposed in this paper is, instead, a sequence of probability distributions $μ_m \in \mathcal{P}(\mathbb{R}^d)$. The distributional output both provides a "best guess" for $\mathbf{x}_*$, for example as the mean of $μ_m$, and also probabilistic uncertainty quantification for the value of $\mathbf{x}_*$ when it has not been exactly determined. Theoretical analysis is provided in the prototypical case of a stationary linear iterative method. In this setting we characterise both the rate of contraction of $μ_m$ to an atomic measure on $\mathbf{x}_*$ and the nature of the uncertainty quantification being provided. We conclude with an empirical illustration that highlights the insight into solution uncertainty that can be provided by probabilistic iterative methods.
△ Less
Submitted 11 January, 2021; v1 submitted 23 December, 2020;
originally announced December 2020.
-
Probabilistic Linear Solvers: A Unifying View
Authors:
Simon Bartels,
Jon Cockayne,
Ilse C. F. Ipsen,
Philipp Hennig
Abstract:
Several recent works have developed a new, probabilistic interpretation for numerical algorithms solving linear systems in which the solution is inferred in a Bayesian framework, either directly or by inferring the unknown action of the matrix inverse. These approaches have typically focused on replicating the behavior of the conjugate gradient method as a prototypical iterative method. In this wo…
▽ More
Several recent works have developed a new, probabilistic interpretation for numerical algorithms solving linear systems in which the solution is inferred in a Bayesian framework, either directly or by inferring the unknown action of the matrix inverse. These approaches have typically focused on replicating the behavior of the conjugate gradient method as a prototypical iterative method. In this work surprisingly general conditions for equivalence of these disparate methods are presented. We also describe connections between probabilistic linear solvers and projection methods for linear systems, providing a probabilistic interpretation of a far more general class of iterative methods. In particular, this provides such an interpretation of the generalised minimum residual method. A probabilistic view of preconditioning is also introduced. These developments unify the literature on probabilistic linear solvers, and provide foundational connections to the literature on iterative solvers for linear systems.
△ Less
Submitted 17 October, 2018; v1 submitted 8 October, 2018;
originally announced October 2018.
-
A Projector-Based Approach to Quantifying Total and Excess Uncertainties for Sketched Linear Regression
Authors:
Jocelyn T. Chi,
Ilse C. F. Ipsen
Abstract:
Linear regression is a classic method of data analysis. In recent years, sketching -- a method of dimension reduction using random sampling, random projections, or both -- has gained popularity as an effective computational approximation when the number of observations greatly exceeds the number of variables. In this paper, we address the following question: How does sketching affect the statistic…
▽ More
Linear regression is a classic method of data analysis. In recent years, sketching -- a method of dimension reduction using random sampling, random projections, or both -- has gained popularity as an effective computational approximation when the number of observations greatly exceeds the number of variables. In this paper, we address the following question: How does sketching affect the statistical properties of the solution and key quantities derived from it?
To answer this question, we present a projector-based approach to sketched linear regression that is exact and that requires minimal assumptions on the sketching matrix. Therefore, downstream analyses hold exactly and generally for all sketching schemes. Additionally, a projector-based approach enables derivation of key quantities from classic linear regression that account for the combined model- and algorithm-induced uncertainties. We demonstrate the usefulness of a projector-based approach in quantifying and enabling insight on excess uncertainties and bias-variance decompositions for sketched linear regression. Finally, we demonstrate how the insights from our projector-based analyses can be used to produce practical sketching diagnostics to aid the design of judicious sketching schemes.
△ Less
Submitted 3 August, 2020; v1 submitted 17 August, 2018;
originally announced August 2018.
-
Efficient Computation of Gaussian Likelihoods for Stationary Markov Random Field Models
Authors:
Joseph Guinness,
Ilse C. F. Ipsen
Abstract:
Rue and Held (2005) proposed a method for efficiently computing the Gaussian likelihood for stationary Markov random field models, when the data locations fall on a complete regular grid, and the model has no additive error term. The calculations rely on the availability of the covariances. We prove a theorem giving the rate of convergence of a spectral method of computing the covariances, establi…
▽ More
Rue and Held (2005) proposed a method for efficiently computing the Gaussian likelihood for stationary Markov random field models, when the data locations fall on a complete regular grid, and the model has no additive error term. The calculations rely on the availability of the covariances. We prove a theorem giving the rate of convergence of a spectral method of computing the covariances, establishing that the error decays faster than any polynomial in the size of the computing grid. We extend the exact likelihood calculations to the case of non-rectangular domains and missing values on the interior of the grid and to the case when an additive uncorrelated error term (nugget) is present in the model. We also give an alternative formulation of the likelihood that has a smaller memory burden, parts of which can be computed in parallel. We show in simulations that using the exact likelihood can give far better parameter estimates than using standard Markov random field approximations. Having access to the exact likelihood allows for model comparisons via likelihood ratios on large datasets, so as an application of the methods, we compare several state-of-the-art methods for large spatial datasets on an aerosol optical thickness dataset. We find that simple block independent likelihood and composite likelihood methods outperform stochastic partial differential equation approximations in terms of computation time and returning parameter estimates that nearly maximize the likelihood.
△ Less
Submitted 12 December, 2019; v1 submitted 30 May, 2015;
originally announced June 2015.
-
Randomized Approximation of the Gram Matrix: Exact Computation and Probabilistic Bounds
Authors:
John T. Holodnak,
Ilse C. F. Ipsen
Abstract:
Given a real matrix A with n columns, the problem is to approximate the Gram product AA^T by c << n weighted outer products of columns of A. Necessary and sufficient conditions for the exact computation of AA^T (in exact arithmetic) from c >= rank(A) columns depend on the right singular vector matrix of A. For a Monte-Carlo matrix multiplication algorithm by Drineas et al. that samples outer produ…
▽ More
Given a real matrix A with n columns, the problem is to approximate the Gram product AA^T by c << n weighted outer products of columns of A. Necessary and sufficient conditions for the exact computation of AA^T (in exact arithmetic) from c >= rank(A) columns depend on the right singular vector matrix of A. For a Monte-Carlo matrix multiplication algorithm by Drineas et al. that samples outer products, we present probabilistic bounds for the 2-norm relative error due to randomization. The bounds depend on the stable rank or the rank of A, but not on the matrix dimensions. Numerical experiments illustrate that the bounds are informative, even for stringent success probabilities and matrices of small dimension. We also derive bounds for the smallest singular value and the condition number of matrices obtained by sampling rows from orthonormal matrices.
△ Less
Submitted 15 May, 2014; v1 submitted 5 October, 2013;
originally announced October 2013.