-
Scalable Fitting Methods for Multivariate Gaussian Additive Models with Covariate-dependent Covariance Matrices
Authors:
Vincenzo Gioia,
Matteo Fasiolo,
Ruggero Bellio,
Simon N. Wood
Abstract:
We propose efficient computational methods to fit multivariate Gaussian additive models, where the mean vector and the covariance matrix are allowed to vary with covariates, in an empirical Bayes framework. To guarantee the positive-definiteness of the covariance matrix, we model the elements of an unconstrained parametrisation matrix, focussing particularly on the modified Cholesky decomposition…
▽ More
We propose efficient computational methods to fit multivariate Gaussian additive models, where the mean vector and the covariance matrix are allowed to vary with covariates, in an empirical Bayes framework. To guarantee the positive-definiteness of the covariance matrix, we model the elements of an unconstrained parametrisation matrix, focussing particularly on the modified Cholesky decomposition and the matrix logarithm. A key computational challenge arises from the fact that, for the model class considered here, the number of parameters increases quadratically with the dimension of the response vector. Hence, here we discuss how to achieve fast computation and low memory footprint in moderately high dimensions, by exploiting parsimonious model structures, sparse derivative systems and by employing block-oriented computational methods. Methods for building and fitting multivariate Gaussian additive models are provided by the SCM R package, available at https://github.com/VinGioia90/SCM, while the code for reproducing the results in this paper is available at https://github.com/VinGioia90/SACM.
△ Less
Submitted 4 April, 2025;
originally announced April 2025.
-
When Composite Likelihood Meets Stochastic Approximation
Authors:
Giuseppe Alfonzetti,
Ruggero Bellio,
Yunxiao Chen,
Irini Moustaki
Abstract:
A composite likelihood is an inference function derived by multiplying a set of likelihood components. This approach provides a flexible framework for drawing inference when the likelihood function of a statistical model is computationally intractable. While composite likelihood has computational advantages, it can still be demanding when dealing with numerous likelihood components and a large sam…
▽ More
A composite likelihood is an inference function derived by multiplying a set of likelihood components. This approach provides a flexible framework for drawing inference when the likelihood function of a statistical model is computationally intractable. While composite likelihood has computational advantages, it can still be demanding when dealing with numerous likelihood components and a large sample size. This paper tackles this challenge by employing an approximation of the conventional composite likelihood estimator, which is derived from an optimization procedure relying on stochastic gradients. This novel estimator is shown to be asymptotically normally distributed around the true parameter. In particular, based on the relative divergent rate of the sample size and the number of iterations of the optimization, the variance of the limiting distribution is shown to compound for two sources of uncertainty: the sampling variability of the data and the optimization noise, with the latter depending on the sampling distribution used to construct the stochastic gradients. The advantages of the proposed framework are illustrated through simulation studies on two working examples: an Ising model for binary data and a gamma frailty model for count data. Finally, a real-data application is presented, showing its effectiveness in a large-scale mental health survey.
△ Less
Submitted 9 December, 2024; v1 submitted 6 October, 2023;
originally announced October 2023.
-
Consistent and Scalable Composite Likelihood Estimation of Probit Models with Crossed Random Effects
Authors:
Ruggero Bellio,
Swarnadip Ghosh,
Art B. Owen,
Cristiano Varin
Abstract:
Estimation of crossed random effects models commonly requires computational costs that grow faster than linearly in the sample size $N$, often as fast as $Ω(N^{3/2})$, making them unsuitable for large data sets. For non-Gaussian responses, integrating out the random effects to get a marginal likelihood brings significant challenges, especially for high dimensional integrals where the Laplace appro…
▽ More
Estimation of crossed random effects models commonly requires computational costs that grow faster than linearly in the sample size $N$, often as fast as $Ω(N^{3/2})$, making them unsuitable for large data sets. For non-Gaussian responses, integrating out the random effects to get a marginal likelihood brings significant challenges, especially for high dimensional integrals where the Laplace approximation might not be accurate. We develop a composite likelihood approach to probit models that replaces the crossed random effects model with some hierarchical models that require only one-dimensional integrals. We show how to consistently estimate the crossed effects model parameters from the hierarchical model fits. We find that the computation scales linearly in the sample size. We illustrate the method on about five million observations from Stitch Fix where the crossed effects formulation would require an integral of dimension larger than $700{,}000$.
△ Less
Submitted 29 April, 2025; v1 submitted 29 August, 2023;
originally announced August 2023.
-
Additive Covariance Matrix Models: Modelling Regional Electricity Net-Demand in Great Britain
Authors:
V. Gioia,
M. Fasiolo,
J. Browell,
R. Bellio
Abstract:
Forecasts of regional electricity net-demand, consumption minus embedded generation, are an essential input for reliable and economic power system operation, and energy trading. While such forecasts are typically performed region by region, operations such as managing power flows require spatially coherent joint forecasts, which account for cross-regional dependencies. Here, we forecast the joint…
▽ More
Forecasts of regional electricity net-demand, consumption minus embedded generation, are an essential input for reliable and economic power system operation, and energy trading. While such forecasts are typically performed region by region, operations such as managing power flows require spatially coherent joint forecasts, which account for cross-regional dependencies. Here, we forecast the joint distribution of net-demand across the 14 regions constituting Great Britain's electricity network. Joint modelling is complicated by the fact that the net-demand variability within each region, and the dependencies between regions, vary with temporal, socio-economical and weather-related factors. We accommodate for these characteristics by proposing a multivariate Gaussian model based on a modified Cholesky parametrisation, which allows us to model each unconstrained parameter via an additive model. Given that the number of model parameters and covariates is large, we adopt a semi-automated approach to model selection, based on gradient boosting. In addition to comparing the forecasting performance of several versions of the proposed model with that of two non-Gaussian copula-based models, we visually explore the model output to interpret how the covariates affect net-demand variability and dependencies.
The code for reproducing the results in this paper is available at https://doi.org/10.5281/zenodo.7315105, while methods for building and fitting multivariate Gaussian additive models are provided by the SCM R package, available at https://github.com/VinGioia90/SCM.
△ Less
Submitted 17 April, 2024; v1 submitted 14 November, 2022;
originally announced November 2022.
-
Separable spatio-temporal kriging for fast virtual sensing
Authors:
M. Lambardi di San Miniato,
R. Bellio,
L. Grassetti,
P. Vidoni
Abstract:
Environmental monitoring is a task that requires to surrogate system-wide information with limited sensor readings. Under the proximity principle, an environmental monitoring system can be based on the virtual sensing logic and then rely on distance-based prediction methods, such as $k$-nearest-neighbors, inverse distance weighted regression and spatio-temporal kriging. The last one is cumbersome…
▽ More
Environmental monitoring is a task that requires to surrogate system-wide information with limited sensor readings. Under the proximity principle, an environmental monitoring system can be based on the virtual sensing logic and then rely on distance-based prediction methods, such as $k$-nearest-neighbors, inverse distance weighted regression and spatio-temporal kriging. The last one is cumbersome with large datasets, but we show that a suitable separability assumption reduces its computational cost to an extent broader than considered insofar. Only spatial interpolation needs to be performed in a centralized way, while forecasting can be delegated to each sensor. This simplification is mostly related to the fact that two separate models are involved, one in time and one in the space domain. Any of the two models can be replaced without re-estimating the other under a composite likelihood approach. Moreover, the use of convenient spatial and temporal models eases up computation. We show that this perspective on kriging allows to perform virtual sensing even in the case of tall datasets.
△ Less
Submitted 31 March, 2022;
originally announced March 2022.
-
Bayesian Multi-study Factor Analysis for High-throughput Biological Data
Authors:
Roberta De Vito,
Ruggero Bellio,
Lorenzo Trippa,
Giovanni Parmigiani
Abstract:
This paper presents a new modeling strategy for joint unsupervised analysis of multiple high-throughput biological studies. As in Multi-study Factor Analysis, our goals are to identify both common factors shared across studies and study-specific factors. Our approach is motivated by the growing body of high-throughput studies in biomedical research, as exemplified by the comprehensive set of expre…
▽ More
This paper presents a new modeling strategy for joint unsupervised analysis of multiple high-throughput biological studies. As in Multi-study Factor Analysis, our goals are to identify both common factors shared across studies and study-specific factors. Our approach is motivated by the growing body of high-throughput studies in biomedical research, as exemplified by the comprehensive set of expression data on breast tumors considered in our case study. To handle high-dimensional studies, we extend Multi-study Factor Analysis using a Bayesian approach that imposes sparsity. Specifically, we generalize the sparse Bayesian infinite factor model to multiple studies. We also devise novel solutions for the identification of the loading matrices: we recover the loading matrices of interest ex-post, by adapting the orthogonal Procrustes approach. Computationally, we propose an efficient and fast Gibbs sampling approach. Through an extensive simulation analysis, we show that the proposed approach performs very well in a range of different scenarios, and outperforms standard Factor analysis in all the scenarios identifying replicable signal in unsupervised genomic applications. The results of our analysis of breast cancer gene expression across seven studies identified replicable gene patterns, clearly related to well-known breast cancer pathways. An R package is implemented and available on GitHub.
△ Less
Submitted 26 June, 2018;
originally announced June 2018.
-
Multi-study Factor Analysis
Authors:
Roberta De Vito,
Ruggero Bellio,
Lorenzo Trippa,
Giovanni Parmigiani
Abstract:
We introduce a novel class of factor analysis methodologies for the joint analysis of multiple studies. The goal is to separately identify and estimate 1) common factors shared across multiple studies, and 2) study-specific factors. We develop a fast Expectation Conditional-Maximization algorithm for parameter estimates and we provide a procedure for choosing the common and specific factor. We pre…
▽ More
We introduce a novel class of factor analysis methodologies for the joint analysis of multiple studies. The goal is to separately identify and estimate 1) common factors shared across multiple studies, and 2) study-specific factors. We develop a fast Expectation Conditional-Maximization algorithm for parameter estimates and we provide a procedure for choosing the common and specific factor. We present simulations evaluating the performance of the method and we illustrate it by applying it to gene expression data in ovarian cancer. In both cases, we clarify the benefits of a joint analysis compared to the standard factor analysis. We hope to have provided a valuable tool to accelerate the pace at which we can combine unsupervised analysis across multiple studies, and understand the cross-study reproducibility of signal in multivariate data.
△ Less
Submitted 26 June, 2018; v1 submitted 19 November, 2016;
originally announced November 2016.