-
dynoGP: Deep Gaussian Processes for dynamic system identification
Authors:
Alessio Benavoli,
Dario Piga,
Marco Forgione,
Marco Zaffalon
Abstract:
In this work, we present a novel approach to system identification for dynamical systems, based on a specific class of Deep Gaussian Processes (Deep GPs). These models are constructed by interconnecting linear dynamic GPs (equivalent to stochastic linear time-invariant dynamical systems) and static GPs (to model static nonlinearities). Our approach combines the strengths of data-driven methods, su…
▽ More
In this work, we present a novel approach to system identification for dynamical systems, based on a specific class of Deep Gaussian Processes (Deep GPs). These models are constructed by interconnecting linear dynamic GPs (equivalent to stochastic linear time-invariant dynamical systems) and static GPs (to model static nonlinearities). Our approach combines the strengths of data-driven methods, such as those based on neural network architectures, with the ability to output a probability distribution. This offers a more comprehensive framework for system identification that includes uncertainty quantification. Using both simulated and real-world data, we demonstrate the effectiveness of the proposed approach.
△ Less
Submitted 8 February, 2025;
originally announced February 2025.
-
A Note on Bayesian Networks with Latent Root Variables
Authors:
Marco Zaffalon,
Alessandro Antonucci
Abstract:
We characterise the likelihood function computed from a Bayesian network with latent variables as root nodes. We show that the marginal distribution over the remaining, manifest, variables also factorises as a Bayesian network, which we call empirical. A dataset of observations of the manifest variables allows us to quantify the parameters of the empirical Bayesian net. We prove that (i) the likel…
▽ More
We characterise the likelihood function computed from a Bayesian network with latent variables as root nodes. We show that the marginal distribution over the remaining, manifest, variables also factorises as a Bayesian network, which we call empirical. A dataset of observations of the manifest variables allows us to quantify the parameters of the empirical Bayesian net. We prove that (i) the likelihood of such a dataset from the original Bayesian network is dominated by the global maximum of the likelihood from the empirical one; and that (ii) such a maximum is attained if and only if the parameters of the Bayesian network are consistent with those of the empirical model.
△ Less
Submitted 26 February, 2024;
originally announced February 2024.
-
Approximating Counterfactual Bounds while Fusing Observational, Biased and Randomised Data Sources
Authors:
Marco Zaffalon,
Alessandro Antonucci,
Rafael Cabañas,
David Huber
Abstract:
We address the problem of integrating data from multiple, possibly biased, observational and interventional studies, to eventually compute counterfactuals in structural causal models. We start from the case of a single observational dataset affected by a selection bias. We show that the likelihood of the available data has no local maxima. This enables us to use the causal expectation-maximisation…
▽ More
We address the problem of integrating data from multiple, possibly biased, observational and interventional studies, to eventually compute counterfactuals in structural causal models. We start from the case of a single observational dataset affected by a selection bias. We show that the likelihood of the available data has no local maxima. This enables us to use the causal expectation-maximisation scheme to approximate the bounds for partially identifiable counterfactual queries, which are the focus of this paper. We then show how the same approach can address the general case of multiple datasets, no matter whether interventional or observational, biased or unbiased, by remapping it into the former one via graphical transformations. Systematic numerical experiments and a case study on palliative care show the effectiveness of our approach, while hinting at the benefits of fusing heterogeneous data sources to get informative outcomes in case of partial identifiability.
△ Less
Submitted 31 July, 2023;
originally announced July 2023.
-
Learning to Bound Counterfactual Inference from Observational, Biased and Randomised Data
Authors:
Marco Zaffalon,
Alessandro Antonucci,
David Huber,
Rafael Cabañas
Abstract:
We address the problem of integrating data from multiple, possibly biased, observational and interventional studies, to eventually compute counterfactuals in structural causal models. We start from the case of a single observational dataset affected by a selection bias. We show that the likelihood of the available data has no local maxima. This enables us to use the causal expectation-maximisation…
▽ More
We address the problem of integrating data from multiple, possibly biased, observational and interventional studies, to eventually compute counterfactuals in structural causal models. We start from the case of a single observational dataset affected by a selection bias. We show that the likelihood of the available data has no local maxima. This enables us to use the causal expectation-maximisation scheme to compute approximate bounds for partially identifiable counterfactual queries, which are the focus of this paper. We then show how the same approach can solve the general case of multiple datasets, no matter whether interventional or observational, biased or unbiased, by remapping it into the former one via graphical transformations. Systematic numerical experiments and a case study on palliative care show the effectiveness and accuracy of our approach, while hinting at the benefits of integrating heterogeneous data to get informative bounds in case of partial identifiability.
△ Less
Submitted 16 March, 2023; v1 submitted 6 December, 2022;
originally announced December 2022.
-
Bounding Counterfactuals under Selection Bias
Authors:
Marco Zaffalon,
Alessandro Antonucci,
Rafael Cabañas,
David Huber,
Dario Azzimonti
Abstract:
Causal analysis may be affected by selection bias, which is defined as the systematic exclusion of data from a certain subpopulation. Previous work in this area focused on the derivation of identifiability conditions. We propose instead a first algorithm to address both identifiable and unidentifiable queries. We prove that, in spite of the missingness induced by the selection bias, the likelihood…
▽ More
Causal analysis may be affected by selection bias, which is defined as the systematic exclusion of data from a certain subpopulation. Previous work in this area focused on the derivation of identifiability conditions. We propose instead a first algorithm to address both identifiable and unidentifiable queries. We prove that, in spite of the missingness induced by the selection bias, the likelihood of the available data is unimodal. This enables us to use the causal expectation-maximisation scheme to obtain the values of causal queries in the identifiable case, and to compute bounds otherwise. Experiments demonstrate the approach to be practically viable. Theoretical convergence characterisations are provided.
△ Less
Submitted 26 July, 2022;
originally announced August 2022.
-
Correlated Product of Experts for Sparse Gaussian Process Regression
Authors:
Manuel Schürch,
Dario Azzimonti,
Alessio Benavoli,
Marco Zaffalon
Abstract:
Gaussian processes (GPs) are an important tool in machine learning and statistics with applications ranging from social and natural science through engineering. They constitute a powerful kernelized non-parametric method with well-calibrated uncertainty estimates, however, off-the-shelf GP inference procedures are limited to datasets with several thousand data points because of their cubic computa…
▽ More
Gaussian processes (GPs) are an important tool in machine learning and statistics with applications ranging from social and natural science through engineering. They constitute a powerful kernelized non-parametric method with well-calibrated uncertainty estimates, however, off-the-shelf GP inference procedures are limited to datasets with several thousand data points because of their cubic computational complexity. For this reason, many sparse GPs techniques have been developed over the past years. In this paper, we focus on GP regression tasks and propose a new approach based on aggregating predictions from several local and correlated experts. Thereby, the degree of correlation between the experts can vary between independent up to fully correlated experts. The individual predictions of the experts are aggregated taking into account their correlation resulting in consistent uncertainty estimates. Our method recovers independent Product of Experts, sparse GP and full GP in the limiting cases. The presented framework can deal with a general kernel function and multiple variables, and has a time and space complexity which is linear in the number of experts and data samples, which makes our approach highly scalable. We demonstrate superior performance, in a time vs. accuracy sense, of our proposed method against state-of-the-art GP approximation methods for synthetic as well as several real-world datasets with deterministic and stochastic optimization.
△ Less
Submitted 17 December, 2021;
originally announced December 2021.
-
Time series forecasting with Gaussian Processes needs priors
Authors:
Giorgio Corani,
Alessio Benavoli,
Marco Zaffalon
Abstract:
Automatic forecasting is the task of receiving a time series and returning a forecast for the next time steps without any human intervention. Gaussian Processes (GPs) are a powerful tool for modeling time series, but so far there are no competitive approaches for automatic forecasting based on GPs. We propose practical solutions to two problems: automatic selection of the optimal kernel and reliab…
▽ More
Automatic forecasting is the task of receiving a time series and returning a forecast for the next time steps without any human intervention. Gaussian Processes (GPs) are a powerful tool for modeling time series, but so far there are no competitive approaches for automatic forecasting based on GPs. We propose practical solutions to two problems: automatic selection of the optimal kernel and reliable estimation of the hyperparameters. We propose a fixed composition of kernels, which contains the components needed to model most time series: linear trend, periodic patterns, and other flexible kernel for modeling the non-linear trend. Not all components are necessary to model each time series; during training the unnecessary components are automatically made irrelevant via automatic relevance determination (ARD). We moreover assign priors to the hyperparameters, in order to keep the inference within a plausible range; we design such priors through an empirical Bayes approach. We present results on many time series of different types; our GP model is more accurate than state-of-the-art time series models. Thanks to the priors, a single restart is enough the estimate the hyperparameters; hence the model is also fast to train.
△ Less
Submitted 21 June, 2021; v1 submitted 17 September, 2020;
originally announced September 2020.
-
Orthogonally Decoupled Variational Fourier Features
Authors:
Dario Azzimonti,
Manuel Schürch,
Alessio Benavoli,
Marco Zaffalon
Abstract:
Sparse inducing points have long been a standard method to fit Gaussian processes to big data. In the last few years, spectral methods that exploit approximations of the covariance kernel have shown to be competitive. In this work we exploit a recently introduced orthogonally decoupled variational basis to combine spectral methods and sparse inducing points methods. We show that the method is comp…
▽ More
Sparse inducing points have long been a standard method to fit Gaussian processes to big data. In the last few years, spectral methods that exploit approximations of the covariance kernel have shown to be competitive. In this work we exploit a recently introduced orthogonally decoupled variational basis to combine spectral methods and sparse inducing points methods. We show that the method is competitive with the state-of-the-art on synthetic and on real-world data.
△ Less
Submitted 13 July, 2020;
originally announced July 2020.
-
Reconciling Hierarchical Forecasts via Bayes' Rule
Authors:
Giorgio Corani,
Dario Azzimonti,
João P. S. C. Augusto,
Marco Zaffalon
Abstract:
We present a novel approach for reconciling hierarchical forecasts, based on Bayes rule. We define a prior distribution for the bottom time series of the hierarchy, based on the bottom base forecasts. Then we update their distribution via Bayes rule, based on the base forecasts for the upper time series. Under the Gaussian assumption, we derive the updating in closed-form. We derive two algorithms…
▽ More
We present a novel approach for reconciling hierarchical forecasts, based on Bayes rule. We define a prior distribution for the bottom time series of the hierarchy, based on the bottom base forecasts. Then we update their distribution via Bayes rule, based on the base forecasts for the upper time series. Under the Gaussian assumption, we derive the updating in closed-form. We derive two algorithms, which differ as for the assumed independencies. We discuss their relation with the MinT reconciliation algorithm and with the Kalman filter, and we compare them experimentally.
△ Less
Submitted 22 June, 2020; v1 submitted 7 June, 2019;
originally announced June 2019.
-
Recursive Estimation for Sparse Gaussian Process Regression
Authors:
Manuel Schürch,
Dario Azzimonti,
Alessio Benavoli,
Marco Zaffalon
Abstract:
Gaussian Processes (GPs) are powerful kernelized methods for non-parameteric regression used in many applications. However, their use is limited to a few thousand of training samples due to their cubic time complexity. In order to scale GPs to larger datasets, several sparse approximations based on so-called inducing points have been proposed in the literature. In this work we investigate the conn…
▽ More
Gaussian Processes (GPs) are powerful kernelized methods for non-parameteric regression used in many applications. However, their use is limited to a few thousand of training samples due to their cubic time complexity. In order to scale GPs to larger datasets, several sparse approximations based on so-called inducing points have been proposed in the literature. In this work we investigate the connection between a general class of sparse inducing point GP regression methods and Bayesian recursive estimation which enables Kalman Filter like updating for online learning. The majority of previous work has focused on the batch setting, in particular for learning the model parameters and the position of the inducing points, here instead we focus on training with mini-batches. By exploiting the Kalman filter formulation, we propose a novel approach that estimates such parameters by recursively propagating the analytical gradients of the posterior over mini-batches of the data. Compared to state of the art methods, our method keeps analytic updates for the mean and covariance of the posterior, thus reducing drastically the size of the optimization problem. We show that our method achieves faster convergence and superior performance compared to state of the art sequential Gaussian Process regression on synthetic GP as well as real-world data with up to a million of data samples.
△ Less
Submitted 22 June, 2020; v1 submitted 28 May, 2019;
originally announced May 2019.
-
Hierarchical Multinomial-Dirichlet model for the estimation of conditional probability tables
Authors:
L. Azzimonti,
G. Corani,
M. Zaffalon
Abstract:
We present a novel approach for estimating conditional probability tables, based on a joint, rather than independent, estimate of the conditional distributions belonging to the same table. We derive exact analytical expressions for the estimators and we analyse their properties both analytically and via simulation. We then apply this method to the estimation of parameters in a Bayesian network. Gi…
▽ More
We present a novel approach for estimating conditional probability tables, based on a joint, rather than independent, estimate of the conditional distributions belonging to the same table. We derive exact analytical expressions for the estimators and we analyse their properties both analytically and via simulation. We then apply this method to the estimation of parameters in a Bayesian network. Given the structure of the network, the proposed approach better estimates the joint distribution and significantly improves the classification performance with respect to traditional approaches.
△ Less
Submitted 23 August, 2017;
originally announced August 2017.
-
Entropy-based Pruning for Learning Bayesian Networks using BIC
Authors:
Cassio P. de Campos,
Mauro Scanagatta,
Giorgio Corani,
Marco Zaffalon
Abstract:
For decomposable score-based structure learning of Bayesian networks, existing approaches first compute a collection of candidate parent sets for each variable and then optimize over this collection by choosing one parent set for each variable without creating directed cycles while maximizing the total score. We target the task of constructing the collection of candidate parent sets when the score…
▽ More
For decomposable score-based structure learning of Bayesian networks, existing approaches first compute a collection of candidate parent sets for each variable and then optimize over this collection by choosing one parent set for each variable without creating directed cycles while maximizing the total score. We target the task of constructing the collection of candidate parent sets when the score of choice is the Bayesian Information Criterion (BIC). We provide new non-trivial results that can be used to prune the search space of candidate parent sets of each node. We analyze how these new results relate to previous ideas in the literature both theoretically and empirically. We show in experiments with UCI data sets that gains can be significant. Since the new pruning rules are easy to implement and have low computational costs, they can be promptly integrated into all state-of-the-art methods for structure learning of Bayesian networks.
△ Less
Submitted 19 July, 2017;
originally announced July 2017.
-
Statistical comparison of classifiers through Bayesian hierarchical modelling
Authors:
Giorgio Corani,
Alessio Benavoli,
Janez Demšar,
Francesca Mangili,
Marco Zaffalon
Abstract:
Usually one compares the accuracy of two competing classifiers via null hypothesis significance tests (nhst). Yet the nhst tests suffer from important shortcomings, which can be overcome by switching to Bayesian hypothesis testing. We propose a Bayesian hierarchical model which jointly analyzes the cross-validation results obtained by two classifiers on multiple data sets. It returns the posterior…
▽ More
Usually one compares the accuracy of two competing classifiers via null hypothesis significance tests (nhst). Yet the nhst tests suffer from important shortcomings, which can be overcome by switching to Bayesian hypothesis testing. We propose a Bayesian hierarchical model which jointly analyzes the cross-validation results obtained by two classifiers on multiple data sets. It returns the posterior probability of the accuracies of the two classifiers being practically equivalent or significantly different. A further strength of the hierarchical model is that, by jointly analyzing the results obtained on all data sets, it reduces the estimation error compared to the usual approach of averaging the cross-validation results obtained on a given data set.
△ Less
Submitted 22 November, 2016; v1 submitted 28 September, 2016;
originally announced September 2016.
-
Time for a change: a tutorial for comparing multiple classifiers through Bayesian analysis
Authors:
Alessio Benavoli,
Giorgio Corani,
Janez Demsar,
Marco Zaffalon
Abstract:
The machine learning community adopted the use of null hypothesis significance testing (NHST) in order to ensure the statistical validity of results. Many scientific fields however realized the shortcomings of frequentist reasoning and in the most radical cases even banned its use in publications. We should do the same: just as we have embraced the Bayesian paradigm in the development of new machi…
▽ More
The machine learning community adopted the use of null hypothesis significance testing (NHST) in order to ensure the statistical validity of results. Many scientific fields however realized the shortcomings of frequentist reasoning and in the most radical cases even banned its use in publications. We should do the same: just as we have embraced the Bayesian paradigm in the development of new machine learning methods, so we should also use it in the analysis of our own results. We argue for abandonment of NHST by exposing its fallacies and, more importantly, offer better - more sound and useful - alternatives for it.
△ Less
Submitted 15 July, 2017; v1 submitted 14 June, 2016;
originally announced June 2016.
-
State Space representation of non-stationary Gaussian Processes
Authors:
Alessio Benavoli,
Marco Zaffalon
Abstract:
The state space (SS) representation of Gaussian processes (GP) has recently gained a lot of interest. The main reason is that it allows to compute GPs based inferences in O(n), where $n$ is the number of observations. This implementation makes GPs suitable for Big Data. For this reason, it is important to provide a SS representation of the most important kernels used in machine learning. The aim o…
▽ More
The state space (SS) representation of Gaussian processes (GP) has recently gained a lot of interest. The main reason is that it allows to compute GPs based inferences in O(n), where $n$ is the number of observations. This implementation makes GPs suitable for Big Data. For this reason, it is important to provide a SS representation of the most important kernels used in machine learning. The aim of this paper is to show how to exploit the transient behaviour of SS models to map non-stationary kernels to SS models.
△ Less
Submitted 7 January, 2016;
originally announced January 2016.
-
Imprecise Dirichlet Process with application to the hypothesis test on the probability that X< Y
Authors:
Alessio Benavoli,
Francesca Mangili,
Fabrizio Ruggeri,
Marco Zaffalon
Abstract:
The Dirichlet process (DP) is one of the most popular Bayesian nonparametric models. An open problem with the DP is how to choose its infinite dimensional parameter (base measure) in case of lack of prior information. In this work we present the Imprecise DP (IDP) -- a prior near-ignorance DP-based model that does not require any choice of this probability measure. It consists of a class of DPs ob…
▽ More
The Dirichlet process (DP) is one of the most popular Bayesian nonparametric models. An open problem with the DP is how to choose its infinite dimensional parameter (base measure) in case of lack of prior information. In this work we present the Imprecise DP (IDP) -- a prior near-ignorance DP-based model that does not require any choice of this probability measure. It consists of a class of DPs obtained by letting the normalized base measure of the DP vary in the set of all probability measures. We discuss the tight connections of this approach with Bayesian robustness and in particular prior near-ignorance modeling via sets of probabilities. We use this model to perform a Bayesian hypothesis test on the probability P(X<Y). We study the theoretical properties of the IDP test (e.g., asymptotic consistency), and compare it with the frequentist Mann-Whitney-Wilcoxon rank test that is commonly employed as a test on P(X< Y). In particular we will show that our method is more robust, in the sense that it is able to isolate instances in which the aforementioned test is virtually guessing at random.
△ Less
Submitted 20 February, 2014; v1 submitted 12 February, 2014;
originally announced February 2014.
-
Solving Limited Memory Influence Diagrams
Authors:
Denis Deratani Mauá,
Cassio Polpo de Campos,
Marco Zaffalon
Abstract:
We present a new algorithm for exactly solving decision making problems represented as influence diagrams. We do not require the usual assumptions of no forgetting and regularity; this allows us to solve problems with simultaneous decisions and limited information. The algorithm is empirically shown to outperform a state-of-the-art algorithm on randomly generated problems of up to 150 variables an…
▽ More
We present a new algorithm for exactly solving decision making problems represented as influence diagrams. We do not require the usual assumptions of no forgetting and regularity; this allows us to solve problems with simultaneous decisions and limited information. The algorithm is empirically shown to outperform a state-of-the-art algorithm on randomly generated problems of up to 150 variables and $10^{64}$ solutions. We show that the problem is NP-hard even if the underlying graph structure of the problem has small treewidth and the variables take on a bounded number of states, but that a fully polynomial time approximation scheme exists for these cases. Moreover, we show that the bound on the number of states is a necessary condition for any efficient approximation scheme.
△ Less
Submitted 9 September, 2011; v1 submitted 8 September, 2011;
originally announced September 2011.
-
Epistemic irrelevance in credal nets: the case of imprecise Markov trees
Authors:
Gert de Cooman,
Filip Hermans,
Alessandro Antonucci,
Marco Zaffalon
Abstract:
We focus on credal nets, which are graphical models that generalise Bayesian nets to imprecise probability. We replace the notion of strong independence commonly used in credal nets with the weaker notion of epistemic irrelevance, which is arguably more suited for a behavioural theory of probability. Focusing on directed trees, we show how to combine the given local uncertainty models in the nodes…
▽ More
We focus on credal nets, which are graphical models that generalise Bayesian nets to imprecise probability. We replace the notion of strong independence commonly used in credal nets with the weaker notion of epistemic irrelevance, which is arguably more suited for a behavioural theory of probability. Focusing on directed trees, we show how to combine the given local uncertainty models in the nodes of the graph into a global model, and we use this to construct and justify an exact message-passing algorithm that computes updated beliefs for a variable in the tree. The algorithm, which is linear in the number of nodes, is formulated entirely in terms of coherent lower previsions, and is shown to satisfy a number of rationality requirements. We supply examples of the algorithm's operation, and report an application to on-line character recognition that illustrates the advantages of our approach for prediction. We comment on the perspectives, opened by the availability, for the first time, of a truly efficient algorithm based on epistemic irrelevance.
△ Less
Submitted 15 August, 2010;
originally announced August 2010.