Search | arXiv e-print repository

Combining Climate Models using Bayesian Regression Trees and Random Paths

Authors: John C. Yannotty, Thomas J. Santner, Bo Li, Matthew T. Pratola

Abstract: Climate models, also known as general circulation models (GCMs), are essential tools for climate studies. Each climate model may have varying accuracy across the input domain, but no single model is uniformly better than the others. One strategy to improving climate model prediction performance is to integrate multiple model outputs using input-dependent weights. Along with this concept, weight fu… ▽ More Climate models, also known as general circulation models (GCMs), are essential tools for climate studies. Each climate model may have varying accuracy across the input domain, but no single model is uniformly better than the others. One strategy to improving climate model prediction performance is to integrate multiple model outputs using input-dependent weights. Along with this concept, weight functions modeled using Bayesian Additive Regression Trees (BART) were recently shown to be useful for integrating multiple Effective Field Theories in nuclear physics applications. However, a restriction of this approach is that the weights could only be modeled as piecewise constant functions. To smoothly integrate multiple climate models, we propose a new tree-based model, Random Path BART (RPBART), that incorporates random path assignments into the BART model to produce smooth weight functions and smooth predictions of the physical system, all in a matrix-free formulation. The smoothness feature of RPBART requires a more complex prior specification, for which we introduce a semivariogram to guide its hyperparameter selection. This approach is easy to interpret, computationally cheap, and avoids an expensive cross-validation study. Finally, we propose a posterior projection technique to enable detailed analysis of the fitted posterior weight functions. This allows us to identify a sparse set of climate models that can largely recover the underlying system within a given spatial region as well as quantifying model discrepancy within the model set under consideration. Our method is demonstrated on an ensemble of 8 GCMs modeling the average monthly surface temperature. △ Less

Submitted 18 July, 2024; originally announced July 2024.

Comments: 52 pages, 18 figures

arXiv:2306.00361 [pdf, other]

Sharded Bayesian Additive Regression Trees

Authors: Hengrui Luo, Matthew T. Pratola

Abstract: In this paper we develop the randomized Sharded Bayesian Additive Regression Trees (SBT) model. We introduce a randomization auxiliary variable and a sharding tree to decide partitioning of data, and fit each partition component to a sub-model using Bayesian Additive Regression Tree (BART). By observing that the optimal design of a sharding tree can determine optimal sharding for sub-models on a p… ▽ More In this paper we develop the randomized Sharded Bayesian Additive Regression Trees (SBT) model. We introduce a randomization auxiliary variable and a sharding tree to decide partitioning of data, and fit each partition component to a sub-model using Bayesian Additive Regression Tree (BART). By observing that the optimal design of a sharding tree can determine optimal sharding for sub-models on a product space, we introduce an intersection tree structure to completely specify both the sharding and modeling using only tree structures. In addition to experiments, we also derive the theoretical optimal weights for minimizing posterior contractions and prove the worst-case complexity of SBT. △ Less

Submitted 1 June, 2023; originally announced June 2023.

Comments: 46 pages, 10 figures (Appendix included)

MSC Class: 62F15; 62G08 ACM Class: G.3

arXiv:2304.03809 [pdf, other]

Estimating Shapley Effects in Big-Data Emulation and Regression Settings using Bayesian Additive Regression Trees

Authors: Akira Horiguchi, Matthew T. Pratola

Abstract: Shapley effects are a particularly interpretable approach to assessing how a function depends on its various inputs. The existing literature contains various estimators for this class of sensitivity indices in the context of nonparametric regression where the function is observed with noise, but there does not seem to be an estimator that is computationally tractable for input dimensions in the hu… ▽ More Shapley effects are a particularly interpretable approach to assessing how a function depends on its various inputs. The existing literature contains various estimators for this class of sensitivity indices in the context of nonparametric regression where the function is observed with noise, but there does not seem to be an estimator that is computationally tractable for input dimensions in the hundreds scale. This article provides such an estimator that is computationally tractable on this scale. The estimator uses a metamodel-based approach by first fitting a Bayesian Additive Regression Trees model which is then used to compute Shapley-effect estimates. This article also establishes a theoretical guarantee of posterior consistency on a large function class for this Shapley-effect estimator. Finally, this paper explores the performance of these Shapley-effect estimators on four different test functions for various input dimensions, including $p=500$. △ Less

Submitted 23 May, 2025; v1 submitted 7 April, 2023; originally announced April 2023.

Comments: 32 pages, 11 figures, 2 tables

arXiv:2301.02296 [pdf, other]

Model Mixing Using Bayesian Additive Regression Trees

Authors: John C. Yannotty, Thomas J. Santner, Richard J. Furnstahl, Matthew T. Pratola

Abstract: In modern computer experiment applications, one often encounters the situation where various models of a physical system are considered, each implemented as a simulator on a computer. An important question in such a setting is determining the best simulator, or the best combination of simulators, to use for prediction and inference. Bayesian model averaging (BMA) and stacking are two statistical a… ▽ More In modern computer experiment applications, one often encounters the situation where various models of a physical system are considered, each implemented as a simulator on a computer. An important question in such a setting is determining the best simulator, or the best combination of simulators, to use for prediction and inference. Bayesian model averaging (BMA) and stacking are two statistical approaches used to account for model uncertainty by aggregating a set of predictions through a simple linear combination or weighted average. Bayesian model mixing (BMM) extends these ideas to capture the localized behavior of each simulator by defining input-dependent weights. One possibility is to define the relationship between inputs and the weight functions using a flexible non-parametric model that learns the local strengths and weaknesses of each simulator. This paper proposes a BMM model based on Bayesian Additive Regression Trees (BART). The proposed methodology is applied to combine predictions from Effective Field Theories (EFTs) associated with a motivating nuclear physics application. △ Less

Submitted 5 May, 2023; v1 submitted 5 January, 2023; originally announced January 2023.

Comments: 33 pages, 6 figures, additional supplementary material can be found at https://github.com/jcyannotty/OpenBT

arXiv:2203.14102 [pdf, other]

Influential Observations in Bayesian Regression Tree Models

Authors: Matthew T. Pratola, Edward I. George, Robert E. McCulloch

Abstract: BCART (Bayesian Classification and Regression Trees) and BART (Bayesian Additive Regression Trees) are popular Bayesian regression models widely applicable in modern regression problems. Their popularity is intimately tied to the ability to flexibly model complex responses depending on high-dimensional inputs while simultaneously being able to quantify uncertainties. This ability to quantify uncer… ▽ More BCART (Bayesian Classification and Regression Trees) and BART (Bayesian Additive Regression Trees) are popular Bayesian regression models widely applicable in modern regression problems. Their popularity is intimately tied to the ability to flexibly model complex responses depending on high-dimensional inputs while simultaneously being able to quantify uncertainties. This ability to quantify uncertainties is key, as it allows researchers to perform appropriate inferential analyses in settings that have generally been too difficult to handle using the Bayesian approach. However, surprisingly little work has been done to evaluate the sensitivity of these modern regression models to violations of modeling assumptions. In particular, we will consider influential observations, which one reasonably would imagine to be common -- or at least a concern -- in the big-data setting. In this paper, we consider both the problem of detecting influential observations and adjusting predictions to not be unduly affected by such potentially problematic data. We consider three detection diagnostics for Bayesian tree models, one an analogue of Cook's distance and the others taking the form of a divergence measure and a conditional predictive density metric, and then propose an importance sampling algorithm to re-weight previously sampled posterior draws so as to remove the effects of influential data in a computationally efficient manner. Finally, our methods are demonstrated on real-world data where blind application of the models can lead to poor predictions and inference. △ Less

Submitted 17 May, 2023; v1 submitted 26 March, 2022; originally announced March 2022.

arXiv:2107.07313 [pdf, other]

doi 10.1080/00949655.2022.2119972

The Taxicab Sampler: MCMC for Discrete Spaces with Application to Tree Models

Authors: Vincent Geels, Matthew Pratola, Radu Herbei

Abstract: Motivated by the problem of exploring discrete but very complex state spaces in Bayesian models, we propose a novel Markov Chain Monte Carlo search algorithm: the taxicab sampler. We describe the construction of this sampler and discuss how its interpretation and usage differs from that of standard Metropolis-Hastings as well as the related Hamming ball sampler. The proposed sampling algorithm is… ▽ More Motivated by the problem of exploring discrete but very complex state spaces in Bayesian models, we propose a novel Markov Chain Monte Carlo search algorithm: the taxicab sampler. We describe the construction of this sampler and discuss how its interpretation and usage differs from that of standard Metropolis-Hastings as well as the related Hamming ball sampler. The proposed sampling algorithm is then shown to demonstrate substantial improvement in computation time without any loss of efficiency relative to a naïve Metropolis-Hastings search in a motivating Bayesian regression tree count model, in which we leverage the discrete state space assumption to construct a novel likelihood function that allows for flexibly describing different mean-variance relationships while preserving parameter interpretability compared to existing likelihood functions for count data. △ Less

Submitted 16 February, 2022; v1 submitted 15 July, 2021; originally announced July 2021.

Comments: Expanded simulation study example in Supplementary Materials and updated related Figure 2; updated Section 2 introduction and Section 2.1; added additional references in introduction section

arXiv:2101.02558 [pdf, other]

Using BART to Perform Pareto Optimization and Quantify its Uncertainties

Authors: Akira Horiguchi, Thomas J. Santner, Ying Sun, Matthew T. Pratola

Abstract: Techniques to reduce the energy burden of an industrial ecosystem often require solving a multiobjective optimization problem. However, collecting experimental data can often be either expensive or time-consuming. In such cases, statistical methods can be helpful. This article proposes Pareto Front (PF) and Pareto Set (PS) estimation methods using Bayesian Additive Regression Trees (BART), which i… ▽ More Techniques to reduce the energy burden of an industrial ecosystem often require solving a multiobjective optimization problem. However, collecting experimental data can often be either expensive or time-consuming. In such cases, statistical methods can be helpful. This article proposes Pareto Front (PF) and Pareto Set (PS) estimation methods using Bayesian Additive Regression Trees (BART), which is a non-parametric model whose assumptions are typically less restrictive than popular alternatives, such as Gaussian Processes (GPs). These less restrictive assumptions allow BART to handle scenarios (e.g. high-dimensional input spaces, nonsmooth responses, large datasets) that GPs find difficult. The performance of our BART-based method is compared to a GP-based method using analytic test functions, demonstrating convincing advantages. Finally, our BART-based methodology is applied to a motivating engineering problem. Supplementary materials, which include a theorem proof, algorithms, and R code, for this article are available online. △ Less

Submitted 3 September, 2021; v1 submitted 4 January, 2021; originally announced January 2021.

Comments: 27 pages, 8 figures, submitted to Industry 4.0 special issue of Technometrics journal

arXiv:2005.13622 [pdf, other]

Assessing variable activity for Bayesian regression trees

Authors: Akira Horiguchi, Matthew T. Pratola, Thomas J. Santner

Abstract: Bayesian Additive Regression Trees (BART) are non-parametric models that can capture complex exogenous variable effects. In any regression problem, it is often of interest to learn which variables are most active. Variable activity in BART is usually measured by counting the number of times a tree splits for each variable. Such one-way counts have the advantage of fast computations. Despite their… ▽ More Bayesian Additive Regression Trees (BART) are non-parametric models that can capture complex exogenous variable effects. In any regression problem, it is often of interest to learn which variables are most active. Variable activity in BART is usually measured by counting the number of times a tree splits for each variable. Such one-way counts have the advantage of fast computations. Despite their convenience, one-way counts have several issues. They are statistically unjustified, cannot distinguish between main effects and interaction effects, and become inflated when measuring interaction effects. An alternative method well-established in the literature is Sobol' indices, a variance-based global sensitivity analysis technique. However, these indices often require Monte Carlo integration, which can be computationally expensive. This paper provides analytic expressions for Sobol' indices for BART posterior samples. These expressions are easy to interpret and are computationally feasible. Furthermore, we will show a fascinating connection between first-order (main-effects) Sobol' indices and one-way counts. We also introduce a novel ranking method, and use this to demonstrate that the proposed indices preserve the Sobol'-based rank order of variable importance. Finally, we compare these methods using analytic test functions and the En-ROADS climate impacts simulator. △ Less

Submitted 14 September, 2020; v1 submitted 27 May, 2020; originally announced May 2020.

Comments: 46 pages, 8 figures, submitted to the special issue "Recent Advances in Sensitivity Analysis of Model Outputs" in the Reliability Engineering and Safety System journal

arXiv:1904.09339 [pdf, other]

Continuous-Time Birth-Death MCMC for Bayesian Regression Tree Models

Authors: Reza Mohammadi, Matthew Pratola, Maurits Kaptein

Abstract: Decision trees are flexible models that are well suited for many statistical regression problems. In a Bayesian framework for regression trees, Markov Chain Monte Carlo (MCMC) search algorithms are required to generate samples of tree models according to their posterior probabilities. The critical component of such an MCMC algorithm is to construct good Metropolis-Hastings steps for updating the t… ▽ More Decision trees are flexible models that are well suited for many statistical regression problems. In a Bayesian framework for regression trees, Markov Chain Monte Carlo (MCMC) search algorithms are required to generate samples of tree models according to their posterior probabilities. The critical component of such an MCMC algorithm is to construct good Metropolis-Hastings steps for updating the tree topology. However, such algorithms frequently suffering from local mode stickiness and poor mixing. As a result, the algorithms are slow to converge. Hitherto, authors have primarily used discrete-time birth/death mechanisms for Bayesian (sums of) regression tree models to explore the model space. These algorithms are efficient only if the acceptance rate is high which is not always the case. Here we overcome this issue by developing a new search algorithm which is based on a continuous-time birth-death Markov process. This search algorithm explores the model space by jumping between parameter spaces corresponding to different tree structures. In the proposed algorithm, the moves between models are always accepted which can dramatically improve the convergence and mixing properties of the MCMC algorithm. We provide theoretical support of the algorithm for Bayesian regression tree models and demonstrate its performance. △ Less

Submitted 26 October, 2020; v1 submitted 19 April, 2019; originally announced April 2019.

Comments: Published at http://jmlr.org/papers/v21/19-307 in the Journal of Machine Learning Research (https://www.jmlr.org)

Journal ref: Journal of Machine Learning Research 2020, Vol. 21, No. 201, 1-26

arXiv:1804.02089 [pdf, other]

Optimal Design Emulators: A Point Process Approach

Authors: Matthew T. Pratola, C. Devon Lin, Peter F. Craigmile

Abstract: Design of experiments is a fundamental topic in applied statistics with a long history. Yet its application is often limited by the complexity and costliness of constructing experimental designs, which involve searching a high-dimensional input space and evaluating computationally expensive criterion functions. In this work, we introduce a novel approach to the challenging design problem. We will… ▽ More Design of experiments is a fundamental topic in applied statistics with a long history. Yet its application is often limited by the complexity and costliness of constructing experimental designs, which involve searching a high-dimensional input space and evaluating computationally expensive criterion functions. In this work, we introduce a novel approach to the challenging design problem. We will take a probabilistic view of the problem by representing the optimal design as being one element (or a subset of elements) of a probability space. Given a suitable distribution on this space, a generative point process can be specified from which stochastic design realizations can be drawn. In particular, we describe a scenario where the classical entropy-optimal design for Gaussian Process regression coincides with the mode of a particular point process. We conclude with outlining an algorithm for drawing such design realizations, its extension to sequential designs, and applying the techniques developed to constructing designs for Stochastic Gradient Descent and Gaussian process regression. △ Less

Submitted 26 March, 2022; v1 submitted 5 April, 2018; originally announced April 2018.

arXiv:1709.07542 [pdf, other]

Heteroscedastic BART Using Multiplicative Regression Trees

Authors: Matthew Pratola, Hugh Chipman, Edward George, Robert McCulloch

Abstract: BART (Bayesian Additive Regression Trees) has become increasingly popular as a flexible and scalable nonparametric regression approach for modern applied statistics problems. For the practitioner dealing with large and complex nonlinear response surfaces, its advantages include a matrix-free formulation and the lack of a requirement to prespecify a confining regression basis. Although flexible in… ▽ More BART (Bayesian Additive Regression Trees) has become increasingly popular as a flexible and scalable nonparametric regression approach for modern applied statistics problems. For the practitioner dealing with large and complex nonlinear response surfaces, its advantages include a matrix-free formulation and the lack of a requirement to prespecify a confining regression basis. Although flexible in fitting the mean, BART has been limited by its reliance on a constant variance error model. This homoscedastic assumption is unrealistic in many applications. Alleviating this limitation, we propose HBART, a nonparametric heteroscedastic elaboration of BART. In BART, the mean function is modeled with a sum of trees, each of which determines an additive contribution to the mean. In HBART, the variance function is further modeled with a product of trees, each of which determines a multiplicative contribution to the variance. Like the mean model, this flexible, multidimensional variance model is entirely nonparametric with no need for the prespecification of a confining basis. Moreover, with this enhancement, HBART can provide insights into the potential relationships of the predictors with both the mean and the variance. Practical implementations of HBART with revealing new diagnostic plots are demonstrated with simulated and real data on used car prices, fishing catch production and alcohol consumption. △ Less

Submitted 9 July, 2018; v1 submitted 21 September, 2017; originally announced September 2017.

arXiv:1312.1895 [pdf]

Efficient Metropolis-Hastings Proposal Mechanisms for Bayesian Regression Tree Models

Authors: M. T. Pratola

Abstract: Bayesian regression trees are flexible non-parametric models that are well suited to many modern statistical regression problems. Many such tree models have been proposed, from the simple single- tree model to more complex tree ensembles. Their non-parametric formulation allows for effective and efficient modeling of datasets exhibiting complex non-linear relationships between the model pre- dicto… ▽ More Bayesian regression trees are flexible non-parametric models that are well suited to many modern statistical regression problems. Many such tree models have been proposed, from the simple single- tree model to more complex tree ensembles. Their non-parametric formulation allows for effective and efficient modeling of datasets exhibiting complex non-linear relationships between the model pre- dictors and observations. However, the mixing behavior of the Markov Chain Monte Carlo (MCMC) sampler is sometimes poor. This is because the proposals in the sampler are typically local alterations of the tree structure, such as the birth/death of leaf nodes, which does not allow for efficient traversal of the model space. This poor mixing can lead to inferential problems, such as under-representing uncertainty. In this paper, we develop novel proposal mechanisms for efficient sampling. The first is a rule perturbation proposal while the second we call tree rotation. The perturbation proposal can be seen as an efficient variation of the change proposal found in existing literature. The novel tree rotation proposal is simple to implement as it only requires local changes to the regression tree structure, yet it efficiently traverses disparate regions of the model space along contours of equal probability. When combined with the classical birth/death proposal, the resulting MCMC sampler exhibits good acceptance rates and properly represents model uncertainty in the posterior samples. We implement this sampling algorithm in the Bayesian Additive Regression Tree (BART) model and demonstrate its effectiveness on a prediction problem from computer experiments and a test function where structural tree variability is needed to fully explore the posterior. △ Less

Submitted 6 December, 2013; originally announced December 2013.

arXiv:1309.1906 [pdf]

Parallel Bayesian Additive Regression Trees

Authors: Matthew T. Pratola, Hugh A. Chipman, James R. Gattiker, David M. Higdon, Robert McCulloch, William N. Rust

Abstract: Bayesian Additive Regression Trees (BART) is a Bayesian approach to flexible non-linear regression which has been shown to be competitive with the best modern predictive methods such as those based on bagging and boosting. BART offers some advantages. For example, the stochastic search Markov Chain Monte Carlo (MCMC) algorithm can provide a more complete search of the model space and variation acr… ▽ More Bayesian Additive Regression Trees (BART) is a Bayesian approach to flexible non-linear regression which has been shown to be competitive with the best modern predictive methods such as those based on bagging and boosting. BART offers some advantages. For example, the stochastic search Markov Chain Monte Carlo (MCMC) algorithm can provide a more complete search of the model space and variation across MCMC draws can capture the level of uncertainty in the usual Bayesian way. The BART prior is robust in that reasonable results are typically obtained with a default prior specification. However, the publicly available implementation of the BART algorithm in the R package BayesTree is not fast enough to be considered interactive with over a thousand observations, and is unlikely to even run with 50,000 to 100,000 observations. In this paper we show how the BART algorithm may be modified and then computed using single program, multiple data (SPMD) parallel computation implemented using the Message Passing Interface (MPI) library. The approach scales nearly linearly in the number of processor cores, enabling the practitioner to perform statistical inference on massive datasets. Our approach can also handle datasets too massive to fit on any single data repository. △ Less

Submitted 7 September, 2013; originally announced September 2013.

arXiv:1204.3547 [pdf, other]

Computer Model Calibration using the Ensemble Kalman Filter

Authors: Dave Higdon, Matt Pratola, James Gattiker, Earl Lawrence, Salman Habib, Katrin Heitmann, Steve Price, Charles Jackson, Michael Tobis

Abstract: The ensemble Kalman filter (EnKF) (Evensen, 2009) has proven effective in quantifying uncertainty in a number of challenging dynamic, state estimation, or data assimilation, problems such as weather forecasting and ocean modeling. In these problems a high-dimensional state parameter is successively updated based on recurring physical observations, with the aid of a computationally demanding forwar… ▽ More The ensemble Kalman filter (EnKF) (Evensen, 2009) has proven effective in quantifying uncertainty in a number of challenging dynamic, state estimation, or data assimilation, problems such as weather forecasting and ocean modeling. In these problems a high-dimensional state parameter is successively updated based on recurring physical observations, with the aid of a computationally demanding forward model that prop- agates the state from one time step to the next. More recently, the EnKF has proven effective in history matching in the petroleum engineering community (Evensen, 2009; Oliver and Chen, 2010). Such applications typically involve estimating large numbers of parameters, describing an oil reservoir, using data from production history that accumulate over time. Such history matching problems are especially challenging examples of computer model calibration since they involve a large number of model parameters as well as a computationally demanding forward model. More generally, computer model calibration combines physical observations with a computational model - a computer model - to estimate unknown parameters in the computer model. This paper explores how the EnKF can be used in computer model calibration problems, comparing it to other more common approaches, considering applications in climate and cosmology. △ Less

Submitted 23 April, 2012; v1 submitted 16 April, 2012; originally announced April 2012.

Comments: 20 pages; 11 figures

Report number: LA-UR-12-20660

Showing 1–14 of 14 results for author: Pratola, M