-
Evaluation of Model-Based PM$_{2.5}$ Estimates for Exposure Assessment During Wildfire Smoke Episodes in the Western U.S
Authors:
Ellen M. Considine,
Jiayuan Hao,
Priyanka deSouza,
Danielle Braun,
Colleen E. Reid,
Rachel C. Nethery
Abstract:
Investigating the health impacts of wildfire smoke requires data on people's exposure to fine particulate matter (PM$_{2.5}$) across space and time. In recent years, it has become common to use machine learning models to fill gaps in monitoring data. However, it remains unclear how well these models are able to capture spikes in PM$_{2.5}$ during and across wildfire events. Here, we evaluate the a…
▽ More
Investigating the health impacts of wildfire smoke requires data on people's exposure to fine particulate matter (PM$_{2.5}$) across space and time. In recent years, it has become common to use machine learning models to fill gaps in monitoring data. However, it remains unclear how well these models are able to capture spikes in PM$_{2.5}$ during and across wildfire events. Here, we evaluate the accuracy of two sets of high-coverage and high-resolution machine learning-derived PM$_{2.5}$ data sets created by Di et al. (2021) and Reid et al. (2021). In general, the Reid estimates are more accurate than the Di estimates when compared to independent validation data from mobile smoke monitors deployed by the US Forest Service. However, both models tend to severely under-predict PM$_{2.5}$ on high-pollution days. Our findings complement other recent studies calling for increased air pollution monitoring in the western US and support the inclusion of wildfire-specific monitoring observations and predictor variables in model-based estimates of PM$_{2.5}$. Lastly, we call for more rigorous error quantification of machine-learning derived exposure data sets, with special attention to extreme events.
△ Less
Submitted 9 January, 2023; v1 submitted 3 September, 2022;
originally announced September 2022.
-
Treeging
Authors:
Gregory L. Watson,
Michael Jerrett,
Colleen E. Reid,
Donatello Telesca
Abstract:
Treeging combines the flexible mean structure of regression trees with the covariance-based prediction strategy of kriging into the base learner of an ensemble prediction algorithm. In so doing, it combines the strengths of the two primary types of spatial and space-time prediction models: (1) models with flexible mean structures (often machine learning algorithms) that assume independently distri…
▽ More
Treeging combines the flexible mean structure of regression trees with the covariance-based prediction strategy of kriging into the base learner of an ensemble prediction algorithm. In so doing, it combines the strengths of the two primary types of spatial and space-time prediction models: (1) models with flexible mean structures (often machine learning algorithms) that assume independently distributed data, and (2) kriging or Gaussian Process (GP) prediction models with rich covariance structures but simple mean structures. We investigate the predictive accuracy of treeging across a thorough and widely varied battery of spatial and space-time simulation scenarios, comparing it to ordinary kriging, random forest and ensembles of ordinary kriging base learners. Treeging performs well across the board, whereas kriging suffers when dependence is weak or in the presence of spurious covariates, and random forest suffers when the covariates are less informative. Treeging also outperforms these competitors in predicting atmospheric pollutants (ozone and PM$_{2.5}$) in several case studies. We examine sensitivity to tuning parameters (number of base learners and training data sampling proportion), finding they follow the familiar intuition of their random forest counterparts. We include a discussion of scaleability, noting that any covariance approximation techniques that expedite kriging (GP) may be similarly applied to expedite treeging.
△ Less
Submitted 3 October, 2021;
originally announced October 2021.
-
Prediction & Model Evaluation for Space-Time Data
Authors:
Gregory L. Watson,
Colleen E. Reid,
Michael Jerrett,
Donatello Telesca
Abstract:
Evaluation metrics for prediction error, model selection and model averaging on space-time data are understudied and poorly understood. The absence of independent replication makes prediction ambiguous as a concept and renders evaluation procedures developed for independent data inappropriate for most space-time prediction problems. Motivated by air pollution data collected during California wildf…
▽ More
Evaluation metrics for prediction error, model selection and model averaging on space-time data are understudied and poorly understood. The absence of independent replication makes prediction ambiguous as a concept and renders evaluation procedures developed for independent data inappropriate for most space-time prediction problems. Motivated by air pollution data collected during California wildfires in 2008, this manuscript attempts a formalization of the true prediction error associated with spatial interpolation. We investigate a variety of cross-validation (CV) procedures employing both simulations and case studies to provide insight into the nature of the estimand targeted by alternative data partition strategies. Consistent with recent best practice, we find that location-based cross-validation is appropriate for estimating spatial interpolation error as in our analysis of the California wildfire data. Interestingly, commonly held notions of bias-variance trade-off of CV fold size do not trivially apply to dependent data, and we recommend leave-one-location-out (LOLO) CV as the preferred prediction error metric for spatial interpolation.
△ Less
Submitted 4 November, 2022; v1 submitted 27 December, 2020;
originally announced December 2020.