-
An Anytime Valid Test for Complete Spatial Randomness
Authors:
Vaidehi Dixit,
Christopher K. Wikle,
Scott H. Holan
Abstract:
A relevant question when analyzing spatial point patterns is that of spatial randomness. More specifically, before any model can be fit to a point pattern a first step is to test the data for departures from complete spatial randomness (CSR). Traditional techniques employ distance or quadrat counts based methods to test for CSR based on batched data. In this paper, we consider the practical scenar…
▽ More
A relevant question when analyzing spatial point patterns is that of spatial randomness. More specifically, before any model can be fit to a point pattern a first step is to test the data for departures from complete spatial randomness (CSR). Traditional techniques employ distance or quadrat counts based methods to test for CSR based on batched data. In this paper, we consider the practical scenario of testing for CSR when the data are available sequentially (i.e., online). We present a sequential testing methodology called as {\em PRe-process} that is based on e-values and is a fast, efficient and nonparametric method. Simulation experiments with the truth departing from CSR in two different scenarios show that the method is effective in capturing inhomogeneity over time. Two real data illustrations considering lung cancer cases in the Chorley-Ribble area, England from 1974 - 1983 and locations of earthquakes in the state of Oklahoma, USA from 2000 - 2011 demonstrate the utility of the PRe-process in sequential testing of CSR.
△ Less
Submitted 3 April, 2025;
originally announced April 2025.
-
Inference for Log-Gaussian Cox Point Processes using Bayesian Deep Learning: Application to Human Oral Microbiome Image Data
Authors:
Shuwan Wang,
Christopher K. Wikle,
Athanasios C. Micheas,
Jessica L. Mark Welch,
Jacqueline R. Starr,
Kyu Ha Lee
Abstract:
It is common in nature to see aggregation of objects in space. Exploring the mechanism associated with the locations of such clustered observations can be essential to understanding the phenomenon, such as the source of spatial heterogeneity, or comparison to other event generating processes in the same domain. Log-Gaussian Cox processes (LGCPs) represent an important class of models for quantifyi…
▽ More
It is common in nature to see aggregation of objects in space. Exploring the mechanism associated with the locations of such clustered observations can be essential to understanding the phenomenon, such as the source of spatial heterogeneity, or comparison to other event generating processes in the same domain. Log-Gaussian Cox processes (LGCPs) represent an important class of models for quantifying aggregation in a spatial point pattern. However, implementing likelihood-based Bayesian inference for such models presents many computational challenges, particularly in high dimensions. In this paper, we propose a novel likelihood-free inference approach for LGCPs using the recently developed BayesFlow approach, where invertible neural networks are employed to approximate the posterior distribution of the parameters of interest. BayesFlow is a neural simulation-based method based on "amortized" posterior estimation. That is, after an initial training procedure, fast feed-forward operations allow rapid posterior inference for any data within the same model family. Comprehensive numerical studies validate the reliability of the framework and show that BayesFlow achieves substantial computational gain in repeated application, especially for two-dimensional LGCPs. We demonstrate the utility and robustness of the method by applying it to two distinct oral microbial biofilm images.
△ Less
Submitted 18 March, 2025; v1 submitted 17 February, 2025;
originally announced February 2025.
-
Capturing Extreme Events in Turbulence using an Extreme Variational Autoencoder (xVAE)
Authors:
Likun Zhang,
Kiran Bhaganagar,
Christopher K. Wikle
Abstract:
Turbulent flow fields are characterized by extreme events that are statistically intermittent and carry a significant amount of energy and physical importance. To emulate these flows, we introduce the extreme variational Autoencoder (xVAE), which embeds a max-infinitely divisible process with heavy-tailed distributions into a standard VAE framework, enabling accurate modeling of extreme events. xV…
▽ More
Turbulent flow fields are characterized by extreme events that are statistically intermittent and carry a significant amount of energy and physical importance. To emulate these flows, we introduce the extreme variational Autoencoder (xVAE), which embeds a max-infinitely divisible process with heavy-tailed distributions into a standard VAE framework, enabling accurate modeling of extreme events. xVAEs are neural network models that reduce system dimensionality by learning non-linear latent representations of data. We demonstrate the effectiveness of xVAE in large-eddy simulation data of wildland fire plumes, where intense heat release and complex plume-atmosphere interactions generate extreme turbulence. Comparisons with the commonly used Proper Orthogonal Decomposition (POD) modes show that xVAE is more robust in capturing extreme values and provides a powerful uncertainty quantification framework using variational Bayes. Additionally, xVAE enables analysis of the so-called copulas of fields to assess risks associated with rare events while rigorously accounting for uncertainty, such as simultaneous exceedances of high thresholds across multiple locations. The proposed approach provides a new direction for studying realistic turbulent flows, such as high-speed aerodynamics, space propulsion, and atmospheric and oceanic systems that are characterized by extreme events.
△ Less
Submitted 7 February, 2025;
originally announced February 2025.
-
Echo State Networks for Spatio-Temporal Area-Level Data
Authors:
Zhenhua Wang,
Scott H. Holan,
Christopher K. Wikle
Abstract:
Spatio-temporal area-level datasets play a critical role in official statistics, providing valuable insights for policy-making and regional planning. Accurate modeling and forecasting of these datasets can be extremely useful for policymakers to develop informed strategies for future planning. Echo State Networks (ESNs) are efficient methods for capturing nonlinear temporal dynamics and generating…
▽ More
Spatio-temporal area-level datasets play a critical role in official statistics, providing valuable insights for policy-making and regional planning. Accurate modeling and forecasting of these datasets can be extremely useful for policymakers to develop informed strategies for future planning. Echo State Networks (ESNs) are efficient methods for capturing nonlinear temporal dynamics and generating forecasts. However, ESNs lack a direct mechanism to account for the neighborhood structure inherent in area-level data. Ignoring these spatial relationships can significantly compromise the accuracy and utility of forecasts. In this paper, we incorporate approximate graph spectral filters at the input stage of the ESN, thereby improving forecast accuracy while preserving the model's computational efficiency during training. We demonstrate the effectiveness of our approach using Eurostat's tourism occupancy dataset and show how it can support more informed decision-making in policy and planning contexts.
△ Less
Submitted 14 October, 2024;
originally announced October 2024.
-
Incorporating Asymmetric Loss for Real Estate Prediction with Area-level Spatial Data
Authors:
Vaidehi Dixit,
Scott H. Holan,
Christopher K. Wikle
Abstract:
We investigate two asymmetric loss functions, namely LINEX loss and power divergence loss for optimal spatial prediction with area-level data. With our motivation arising from the real estate industry, namely in real estate valuation, we use the Zillow Home Value Index (ZHVI) for county-level values to show the change in prediction when the loss is different (asymmetric) from a traditional squared…
▽ More
We investigate two asymmetric loss functions, namely LINEX loss and power divergence loss for optimal spatial prediction with area-level data. With our motivation arising from the real estate industry, namely in real estate valuation, we use the Zillow Home Value Index (ZHVI) for county-level values to show the change in prediction when the loss is different (asymmetric) from a traditional squared error loss (symmetric) function. Additionally, we discuss the importance of choosing the asymmetry parameter, and propose a solution to this choice for a general asymmetric loss function. Since the focus is on area-level data predictions, we propose the methodology in the context of conditionally autoregressive (CAR) models. We conclude that choice of the loss functions for spatial area-level predictions can play a crucial role, and is heavily driven by the choice of parameters in the respective loss.
△ Less
Submitted 12 October, 2024;
originally announced October 2024.
-
Uncertainty-enabled machine learning for emulation of regional sea-level change caused by the Antarctic Ice Sheet
Authors:
Myungsoo Yoo,
Giri Gopalan,
Matthew J. Hoffman,
Sophie Coulson,
Holly Kyeore Han,
Christopher K. Wikle,
Trevor Hillebrand
Abstract:
Projecting sea-level change in various climate-change scenarios typically involves running forward simulations of the Earth's gravitational, rotational and deformational (GRD) response to ice mass change, which requires high computational cost and time. Here we build neural-network emulators of sea-level change at 27 coastal locations, due to the GRD effects associated with future Antarctic Ice Sh…
▽ More
Projecting sea-level change in various climate-change scenarios typically involves running forward simulations of the Earth's gravitational, rotational and deformational (GRD) response to ice mass change, which requires high computational cost and time. Here we build neural-network emulators of sea-level change at 27 coastal locations, due to the GRD effects associated with future Antarctic Ice Sheet mass change over the 21st century. The emulators are based on datasets produced using a numerical solver for the static sea-level equation and published ISMIP6-2100 ice-sheet model simulations referenced in the IPCC AR6 report. We show that the neural-network emulators have an accuracy that is competitive with baseline machine learning emulators. In order to quantify uncertainty, we derive well-calibrated prediction intervals for simulated sea-level change via a linear regression postprocessing technique that uses (nonlinear) machine learning model outputs, a technique that has previously been applied to numerical climate models. We also demonstrate substantial gains in computational efficiency: a feedforward neural-network emulator exhibits on the order of 100 times speedup in comparison to the numerical sea-level equation solver that is used for training.
△ Less
Submitted 21 June, 2024;
originally announced June 2024.
-
A Criterion for Aggregation Error for Multivariate Spatial Data
Authors:
Ranadeep Daw,
Jonathan R. Bradley,
Christopher K. Wikle,
Scott H. Holan
Abstract:
The criterion for aggregation error (CAGE) is an important metric that aims to measure errors that arise in multiscale (or multi-resolution) spatial data, referred to as the modifiable areal unit problem and the ecological fallacy. Specifically, CAGE is a measure of between scale variance of eigenvectors in a Karhunen-Loéve expansion (KLE), motivated by a theoretical result, referred to as the ``n…
▽ More
The criterion for aggregation error (CAGE) is an important metric that aims to measure errors that arise in multiscale (or multi-resolution) spatial data, referred to as the modifiable areal unit problem and the ecological fallacy. Specifically, CAGE is a measure of between scale variance of eigenvectors in a Karhunen-Loéve expansion (KLE), motivated by a theoretical result, referred to as the ``null-MAUP-theorem,'' that states that the MAUP/ecological fallacy are not present when this variance is zero. CAGE was originally developed for univariate spatial data, but its use has been applied to multivariate spatial data without the development of a null-MAUP-theorem in the multivariate spatial setting. To fill this gap, we provide theoretical justification for a multivariate CAGE (MVCAGE), which includes multiscale multivariate extensions of the KLE, Mercer's theorem, and the-null-MAUP theorem. Additionally, we provide technical results that demonstrate that the MVCAGE is preferable to spatial-only CAGE, and extend commonly used basis functions used to compute CAGE to the multivariate spatial setting. Empirical results are provided to demonstrate the use of MVCAGE for uncertainty quantification and regionalization.
△ Less
Submitted 10 February, 2025; v1 submitted 19 December, 2023;
originally announced December 2023.
-
Calibrated Forecasts of Quasi-Periodic Climate Processes with Deep Echo State Networks and Penalized Quantile Regression
Authors:
Matthew Bonas,
Christopher K. Wikle,
Stefano Castruccio
Abstract:
Among the most relevant processes in the Earth system for human habitability are quasi-periodic, ocean-driven multi-year events whose dynamics are currently incompletely characterized by physical models, and hence poorly predictable. This work aims at showing how 1) data-driven, stochastic machine learning approaches provide an affordable yet flexible means to forecast these processes; 2) the asso…
▽ More
Among the most relevant processes in the Earth system for human habitability are quasi-periodic, ocean-driven multi-year events whose dynamics are currently incompletely characterized by physical models, and hence poorly predictable. This work aims at showing how 1) data-driven, stochastic machine learning approaches provide an affordable yet flexible means to forecast these processes; 2) the associated uncertainty can be properly calibrated with fast ensemble-based approaches. While the methodology introduced and discussed in this work pertains to synoptic scale events, the principle of augmenting incomplete or highly sensitive physical systems with data-driven models to improve predictability is far more general and can be extended to environmental problems of any scale in time or space.
△ Less
Submitted 8 August, 2023;
originally announced August 2023.
-
Flexible and efficient emulation of spatial extremes processes via variational autoencoders
Authors:
Likun Zhang,
Xiaoyu Ma,
Christopher K. Wikle,
Raphaël Huser
Abstract:
Many real-world processes have complex tail dependence structures that cannot be characterized using classical Gaussian processes. More flexible spatial extremes models exhibit appealing extremal dependence properties but are often exceedingly prohibitive to fit and simulate from in high dimensions. In this paper, we aim to push the boundaries on computation and modeling of high-dimensional spatia…
▽ More
Many real-world processes have complex tail dependence structures that cannot be characterized using classical Gaussian processes. More flexible spatial extremes models exhibit appealing extremal dependence properties but are often exceedingly prohibitive to fit and simulate from in high dimensions. In this paper, we aim to push the boundaries on computation and modeling of high-dimensional spatial extremes via integrating a new spatial extremes model that has flexible and non-stationary dependence properties in the encoding-decoding structure of a variational autoencoder called the XVAE. The XVAE can emulate spatial observations and produce outputs that have the same statistical properties as the inputs, especially in the tail. Our approach also provides a novel way of making fast inference with complex extreme-value processes. Through extensive simulation studies, we show that our XVAE is substantially more time-efficient than traditional Bayesian inference while outperforming many spatial extremes models with a stationary dependence structure. Lastly, we analyze a high-resolution satellite-derived dataset of sea surface temperature in the Red Sea, which includes 30 years of daily measurements at 16703 grid cells. We demonstrate how to use XVAE to identify regions susceptible to marine heatwaves under climate change and examine the spatial and temporal variability of the extremal dependence structure.
△ Less
Submitted 18 December, 2024; v1 submitted 16 July, 2023;
originally announced July 2023.
-
Bayesian Ensemble Echo State Networks for Enhancing Binary Stochastic Cellular Automata
Authors:
Nicholas Grieshop,
Christopher K. Wikle
Abstract:
Binary spatio-temporal data are common in many application areas. Such data can be considered from many perspectives, including via deterministic or stochastic cellular automata, where local rules govern the transition probabilities that describe the evolution of the 0 and 1 states across space and time. One implementation of a stochastic cellular automata for such data is with a spatio-temporal g…
▽ More
Binary spatio-temporal data are common in many application areas. Such data can be considered from many perspectives, including via deterministic or stochastic cellular automata, where local rules govern the transition probabilities that describe the evolution of the 0 and 1 states across space and time. One implementation of a stochastic cellular automata for such data is with a spatio-temporal generalized linear model (or mixed model), with the local rule covariates being included in the transformed mean response. However, in real world applications, we seldom have a complete understanding of the local rules and it is helpful to augment the transformed linear predictor with a latent spatio-temporal dynamic process. Here, we demonstrate for the first time that an echo state network (ESN) latent process can be used to enhance the local rule covariates. We implement this in a hierarchical Bayesian framework with regularized horseshoe priors on the ESN output weight matrices, which extends the ESN literature as well. Finally, we gain added expressiveness from the ESNs by considering an ensemble of ESN reservoirs, which we accommodate through model averaging. This is also new to the ESN literature. We demonstrate our methodology on a simulated process in which we assume we do not know all of the local CA rules, as well as a fire evolution data set, and data describing the spread of raccoon rabies in Connecticut, USA.
△ Less
Submitted 7 June, 2023;
originally announced June 2023.
-
Data-Driven Modeling of Wildfire Spread with Stochastic Cellular Automata and Latent Spatio-Temporal Dynamics
Authors:
Nicholas Grieshop,
Christopher K. Wikle
Abstract:
We propose a Bayesian stochastic cellular automata modeling approach to model the spread of wildfires with uncertainty quantification. The model considers a dynamic neighborhood structure that allows neighbor states to inform transition probabilities in a multistate categorical model. Additional spatial information is captured by the use of a temporally evolving latent spatio-temporal dynamic proc…
▽ More
We propose a Bayesian stochastic cellular automata modeling approach to model the spread of wildfires with uncertainty quantification. The model considers a dynamic neighborhood structure that allows neighbor states to inform transition probabilities in a multistate categorical model. Additional spatial information is captured by the use of a temporally evolving latent spatio-temporal dynamic process linked to the original spatial domain by spatial basis functions. The Bayesian construction allows for uncertainty quantification associated with each of the predicted fire states. The approach is applied to a heavily instrumented controlled burn.
△ Less
Submitted 5 June, 2023;
originally announced June 2023.
-
Using Echo State Networks to Inform Physical Models for Fire Front Propagation
Authors:
Myungsoo Yoo,
Christopher K. Wikle
Abstract:
Wildfires can be devastating, causing significant damage to property, ecosystem disruption, and loss of life. Forecasting the evolution of wildfire boundaries is essential to real-time wildfire management. To this end, substantial attention in the wildifre literature has focused on the level set method, which effectively represents complicated boundaries and their change over time. Nevertheless, m…
▽ More
Wildfires can be devastating, causing significant damage to property, ecosystem disruption, and loss of life. Forecasting the evolution of wildfire boundaries is essential to real-time wildfire management. To this end, substantial attention in the wildifre literature has focused on the level set method, which effectively represents complicated boundaries and their change over time. Nevertheless, most of these approaches rely on a heavily-parameterized formulas for spread and fail to account for the uncertainty in the forecast. The rapid evolution of large wildfires and inhomogeneous environmental conditions across the domain of interest (e.g., varying land cover, fire-induced winds) give rise to a need for a model that enables efficient data-driven learning of fire spread and allows uncertainty quantification. Here, we present a novel hybrid model that nests an echo state network to learn nonlinear spatio-temporal evolving velocities (speed in the normal direction) within a physically-based level set model framework. This model is computationally efficient and includes calibrated uncertainty quantification. We show the forecasting performance of our model with simulations and two real data sets - the Haybress and Thomas megafires that started in California (USA) in 2017.
△ Less
Submitted 9 February, 2023;
originally announced February 2023.
-
Bayesian Hierarchical Models For Multi-type Survey Data Using Spatially Correlated Covariates Measured With Error
Authors:
Saikat Nandy,
Scott H. Holan,
Jonathan R. Bradley,
Christopher K. Wikle
Abstract:
We introduce Bayesian hierarchical models for predicting high-dimensional tabular survey data which can be distributed from one or multiple classes of distributions (e.g., Gaussian, Poisson, Binomial, etc.). We adopt a Bayesian implementation of a Hierarchical Generalized Transformation (HGT) model to deal with the non-conjugacy of non-Gaussian data models when estimated using a Latent Gaussian Pr…
▽ More
We introduce Bayesian hierarchical models for predicting high-dimensional tabular survey data which can be distributed from one or multiple classes of distributions (e.g., Gaussian, Poisson, Binomial, etc.). We adopt a Bayesian implementation of a Hierarchical Generalized Transformation (HGT) model to deal with the non-conjugacy of non-Gaussian data models when estimated using a Latent Gaussian Process (LGP) model. Survey data are usually prone to a high degree of sampling error, and we use covariates that are prone to measurement error as well as those free of any such error. A classical measurement error component is defined to deal with the sampling error in the covariates. The proposed models can be high-dimensional and we employ the notion of basis function expansions to provide an effective approach to dimension reduction. The HGT component lends flexibility to our model to incorporate multi-type response datasets under a unified latent process model framework. To demonstrate the applicability of our methodology, we provide the results from simulation studies and data applications arising from a dataset consisting of the U.S. Census Bureau's American Community Survey (ACS) 5-year period estimates of the total population count under the poverty threshold and the ACS 5-year period estimates of median housing costs at the county level across multiple states in the USA.
△ Less
Submitted 17 November, 2022;
originally announced November 2022.
-
REDS: Random Ensemble Deep Spatial prediction
Authors:
Ranadeep Daw,
Christopher K. Wikle
Abstract:
There has been a great deal of recent interest in the development of spatial prediction algorithms for very large datasets and/or prediction domains. These methods have primarily been developed in the spatial statistics community, but there has been growing interest in the machine learning community for such methods, primarily driven by the success of deep Gaussian process regression approaches an…
▽ More
There has been a great deal of recent interest in the development of spatial prediction algorithms for very large datasets and/or prediction domains. These methods have primarily been developed in the spatial statistics community, but there has been growing interest in the machine learning community for such methods, primarily driven by the success of deep Gaussian process regression approaches and deep convolutional neural networks. These methods are often computationally expensive to train and implement and consequently, there has been a resurgence of interest in random projections and deep learning models based on random weights -- so called reservoir computing methods. Here, we combine several of these ideas to develop the Random Ensemble Deep Spatial (REDS) approach to predict spatial data. The procedure uses random Fourier features as inputs to an extreme learning machine (a deep neural model with random weights), and with calibrated ensembles of outputs from this model based on different random weights, it provides a simple uncertainty quantification. The REDS method is demonstrated on simulated data and on a classic large satellite data set.
△ Less
Submitted 8 November, 2022;
originally announced November 2022.
-
A Bayesian Spatio-Temporal Level Set Dynamic Model and Application to Fire Front Propagation
Authors:
Myungsoo Yoo,
Christopher K. Wikle
Abstract:
Intense wildfires impact nature, humans, and society, causing catastrophic damage to property and the ecosystem, as well as the loss of life. Forecasting wildfire front propagation is essential in order to support fire fighting efforts and plan evacuations. The level set method has been widely used to analyze the change in surfaces, shapes, and boundaries. In particular, a signed distance function…
▽ More
Intense wildfires impact nature, humans, and society, causing catastrophic damage to property and the ecosystem, as well as the loss of life. Forecasting wildfire front propagation is essential in order to support fire fighting efforts and plan evacuations. The level set method has been widely used to analyze the change in surfaces, shapes, and boundaries. In particular, a signed distance function used in level set methods can readily be interpreted to represent complicated boundaries and their changes in time. While there is substantial literature on the level set method in wildfire applications, these implementations have relied on a heavily-parameterized formula for the rate of spread. These implementations have not typically considered uncertainty quantification or incorporated data-driven learning. Here, we present a Bayesian spatio-temporal dynamic model based on level sets, which can be utilized for forecasting the boundary of interest in the presence of uncertain data and lack of knowledge about the boundary velocity. The methodology relies on both a mechanistically-motivated dynamic model for level sets and a stochastic spatio-temporal dynamic model for the front velocity. We show the effectiveness of our method via simulation and with forecasting the fire front boundary evolution of two classic California megafires - the 2017-2018 Thomas fire and the 2017 Haypress.
△ Less
Submitted 26 October, 2022;
originally announced October 2022.
-
A Review of Data-Driven Discovery for Dynamic Systems
Authors:
Joshua S. North,
Christopher K. Wikle,
Erin M. Schliep
Abstract:
Many real-world scientific processes are governed by complex nonlinear dynamic systems that can be represented by differential equations. Recently, there has been increased interest in learning, or discovering, the forms of the equations driving these complex nonlinear dynamic system using data-driven approaches. In this paper we review the current literature on data-driven discovery for dynamic s…
▽ More
Many real-world scientific processes are governed by complex nonlinear dynamic systems that can be represented by differential equations. Recently, there has been increased interest in learning, or discovering, the forms of the equations driving these complex nonlinear dynamic system using data-driven approaches. In this paper we review the current literature on data-driven discovery for dynamic systems. We provide a categorization to the different approaches for data-driven discovery and a unified mathematical framework to show the relationship between the approaches. Importantly, we discuss the role of statistics in the data-driven discovery field, describe a possible approach by which the problem can be cast in a statistical framework, and provide avenues for future work.
△ Less
Submitted 19 October, 2022;
originally announced October 2022.
-
A Bayesian Approach for Spatio-Temporal Data-Driven Dynamic Equation Discovery
Authors:
Joshua S. North,
Christopher K. Wikle,
Erin M. Schliep
Abstract:
Differential equations based on physical principals are used to represent complex dynamic systems in all fields of science and engineering. Through repeated use in both academics and industry, these equations have been shown to represent real-world dynamics well. Since the true dynamics of these complex systems are generally unknown, learning the governing equations can improve our understanding o…
▽ More
Differential equations based on physical principals are used to represent complex dynamic systems in all fields of science and engineering. Through repeated use in both academics and industry, these equations have been shown to represent real-world dynamics well. Since the true dynamics of these complex systems are generally unknown, learning the governing equations can improve our understanding of the mechanisms driving the systems. Here, we develop a Bayesian approach to data-driven discovery of non-linear spatio-temporal dynamic equations. Our approach can accommodate measurement noise and missing data, both of which are common in real-world data, and accounts for parameter uncertainty. The proposed framework is illustrated using three simulated systems with varying amounts of observational uncertainty and missing data and applied to a real-world system to infer the temporal evolution of the vorticity of the streamfunction.
△ Less
Submitted 6 September, 2022;
originally announced September 2022.
-
Statistical Deep Learning for Spatial and Spatio-Temporal Data
Authors:
Christopher K. Wikle,
Andrew Zammit-Mangion
Abstract:
Deep neural network models have become ubiquitous in recent years, and have been applied to nearly all areas of science, engineering, and industry. These models are particularly useful for data that have strong dependencies in space (e.g., images) and time (e.g., sequences). Indeed, deep models have also been extensively used by the statistical community to model spatial and spatio-temporal data t…
▽ More
Deep neural network models have become ubiquitous in recent years, and have been applied to nearly all areas of science, engineering, and industry. These models are particularly useful for data that have strong dependencies in space (e.g., images) and time (e.g., sequences). Indeed, deep models have also been extensively used by the statistical community to model spatial and spatio-temporal data through, for example, the use of multi-level Bayesian hierarchical models and deep Gaussian processes. In this review, we first present an overview of traditional statistical and machine learning perspectives for modeling spatial and spatio-temporal data, and then focus on a variety of hybrid models that have recently been developed for latent process, data, and parameter specifications. These hybrid models integrate statistical modeling ideas with deep neural network models in order to take advantage of the strengths of each modeling paradigm. We conclude by giving an overview of computational technologies that have proven useful for these hybrid models, and with a brief discussion on future research directions.
△ Less
Submitted 5 June, 2022;
originally announced June 2022.
-
A Bayesian Hidden Semi-Markov Model with Covariate-Dependent State Duration Parameters for High-Frequency Environmental Data
Authors:
Shirley Rojas-Salazar,
Erin M. Schliep,
Christopher K. Wikle,
Emily H. Stanley,
Stephen R. Carpenter,
Noah R. Lottig
Abstract:
Environmental time series data observed at high frequencies can be studied with approaches such as hidden Markov and semi-Markov models (HMM and HSMM). HSMMs extend the HMM by explicitly modeling the time spent in each state. In a discrete-time HSMM, the duration in each state can be modeled with a zero-truncated Poisson distribution, where the duration parameter may be state-specific but constant…
▽ More
Environmental time series data observed at high frequencies can be studied with approaches such as hidden Markov and semi-Markov models (HMM and HSMM). HSMMs extend the HMM by explicitly modeling the time spent in each state. In a discrete-time HSMM, the duration in each state can be modeled with a zero-truncated Poisson distribution, where the duration parameter may be state-specific but constant in time. We extend the HSMM by allowing the state-specific duration parameters to vary in time and model them as a function of known covariates observed over a period of time leading up to a state transition. In addition, we propose a data subsampling approach given that high-frequency data can violate the conditional independence assumption of the HSMM. We apply the model to high-frequency data collected by an instrumented buoy in Lake Mendota. We model the phycocyanin concentration, which is used in aquatic systems to estimate the relative abundance of blue-green algae, and identify important time-varying effects associated with the duration in each state.
△ Less
Submitted 21 September, 2021;
originally announced September 2021.
-
Correcting spatial Gaussian process parameter and prediction variance estimation under informative sampling
Authors:
Erin M. Schliep,
Christopher K. Wikle,
Ranadeep Daw
Abstract:
Informative sampling designs can impact spatial prediction, or kriging, in two important ways. First, the sampling design can bias spatial covariance parameter estimation, which in turn can bias spatial kriging estimates. Second, even with unbiased estimates of the spatial covariance parameters, since the kriging variance is a function of the observation locations, these estimates will vary based…
▽ More
Informative sampling designs can impact spatial prediction, or kriging, in two important ways. First, the sampling design can bias spatial covariance parameter estimation, which in turn can bias spatial kriging estimates. Second, even with unbiased estimates of the spatial covariance parameters, since the kriging variance is a function of the observation locations, these estimates will vary based on the sample and overestimate the population-based estimates. In this work, we develop a weighted composite likelihood approach to improve spatial covariance parameter estimation under informative sampling designs. Then, given these parameter estimates, we propose three approaches to quantify the effects of the sampling design on the variance estimates in spatial prediction. These results can be used to make informed decisions for population-based inference. We illustrate our approaches using a comprehensive simulation study. Then, we apply our methods to perform spatial prediction on nitrate concentration in wells located throughout central California.
△ Less
Submitted 27 August, 2021;
originally announced August 2021.
-
A Bayesian Hidden Semi-Markov Model with Covariate-Dependent State Duration Parameters for High-Frequency Data from Wearable Devices
Authors:
Shirley Rojas-Salazar,
Erin M. Schliep,
Christopher K. Wikle,
Matthew Hawkey
Abstract:
Data collected by wearable devices in sports provide valuable information about an athlete's behavior such as their activity, performance, and ability. These time series data can be studied with approaches such as hidden Markov and semi-Markov models (HMM and HSMM) for varied purposes including activity recognition and event detection. HSMMs extend the HMM by explicitly modeling the time spent in…
▽ More
Data collected by wearable devices in sports provide valuable information about an athlete's behavior such as their activity, performance, and ability. These time series data can be studied with approaches such as hidden Markov and semi-Markov models (HMM and HSMM) for varied purposes including activity recognition and event detection. HSMMs extend the HMM by explicitly modeling the time spent in each state. In a discrete-time HSMM, the duration in each state can be modeled with a zero-truncated Poisson distribution, where the duration parameter may be state-specific but constant in time. We extend the HSMM by allowing the state-specific duration parameters to vary in time and model them as a function of known covariates derived from the wearable device and observed over a period of time leading up to a state transition. In addition, we propose a data subsampling approach given that high-frequency data from wearable devices can violate the conditional independence assumption of the HSMM. We apply the model to wearable device data collected on a soccer referee in a Major League Soccer game. We model the referee's physiological response to the game demands and identify important time-varying effects of these demands associated with the duration in each state.
△ Less
Submitted 20 October, 2020;
originally announced October 2020.
-
A higher-order singular value decomposition tensor emulator for spatio-temporal simulators
Authors:
Giri Gopalan,
Christopher K. Wikle
Abstract:
We introduce methodology to construct an emulator for environmental and ecological spatio-temporal processes that uses the higher order singular value decomposition (HOSVD) as an extension of singular value decomposition (SVD) approaches to emulation. Some important advantages of the method are that it allows for the use of a combination of supervised learning methods (e.g., random forests and Gau…
▽ More
We introduce methodology to construct an emulator for environmental and ecological spatio-temporal processes that uses the higher order singular value decomposition (HOSVD) as an extension of singular value decomposition (SVD) approaches to emulation. Some important advantages of the method are that it allows for the use of a combination of supervised learning methods (e.g., random forests and Gaussian process regression) and also allows for the prediction of process values at spatial locations and time points that were not used in the training sample. The method is demonstrated with two applications: the first is a periodic solution to a shallow ice approximation partial differential equation from glaciology, and second is an agent-based model of collective animal movement. In both cases, we demonstrate the value of combining different machine learning models for accurate emulation. In addition, in the agent-based model case we demonstrate the ability of the tensor emulator to successfully capture individual behavior in space and time. We demonstrate via a real data example the ability to perform Bayesian inference in order to learn parameters governing collective animal behavior.
△ Less
Submitted 12 July, 2021; v1 submitted 7 October, 2020;
originally announced October 2020.
-
Bayesian Inverse Reinforcement Learning for Collective Animal Movement
Authors:
Toryn L. J. Schafer,
Christopher K. Wikle,
Mevin B. Hooten
Abstract:
Agent-based methods allow for defining simple rules that generate complex group behaviors. The governing rules of such models are typically set a priori and parameters are tuned from observed behavior trajectories. Instead of making simplifying assumptions across all anticipated scenarios, inverse reinforcement learning provides inference on the short-term (local) rules governing long term behavio…
▽ More
Agent-based methods allow for defining simple rules that generate complex group behaviors. The governing rules of such models are typically set a priori and parameters are tuned from observed behavior trajectories. Instead of making simplifying assumptions across all anticipated scenarios, inverse reinforcement learning provides inference on the short-term (local) rules governing long term behavior policies by using properties of a Markov decision process. We use the computationally efficient linearly-solvable Markov decision process to learn the local rules governing collective movement for a simulation of the self propelled-particle (SPP) model and a data application for a captive guppy population. The estimation of the behavioral decision costs is done in a Bayesian framework with basis function smoothing. We recover the true costs in the SPP simulation and find the guppies value collective movement more than targeted movement toward shelter.
△ Less
Submitted 11 June, 2022; v1 submitted 8 September, 2020;
originally announced September 2020.
-
On the spatial and temporal shift in the archetypal seasonal temperature cycle as driven by annual and semi-annual harmonics
Authors:
Joshua S. North,
Erin M. Schliep,
Christopher K. Wikle
Abstract:
Statistical methods are required to evaluate and quantify the uncertainty in environmental processes, such as land and sea surface temperature, in a changing climate. Typically, annual harmonics are used to characterize the variation in the seasonal temperature cycle. However, an often overlooked feature of the climate seasonal cycle is the semi-annual harmonic, which can account for a significant…
▽ More
Statistical methods are required to evaluate and quantify the uncertainty in environmental processes, such as land and sea surface temperature, in a changing climate. Typically, annual harmonics are used to characterize the variation in the seasonal temperature cycle. However, an often overlooked feature of the climate seasonal cycle is the semi-annual harmonic, which can account for a significant portion of the variance of the seasonal cycle and varies in amplitude and phase across space. Together, the spatial variation in the annual and semi-annual harmonics can play an important role in driving processes that are tied to seasonality (e.g., ecological and agricultural processes). We propose a multivariate spatio-temporal model to quantify the spatial and temporal change in minimum and maximum temperature seasonal cycles as a function of the annual and semi-annual harmonics. Our approach captures spatial dependence, temporal dynamics, and multivariate dependence of these harmonics through spatially and temporally-varying coefficients. We apply the model to minimum and maximum temperature over North American for the years 1979 to 2018. Formal model inference within the Bayesian paradigm enables the identification of regions experiencing significant changes in minimum and maximum temperature seasonal cycles due to the relative effects of changes in the two harmonics.
△ Less
Submitted 15 March, 2020;
originally announced March 2020.
-
Deep Integro-Difference Equation Models for Spatio-Temporal Forecasting
Authors:
Andrew Zammit-Mangion,
Christopher K. Wikle
Abstract:
Integro-difference equation (IDE) models describe the conditional dependence between the spatial process at a future time point and the process at the present time point through an integral operator. Nonlinearity or temporal dependence in the dynamics is often captured by allowing the operator parameters to vary temporally, or by re-fitting a model with a temporally-invariant linear operator in a…
▽ More
Integro-difference equation (IDE) models describe the conditional dependence between the spatial process at a future time point and the process at the present time point through an integral operator. Nonlinearity or temporal dependence in the dynamics is often captured by allowing the operator parameters to vary temporally, or by re-fitting a model with a temporally-invariant linear operator in a sliding window. Both procedures tend to be excellent for prediction purposes over small time horizons, but are generally time-consuming and, crucially, do not provide a global prior model for the temporally-varying dynamics that is realistic. Here, we tackle these two issues by using a deep convolution neural network (CNN) in a hierarchical statistical IDE framework, where the CNN is designed to extract process dynamics from the process' most recent behaviour. Once the CNN is fitted, probabilistic forecasting can be done extremely quickly online using an ensemble Kalman filter with no requirement for repeated parameter estimation. We conduct an experiment where we train the model using 13 years of daily sea-surface temperature data in the North Atlantic Ocean. Forecasts are seen to be accurate and calibrated. A key advantage of our approach is that the CNN provides a global prior model for the dynamics that is realistic, interpretable, and computationally efficient. We show the versatility of the approach by successfully producing 10-minute nowcasts of weather radar reflectivities in Sydney using the same model that was trained on daily sea-surface temperature data in the North Atlantic Ocean.
△ Less
Submitted 27 January, 2020; v1 submitted 29 October, 2019;
originally announced October 2019.
-
A Bayesian Markov model with Pólya-Gamma sampling for estimating individual behavior transition probabilities from accelerometer classifications
Authors:
Toryn L. J. Schafer,
Christopher K. Wikle,
Jay A. VonBank,
Bart M. Ballard,
Mitch D. Weegman
Abstract:
The use of accelerometers in wildlife tracking provides a fine-scale data source for understanding animal behavior and decision-making. Current methods in movement ecology focus on behavior as a driver of movement mechanisms. Our Markov model is a flexible and efficient method for inference related to effects on behavior that considers dependence between current and past behaviors. We applied this…
▽ More
The use of accelerometers in wildlife tracking provides a fine-scale data source for understanding animal behavior and decision-making. Current methods in movement ecology focus on behavior as a driver of movement mechanisms. Our Markov model is a flexible and efficient method for inference related to effects on behavior that considers dependence between current and past behaviors. We applied this model to behavior data from six greater white-fronted geese (Anser albifrons frontalis) during spring migration in mid-continent North America and considered likely drivers of behavior, including habitat, weather and time of day effects. We modeled the transitions between flying, feeding, stationary and walking behavior states using a first-order Bayesian Markov model. We introduced Pólya-Gamma latent variables for automatic sampling of the covariate coefficients from the posterior distribution and we calculated the odds ratios from the posterior samples. Our model provides a unifying framework for including both acceleration and Global Positioning System data. We found significant differences in behavioral transition rates among habitat types, diurnal behavior and behavioral changes due to weather. Our model provides straightforward inference of behavioral time allocation across used habitats, which is not amenable in activity budget or resource selection frameworks.
△ Less
Submitted 19 May, 2020; v1 submitted 7 August, 2019;
originally announced August 2019.
-
Spatio-Temporal Change of Support Modeling with R
Authors:
Andrew M. Raim,
Scott H. Holan,
Jonathan R. Bradley,
Christopher K. Wikle
Abstract:
Spatio-temporal change of support methods are designed for statistical analysis on spatial and temporal domains which can differ from those of the observed data. Previous work introduced a parsimonious class of Bayesian hierarchical spatio-temporal models, which we refer to as STCOS, for the case of Gaussian outcomes. Application of STCOS methodology from this literature requires a level of profic…
▽ More
Spatio-temporal change of support methods are designed for statistical analysis on spatial and temporal domains which can differ from those of the observed data. Previous work introduced a parsimonious class of Bayesian hierarchical spatio-temporal models, which we refer to as STCOS, for the case of Gaussian outcomes. Application of STCOS methodology from this literature requires a level of proficiency with spatio-temporal methods and statistical computing which may be a hurdle for potential users. The present work seeks to bridge this gap by guiding readers through STCOS computations. We focus on the R computing environment because of its popularity, free availability, and high quality contributed packages. The stcos package is introduced to facilitate computations for the STCOS model. A motivating application is the American Community Survey (ACS), an ongoing survey administered by the U.S. Census Bureau that measures key socioeconomic and demographic variables for various populations in the United States. The STCOS methodology offers a principled approach to compute model-based estimates and associated measures of uncertainty for ACS variables on customized geographies and/or time periods. We present a detailed case study with ACS data as a guide for change of support analysis in R, and as a foundation which can be customized to other applications.
△ Less
Submitted 2 July, 2020; v1 submitted 26 April, 2019;
originally announced April 2019.
-
Comparison of Deep Neural Networks and Deep Hierarchical Models for Spatio-Temporal Data
Authors:
Christopher K. Wikle
Abstract:
Spatio-temporal data are ubiquitous in the agricultural, ecological, and environmental sciences, and their study is important for understanding and predicting a wide variety of processes. One of the difficulties with modeling spatial processes that change in time is the complexity of the dependence structures that must describe how such a process varies, and the presence of high-dimensional comple…
▽ More
Spatio-temporal data are ubiquitous in the agricultural, ecological, and environmental sciences, and their study is important for understanding and predicting a wide variety of processes. One of the difficulties with modeling spatial processes that change in time is the complexity of the dependence structures that must describe how such a process varies, and the presence of high-dimensional complex data sets and large prediction domains. It is particularly challenging to specify parameterizations for nonlinear dynamic spatio-temporal models (DSTMs) that are simultaneously useful scientifically and efficient computationally. Statisticians have developed deep hierarchical models that can accommodate process complexity as well as the uncertainties in the predictions and inference. However, these models can be expensive and are typically application specific. On the other hand, the machine learning community has developed alternative "deep learning" approaches for nonlinear spatio-temporal modeling. These models are flexible yet are typically not implemented in a probabilistic framework. The two paradigms have many things in common and suggest hybrid approaches that can benefit from elements of each framework. This overview paper presents a brief introduction to the deep hierarchical DSTM (DH-DSTM) framework, and deep models in machine learning, culminating with the deep neural DSTM (DN-DSTM). Recent approaches that combine elements from DH-DSTMs and echo state network DN-DSTMs are presented as illustrations.
△ Less
Submitted 21 February, 2019;
originally announced February 2019.
-
Spatio-Temporal Models for Big Multinomial Data using the Conditional Multivariate Logit-Beta Distribution
Authors:
Jonathan R. Bradley,
Christopher K. Wikle,
Scott H. Holan
Abstract:
We introduce a Bayesian approach for analyzing high-dimensional multinomial data that are referenced over space and time. In particular, the proportions associated with multinomial data are assumed to have a logit link to a latent spatio-temporal mixed effects model. This strategy allows for covariances that are nonstationarity in both space and time, asymmetric, and parsimonious. We also introduc…
▽ More
We introduce a Bayesian approach for analyzing high-dimensional multinomial data that are referenced over space and time. In particular, the proportions associated with multinomial data are assumed to have a logit link to a latent spatio-temporal mixed effects model. This strategy allows for covariances that are nonstationarity in both space and time, asymmetric, and parsimonious. We also introduce the use of the conditional multivariate logit-beta distribution into the dependent multinomial data setting, which leads to conjugate full-conditional distributions for use in a collapsed Gibbs sampler. We refer to this model as the multinomial spatio-temporal mixed effects model (MN-STM). Additionally, we provide methodological developments including: the derivation of the associated full-conditional distributions, a relationship with a latent Gaussian process model, and the stability of the non-stationary vector autoregressive model. We illustrate the MN-STM through simulations and through a demonstration with public-use Quarterly Workforce Indicators (QWI) data from the Longitudinal Employer Household Dynamics (LEHD) program of the U.S. Census Bureau.
△ Less
Submitted 9 December, 2018;
originally announced December 2018.
-
A Hierarchical Spatio-Temporal Statistical Model Motivated by Glaciology
Authors:
Giri Gopalan,
Birgir Hrafnkelsson,
Christopher K. Wikle,
Håvard Rue,
Guðfinna Aðalgeirsdóttir,
Alexander H. Jarosch,
Finnur Pálsson
Abstract:
In this paper, we extend and analyze a Bayesian hierarchical spatio-temporal model for physical systems. A novelty is to model the discrepancy between the output of a computer simulator for a physical process and the actual process values with a multivariate random walk. For computational efficiency, linear algebra for bandwidth limited matrices is utilized, and first-order emulator inference allo…
▽ More
In this paper, we extend and analyze a Bayesian hierarchical spatio-temporal model for physical systems. A novelty is to model the discrepancy between the output of a computer simulator for a physical process and the actual process values with a multivariate random walk. For computational efficiency, linear algebra for bandwidth limited matrices is utilized, and first-order emulator inference allows for the fast emulation of a numerical partial differential equation (PDE) solver. A test scenario from a physical system motivated by glaciology is used to examine the speed and accuracy of the computational methods used, in addition to the viability of modeling assumptions. We conclude by discussing how the model and associated methodology can be applied in other physical contexts besides glaciology.
△ Less
Submitted 4 June, 2019; v1 submitted 20 November, 2018;
originally announced November 2018.
-
Deep Echo State Networks with Uncertainty Quantification for Spatio-Temporal Forecasting
Authors:
Patrick L. McDermott,
Christopher K. Wikle
Abstract:
Long-lead forecasting for spatio-temporal systems can often entail complex nonlinear dynamics that are difficult to specify it a priori. Current statistical methodologies for modeling these processes are often highly parameterized and thus, challenging to implement from a computational perspective. One potential parsimonious solution to this problem is a method from the dynamical systems and engin…
▽ More
Long-lead forecasting for spatio-temporal systems can often entail complex nonlinear dynamics that are difficult to specify it a priori. Current statistical methodologies for modeling these processes are often highly parameterized and thus, challenging to implement from a computational perspective. One potential parsimonious solution to this problem is a method from the dynamical systems and engineering literature referred to as an echo state network (ESN). ESN models use so-called {\it reservoir computing} to efficiently compute recurrent neural network (RNN) forecasts. Moreover, so-called "deep" models have recently been shown to be successful at predicting high-dimensional complex nonlinear processes, particularly those with multiple spatial and temporal scales of variability (such as we often find in spatio-temporal environmental data). Here we introduce a deep ensemble ESN (D-EESN) model. We present two versions of this model for spatio-temporal processes that both produce forecasts and associated measures of uncertainty. The first approach utilizes a bootstrap ensemble framework and the second is developed within a hierarchical Bayesian framework (BD-EESN). This more general hierarchical Bayesian framework naturally accommodates non-Gaussian data types and multiple levels of uncertainties. The methodology is first applied to a data set simulated from a novel non-Gaussian multiscale Lorenz-96 dynamical system simulation model and then to a long-lead United States (U.S.) soil moisture forecasting application.
△ Less
Submitted 3 September, 2018; v1 submitted 27 June, 2018;
originally announced June 2018.
-
Interpolating Population Distributions using Public-use Data: An Application to Income Segregation using American Community Survey Data
Authors:
Matthew Simpson,
Scott H. Holan,
Christopher K. Wikle,
Jonathan R. Bradley
Abstract:
Income segregation measures the extent to which households choose to live near other households with similar incomes. Sociologists theorize that income segregation can exacerbate the impacts of income inequality, and have developed indices to measure it at the metro area level, including the information theory index introduced in \citet{reardon2011income}, and the divergence index presented in \ci…
▽ More
Income segregation measures the extent to which households choose to live near other households with similar incomes. Sociologists theorize that income segregation can exacerbate the impacts of income inequality, and have developed indices to measure it at the metro area level, including the information theory index introduced in \citet{reardon2011income}, and the divergence index presented in \citet{roberto2015divergence}. To study their differences, we construct both indices using recent American Community Survey (ACS) estimates of features of the income distribution. Since the elimination of the decennial census long form, methods of computing these estimates must be updated to use ACS estimates and account for survey error. We propose a model-based method to interpolate estimates of features of the income distribution that accounts for this error. This method improves on previous approaches by allowing for the use of more types of estimates, and by providing uncertainty quantification. We apply this method to estimate U.S. census tract-level income distributions using ACS tabulations, and in turn use these to construct both income segregation indices. We find major differences between the two indices in the relative ranking of metro areas, as well as differences in how both indices correlate with the Gini index.
△ Less
Submitted 23 November, 2021; v1 submitted 7 February, 2018;
originally announced February 2018.
-
Bayesian Recurrent Neural Network Models for Forecasting and Quantifying Uncertainty in Spatial-Temporal Data
Authors:
Patrick L. McDermott,
Christopher K. Wikle
Abstract:
Recurrent neural networks (RNNs) are nonlinear dynamical models commonly used in the machine learning and dynamical systems literature to represent complex dynamical or sequential relationships between variables. More recently, as deep learning models have become more common, RNNs have been used to forecast increasingly complicated systems. Dynamical spatio-temporal processes represent a class of…
▽ More
Recurrent neural networks (RNNs) are nonlinear dynamical models commonly used in the machine learning and dynamical systems literature to represent complex dynamical or sequential relationships between variables. More recently, as deep learning models have become more common, RNNs have been used to forecast increasingly complicated systems. Dynamical spatio-temporal processes represent a class of complex systems that can potentially benefit from these types of models. Although the RNN literature is expansive and highly developed, uncertainty quantification is often ignored. Even when considered, the uncertainty is generally quantified without the use of a rigorous framework, such as a fully Bayesian setting. Here we attempt to quantify uncertainty in a more formal framework while maintaining the forecast accuracy that makes these models appealing, by presenting a Bayesian RNN model for nonlinear spatio-temporal forecasting. Additionally, we make simple modifications to the basic RNN to help accommodate the unique nature of nonlinear spatio-temporal data. The proposed model is applied to a Lorenz simulation and two real-world nonlinear spatio-temporal forecasting applications.
△ Less
Submitted 6 February, 2018; v1 submitted 2 November, 2017;
originally announced November 2017.
-
An Ensemble Quadratic Echo State Network for Nonlinear Spatio-Temporal Forecasting
Authors:
Patrick L. McDermott,
Christopher K. Wikle
Abstract:
Spatio-temporal data and processes are prevalent across a wide variety of scientific disciplines. These processes are often characterized by nonlinear time dynamics that include interactions across multiple scales of spatial and temporal variability. The data sets associated with many of these processes are increasing in size due to advances in automated data measurement, management, and numerical…
▽ More
Spatio-temporal data and processes are prevalent across a wide variety of scientific disciplines. These processes are often characterized by nonlinear time dynamics that include interactions across multiple scales of spatial and temporal variability. The data sets associated with many of these processes are increasing in size due to advances in automated data measurement, management, and numerical simulator output. Non- linear spatio-temporal models have only recently seen interest in statistics, but there are many classes of such models in the engineering and geophysical sciences. Tradi- tionally, these models are more heuristic than those that have been presented in the statistics literature, but are often intuitive and quite efficient computationally. We show here that with fairly simple, but important, enhancements, the echo state net- work (ESN) machine learning approach can be used to generate long-lead forecasts of nonlinear spatio-temporal processes, with reasonable uncertainty quantification, and at only a fraction of the computational expense of a traditional parametric nonlinear spatio-temporal models.
△ Less
Submitted 16 August, 2017;
originally announced August 2017.
-
Ensemble Kalman methods for high-dimensional hierarchical dynamic space-time models
Authors:
Matthias Katzfuss,
Jonathan R. Stroud,
Christopher K. Wikle
Abstract:
We propose a new class of filtering and smoothing methods for inference in high-dimensional, nonlinear, non-Gaussian, spatio-temporal state-space models. The main idea is to combine the ensemble Kalman filter and smoother, developed in the geophysics literature, with state-space algorithms from the statistics literature. Our algorithms address a variety of estimation scenarios, including on-line a…
▽ More
We propose a new class of filtering and smoothing methods for inference in high-dimensional, nonlinear, non-Gaussian, spatio-temporal state-space models. The main idea is to combine the ensemble Kalman filter and smoother, developed in the geophysics literature, with state-space algorithms from the statistics literature. Our algorithms address a variety of estimation scenarios, including on-line and off-line state and parameter estimation. We take a Bayesian perspective, for which the goal is to generate samples from the joint posterior distribution of states and parameters. The key benefit of our approach is the use of ensemble Kalman methods for dimension reduction, which allows inference for high-dimensional state vectors. We compare our methods to existing ones, including ensemble Kalman filters, particle filters, and particle MCMC. Using a real data example of cloud motion and data simulated under a number of nonlinear and non-Gaussian scenarios, we show that our approaches outperform these existing methods.
△ Less
Submitted 8 August, 2018; v1 submitted 23 April, 2017;
originally announced April 2017.
-
Bayesian Hierarchical Models with Conjugate Full-Conditional Distributions for Dependent Data from the Natural Exponential Family
Authors:
Jonathan R. Bradley,
Scott H. Holan,
Christopher K. Wikle
Abstract:
We introduce a Bayesian approach for analyzing (possibly) high-dimensional dependent data that are distributed according to a member from the natural exponential family of distributions. This problem requires extensive methodological advancements, as jointly modeling high-dimensional dependent data leads to the so-called "big n problem." The computational complexity of the "big n problem" is furth…
▽ More
We introduce a Bayesian approach for analyzing (possibly) high-dimensional dependent data that are distributed according to a member from the natural exponential family of distributions. This problem requires extensive methodological advancements, as jointly modeling high-dimensional dependent data leads to the so-called "big n problem." The computational complexity of the "big n problem" is further exacerbated when allowing for non-Gaussian data models, as is the case here. Thus, we develop new computationally efficient distribution theory for this setting. In particular, we introduce the "conjugate multivariate distribution," which is motivated by the univariate distribution introduced in Diaconis and Ylvisaker (1979). Furthermore, we provide substantial theoretical and methodological development including: results regarding conditional distributions, an asymptotic relationship with the multivariate normal distribution, conjugate prior distributions, and full-conditional distributions for a Gibbs sampler. To demonstrate the wide-applicability of the proposed methodology, we provide two simulation studies and three applications based on an epidemiology dataset, a federal statistics dataset, and an environmental dataset, respectively.
△ Less
Submitted 17 April, 2019; v1 submitted 25 January, 2017;
originally announced January 2017.
-
A Hierarchical Spatio-Temporal Analog Forecasting Model for Count Data
Authors:
Patrick L. McDermott,
Christopher K. Wikle,
Joshua Millspaugh
Abstract:
1. Analog forecasting has been successful at producing robust forecasts for a variety of ecological and physical processes. Analog forecasting is a mechanism-free nonlinear method that forecasts a system forward in time by examining how past states deemed similar to the current state moved forward. Previous work on analog forecasting has typically been presented in an empirical or heuristic contex…
▽ More
1. Analog forecasting has been successful at producing robust forecasts for a variety of ecological and physical processes. Analog forecasting is a mechanism-free nonlinear method that forecasts a system forward in time by examining how past states deemed similar to the current state moved forward. Previous work on analog forecasting has typically been presented in an empirical or heuristic context, as opposed to a formal statistical context. 2. The model presented here extends the model-based analog method of McDermott and Wikle (2016) by placing analog forecasting within a fully hierarchical statistical frame- work. In particular, a Bayesian hierarchical spatial-temporal Poisson analog forecasting model is formulated. 3. In comparison to a Poisson Bayesian hierarchical model with a latent dynamical spatio- temporal process, the hierarchical analog model consistently produced more accurate forecasts. By using a Bayesian approach, the hierarchical analog model is able to quantify rigorously the uncertainty associated with forecasts. 4. Forecasting waterfowl settling patterns in the northwestern United States and Canada is conducted by applying the hierarchical analog model to a breeding population survey dataset. Sea Surface Temperature (SST) in the Pacific ocean is used to help identify potential analogs for the waterfowl settling patterns.
△ Less
Submitted 16 January, 2017;
originally announced January 2017.
-
A Bayesian adaptive ensemble Kalman filter for sequential state and parameter estimation
Authors:
Jonathan R. Stroud,
Matthias Katzfuss,
Christopher K. Wikle
Abstract:
This paper proposes new methodology for sequential state and parameter estimation within the ensemble Kalman filter. The method is fully Bayesian and propagates the joint posterior density of states and parameters over time. In order to implement the method we consider two representations of the marginal posterior distribution of the parameters: a grid-based approach and a Gaussian approximation.…
▽ More
This paper proposes new methodology for sequential state and parameter estimation within the ensemble Kalman filter. The method is fully Bayesian and propagates the joint posterior density of states and parameters over time. In order to implement the method we consider two representations of the marginal posterior distribution of the parameters: a grid-based approach and a Gaussian approximation. Contrary to existing algorithms, the new method explicitly accounts for parameter uncertainty and provides a formal way to combine information about the parameters from data at different time periods. The method is illustrated and compared to existing approaches using simulated and real data.
△ Less
Submitted 11 November, 2016;
originally announced November 2016.
-
Computationally Efficient Distribution Theory for Bayesian Inference of High-Dimensional Dependent Count-Valued Data
Authors:
Jonathan R. Bradley,
Scott H. Holan,
Christopher K. Wikle
Abstract:
We introduce a Bayesian approach for multivariate spatio-temporal prediction for high-dimensional count-valued data. Our primary interest is when there are possibly millions of data points referenced over different variables, geographic regions, and times. This problem requires extensive methodological advancements, as jointly modeling correlated data of this size leads to the so-called "big n pro…
▽ More
We introduce a Bayesian approach for multivariate spatio-temporal prediction for high-dimensional count-valued data. Our primary interest is when there are possibly millions of data points referenced over different variables, geographic regions, and times. This problem requires extensive methodological advancements, as jointly modeling correlated data of this size leads to the so-called "big n problem." The computational complexity of prediction in this setting is further exacerbated by acknowledging that count-valued data are naturally non-Gaussian. Thus, we develop a new computationally efficient distribution theory for this setting. In particular, we introduce a multivariate log-gamma distribution and provide substantial theoretical development including: results regarding conditional distributions, marginal distributions, an asymptotic relationship with the multivariate normal distribution, and full-conditional distributions for a Gibbs sampler. To incorporate dependence between variables, regions, and time points, a multivariate spatio-temporal mixed effects model (MSTM) is used. The results in this manuscript are extremely general, and can be used for data that exhibit fewer sources of dependency than what we consider (e.g., multivariate, spatial-only, or spatio-temporal-only data). Hence, the implications of our modeling framework may have a large impact on the general problem of jointly modeling correlated count-valued data. We show the effectiveness of our approach through a simulation study. Additionally, we demonstrate our proposed methodology with an important application analyzing data obtained from the Longitudinal Employer-Household Dynamics (LEHD) program, which is administered by the U.S. Census Bureau.
△ Less
Submitted 22 December, 2015;
originally announced December 2015.
-
Spatio-Temporal Change of Support with Application to American Community Survey Multi-Year Period Estimates
Authors:
Jonathan R. Bradley,
Christopher K. Wikle,
Scott H. Holan
Abstract:
We present hierarchical Bayesian methodology to perform spatio-temporal change of support (COS) for survey data with Gaussian sampling errors. This methodology is motivated by the American Community Survey (ACS), which is an ongoing survey administered by the U.S. Census Bureau that provides timely information on several key demographic variables. The ACS has published 1-year, 3-year, and 5-year p…
▽ More
We present hierarchical Bayesian methodology to perform spatio-temporal change of support (COS) for survey data with Gaussian sampling errors. This methodology is motivated by the American Community Survey (ACS), which is an ongoing survey administered by the U.S. Census Bureau that provides timely information on several key demographic variables. The ACS has published 1-year, 3-year, and 5-year period-estimates, and margins of errors, for demographic and socio-economic variables recorded over predefined geographies. The spatio-temporal COS methodology considered here provides data users with a way to estimate ACS variables on customized geographies and time periods, while accounting for sampling errors. Additionally, 3-year ACS period estimates are to be discontinued, and this methodology can provide predictions of ACS variables for 3-year periods given the available period estimates. The methodology is based on a spatio-temporal mixed effects model with a low-dimensional spatio-temporal basis function representation, which provides multi-resolution estimates through basis function aggregation in space and time. This methodology includes a novel parameterization that uses a target dynamical process and recently proposed parsimonious Moran's I propagator structures. Our approach is demonstrated through two applications using public-use ACS estimates, and is shown to produce good predictions on a holdout set of 3-year period estimates.
△ Less
Submitted 24 August, 2015; v1 submitted 6 August, 2015;
originally announced August 2015.
-
Generating Partially Synthetic Geocoded Public Use Data with Decreased Disclosure Risk Using Differential Smoothing
Authors:
Harrison Quick,
Scott H. Holan,
Christopher K. Wikle
Abstract:
When collecting geocoded confidential data with the intent to disseminate, agencies often resort to altering the geographies prior to making data publicly available due to data privacy obligations. An alternative to releasing aggregated and/or perturbed data is to release multiply-imputed synthetic data, where sensitive values are replaced with draws from statistical models designed to capture imp…
▽ More
When collecting geocoded confidential data with the intent to disseminate, agencies often resort to altering the geographies prior to making data publicly available due to data privacy obligations. An alternative to releasing aggregated and/or perturbed data is to release multiply-imputed synthetic data, where sensitive values are replaced with draws from statistical models designed to capture important distributional features in the collected data. One issue that has received relatively little attention, however, is how to handle spatially outlying observations in the collected data, as common spatial models often have a tendency to overfit these observations. The goal of this work is to bring this issue to the forefront and propose a solution, which we refer to as "differential smoothing." After implementing our method on simulated data, highlighting the effectiveness of our approach under various scenarios, we illustrate the framework using data consisting of sale prices of homes in San Francisco.
△ Less
Submitted 20 July, 2015;
originally announced July 2015.
-
A Model-Based Approach for Analog Spatio-Temporal Dynamic Forecasting
Authors:
Patrick L. McDermott,
Christopher K. Wikle
Abstract:
Analog forecasting has been applied in a variety of fields for predicting future states of complex nonlinear systems that require flexible forecasting methods. Past analog methods have almost exclu- sively been used in an empirical framework without the structure of a model-based approach. We propose a Bayesian model framework for analog forecasting, building upon previous analog methods but accou…
▽ More
Analog forecasting has been applied in a variety of fields for predicting future states of complex nonlinear systems that require flexible forecasting methods. Past analog methods have almost exclu- sively been used in an empirical framework without the structure of a model-based approach. We propose a Bayesian model framework for analog forecasting, building upon previous analog methods but accounting for parameter uncertainty. Thus, unlike traditional analog forecasting methods, the use of Bayesian modeling allows one to rigorously quantify uncertainty to obtain realistic posterior predictive distributions. The model is applied to the long-lead time forecasting of mid-May averaged soil moisture anomalies in Iowa over a high-resolution grid of spatial locations. Sea Surface Tem- perature (SST) is used to find past time periods with similar trajectories to the current pre-forecast period. The analog model is developed on projection coefficients from a basis expansion of the soil moisture and SST fields. Separate models are constructed for locations falling in each Iowa Crop Reporting District (CRD) and the forecasting ability of the proposed model is compared against a variety of alternative methods and metrics.
△ Less
Submitted 12 February, 2016; v1 submitted 19 June, 2015;
originally announced June 2015.
-
Bayesian binomial mixture models for estimating abundance in ecological monitoring studies
Authors:
Guohui Wu,
Scott H. Holan,
Charles H. Nilon,
Christopher K. Wikle
Abstract:
Investigation of species abundance has become a vital component of many ecological monitoring studies. The primary objective of these studies is to understand how specific species are distributed across the study domain, as well as quantification of the sampling efficiency for detecting these species. To achieve these goals, preselected locations are sampled during scheduled visits, in which the n…
▽ More
Investigation of species abundance has become a vital component of many ecological monitoring studies. The primary objective of these studies is to understand how specific species are distributed across the study domain, as well as quantification of the sampling efficiency for detecting these species. To achieve these goals, preselected locations are sampled during scheduled visits, in which the number of species observed at each location is recorded. This results in spatially referenced replicated count data that are often unbalanced in structure and exhibit overdispersion. Motivated by the Baltimore Ecosystem Study, we propose Bayesian hierarchical binomial mixture models, including Binomial Conway-Maxwell Poisson (Bin-CMP) mixture models, that formally account for varying levels of spatial dispersion. Our proposed models also allow for variable selection of model covariates and grouping of dispersion parameters through the implementation of reversible jump Markov chain Monte Carlo methodology. Finally, using demographic covariates from the American Community Survey, we demonstrate the effectiveness of our approach through estimation of abundance for the American Robin (Turdus migratorius) in the Baltimore Ecosystem Study.
△ Less
Submitted 11 May, 2015;
originally announced May 2015.
-
Multivariate spatio-temporal models for high-dimensional areal data with application to Longitudinal Employer-Household Dynamics
Authors:
Jonathan R. Bradley,
Scott H. Holan,
Christopher K. Wikle
Abstract:
Many data sources report related variables of interest that are also referenced over geographic regions and time; however, there are relatively few general statistical methods that one can readily use that incorporate these multivariate spatio-temporal dependencies. Additionally, many multivariate spatio-temporal areal data sets are extremely high dimensional, which leads to practical issues when…
▽ More
Many data sources report related variables of interest that are also referenced over geographic regions and time; however, there are relatively few general statistical methods that one can readily use that incorporate these multivariate spatio-temporal dependencies. Additionally, many multivariate spatio-temporal areal data sets are extremely high dimensional, which leads to practical issues when formulating statistical models. For example, we analyze Quarterly Workforce Indicators (QWI) published by the US Census Bureau's Longitudinal Employer-Household Dynamics (LEHD) program. QWIs are available by different variables, regions, and time points, resulting in millions of tabulations. Despite their already expansive coverage, by adopting a fully Bayesian framework, the scope of the QWIs can be extended to provide estimates of missing values along with associated measures of uncertainty. Motivated by the LEHD, and other applications in federal statistics, we introduce the multivariate spatio-temporal mixed effects model (MSTM), which can be used to efficiently model high-dimensional multivariate spatio-temporal areal data sets. The proposed MSTM extends the notion of Moran's I basis functions to the multivariate spatio-temporal setting. This extension leads to several methodological contributions, including extremely effective dimension reduction, a dynamic linear model for multivariate spatio-temporal areal processes, and the reduction of a high-dimensional parameter space using a novel parameter model.
△ Less
Submitted 29 January, 2016; v1 submitted 3 March, 2015;
originally announced March 2015.
-
Regionalization of Multiscale Spatial Processes using a Criterion for Spatial Aggregation Error
Authors:
Jonathan R. Bradley,
Christopher K. Wikle,
Scott H. Holan
Abstract:
The modifiable areal unit problem and the ecological fallacy are known problems that occur when modeling multiscale spatial processes. We investigate how these forms of spatial aggregation error can guide a regionalization over a spatial domain of interest. By "regionalization" we mean a specification of geographies that define the spatial support for areal data. This topic has been studied vigoro…
▽ More
The modifiable areal unit problem and the ecological fallacy are known problems that occur when modeling multiscale spatial processes. We investigate how these forms of spatial aggregation error can guide a regionalization over a spatial domain of interest. By "regionalization" we mean a specification of geographies that define the spatial support for areal data. This topic has been studied vigorously by geographers, but has been given less attention by spatial statisticians. Thus, we propose a criterion for spatial aggregation error (CAGE), which we minimize to obtain an optimal regionalization. To define CAGE we draw a connection between spatial aggregation error and a new multiscale representation of the Karhunen-Loeve (K-L) expansion. This relationship between CAGE and the multiscale K-L expansion leads to illuminating theoretical developments including: connections between spatial aggregation error, squared prediction error, spatial variance, and a novel extension of Obled-Creutin eigenfunctions. The effectiveness of our approach is demonstrated through an analysis of two datasets, one using the American Community Survey and one related to environmental ocean winds.
△ Less
Submitted 10 December, 2015; v1 submitted 6 February, 2015;
originally announced February 2015.
-
Bayesian Lattice Filters for Time-Varying Autoregression and Time-Frequency Analysis
Authors:
Wen-Hsi Yang,
Scott H. Holan,
Christopher K. Wikle
Abstract:
Modeling nonstationary processes is of paramount importance to many scientific disciplines including environmental science, ecology, and finance, among others. Consequently, flexible methodology that provides accurate estimation across a wide range of processes is a subject of ongoing interest. We propose a novel approach to model-based time-frequency estimation using time-varying autoregressive m…
▽ More
Modeling nonstationary processes is of paramount importance to many scientific disciplines including environmental science, ecology, and finance, among others. Consequently, flexible methodology that provides accurate estimation across a wide range of processes is a subject of ongoing interest. We propose a novel approach to model-based time-frequency estimation using time-varying autoregressive models. In this context, we take a fully Bayesian approach and allow both the autoregressive coefficients and innovation variance to vary over time. Importantly, our estimation method uses the lattice filter and is cast within the partial autocorrelation domain. The marginal posterior distributions are of standard form and, as a convenient by-product of our estimation method, our approach avoids undesirable matrix inversions. As such, estimation is extremely computationally efficient and stable. To illustrate the effectiveness of our approach, we conduct a comprehensive simulation study that compares our method with other competing methods and find that, in most cases, our approach performs superior in terms of average squared error between the estimated and true time-varying spectral density. Lastly, we demonstrate our methodology through three modeling applications; namely, insect communication signals, environmental data (wind components), and macroeconomic data (US gross domestic product (GDP) and consumption).
△ Less
Submitted 12 August, 2014;
originally announced August 2014.
-
Bayesian Marked Point Process Modeling for Generating Fully Synthetic Public Use Data with Point-Referenced Geography
Authors:
Harrison Quick,
Scott H. Holan,
Christopher K. Wikle,
Jerome P. Reiter
Abstract:
Many data stewards collect confidential data that include fine geography. When sharing these data with others, data stewards strive to disseminate data that are informative for a wide range of spatial and non-spatial analyses while simultaneously protecting the confidentiality of data subjects' identities and attributes. Typically, data stewards meet this challenge by coarsening the resolution of…
▽ More
Many data stewards collect confidential data that include fine geography. When sharing these data with others, data stewards strive to disseminate data that are informative for a wide range of spatial and non-spatial analyses while simultaneously protecting the confidentiality of data subjects' identities and attributes. Typically, data stewards meet this challenge by coarsening the resolution of the released geography and, as needed, perturbing the confidential attributes. When done with high intensity, these redaction strategies can result in released data with poor analytic quality. We propose an alternative dissemination approach based on fully synthetic data. We generate data using marked point process models that can maintain both the statistical properties and the spatial dependence structure of the confidential data. We illustrate the approach using data consisting of mortality records from Durham, North Carolina.
△ Less
Submitted 29 July, 2014;
originally announced July 2014.
-
Mixed Effects Modeling for Areal Data that Exhibit Multivariate-Spatio-Temporal Dependencies
Authors:
Jonathan R. Bradley,
Scott H. Holan,
Christopher K. Wikle
Abstract:
There are many data sources available that report related variables of interest that are also referenced over geographic regions and time; however, there are relatively few general statistical methods that one can readily use that incorporate these multivariate-spatio-temporal dependencies. As such, we introduce the multivariate-spatio-temporal mixed effects model (MSTM) to analyze areal data with…
▽ More
There are many data sources available that report related variables of interest that are also referenced over geographic regions and time; however, there are relatively few general statistical methods that one can readily use that incorporate these multivariate-spatio-temporal dependencies. As such, we introduce the multivariate-spatio-temporal mixed effects model (MSTM) to analyze areal data with multivariate-spatio-temporal dependencies. The proposed MSTM extends the notion of Moran's I basis functions to the multivariate-spatio-temporal setting. This extension leads to several methodological contributions including extremely effective dimension reduction, a dynamic linear model for multivariate-spatio-temporal areal processes, and the reduction of a high-dimensional parameter space using a novel parameter model. Several examples are used to demonstrate that the MSTM provides an extremely viable solution to many important problems found in different and distinct corners of the spatio-temporal statistics literature including: modeling nonseparable and nonstationary covariances, combing data from multiple repeated surveys, and analyzing massive multivariate-spatio-temporal datasets.
△ Less
Submitted 4 September, 2014; v1 submitted 28 July, 2014;
originally announced July 2014.
-
Bayesian Spatial Change of Support for Count-Valued Survey Data
Authors:
Jonathan R. Bradley,
Christopher K. Wikle,
Scott H. Holan
Abstract:
We introduce Bayesian spatial change of support methodology for count-valued survey data with known survey variances. Our proposed methodology is motivated by the American Community Survey (ACS), an ongoing survey administered by the U.S. Census Bureau that provides timely information on several key demographic variables. Specifically, the ACS produces 1-year, 3-year, and 5-year "period-estimates,…
▽ More
We introduce Bayesian spatial change of support methodology for count-valued survey data with known survey variances. Our proposed methodology is motivated by the American Community Survey (ACS), an ongoing survey administered by the U.S. Census Bureau that provides timely information on several key demographic variables. Specifically, the ACS produces 1-year, 3-year, and 5-year "period-estimates," and corresponding margins of errors, for published demographic and socio-economic variables recorded over predefined geographies within the United States. Despite the availability of these predefined geographies it is often of interest to data users to specify customized user-defined spatial supports. In particular, it is useful to estimate demographic variables defined on "new" spatial supports in "real-time." This problem is known as spatial change of support (COS), which is typically performed under the assumption that the data follows a Gaussian distribution. However, count-valued survey data is naturally non-Gaussian and, hence, we consider modeling these data using a Poisson distribution. Additionally, survey-data are often accompanied by estimates of error, which we incorporate into our analysis. We interpret Poisson count-valued data in small areas as an aggregation of events from a spatial point process. This approach provides us with the flexibility necessary to allow ACS users to consider a variety of spatial supports in "real-time." We demonstrate the effectiveness of our approach through a simulated example as well as through an analysis using public-use ACS data.
△ Less
Submitted 28 October, 2014; v1 submitted 28 May, 2014;
originally announced May 2014.
-
Bayesian Semiparametric Hierarchical Empirical Likelihood Spatial Models
Authors:
Aaron T. Porter,
Scott H. Holan,
Christopher K. Wikle
Abstract:
We introduce a general hierarchical Bayesian framework that incorporates a flexible nonparametric data model specification through the use of empirical likelihood methodology, which we term semiparametric hierarchical empirical likelihood (SHEL) models. Although general dependence structures can be readily accommodated, we focus on spatial modeling, a relatively underdeveloped area in the empirica…
▽ More
We introduce a general hierarchical Bayesian framework that incorporates a flexible nonparametric data model specification through the use of empirical likelihood methodology, which we term semiparametric hierarchical empirical likelihood (SHEL) models. Although general dependence structures can be readily accommodated, we focus on spatial modeling, a relatively underdeveloped area in the empirical likelihood literature. Importantly, the models we develop naturally accommodate spatial association on irregular lattices and irregularly spaced point-referenced data. We illustrate our proposed framework by means of a simulation study and through three real data examples. First, we develop a spatial Fay-Herriot model in the SHEL framework and apply it to the problem of small area estimation in the American Community Survey. Next, we illustrate the SHEL model in the context of areal data (on an irregular lattice) through the North Carolina sudden infant death syndrome (SIDS) dataset. Finally, we analyze a point-referenced dataset from the North American Breeding Bird survey that considers dove counts for the state of Missouri. In all cases, we demonstrate superior performance of our model, in terms of mean squared prediction error, over standard parametric analyses.
△ Less
Submitted 15 May, 2014;
originally announced May 2014.