-
Analyzing trends in precipitation patterns using Hidden Markov model stochastic weather generators
Authors:
Christopher J. Paciorek
Abstract:
We develop a flexible spline-based Bayesian hidden Markov model stochastic weather generator to statistically model daily precipitation over time by season at individual locations. The model naturally accounts for missing data (considered missing at random), avoiding potential sensitivity from systematic missingness patterns or from using arbitrary cutoffs to deal with missingness when computing m…
▽ More
We develop a flexible spline-based Bayesian hidden Markov model stochastic weather generator to statistically model daily precipitation over time by season at individual locations. The model naturally accounts for missing data (considered missing at random), avoiding potential sensitivity from systematic missingness patterns or from using arbitrary cutoffs to deal with missingness when computing metrics on daily precipitation data. The fitted model can then be used for inference about trends in arbitrary measures of precipitation behavior, either by multiple imputation of the missing data followed by frequentist analysis or by simulation from the Bayesian posterior predictive distribution. We show that the model fits the data well, including a variety of multi-day characteristics, indicating fidelity to the autocorrelation structure of the data. Using three stations from the western United States, we develop case studies in which we assess trends in various aspects of precipitation (such as dry spell length and precipitation intensity), finding only limited evidence of trends in certain seasons based on the use of Sen's slope as a nonparametric measure of trend. In future work, we plan to apply the method to the complete set of GHCN stations in selected regions to systematically assess the evidence for trends.
△ Less
Submitted 18 July, 2022;
originally announced July 2022.
-
A numerically stable online implementation and exploration of WAIC through variations of the predictive density, using NIMBLE
Authors:
Joshua E. Hug,
Christopher J. Paciorek
Abstract:
We go through the process of crafting a robust and numerically stable online algorithm for the computation of the Watanabe-Akaike information criteria (WAIC). We implement this algorithm in the NIMBLE software. The implementation is performed in an online manner and does not require the storage in memory of the complete samples from the posterior distribution. This algorithm allows the user to spe…
▽ More
We go through the process of crafting a robust and numerically stable online algorithm for the computation of the Watanabe-Akaike information criteria (WAIC). We implement this algorithm in the NIMBLE software. The implementation is performed in an online manner and does not require the storage in memory of the complete samples from the posterior distribution. This algorithm allows the user to specify a specific form of the predictive density to be used in the computation of WAIC, in order to cater to specific prediction goals. We then comment and explore via simulations the use of different forms of the predictive density in the context of different predictive goals. We find that when using marginalized predictive densities, WAIC is sensitive to the grouping of the observations into a joint density.
△ Less
Submitted 24 June, 2021;
originally announced June 2021.
-
Computational strategies and estimation performance with Bayesian semiparametric Item Response Theory models
Authors:
Sally Paganin,
Christopher J. Paciorek,
Claudia Wehrhahn,
Abel Rodriguez,
Sophia Rabe-Hesketh,
Perry de Valpine
Abstract:
Item response theory (IRT) models typically rely on a normality assumption for subject-specific latent traits, which is often unrealistic in practice. Semiparametric extensions based on Dirichlet process mixtures offer a more flexible representation of the unknown distribution of the latent trait. However, the use of such models in the IRT literature has been extremely limited, in good part becaus…
▽ More
Item response theory (IRT) models typically rely on a normality assumption for subject-specific latent traits, which is often unrealistic in practice. Semiparametric extensions based on Dirichlet process mixtures offer a more flexible representation of the unknown distribution of the latent trait. However, the use of such models in the IRT literature has been extremely limited, in good part because of the lack of comprehensive studies and accessible software tools. This paper provides guidance for practitioners on semiparametric IRT models and their implementation. In particular, we rely on NIMBLE, a flexible software system for hierarchical models that enables the use of Dirichlet process mixtures. We highlight efficient sampling strategies for model estimation and compare inferential results under parametric and semiparametric models.
△ Less
Submitted 10 August, 2022; v1 submitted 27 January, 2021;
originally announced January 2021.
-
Detected changes in precipitation extremes at their native scales derived from in situ measurements
Authors:
Mark D. Risser,
Christopher J. Paciorek,
Travis A. O'Brien,
Michael F. Wehner,
William D. Collins
Abstract:
The gridding of daily accumulated precipitation -- especially extremes -- from ground-based station observations is problematic due to the fractal nature of precipitation, and therefore estimates of long period return values and their changes based on such gridded daily data sets are generally underestimated. In this paper, we characterize high-resolution changes in observed extreme precipitation…
▽ More
The gridding of daily accumulated precipitation -- especially extremes -- from ground-based station observations is problematic due to the fractal nature of precipitation, and therefore estimates of long period return values and their changes based on such gridded daily data sets are generally underestimated. In this paper, we characterize high-resolution changes in observed extreme precipitation from 1950 to 2017 for the contiguous United States (CONUS) based on in situ measurements only. Our analysis utilizes spatial statistical methods that allow us to derive gridded estimates that do not smooth extreme daily measurements and are consistent with statistics from the original station data while increasing the resulting signal to noise ratio. Furthermore, we use a robust statistical technique to identify significant pointwise changes in the climatology of extreme precipitation while carefully controlling the rate of false positives. We present and discuss seasonal changes in the statistics of extreme precipitation: the largest and most spatially-coherent pointwise changes are in fall (SON), with approximately 33% of CONUS exhibiting significant changes (in an absolute sense). Other seasons display very few meaningful pointwise changes (in either a relative or absolute sense), illustrating the difficulty in detecting pointwise changes in extreme precipitation based on in situ measurements. While our main result involves seasonal changes, we also present and discuss annual changes in the statistics of extreme precipitation. In this paper we only seek to detect changes over time and leave attribution of the underlying causes of these changes for future work.
△ Less
Submitted 13 August, 2019; v1 submitted 15 February, 2019;
originally announced February 2019.
-
A probabilistic gridded product for daily precipitation extremes over the United States
Authors:
Mark D. Risser,
Christopher J. Paciorek,
Michael F. Wehner,
Travis A. O'Brien,
William D. Collins
Abstract:
Gridded data products, for example interpolated daily measurements of precipitation from weather stations, are commonly used as a convenient substitute for direct observations because these products provide a spatially and temporally continuous and complete source of data. However, when the goal is to characterize climatological features of extreme precipitation over a spatial domain (e.g., a map…
▽ More
Gridded data products, for example interpolated daily measurements of precipitation from weather stations, are commonly used as a convenient substitute for direct observations because these products provide a spatially and temporally continuous and complete source of data. However, when the goal is to characterize climatological features of extreme precipitation over a spatial domain (e.g., a map of return values) at the native spatial scales of these phenomena, then gridded products may lead to incorrect conclusions because daily precipitation is a fractal field and hence any smoothing technique will dampen local extremes. To address this issue, we create a new "probabilistic" gridded product specifically designed to characterize the climatological properties of extreme precipitation by applying spatial statistical analyses to daily measurements of precipitation from the GHCN over CONUS. The essence of our method is to first estimate the climatology of extreme precipitation based on station data and then use a data-driven statistical approach to interpolate these estimates to a fine grid. We argue that our method yields an improved characterization of the climatology within a grid cell because the probabilistic behavior of extreme precipitation is much better behaved (i.e., smoother) than daily weather. Furthermore, the spatial smoothing innate to our approach significantly increases the signal-to-noise ratio in the estimated extreme statistics relative to an analysis without smoothing. Finally, by deriving a data-driven approach for translating extreme statistics to a spatially complete grid, the methodology outlined in this paper resolves the issue of how to properly compare station data with output from earth system models. We conclude the paper by comparing our probabilistic gridded product with a standard extreme value analysis of the Livneh gridded daily precipitation product.
△ Less
Submitted 2 January, 2019; v1 submitted 11 July, 2018;
originally announced July 2018.
-
Quantifying statistical uncertainty in the attribution of human influence on severe weather
Authors:
Christopher J. Paciorek,
Dáithí A. Stone,
Michael F. Wehner
Abstract:
Event attribution in the context of climate change seeks to understand the role of anthropogenic greenhouse gas emissions on extreme weather events, either specific events or classes of events. A common approach to event attribution uses climate model output under factual (real-world) and counterfactual (world that might have been without anthropogenic greenhouse gas emissions) scenarios to estima…
▽ More
Event attribution in the context of climate change seeks to understand the role of anthropogenic greenhouse gas emissions on extreme weather events, either specific events or classes of events. A common approach to event attribution uses climate model output under factual (real-world) and counterfactual (world that might have been without anthropogenic greenhouse gas emissions) scenarios to estimate the probabilities of the event of interest under the two scenarios. Event attribution is then quantified by the ratio of the two probabilities. While this approach has been applied many times in the last 15 years, the statistical techniques used to estimate the risk ratio based on climate model ensembles have not drawn on the full set of methods available in the statistical literature and have in some cases used and interpreted the bootstrap method in non-standard ways. We present a precise frequentist statistical framework for quantifying the effect of sampling uncertainty on estimation of the risk ratio, propose the use of statistical methods that are new to event attribution, and evaluate a variety of methods using statistical simulations. We conclude that existing statistical methods not yet in use for event attribution have several advantages over the widely-used bootstrap, including better statistical performance in repeated samples and robustness to small estimated probabilities. Software for using the methods is available through the climextRemes package available for R or Python. While we focus on frequentist statistical methods, Bayesian methods are likely to be particularly useful when considering sources of uncertainty beyond sampling uncertainty.
△ Less
Submitted 3 February, 2018; v1 submitted 11 June, 2017;
originally announced June 2017.
-
Spatially-Dependent Multiple Testing Under Model Misspecification, with Application to Detection of Anthropogenic Influence on Extreme Climate Events
Authors:
Mark D. Risser,
Christopher J. Paciorek,
Daithi Stone
Abstract:
The Weather Risk Attribution Forecast (WRAF) is a forecasting tool that uses output from global climate models to make simultaneous attribution statements about whether and how greenhouse gas emissions have contributed to extreme weather across the globe. However, in conducting a large number of simultaneous hypothesis tests, the WRAF is prone to identifying false "discoveries." A common technique…
▽ More
The Weather Risk Attribution Forecast (WRAF) is a forecasting tool that uses output from global climate models to make simultaneous attribution statements about whether and how greenhouse gas emissions have contributed to extreme weather across the globe. However, in conducting a large number of simultaneous hypothesis tests, the WRAF is prone to identifying false "discoveries." A common technique for addressing this multiple testing problem is to adjust the procedure in a way that controls the proportion of true null hypotheses that are incorrectly rejected, or the false discovery rate (FDR). Unfortunately, generic FDR procedures suffer from low power when the hypotheses are dependent, and techniques designed to account for dependence are sensitive to misspecification of the underlying statistical model. In this paper, we develop a Bayesian decision theoretic approach for dependent multiple testing and a nonparametric hierarchical statistical model that flexibly controls false discovery and is robust to model misspecification. We illustrate the robustness of our procedure to model error with a simulation study, using a framework that accounts for generic spatial dependence and allows the practitioner to flexibly specify the decision criteria. Finally, we apply our procedure to several seasonal forecasts and discuss implementation for the WRAF workflow.
△ Less
Submitted 14 November, 2017; v1 submitted 29 March, 2017;
originally announced March 2017.
-
Sequential Monte Carlo Methods in the nimble R Package
Authors:
Nicholas Michaud,
Perry de Valpine,
Daniel Turek,
Christopher J. Paciorek,
Dao Nguyen
Abstract:
nimble is an R package for constructing algorithms and conducting inference on hierarchical models. The nimble package provides a unique combination of flexible model specification and the ability to program model-generic algorithms. Specifically, the package allows users to code models in the BUGS language, and it allows users to write algorithms that can be applied to any appropriate model. In t…
▽ More
nimble is an R package for constructing algorithms and conducting inference on hierarchical models. The nimble package provides a unique combination of flexible model specification and the ability to program model-generic algorithms. Specifically, the package allows users to code models in the BUGS language, and it allows users to write algorithms that can be applied to any appropriate model. In this paper, we introduce nimble's capabilities for state-space model analysis using sequential Monte Carlo (SMC) techniques. We first provide an overview of state-space models and commonly-used SMC algorithms. We then describe how to build a state-space model and conduct inference using existing SMC algorithms within nimble. SMC algorithms within nimble currently include the bootstrap filter, auxiliary particle filter, ensemble Kalman filter, IF2 method of iterated filtering, and a particle MCMC sampler. These algorithms can be run in R or compiled into C++ for more efficient execution. Examples of applying SMC algorithms to linear autoregressive models and a stochastic volatility model are provided. Finally, we give an overview of how model-generic algorithms are coded within nimble by providing code for a simple SMC algorithm. This illustrates how users can easily extend nimble's SMC methods in high-level code.
△ Less
Submitted 4 March, 2020; v1 submitted 17 March, 2017;
originally announced March 2017.
-
Quantifying the effect of interannual ocean variability on the attribution of extreme climate events to human influence
Authors:
Mark D. Risser,
Daithi A. Stone,
Christopher J. Paciorek,
Michael F. Wehner,
Oliver Angelil
Abstract:
In recent years, the climate change research community has become highly interested in describing the anthropogenic influence on extreme weather events, commonly termed "event attribution." Limitations in the observational record and in computational resources motivate the use of uncoupled, atmosphere/land-only climate models with prescribed ocean conditions run over a short period, leading up to…
▽ More
In recent years, the climate change research community has become highly interested in describing the anthropogenic influence on extreme weather events, commonly termed "event attribution." Limitations in the observational record and in computational resources motivate the use of uncoupled, atmosphere/land-only climate models with prescribed ocean conditions run over a short period, leading up to and including an event of interest. In this approach, large ensembles of high-resolution simulations can be generated under factual observed conditions and counterfactual conditions that might have been observed in the absence of human interference; these can be used to estimate the change in probability of the given event due to anthropogenic influence. However, using a prescribed ocean state ignores the possibility that estimates of attributable risk might be a function of the ocean state. Thus, the uncertainty in attributable risk is likely underestimated, implying an over-confidence in anthropogenic influence.
In this work, we estimate the year-to-year variability in calculations of the anthropogenic contribution to extreme weather based on large ensembles of atmospheric model simulations. Our results both quantify the magnitude of year-to-year variability and categorize the degree to which conclusions of attributable risk are qualitatively affected. The methodology is illustrated by exploring extreme temperature and precipitation events for the northwest coast of South America and northern-central Siberia; we also provides results for regions around the globe. While it remains preferable to perform a full multi-year analysis, the results presented here can serve as an indication of where and when attribution researchers should be concerned about the use of atmosphere-only simulations.
△ Less
Submitted 28 September, 2016; v1 submitted 28 June, 2016;
originally announced June 2016.
-
Quantile-based bias correction and uncertainty quantification of extreme event attribution statements
Authors:
Soyoung Jeon,
Christopher J. Paciorek,
Michael F. Wehner
Abstract:
Extreme event attribution characterizes how anthropogenic climate change may have influenced the probability and magnitude of selected individual extreme weather and climate events. Attribution statements often involve quantification of the fraction of attributable risk (FAR) or the risk ratio (RR) and associated confidence intervals. Many such analyses use climate model output to characterize ext…
▽ More
Extreme event attribution characterizes how anthropogenic climate change may have influenced the probability and magnitude of selected individual extreme weather and climate events. Attribution statements often involve quantification of the fraction of attributable risk (FAR) or the risk ratio (RR) and associated confidence intervals. Many such analyses use climate model output to characterize extreme event behavior with and without anthropogenic influence. However, such climate models may have biases in their representation of extreme events. To account for discrepancies in the probabilities of extreme events between observational datasets and model datasets, we demonstrate an appropriate rescaling of the model output based on the quantiles of the datasets to estimate an adjusted risk ratio. Our methodology accounts for various components of uncertainty in estimation of the risk ratio. In particular, we present an approach to construct a one-sided confidence interval on the lower bound of the risk ratio when the estimated risk ratio is infinity. We demonstrate the methodology using the summer 2011 central US heatwave and output from the Community Earth System Model. In this example, we find that the lower bound of the risk ratio is relatively insensitive to the magnitude and probability of the actual event.
△ Less
Submitted 12 February, 2016;
originally announced February 2016.
-
Efficient Markov Chain Monte Carlo Sampling for Hierarchical Hidden Markov Models
Authors:
Daniel Turek,
Perry de Valpine,
Christopher J. Paciorek
Abstract:
Traditional Markov chain Monte Carlo (MCMC) sampling of hidden Markov models (HMMs) involves latent states underlying an imperfect observation process, and generates posterior samples for top-level parameters concurrently with nuisance latent variables. When potentially many HMMs are embedded within a hierarchical model, this can result in prohibitively long MCMC runtimes. We study combinations of…
▽ More
Traditional Markov chain Monte Carlo (MCMC) sampling of hidden Markov models (HMMs) involves latent states underlying an imperfect observation process, and generates posterior samples for top-level parameters concurrently with nuisance latent variables. When potentially many HMMs are embedded within a hierarchical model, this can result in prohibitively long MCMC runtimes. We study combinations of existing methods, which are shown to vastly improve computational efficiency for these hierarchical models while maintaining the modeling flexibility provided by embedded HMMs. The methods include discrete filtering of the HMM likelihood to remove latent states, reduced data representations, and a novel procedure for dynamic block sampling of posterior dimensions. The first two methods have been used in isolation in existing application-specific software, but are not generally available for incorporation in arbitrary model structures. Using the NIMBLE package for R, we develop and test combined computational approaches using three examples from ecological capture-recapture, although our methods are generally applicable to any embedded discrete HMMs. These combinations provide several orders of magnitude improvement in MCMC sampling efficiency, defined as the rate of generating effectively independent posterior samples. In addition to being computationally significant for this class of hierarchical models, this result underscores the potential for vast improvements to MCMC sampling efficiency which can result from combinations of known algorithms.
△ Less
Submitted 11 January, 2016;
originally announced January 2016.
-
Statistically-estimated tree composition for the northeastern United States at the time of Euro-American settlement
Authors:
Christopher J. Paciorek,
Simon J. Goring,
Andrew L. Thurman,
Charles V. Cogbill,
John W. Williams,
David J. Mladenoff,
Jody A. Peters,
Jun Zhu,
Jason S. McLachlan
Abstract:
We present a gridded 8 km-resolution data product of the estimated composition of tree taxa at the time of Euro-American settlement of the northeastern United States and the statistical methodology used to produce the product from trees recorded by land surveyors. Composition is defined as the proportion of stems larger than approximately 20 cm diameter at breast height for 22 tree taxa, generally…
▽ More
We present a gridded 8 km-resolution data product of the estimated composition of tree taxa at the time of Euro-American settlement of the northeastern United States and the statistical methodology used to produce the product from trees recorded by land surveyors. Composition is defined as the proportion of stems larger than approximately 20 cm diameter at breast height for 22 tree taxa, generally at the genus level. The data come from settlement-era public survey records that are transcribed and then aggregated spatially, giving count data. The domain is divided into two regions, eastern (Maine to Ohio) and midwestern (Indiana to Minnesota). Public Land Survey point data in the midwestern region (ca. 0.8-km resolution) are aggregated to a regular 8 km grid, while data in the eastern region, from Town Proprietor Surveys, are aggregated at the township level in irregularly-shaped local administrative units. The product is based on a Bayesian statistical model fit to the count data that estimates composition on a regular 8 km grid across the entire domain. The statistical model is designed to handle data from both the regular grid and the irregularly-shaped townships and allows us to estimate composition at locations with no data and to smooth over noise caused by limited counts in locations with data. The model also allows us to quantify uncertainty in our composition estimates, making the product suitable for applications employing data assimilation. We expect this data product to be useful for understanding the state of vegetation in the northeastern United States prior to large-scale Euro-American settlement. In addition to specific regional questions, the data product can also serve as a baseline against which to investigate how forests and ecosystems change after intensive settlement. The data product is available at the NIS data portal as version 1.0.
△ Less
Submitted 3 April, 2016; v1 submitted 29 August, 2015;
originally announced August 2015.
-
Programming with models: writing statistical algorithms for general model structures with NIMBLE
Authors:
Perry de Valpine,
Daniel Turek,
Christopher J. Paciorek,
Clifford Anderson-Bergman,
Duncan Temple Lang,
Rastislav Bodik
Abstract:
We describe NIMBLE, a system for programming statistical algorithms for general model structures within R. NIMBLE is designed to meet three challenges: flexible model specification, a language for programming algorithms that can use different models, and a balance between high-level programmability and execution efficiency. For model specification, NIMBLE extends the BUGS language and creates mode…
▽ More
We describe NIMBLE, a system for programming statistical algorithms for general model structures within R. NIMBLE is designed to meet three challenges: flexible model specification, a language for programming algorithms that can use different models, and a balance between high-level programmability and execution efficiency. For model specification, NIMBLE extends the BUGS language and creates model objects, which can manipulate variables, calculate log probability values, generate simulations, and query the relationships among variables. For algorithm programming, NIMBLE provides functions that operate with model objects using two stages of evaluation. The first stage allows specialization of a function to a particular model and/or nodes, such as creating a Metropolis-Hastings sampler for a particular block of nodes. The second stage allows repeated execution of computations using the results of the first stage. To achieve efficient second-stage computation, NIMBLE compiles models and functions via C++, using the Eigen library for linear algebra, and provides the user with an interface to compiled objects. The NIMBLE language represents a compilable domain-specific language (DSL) embedded within R. This paper provides an overview of the design and rationale for NIMBLE along with illustrative examples including importance sampling, Markov chain Monte Carlo (MCMC) and Monte Carlo expectation maximization (MCEM).
△ Less
Submitted 12 April, 2016; v1 submitted 19 May, 2015;
originally announced May 2015.
-
Automated Parameter Blocking for Efficient Markov-Chain Monte Carlo Sampling
Authors:
Daniel Turek,
Perry de Valpine,
Christopher J. Paciorek,
Clifford Anderson-Bergman
Abstract:
Markov chain Monte Carlo (MCMC) sampling is an important and commonly used tool for the analysis of hierarchical models. Nevertheless, practitioners generally have two options for MCMC: utilize existing software that generates a black-box "one size fits all" algorithm, or the challenging (and time consuming) task of implementing a problem-specific MCMC algorithm. Either choice may result in ineffi…
▽ More
Markov chain Monte Carlo (MCMC) sampling is an important and commonly used tool for the analysis of hierarchical models. Nevertheless, practitioners generally have two options for MCMC: utilize existing software that generates a black-box "one size fits all" algorithm, or the challenging (and time consuming) task of implementing a problem-specific MCMC algorithm. Either choice may result in inefficient sampling, and hence researchers have become accustomed to MCMC runtimes on the order of days (or longer) for large models. We propose an automated procedure to determine an efficient MCMC algorithm for a given model and computing platform. Our procedure dynamically determines blocks of parameters for joint sampling that result in efficient sampling of the entire model. We test this procedure using a diverse suite of example models, and observe non-trivial improvements in MCMC efficiency for many models. Our procedure is the first attempt at such, and may be generalized to a broader space of MCMC algorithms. Our results suggest that substantive improvements in MCMC efficiency may be practically realized using our automated blocking procedure, or variants thereof, which warrants additional study and application.
△ Less
Submitted 18 March, 2015;
originally announced March 2015.
-
Nonlinear predictive latent process models for integrating spatio-temporal exposure data from multiple sources
Authors:
Nikolay Bliznyuk,
Christopher J. Paciorek,
Joel Schwartz,
Brent Coull
Abstract:
Spatio-temporal prediction of levels of an environmental exposure is an important problem in environmental epidemiology. Our work is motivated by multiple studies on the spatio-temporal distribution of mobile source, or traffic related, particles in the greater Boston area. When multiple sources of exposure information are available, a joint model that pools information across sources maximizes da…
▽ More
Spatio-temporal prediction of levels of an environmental exposure is an important problem in environmental epidemiology. Our work is motivated by multiple studies on the spatio-temporal distribution of mobile source, or traffic related, particles in the greater Boston area. When multiple sources of exposure information are available, a joint model that pools information across sources maximizes data coverage over both space and time, thereby reducing the prediction error. We consider a Bayesian hierarchical framework in which a joint model consists of a set of submodels, one for each data source, and a model for the latent process that serves to relate the submodels to one another. If a submodel depends on the latent process nonlinearly, inference using standard MCMC techniques can be computationally prohibitive. The implications are particularly severe when the data for each submodel are aggregated at different temporal scales. To make such problems tractable, we linearize the nonlinear components with respect to the latent process and induce sparsity in the covariance matrix of the latent process using compactly supported covariance functions. We propose an efficient MCMC scheme that takes advantage of these approximations. We use our model to address a temporal change of support problem whereby interest focuses on pooling daily and multiday black carbon readings in order to maximize the spatial coverage of the study region.
△ Less
Submitted 13 November, 2014;
originally announced November 2014.
-
Bayesian Estimation of Population-Level Trends in Measures of Health Status
Authors:
Mariel M. Finucane,
Christopher J. Paciorek,
Goodarz Danaei,
Majid Ezzati
Abstract:
Improving health worldwide will require rigorous quantification of population-level trends in health status. However, global-level surveys are not available, forcing researchers to rely on fragmentary country-specific data of varying quality. We present a Bayesian model that systematically combines disparate data to make country-, region- and global-level estimates of time trends in important heal…
▽ More
Improving health worldwide will require rigorous quantification of population-level trends in health status. However, global-level surveys are not available, forcing researchers to rely on fragmentary country-specific data of varying quality. We present a Bayesian model that systematically combines disparate data to make country-, region- and global-level estimates of time trends in important health indicators. The model allows for time and age nonlinearity, and it borrows strength in time, age, covariates, and within and across regional country clusters to make estimates where data are sparse. The Bayesian approach allows us to account for uncertainty from the various aspects of missingness as well as sampling and parameter uncertainty. MCMC sampling allows for inference in a high-dimensional, constrained parameter space, while providing posterior draws that allow straightforward inference on the wide variety of functionals of interest. Here we use blood pressure as an example health metric. High blood pressure is the leading risk factor for cardiovascular disease, the leading cause of death worldwide. The results highlight a risk transition, with decreasing blood pressure in high-income regions and increasing levels in many lower-income regions.
△ Less
Submitted 19 May, 2014;
originally announced May 2014.
-
Parallelizing Gaussian Process Calculations in R
Authors:
Christopher J. Paciorek,
Benjamin Lipshitz,
Wei Zhuo,
Prabhat,
Cari G. Kaufman,
Rollin C. Thomas
Abstract:
We consider parallel computation for Gaussian process calculations to overcome computational and memory constraints on the size of datasets that can be analyzed. Using a hybrid parallelization approach that uses both threading (shared memory) and message-passing (distributed memory), we implement the core linear algebra operations used in spatial statistics and Gaussian process regression in an R…
▽ More
We consider parallel computation for Gaussian process calculations to overcome computational and memory constraints on the size of datasets that can be analyzed. Using a hybrid parallelization approach that uses both threading (shared memory) and message-passing (distributed memory), we implement the core linear algebra operations used in spatial statistics and Gaussian process regression in an R package called bigGP that relies on C and MPI. The approach divides the matrix into blocks such that the computational load is balanced across processes while communication between processes is limited. The package provides an API enabling R programmers to implement Gaussian process-based methods by using the distributed linear algebra operations without any C or MPI coding. We illustrate the approach and software by analyzing an astrophysics dataset with n=67,275 observations.
△ Less
Submitted 21 May, 2013;
originally announced May 2013.
-
Semiparametric Bayesian Density Estimation with Disparate Data Sources: A Meta-Analysis of Global Childhood Undernutrition
Authors:
Mariel M. Finucane,
Christopher J. Paciorek,
Gretchen A. Stevens,
Majid Ezzati
Abstract:
Undernutrition, resulting in restricted growth, and quantified here using height-for-age z-scores, is an important contributor to childhood morbidity and mortality. Since all levels of mild, moderate and severe undernutrition are of clinical and public health importance, it is of interest to estimate the shape of the z-scores' distributions.
We present a finite normal mixture model that uses dat…
▽ More
Undernutrition, resulting in restricted growth, and quantified here using height-for-age z-scores, is an important contributor to childhood morbidity and mortality. Since all levels of mild, moderate and severe undernutrition are of clinical and public health importance, it is of interest to estimate the shape of the z-scores' distributions.
We present a finite normal mixture model that uses data on 4.3 million children to make annual country-specific estimates of these distributions for under-5-year-old children in the world's 141 low- and middle-income countries between 1985 and 2011. We incorporate both individual-level data when available, as well as aggregated summary statistics from studies whose individual-level data could not be obtained. We place a hierarchical Bayesian probit stick-breaking model on the mixture weights. The model allows for nonlinear changes in time, and it borrows strength in time, in covariates, and within and across regional country clusters to make estimates where data are uncertain, sparse, or missing.
This work addresses three important problems that often arise in the fields of public health surveillance and global health monitoring. First, data are always incomplete. Second, different data sources commonly use different reporting metrics. Last, distributions, and especially their tails, are often of substantive interest.
△ Less
Submitted 28 June, 2014; v1 submitted 22 January, 2013;
originally announced January 2013.
-
Measurement error in two-stage analyses, with application to air pollution epidemiology
Authors:
Adam A. Szpiro,
Christopher J. Paciorek
Abstract:
Public health researchers often estimate health effects of exposures (e.g., pollution, diet, lifestyle) that cannot be directly measured for study subjects. A common strategy in environmental epidemiology is to use a first-stage (exposure) model to estimate the exposure based on covariates and/or spatio-temporal proximity and to use predictions from the exposure model as the covariate of interest…
▽ More
Public health researchers often estimate health effects of exposures (e.g., pollution, diet, lifestyle) that cannot be directly measured for study subjects. A common strategy in environmental epidemiology is to use a first-stage (exposure) model to estimate the exposure based on covariates and/or spatio-temporal proximity and to use predictions from the exposure model as the covariate of interest in the second-stage (health) model. This induces a complex form of measurement error. We propose an analytical framework and methodology that is robust to misspecification of the first-stage model and provides valid inference for the second-stage model parameter of interest.
We decompose the measurement error into components analogous to classical and Berkson error and characterize properties of the estimator in the second-stage model if the first-stage model predictions are plugged in without correction. Specifically, we derive conditions for compatibility between the first- and second-stage models that guarantee consistency (and have direct and important real-world design implications), and we derive an asymptotic estimate of finite-sample bias when the compatibility conditions are satisfied. We propose a methodology that (1) corrects for finite-sample bias and (2) correctly estimates standard errors. We demonstrate the utility of our methodology in simulations and an example from air pollution epidemiology.
△ Less
Submitted 30 June, 2013; v1 submitted 27 October, 2012;
originally announced October 2012.
-
Spatial models for point and areal data using Markov random fields on a fine grid
Authors:
Christopher J. Paciorek
Abstract:
I consider the use of Markov random fields (MRFs) on a fine grid to represent latent spatial processes when modeling point-level and areal data, including situations with spatial misalignment. Point observations are related to the grid cell in which they reside, while areal observations are related to the (approximate) integral over the latent process within the area of interest. I review several…
▽ More
I consider the use of Markov random fields (MRFs) on a fine grid to represent latent spatial processes when modeling point-level and areal data, including situations with spatial misalignment. Point observations are related to the grid cell in which they reside, while areal observations are related to the (approximate) integral over the latent process within the area of interest. I review several approaches to specifying the neighborhood structure for constructing the MRF precision matrix, presenting results comparing these MRF representations analytically, in simulations, and in two examples. The results provide practical guidance for choosing a spatial process representation and highlight the importance of this choice. In particular, the results demonstrate that, and explain why, standard CAR models can behave strangely for point-level data. They show that various neighborhood weighting approaches based on higher-order neighbors that have been suggested for MRF models do not produce smooth fields, which raises doubts about their utility. Finally, they indicate that an MRF that approximates a thin plate spline compares favorably to standard CAR models and to kriging under many circumstances.
△ Less
Submitted 6 April, 2013; v1 submitted 26 April, 2012;
originally announced April 2012.
-
The Importance of Scale for Spatial-Confounding Bias and Precision of Spatial Regression Estimators
Authors:
Christopher J. Paciorek
Abstract:
Residuals in regression models are often spatially correlated. Prominent examples include studies in environmental epidemiology to understand the chronic health effects of pollutants. I consider the effects of residual spatial structure on the bias and precision of regression coefficients, developing a simple framework in which to understand the key issues and derive informative analytic results.…
▽ More
Residuals in regression models are often spatially correlated. Prominent examples include studies in environmental epidemiology to understand the chronic health effects of pollutants. I consider the effects of residual spatial structure on the bias and precision of regression coefficients, developing a simple framework in which to understand the key issues and derive informative analytic results. When unmeasured confounding introduces spatial structure into the residuals, regression models with spatial random effects and closely-related models such as kriging and penalized splines are biased, even when the residual variance components are known. Analytic and simulation results show how the bias depends on the spatial scales of the covariate and the residual: one can reduce bias by fitting a spatial model only when there is variation in the covariate at a scale smaller than the scale of the unmeasured confounding. I also discuss how the scales of the residual and the covariate affect efficiency and uncertainty estimation when the residuals are independent of the covariate. In an application on the association between black carbon particulate matter air pollution and birth weight, controlling for large-scale spatial variation appears to reduce bias from unmeasured confounders, while increasing uncertainty in the estimated pollution effect.
△ Less
Submitted 4 November, 2010;
originally announced November 2010.
-
Combining spatial information sources while accounting for systematic errors in proxies
Authors:
Christopher J. Paciorek
Abstract:
Environmental research increasingly uses high-dimensional remote sensing and numerical model output to help fill space-time gaps between traditional observations. Such output is often a noisy proxy for the process of interest. Thus one needs to separate and assess the signal and noise (often called discrepancy) in the proxy given complicated spatio-temporal dependencies. Here I extend a popular tw…
▽ More
Environmental research increasingly uses high-dimensional remote sensing and numerical model output to help fill space-time gaps between traditional observations. Such output is often a noisy proxy for the process of interest. Thus one needs to separate and assess the signal and noise (often called discrepancy) in the proxy given complicated spatio-temporal dependencies. Here I extend a popular two-likelihood hierarchical model using a more flexible representation for the discrepancy. I employ the little-used Markov random field approximation to a thin plate spline, which can capture small-scale discrepancy in a computationally efficient manner while better modeling smooth processes than standard conditional auto-regressive models. The increased flexibility reduces identifiability, but the lack of identifiability is inherent in the scientific context. I model particulate matter air pollution using satellite aerosol and atmospheric model output proxies. The estimated discrepancies occur at a variety of spatial scales, with small-scale discrepancy particularly important. The examples indicate little predictive improvement over modeling the observations alone. Similarly, in simulations with an informative proxy, the presence of discrepancy and resulting identifiability issues prevent improvement in prediction. The results highlight but do not resolve the critical question of how best to use proxy information while minimizing the potential for proxy-induced error.
△ Less
Submitted 13 September, 2011; v1 submitted 12 August, 2010;
originally announced August 2010.
-
Practical large-scale spatio-temporal modeling of particulate matter concentrations
Authors:
Christopher J. Paciorek,
Jeff D. Yanosky,
Robin C. Puett,
Francine Laden,
Helen H. Suh
Abstract:
The last two decades have seen intense scientific and regulatory interest in the health effects of particulate matter (PM). Influential epidemiological studies that characterize chronic exposure of individuals rely on monitoring data that are sparse in space and time, so they often assign the same exposure to participants in large geographic areas and across time. We estimate monthly PM during 1…
▽ More
The last two decades have seen intense scientific and regulatory interest in the health effects of particulate matter (PM). Influential epidemiological studies that characterize chronic exposure of individuals rely on monitoring data that are sparse in space and time, so they often assign the same exposure to participants in large geographic areas and across time. We estimate monthly PM during 1988--2002 in a large spatial domain for use in studying health effects in the Nurses' Health Study. We develop a conceptually simple spatio-temporal model that uses a rich set of covariates. The model is used to estimate concentrations of $PM_{10}$ for the full time period and $PM_{2.5}$ for a subset of the period. For the earlier part of the period, 1988--1998, few $PM_{2.5}$ monitors were operating, so we develop a simple extension to the model that represents $PM_{2.5}$ conditionally on $PM_{10}$ model predictions. In the epidemiological analysis, model predictions of $PM_{10}$ are more strongly associated with health effects than when using simpler approaches to estimate exposure. Our modeling approach supports the application in estimating both fine-scale and large-scale spatial heterogeneity and capturing space--time interaction through the use of monthly-varying spatial surfaces. At the same time, the model is computationally feasible, implementable with standard software, and readily understandable to the scientific audience. Despite simplifying assumptions, the model has good predictive performance and uncertainty characterization.
△ Less
Submitted 8 June, 2009;
originally announced June 2009.