-
Causal Inference with Double/Debiased Machine Learning for Evaluating the Health Effects of Multiple Mismeasured Pollutants
Authors:
Gang Xu,
Xin Zhou,
Molin Wang,
Boya Zhang,
Wenhao Jiang,
Francine Laden,
Helen H. Suh,
Adam A. Szpiro,
Donna Spiegelman,
Zuoheng Wang
Abstract:
One way to quantify exposure to air pollution and its constituents in epidemiologic studies is to use an individual's nearest monitor. This strategy results in potential inaccuracy in the actual personal exposure, introducing bias in estimating the health effects of air pollution and its constituents, especially when evaluating the causal effects of correlated multi-pollutant constituents measured…
▽ More
One way to quantify exposure to air pollution and its constituents in epidemiologic studies is to use an individual's nearest monitor. This strategy results in potential inaccuracy in the actual personal exposure, introducing bias in estimating the health effects of air pollution and its constituents, especially when evaluating the causal effects of correlated multi-pollutant constituents measured with correlated error. This paper addresses estimation and inference for the causal effect of one constituent in the presence of other PM2.5 constituents, accounting for measurement error and correlations. We used a linear regression calibration model, fitted with generalized estimating equations in an external validation study, and extended a double/debiased machine learning (DML) approach to correct for measurement error and estimate the effect of interest in the main study. We demonstrated that the DML estimator with regression calibration is consistent and derived its asymptotic variance. Simulations showed that the proposed estimator reduced bias and attained nominal coverage probability across most simulation settings. We applied this method to assess the causal effects of PM2.5 constituents on cognitive function in the Nurses' Health Study and identified two PM2.5 constituents, Br and Mn, that showed a negative causal effect on cognitive function after measurement error correction.
△ Less
Submitted 21 September, 2024;
originally announced October 2024.
-
Random Spatial Forests
Authors:
Travis Hee Wai,
Michael T. Young,
Adam A. Szpiro
Abstract:
We introduce random spatial forests, a method of bagging regression trees allowing for spatial correlation. Our main contribution is the development of a computationally efficient tree building algorithm which selects each split of the tree adjusting for spatial correlation. We evaluate two different approaches for estimation of random spatial forests, a pseudo-likelihood approach combining random…
▽ More
We introduce random spatial forests, a method of bagging regression trees allowing for spatial correlation. Our main contribution is the development of a computationally efficient tree building algorithm which selects each split of the tree adjusting for spatial correlation. We evaluate two different approaches for estimation of random spatial forests, a pseudo-likelihood approach combining random forests with kriging and a non-parametric version for a general class of spatial smoothers. We show improved prediction accuracy of our method compared to existing two-step approaches combining random forests and kriging across a range of numerical simulations and demonstrate its performance on elemental carbon, organic carbon, silicon, and sulfur measurements across the continental United States from 2009-2010.
△ Less
Submitted 22 July, 2020; v1 submitted 29 May, 2020;
originally announced June 2020.
-
Spatial Matrix Completion for Spatially-Misaligned and High-Dimensional Air Pollution Data
Authors:
Phuong T. Vu,
Adam A. Szpiro,
Noah Simon
Abstract:
In health-pollution cohort studies, accurate predictions of pollutant concentrations at new locations are needed, since the locations of fixed monitoring sites and study participants are often spatially misaligned. For multi-pollution data, principal component analysis (PCA) is often incorporated to obtain low-rank (LR) structure of the data prior to spatial prediction. Recently developed predicti…
▽ More
In health-pollution cohort studies, accurate predictions of pollutant concentrations at new locations are needed, since the locations of fixed monitoring sites and study participants are often spatially misaligned. For multi-pollution data, principal component analysis (PCA) is often incorporated to obtain low-rank (LR) structure of the data prior to spatial prediction. Recently developed predictive PCA modifies the traditional algorithm to improve the overall predictive performance by leveraging both LR and spatial structures within the data. However, predictive PCA requires complete data or an initial imputation step. Nonparametric imputation techniques without accounting for spatial information may distort the underlying structure of the data, and thus further reduce the predictive performance. We propose a convex optimization problem inspired by the LR matrix completion framework and develop a proximal algorithm to solve it. Missing data are imputed and handled concurrently within the algorithm, which eliminates the necessity of a separate imputation step. We show that our algorithm has low computational burden and leads to reliable predictive performance as the severity of missing data increases.
△ Less
Submitted 21 January, 2022; v1 submitted 11 April, 2020;
originally announced April 2020.
-
Selecting a Scale for Spatial Confounding Adjustment
Authors:
Joshua P. Keller,
Adam A. Szpiro
Abstract:
Unmeasured, spatially-structured factors can confound associations between spatial environmental exposures and health outcomes. Adding flexible splines to a regression model is a simple approach for spatial confounding adjustment, but the spline degrees of freedom do not provide an easily interpretable spatial scale. We describe a method for quantifying the extent of spatial confounding adjustment…
▽ More
Unmeasured, spatially-structured factors can confound associations between spatial environmental exposures and health outcomes. Adding flexible splines to a regression model is a simple approach for spatial confounding adjustment, but the spline degrees of freedom do not provide an easily interpretable spatial scale. We describe a method for quantifying the extent of spatial confounding adjustment in terms of the Euclidean distance at which variation is removed. We develop this approach for confounding adjustment with splines and using Fourier and wavelet filtering. We demonstrate differences in the spatial scales these bases can represent and provide a comparison of methods for selecting the amount of confounding adjustment. We find the best performance for selecting the amount of adjustment using an information criterion evaluated on an outcome model without exposure. We apply this method to spatial adjustment in an analysis of particulate matter and blood pressure in a cohort of United States women.
△ Less
Submitted 24 September, 2019;
originally announced September 2019.
-
Probabilistic Predictive Principal Component Analysis for Spatially-Misaligned and High-Dimensional Air Pollution Data with Missing Observations
Authors:
Phuong T. Vu,
Timothy V. Larson,
Adam A. Szpiro
Abstract:
Accurate predictions of pollutant concentrations at new locations are often of interest in air pollution studies on fine particulate matters (PM$_{2.5}$), in which data is usually not measured at all study locations. PM$_{2.5}$ is also a mixture of many different chemical components. Principal component analysis (PCA) can be incorporated to obtain lower-dimensional representative scores of such mu…
▽ More
Accurate predictions of pollutant concentrations at new locations are often of interest in air pollution studies on fine particulate matters (PM$_{2.5}$), in which data is usually not measured at all study locations. PM$_{2.5}$ is also a mixture of many different chemical components. Principal component analysis (PCA) can be incorporated to obtain lower-dimensional representative scores of such multi-pollutant data. Spatial prediction can then be used to estimate these scores at new locations. Recently developed predictive PCA modifies the traditional PCA algorithm to obtain scores with spatial structures that can be well predicted at unmeasured locations. However, these approaches require complete data, whereas multi-pollutant data tends to have complex missing patterns in practice. We propose probabilistic versions of predictive PCA which allow for flexible model-based imputation that can account for spatial information and subsequently improve the overall predictive performance.
△ Less
Submitted 8 December, 2019; v1 submitted 1 May, 2019;
originally announced May 2019.
-
National PM2.5 and NO2 Exposure Models for China Based on Land Use Regression, Satellite Measurements, and Universal Kriging
Authors:
Hao Xu,
Matthew J. Bechle,
Meng Wang,
Adam A. Szpiro,
Sverre Vedal,
Yuqi Bai,
Julian D. Marshall
Abstract:
Outdoor air pollution is a major killer worldwide and the fourth largest contributor to the burden of disease in China. China is the most populous country in the world and also has the largest number of air pollution deaths per year, yet the spatial resolution of existing national air pollution estimates for China is generally relatively low. We address this knowledge gap by developing and evaluat…
▽ More
Outdoor air pollution is a major killer worldwide and the fourth largest contributor to the burden of disease in China. China is the most populous country in the world and also has the largest number of air pollution deaths per year, yet the spatial resolution of existing national air pollution estimates for China is generally relatively low. We address this knowledge gap by developing and evaluating national empirical models for China incorporating land-use regression (LUR), satellite measurements, and universal kriging (UK). We test the resulting models in several ways, including (1) comparing models developed using forward stepwise regression vs. partial least squares (PLS) regression, (2) comparing models developed with and without satellite measurements, and with and without UK, and (3) 10-fold cross-validation (CV), leave-one-province-out(LOPO) CV, and leave-one-city-out(LOCO) CV. Satellite data and kriging are complementary in making predictions more accurate: kriging improved the models in well-sampled areas; satellite data substantially improved performance at locations far away from monitors. Stepwise forward selection performs similarly to PLS in 10-fold CV, but better than PLS in LOPO-CV. Our best models employ forward selection and UK, with 10-fold CV R2 of 0.89 (for both 2014 and 2015) for PM2.5 and of 0.73 (year-2014) and 0.78 (year-2015) for NO2. Population-weighted concentrations during 2014-2015 decreased for PM2.5 (58.7 μg/m3 to 52.3 μg/m3) and NO2 (29.6 μg/m3 to 26.8 μg/m3). We produced the first high resolution national LUR models for annual-average concentrations in China. Models were applied on 1 km grid to support future research. In 2015, more than 80% of the Chinese population lived in areas that exceed the Chinese national PM2.5 standard, 35 μg/m3. Results here will be publicly available and may be useful for environmental health research.
△ Less
Submitted 28 August, 2018;
originally announced August 2018.
-
A novel principal component analysis for spatially-misaligned multivariate air pollution data
Authors:
Roman A. Jandarov,
Lianne A. Sheppard,
Paul D. Sampson,
Adam A. Szpiro
Abstract:
We propose novel methods for predictive (sparse) PCA with spatially misaligned data. These methods identify principal component loading vectors that explain as much variability in the observed data as possible, while also ensuring the corresponding principal component scores can be predicted accurately by means of spatial statistics at locations where air pollution measurements are not available.…
▽ More
We propose novel methods for predictive (sparse) PCA with spatially misaligned data. These methods identify principal component loading vectors that explain as much variability in the observed data as possible, while also ensuring the corresponding principal component scores can be predicted accurately by means of spatial statistics at locations where air pollution measurements are not available. This will make it possible to identify important mixtures of air pollutants and to quantify their health effects in cohort studies, where currently available methods cannot be used. We demonstrate the utility of predictive (sparse) PCA in simulated data and apply the approach to annual averages of particulate matter speciation data from national Environmental Protection Agency (EPA) regulatory monitors.
△ Less
Submitted 3 September, 2015;
originally announced September 2015.
-
Reduced-rank spatio-temporal modeling of air pollution concentrations in the Multi-Ethnic Study of Atherosclerosis and Air Pollution
Authors:
Casey Olives,
Lianne Sheppard,
Johan Lindström,
Paul D. Sampson,
Joel D. Kaufman,
Adam A. Szpiro
Abstract:
There is growing evidence in the epidemiologic literature of the relationship between air pollution and adverse health outcomes. Prediction of individual air pollution exposure in the Environmental Protection Agency (EPA) funded Multi-Ethnic Study of Atheroscelerosis and Air Pollution (MESA Air) study relies on a flexible spatio-temporal prediction model that integrates land-use regression with kr…
▽ More
There is growing evidence in the epidemiologic literature of the relationship between air pollution and adverse health outcomes. Prediction of individual air pollution exposure in the Environmental Protection Agency (EPA) funded Multi-Ethnic Study of Atheroscelerosis and Air Pollution (MESA Air) study relies on a flexible spatio-temporal prediction model that integrates land-use regression with kriging to account for spatial dependence in pollutant concentrations. Temporal variability is captured using temporal trends estimated via modified singular value decomposition and temporally varying spatial residuals. This model utilizes monitoring data from existing regulatory networks and supplementary MESA Air monitoring data to predict concentrations for individual cohort members. In general, spatio-temporal models are limited in their efficacy for large data sets due to computational intractability. We develop reduced-rank versions of the MESA Air spatio-temporal model. To do so, we apply low-rank kriging to account for spatial variation in the mean process and discuss the limitations of this approach. As an alternative, we represent spatial variation using thin plate regression splines. We compare the performance of the outlined models using EPA and MESA Air monitoring data for predicting concentrations of oxides of nitrogen (NO$_x$)-a pollutant of primary interest in MESA Air-in the Los Angeles metropolitan area via cross-validated $R^2$. Our findings suggest that use of reduced-rank models can improve computational efficiency in certain cases. Low-rank kriging and thin plate regression splines were competitive across the formulations considered, although TPRS appeared to be more robust in some settings.
△ Less
Submitted 3 February, 2015;
originally announced February 2015.
-
Measurement error in two-stage analyses, with application to air pollution epidemiology
Authors:
Adam A. Szpiro,
Christopher J. Paciorek
Abstract:
Public health researchers often estimate health effects of exposures (e.g., pollution, diet, lifestyle) that cannot be directly measured for study subjects. A common strategy in environmental epidemiology is to use a first-stage (exposure) model to estimate the exposure based on covariates and/or spatio-temporal proximity and to use predictions from the exposure model as the covariate of interest…
▽ More
Public health researchers often estimate health effects of exposures (e.g., pollution, diet, lifestyle) that cannot be directly measured for study subjects. A common strategy in environmental epidemiology is to use a first-stage (exposure) model to estimate the exposure based on covariates and/or spatio-temporal proximity and to use predictions from the exposure model as the covariate of interest in the second-stage (health) model. This induces a complex form of measurement error. We propose an analytical framework and methodology that is robust to misspecification of the first-stage model and provides valid inference for the second-stage model parameter of interest.
We decompose the measurement error into components analogous to classical and Berkson error and characterize properties of the estimator in the second-stage model if the first-stage model predictions are plugged in without correction. Specifically, we derive conditions for compatibility between the first- and second-stage models that guarantee consistency (and have direct and important real-world design implications), and we derive an asymptotic estimate of finite-sample bias when the compatibility conditions are satisfied. We propose a methodology that (1) corrects for finite-sample bias and (2) correctly estimates standard errors. We demonstrate the utility of our methodology in simulations and an example from air pollution epidemiology.
△ Less
Submitted 30 June, 2013; v1 submitted 27 October, 2012;
originally announced October 2012.
-
Model-robust regression and a Bayesian ``sandwich'' estimator
Authors:
Adam A. Szpiro,
Kenneth M. Rice,
Thomas Lumley
Abstract:
We present a new Bayesian approach to model-robust linear regression that leads to uncertainty estimates with the same robustness properties as the Huber--White sandwich estimator. The sandwich estimator is known to provide asymptotically correct frequentist inference, even when standard modeling assumptions such as linearity and homoscedasticity in the data-generating mechanism are violated. Our…
▽ More
We present a new Bayesian approach to model-robust linear regression that leads to uncertainty estimates with the same robustness properties as the Huber--White sandwich estimator. The sandwich estimator is known to provide asymptotically correct frequentist inference, even when standard modeling assumptions such as linearity and homoscedasticity in the data-generating mechanism are violated. Our derivation provides a compelling Bayesian justification for using this simple and popular tool, and it also clarifies what is being estimated when the data-generating mechanism is not linear. We demonstrate the applicability of our approach using a simulation study and health care cost data from an evaluation of the Washington State Basic Health Plan.
△ Less
Submitted 7 January, 2011;
originally announced January 2011.