-
AirCast: Improving Air Pollution Forecasting Through Multi-Variable Data Alignment
Authors:
Vishal Nedungadi,
Muhammad Akhtar Munir,
Marc RuĆwurm,
Ron Sarafian,
Ioannis N. Athanasiadis,
Yinon Rudich,
Fahad Shahbaz Khan,
Salman Khan
Abstract:
Air pollution remains a leading global health risk, exacerbated by rapid industrialization and urbanization, contributing significantly to morbidity and mortality rates. In this paper, we introduce AirCast, a novel multi-variable air pollution forecasting model, by combining weather and air quality variables. AirCast employs a multi-task head architecture that simultaneously forecasts atmospheric…
▽ More
Air pollution remains a leading global health risk, exacerbated by rapid industrialization and urbanization, contributing significantly to morbidity and mortality rates. In this paper, we introduce AirCast, a novel multi-variable air pollution forecasting model, by combining weather and air quality variables. AirCast employs a multi-task head architecture that simultaneously forecasts atmospheric conditions and pollutant concentrations, improving its understanding of how weather patterns affect air quality. Predicting extreme pollution events is challenging due to their rare occurrence in historic data, resulting in a heavy-tailed distribution of pollution levels. To address this, we propose a novel Frequency-weighted Mean Absolute Error (fMAE) loss, adapted from the class-balanced loss for regression tasks. Informed from domain knowledge, we investigate the selection of key variables known to influence pollution levels. Additionally, we align existing weather and chemical datasets across spatial and temporal dimensions. AirCast's integrated approach, combining multi-task learning, frequency weighted loss and domain informed variable selection, enables more accurate pollution forecasts. Our source code and models are made public here (https://github.com/vishalned/AirCast.git)
△ Less
Submitted 25 February, 2025;
originally announced February 2025.
-
Optimal-Design Domain-Adaptation for Exposure Prediction in Two-Stage Epidemiological Studies
Authors:
Ron Sarafian,
Itai Kloog,
Jonathan D. Rosenblatt
Abstract:
In the first stage of a two-stage study, the researcher uses a statistical model to impute the unobserved exposures. In the second stage, imputed exposures serve as covariates in epidemiological models. Imputation error in the first stage operate as measurement errors in the second stage, and thus bias exposure effect estimates. This study aims to improve the estimation of exposure effects by shar…
▽ More
In the first stage of a two-stage study, the researcher uses a statistical model to impute the unobserved exposures. In the second stage, imputed exposures serve as covariates in epidemiological models. Imputation error in the first stage operate as measurement errors in the second stage, and thus bias exposure effect estimates. This study aims to improve the estimation of exposure effects by sharing information between the first and second stage. At the heart of our estimator is the observation that not all second-stage observations are equally important to impute. We thus borrow ideas from the optimal-experimental-design theory, to identify individuals of higher importance. We then improve the imputation of these individuals using ideas from the machine-learning literature of domain-adaptation. Our simulations confirm that the exposure effect estimates are more accurate than the current best practice. An empirical demonstration yields smaller estimates of PM effect on hyperglycemia risk, with tighter confidence bands. Sharing information between environmental scientist and epidemiologist improves health effect estimates. Our estimator is a principled approach for harnessing this information exchange, and may be applied to any two stage study.
△ Less
Submitted 15 July, 2021;
originally announced July 2021.
-
Gaussian Markov Random Fields versus Linear Mixed Models for satellite-based PM2.5 assessment: Evidence from the Northeastern USA
Authors:
Ron Sarafian,
Itai Kloog,
Allan C. Just,
Johnathan D. Rosenblatt
Abstract:
Studying the effects of air-pollution on health is a key area in environmental epidemiology. An accurate estimation of air-pollution effects requires spatio-temporally resolved datasets of air-pollution, especially, Fine Particulate Matter (PM). Satellite-based technology has greatly enhanced the ability to provide PM assessments in locations where direct measurement is impossible.
Indirect PM m…
▽ More
Studying the effects of air-pollution on health is a key area in environmental epidemiology. An accurate estimation of air-pollution effects requires spatio-temporally resolved datasets of air-pollution, especially, Fine Particulate Matter (PM). Satellite-based technology has greatly enhanced the ability to provide PM assessments in locations where direct measurement is impossible.
Indirect PM measurement is a statistical prediction problem. The spatio-temporal statistical literature offer various predictive models: Gaussian Random Fields (GRF) and Linear Mixed Models (LMM), in particular. GRF emphasize the spatio-temporal structure in the data, but are computationally demanding to fit. LMMs are computationally easier to fit, but require some tampering to deal with space and time.
Recent advances in the spatio-temporal statistical literature propose to alleviate the computation burden of GRFs by approximating them with Gaussian Markov Random Fields (GMRFs). Since LMMs and GMRFs are both computationally feasible, the question arises: which is statistically better? We show that despite the great popularity of LMMs in environmental monitoring and pollution assessment, LMMs are statistically inferior to GMRF for measuring PM in the Northeastern USA.
△ Less
Submitted 22 February, 2019;
originally announced February 2019.