-
Robust Model-based Inference for Non-Probability Samples
Authors:
Ali Rafei,
Michael R. Elliott,
Carol A. C. Flannagan
Abstract:
With the ubiquitous availability of unstructured data, growing attention is paid as how to adjust for selection bias in such non-probability samples. The majority of the robust estimators proposed by prior literature are either fully or partially design-based, which may lead to inefficient estimates if outlying (pseudo-)weights are present. In addition, correctly reflecting the uncertainty of the…
▽ More
With the ubiquitous availability of unstructured data, growing attention is paid as how to adjust for selection bias in such non-probability samples. The majority of the robust estimators proposed by prior literature are either fully or partially design-based, which may lead to inefficient estimates if outlying (pseudo-)weights are present. In addition, correctly reflecting the uncertainty of the adjusted estimator remains a challenge when the available reference survey is complex in the sample design. This article proposes a fully model-based method for inference using non-probability samples where the goal is to predict the outcome variable for the entire population units. We employ a Bayesian bootstrap method with Rubin's combing rules to derive the adjusted point and interval estimates. Using Gaussian process regression, our method allows for kernel matching between the non-probability sample units and population units based on the estimated selection propensities when the outcome model is misspecified. The repeated sampling properties of our method are evaluated through two Monte Carlo simulation studies. Finally, we examine it on a real-world non-probability sample with the aim to estimate crash-attributed injury rates in different body regions in the United States.
△ Less
Submitted 7 April, 2022;
originally announced April 2022.
-
Robust and Efficient Bayesian Inference for Non-Probability Samples
Authors:
Ali Rafei,
Michael R. Elliott,
Carol A. C. Flannagan
Abstract:
The declining response rates in probability surveys along with the widespread availability of unstructured data has led to growing research into non-probability samples. Existing robust approaches are not well-developed for non-Gaussian outcomes and may perform poorly in presence of influential pseudo-weights. Furthermore, their variance estimator lacks a unified framework and rely often on asympt…
▽ More
The declining response rates in probability surveys along with the widespread availability of unstructured data has led to growing research into non-probability samples. Existing robust approaches are not well-developed for non-Gaussian outcomes and may perform poorly in presence of influential pseudo-weights. Furthermore, their variance estimator lacks a unified framework and rely often on asymptotic theory. To address these gaps, we propose an alternative Bayesian approach using a partially linear Gaussian process regression that utilizes a prediction model with a flexible function of the pseudo-inclusion probabilities to impute the outcome variable for the reference survey. By efficiency, we mean not only computational scalability but also superiority with respect to variance. We also show that Gaussian process regression behaves as a kernel matching technique based on the estimated propensity scores, which yields double robustness and lowers sensitivity to influential pseudo-weights. Using the simulated posterior predictive distribution, one can directly quantify the uncertainty of the proposed estimator and derive associated $95\%$ credible intervals. We assess the repeated sampling properties of our method in two simulation studies. The application of this study deals with modeling count data with varying exposures under a non-probability sample setting.
△ Less
Submitted 27 March, 2022;
originally announced March 2022.
-
Robust Bayesian Inference for Big Data: Combining Sensor-based Records with Traditional Survey Data
Authors:
Ali Rafei,
Carol A. C. Flannagan,
Brady T. West,
Michael R. Elliott
Abstract:
Big Data often presents as massive non-probability samples. Not only is the selection mechanism often unknown, but larger data volume amplifies the relative contribution of selection bias to total error. Existing bias adjustment approaches assume that the conditional mean structures have been correctly specified for the selection indicator or key substantive measures. In the presence of a referenc…
▽ More
Big Data often presents as massive non-probability samples. Not only is the selection mechanism often unknown, but larger data volume amplifies the relative contribution of selection bias to total error. Existing bias adjustment approaches assume that the conditional mean structures have been correctly specified for the selection indicator or key substantive measures. In the presence of a reference probability sample, these methods rely on a pseudo-likelihood method to account for the sampling weights of the reference sample, which is parametric in nature. Under a Bayesian framework, handling the sampling weights is an even bigger hurdle. To further protect against model misspecification, we expand the idea of double robustness such that more flexible non-parametric methods, as well as Bayesian models, can be used for prediction. In particular, we employ Bayesian additive regression trees, which not only capture non-linear associations automatically but permit direct quantification of the uncertainty of point estimates through its posterior predictive draws. We apply our method to sensor-based naturalistic driving data from the second Strategic Highway Research Program using the 2017 National Household Travel Survey as a benchmark.
△ Less
Submitted 26 March, 2022; v1 submitted 18 January, 2021;
originally announced January 2021.
-
Multitasking additional-to-driving: Prevalence, structure, and associated risk in SHRP2 naturalistic driving data
Authors:
András Bálint,
Carol A. C. Flannagan,
Andrew Leslie,
Sheila Klauer,
Feng Guo,
Marco Dozza
Abstract:
This paper 1) analyzes the extent to which drivers engage in multitasking additional-to-driving (MAD) under various conditions, 2) specifies odds ratios (ORs) of crashing associated with MAD compared to no task engagement, and 3) explores the structure of MAD, based on data from the Second Strategic Highway Research Program Naturalistic Driving Study (SHRP2 NDS). Sensitivity analysis in which seco…
▽ More
This paper 1) analyzes the extent to which drivers engage in multitasking additional-to-driving (MAD) under various conditions, 2) specifies odds ratios (ORs) of crashing associated with MAD compared to no task engagement, and 3) explores the structure of MAD, based on data from the Second Strategic Highway Research Program Naturalistic Driving Study (SHRP2 NDS). Sensitivity analysis in which secondary tasks were re-defined by grouping similar tasks was performed to investigate the extent to which ORs are affected by the specific task definitions in SHRP2. A novel visual representation of multitasking was developed to show which secondary tasks co-occur frequently and which ones do not. MAD occurs in 11% of control driving segments, 22% of crashes and near-crashes (CNC), 26% of Level 1-3 crashes and 39% of rear-end striking crashes, and 9%, 16%, 17% and 28% respectively for the same event types if MAD is defined in terms of general task groups. The most common co-occurrences of secondary tasks vary substantially among event types; for example, 'Passenger in adjacent seat - interaction' and 'Other non-specific internal eye glance' tend to co-occur in CNC but tend not to co-occur in control driving segments. The odds ratios of MAD compared to driving without any secondary task and the corresponding 95% confidence intervals are 2.38 (2.17-2.61) for CNC, 3.72 (3.11-4.45) for Level 1-3 crashes and 8.48 (5.11-14.07) for rear-end striking crashes. The corresponding ORs using general task groups to define MAD are slightly lower at 2.00 (1.80-2.21) for CNC, 3.03 (2.48-3.69) for Level 1-3 crashes and 6.94 (4.04-11.94) for rear-end striking crashes. The results confirm that independently of whether secondary tasks are defined according to SHRP2 or general task groups, the reduction of driving performance from MAD observed in simulator studies is manifested in real-world crashes as well.
△ Less
Submitted 2 February, 2020;
originally announced February 2020.
-
Accounting for selection bias due to death in estimating the effect of wealth shock on cognition for the Health and Retirement Study
Authors:
Yaoyuan Vincent Tan,
Carol A. C. Flannagan,
Lindsay R. Pool,
Michael R. Elliott
Abstract:
The Health and Retirement Study is a longitudinal study of US adults enrolled at age 50 and older. We were interested in investigating the effect of a sudden large decline in wealth on the cognitive score of subjects. Our analysis was complicated by the lack of randomization, confounding by indication, and a substantial fraction of the sample and population will die during follow-up leading to som…
▽ More
The Health and Retirement Study is a longitudinal study of US adults enrolled at age 50 and older. We were interested in investigating the effect of a sudden large decline in wealth on the cognitive score of subjects. Our analysis was complicated by the lack of randomization, confounding by indication, and a substantial fraction of the sample and population will die during follow-up leading to some of our outcomes being censored. Common methods to handle these problems for example marginal structural models, may not be appropriate because it upweights subjects who are more likely to die to obtain a population that over time resembles that would have been obtained in the absence of death. We propose a refined approach by comparing the treatment effect among subjects who would survive under both sets of treatment regimes being considered. We do so by viewing this as a large missing data problem and impute the survival status and outcomes of the counterfactual. To improve the robustness of our imputation, we used a modified version of the penalized spline of propensity methods in treatment comparisons approach. We found that our proposed method worked well in various simulation scenarios and our data analysis.
△ Less
Submitted 20 December, 2018;
originally announced December 2018.
-
"Robust-squared" Imputation Models Using BART
Authors:
Yaoyuan V. Tan,
Carol A. C. Flannagan,
Michael R. Elliott
Abstract:
Examples of "doubly robust" estimator for missing data include augmented inverse probability weighting (AIPWT) models (Robins et al., 1994) and penalized splines of propensity prediction (PSPP) models (Zhang and Little, 2009). Doubly-robust estimators have the property that, if either the response propensity or the mean is modeled correctly, a consistent estimator of the population mean is obtaine…
▽ More
Examples of "doubly robust" estimator for missing data include augmented inverse probability weighting (AIPWT) models (Robins et al., 1994) and penalized splines of propensity prediction (PSPP) models (Zhang and Little, 2009). Doubly-robust estimators have the property that, if either the response propensity or the mean is modeled correctly, a consistent estimator of the population mean is obtained. However, doubly-robust estimators can perform poorly when modest misspecification is present in both models (Kang and Schafer, 2007). Here we consider extensions of the AIPWT and PSPP models that use Bayesian Additive Regression Trees (BART; Chipman et al., 2010) to provide highly robust propensity and mean model estimation. We term these "robust-squared" in the sense that the propensity score, the means, or both can be estimated with minimal model misspecification, and applied to the doubly-robust estimator. We consider their behavior via simulations where propensities and/or mean models are misspecified. We apply our proposed method to impute missing instantaneous velocity (delta-v) values from the 2014 National Automotive Sampling System Crashworthiness Data System dataset and missing Blood Alcohol Concentration values from the 2015 Fatality Analysis Reporting System dataset. We found that BART applied to PSPP and AIPWT, provides a more robust and efficient estimate compared to PSPP and AIPWT, with the BART-estimated propensity score combined with PSPP providing the most efficient estimator with close to nominal coverage.
△ Less
Submitted 9 January, 2018;
originally announced January 2018.
-
Predicting human-driving behavior to help driverless vehicles drive: random intercept Bayesian Additive Regression Trees
Authors:
Yaoyuan Vincent Tan,
Carol A. C. Flannagan,
Michael R. Elliott
Abstract:
The development of driverless vehicles has spurred the need to predict human driving behavior to facilitate interaction between driverless and human-driven vehicles. Predicting human driving movements can be challenging, and poor prediction models can lead to accidents between the driverless and human-driven vehicles. We used the vehicle speed obtained from a naturalistic driving dataset to predic…
▽ More
The development of driverless vehicles has spurred the need to predict human driving behavior to facilitate interaction between driverless and human-driven vehicles. Predicting human driving movements can be challenging, and poor prediction models can lead to accidents between the driverless and human-driven vehicles. We used the vehicle speed obtained from a naturalistic driving dataset to predict whether a human-driven vehicle would stop before executing a left turn. In a preliminary analysis, we found that BART produced less variable and higher AUC values compared to a variety of other state-of-the-art binary predictor methods. However, BART assumes independent observations, but our dataset consists of multiple observations clustered by driver. Although methods extending BART to clustered or longitudinal data are available, they lack readily available software and can only be applied to clustered continuous outcomes. We extend BART to handle correlated binary observations by adding a random intercept and used a simulation study to determine bias, root mean squared error, 95% coverage, and average length of 95% credible interval in a correlated data setting. We then successfully implemented our random intercept BART model to our clustered dataset and found substantial improvements in prediction performance compared to BART and random intercept linear logistic regression.
△ Less
Submitted 1 May, 2017; v1 submitted 23 September, 2016;
originally announced September 2016.