-
Commentary on Guyll et al. (2023): Misuse of Statistical Method Results in Highly Biased Interpretation of Forensic Evidence
Authors:
Michael Rosenblum,
Elizabeth T. Chin,
Elizabeth L. Ogburn,
Akihiko Nishimura,
Daniel Westreich,
Abhirup Datta,
Susan Vanderplas,
Maria Cuellar,
William C. Thompson
Abstract:
Since the National Academy of Sciences released their report outlining paths for improving reliability, standards, and policies in the forensic sciences NAS (2009), there has been heightened interest in evaluating and improving the scientific validity within forensic science disciplines. Guyll et al. (2023) seek to evaluate the validity of forensic cartridge-case comparisons. However, they make a…
▽ More
Since the National Academy of Sciences released their report outlining paths for improving reliability, standards, and policies in the forensic sciences NAS (2009), there has been heightened interest in evaluating and improving the scientific validity within forensic science disciplines. Guyll et al. (2023) seek to evaluate the validity of forensic cartridge-case comparisons. However, they make a serious statistical error that leads to highly inflated claims about the probability that a cartridge case from a crime scene was fired from a reference gun, typically a gun found in the possession of a defendant. It is urgent to address this error since these claims, which are generally biased against defendants, are being presented by the prosecution in an ongoing homicide case where the defendant faces the possibility of a lengthy prison sentence (DC Superior Court, 2023).
△ Less
Submitted 8 September, 2023;
originally announced September 2023.
-
Positivity: Identifiability and Estimability
Authors:
Paul N Zivich,
Stephen R Cole,
Daniel Westreich
Abstract:
Positivity, the assumption that every unique combination of confounding variables that occurs in a population has a non-zero probability of an action, can be further delineated as deterministic positivity and stochastic positivity. Here, we revisit this distinction, examine its relation to nonparametric identifiability and estimability, and discuss how to address violations of positivity assumptio…
▽ More
Positivity, the assumption that every unique combination of confounding variables that occurs in a population has a non-zero probability of an action, can be further delineated as deterministic positivity and stochastic positivity. Here, we revisit this distinction, examine its relation to nonparametric identifiability and estimability, and discuss how to address violations of positivity assumptions. Finally, we relate positivity to recent interest in machine learning, as well as the limitations of data-adaptive algorithms for causal inference. Positivity may often be overlooked, but it remains important for inference.
△ Less
Submitted 11 July, 2022;
originally announced July 2022.
-
Variable selection for transportability
Authors:
Megha L. Mehrotra,
M. Maria Glymour,
Elvin Geng,
Daniel Westreich,
David V. Glidden
Abstract:
Transportability provides a principled framework to address the problem of applying study results to new populations. Here, we consider the problem of selecting variables to include in transport estimators. We provide a brief overview of the transportability framework and illustrate that while selection diagrams are a vital first step in variable selection, these graphs alone identify a sufficient…
▽ More
Transportability provides a principled framework to address the problem of applying study results to new populations. Here, we consider the problem of selecting variables to include in transport estimators. We provide a brief overview of the transportability framework and illustrate that while selection diagrams are a vital first step in variable selection, these graphs alone identify a sufficient but not strictly necessary set of variables for generating an unbiased transport estimate. Next, we conduct a simulation experiment assessing the impact of including unnecessary variables on the performance of the parametric g-computation transport estimator. Our results highlight that the types of variables included can affect the bias, variance, and mean squared error of the estimates. We find that addition of variables that are not causes of the outcome but whose distributions differ between the source and target populations can increase the variance and mean squared error of the transported estimates. On the other hand, inclusion of variables that are causes of the outcome (regardless of whether they modify the causal contrast of interest or differ in distribution between the populations) reduces the variance of the estimates without increasing the bias. Finally, exclusion of variables that cause the outcome but do not modify the causal contrast of interest does not increase bias. These findings suggest that variable selection approaches for transport should prioritize identifying and including all causes of the outcome in the study population rather than focusing on variables whose distribution may differ between the study sample and target population.
△ Less
Submitted 9 December, 2019;
originally announced December 2019.
-
Super learning in the SAS system
Authors:
Alexander P. Keil,
Daniel Westreich,
Jessie K Edwards,
Stephen R Cole
Abstract:
Background and objective: Stacking is an ensemble machine learning method that averages predictions from multiple other algorithms, such as generalized linear models and regression trees. An implementation of stacking, called super learning, has been developed as a general approach to supervised learning and has seen frequent usage, in part due to the availability of an R package. We develop super…
▽ More
Background and objective: Stacking is an ensemble machine learning method that averages predictions from multiple other algorithms, such as generalized linear models and regression trees. An implementation of stacking, called super learning, has been developed as a general approach to supervised learning and has seen frequent usage, in part due to the availability of an R package. We develop super learning in the SAS software system using a new macro, and demonstrate its performance relative to the R package.
Methods: Following previous work using the R SuperLearner package we assess the performance of super learning in a number of domains. We compare the R package with the new SAS macro in a small set of simulations assessing curve fitting in a predictive model as well in a set of 14 publicly available datasets to assess cross-validated accuracy.
Results: Across the simulated data and the publicly available data, the SAS macro performed similarly to the R package, despite a different set of potential algorithms available natively in R and SAS.
Conclusions: Our super learner macro performs as well as the R package at a number of tasks. Further, by extending the macro to include the use of R packages, the macro can leverage both the robust, enterprise oriented procedures in SAS and the nimble, cutting edge packages in R. In the spirit of ensemble learning, this macro extends the potential library of algorithms beyond a single software system and provides a simple avenue into machine learning in SAS.
△ Less
Submitted 31 July, 2019; v1 submitted 21 May, 2018;
originally announced May 2018.