-
Assessing treatment effects in observational data with missing confounders: A comparative study of practical doubly-robust and traditional missing data methods
Authors:
Brian D. Williamson,
Chloe Krakauer,
Eric Johnson,
Susan Gruber,
Bryan E. Shepherd,
Mark J. van der Laan,
Thomas Lumley,
Hana Lee,
Jose J. Hernandez Munoz,
Fengyu Zhao,
Sarah K. Dutcher,
Rishi Desai,
Gregory E. Simon,
Susan M. Shortreed,
Jennifer C. Nelson,
Pamela A. Shaw
Abstract:
In pharmacoepidemiology, safety and effectiveness are frequently evaluated using readily available administrative and electronic health records data. In these settings, detailed confounder data are often not available in all data sources and therefore missing on a subset of individuals. Multiple imputation (MI) and inverse-probability weighting (IPW) are go-to analytical methods to handle missing…
▽ More
In pharmacoepidemiology, safety and effectiveness are frequently evaluated using readily available administrative and electronic health records data. In these settings, detailed confounder data are often not available in all data sources and therefore missing on a subset of individuals. Multiple imputation (MI) and inverse-probability weighting (IPW) are go-to analytical methods to handle missing data and are dominant in the biomedical literature. Doubly-robust methods, which are consistent under fewer assumptions, can be more efficient with respect to mean-squared error. We discuss two practical-to-implement doubly-robust estimators, generalized raking and inverse probability-weighted targeted maximum likelihood estimation (TMLE), which are both currently under-utilized in biomedical studies. We compare their performance to IPW and MI in a detailed numerical study for a variety of synthetic data-generating and missingness scenarios, including scenarios with rare outcomes and a high missingness proportion. Further, we consider plasmode simulation studies that emulate the complex data structure of a large electronic health records cohort in order to compare anti-depressant therapies in a rare-outcome setting where a key confounder is prone to more than 50\% missingness. We provide guidance on selecting a missing data analysis approach, based on which methods excelled with respect to the bias-variance trade-off across the different scenarios studied.
△ Less
Submitted 19 December, 2024;
originally announced December 2024.
-
Euclid preparation. LIII. LensMC, weak lensing cosmic shear measurement with forward modelling and Markov Chain Monte Carlo sampling
Authors:
Euclid Collaboration,
G. Congedo,
L. Miller,
A. N. Taylor,
N. Cross,
C. A. J. Duncan,
T. Kitching,
N. Martinet,
S. Matthew,
T. Schrabback,
M. Tewes,
N. Welikala,
N. Aghanim,
A. Amara,
S. Andreon,
N. Auricchio,
M. Baldi,
S. Bardelli,
R. Bender,
C. Bodendorf,
D. Bonino,
E. Branchini,
M. Brescia,
J. Brinchmann,
S. Camera
, et al. (217 additional authors not shown)
Abstract:
LensMC is a weak lensing shear measurement method developed for Euclid and Stage-IV surveys. It is based on forward modelling in order to deal with convolution by a point spread function (PSF) with comparable size to many galaxies; sampling the posterior distribution of galaxy parameters via Markov Chain Monte Carlo; and marginalisation over nuisance parameters for each of the 1.5 billion galaxies…
▽ More
LensMC is a weak lensing shear measurement method developed for Euclid and Stage-IV surveys. It is based on forward modelling in order to deal with convolution by a point spread function (PSF) with comparable size to many galaxies; sampling the posterior distribution of galaxy parameters via Markov Chain Monte Carlo; and marginalisation over nuisance parameters for each of the 1.5 billion galaxies observed by Euclid. We quantified the scientific performance through high-fidelity images based on the Euclid Flagship simulations and emulation of the Euclid VIS images; realistic clustering with a mean surface number density of 250 arcmin$^{-2}$ ($I_{\rm E}<29.5$) for galaxies, and 6 arcmin$^{-2}$ ($I_{\rm E}<26$) for stars; and a diffraction-limited chromatic PSF with a full width at half maximum of $0.^{\!\prime\prime}2$ and spatial variation across the field of view. LensMC measured objects with a density of 90 arcmin$^{-2}$ ($I_{\rm E}<26.5$) in 4500 deg$^2$. The total shear bias was broken down into measurement (our main focus here) and selection effects (which will be addressed elsewhere). We found measurement multiplicative and additive biases of $m_1=(-3.6\pm0.2)\times10^{-3}$, $m_2=(-4.3\pm0.2)\times10^{-3}$, $c_1=(-1.78\pm0.03)\times10^{-4}$, $c_2=(0.09\pm0.03)\times10^{-4}$; a large detection bias with a multiplicative component of $1.2\times10^{-2}$ and an additive component of $-3\times10^{-4}$; and a measurement PSF leakage of $α_1=(-9\pm3)\times10^{-4}$ and $α_2=(2\pm3)\times10^{-4}$. When model bias is suppressed, the obtained measurement biases are close to Euclid requirement and largely dominated by undetected faint galaxies ($-5\times10^{-3}$). Although significant, model bias will be straightforward to calibrate given the weak sensitivity. LensMC is publicly available at https://gitlab.com/gcongedo/LensMC
△ Less
Submitted 2 December, 2024; v1 submitted 1 May, 2024;
originally announced May 2024.
-
A General Framework for Equivariant Neural Networks on Reductive Lie Groups
Authors:
Ilyes Batatia,
Mario Geiger,
Jose Munoz,
Tess Smidt,
Lior Silberman,
Christoph Ortner
Abstract:
Reductive Lie Groups, such as the orthogonal groups, the Lorentz group, or the unitary groups, play essential roles across scientific fields as diverse as high energy physics, quantum mechanics, quantum chromodynamics, molecular dynamics, computer vision, and imaging. In this paper, we present a general Equivariant Neural Network architecture capable of respecting the symmetries of the finite-dime…
▽ More
Reductive Lie Groups, such as the orthogonal groups, the Lorentz group, or the unitary groups, play essential roles across scientific fields as diverse as high energy physics, quantum mechanics, quantum chromodynamics, molecular dynamics, computer vision, and imaging. In this paper, we present a general Equivariant Neural Network architecture capable of respecting the symmetries of the finite-dimensional representations of any reductive Lie Group G. Our approach generalizes the successful ACE and MACE architectures for atomistic point clouds to any data equivariant to a reductive Lie group action. We also introduce the lie-nn software library, which provides all the necessary tools to develop and implement such general G-equivariant neural networks. It implements routines for the reduction of generic tensor products of representations into irreducible representations, making it easy to apply our architecture to a wide range of problems and groups. The generality and performance of our approach are demonstrated by applying it to the tasks of top quark decay tagging (Lorentz group) and shape recognition (orthogonal group).
△ Less
Submitted 31 May, 2023;
originally announced June 2023.
-
Multiple imputation of incomplete multilevel data using Heckman selection models
Authors:
Johanna Muñoz,
Matthias Egger,
Orestis Efthimiou,
Vincent Audigier,
Valentijn M. T. de Jong,
Thomas. P. A. Debray
Abstract:
Missing data is a common problem in medical research, and is commonly addressed using multiple imputation. Although traditional imputation methods allow for valid statistical inference when data are missing at random (MAR), their implementation is problematic when the presence of missingness depends on unobserved variables, i.e. the data are missing not at random (MNAR). Unfortunately, this MNAR s…
▽ More
Missing data is a common problem in medical research, and is commonly addressed using multiple imputation. Although traditional imputation methods allow for valid statistical inference when data are missing at random (MAR), their implementation is problematic when the presence of missingness depends on unobserved variables, i.e. the data are missing not at random (MNAR). Unfortunately, this MNAR situation is rather common, in observational studies, registries and other sources of real-world data. While several imputation methods have been proposed for addressing individual studies when data are MNAR, their application and validity in large datasets with multilevel structure remains unclear. We therefore explored the consequence of MNAR data in hierarchical data in-depth, and proposed a novel multilevel imputation method for common missing patterns in clustered datasets. This method is based on the principles of Heckman selection models and adopts a two-stage meta-analysis approach to impute binary and continuous variables that may be outcomes or predictors and that are systematically or sporadically missing. After evaluating the proposed imputation model in simulated scenarios, we illustrate it use in a cross-sectional community survey to estimate the prevalence of malaria parasitemia in children aged 2-10 years in five subregions in Uganda.
△ Less
Submitted 12 January, 2023;
originally announced January 2023.
-
Quality control, data cleaning, imputation
Authors:
Dawei Liu,
Hanne I. Oberman,
Johanna Muñoz,
Jeroen Hoogland,
Thomas P. A. Debray
Abstract:
This chapter addresses important steps during the quality assurance and control of RWD, with particular emphasis on the identification and handling of missing values. A gentle introduction is provided on common statistical and machine learning methods for imputation. We discuss the main strengths and weaknesses of each method, and compare their performance in a literature review. We motivate why t…
▽ More
This chapter addresses important steps during the quality assurance and control of RWD, with particular emphasis on the identification and handling of missing values. A gentle introduction is provided on common statistical and machine learning methods for imputation. We discuss the main strengths and weaknesses of each method, and compare their performance in a literature review. We motivate why the imputation of RWD may require additional efforts to avoid bias, and highlight recent advances that account for informative missingness and repeated observations. Finally, we introduce alternative methods to address incomplete data without the need for imputation.
△ Less
Submitted 29 October, 2021;
originally announced October 2021.