-
A chart review process aided by natural language processing and multi-wave adaptive sampling to expedite validation of code-based algorithms for large database studies
Authors:
Shirley V Wang,
Georg Hahn,
Sushama Kattinakere Sreedhara,
Mufaddal Mahesri,
Haritha S. Pillai,
Rajendra Aldis,
Joyce Lii,
Sarah K. Dutcher,
Rhoda Eniafe,
Jamal T. Jones,
Keewan Kim,
Jiwei He,
Hana Lee,
Sengwee Toh,
Rishi J Desai,
Jie Yang
Abstract:
Background: One of the ways to enhance analyses conducted with large claims databases is by validating the measurement characteristics of code-based algorithms used to identify health outcomes or other key study parameters of interest. These metrics can be used in quantitative bias analyses to assess the robustness of results for an inferential study given potential bias from outcome misclassifica…
▽ More
Background: One of the ways to enhance analyses conducted with large claims databases is by validating the measurement characteristics of code-based algorithms used to identify health outcomes or other key study parameters of interest. These metrics can be used in quantitative bias analyses to assess the robustness of results for an inferential study given potential bias from outcome misclassification. However, extensive time and resource allocation are typically re-quired to create reference-standard labels through manual chart review of free-text notes from linked electronic health records. Methods: We describe an expedited process that introduces efficiency in a validation study us-ing two distinct mechanisms: 1) use of natural language processing (NLP) to reduce time spent by human reviewers to review each chart, and 2) a multi-wave adaptive sampling approach with pre-defined criteria to stop the validation study once performance characteristics are identified with sufficient precision. We illustrate this process in a case study that validates the performance of a claims-based outcome algorithm for intentional self-harm in patients with obesity. Results: We empirically demonstrate that the NLP-assisted annotation process reduced the time spent on review per chart by 40% and use of the pre-defined stopping rule with multi-wave samples would have prevented review of 77% of patient charts with limited compromise to precision in derived measurement characteristics. Conclusion: This approach could facilitate more routine validation of code-based algorithms used to define key study parameters, ultimately enhancing understanding of the reliability of find-ings derived from database studies.
△ Less
Submitted 25 July, 2025;
originally announced July 2025.
-
A cautionary note for plasmode simulation studies in the setting of causal inference
Authors:
Pamela A Shaw,
Susan Gruber,
Brian D. Williamson,
Rishi Desai,
Susan M. Shortreed,
Chloe Krakauer,
Jennifer C. Nelson,
Mark J. van der Laan
Abstract:
Plasmode simulation has become an important tool for evaluating the operating characteristics of different statistical methods in complex settings, such as pharmacoepidemiological studies of treatment effectiveness using electronic health records (EHR) data. These studies provide insight into how estimator performance is impacted by challenges including rare events, small sample size, etc., that c…
▽ More
Plasmode simulation has become an important tool for evaluating the operating characteristics of different statistical methods in complex settings, such as pharmacoepidemiological studies of treatment effectiveness using electronic health records (EHR) data. These studies provide insight into how estimator performance is impacted by challenges including rare events, small sample size, etc., that can indicate which among a set of methods performs best in a real-world dataset. Plasmode simulation combines data resampled from a real-world dataset with synthetic data to generate a known truth for an estimand in realistic data. There are different potential plasmode strategies currently in use. We compare two popular plasmode simulation frameworks. We provide numerical evidence and a theoretical result, which shows that one of these frameworks can cause certain estimators to incorrectly appear overly biased with lower than nominal confidence interval coverage. Detailed simulation studies using both synthetic and real-world EHR data demonstrate that these pitfalls remain at large sample sizes and when analyzing data from a randomized controlled trial. We conclude with guidance for the choice of a plasmode simulation approach that maintains good theoretical properties to allow a fair evaluation of statistical methods while also maintaining the desired similarity to real data.
△ Less
Submitted 15 April, 2025;
originally announced April 2025.
-
Assessing treatment effects in observational data with missing confounders: A comparative study of practical doubly-robust and traditional missing data methods
Authors:
Brian D. Williamson,
Chloe Krakauer,
Eric Johnson,
Susan Gruber,
Bryan E. Shepherd,
Mark J. van der Laan,
Thomas Lumley,
Hana Lee,
Jose J. Hernandez Munoz,
Fengyu Zhao,
Sarah K. Dutcher,
Rishi Desai,
Gregory E. Simon,
Susan M. Shortreed,
Jennifer C. Nelson,
Pamela A. Shaw
Abstract:
In pharmacoepidemiology, safety and effectiveness are frequently evaluated using readily available administrative and electronic health records data. In these settings, detailed confounder data are often not available in all data sources and therefore missing on a subset of individuals. Multiple imputation (MI) and inverse-probability weighting (IPW) are go-to analytical methods to handle missing…
▽ More
In pharmacoepidemiology, safety and effectiveness are frequently evaluated using readily available administrative and electronic health records data. In these settings, detailed confounder data are often not available in all data sources and therefore missing on a subset of individuals. Multiple imputation (MI) and inverse-probability weighting (IPW) are go-to analytical methods to handle missing data and are dominant in the biomedical literature. Doubly-robust methods, which are consistent under fewer assumptions, can be more efficient with respect to mean-squared error. We discuss two practical-to-implement doubly-robust estimators, generalized raking and inverse probability-weighted targeted maximum likelihood estimation (TMLE), which are both currently under-utilized in biomedical studies. We compare their performance to IPW and MI in a detailed numerical study for a variety of synthetic data-generating and missingness scenarios, including scenarios with rare outcomes and a high missingness proportion. Further, we consider plasmode simulation studies that emulate the complex data structure of a large electronic health records cohort in order to compare anti-depressant therapies in a rare-outcome setting where a key confounder is prone to more than 50\% missingness. We provide guidance on selecting a missing data analysis approach, based on which methods excelled with respect to the bias-variance trade-off across the different scenarios studied.
△ Less
Submitted 19 December, 2024;
originally announced December 2024.
-
High-dimensional multiple imputation (HDMI) for partially observed confounders including natural language processing-derived auxiliary covariates
Authors:
Janick Weberpals,
Pamela A. Shaw,
Kueiyu Joshua Lin,
Richard Wyss,
Joseph M Plasek,
Li Zhou,
Kerry Ngan,
Thomas DeRamus,
Sudha R. Raman,
Bradley G. Hammill,
Hana Lee,
Sengwee Toh,
John G. Connolly,
Kimberly J. Dandreo,
Fang Tian,
Wei Liu,
Jie Li,
José J. Hernández-Muñoz,
Sebastian Schneeweiss,
Rishi J. Desai
Abstract:
Multiple imputation (MI) models can be improved by including auxiliary covariates (AC), but their performance in high-dimensional data is not well understood. We aimed to develop and compare high-dimensional MI (HDMI) approaches using structured and natural language processing (NLP)-derived AC in studies with partially observed confounders. We conducted a plasmode simulation study using data from…
▽ More
Multiple imputation (MI) models can be improved by including auxiliary covariates (AC), but their performance in high-dimensional data is not well understood. We aimed to develop and compare high-dimensional MI (HDMI) approaches using structured and natural language processing (NLP)-derived AC in studies with partially observed confounders. We conducted a plasmode simulation study using data from opioid vs. non-steroidal anti-inflammatory drug (NSAID) initiators (X) with observed serum creatinine labs (Z2) and time-to-acute kidney injury as outcome. We simulated 100 cohorts with a null treatment effect, including X, Z2, atrial fibrillation (U), and 13 other investigator-derived confounders (Z1) in the outcome generation. We then imposed missingness (MZ2) on 50% of Z2 measurements as a function of Z2 and U and created different HDMI candidate AC using structured and NLP-derived features. We mimicked scenarios where U was unobserved by omitting it from all AC candidate sets. Using LASSO, we data-adaptively selected HDMI covariates associated with Z2 and MZ2 for MI, and with U to include in propensity score models. The treatment effect was estimated following propensity score matching in MI datasets and we benchmarked HDMI approaches against a baseline imputation and complete case analysis with Z1 only. HDMI using claims data showed the lowest bias (0.072). Combining claims and sentence embeddings led to an improvement in the efficiency displaying the lowest root-mean-squared-error (0.173) and coverage (94%). NLP-derived AC alone did not perform better than baseline MI. HDMI approaches may decrease bias in studies with partially observed confounders where missingness depends on unobserved factors.
△ Less
Submitted 17 May, 2024;
originally announced May 2024.
-
Oracle-Efficient Differentially Private Learning with Public Data
Authors:
Adam Block,
Mark Bun,
Rathin Desai,
Abhishek Shetty,
Steven Wu
Abstract:
Due to statistical lower bounds on the learnability of many function classes under privacy constraints, there has been recent interest in leveraging public data to improve the performance of private learning algorithms. In this model, algorithms must always guarantee differential privacy with respect to the private samples while also ensuring learning guarantees when the private data distribution…
▽ More
Due to statistical lower bounds on the learnability of many function classes under privacy constraints, there has been recent interest in leveraging public data to improve the performance of private learning algorithms. In this model, algorithms must always guarantee differential privacy with respect to the private samples while also ensuring learning guarantees when the private data distribution is sufficiently close to that of the public data. Previous work has demonstrated that when sufficient public, unlabelled data is available, private learning can be made statistically tractable, but the resulting algorithms have all been computationally inefficient. In this work, we present the first computationally efficient, algorithms to provably leverage public data to learn privately whenever a function class is learnable non-privately, where our notion of computational efficiency is with respect to the number of calls to an optimization oracle for the function class. In addition to this general result, we provide specialized algorithms with improved sample complexities in the special cases when the function class is convex or when the task is binary classification.
△ Less
Submitted 13 February, 2024;
originally announced February 2024.
-
Topological inference on brain networks across subtypes of post-stroke aphasia
Authors:
Yuan Wang,
Jian Yin,
Rutvik H. Desai
Abstract:
Persistent homology (PH) characterizes the shape of brain networks through the persistence features. Group comparison of persistence features from brain networks can be challenging as they are inherently heterogeneous. A recent scale-space representation of persistence diagram (PD) through heat diffusion reparameterizes using the finite number of Fourier coefficients with respect to the Laplace-Be…
▽ More
Persistent homology (PH) characterizes the shape of brain networks through the persistence features. Group comparison of persistence features from brain networks can be challenging as they are inherently heterogeneous. A recent scale-space representation of persistence diagram (PD) through heat diffusion reparameterizes using the finite number of Fourier coefficients with respect to the Laplace-Beltrami (LB) eigenfunction expansion of the domain, which provides a powerful vectorized algebraic representation for group comparisons of PDs. In this study, we advance a transposition-based permutation test for comparing multiple groups of PDs through the heat-diffusion estimates of the PDs. We evaluate the empirical performance of the spectral transposition test in capturing within- and between-group similarity and dissimilarity with respect to statistical variation of topological noise and hole location. We also illustrate how the method extends naturally into a clustering scheme by subtyping individuals with post-stroke aphasia through the PDs of their resting-state functional brain networks.
△ Less
Submitted 2 November, 2023;
originally announced November 2023.
-
Explaining human body responses in random vibration: Effect of motion direction, sitting posture, and anthropometry
Authors:
M. M. Cvetković,
R. Desai,
K. N. de Winkel,
G. Papaioannou,
R. Happee
Abstract:
This study investigates the effects of anthropometric attributes, biological sex, and posture on translational body kinematic responses in translational vibrations. In total, 35 participants were recruited. Perturbations were applied on a standard car seat using a motion-based platform with 0.1 to 12.0 Hz random noise signals, with 0.3 m/s2 rms acceleration, for 60 seconds. Multiple linear regress…
▽ More
This study investigates the effects of anthropometric attributes, biological sex, and posture on translational body kinematic responses in translational vibrations. In total, 35 participants were recruited. Perturbations were applied on a standard car seat using a motion-based platform with 0.1 to 12.0 Hz random noise signals, with 0.3 m/s2 rms acceleration, for 60 seconds. Multiple linear regression models (three basic models and one advanced model, including interactions between predictors) were created to determine the most influential predictors of peak translational gains in the frequency domain per body segment (pelvis, trunk, and head). The models introduced experimentally manipulated factors (motion direction, posture, measured anthropometric attributes, and biological sex) as predictors. Effects of included predictors on the model fit were estimated. Basic linear regression models could explain over 70% of peak body segments' kinematic body response (where the R2 adjusted was 0.728). The inclusion of additional predictors (posture, body height and weight, and biological sex) did enhance the model fit, but not significantly (R2 adjusted was 0.730). The multiple stepwise linear regression, including interactions between predictors, accounted for the data well with an adjusted R2 of 0.907. The present study shows that perturbation direction and body segment kinematics are crucial factors influencing peak translational gains. Besides the body segments' response, perturbation direction was the strongest predictor. Adopted postures and biological sex do not significantly affect kinematic responses.
△ Less
Submitted 21 June, 2023;
originally announced June 2023.
-
Network-based Statistics Distinguish Anomic and Broca Aphasia
Authors:
Xingpei Zhao,
Nicholas Riccardi,
Rutvik H. Desai,
Dirk-Bart den Ouden,
Julius Fridriksson,
Yuan Wang
Abstract:
Aphasia is a speech-language impairment commonly caused by damage to the left hemisphere. Due to the complexity of speech-language processing, the neural mechanisms that underpin various symptoms between different types of aphasia are still not fully understood. We used the network-based statistic method to identify distinct subnetwork(s) of connections differentiating the resting-state functional…
▽ More
Aphasia is a speech-language impairment commonly caused by damage to the left hemisphere. Due to the complexity of speech-language processing, the neural mechanisms that underpin various symptoms between different types of aphasia are still not fully understood. We used the network-based statistic method to identify distinct subnetwork(s) of connections differentiating the resting-state functional networks of the anomic and Broca groups. We identified one such subnetwork that mainly involved the brain regions in the premotor, primary motor, primary auditory, and primary sensory cortices in both hemispheres. The majority of connections in the subnetwork were weaker in the Broca group than the anomic group. The network properties of the subnetwork were examined through complex network measures, which indicated that the regions in the superior temporal gyrus and auditory cortex bilaterally exhibit intensive interaction, and primary motor, premotor and primary sensory cortices in the left hemisphere play an important role in information flow and overall communication efficiency. These findings underlied articulatory difficulties and reduced repetition performance in Broca aphasia, which are rarely observed in anomic aphasia. This research provides novel findings into the resting-state brain network differences between groups of individuals with anomic and Broca aphasia. We identified a subnetwork of, rather than isolated, connections that statistically differentiate the resting-state brain networks of the two groups, in comparison with standard lesion symptom mapping results that yield isolated connections.
△ Less
Submitted 17 February, 2023; v1 submitted 6 February, 2023;
originally announced February 2023.
-
The Many Faces of Robustness: A Critical Analysis of Out-of-Distribution Generalization
Authors:
Dan Hendrycks,
Steven Basart,
Norman Mu,
Saurav Kadavath,
Frank Wang,
Evan Dorundo,
Rahul Desai,
Tyler Zhu,
Samyak Parajuli,
Mike Guo,
Dawn Song,
Jacob Steinhardt,
Justin Gilmer
Abstract:
We introduce four new real-world distribution shift datasets consisting of changes in image style, image blurriness, geographic location, camera operation, and more. With our new datasets, we take stock of previously proposed methods for improving out-of-distribution robustness and put them to the test. We find that using larger models and artificial data augmentations can improve robustness on re…
▽ More
We introduce four new real-world distribution shift datasets consisting of changes in image style, image blurriness, geographic location, camera operation, and more. With our new datasets, we take stock of previously proposed methods for improving out-of-distribution robustness and put them to the test. We find that using larger models and artificial data augmentations can improve robustness on real-world distribution shifts, contrary to claims in prior work. We find improvements in artificial robustness benchmarks can transfer to real-world distribution shifts, contrary to claims in prior work. Motivated by our observation that data augmentations can help with real-world distribution shifts, we also introduce a new data augmentation method which advances the state-of-the-art and outperforms models pretrained with 1000 times more labeled data. Overall we find that some methods consistently help with distribution shifts in texture and local image statistics, but these methods do not help with some other distribution shifts like geographic changes. Our results show that future research must study multiple distribution shifts simultaneously, as we demonstrate that no evaluated method consistently improves robustness.
△ Less
Submitted 24 July, 2021; v1 submitted 29 June, 2020;
originally announced June 2020.
-
Efficient Statistics for Sparse Graphical Models from Truncated Samples
Authors:
Arnab Bhattacharyya,
Rathin Desai,
Sai Ganesh Nagarajan,
Ioannis Panageas
Abstract:
In this paper, we study high-dimensional estimation from truncated samples. We focus on two fundamental and classical problems: (i) inference of sparse Gaussian graphical models and (ii) support recovery of sparse linear models.
(i) For Gaussian graphical models, suppose $d$-dimensional samples ${\bf x}$ are generated from a Gaussian $N(μ,Σ)$ and observed only if they belong to a subset…
▽ More
In this paper, we study high-dimensional estimation from truncated samples. We focus on two fundamental and classical problems: (i) inference of sparse Gaussian graphical models and (ii) support recovery of sparse linear models.
(i) For Gaussian graphical models, suppose $d$-dimensional samples ${\bf x}$ are generated from a Gaussian $N(μ,Σ)$ and observed only if they belong to a subset $S \subseteq \mathbb{R}^d$. We show that $μ$ and $Σ$ can be estimated with error $ε$ in the Frobenius norm, using $\tilde{O}\left(\frac{\textrm{nz}(Σ^{-1})}{ε^2}\right)$ samples from a truncated $\mathcal{N}(μ,Σ)$ and having access to a membership oracle for $S$. The set $S$ is assumed to have non-trivial measure under the unknown distribution but is otherwise arbitrary.
(ii) For sparse linear regression, suppose samples $({\bf x},y)$ are generated where $y = {\bf x}^\top{Ω^*} + \mathcal{N}(0,1)$ and $({\bf x}, y)$ is seen only if $y$ belongs to a truncation set $S \subseteq \mathbb{R}$. We consider the case that $Ω^*$ is sparse with a support set of size $k$. Our main result is to establish precise conditions on the problem dimension $d$, the support size $k$, the number of observations $n$, and properties of the samples and the truncation that are sufficient to recover the support of $Ω^*$. Specifically, we show that under some mild assumptions, only $O(k^2 \log d)$ samples are needed to estimate $Ω^*$ in the $\ell_\infty$-norm up to a bounded error.
For both problems, our estimator minimizes the sum of the finite population negative log-likelihood function and an $\ell_1$-regularization term.
△ Less
Submitted 17 June, 2020;
originally announced June 2020.
-
Quantifying Error in the Presence of Confounders for Causal Inference
Authors:
Rathin Desai,
Amit Sharma
Abstract:
Estimating average causal effect (ACE) is useful whenever we want to know the effect of an intervention on a given outcome. In the absence of a randomized experiment, many methods such as stratification and inverse propensity weighting have been proposed to estimate ACE. However, it is hard to know which method is optimal for a given dataset or which hyperparameters to use for a chosen method. To…
▽ More
Estimating average causal effect (ACE) is useful whenever we want to know the effect of an intervention on a given outcome. In the absence of a randomized experiment, many methods such as stratification and inverse propensity weighting have been proposed to estimate ACE. However, it is hard to know which method is optimal for a given dataset or which hyperparameters to use for a chosen method. To this end, we provide a framework to characterize the loss of a causal inference method against the true ACE, by framing causal inference as a representation learning problem. We show that many popular methods, including back-door methods can be considered as weighting or representation learning algorithms, and provide general error bounds for their causal estimates. In addition, we consider the case when unobserved variables can confound the causal estimate and extend proposed bounds using principles of robust statistics, considering confounding as contamination under the Huber contamination model. These bounds are also estimable; as an example, we provide empirical bounds for the Inverse Propensity Weighting (IPW) estimator and show how the bounds can be used to optimize the threshold of clipping extreme propensity scores. Our work provides a new way to reason about competing estimators, and opens up the potential of deriving new methods by minimizing the proposed error bounds.
△ Less
Submitted 10 July, 2019;
originally announced July 2019.