-
A Unified Framework for Causal Estimand Selection
Authors:
Martha Barnard,
Jared D. Huling,
Julian Wolfson
Abstract:
Estimating the causal effect of a treatment or health policy with observational data can be challenging due to an imbalance of and a lack of overlap between treated and control covariate distributions. In the presence of limited overlap, researchers choose between 1) methods (e.g., inverse probability weighting) that imply traditional estimands but whose estimators are at risk of considerable bias…
▽ More
Estimating the causal effect of a treatment or health policy with observational data can be challenging due to an imbalance of and a lack of overlap between treated and control covariate distributions. In the presence of limited overlap, researchers choose between 1) methods (e.g., inverse probability weighting) that imply traditional estimands but whose estimators are at risk of considerable bias and variance; and 2) methods (e.g., overlap weighting) which imply a different estimand, thereby modifying the target population to reduce variance. We propose a framework for navigating the tradeoffs between variance and bias due to imbalance and lack of overlap and the targeting of the estimand of scientific interest. We introduce a bias decomposition that encapsulates bias due to 1) the statistical bias of the estimator; and 2) estimand mismatch, i.e., deviation from the population of interest. We propose two design-based metrics and an estimand selection procedure that help illustrate the tradeoffs between these sources of bias and variance of the resulting estimators. Our procedure allows analysts to incorporate their domain-specific preference for preservation of the original research population versus reduction of statistical bias. We demonstrate how to select an estimand based on these preferences with an application to right heart catheterization data.
△ Less
Submitted 20 March, 2025; v1 submitted 15 October, 2024;
originally announced October 2024.
-
Adjacency Matrix Decomposition Clustering for Human Activity Data
Authors:
Martha Barnard,
Yingling Fan,
Julian Wolfson
Abstract:
Mobile apps and wearable devices accurately and continuously measure human activity; patterns within this data can provide a wealth of information applicable to fields such as transportation and health. Despite the potential utility of this data, there has been limited development of analysis methods for sequences of daily activities. In this paper, we propose a novel clustering method and cluster…
▽ More
Mobile apps and wearable devices accurately and continuously measure human activity; patterns within this data can provide a wealth of information applicable to fields such as transportation and health. Despite the potential utility of this data, there has been limited development of analysis methods for sequences of daily activities. In this paper, we propose a novel clustering method and cluster evaluation metric for human activity data that leverages an adjacency matrix representation to cluster the data without the calculation of a distance matrix. Our technique is substantially faster than conventional methods based on computing pairwise distances via sequence alignment algorithms and also enhances interpretability of results. We compare our method to distance-based hierarchical clustering and nTreeClus through simulation studies and an application to data collected by Daynamica, an app that turns sensor data into a daily summary of a user's activities. Among days that contain a large portion of time spent at home, our method distinguishes days that also contain multiple hours of travel or other activities, while both comparison methods fail to identify these patterns. We further identify which day patterns classified by our method are associated with higher concern for contracting COVID-19 with implications for public health messaging.
△ Less
Submitted 12 September, 2024; v1 submitted 3 January, 2024;
originally announced January 2024.
-
Impact of COVID-19 Policies and Misinformation on Social Unrest
Authors:
Martha Barnard,
Radhika Iyer,
Sara Y. Del Valle,
Ashlynn R. Daughton
Abstract:
The novel coronavirus disease (COVID-19) pandemic has impacted every corner of earth, disrupting governments and leading to socioeconomic instability. This crisis has prompted questions surrounding how different sectors of society interact and influence each other during times of change and stress. Given the unprecedented economic and societal impacts of this pandemic, many new data sources have b…
▽ More
The novel coronavirus disease (COVID-19) pandemic has impacted every corner of earth, disrupting governments and leading to socioeconomic instability. This crisis has prompted questions surrounding how different sectors of society interact and influence each other during times of change and stress. Given the unprecedented economic and societal impacts of this pandemic, many new data sources have become available, allowing us to quantitatively explore these associations. Understanding these relationships can help us better prepare for future disasters and mitigate the impacts. Here, we focus on the interplay between social unrest (protests), health outcomes, public health orders, and misinformation in eight countries of Western Europe and four regions of the United States. We created 1-3 week forecasts of both a binary protest metric for identifying times of high protest activity and the overall protest counts over time. We found that for all regions, except Belgium, at least one feature from our various data streams was predictive of protests. However, the accuracy of the protest forecasts varied by country, that is, for roughly half of the countries analyzed, our forecasts outperform a naïve model. These mixed results demonstrate the potential of diverse data streams to predict a topic as volatile as protests as well as the difficulties of predicting a situation that is as rapidly evolving as a pandemic.
△ Less
Submitted 7 October, 2021;
originally announced October 2021.
-
Estimating influenza incidence using search query deceptiveness and generalized ridge regression
Authors:
Reid Priedhorsky,
Ashlynn R. Daughton,
Martha Barnard,
Fiona O'Connell,
Dave Osthus
Abstract:
Seasonal influenza is a sometimes surprisingly impactful disease, causing thousands of deaths per year along with much additional morbidity. Timely knowledge of the outbreak state is valuable for managing an effective response. The current state of the art is to gather this knowledge using in-person patient contact. While accurate, this is time-consuming and expensive. This has motivated inquiry i…
▽ More
Seasonal influenza is a sometimes surprisingly impactful disease, causing thousands of deaths per year along with much additional morbidity. Timely knowledge of the outbreak state is valuable for managing an effective response. The current state of the art is to gather this knowledge using in-person patient contact. While accurate, this is time-consuming and expensive. This has motivated inquiry into new approaches using internet activity traces, based on the theory that lay observations of health status lead to informative features in internet data.
These approaches risk being deceived by activity traces having a coincidental, rather than informative, relationship to disease incidence; to our knowledge, this risk has not yet been quantitatively explored. We evaluated both simulated and real activity traces of varying deceptiveness for influenza incidence estimation using linear regression.
We found that deceptiveness knowledge does reduce error in such estimates, that it may help automatically-selected features perform as well or better than features that require human curation, and that a semantic distance measure derived from the Wikipedia article category tree serves as a useful proxy for deceptiveness. This suggests that disease incidence estimation models should incorporate not only data about how internet features map to incidence but also additional data to estimate feature deceptiveness. By doing so, we may gain one more step along the path to accurate, reliable disease incidence estimation using internet data. This capability would improve public health by decreasing the cost and increasing the timeliness of such estimates.
△ Less
Submitted 11 January, 2019;
originally announced January 2019.