-
Investigating the Relationship Between Physical Activity and Tailored Behavior Change Messaging: Connecting Contextual Bandit with Large Language Models
Authors:
Haochen Song,
Dominik Hofer,
Rania Islambouli,
Laura Hawkins,
Ananya Bhattacharjee,
Meredith Franklin,
Joseph Jay Williams
Abstract:
Machine learning approaches, such as contextual multi-armed bandit (cMAB) algorithms, offer a promising strategy to reduce sedentary behavior by delivering personalized interventions to encourage physical activity. However, cMAB algorithms typically require large participant samples to learn effectively and may overlook key psychological factors that are not explicitly encoded in the model. In thi…
▽ More
Machine learning approaches, such as contextual multi-armed bandit (cMAB) algorithms, offer a promising strategy to reduce sedentary behavior by delivering personalized interventions to encourage physical activity. However, cMAB algorithms typically require large participant samples to learn effectively and may overlook key psychological factors that are not explicitly encoded in the model. In this study, we propose a hybrid approach that combines cMAB for selecting intervention types with large language models (LLMs) to personalize message content. We evaluate four intervention types: behavioral self-monitoring, gain-framed, loss-framed, and social comparison, each delivered as a motivational message aimed at increasing motivation for physical activity and daily step count. Message content is further personalized using dynamic contextual factors including daily fluctuations in self-efficacy, social influence, and regulatory focus. Over a seven-day trial, participants receive daily messages assigned by one of four models: cMAB alone, LLM alone, combined cMAB with LLM personalization (cMABxLLM), or equal randomization (RCT). Outcomes include daily step count and message acceptance, assessed via ecological momentary assessments (EMAs). We apply a causal inference framework to evaluate the effects of each model. Our findings offer new insights into the complementary roles of LLM-based personalization and cMAB adaptation in promoting physical activity through personalized behavioral messaging.
△ Less
Submitted 12 June, 2025; v1 submitted 8 June, 2025;
originally announced June 2025.
-
Adaptive Experiments Under High-Dimensional and Data Sparse Settings: Applications for Educational Platforms
Authors:
Haochen Song,
Ilya Musabirov,
Ananya Bhattacharjee,
Audrey Durand,
Meredith Franklin,
Anna Rafferty,
Joseph Jay Williams
Abstract:
In online educational platforms, adaptive experiment designs play a critical role in personalizing learning pathways, instructional sequencing, and content recommendations. Traditional adaptive policies, such as Thompson Sampling, struggle with scalability in high-dimensional and sparse settings such as when there are large amount of treatments (arms) and limited resources such as funding and time…
▽ More
In online educational platforms, adaptive experiment designs play a critical role in personalizing learning pathways, instructional sequencing, and content recommendations. Traditional adaptive policies, such as Thompson Sampling, struggle with scalability in high-dimensional and sparse settings such as when there are large amount of treatments (arms) and limited resources such as funding and time to conduct to a classroom constraint student size. Furthermore, the issue of under-exploration in large-scale educational interventions can lead to suboptimal learning recommendations. To address these challenges, we build upon the concept of lenient regret, which tolerates limited suboptimal selections to enhance exploratory learning, and propose a framework for determining the feasible number of treatments given a sample size. We illustrate these ideas with a case study in online educational learnersourcing examples, where adaptive algorithms dynamically allocate peer-crafted interventions to other students under active recall exercise. Our proposed Weighted Allocation Probability Adjusted Thompson Sampling (WAPTS) algorithm enhances the efficiency of treatment allocation by adjusting sampling weights to balance exploration and exploitation in data-sparse environments. We present comparative evaluations of WAPTS across various sample sizes (N=50, 300, 1000) and treatment conditions, demonstrating its ability to mitigate under-exploration while optimizing learning outcomes.
△ Less
Submitted 24 February, 2025; v1 submitted 7 January, 2025;
originally announced January 2025.
-
Best arm identification in rare events
Authors:
Anirban Bhattacharjee,
Sushant Vijayan,
Sandeep K Juneja
Abstract:
We consider the best arm identification problem in the stochastic multi-armed bandit framework where each arm has a tiny probability of realizing large rewards while with overwhelming probability the reward is zero. A key application of this framework is in online advertising where click rates of advertisements could be a fraction of a single percent and final conversion to sales, while highly pro…
▽ More
We consider the best arm identification problem in the stochastic multi-armed bandit framework where each arm has a tiny probability of realizing large rewards while with overwhelming probability the reward is zero. A key application of this framework is in online advertising where click rates of advertisements could be a fraction of a single percent and final conversion to sales, while highly profitable, may again be a small fraction of the click rates. Lately, algorithms for BAI problems have been developed that minimise sample complexity while providing statistical guarantees on the correct arm selection. As we observe, these algorithms can be computationally prohibitive. We exploit the fact that the reward process for each arm is well approximated by a Compound Poisson process to arrive at algorithms that are faster, with a small increase in sample complexity. We analyze the problem in an asymptotic regime as rarity of reward occurrence reduces to zero, and reward amounts increase to infinity. This helps illustrate the benefits of the proposed algorithm. It also sheds light on the underlying structure of the optimal BAI algorithms in the rare event setting.
△ Less
Submitted 14 March, 2023;
originally announced March 2023.
-
Handling Missingness Value on Jointly Measured Time-Course and Time-to-event Data
Authors:
Gajendra K. Vishwakarma,
Atanu Bhattacharjee,
Souvik Banerjee
Abstract:
Joint modeling technique is a recent advancement in effectively analyzing the longitudinal history of patients with the occurrence of an event of interest attached to it. This procedure is successfully implemented in biomarker studies to examine parents with the occurrence of tumor. One of the typical problem that influences the necessary inference is the presence of missing values in the longitud…
▽ More
Joint modeling technique is a recent advancement in effectively analyzing the longitudinal history of patients with the occurrence of an event of interest attached to it. This procedure is successfully implemented in biomarker studies to examine parents with the occurrence of tumor. One of the typical problem that influences the necessary inference is the presence of missing values in the longitudinal responses as well as in covariates. The occurrence of missingness is very common due to the dropout of patients from the study. This article presents an effective and detailed way to handle the missing values in the covariates and response variable. This study discusses the effect of different multiple imputation techniques on the inferences of joint modeling implemented on imputed datasets. A simulation study is carried out to replicate the complex data structures and conveniently perform our analysis to show its efficacy in terms of parameter estimation. This analysis is further illustrated with the longitudinal and survival outcomes of biomarkers' study by assessing proper codes in R programming language.
△ Less
Submitted 7 January, 2021;
originally announced January 2021.
-
A modified risk detection approach of biomarkers by frailty effect on multiple time to event data
Authors:
Atanu Bhattacharjee,
Gajendra K. Vishwakarma,
Souvik Banerjee
Abstract:
Multiple indications of disease progression found in a cancer patient by loco-regional relapse, distant metastasis and death. Early identification of these indications is necessary to change the treatment strategy. Biomarkers play an essential role in this aspect. The survival chance of a patient is dependent on the biomarker, and the treatment strategy also differs accordingly, e.g., the survival…
▽ More
Multiple indications of disease progression found in a cancer patient by loco-regional relapse, distant metastasis and death. Early identification of these indications is necessary to change the treatment strategy. Biomarkers play an essential role in this aspect. The survival chance of a patient is dependent on the biomarker, and the treatment strategy also differs accordingly, e.g., the survival prediction of breast cancer patients diagnosed with HER2 positive status is different from the same with HER2 negative status. This results in a different treatment strategy. So, the heterogeneity of the biomarker statuses or levels should be taken into consideration while modelling the survival outcome. This heterogeneity factor which is often unobserved, is called frailty. When multiple indications are present simultaneously, the scenario becomes more complex as only one of them can occur, which will censor the occurrence of other events. Incorporating independent frailties of each biomarker status for every cause of indications will not depict the complete picture of heterogeneity. The events indicating cancer progression are likely to be inter-related. So, the correlation should be incorporated through the frailties of different events. In our study, we considered a multiple events or risks model with a heterogeneity component. Based on the estimated variance of the frailty, the threshold levels of a biomarker are utilised as early detection tool of the disease progression or death. Additive-gamma frailty model is considered to account the correlation between different frailty components and estimation of parameters are performed using Expectation-Maximization Algorithm. With the extensive algorithm in R, we have obtained the threshold levels of activity of a biomarker in a multiple events scenario.
△ Less
Submitted 22 July, 2021; v1 submitted 26 November, 2020;
originally announced December 2020.
-
Nowcasting Growth using Google Trends Data: A Bayesian Structural Time Series Model
Authors:
David Kohns,
Arnab Bhattacharjee
Abstract:
This paper investigates the benefits of internet search data in the form of Google Trends for nowcasting real U.S. GDP growth in real time through the lens of mixed frequency Bayesian Structural Time Series (BSTS) models. We augment and enhance both model and methodology to make these better amenable to nowcasting with large number of potential covariates. Specifically, we allow shrinking state va…
▽ More
This paper investigates the benefits of internet search data in the form of Google Trends for nowcasting real U.S. GDP growth in real time through the lens of mixed frequency Bayesian Structural Time Series (BSTS) models. We augment and enhance both model and methodology to make these better amenable to nowcasting with large number of potential covariates. Specifically, we allow shrinking state variances towards zero to avoid overfitting, extend the SSVS (spike and slab variable selection) prior to the more flexible normal-inverse-gamma prior which stays agnostic about the underlying model size, as well as adapt the horseshoe prior to the BSTS. The application to nowcasting GDP growth as well as a simulation study demonstrate that the horseshoe prior BSTS improves markedly upon the SSVS and the original BSTS model with the largest gains in dense data-generating-processes. Our application also shows that a large dimensional set of search terms is able to improve nowcasts early in a specific quarter before other macroeconomic data become available. Search terms with high inclusion probability have good economic interpretation, reflecting leading signals of economic anxiety and wealth effects.
△ Less
Submitted 15 May, 2022; v1 submitted 2 November, 2020;
originally announced November 2020.
-
Classification Algorithm for High Dimensional Protein Markers in Time-course Data
Authors:
Souvik Banerjee,
Gajendra K. Vishwakarma,
Atanu Bhattacharjee
Abstract:
Identification of biomarkers is an emerging area in Oncology. In this article, we develop an efficient statistical procedure for classification of protein markers according to their effect on cancer progression. A high-dimensional time-course dataset of protein markers for 80 patients motivates us for developing the model. We obtain the optimal threshold values for markers using Cox proportional h…
▽ More
Identification of biomarkers is an emerging area in Oncology. In this article, we develop an efficient statistical procedure for classification of protein markers according to their effect on cancer progression. A high-dimensional time-course dataset of protein markers for 80 patients motivates us for developing the model. We obtain the optimal threshold values for markers using Cox proportional hazard model. The optimal threshold value is defined as a level of a marker having maximum impact on cancer progression. The classification was validated by comparing random components using both proportional hazard and accelerated failure time frailty models. The study elucidates the application of two separate joint modeling techniques using auto regressive-type model and mixed effect model for time-course data and proportional hazard model for survival data with proper utilization of Bayesian methodology. Also, a prognostic score has been developed on the basis of few selected genes with application on patients. The complete analysis is performed by R programming code. This study facilitates to identify relevant biomarkers from a set of markers.
△ Less
Submitted 7 January, 2020; v1 submitted 30 July, 2019;
originally announced July 2019.