-
Automated Survey Collection with LLM-based Conversational Agents
Authors:
Kurmanbek Kaiyrbekov,
Nicholas J Dobbins,
Sean D Mooney
Abstract:
Objective: Traditional phone-based surveys are among the most accessible and widely used methods to collect biomedical and healthcare data, however, they are often costly, labor intensive, and difficult to scale effectively. To overcome these limitations, we propose an end-to-end survey collection framework driven by conversational Large Language Models (LLMs).
Materials and Methods: Our framewo…
▽ More
Objective: Traditional phone-based surveys are among the most accessible and widely used methods to collect biomedical and healthcare data, however, they are often costly, labor intensive, and difficult to scale effectively. To overcome these limitations, we propose an end-to-end survey collection framework driven by conversational Large Language Models (LLMs).
Materials and Methods: Our framework consists of a researcher responsible for designing the survey and recruiting participants, a conversational phone agent powered by an LLM that calls participants and administers the survey, a second LLM (GPT-4o) that analyzes the conversation transcripts generated during the surveys, and a database for storing and organizing the results. To test our framework, we recruited 8 participants consisting of 5 native and 3 non-native english speakers and administered 40 surveys. We evaluated the correctness of LLM-generated conversation transcripts, accuracy of survey responses inferred by GPT-4o and overall participant experience.
Results: Survey responses were successfully extracted by GPT-4o from conversation transcripts with an average accuracy of 98% despite transcripts exhibiting an average per-line word error rate of 7.7%. While participants noted occasional errors made by the conversational LLM agent, they reported that the agent effectively conveyed the purpose of the survey, demonstrated good comprehension, and maintained an engaging interaction.
Conclusions: Our study highlights the potential of LLM agents in conducting and analyzing phone surveys for healthcare applications. By reducing the workload on human interviewers and offering a scalable solution, this approach paves the way for real-world, end-to-end AI-powered phone survey collection systems.
△ Less
Submitted 2 April, 2025;
originally announced April 2025.
-
Using agent-based models and EXplainable Artificial Intelligence (XAI) to simulate social behaviors and policy intervention scenarios: A case study of private well users in Ireland
Authors:
Rabia Asghar,
Simon Mooney,
Eoin O Neill,
Paul Hynds
Abstract:
Around 50 percent of Irelands rural population relies on unregulated private wells vulnerable to agricultural runoff and untreated wastewater. High national rates of Shiga toxin-producing Escherichia coli (STEC) and other waterborne illnesses have been linked to well water exposure. Periodic well testing is essential for public health, yet the lack of government incentives places the financial bur…
▽ More
Around 50 percent of Irelands rural population relies on unregulated private wells vulnerable to agricultural runoff and untreated wastewater. High national rates of Shiga toxin-producing Escherichia coli (STEC) and other waterborne illnesses have been linked to well water exposure. Periodic well testing is essential for public health, yet the lack of government incentives places the financial burden on households. Understanding environmental, cognitive, and material factors influencing well-testing behavior is critical.
This study employs Agent-Based Modeling (ABM) to simulate policy interventions based on national survey data. The ABM framework, designed for private well-testing behavior, integrates a Deep Q-network reinforcement learning model and Explainable AI (XAI) for decision-making insights. Key features were selected using Recursive Feature Elimination (RFE) with 10-fold cross-validation, while SHAP (Shapley Additive Explanations) provided further interpretability for policy recommendations.
Fourteen policy scenarios were tested. The most effective, Free Well Testing plus Communication Campaign, increased participation to 435 out of 561 agents, from a baseline of approximately 5 percent, with rapid behavioral adaptation. Free Well Testing plus Regulation also performed well, with 433 out of 561 agents initiating well testing. Free testing alone raised participation to over 75 percent, with some agents testing multiple times annually. Scenarios with free well testing achieved faster learning efficiency, converging in 1000 episodes, while others took 2000 episodes, indicating slower adaptation.
This research demonstrates the value of ABM and XAI in public health policy, providing a framework for evaluating behavioral interventions in environmental health.
△ Less
Submitted 8 February, 2025;
originally announced February 2025.
-
A Pilot Study of Sidewalk Equity in Seattle Using Crowdsourced Sidewalk Assessment Data
Authors:
Chu Li,
Lisa Orii,
Mikey Saugstad,
Stephen J. Mooney,
Yochai Eisenberg,
Delphine Labbé,
Joy Hammel,
Jon E. Froehlich
Abstract:
We examine the potential of using large-scale open crowdsourced sidewalk data from Project Sidewalk to study the distribution and condition of sidewalks in Seattle, WA. While potentially noisier than professionally gathered sidewalk datasets, crowdsourced data enables large, cross-regional studies that would be otherwise expensive and difficult to manage. As an initial case study, we examine spati…
▽ More
We examine the potential of using large-scale open crowdsourced sidewalk data from Project Sidewalk to study the distribution and condition of sidewalks in Seattle, WA. While potentially noisier than professionally gathered sidewalk datasets, crowdsourced data enables large, cross-regional studies that would be otherwise expensive and difficult to manage. As an initial case study, we examine spatial patterns of sidewalk quality in Seattle and their relationship to racial diversity, income level, built density, and transit modes. We close with a reflection on our approach, key limitations, and opportunities for future work.
△ Less
Submitted 5 October, 2022;
originally announced November 2022.
-
A Multifaceted Benchmarking of Synthetic Electronic Health Record Generation Models
Authors:
Chao Yan,
Yao Yan,
Zhiyu Wan,
Ziqi Zhang,
Larsson Omberg,
Justin Guinney,
Sean D. Mooney,
Bradley A. Malin
Abstract:
Synthetic health data have the potential to mitigate privacy concerns when sharing data to support biomedical research and the development of innovative healthcare applications. Modern approaches for data generation based on machine learning, generative adversarial networks (GAN) methods in particular, continue to evolve and demonstrate remarkable potential. Yet there is a lack of a systematic ass…
▽ More
Synthetic health data have the potential to mitigate privacy concerns when sharing data to support biomedical research and the development of innovative healthcare applications. Modern approaches for data generation based on machine learning, generative adversarial networks (GAN) methods in particular, continue to evolve and demonstrate remarkable potential. Yet there is a lack of a systematic assessment framework to benchmark methods as they emerge and determine which methods are most appropriate for which use cases. In this work, we introduce a generalizable benchmarking framework to appraise key characteristics of synthetic health data with respect to utility and privacy metrics. We apply the framework to evaluate synthetic data generation methods for electronic health records (EHRs) data from two large academic medical centers with respect to several use cases. The results illustrate that there is a utility-privacy tradeoff for sharing synthetic EHR data. The results further indicate that no method is unequivocally the best on all criteria in each use case, which makes it evident why synthetic data generation methods need to be assessed in context.
△ Less
Submitted 1 August, 2022;
originally announced August 2022.
-
The NLP Sandbox: an efficient model-to-data system to enable federated and unbiased evaluation of clinical NLP models
Authors:
Yao Yan,
Thomas Yu,
Kathleen Muenzen,
Sijia Liu,
Connor Boyle,
George Koslowski,
Jiaxin Zheng,
Nicholas Dobbins,
Clement Essien,
Hongfang Liu,
Larsson Omberg,
Meliha Yestigen,
Bradley Taylor,
James A Eddy,
Justin Guinney,
Sean Mooney,
Thomas Schaffter
Abstract:
Objective The evaluation of natural language processing (NLP) models for clinical text de-identification relies on the availability of clinical notes, which is often restricted due to privacy concerns. The NLP Sandbox is an approach for alleviating the lack of data and evaluation frameworks for NLP models by adopting a federated, model-to-data approach. This enables unbiased federated model evalua…
▽ More
Objective The evaluation of natural language processing (NLP) models for clinical text de-identification relies on the availability of clinical notes, which is often restricted due to privacy concerns. The NLP Sandbox is an approach for alleviating the lack of data and evaluation frameworks for NLP models by adopting a federated, model-to-data approach. This enables unbiased federated model evaluation without the need for sharing sensitive data from multiple institutions. Materials and Methods We leveraged the Synapse collaborative framework, containerization software, and OpenAPI generator to build the NLP Sandbox (nlpsandbox.io). We evaluated two state-of-the-art NLP de-identification focused annotation models, Philter and NeuroNER, using data from three institutions. We further validated model performance using data from an external validation site. Results We demonstrated the usefulness of the NLP Sandbox through de-identification clinical model evaluation. The external developer was able to incorporate their model into the NLP Sandbox template and provide user experience feedback. Discussion We demonstrated the feasibility of using the NLP Sandbox to conduct a multi-site evaluation of clinical text de-identification models without the sharing of data. Standardized model and data schemas enable smooth model transfer and implementation. To generalize the NLP Sandbox, work is required on the part of data owners and model developers to develop suitable and standardized schemas and to adapt their data or model to fit the schemas. Conclusions The NLP Sandbox lowers the barrier to utilizing clinical data for NLP model evaluation and facilitates federated, multi-site, unbiased evaluation of NLP models.
△ Less
Submitted 28 June, 2022;
originally announced June 2022.
-
Indicators of retention in remote digital health studies: A cross-study evaluation of 100,000 participants
Authors:
Abhishek Pratap,
Elias Chaibub Neto,
Phil Snyder,
Carl Stepnowsky,
Noémie Elhadad,
Daniel Grant,
Matthew H. Mohebbi,
Sean Mooney,
Christine Suver,
John Wilbanks,
Lara Mangravite,
Patrick Heagerty,
Pat Arean,
Larsson Omberg
Abstract:
Digital technologies such as smartphones are transforming the way scientists conduct biomedical research using real-world data. Several remotely-conducted studies have recruited thousands of participants over a span of a few months. Unfortunately, these studies are hampered by substantial participant attrition, calling into question the representativeness of the collected data including generaliza…
▽ More
Digital technologies such as smartphones are transforming the way scientists conduct biomedical research using real-world data. Several remotely-conducted studies have recruited thousands of participants over a span of a few months. Unfortunately, these studies are hampered by substantial participant attrition, calling into question the representativeness of the collected data including generalizability of findings from these studies. We report the challenges in retention and recruitment in eight remote digital health studies comprising over 100,000 participants who participated for more than 850,000 days, completing close to 3.5 million remote health evaluations. Survival modeling surfaced several factors significantly associated(P < 1e-16) with increase in median retention time i) Clinician referral(increase of 40 days), ii) Effect of compensation (22 days), iii) Clinical conditions of interest to the study (7 days) and iv) Older adults(4 days). Additionally, four distinct patterns of daily app usage behavior that were also associated(P < 1e-10) with participant demographics were identified. Most studies were not able to recruit a representative sample, either demographically or regionally. Combined together these findings can help inform recruitment and retention strategies to enable equitable participation of populations in future digital health research.
△ Less
Submitted 2 October, 2019;
originally announced October 2019.
-
Context-Aware Systems for Sequential Item Recommendation
Authors:
Moin Nadeem,
Dustin Stansbury,
Shane Mooney
Abstract:
Quizlet is the most popular online learning tool in the United States, and is used by over 2/3 of high school students, and 1/2 of college students. With more than 95% of Quizlet users reporting improved grades as a result, the platform has become the de-facto tool used in millions of classrooms. In this paper, we explore the task of recommending suitable content for a student to study, given thei…
▽ More
Quizlet is the most popular online learning tool in the United States, and is used by over 2/3 of high school students, and 1/2 of college students. With more than 95% of Quizlet users reporting improved grades as a result, the platform has become the de-facto tool used in millions of classrooms. In this paper, we explore the task of recommending suitable content for a student to study, given their prior interests, as well as what their peers are studying. We propose a novel approach, i.e. Neural Educational Recommendation Engine (NERE), to recommend educational content by leveraging student behaviors rather than ratings. We have found that this approach better captures social factors that are more aligned with learning. NERE is based on a recurrent neural network that includes collaborative and content-based approaches for recommendation, and takes into account any particular student's speed, mastery, and experience to recommend the appropriate task. We train NERE by jointly learning the user embeddings and content embeddings, and attempt to predict the content embedding for the final timestamp. We also develop a confidence estimator for our neural network, which is a crucial requirement for productionizing this model. We apply NERE to Quizlet's proprietary dataset, and present our results. We achieved an R^2 score of 0.81 in the content embedding space, and a recall score of 54% on our 100 nearest neighbors. This vastly exceeds the recall@100 score of 12% that a standard matrix-factorization approach provides. We conclude with a discussion on how NERE will be deployed, and position our work as one of the first educational recommender systems for the K-12 space.
△ Less
Submitted 4 April, 2019; v1 submitted 20 September, 2018;
originally announced September 2018.
-
Predicting Recall Probability to Adaptively Prioritize Study
Authors:
Shane Mooney,
Karen Sun,
Eric Bomgardner
Abstract:
Students have a limited time to study and are typically ineffective at allocating study time. Machine-directed study strategies that identify which items need reinforcement and dictate the spacing of repetition have been shown to help students optimize mastery (Mozer & Lindsey 2017). The large volume of research on this matter is typically conducted in constructed experimental settings with fixed…
▽ More
Students have a limited time to study and are typically ineffective at allocating study time. Machine-directed study strategies that identify which items need reinforcement and dictate the spacing of repetition have been shown to help students optimize mastery (Mozer & Lindsey 2017). The large volume of research on this matter is typically conducted in constructed experimental settings with fixed instruction, content, and scheduling; in contrast, we aim to develop methods that can address any demographic, subject matter, or study schedule. We show two methods that model item-specific recall probability for use in a discrepancy-reduction instruction strategy. The first model predicts item recall probability using a multiple logistic regression (MLR) model based on previous answer correctness and temporal spacing of study. Prompted by literature suggesting that forgetting is better modeled by the power law than an exponential decay (Wickelgren 1974), we compare the MLR approach with a Recurrent Power Law (RPL) model which adaptively fits a forgetting curve. We then discuss the performance of these models against study datasets comprised of millions of answers and show that the RPL approach is more accurate and flexible than the MLR model. Finally, we give an overview of promising future approaches to knowledge modeling.
△ Less
Submitted 28 February, 2018;
originally announced March 2018.