-
Conversations with AI Chatbots Increase Short-Term Vaccine Intentions But Do Not Outperform Standard Public Health Messaging
Authors:
Neil K. R. Sehgal,
Sunny Rai,
Manuel Tonneau,
Anish K. Agarwal,
Joseph Cappella,
Melanie Kornides,
Lyle Ungar,
Alison Buttenheim,
Sharath Chandra Guntuku
Abstract:
Large language model (LLM) based chatbots show promise in persuasive communication, but existing studies often rely on weak controls or focus on belief change rather than behavioral intentions or outcomes. This pre-registered multi-country (US, Canada, UK) randomized controlled trial involving 930 vaccine-hesitant parents evaluated brief (three-minute) multi-turn conversations with LLM-based chatb…
▽ More
Large language model (LLM) based chatbots show promise in persuasive communication, but existing studies often rely on weak controls or focus on belief change rather than behavioral intentions or outcomes. This pre-registered multi-country (US, Canada, UK) randomized controlled trial involving 930 vaccine-hesitant parents evaluated brief (three-minute) multi-turn conversations with LLM-based chatbots against standard public health messaging approaches for increasing human papillomavirus (HPV) vaccine intentions for their children. Participants were randomly assigned to: (1) a weak control (no message), (2) a strong control reflecting the standard of care (reading official public health materials), or (3 and 4) one of two chatbot conditions. One chatbot was prompted to deliver short, conversational responses, while the other used the model's default output style (longer with bullet points). While chatbot interactions significantly increased self-reported vaccination intent (by 7.1-10.3 points on a 100-point scale) compared to no message, they did not outperform standard public health materials, with the conversational chatbot performing significantly worse. Additionally, while the short-term effects of chatbot interactions faded during a 15-day follow-up, the effects of public health material persisted through a 45-day follow-up relative to no message. These findings suggest that while LLMs can effectively shift vaccination intentions in the short-term, their incremental value over existing public health communications is questionable, offering a more tempered view of their persuasive capabilities and highlighting the importance of integrating AI-driven tools alongside, rather than replacing, current public health strategies.
△ Less
Submitted 26 June, 2025; v1 submitted 29 April, 2025;
originally announced April 2025.
-
When Claims Evolve: Evaluating and Enhancing the Robustness of Embedding Models Against Misinformation Edits
Authors:
Jabez Magomere,
Emanuele La Malfa,
Manuel Tonneau,
Ashkan Kazemi,
Scott Hale
Abstract:
Online misinformation remains a critical challenge, and fact-checkers increasingly rely on claim matching systems that use sentence embedding models to retrieve relevant fact-checks. However, as users interact with claims online, they often introduce edits, and it remains unclear whether current embedding models used in retrieval are robust to such edits. To investigate this, we introduce a pertur…
▽ More
Online misinformation remains a critical challenge, and fact-checkers increasingly rely on claim matching systems that use sentence embedding models to retrieve relevant fact-checks. However, as users interact with claims online, they often introduce edits, and it remains unclear whether current embedding models used in retrieval are robust to such edits. To investigate this, we introduce a perturbation framework that generates valid and natural claim variations, enabling us to assess the robustness of a wide-range of sentence embedding models in a multi-stage retrieval pipeline and evaluate the effectiveness of various mitigation approaches. Our evaluation reveals that standard embedding models exhibit notable performance drops on edited claims, while LLM-distilled embedding models offer improved robustness at a higher computational cost. Although a strong reranker helps to reduce the performance drop, it cannot fully compensate for first-stage retrieval gaps. To address these retrieval gaps, we evaluate train- and inference-time mitigation approaches, demonstrating that they can improve in-domain robustness by up to 17 percentage points and boost out-of-domain generalization by 10 percentage points. Overall, our findings provide practical improvements to claim-matching systems, enabling more reliable fact-checking of evolving misinformation. Code and data are available at https://github.com/JabezNzomo99/claim-matching-robustness.
△ Less
Submitted 5 June, 2025; v1 submitted 5 March, 2025;
originally announced March 2025.
-
HateDay: Insights from a Global Hate Speech Dataset Representative of a Day on Twitter
Authors:
Manuel Tonneau,
Diyi Liu,
Niyati Malhotra,
Scott A. Hale,
Samuel P. Fraiberger,
Victor Orozco-Olvera,
Paul Röttger
Abstract:
To address the global challenge of online hate speech, prior research has developed detection models to flag such content on social media. However, due to systematic biases in evaluation datasets, the real-world effectiveness of these models remains unclear, particularly across geographies. We introduce HateDay, the first global hate speech dataset representative of social media settings, construc…
▽ More
To address the global challenge of online hate speech, prior research has developed detection models to flag such content on social media. However, due to systematic biases in evaluation datasets, the real-world effectiveness of these models remains unclear, particularly across geographies. We introduce HateDay, the first global hate speech dataset representative of social media settings, constructed from a random sample of all tweets posted on September 21, 2022 and covering eight languages and four English-speaking countries. Using HateDay, we uncover substantial variation in the prevalence and composition of hate speech across languages and regions. We show that evaluations on academic datasets greatly overestimate real-world detection performance, which we find is very low, especially for non-European languages. Our analysis identifies key drivers of this gap, including models' difficulty to distinguish hate from offensive speech and a mismatch between the target groups emphasized in academic datasets and those most frequently targeted in real-world settings. We argue that poor model performance makes public models ill-suited for automatic hate speech moderation and find that high moderation rates are only achievable with substantial human oversight. Our results underscore the need to evaluate detection systems on data that reflects the complexity and diversity of real-world social media.
△ Less
Submitted 3 June, 2025; v1 submitted 23 November, 2024;
originally announced November 2024.
-
From Languages to Geographies: Towards Evaluating Cultural Bias in Hate Speech Datasets
Authors:
Manuel Tonneau,
Diyi Liu,
Samuel Fraiberger,
Ralph Schroeder,
Scott A. Hale,
Paul Röttger
Abstract:
Perceptions of hate can vary greatly across cultural contexts. Hate speech (HS) datasets, however, have traditionally been developed by language. This hides potential cultural biases, as one language may be spoken in different countries home to different cultures. In this work, we evaluate cultural bias in HS datasets by leveraging two interrelated cultural proxies: language and geography. We cond…
▽ More
Perceptions of hate can vary greatly across cultural contexts. Hate speech (HS) datasets, however, have traditionally been developed by language. This hides potential cultural biases, as one language may be spoken in different countries home to different cultures. In this work, we evaluate cultural bias in HS datasets by leveraging two interrelated cultural proxies: language and geography. We conduct a systematic survey of HS datasets in eight languages and confirm past findings on their English-language bias, but also show that this bias has been steadily decreasing in the past few years. For three geographically-widespread languages -- English, Arabic and Spanish -- we then leverage geographical metadata from tweets to approximate geo-cultural contexts by pairing language and country information. We find that HS datasets for these languages exhibit a strong geo-cultural bias, largely overrepresenting a handful of countries (e.g., US and UK for English) relative to their prominence in both the broader social media population and the general population speaking these languages. Based on these findings, we formulate recommendations for the creation of future HS datasets.
△ Less
Submitted 19 May, 2025; v1 submitted 27 April, 2024;
originally announced April 2024.
-
NaijaHate: Evaluating Hate Speech Detection on Nigerian Twitter Using Representative Data
Authors:
Manuel Tonneau,
Pedro Vitor Quinta de Castro,
Karim Lasri,
Ibrahim Farouq,
Lakshminarayanan Subramanian,
Victor Orozco-Olvera,
Samuel P. Fraiberger
Abstract:
To address the global issue of online hate, hate speech detection (HSD) systems are typically developed on datasets from the United States, thereby failing to generalize to English dialects from the Majority World. Furthermore, HSD models are often evaluated on non-representative samples, raising concerns about overestimating model performance in real-world settings. In this work, we introduce Nai…
▽ More
To address the global issue of online hate, hate speech detection (HSD) systems are typically developed on datasets from the United States, thereby failing to generalize to English dialects from the Majority World. Furthermore, HSD models are often evaluated on non-representative samples, raising concerns about overestimating model performance in real-world settings. In this work, we introduce NaijaHate, the first dataset annotated for HSD which contains a representative sample of Nigerian tweets. We demonstrate that HSD evaluated on biased datasets traditionally used in the literature consistently overestimates real-world performance by at least two-fold. We then propose NaijaXLM-T, a pretrained model tailored to the Nigerian Twitter context, and establish the key role played by domain-adaptive pretraining and finetuning in maximizing HSD performance. Finally, owing to the modest performance of HSD systems in real-world conditions, we find that content moderators would need to review about ten thousand Nigerian tweets flagged as hateful daily to moderate 60% of all hateful content, highlighting the challenges of moderating hate speech at scale as social media usage continues to grow globally. Taken together, these results pave the way towards robust HSD systems and a better protection of social media users from hateful content in low-resource settings.
△ Less
Submitted 24 June, 2024; v1 submitted 28 March, 2024;
originally announced March 2024.
-
Indian-BhED: A Dataset for Measuring India-Centric Biases in Large Language Models
Authors:
Khyati Khandelwal,
Manuel Tonneau,
Andrew M. Bean,
Hannah Rose Kirk,
Scott A. Hale
Abstract:
Large Language Models (LLMs), now used daily by millions, can encode societal biases, exposing their users to representational harms. A large body of scholarship on LLM bias exists but it predominantly adopts a Western-centric frame and attends comparatively less to bias levels and potential harms in the Global South. In this paper, we quantify stereotypical bias in popular LLMs according to an In…
▽ More
Large Language Models (LLMs), now used daily by millions, can encode societal biases, exposing their users to representational harms. A large body of scholarship on LLM bias exists but it predominantly adopts a Western-centric frame and attends comparatively less to bias levels and potential harms in the Global South. In this paper, we quantify stereotypical bias in popular LLMs according to an Indian-centric frame through Indian-BhED, a first of its kind dataset, containing stereotypical and anti-stereotypical examples in the context of caste and religious stereotypes in India. We find that the majority of LLMs tested have a strong propensity to output stereotypes in the Indian context, especially when compared to axes of bias traditionally studied in the Western context, such as gender and race. Notably, we find that GPT-2, GPT-2 Large, and GPT 3.5 have a particularly high propensity for preferring stereotypical outputs as a percent of all sentences for the axes of caste (63-79%) and religion (69-72%). We finally investigate potential causes for such harmful behaviour in LLMs, and posit intervention techniques to reduce both stereotypical and anti-stereotypical biases. The findings of this work highlight the need for including more diverse voices when researching fairness in AI and evaluating LLMs.
△ Less
Submitted 9 August, 2024; v1 submitted 15 September, 2023;
originally announced September 2023.
-
Multilingual Detection of Personal Employment Status on Twitter
Authors:
Manuel Tonneau,
Dhaval Adjodah,
João Palotti,
Nir Grinberg,
Samuel Fraiberger
Abstract:
Detecting disclosures of individuals' employment status on social media can provide valuable information to match job seekers with suitable vacancies, offer social protection, or measure labor market flows. However, identifying such personal disclosures is a challenging task due to their rarity in a sea of social media content and the variety of linguistic forms used to describe them. Here, we exa…
▽ More
Detecting disclosures of individuals' employment status on social media can provide valuable information to match job seekers with suitable vacancies, offer social protection, or measure labor market flows. However, identifying such personal disclosures is a challenging task due to their rarity in a sea of social media content and the variety of linguistic forms used to describe them. Here, we examine three Active Learning (AL) strategies in real-world settings of extreme class imbalance, and identify five types of disclosures about individuals' employment status (e.g. job loss) in three languages using BERT-based classification models. Our findings show that, even under extreme imbalance settings, a small number of AL iterations is sufficient to obtain large and significant gains in precision, recall, and diversity of results compared to a supervised baseline with the same number of labels. We also find that no AL strategy consistently outperforms the rest. Qualitative analysis suggests that AL helps focus the attention mechanism of BERT on core terms and adjust the boundaries of semantic expansion, highlighting the importance of interpretable models to provide greater control and visibility into this dynamic learning process.
△ Less
Submitted 17 March, 2022;
originally announced March 2022.