-
Datasheets Aren't Enough: DataRubrics for Automated Quality Metrics and Accountability
Authors:
Genta Indra Winata,
David Anugraha,
Emmy Liu,
Alham Fikri Aji,
Shou-Yi Hung,
Aditya Parashar,
Patrick Amadeus Irawan,
Ruochen Zhang,
Zheng-Xin Yong,
Jan Christian Blaise Cruz,
Niklas Muennighoff,
Seungone Kim,
Hanyang Zhao,
Sudipta Kar,
Kezia Erina Suryoraharjo,
M. Farid Adilazuarda,
En-Shiun Annie Lee,
Ayu Purwarianti,
Derry Tanti Wijaya,
Monojit Choudhury
Abstract:
High-quality datasets are fundamental to training and evaluating machine learning models, yet their creation-especially with accurate human annotations-remains a significant challenge. Many dataset paper submissions lack originality, diversity, or rigorous quality control, and these shortcomings are often overlooked during peer review. Submissions also frequently omit essential details about datas…
▽ More
High-quality datasets are fundamental to training and evaluating machine learning models, yet their creation-especially with accurate human annotations-remains a significant challenge. Many dataset paper submissions lack originality, diversity, or rigorous quality control, and these shortcomings are often overlooked during peer review. Submissions also frequently omit essential details about dataset construction and properties. While existing tools such as datasheets aim to promote transparency, they are largely descriptive and do not provide standardized, measurable methods for evaluating data quality. Similarly, metadata requirements at conferences promote accountability but are inconsistently enforced. To address these limitations, this position paper advocates for the integration of systematic, rubric-based evaluation metrics into the dataset review process-particularly as submission volumes continue to grow. We also explore scalable, cost-effective methods for synthetic data generation, including dedicated tools and LLM-as-a-judge approaches, to support more efficient evaluation. As a call to action, we introduce DataRubrics, a structured framework for assessing the quality of both human- and model-generated datasets. Leveraging recent advances in LLM-based evaluation, DataRubrics offers a reproducible, scalable, and actionable solution for dataset quality assessment, enabling both authors and reviewers to uphold higher standards in data-centric research. We also release code to support reproducibility of LLM-based evaluations at https://github.com/datarubrics/datarubrics.
△ Less
Submitted 3 June, 2025; v1 submitted 2 June, 2025;
originally announced June 2025.
-
MythTriage: Scalable Detection of Opioid Use Disorder Myths on a Video-Sharing Platform
Authors:
Hayoung Jung,
Shravika Mittal,
Ananya Aatreya,
Navreet Kaur,
Munmun De Choudhury,
Tanushree Mitra
Abstract:
Understanding the prevalence of misinformation in health topics online can inform public health policies and interventions. However, measuring such misinformation at scale remains a challenge, particularly for high-stakes but understudied topics like opioid-use disorder (OUD)--a leading cause of death in the U.S. We present the first large-scale study of OUD-related myths on YouTube, a widely-used…
▽ More
Understanding the prevalence of misinformation in health topics online can inform public health policies and interventions. However, measuring such misinformation at scale remains a challenge, particularly for high-stakes but understudied topics like opioid-use disorder (OUD)--a leading cause of death in the U.S. We present the first large-scale study of OUD-related myths on YouTube, a widely-used platform for health information. With clinical experts, we validate 8 pervasive myths and release an expert-labeled video dataset. To scale labeling, we introduce MythTriage, an efficient triage pipeline that uses a lightweight model for routine cases and defers harder ones to a high-performing, but costlier, large language model (LLM). MythTriage achieves up to 0.86 macro F1-score while estimated to reduce annotation time and financial cost by over 76% compared to experts and full LLM labeling. We analyze 2.9K search results and 343K recommendations, uncovering how myths persist on YouTube and offering actionable insights for public health and platform moderation.
△ Less
Submitted 30 May, 2025;
originally announced June 2025.
-
REDDIX-NET: A Novel Dataset and Benchmark for Moderating Online Explicit Services
Authors:
MSVPJ Sathvik,
Manan Roy Choudhury,
Rishita Agarwal,
Sathwik Narkedimilli,
Vivek Gupta
Abstract:
The rise of online platforms has enabled covert illicit activities, including online prostitution, to pose challenges for detection and regulation. In this study, we introduce REDDIX-NET, a novel benchmark dataset specifically designed for moderating online sexual services and going beyond traditional NSFW filters. The dataset is derived from thousands of web-scraped NSFW posts on Reddit and categ…
▽ More
The rise of online platforms has enabled covert illicit activities, including online prostitution, to pose challenges for detection and regulation. In this study, we introduce REDDIX-NET, a novel benchmark dataset specifically designed for moderating online sexual services and going beyond traditional NSFW filters. The dataset is derived from thousands of web-scraped NSFW posts on Reddit and categorizes users into six behavioral classes reflecting different service offerings and user intentions. We evaluate the classification performance of state-of-the-art large language models (GPT-4, LlaMA 3.3-70B-Instruct, Gemini 1.5 Flash, Mistral 8x7B, Qwen 2.5 Turbo, Claude 3.5 Haiku) using advanced quantitative metrics, finding promising results with models like GPT-4 and Gemini 1.5 Flash. Beyond classification, we conduct sentiment and comment analysis, leveraging LLM and PLM-based approaches and metadata extraction to uncover behavioral and temporal patterns. These analyses reveal peak engagement times and distinct user interaction styles across categories. Our findings provide critical insights into AI-driven moderation and enforcement, offering a scalable framework for platforms to combat online prostitution and associated harms.
△ Less
Submitted 29 May, 2025;
originally announced May 2025.
-
Reasoning Is Not All You Need: Examining LLMs for Multi-Turn Mental Health Conversations
Authors:
Mohit Chandra,
Siddharth Sriraman,
Harneet Singh Khanuja,
Yiqiao Jin,
Munmun De Choudhury
Abstract:
Limited access to mental healthcare, extended wait times, and increasing capabilities of Large Language Models (LLMs) has led individuals to turn to LLMs for fulfilling their mental health needs. However, examining the multi-turn mental health conversation capabilities of LLMs remains under-explored. Existing evaluation frameworks typically focus on diagnostic accuracy and win-rates and often over…
▽ More
Limited access to mental healthcare, extended wait times, and increasing capabilities of Large Language Models (LLMs) has led individuals to turn to LLMs for fulfilling their mental health needs. However, examining the multi-turn mental health conversation capabilities of LLMs remains under-explored. Existing evaluation frameworks typically focus on diagnostic accuracy and win-rates and often overlook alignment with patient-specific goals, values, and personalities required for meaningful conversations. To address this, we introduce MedAgent, a novel framework for synthetically generating realistic, multi-turn mental health sensemaking conversations and use it to create the Mental Health Sensemaking Dialogue (MHSD) dataset, comprising over 2,200 patient-LLM conversations. Additionally, we present MultiSenseEval, a holistic framework to evaluate the multi-turn conversation abilities of LLMs in healthcare settings using human-centric criteria. Our findings reveal that frontier reasoning models yield below-par performance for patient-centric communication and struggle at advanced diagnostic capabilities with average score of 31%. Additionally, we observed variation in model performance based on patient's persona and performance drop with increasing turns in the conversation. Our work provides a comprehensive synthetic data generation framework, a dataset and evaluation framework for assessing LLMs in multi-turn mental health conversations.
△ Less
Submitted 28 May, 2025; v1 submitted 26 May, 2025;
originally announced May 2025.
-
Communication Styles and Reader Preferences of LLM and Human Experts in Explaining Health Information
Authors:
Jiawei Zhou,
Kritika Venkatachalam,
Minje Choi,
Koustuv Saha,
Munmun De Choudhury
Abstract:
With the wide adoption of large language models (LLMs) in information assistance, it is essential to examine their alignment with human communication styles and values. We situate this study within the context of fact-checking health information, given the critical challenge of rectifying conceptions and building trust. Recent studies have explored the potential of LLM for health communication, bu…
▽ More
With the wide adoption of large language models (LLMs) in information assistance, it is essential to examine their alignment with human communication styles and values. We situate this study within the context of fact-checking health information, given the critical challenge of rectifying conceptions and building trust. Recent studies have explored the potential of LLM for health communication, but style differences between LLMs and human experts and associated reader perceptions remain under-explored. In this light, our study evaluates the communication styles of LLMs, focusing on how their explanations differ from those of humans in three core components of health communication: information, sender, and receiver. We compiled a dataset of 1498 health misinformation explanations from authoritative fact-checking organizations and generated LLM responses to inaccurate health information. Drawing from health communication theory, we evaluate communication styles across three key dimensions of information linguistic features, sender persuasive strategies, and receiver value alignments. We further assessed human perceptions through a blinded evaluation with 99 participants. Our findings reveal that LLM-generated articles showed significantly lower scores in persuasive strategies, certainty expressions, and alignment with social values and moral foundations. However, human evaluation demonstrated a strong preference for LLM content, with over 60% responses favoring LLM articles for clarity, completeness, and persuasiveness. Our results suggest that LLMs' structured approach to presenting information may be more effective at engaging readers despite scoring lower on traditional measures of quality in fact-checking and health communication.
△ Less
Submitted 12 May, 2025;
originally announced May 2025.
-
Proceedings of 1st Workshop on Advancing Artificial Intelligence through Theory of Mind
Authors:
Mouad Abrini,
Omri Abend,
Dina Acklin,
Henny Admoni,
Gregor Aichinger,
Nitay Alon,
Zahra Ashktorab,
Ashish Atreja,
Moises Auron,
Alexander Aufreiter,
Raghav Awasthi,
Soumya Banerjee,
Joe M. Barnby,
Rhea Basappa,
Severin Bergsmann,
Djallel Bouneffouf,
Patrick Callaghan,
Marc Cavazza,
Thierry Chaminade,
Sonia Chernova,
Mohamed Chetouan,
Moumita Choudhury,
Axel Cleeremans,
Jacek B. Cywinski,
Fabio Cuzzolin
, et al. (83 additional authors not shown)
Abstract:
This volume includes a selection of papers presented at the Workshop on Advancing Artificial Intelligence through Theory of Mind held at AAAI 2025 in Philadelphia US on 3rd March 2025. The purpose of this volume is to provide an open access and curated anthology for the ToM and AI research community.
This volume includes a selection of papers presented at the Workshop on Advancing Artificial Intelligence through Theory of Mind held at AAAI 2025 in Philadelphia US on 3rd March 2025. The purpose of this volume is to provide an open access and curated anthology for the ToM and AI research community.
△ Less
Submitted 28 April, 2025;
originally announced May 2025.
-
Constraints on the state of the IGM at $z\sim 8-10$ using redshifted 21-cm observations with LOFAR
Authors:
R. Ghara,
S. Zaroubi,
B. Ciardi,
G. Mellema,
S. K. Giri,
F. G. Mertens,
M. Mevius,
L. V. E. Koopmans,
I. T. Iliev,
A. Acharya,
S. A. Brackenhoff,
E. Ceccotti,
K. Chege,
I. Georgiev,
S. Ghosh,
I. Hothi,
C. Höfer,
Q. Ma,
S. Munshi,
A. R. Offringa,
A. K. Shaw,
V. N. Pandey,
S. Yatawatta,
M. Choudhury
Abstract:
The power spectra of the redshifted 21-cm signal from the Epoch of Reionization (EoR) contain information about the ionization and thermal states of the intergalactic medium (IGM), and depend on the properties of the EoR sources. Recently, Mertens et al 2025 has analysed 10 nights of LOFAR high-band data and estimated upper limits on the 21-cm power spectrum at redshifts 8.3, 9.1 and 10.1. Here we…
▽ More
The power spectra of the redshifted 21-cm signal from the Epoch of Reionization (EoR) contain information about the ionization and thermal states of the intergalactic medium (IGM), and depend on the properties of the EoR sources. Recently, Mertens et al 2025 has analysed 10 nights of LOFAR high-band data and estimated upper limits on the 21-cm power spectrum at redshifts 8.3, 9.1 and 10.1. Here we use these upper limit results to constrain the properties of the IGM at those redshifts. We focus on the properties of the ionized and heated regions where the temperature is larger than that of the CMB. We model the 21-cm power spectrum with the code GRIZZLY, and use a Bayesian inference framework to explore the source parameters for uniform priors on their ranges. The framework also provides information about the IGM properties in the form of derived parameters. In a model which includes a radio background in excess of the CMB, the 95 (68) per cent credible intervals of disfavoured models at redshift 9.1 for the chosen priors correspond to IGM states with averaged ionization and heated fraction below 0.46 ($\lesssim 0.05$), an average gas temperature below 44 K (4 K), and a characteristic size of the heated region $\lesssim 14 ~h^{-1} ~\mathrm{Mpc}$ ($\lesssim 3 ~h^{-1} ~\mathrm{Mpc}$). The 68 per cent credible interval suggests an excess radio background which is more than 100 per cent of the CMB at 1.42 GHz, while the 95 per cent credible interval of the radio background efficiency parameter spans the entire prior range. The behaviour of the credible intervals is similar at all redshifts. The models disfavoured by the LOFAR upper limits are extreme ones, as they are mainly driven by rare and large ionized or heated regions.
△ Less
Submitted 1 May, 2025;
originally announced May 2025.
-
Modeling of Experimentally Observed Two-Dimensional Precursor Solitons in a Dusty Plasma by the forced Kadomtsev-Petviashvili Equation
Authors:
Ajaz Mir,
Pintu Bandyopadhyay,
Madhurima Choudhury,
Krishan Kumar,
Abhijit Sen
Abstract:
We compare model solutions of a forced Kadomtsev-Petviashvili (fKP) equation with experimental observations of dust acoustic precursor solitons excited by a supersonically moving charged cylindrical object in a dusty plasma medium. The fKP equation is derived from a three-fluid-Poisson model of the dusty plasma using the reductive perturbation technique and numerically solved for parameters close…
▽ More
We compare model solutions of a forced Kadomtsev-Petviashvili (fKP) equation with experimental observations of dust acoustic precursor solitons excited by a supersonically moving charged cylindrical object in a dusty plasma medium. The fKP equation is derived from a three-fluid-Poisson model of the dusty plasma using the reductive perturbation technique and numerically solved for parameters close to the experimental investigations of cylindrical precursor solitons. The fKP model solutions show excellent agreement with the experimental results in reproducing the prominent geometric features of the two-dimensional solitons and closely matching the quantitative values of their velocities, amplitudes, and temporal evolutions. Our findings suggest that the fKP equation can serve as a very realistic model to investigate the dynamics of precursor solitons and can be usefully employed in practical applications such as space debris detection and tracking techniques that are based on observing/predicting nonlinear plasma excitations induced by the debris in the ionosphere.
△ Less
Submitted 23 April, 2025;
originally announced April 2025.
-
Exposure to Content Written by Large Language Models Can Reduce Stigma Around Opioid Use Disorder in Online Communities
Authors:
Shravika Mittal,
Darshi Shah,
Shin Won Do,
Mai ElSherief,
Tanushree Mitra,
Munmun De Choudhury
Abstract:
Widespread stigma, both in the offline and online spaces, acts as a barrier to harm reduction efforts in the context of opioid use disorder (OUD). This stigma is prominently directed towards clinically approved medications for addiction treatment (MAT), people with the condition, and the condition itself. Given the potential of artificial intelligence based technologies in promoting health equity,…
▽ More
Widespread stigma, both in the offline and online spaces, acts as a barrier to harm reduction efforts in the context of opioid use disorder (OUD). This stigma is prominently directed towards clinically approved medications for addiction treatment (MAT), people with the condition, and the condition itself. Given the potential of artificial intelligence based technologies in promoting health equity, and facilitating empathic conversations, this work examines whether large language models (LLMs) can help abate OUD-related stigma in online communities. To answer this, we conducted a series of pre-registered randomized controlled experiments, where participants read LLM-generated, human-written, or no responses to help seeking OUD-related content in online communities. The experiment was conducted under two setups, i.e., participants read the responses either once (N = 2,141), or repeatedly for 14 days (N = 107). We found that participants reported the least stigmatized attitudes toward MAT after consuming LLM-generated responses under both the setups. This study offers insights into strategies that can foster inclusive online discourse on OUD, e.g., based on our findings LLMs can be used as an education-based intervention to promote positive attitudes and increase people's propensity toward MAT.
△ Less
Submitted 8 April, 2025;
originally announced April 2025.
-
Linguistic Comparison of AI- and Human-Written Responses to Online Mental Health Queries
Authors:
Koustuv Saha,
Yoshee Jain,
Munmun De Choudhury
Abstract:
The ubiquity and widespread use of digital and online technologies have transformed mental health support, with online mental health communities (OMHCs) providing safe spaces for peer support. More recently, generative AI and large language models (LLMs) have introduced new possibilities for scalable, around-the-clock mental health assistance that could potentially augment and supplement the capab…
▽ More
The ubiquity and widespread use of digital and online technologies have transformed mental health support, with online mental health communities (OMHCs) providing safe spaces for peer support. More recently, generative AI and large language models (LLMs) have introduced new possibilities for scalable, around-the-clock mental health assistance that could potentially augment and supplement the capabilities of OMHCs. Although genAI shows promise in delivering immediate and personalized responses, their effectiveness in replicating the nuanced, experience-based support of human peers remains an open question. In this study, we harnessed 24,114 posts and 138,758 online community (OC) responses from 55 OMHCs on Reddit. We prompted several state-of-the-art LLMs (GPT-4-Turbo, Llama-3, and Mistral-7B) with these posts, and compared their (AI) responses to human-written (OC) responses based on a variety of linguistic measures across psycholinguistics and lexico-semantics. Our findings revealed that AI responses are more verbose, readable, and analytically structured, but lack linguistic diversity and personal narratives inherent in human-human interactions. Through a qualitative examination, we found validation as well as complementary insights into the nature of AI responses, such as its neutrality of stance and the absence of seeking back-and-forth clarifications. We discuss the ethical and practical implications of integrating generative AI into OMHCs, advocating for frameworks that balance AI's scalability and timeliness with the irreplaceable authenticity, social interactiveness, and expertise of human connections that form the ethos of online support communities.
△ Less
Submitted 12 April, 2025;
originally announced April 2025.
-
Large-Scale Analysis of Online Questions Related to Opioid Use Disorder on Reddit
Authors:
Tanmay Laud,
Akadia Kacha-Ochana,
Steven A. Sumner,
Vikram Krishnasamy,
Royal Law,
Lyna Schieber,
Munmun De Choudhury,
Mai ElSherief
Abstract:
Opioid use disorder (OUD) is a leading health problem that affects individual well-being as well as general public health. Due to a variety of reasons, including the stigma faced by people using opioids, online communities for recovery and support were formed on different social media platforms. In these communities, people share their experiences and solicit information by asking questions to lea…
▽ More
Opioid use disorder (OUD) is a leading health problem that affects individual well-being as well as general public health. Due to a variety of reasons, including the stigma faced by people using opioids, online communities for recovery and support were formed on different social media platforms. In these communities, people share their experiences and solicit information by asking questions to learn about opioid use and recovery. However, these communities do not always contain clinically verified information. In this paper, we study natural language questions asked in the context of OUD-related discourse on Reddit. We adopt transformer-based question detection along with hierarchical clustering across 19 subreddits to identify six coarse-grained categories and 69 fine-grained categories of OUD-related questions. Our analysis uncovers ten areas of information seeking from Reddit users in the context of OUD: drug sales, specific drug-related questions, OUD treatment, drug uses, side effects, withdrawal, lifestyle, drug testing, pain management and others, during the study period of 2018-2021. Our work provides a major step in improving the understanding of OUD-related questions people ask unobtrusively on Reddit. We finally discuss technological interventions and public health harm reduction techniques based on the topics of these questions.
△ Less
Submitted 10 April, 2025;
originally announced April 2025.
-
Navigating the Rabbit Hole: Emergent Biases in LLM-Generated Attack Narratives Targeting Mental Health Groups
Authors:
Rijul Magu,
Arka Dutta,
Sean Kim,
Ashiqur R. KhudaBukhsh,
Munmun De Choudhury
Abstract:
Large Language Models (LLMs) have been shown to demonstrate imbalanced biases against certain groups. However, the study of unprovoked targeted attacks by LLMs towards at-risk populations remains underexplored. Our paper presents three novel contributions: (1) the explicit evaluation of LLM-generated attacks on highly vulnerable mental health groups; (2) a network-based framework to study the prop…
▽ More
Large Language Models (LLMs) have been shown to demonstrate imbalanced biases against certain groups. However, the study of unprovoked targeted attacks by LLMs towards at-risk populations remains underexplored. Our paper presents three novel contributions: (1) the explicit evaluation of LLM-generated attacks on highly vulnerable mental health groups; (2) a network-based framework to study the propagation of relative biases; and (3) an assessment of the relative degree of stigmatization that emerges from these attacks. Our analysis of a recently released large-scale bias audit dataset reveals that mental health entities occupy central positions within attack narrative networks, as revealed by a significantly higher mean centrality of closeness (p-value = 4.06e-10) and dense clustering (Gini coefficient = 0.7). Drawing from sociological foundations of stigmatization theory, our stigmatization analysis indicates increased labeling components for mental health disorder-related targets relative to initial targets in generation chains. Taken together, these insights shed light on the structural predilections of large language models to heighten harmful discourse and highlight the need for suitable approaches for mitigation.
△ Less
Submitted 11 April, 2025; v1 submitted 8 April, 2025;
originally announced April 2025.
-
Llama-3-Nanda-10B-Chat: An Open Generative Large Language Model for Hindi
Authors:
Monojit Choudhury,
Shivam Chauhan,
Rocktim Jyoti Das,
Dhruv Sahnan,
Xudong Han,
Haonan Li,
Aaryamonvikram Singh,
Alok Anil Jadhav,
Utkarsh Agarwal,
Mukund Choudhary,
Debopriyo Banerjee,
Fajri Koto,
Junaid Bhat,
Awantika Shukla,
Samujjwal Ghosh,
Samta Kamboj,
Onkar Pandit,
Lalit Pradhan,
Rahul Pal,
Sunil Sahu,
Soundar Doraiswamy,
Parvez Mullah,
Ali El Filali,
Neha Sengupta,
Gokul Ramakrishnan
, et al. (5 additional authors not shown)
Abstract:
Developing high-quality large language models (LLMs) for moderately resourced languages presents unique challenges in data availability, model adaptation, and evaluation. We introduce Llama-3-Nanda-10B-Chat, or Nanda for short, a state-of-the-art Hindi-centric instruction-tuned generative LLM, designed to push the boundaries of open-source Hindi language models. Built upon Llama-3-8B, Nanda incorp…
▽ More
Developing high-quality large language models (LLMs) for moderately resourced languages presents unique challenges in data availability, model adaptation, and evaluation. We introduce Llama-3-Nanda-10B-Chat, or Nanda for short, a state-of-the-art Hindi-centric instruction-tuned generative LLM, designed to push the boundaries of open-source Hindi language models. Built upon Llama-3-8B, Nanda incorporates continuous pre-training with expanded transformer blocks, leveraging the Llama Pro methodology. A key challenge was the limited availability of high-quality Hindi text data; we addressed this through rigorous data curation, augmentation, and strategic bilingual training, balancing Hindi and English corpora to optimize cross-linguistic knowledge transfer. With 10 billion parameters, Nanda stands among the top-performing open-source Hindi and multilingual models of similar scale, demonstrating significant advantages over many existing models. We provide an in-depth discussion of training strategies, fine-tuning techniques, safety alignment, and evaluation metrics, demonstrating how these approaches enabled Nanda to achieve state-of-the-art results. By open-sourcing Nanda, we aim to advance research in Hindi LLMs and support a wide range of real-world applications across academia, industry, and public services.
△ Less
Submitted 8 April, 2025;
originally announced April 2025.
-
A Framework for Situating Innovations, Opportunities, and Challenges in Advancing Vertical Systems with Large AI Models
Authors:
Gaurav Verma,
Jiawei Zhou,
Mohit Chandra,
Srijan Kumar,
Munmun De Choudhury
Abstract:
Large artificial intelligence (AI) models have garnered significant attention for their remarkable, often "superhuman", performance on standardized benchmarks. However, when these models are deployed in high-stakes verticals such as healthcare, education, and law, they often reveal notable limitations. For instance, they exhibit brittleness to minor variations in input data, present contextually u…
▽ More
Large artificial intelligence (AI) models have garnered significant attention for their remarkable, often "superhuman", performance on standardized benchmarks. However, when these models are deployed in high-stakes verticals such as healthcare, education, and law, they often reveal notable limitations. For instance, they exhibit brittleness to minor variations in input data, present contextually uninformed decisions in critical settings, and undermine user trust by confidently producing or reproducing inaccuracies. These challenges in applying large models necessitate cross-disciplinary innovations to align the models' capabilities with the needs of real-world applications. We introduce a framework that addresses this gap through a layer-wise abstraction of innovations aimed at meeting users' requirements with large models. Through multiple case studies, we illustrate how researchers and practitioners across various fields can operationalize this framework. Beyond modularizing the pipeline of transforming large models into useful "vertical systems", we also highlight the dynamism that exists within different layers of the framework. Finally, we discuss how our framework can guide researchers and practitioners to (i) optimally situate their innovations (e.g., when vertical-specific insights can empower broadly impactful vertical-agnostic innovations), (ii) uncover overlooked opportunities (e.g., spotting recurring problems across verticals to develop practically useful foundation models instead of chasing benchmarks), and (iii) facilitate cross-disciplinary communication of critical challenges (e.g., enabling a shared vocabulary for AI developers, domain experts, and human-computer interaction scholars).
△ Less
Submitted 3 April, 2025;
originally announced April 2025.
-
Square Kilometre Array Science Data Challenge 3a: foreground removal for an EoR experiment
Authors:
A. Bonaldi,
P. Hartley,
R. Braun,
S. Purser,
A. Acharya,
K. Ahn,
M. Aparicio Resco,
O. Bait,
M. Bianco,
A. Chakraborty,
E. Chapman,
S. Chatterjee,
K. Chege,
H. Chen,
X. Chen,
Z. Chen,
L. Conaboy,
M. Cruz,
L. Darriba,
M. De Santis,
P. Denzel,
K. Diao,
J. Feron,
C. Finlay,
B. Gehlot
, et al. (159 additional authors not shown)
Abstract:
We present and analyse the results of the Science data challenge 3a (SDC3a, https://sdc3.skao.int/challenges/foregrounds), an EoR foreground-removal community-wide exercise organised by the Square Kilometre Array Observatory (SKAO). The challenge ran for 8 months, from March to October 2023. Participants were provided with realistic simulations of SKA-Low data between 106 MHz and 196 MHz, includin…
▽ More
We present and analyse the results of the Science data challenge 3a (SDC3a, https://sdc3.skao.int/challenges/foregrounds), an EoR foreground-removal community-wide exercise organised by the Square Kilometre Array Observatory (SKAO). The challenge ran for 8 months, from March to October 2023. Participants were provided with realistic simulations of SKA-Low data between 106 MHz and 196 MHz, including foreground contamination from extragalactic as well as Galactic emission, instrumental and systematic effects. They were asked to deliver cylindrical power spectra of the EoR signal, cleaned from all corruptions, and the corresponding confidence levels. Here we describe the approaches taken by the 17 teams that completed the challenge, and we assess their performance using different metrics.
The challenge results provide a positive outlook on the capabilities of current foreground-mitigation approaches to recover the faint EoR signal from SKA-Low observations. The median error committed in the EoR power spectrum recovery is below the true signal for seven teams, although in some cases there are some significant outliers. The smallest residual overall is $4.2_{-4.2}^{+20} \times 10^{-4}\,\rm{K}^2h^{-3}$cMpc$^{3}$ across all considered scales and frequencies.
The estimation of confidence levels provided by the teams is overall less accurate, with the true error being typically under-estimated, sometimes very significantly. The most accurate error bars account for $60 \pm 20$\% of the true errors committed. The challenge results provide a means for all teams to understand and improve their performance. This challenge indicates that the comparison between independent pipelines could be a powerful tool to assess residual biases and improve error estimation.
△ Less
Submitted 14 March, 2025;
originally announced March 2025.
-
Llama-3.1-Sherkala-8B-Chat: An Open Large Language Model for Kazakh
Authors:
Fajri Koto,
Rituraj Joshi,
Nurdaulet Mukhituly,
Yuxia Wang,
Zhuohan Xie,
Rahul Pal,
Daniil Orel,
Parvez Mullah,
Diana Turmakhan,
Maiya Goloburda,
Mohammed Kamran,
Samujjwal Ghosh,
Bokang Jia,
Jonibek Mansurov,
Mukhammed Togmanov,
Debopriyo Banerjee,
Nurkhan Laiyk,
Akhmed Sakip,
Xudong Han,
Ekaterina Kochmar,
Alham Fikri Aji,
Aaryamonvikram Singh,
Alok Anil Jadhav,
Satheesh Katipomu,
Samta Kamboj
, et al. (10 additional authors not shown)
Abstract:
Llama-3.1-Sherkala-8B-Chat, or Sherkala-Chat (8B) for short, is a state-of-the-art instruction-tuned open generative large language model (LLM) designed for Kazakh. Sherkala-Chat (8B) aims to enhance the inclusivity of LLM advancements for Kazakh speakers. Adapted from the LLaMA-3.1-8B model, Sherkala-Chat (8B) is trained on 45.3B tokens across Kazakh, English, Russian, and Turkish. With 8 billion…
▽ More
Llama-3.1-Sherkala-8B-Chat, or Sherkala-Chat (8B) for short, is a state-of-the-art instruction-tuned open generative large language model (LLM) designed for Kazakh. Sherkala-Chat (8B) aims to enhance the inclusivity of LLM advancements for Kazakh speakers. Adapted from the LLaMA-3.1-8B model, Sherkala-Chat (8B) is trained on 45.3B tokens across Kazakh, English, Russian, and Turkish. With 8 billion parameters, it demonstrates strong knowledge and reasoning abilities in Kazakh, significantly outperforming existing open Kazakh and multilingual models of similar scale while achieving competitive performance in English. We release Sherkala-Chat (8B) as an open-weight instruction-tuned model and provide a detailed overview of its training, fine-tuning, safety alignment, and evaluation, aiming to advance research and support diverse real-world applications.
△ Less
Submitted 3 March, 2025;
originally announced March 2025.
-
Lost in Transcription, Found in Distribution Shift: Demystifying Hallucination in Speech Foundation Models
Authors:
Hanin Atwany,
Abdul Waheed,
Rita Singh,
Monojit Choudhury,
Bhiksha Raj
Abstract:
Speech foundation models trained at a massive scale, both in terms of model and data size, result in robust systems capable of performing multiple speech tasks, including automatic speech recognition (ASR). These models transcend language and domain barriers, yet effectively measuring their performance remains a challenge. Traditional metrics like word error rate (WER) and character error rate (CE…
▽ More
Speech foundation models trained at a massive scale, both in terms of model and data size, result in robust systems capable of performing multiple speech tasks, including automatic speech recognition (ASR). These models transcend language and domain barriers, yet effectively measuring their performance remains a challenge. Traditional metrics like word error rate (WER) and character error rate (CER) are commonly used to evaluate ASR performance but often fail to reflect transcription quality in critical contexts, particularly when detecting fabricated outputs. This phenomenon, known as hallucination, is especially concerning in high-stakes domains such as healthcare, legal, and aviation, where errors can have severe consequences. In our work, we address this gap by investigating hallucination in ASR models. We examine how factors such as distribution shifts, model size, and model architecture influence the hallucination error rate (HER), a metric we introduce to quantify hallucinations. Our analysis of 20 ASR models reveals \numinsights~key insights: (1) High WERs can mask low hallucination rates, while low WERs may conceal dangerous hallucinations. (2) Synthetic noise, both adversarial and common perturbations like white noise, pitch shift, and time stretching, increase HER. (3) Distribution shift correlates strongly with HER ($α= 0.91$). Our findings highlight the importance of incorporating HER alongside traditional metrics like WER to better assess ASR model performance, particularly in high-stakes domains.
△ Less
Submitted 17 February, 2025;
originally announced February 2025.
-
Meta-Cultural Competence: Climbing the Right Hill of Cultural Awareness
Authors:
Sougata Saha,
Saurabh Kumar Pandey,
Monojit Choudhury
Abstract:
Numerous recent studies have shown that Large Language Models (LLMs) are biased towards a Western and Anglo-centric worldview, which compromises their usefulness in non-Western cultural settings. However, "culture" is a complex, multifaceted topic, and its awareness, representation, and modeling in LLMs and LLM-based applications can be defined and measured in numerous ways. In this position paper…
▽ More
Numerous recent studies have shown that Large Language Models (LLMs) are biased towards a Western and Anglo-centric worldview, which compromises their usefulness in non-Western cultural settings. However, "culture" is a complex, multifaceted topic, and its awareness, representation, and modeling in LLMs and LLM-based applications can be defined and measured in numerous ways. In this position paper, we ask what does it mean for an LLM to possess "cultural awareness", and through a thought experiment, which is an extension of the Octopus test proposed by Bender and Koller (2020), we argue that it is not cultural awareness or knowledge, rather meta-cultural competence, which is required of an LLM and LLM-based AI system that will make it useful across various, including completely unseen, cultures. We lay out the principles of meta-cultural competence AI systems, and discuss ways to measure and model those.
△ Less
Submitted 8 February, 2025;
originally announced February 2025.
-
Reading between the Lines: Can LLMs Identify Cross-Cultural Communication Gaps?
Authors:
Sougata Saha,
Saurabh Kumar Pandey,
Harshit Gupta,
Monojit Choudhury
Abstract:
In a rapidly globalizing and digital world, content such as book and product reviews created by people from diverse cultures are read and consumed by others from different corners of the world. In this paper, we investigate the extent and patterns of gaps in understandability of book reviews due to the presence of culturally-specific items and elements that might be alien to users from another cul…
▽ More
In a rapidly globalizing and digital world, content such as book and product reviews created by people from diverse cultures are read and consumed by others from different corners of the world. In this paper, we investigate the extent and patterns of gaps in understandability of book reviews due to the presence of culturally-specific items and elements that might be alien to users from another culture. Our user-study on 57 book reviews from Goodreads reveal that 83\% of the reviews had at least one culture-specific difficult-to-understand element. We also evaluate the efficacy of GPT-4o in identifying such items, given the cultural background of the reader; the results are mixed, implying a significant scope for improvement. Our datasets are available here: https://github.com/sougata-ub/reading_between_lines
△ Less
Submitted 20 February, 2025; v1 submitted 8 February, 2025;
originally announced February 2025.
-
Music for All: Representational Bias and Cross-Cultural Adaptability of Music Generation Models
Authors:
Atharva Mehta,
Shivam Chauhan,
Amirbek Djanibekov,
Atharva Kulkarni,
Gus Xia,
Monojit Choudhury
Abstract:
The advent of Music-Language Models has greatly enhanced the automatic music generation capability of AI systems, but they are also limited in their coverage of the musical genres and cultures of the world. We present a study of the datasets and research papers for music generation and quantify the bias and under-representation of genres. We find that only 5.7% of the total hours of existing music…
▽ More
The advent of Music-Language Models has greatly enhanced the automatic music generation capability of AI systems, but they are also limited in their coverage of the musical genres and cultures of the world. We present a study of the datasets and research papers for music generation and quantify the bias and under-representation of genres. We find that only 5.7% of the total hours of existing music datasets come from non-Western genres, which naturally leads to disparate performance of the models across genres. We then investigate the efficacy of Parameter-Efficient Fine-Tuning (PEFT) techniques in mitigating this bias. Our experiments with two popular models -- MusicGen and Mustango, for two underrepresented non-Western music traditions -- Hindustani Classical and Turkish Makam music, highlight the promises as well as the non-triviality of cross-genre adaptation of music through small datasets, implying the need for more equitable baseline music-language models that are designed for cross-cultural transfer learning.
△ Less
Submitted 6 May, 2025; v1 submitted 11 February, 2025;
originally announced February 2025.
-
SMAB: MAB based word Sensitivity Estimation Framework and its Applications in Adversarial Text Generation
Authors:
Saurabh Kumar Pandey,
Sachin Vashistha,
Debrup Das,
Somak Aditya,
Monojit Choudhury
Abstract:
To understand the complexity of sequence classification tasks, Hahn et al. (2021) proposed sensitivity as the number of disjoint subsets of the input sequence that can each be individually changed to change the output. Though effective, calculating sensitivity at scale using this framework is costly because of exponential time complexity. Therefore, we introduce a Sensitivity-based Multi-Armed Ban…
▽ More
To understand the complexity of sequence classification tasks, Hahn et al. (2021) proposed sensitivity as the number of disjoint subsets of the input sequence that can each be individually changed to change the output. Though effective, calculating sensitivity at scale using this framework is costly because of exponential time complexity. Therefore, we introduce a Sensitivity-based Multi-Armed Bandit framework (SMAB), which provides a scalable approach for calculating word-level local (sentence-level) and global (aggregated) sensitivities concerning an underlying text classifier for any dataset. We establish the effectiveness of our approach through various applications. We perform a case study on CHECKLIST generated sentiment analysis dataset where we show that our algorithm indeed captures intuitively high and low-sensitive words. Through experiments on multiple tasks and languages, we show that sensitivity can serve as a proxy for accuracy in the absence of gold data. Lastly, we show that guiding perturbation prompts using sensitivity values in adversarial example generation improves attack success rate by 15.58%, whereas using sensitivity as an additional reward in adversarial paraphrase generation gives a 12.00% improvement over SOTA approaches. Warning: Contains potentially offensive content.
△ Less
Submitted 10 February, 2025;
originally announced February 2025.
-
Employing Social Media to Improve Mental Health Outcomes
Authors:
Munmun De Choudhury
Abstract:
As social media platforms are increasingly adopted, the data the data people leave behind is shining new light into our understanding of phenomena, ranging from socio-economic-political events to the spread of infectious diseases. This chapter presents research conducted in the past decade that has harnessed social media data in the service of mental health and well-being. The discussion is organi…
▽ More
As social media platforms are increasingly adopted, the data the data people leave behind is shining new light into our understanding of phenomena, ranging from socio-economic-political events to the spread of infectious diseases. This chapter presents research conducted in the past decade that has harnessed social media data in the service of mental health and well-being. The discussion is organized along three thrusts: a first that highlights how social media data has been utilized to detect and predict risk to varied mental health concerns; a second thrust that focuses on translation paradigms that can enable to use of such social media based algorithms in the real-world; and the final thrust that brings to the fore the ethical considerations and challenges that engender the conduct of this research as well as its translation. The chapter concludes by noting open questions and problems in this emergent area, emphasizing the need for deeper interdisciplinary collaborations and participatory research design, incorporating and centering on human agency, and attention to societal inequities and harms that may result from or be exacerbated in this line of computational social science research.
△ Less
Submitted 9 January, 2025;
originally announced January 2025.
-
Women, Infamous, and Exotic Beings: What Honorific Usages in Wikipedia Reveal about the Socio-Cultural Norms
Authors:
Sourabrata Mukherjee,
Soumya Teotia,
Sougata Saha,
Monojit Choudhury
Abstract:
Honorifics serve as powerful linguistic markers that reflect social hierarchies and cultural values. This paper presents a large-scale, cross-linguistic exploration of usage of honorific pronouns in Bengali and Hindi Wikipedia articles, shedding light on how socio-cultural factors shape language. Using LLM (GPT-4o), we annotated 10, 000 articles of real and fictional beings in each language for se…
▽ More
Honorifics serve as powerful linguistic markers that reflect social hierarchies and cultural values. This paper presents a large-scale, cross-linguistic exploration of usage of honorific pronouns in Bengali and Hindi Wikipedia articles, shedding light on how socio-cultural factors shape language. Using LLM (GPT-4o), we annotated 10, 000 articles of real and fictional beings in each language for several sociodemographic features such as gender, age, fame, and exoticness, and the use of honorifics. We find that across all feature combinations, use of honorifics is consistently more common in Bengali than Hindi. For both languages, the use non-honorific pronouns is more commonly observed for infamous, juvenile, and exotic beings. Notably, we observe a gender bias in use of honorifics in Hindi, with men being more commonly referred to with honorifics than women.
△ Less
Submitted 6 March, 2025; v1 submitted 6 January, 2025;
originally announced January 2025.
-
Libra-Leaderboard: Towards Responsible AI through a Balanced Leaderboard of Safety and Capability
Authors:
Haonan Li,
Xudong Han,
Zenan Zhai,
Honglin Mu,
Hao Wang,
Zhenxuan Zhang,
Yilin Geng,
Shom Lin,
Renxi Wang,
Artem Shelmanov,
Xiangyu Qi,
Yuxia Wang,
Donghai Hong,
Youliang Yuan,
Meng Chen,
Haoqin Tu,
Fajri Koto,
Tatsuki Kuribayashi,
Cong Zeng,
Rishabh Bhardwaj,
Bingchen Zhao,
Yawen Duan,
Yi Liu,
Emad A. Alghamdi,
Yaodong Yang
, et al. (10 additional authors not shown)
Abstract:
To address this gap, we introduce Libra-Leaderboard, a comprehensive framework designed to rank LLMs through a balanced evaluation of performance and safety. Combining a dynamic leaderboard with an interactive LLM arena, Libra-Leaderboard encourages the joint optimization of capability and safety. Unlike traditional approaches that average performance and safety metrics, Libra-Leaderboard uses a d…
▽ More
To address this gap, we introduce Libra-Leaderboard, a comprehensive framework designed to rank LLMs through a balanced evaluation of performance and safety. Combining a dynamic leaderboard with an interactive LLM arena, Libra-Leaderboard encourages the joint optimization of capability and safety. Unlike traditional approaches that average performance and safety metrics, Libra-Leaderboard uses a distance-to-optimal-score method to calculate the overall rankings. This approach incentivizes models to achieve a balance rather than excelling in one dimension at the expense of some other ones. In the first release, Libra-Leaderboard evaluates 26 mainstream LLMs from 14 leading organizations, identifying critical safety challenges even in state-of-the-art models.
△ Less
Submitted 24 December, 2024;
originally announced December 2024.
-
From Lived Experience to Insight: Unpacking the Psychological Risks of Using AI Conversational Agents
Authors:
Mohit Chandra,
Suchismita Naik,
Denae Ford,
Ebele Okoli,
Munmun De Choudhury,
Mahsa Ershadi,
Gonzalo Ramos,
Javier Hernandez,
Ananya Bhattacharjee,
Shahed Warreth,
Jina Suh
Abstract:
Recent gains in popularity of AI conversational agents have led to their increased use for improving productivity and supporting well-being. While previous research has aimed to understand the risks associated with interactions with AI conversational agents, these studies often fall short in capturing the lived experiences of individuals. Additionally, psychological risks have often been presented…
▽ More
Recent gains in popularity of AI conversational agents have led to their increased use for improving productivity and supporting well-being. While previous research has aimed to understand the risks associated with interactions with AI conversational agents, these studies often fall short in capturing the lived experiences of individuals. Additionally, psychological risks have often been presented as a sub-category within broader AI-related risks in past taxonomy works, leading to under-representation of the impact of psychological risks of AI use. To address these challenges, our work presents a novel risk taxonomy focusing on psychological risks of using AI gathered through the lived experiences of individuals. We employed a mixed-method approach, involving a comprehensive survey with 283 people with lived mental health experience and workshops involving experts with lived experience to develop a psychological risk taxonomy. Our taxonomy features 19 AI behaviors, 21 negative psychological impacts, and 15 contexts related to individuals. Additionally, we propose a novel multi-path vignette-based framework for understanding the complex interplay between AI behaviors, psychological impacts, and individual user contexts. Finally, based on the feedback obtained from the workshop sessions, we present design recommendations for developing safer and more robust AI agents. Our work offers an in-depth understanding of the psychological risks associated with AI conversational agents and provides actionable recommendations for policymakers, researchers, and developers.
△ Less
Submitted 29 May, 2025; v1 submitted 10 December, 2024;
originally announced December 2024.
-
Missing Melodies: AI Music Generation and its "Nearly" Complete Omission of the Global South
Authors:
Atharva Mehta,
Shivam Chauhan,
Monojit Choudhury
Abstract:
Recent advances in generative AI have sparked renewed interest and expanded possibilities for music generation. However, the performance and versatility of these systems across musical genres are heavily influenced by the availability of training data. We conducted an extensive analysis of over one million hours of audio datasets used in AI music generation research and manually reviewed more than…
▽ More
Recent advances in generative AI have sparked renewed interest and expanded possibilities for music generation. However, the performance and versatility of these systems across musical genres are heavily influenced by the availability of training data. We conducted an extensive analysis of over one million hours of audio datasets used in AI music generation research and manually reviewed more than 200 papers from eleven prominent AI and music conferences and organizations (AAAI, ACM, EUSIPCO, EURASIP, ICASSP, ICML, IJCAI, ISMIR, NeurIPS, NIME, SMC) to identify a critical gap in the fair representation and inclusion of the musical genres of the Global South in AI research. Our findings reveal a stark imbalance: approximately 86% of the total dataset hours and over 93% of researchers focus primarily on music from the Global North. However, around 40% of these datasets include some form of non-Western music, genres from the Global South account for only 14.6% of the data. Furthermore, approximately 51% of the papers surveyed concentrate on symbolic music generation, a method that often fails to capture the cultural nuances inherent in music from regions such as South Asia, the Middle East, and Africa. As AI increasingly shapes the creation and dissemination of music, the significant underrepresentation of music genres in datasets and research presents a serious threat to global musical diversity. We also propose some important steps to mitigate these risks and foster a more inclusive future for AI-driven music generation.
△ Less
Submitted 12 December, 2024; v1 submitted 5 December, 2024;
originally announced December 2024.
-
Efficiency Enhancement of c-Si/TiO$_2$ Heterojunction Thin Film Solar Cell Using Hybrid Metal-Dielectric Nanostructures
Authors:
Soikot Sarkar,
Sajid Muhaimin Choudhury
Abstract:
The hybrid metal-dielectric nanostructures (HMDN) are promising candidates to address the ohmic loss by conventional nanostructures in photovoltaic applications by strong confinement and high scattering directivity. In this study, we present a c-Si/TiO$_2$ heterojunction thin film solar cell (TFSC) where a pair of triangular HMDN comprised of Ag and AZO was utilized to enhance the longer wavelengt…
▽ More
The hybrid metal-dielectric nanostructures (HMDN) are promising candidates to address the ohmic loss by conventional nanostructures in photovoltaic applications by strong confinement and high scattering directivity. In this study, we present a c-Si/TiO$_2$ heterojunction thin film solar cell (TFSC) where a pair of triangular HMDN comprised of Ag and AZO was utilized to enhance the longer wavelength light absorption. The presence of the TiO$_2$ inverted pyramid layer, in combination with the ITO and SiO$_2$-based pyramid layers at the front, enhanced the shorter wavelength light absorption by increasing the optical path and facilitating the coupling of incoming light in photonic mode. Consequently, the average absorption by 1000 nm thick photoactive layer reached 83.32 % for AM 1.5G within the wavelength range of 300 - 1100 nm which was investigated by employing the finite-difference time-domain (FDTD) method. The electric field profile and current density profile demonstrated the respective contributions of each layer in the absorption of light at shorter and longer wavelengths. The structure exhibited a short circuit current density ($J_{sc}$) of 37.96 mA/cm$^2$ and a power conversion efficiency ($PCE$) of 17.42 %. The efficiency of our proposed structure experienced a maximum relative change of 0.34 % when a polarized light was exposed with an angle of 0$^\circ$ to 90$^\circ$. The incorporation of self-heating in non-isothermal conditions reduced $PCE$ by $13.77 \%$. In addition, the comparative analysis to assess the impact of HMDN on our structure revealed a $4.54 \%$ increase in $PCE$ of the structure with metallic nanostructures, paving the way for the utilization of HMDN to enhance the performance of TFSC.
△ Less
Submitted 14 May, 2025; v1 submitted 29 November, 2024;
originally announced November 2024.
-
All Languages Matter: Evaluating LMMs on Culturally Diverse 100 Languages
Authors:
Ashmal Vayani,
Dinura Dissanayake,
Hasindri Watawana,
Noor Ahsan,
Nevasini Sasikumar,
Omkar Thawakar,
Henok Biadglign Ademtew,
Yahya Hmaiti,
Amandeep Kumar,
Kartik Kuckreja,
Mykola Maslych,
Wafa Al Ghallabi,
Mihail Mihaylov,
Chao Qin,
Abdelrahman M Shaker,
Mike Zhang,
Mahardika Krisna Ihsani,
Amiel Esplana,
Monil Gokani,
Shachar Mirkin,
Harsh Singh,
Ashay Srivastava,
Endre Hamerlik,
Fathinah Asma Izzati,
Fadillah Adamsyah Maani
, et al. (44 additional authors not shown)
Abstract:
Existing Large Multimodal Models (LMMs) generally focus on only a few regions and languages. As LMMs continue to improve, it is increasingly important to ensure they understand cultural contexts, respect local sensitivities, and support low-resource languages, all while effectively integrating corresponding visual cues. In pursuit of culturally diverse global multimodal models, our proposed All La…
▽ More
Existing Large Multimodal Models (LMMs) generally focus on only a few regions and languages. As LMMs continue to improve, it is increasingly important to ensure they understand cultural contexts, respect local sensitivities, and support low-resource languages, all while effectively integrating corresponding visual cues. In pursuit of culturally diverse global multimodal models, our proposed All Languages Matter Benchmark (ALM-bench) represents the largest and most comprehensive effort to date for evaluating LMMs across 100 languages. ALM-bench challenges existing models by testing their ability to understand and reason about culturally diverse images paired with text in various languages, including many low-resource languages traditionally underrepresented in LMM research. The benchmark offers a robust and nuanced evaluation framework featuring various question formats, including true/false, multiple choice, and open-ended questions, which are further divided into short and long-answer categories. ALM-bench design ensures a comprehensive assessment of a model's ability to handle varied levels of difficulty in visual and linguistic reasoning. To capture the rich tapestry of global cultures, ALM-bench carefully curates content from 13 distinct cultural aspects, ranging from traditions and rituals to famous personalities and celebrations. Through this, ALM-bench not only provides a rigorous testing ground for state-of-the-art open and closed-source LMMs but also highlights the importance of cultural and linguistic inclusivity, encouraging the development of models that can serve diverse global populations effectively. Our benchmark is publicly available.
△ Less
Submitted 30 April, 2025; v1 submitted 25 November, 2024;
originally announced November 2024.
-
Optimizing Social Media Annotation of HPV Vaccine Skepticism and Misinformation Using Large Language Models: An Experimental Evaluation of In-Context Learning and Fine-Tuning Stance Detection Across Multiple Models
Authors:
Luhang Sun,
Varsha Pendyala,
Yun-Shiuan Chuang,
Shanglin Yang,
Jonathan Feldman,
Andrew Zhao,
Munmun De Choudhury,
Sijia Yang,
Dhavan Shah
Abstract:
This paper leverages large-language models (LLMs) to experimentally determine optimal strategies for scaling up social media content annotation for stance detection on HPV vaccine-related tweets. We examine both conventional fine-tuning and emergent in-context learning methods, systematically varying strategies of prompt engineering across widely used LLMs and their variants (e.g., GPT4, Mistral,…
▽ More
This paper leverages large-language models (LLMs) to experimentally determine optimal strategies for scaling up social media content annotation for stance detection on HPV vaccine-related tweets. We examine both conventional fine-tuning and emergent in-context learning methods, systematically varying strategies of prompt engineering across widely used LLMs and their variants (e.g., GPT4, Mistral, and Llama3, etc.). Specifically, we varied prompt template design, shot sampling methods, and shot quantity to detect stance on HPV vaccination. Our findings reveal that 1) in general, in-context learning outperforms fine-tuning in stance detection for HPV vaccine social media content; 2) increasing shot quantity does not necessarily enhance performance across models; and 3) different LLMs and their variants present differing sensitivity to in-context learning conditions. We uncovered that the optimal in-context learning configuration for stance detection on HPV vaccine tweets involves six stratified shots paired with detailed contextual prompts. This study highlights the potential and provides an applicable approach for applying LLMs to research on social media stance and skepticism detection.
△ Less
Submitted 2 April, 2025; v1 submitted 21 November, 2024;
originally announced November 2024.
-
Design of Dual-Band Plasmonic Absorber for Biomedical Sensing and Environmental Monitoring
Authors:
Ayon Sarker,
Sajid Muhaimin Choudhury
Abstract:
This study introduces a dual-band plasmonic absorber designed for simultaneous sensing applications in the near-infrared (NIR) and mid-infrared (MIR) regions. The absorber, composed of silver nanostructures on a metal plate with a dielectric spacer, exhibits a combination of localized and gap surface plasmon resonances, resulting in two distinct absorption peaks in theoretical analysis based on th…
▽ More
This study introduces a dual-band plasmonic absorber designed for simultaneous sensing applications in the near-infrared (NIR) and mid-infrared (MIR) regions. The absorber, composed of silver nanostructures on a metal plate with a dielectric spacer, exhibits a combination of localized and gap surface plasmon resonances, resulting in two distinct absorption peaks in theoretical analysis based on the FDTD method. Numerical simulations also validate the sensor's high refractive index sensitivity, enabling the detection of biomolecules, proteins, viruses, and various solutes in aqueous solutions. The absorber demonstrates significant resonance shifts, making it a promising candidate for environmental monitoring, medical diagnostics, and chemical sensing.
△ Less
Submitted 19 November, 2024;
originally announced November 2024.
-
TRANSPOSE: Transitional Approaches for Spatially-Aware LFI Resilient FSM Encoding
Authors:
Muhtadi Choudhury,
Minyan Gao,
Avinash Varna,
Elad Peer,
Domenic Forte
Abstract:
Finite state machines (FSMs) regulate sequential circuits, including access to sensitive information and privileged CPU states. Courtesy of contemporary research on laser attacks, laser-based fault injection (LFI) is becoming even more precise where an adversary can thwart chip security by altering individual flip-flop (FF) values. Different laser models, e.g., bit flip, bit set, and bit reset, ha…
▽ More
Finite state machines (FSMs) regulate sequential circuits, including access to sensitive information and privileged CPU states. Courtesy of contemporary research on laser attacks, laser-based fault injection (LFI) is becoming even more precise where an adversary can thwart chip security by altering individual flip-flop (FF) values. Different laser models, e.g., bit flip, bit set, and bit reset, have been developed to appreciate LFI on practical targets. As traditional approaches may incorporate substantial overhead, state-based SPARSE and transition-based TAMED countermeasures were proposed in our prior work to improve FSM resiliency efficiently. TAMED overcame SPARSE's limitation of being too conservative, and generating multiple LFI resilient encodings for contemporary LFI models on demand. SPARSE, however, incorporated design layout information into its vulnerability estimation which makes its vulnerability estimation metric more accurate. In this paper, we extend TAMED by proposing a transition-based encoding CAD framework (TRANSPOSE), that incorporates spatial transitional vulnerability metrics to quantify design susceptibility of FSMs based on both the bit flip model and the set-reset models. TRANSPOSE also incorporates floorplan optimization into its framework to accommodate secure spatial inter-distance of FF-sensitive regions. All TRANSPOSE approaches are demonstrated on 5 multifarious benchmarks and outperform existing FSM encoding schemes/frameworks in terms of security and overhead.
△ Less
Submitted 4 November, 2024;
originally announced November 2024.
-
"It's a conversation, not a quiz": A Risk Taxonomy and Reflection Tool for LLM Adoption in Public Health
Authors:
Jiawei Zhou,
Amy Z. Chen,
Darshi Shah,
Laura Schwab Reese,
Munmun De Choudhury
Abstract:
Recent breakthroughs in large language models (LLMs) have generated both interest and concern about their potential adoption as accessible information sources or communication tools across different domains. In public health -- where stakes are high and impacts extend across populations -- adopting LLMs poses unique challenges that require thorough evaluation. However, structured approaches for as…
▽ More
Recent breakthroughs in large language models (LLMs) have generated both interest and concern about their potential adoption as accessible information sources or communication tools across different domains. In public health -- where stakes are high and impacts extend across populations -- adopting LLMs poses unique challenges that require thorough evaluation. However, structured approaches for assessing potential risks in public health remain under-explored. To address this gap, we conducted focus groups with health professionals and health issue experiencers to unpack their concerns, situated across three distinct and critical public health issues that demand high-quality information: vaccines, opioid use disorder, and intimate partner violence. We synthesize participants' perspectives into a risk taxonomy, distinguishing and contextualizing the potential harms LLMs may introduce when positioned alongside traditional health communication. This taxonomy highlights four dimensions of risk in individual behaviors, human-centered care, information ecosystem, and technology accountability. For each dimension, we discuss specific risks and example reflection questions to help practitioners adopt a risk-reflexive approach. This work offers a shared vocabulary and reflection tool for experts in both computing and public health to collaboratively anticipate, evaluate, and mitigate risks in deciding when to employ LLM capabilities (or not) and how to mitigate harm when they are used.
△ Less
Submitted 4 November, 2024;
originally announced November 2024.
-
Do Large Language Models Align with Core Mental Health Counseling Competencies?
Authors:
Viet Cuong Nguyen,
Mohammad Taher,
Dongwan Hong,
Vinicius Konkolics Possobom,
Vibha Thirunellayi Gopalakrishnan,
Ekta Raj,
Zihang Li,
Heather J. Soled,
Michael L. Birnbaum,
Srijan Kumar,
Munmun De Choudhury
Abstract:
The rapid evolution of Large Language Models (LLMs) presents a promising solution to the global shortage of mental health professionals. However, their alignment with essential counseling competencies remains underexplored. We introduce CounselingBench, a novel NCMHCE-based benchmark evaluating 22 general-purpose and medical-finetuned LLMs across five key competencies. While frontier models surpas…
▽ More
The rapid evolution of Large Language Models (LLMs) presents a promising solution to the global shortage of mental health professionals. However, their alignment with essential counseling competencies remains underexplored. We introduce CounselingBench, a novel NCMHCE-based benchmark evaluating 22 general-purpose and medical-finetuned LLMs across five key competencies. While frontier models surpass minimum aptitude thresholds, they fall short of expert-level performance, excelling in Intake, Assessment & Diagnosis but struggling with Core Counseling Attributes and Professional Practice & Ethics. Surprisingly, medical LLMs do not outperform generalist models in accuracy, though they provide slightly better justifications while making more context-related errors. These findings highlight the challenges of developing AI for mental health counseling, particularly in competencies requiring empathy and nuanced reasoning. Our results underscore the need for specialized, fine-tuned models aligned with core mental health counseling competencies and supported by human oversight before real-world deployment. Code and data associated with this manuscript can be found at: https://github.com/cuongnguyenx/CounselingBench
△ Less
Submitted 26 February, 2025; v1 submitted 29 October, 2024;
originally announced October 2024.
-
The Zeno's Paradox of `Low-Resource' Languages
Authors:
Hellina Hailu Nigatu,
Atnafu Lambebo Tonja,
Benjamin Rosman,
Thamar Solorio,
Monojit Choudhury
Abstract:
The disparity in the languages commonly studied in Natural Language Processing (NLP) is typically reflected by referring to languages as low vs high-resourced. However, there is limited consensus on what exactly qualifies as a `low-resource language.' To understand how NLP papers define and study `low resource' languages, we qualitatively analyzed 150 papers from the ACL Anthology and popular spee…
▽ More
The disparity in the languages commonly studied in Natural Language Processing (NLP) is typically reflected by referring to languages as low vs high-resourced. However, there is limited consensus on what exactly qualifies as a `low-resource language.' To understand how NLP papers define and study `low resource' languages, we qualitatively analyzed 150 papers from the ACL Anthology and popular speech-processing conferences that mention the keyword `low-resource.' Based on our analysis, we show how several interacting axes contribute to `low-resourcedness' of a language and why that makes it difficult to track progress for each individual language. We hope our work (1) elicits explicit definitions of the terminology when it is used in papers and (2) provides grounding for the different axes to consider when connoting a language as low-resource.
△ Less
Submitted 28 October, 2024;
originally announced October 2024.
-
Lived Experience Not Found: LLMs Struggle to Align with Experts on Addressing Adverse Drug Reactions from Psychiatric Medication Use
Authors:
Mohit Chandra,
Siddharth Sriraman,
Gaurav Verma,
Harneet Singh Khanuja,
Jose Suarez Campayo,
Zihang Li,
Michael L. Birnbaum,
Munmun De Choudhury
Abstract:
Adverse Drug Reactions (ADRs) from psychiatric medications are the leading cause of hospitalizations among mental health patients. With healthcare systems and online communities facing limitations in resolving ADR-related issues, Large Language Models (LLMs) have the potential to fill this gap. Despite the increasing capabilities of LLMs, past research has not explored their capabilities in detect…
▽ More
Adverse Drug Reactions (ADRs) from psychiatric medications are the leading cause of hospitalizations among mental health patients. With healthcare systems and online communities facing limitations in resolving ADR-related issues, Large Language Models (LLMs) have the potential to fill this gap. Despite the increasing capabilities of LLMs, past research has not explored their capabilities in detecting ADRs related to psychiatric medications or in providing effective harm reduction strategies. To address this, we introduce the Psych-ADR benchmark and the Adverse Drug Reaction Response Assessment (ADRA) framework to systematically evaluate LLM performance in detecting ADR expressions and delivering expert-aligned mitigation strategies. Our analyses show that LLMs struggle with understanding the nuances of ADRs and differentiating between types of ADRs. While LLMs align with experts in terms of expressed emotions and tone of the text, their responses are more complex, harder to read, and only 70.86% aligned with expert strategies. Furthermore, they provide less actionable advice by a margin of 12.32% on average. Our work provides a comprehensive benchmark and evaluation framework for assessing LLMs in strategy-driven tasks within high-risk domains.
△ Less
Submitted 7 January, 2025; v1 submitted 24 October, 2024;
originally announced October 2024.
-
MedHalu: Hallucinations in Responses to Healthcare Queries by Large Language Models
Authors:
Vibhor Agarwal,
Yiqiao Jin,
Mohit Chandra,
Munmun De Choudhury,
Srijan Kumar,
Nishanth Sastry
Abstract:
The remarkable capabilities of large language models (LLMs) in language understanding and generation have not rendered them immune to hallucinations. LLMs can still generate plausible-sounding but factually incorrect or fabricated information. As LLM-empowered chatbots become popular, laypeople may frequently ask health-related queries and risk falling victim to these LLM hallucinations, resulting…
▽ More
The remarkable capabilities of large language models (LLMs) in language understanding and generation have not rendered them immune to hallucinations. LLMs can still generate plausible-sounding but factually incorrect or fabricated information. As LLM-empowered chatbots become popular, laypeople may frequently ask health-related queries and risk falling victim to these LLM hallucinations, resulting in various societal and healthcare implications. In this work, we conduct a pioneering study of hallucinations in LLM-generated responses to real-world healthcare queries from patients. We propose MedHalu, a carefully crafted first-of-its-kind medical hallucination dataset with a diverse range of health-related topics and the corresponding hallucinated responses from LLMs with labeled hallucination types and hallucinated text spans. We also introduce MedHaluDetect framework to evaluate capabilities of various LLMs in detecting hallucinations. We also employ three groups of evaluators -- medical experts, LLMs, and laypeople -- to study who are more vulnerable to these medical hallucinations. We find that LLMs are much worse than the experts. They also perform no better than laypeople and even worse in few cases in detecting hallucinations. To fill this gap, we propose expert-in-the-loop approach to improve hallucination detection through LLMs by infusing expert reasoning. We observe significant performance gains for all the LLMs with an average macro-F1 improvement of 6.3 percentage points for GPT-4.
△ Less
Submitted 28 September, 2024;
originally announced September 2024.
-
Assessing FIFO and Round Robin Scheduling:Effects on Data Pipeline Performance and Energy Usage
Authors:
Malobika Roy Choudhury,
Akshat Mehrotra
Abstract:
In the case of compute-intensive machine learning, efficient operating system scheduling is crucial for performance and energy efficiency. This paper conducts a comparative study over FIFO(First-In-First-Out) and RR(Round-Robin) scheduling policies with the application of real-time machine learning training processes and data pipelines on Ubuntu-based systems. Knowing a few patterns of CPU usage a…
▽ More
In the case of compute-intensive machine learning, efficient operating system scheduling is crucial for performance and energy efficiency. This paper conducts a comparative study over FIFO(First-In-First-Out) and RR(Round-Robin) scheduling policies with the application of real-time machine learning training processes and data pipelines on Ubuntu-based systems. Knowing a few patterns of CPU usage and energy consumption, we identify which policy (the exclusive or the shared) provides higher performance and/or lower energy consumption for typical modern workloads. Results of this study would help in providing better operating system schedulers for modern systems like Ubuntu, working to improve performance and reducing energy consumption in compute intensive workloads.
△ Less
Submitted 23 September, 2024;
originally announced September 2024.
-
Lithium Niobate Photonic Topological Insulator-based Multi-Wavelength Optical Demultiplexer with Piezoelectric Switch-Off
Authors:
Prithu Mahmud,
Kaniz Fatema Supti,
Sajid Muhaimin Choudhury
Abstract:
Photonic topological insulators provide unidirectional, robust, wavelength-selective transport of light at an interface while keeping it insulated at the bulk of the material. The non-trivial topology results in an immunity to backscattering, sharp turns, and fabrication defects. This work leverages these unique properties to design a 2-channel optical demultiplexer based on a lithium niobate phot…
▽ More
Photonic topological insulators provide unidirectional, robust, wavelength-selective transport of light at an interface while keeping it insulated at the bulk of the material. The non-trivial topology results in an immunity to backscattering, sharp turns, and fabrication defects. This work leverages these unique properties to design a 2-channel optical demultiplexer based on a lithium niobate photonic topological insulator with piezoelectric switch-off capabilities. A photonic topological insulator design for the demultiplexer allows for good wavelength selectivity, crosstalk as low as $-$54 dB, and better isolation between output channels. The primary operating wavelengths presented are the telecommunication wavelengths of 1310 nm and 1550 nm, but the use of the lithium niobate material allows operation at multiple operating wavelengths. Furthermore, we propose a post-fabrication method to switch off the topological protection and, thus, optical transmittance via an applied voltage utilizing the inverse piezoelectric effect of lithium niobate. This work will contribute to advancing lithium niobate integrated photonics and developing efficient, multi-wavelength, electrically controlled optical communication systems and integrated photonic circuits.
△ Less
Submitted 8 December, 2024; v1 submitted 16 September, 2024;
originally announced September 2024.
-
ExploreSelf: Fostering User-driven Exploration and Reflection on Personal Challenges with Adaptive Guidance by Large Language Models
Authors:
Inhwa Song,
SoHyun Park,
Sachin R. Pendse,
Jessica Lee Schleider,
Munmun De Choudhury,
Young-Ho Kim
Abstract:
Expressing stressful experiences in words is proven to improve mental and physical health, but individuals often disengage with writing interventions as they struggle to organize their thoughts and emotions. Reflective prompts have been used to provide direction, and large language models (LLMs) have demonstrated the potential to provide tailored guidance. However, current systems often limit user…
▽ More
Expressing stressful experiences in words is proven to improve mental and physical health, but individuals often disengage with writing interventions as they struggle to organize their thoughts and emotions. Reflective prompts have been used to provide direction, and large language models (LLMs) have demonstrated the potential to provide tailored guidance. However, current systems often limit users' flexibility to direct their reflections. We thus present ExploreSelf, an LLM-driven application designed to empower users to control their reflective journey, providing adaptive support through dynamically generated questions. Through an exploratory study with 19 participants, we examine how participants explore and reflect on personal challenges using ExploreSelf. Our findings demonstrate that participants valued the flexible navigation of adaptive guidance to control their reflective journey, leading to deeper engagement and insight. Building on our findings, we discuss the implications of designing LLM-driven tools that facilitate user-driven and effective reflection of personal challenges.
△ Less
Submitted 5 February, 2025; v1 submitted 15 September, 2024;
originally announced September 2024.
-
A Community-Centric Perspective for Characterizing and Detecting Anti-Asian Violence-Provoking Speech
Authors:
Gaurav Verma,
Rynaa Grover,
Jiawei Zhou,
Binny Mathew,
Jordan Kraemer,
Munmun De Choudhury,
Srijan Kumar
Abstract:
Violence-provoking speech -- speech that implicitly or explicitly promotes violence against the members of the targeted community, contributed to a massive surge in anti-Asian crimes during the pandemic. While previous works have characterized and built tools for detecting other forms of harmful speech, like fear speech and hate speech, our work takes a community-centric approach to studying anti-…
▽ More
Violence-provoking speech -- speech that implicitly or explicitly promotes violence against the members of the targeted community, contributed to a massive surge in anti-Asian crimes during the pandemic. While previous works have characterized and built tools for detecting other forms of harmful speech, like fear speech and hate speech, our work takes a community-centric approach to studying anti-Asian violence-provoking speech. Using data from ~420k Twitter posts spanning a 3-year duration (January 1, 2020 to February 1, 2023), we develop a codebook to characterize anti-Asian violence-provoking speech and collect a community-crowdsourced dataset to facilitate its large-scale detection using state-of-the-art classifiers. We contrast the capabilities of natural language processing classifiers, ranging from BERT-based to LLM-based classifiers, in detecting violence-provoking speech with their capabilities to detect anti-Asian hateful speech. In contrast to prior work that has demonstrated the effectiveness of such classifiers in detecting hateful speech ($F_1 = 0.89$), our work shows that accurate and reliable detection of violence-provoking speech is a challenging task ($F_1 = 0.69$). We discuss the implications of our findings, particularly the need for proactive interventions to support Asian communities during public health crises. The resources related to the study are available at https://claws-lab.github.io/violence-provoking-speech/.
△ Less
Submitted 21 July, 2024;
originally announced July 2024.
-
Inferring IGM parameters from the redshifted 21-cm Power Spectrum using Artificial Neural Networks
Authors:
Madhurima Choudhury,
Raghunath Ghara,
Saleem Zaroubi,
Benedetta Ciardi,
Leon V. E. Koopmans,
Garrelt Mellema,
Abinash Kumar Shaw,
Anshuman Acharya,
I. T. Iliev,
Qing-Bo Ma,
Sambit K. Giri
Abstract:
The high redshift 21-cm signal promises to be a crucial probe of the state of the intergalactic medium (IGM). Understanding the connection between the observed 21-cm power spectrum and the physical quantities intricately associated with the IGM is crucial to fully understand the evolution of our Universe. In this study, we develop an emulator using artificial neural network (ANN) to predict the 21…
▽ More
The high redshift 21-cm signal promises to be a crucial probe of the state of the intergalactic medium (IGM). Understanding the connection between the observed 21-cm power spectrum and the physical quantities intricately associated with the IGM is crucial to fully understand the evolution of our Universe. In this study, we develop an emulator using artificial neural network (ANN) to predict the 21-cm power spectrum from a given set of IGM properties, namely, the bubble size distribution and the volume averaged ionization fraction. This emulator is implemented within a standard Bayesian framework to constrain the IGM parameters from a given 21-cm power spectrum. We compare the performance of the Bayesian method to an alternate method using ANN to predict the IGM parameters from a given input power spectrum, and find that both methods yield similar levels of accuracy, while the ANN is significantly faster. We also use this ANN method of parameter estimation to predict the IGM parameters from a test set contaminated with noise levels expected from the SKA-LOW instrument after 1000 hours of observation. Finally, we train a separate ANN to predict the source parameters from the IGM parameters directly, at a redshift of $z=9.1$, demonstrating the possibility of a non-analytic inference of the source parameters from the IGM parameters for the first time. We achieve high accuracies, with R2-scores ranging between $0.898-0.978$ for the ANN emulator and between $0.966-0.986$ and $0.817-0.981$ for the predictions of IGM parameters from 21-cm power spectrum and source parameters from IGM parameters, respectively. The predictions of the IGM parameters from the Bayesian method incorporating the ANN emulator leads to tight constraints with error bars around $\pm{0.14}$ on the IGM parameters.
△ Less
Submitted 6 May, 2025; v1 submitted 3 July, 2024;
originally announced July 2024.
-
Supporters and Skeptics: LLM-based Analysis of Engagement with Mental Health (Mis)Information Content on Video-sharing Platforms
Authors:
Viet Cuong Nguyen,
Mini Jain,
Abhijat Chauhan,
Heather Jaime Soled,
Santiago Alvarez Lesmes,
Zihang Li,
Michael L. Birnbaum,
Sunny X. Tang,
Srijan Kumar,
Munmun De Choudhury
Abstract:
Over one in five adults in the US lives with a mental illness. In the face of a shortage of mental health professionals and offline resources, online short-form video content has grown to serve as a crucial conduit for disseminating mental health help and resources. However, the ease of content creation and access also contributes to the spread of misinformation, posing risks to accurate diagnosis…
▽ More
Over one in five adults in the US lives with a mental illness. In the face of a shortage of mental health professionals and offline resources, online short-form video content has grown to serve as a crucial conduit for disseminating mental health help and resources. However, the ease of content creation and access also contributes to the spread of misinformation, posing risks to accurate diagnosis and treatment. Detecting and understanding engagement with such content is crucial to mitigating their harmful effects on public health. We perform the first quantitative study of the phenomenon using YouTube Shorts and Bitchute as the sites of study. We contribute MentalMisinfo, a novel labeled mental health misinformation (MHMisinfo) dataset of 739 videos (639 from Youtube and 100 from Bitchute) and 135372 comments in total, using an expert-driven annotation schema. We first found that few-shot in-context learning with large language models (LLMs) are effective in detecting MHMisinfo videos. Next, we discover distinct and potentially alarming linguistic patterns in how audiences engage with MHMisinfo videos through commentary on both video-sharing platforms. Across the two platforms, comments could exacerbate prevailing stigma with some groups showing heightened susceptibility to and alignment with MHMisinfo. We discuss technical and public health-driven adaptive solutions to tackling the "epidemic" of mental health misinformation online.
△ Less
Submitted 2 July, 2024;
originally announced July 2024.
-
[WIP] Jailbreak Paradox: The Achilles' Heel of LLMs
Authors:
Abhinav Rao,
Monojit Choudhury,
Somak Aditya
Abstract:
We introduce two paradoxes concerning jailbreak of foundation models: First, it is impossible to construct a perfect jailbreak classifier, and second, a weaker model cannot consistently detect whether a stronger (in a pareto-dominant sense) model is jailbroken or not. We provide formal proofs for these paradoxes and a short case study on Llama and GPT4-o to demonstrate this. We discuss broader the…
▽ More
We introduce two paradoxes concerning jailbreak of foundation models: First, it is impossible to construct a perfect jailbreak classifier, and second, a weaker model cannot consistently detect whether a stronger (in a pareto-dominant sense) model is jailbroken or not. We provide formal proofs for these paradoxes and a short case study on Llama and GPT4-o to demonstrate this. We discuss broader theoretical and practical repercussions of these results.
△ Less
Submitted 20 June, 2024; v1 submitted 18 June, 2024;
originally announced June 2024.
-
Cultural Conditioning or Placebo? On the Effectiveness of Socio-Demographic Prompting
Authors:
Sagnik Mukherjee,
Muhammad Farid Adilazuarda,
Sunayana Sitaram,
Kalika Bali,
Alham Fikri Aji,
Monojit Choudhury
Abstract:
Socio-demographic prompting is a commonly employed approach to study cultural biases in LLMs as well as for aligning models to certain cultures. In this paper, we systematically probe four LLMs (Llama 3, Mistral v0.2, GPT-3.5 Turbo and GPT-4) with prompts that are conditioned on culturally sensitive and non-sensitive cues, on datasets that are supposed to be culturally sensitive (EtiCor and CALI)…
▽ More
Socio-demographic prompting is a commonly employed approach to study cultural biases in LLMs as well as for aligning models to certain cultures. In this paper, we systematically probe four LLMs (Llama 3, Mistral v0.2, GPT-3.5 Turbo and GPT-4) with prompts that are conditioned on culturally sensitive and non-sensitive cues, on datasets that are supposed to be culturally sensitive (EtiCor and CALI) or neutral (MMLU and ETHICS). We observe that all models except GPT-4 show significant variations in their responses on both kinds of datasets for both kinds of prompts, casting doubt on the robustness of the culturally-conditioned prompting as a method for eliciting cultural bias in models or as an alignment strategy. The work also calls rethinking the control experiment design to tease apart the cultural conditioning of responses from "placebo effect", i.e., random perturbations of model responses due to arbitrary tokens in the prompt.
△ Less
Submitted 20 June, 2024; v1 submitted 17 June, 2024;
originally announced June 2024.
-
Synergizing Deep Learning and Phase Change Materials for Four-state Broadband Multifunctional Metasurfaces in the Visible Range
Authors:
Md. Ehsanul Karim,
Md. Redwanul Karim,
Sajid Muhaimin Choudhury
Abstract:
In this article, we report, for the first time, broadband multifunctional metasurfaces with more than four distinct functionalities. The constituent meta-atoms combine two different phase change materials, $\mathrm{VO_2}$ and $\mathrm{Sb_2S_3}$ in a multi-stage configuration. FDTD simulations demonstrate a broadband reflection amplitude switching between the four states in visible range due to the…
▽ More
In this article, we report, for the first time, broadband multifunctional metasurfaces with more than four distinct functionalities. The constituent meta-atoms combine two different phase change materials, $\mathrm{VO_2}$ and $\mathrm{Sb_2S_3}$ in a multi-stage configuration. FDTD simulations demonstrate a broadband reflection amplitude switching between the four states in visible range due to the enhanced cavity length modulation effect from the cascaded Fabry-Perot cavities, overcoming the inherent small optical contrast between the phase change material (PCM) states. This, along with the reflection phase control between the four states, allows us to incorporate both amplitude and phase-dependent properties in the same metasurface - achromatic deflection, wavelength beam splitting, achromatic focusing, and broadband absorption, overcoming the limitations of previous functionality switching mechanisms for the visible band. We have used a Tandem Neural network-based inverse design scheme to ensure the stringent requirements of different states are realized. We have used two forward networks for predicting the reflection amplitude and phase for a meta-atom within the pre-defined design space. The excellent prediction capability of these surrogate models is utilized to train the reverse network. The inverse design network, trained with a labeled data set, is capable of producing the optimized meta-units given the desired figure-of-merits in terms of reflection amplitude and phase for the four states. The optical characteristics of two inverse-designed metasurfaces have been evaluated as test cases for two different sets of design parameters in the four states. Both structures demonstrate the four desired broadband functionalities while closely matching the design requirements, suggesting their potential in visible-range portable medical imaging devices.
△ Less
Submitted 28 July, 2024; v1 submitted 8 June, 2024;
originally announced June 2024.
-
CASE: Efficient Curricular Data Pre-training for Building Assistive Psychology Expert Models
Authors:
Sarthak Harne,
Monjoy Narayan Choudhury,
Madhav Rao,
TK Srikanth,
Seema Mehrotra,
Apoorva Vashisht,
Aarushi Basu,
Manjit Sodhi
Abstract:
The limited availability of psychologists necessitates efficient identification of individuals requiring urgent mental healthcare. This study explores the use of Natural Language Processing (NLP) pipelines to analyze text data from online mental health forums used for consultations. By analyzing forum posts, these pipelines can flag users who may require immediate professional attention. A crucial…
▽ More
The limited availability of psychologists necessitates efficient identification of individuals requiring urgent mental healthcare. This study explores the use of Natural Language Processing (NLP) pipelines to analyze text data from online mental health forums used for consultations. By analyzing forum posts, these pipelines can flag users who may require immediate professional attention. A crucial challenge in this domain is data privacy and scarcity. To address this, we propose utilizing readily available curricular texts used in institutes specializing in mental health for pre-training the NLP pipelines. This helps us mimic the training process of a psychologist. Our work presents CASE-BERT that flags potential mental health disorders based on forum text. CASE-BERT demonstrates superior performance compared to existing methods, achieving an f1 score of 0.91 for Depression and 0.88 for Anxiety, two of the most commonly reported mental health disorders. Our code and data are publicly available.
△ Less
Submitted 2 October, 2024; v1 submitted 1 June, 2024;
originally announced June 2024.
-
Benchmarks Underestimate the Readiness of Multi-lingual Dialogue Agents
Authors:
Andrew H. Lee,
Sina J. Semnani,
Galo Castillo-López,
Gäel de Chalendar,
Monojit Choudhury,
Ashna Dua,
Kapil Rajesh Kavitha,
Sungkyun Kim,
Prashant Kodali,
Ponnurangam Kumaraguru,
Alexis Lombard,
Mehrad Moradshahi,
Gihyun Park,
Nasredine Semmar,
Jiwon Seo,
Tianhao Shen,
Manish Shrivastava,
Deyi Xiong,
Monica S. Lam
Abstract:
Creating multilingual task-oriented dialogue (TOD) agents is challenging due to the high cost of training data acquisition. Following the research trend of improving training data efficiency, we show for the first time, that in-context learning is sufficient to tackle multilingual TOD.
To handle the challenging dialogue state tracking (DST) subtask, we break it down to simpler steps that are mor…
▽ More
Creating multilingual task-oriented dialogue (TOD) agents is challenging due to the high cost of training data acquisition. Following the research trend of improving training data efficiency, we show for the first time, that in-context learning is sufficient to tackle multilingual TOD.
To handle the challenging dialogue state tracking (DST) subtask, we break it down to simpler steps that are more compatible with in-context learning where only a handful of few-shot examples are used. We test our approach on the multilingual TOD dataset X-RiSAWOZ, which has 12 domains in Chinese, English, French, Korean, Hindi, and code-mixed Hindi-English. Our turn-by-turn DST accuracy on the 6 languages range from 55.6% to 80.3%, seemingly worse than the SOTA results from fine-tuned models that achieve from 60.7% to 82.8%; our BLEU scores in the response generation (RG) subtask are also significantly lower than SOTA.
However, after manual evaluation of the validation set, we find that by correcting gold label errors and improving dataset annotation schema, GPT-4 with our prompts can achieve (1) 89.6%-96.8% accuracy in DST, and (2) more than 99% correct response generation across different languages. This leads us to conclude that current automatic metrics heavily underestimate the effectiveness of in-context learning.
△ Less
Submitted 16 June, 2024; v1 submitted 28 May, 2024;
originally announced May 2024.
-
From Human Judgements to Predictive Models: Unravelling Acceptability in Code-Mixed Sentences
Authors:
Prashant Kodali,
Anmol Goel,
Likhith Asapu,
Vamshi Krishna Bonagiri,
Anirudh Govil,
Monojit Choudhury,
Ponnurangam Kumaraguru,
Manish Shrivastava
Abstract:
Current computational approaches for analysing or generating code-mixed sentences do not explicitly model ``naturalness'' or ``acceptability'' of code-mixed sentences, but rely on training corpora to reflect distribution of acceptable code-mixed sentences. Modelling human judgement for the acceptability of code-mixed text can help in distinguishing natural code-mixed text and enable quality-contro…
▽ More
Current computational approaches for analysing or generating code-mixed sentences do not explicitly model ``naturalness'' or ``acceptability'' of code-mixed sentences, but rely on training corpora to reflect distribution of acceptable code-mixed sentences. Modelling human judgement for the acceptability of code-mixed text can help in distinguishing natural code-mixed text and enable quality-controlled generation of code-mixed text. To this end, we construct Cline - a dataset containing human acceptability judgements for English-Hindi~(en-hi) code-mixed text. Cline is the largest of its kind with 16,642 sentences, consisting of samples sourced from two sources: synthetically generated code-mixed text and samples collected from online social media. Our analysis establishes that popular code-mixing metrics such as CMI, Number of Switch Points, Burstines, which are used to filter/curate/compare code-mixed corpora have low correlation with human acceptability judgements, underlining the necessity of our dataset. Experiments using Cline demonstrate that simple Multilayer Perceptron (MLP) models when trained solely using code-mixing metrics as features are outperformed by fine-tuned pre-trained Multilingual Large Language Models (MLLMs). Specifically, among Encoder models XLM-Roberta and Bernice outperform IndicBERT across different configurations. Among Encoder-Decoder models, mBART performs better than mT5, however Encoder-Decoder models are not able to outperform Encoder-only models. Decoder-only models perform the best when compared to all other MLLMS, with Llama 3.2 - 3B models outperforming similarly sized Qwen, Phi models. Comparison with zero and fewshot capabilitites of ChatGPT show that MLLMs fine-tuned on larger data outperform ChatGPT, providing scope for improvement in code-mixed tasks. Zero-shot transfer from En-Hi to En-Te acceptability judgments are better than random baselines.
△ Less
Submitted 5 May, 2025; v1 submitted 9 May, 2024;
originally announced May 2024.
-
"They are uncultured": Unveiling Covert Harms and Social Threats in LLM Generated Conversations
Authors:
Preetam Prabhu Srikar Dammu,
Hayoung Jung,
Anjali Singh,
Monojit Choudhury,
Tanushree Mitra
Abstract:
Large language models (LLMs) have emerged as an integral part of modern societies, powering user-facing applications such as personal assistants and enterprise applications like recruitment tools. Despite their utility, research indicates that LLMs perpetuate systemic biases. Yet, prior works on LLM harms predominantly focus on Western concepts like race and gender, often overlooking cultural conc…
▽ More
Large language models (LLMs) have emerged as an integral part of modern societies, powering user-facing applications such as personal assistants and enterprise applications like recruitment tools. Despite their utility, research indicates that LLMs perpetuate systemic biases. Yet, prior works on LLM harms predominantly focus on Western concepts like race and gender, often overlooking cultural concepts from other parts of the world. Additionally, these studies typically investigate "harm" as a singular dimension, ignoring the various and subtle forms in which harms manifest. To address this gap, we introduce the Covert Harms and Social Threats (CHAST), a set of seven metrics grounded in social science literature. We utilize evaluation models aligned with human assessments to examine the presence of covert harms in LLM-generated conversations, particularly in the context of recruitment. Our experiments reveal that seven out of the eight LLMs included in this study generated conversations riddled with CHAST, characterized by malign views expressed in seemingly neutral language unlikely to be detected by existing methods. Notably, these LLMs manifested more extreme views and opinions when dealing with non-Western concepts like caste, compared to Western ones such as race.
△ Less
Submitted 8 May, 2024;
originally announced May 2024.
-
Ethical Reasoning and Moral Value Alignment of LLMs Depend on the Language we Prompt them in
Authors:
Utkarsh Agarwal,
Kumar Tanmay,
Aditi Khandelwal,
Monojit Choudhury
Abstract:
Ethical reasoning is a crucial skill for Large Language Models (LLMs). However, moral values are not universal, but rather influenced by language and culture. This paper explores how three prominent LLMs -- GPT-4, ChatGPT, and Llama2-70B-Chat -- perform ethical reasoning in different languages and if their moral judgement depend on the language in which they are prompted. We extend the study of et…
▽ More
Ethical reasoning is a crucial skill for Large Language Models (LLMs). However, moral values are not universal, but rather influenced by language and culture. This paper explores how three prominent LLMs -- GPT-4, ChatGPT, and Llama2-70B-Chat -- perform ethical reasoning in different languages and if their moral judgement depend on the language in which they are prompted. We extend the study of ethical reasoning of LLMs by Rao et al. (2023) to a multilingual setup following their framework of probing LLMs with ethical dilemmas and policies from three branches of normative ethics: deontology, virtue, and consequentialism. We experiment with six languages: English, Spanish, Russian, Chinese, Hindi, and Swahili. We find that GPT-4 is the most consistent and unbiased ethical reasoner across languages, while ChatGPT and Llama2-70B-Chat show significant moral value bias when we move to languages other than English. Interestingly, the nature of this bias significantly vary across languages for all LLMs, including GPT-4.
△ Less
Submitted 29 April, 2024;
originally announced April 2024.