Search | arXiv e-print repository

Jailbreak Detection in Clinical Training LLMs Using Feature-Based Predictive Models

Authors: Tri Nguyen, Lohith Srikanth Pentapalli, Magnus Sieverding, Laurah Turner, Seth Overla, Weibing Zheng, Chris Zhou, David Furniss, Danielle Weber, Michael Gharib, Matt Kelleher, Michael Shukis, Cameron Pawlik, Kelly Cohen

Abstract: Jailbreaking in Large Language Models (LLMs) threatens their safe use in sensitive domains like education by allowing users to bypass ethical safeguards. This study focuses on detecting jailbreaks in 2-Sigma, a clinical education platform that simulates patient interactions using LLMs. We annotated over 2,300 prompts across 158 conversations using four linguistic variables shown to correlate stron… ▽ More Jailbreaking in Large Language Models (LLMs) threatens their safe use in sensitive domains like education by allowing users to bypass ethical safeguards. This study focuses on detecting jailbreaks in 2-Sigma, a clinical education platform that simulates patient interactions using LLMs. We annotated over 2,300 prompts across 158 conversations using four linguistic variables shown to correlate strongly with jailbreak behavior. The extracted features were used to train several predictive models, including Decision Trees, Fuzzy Logic-based classifiers, Boosting methods, and Logistic Regression. Results show that feature-based predictive models consistently outperformed Prompt Engineering, with the Fuzzy Decision Tree achieving the best overall performance. Our findings demonstrate that linguistic-feature-based models are effective and explainable alternatives for jailbreak detection. We suggest future work explore hybrid frameworks that integrate prompt-based flexibility with rule-based robustness for real-time, spectrum-based jailbreak monitoring in educational LLMs. △ Less

Submitted 21 April, 2025; originally announced May 2025.

arXiv:2504.02111 [pdf, other]

Exploring LLM Reasoning Through Controlled Prompt Variations

Authors: Giannis Chatziveroglou, Richard Yun, Maura Kelleher

Abstract: This study investigates the reasoning robustness of large language models (LLMs) on mathematical problem-solving tasks under systematically introduced input perturbations. Using the GSM8K dataset as a controlled testbed, we evaluate how well state-of-the-art models maintain logical consistency and correctness when confronted with four categories of prompt perturbations: irrelevant context, patholo… ▽ More This study investigates the reasoning robustness of large language models (LLMs) on mathematical problem-solving tasks under systematically introduced input perturbations. Using the GSM8K dataset as a controlled testbed, we evaluate how well state-of-the-art models maintain logical consistency and correctness when confronted with four categories of prompt perturbations: irrelevant context, pathological instructions, factually relevant but non-essential context, and a combination of the latter two. Our experiments, conducted on thirteen open-source and closed-source LLMs, reveal that introducing irrelevant context within the model's context window significantly degrades performance, suggesting that distinguishing essential from extraneous details remains a pressing challenge. Surprisingly, performance regressions are relatively insensitive to the complexity of the reasoning task, as measured by the number of steps required, and are not strictly correlated with model size. Moreover, we observe that certain perturbations inadvertently trigger chain-of-thought-like reasoning behaviors, even without explicit prompting. Our findings highlight critical vulnerabilities in current LLMs and underscore the need for improved robustness against noisy, misleading, and contextually dense inputs, paving the way for more resilient and reliable reasoning in real-world applications. △ Less

Submitted 2 April, 2025; originally announced April 2025.

arXiv:2209.07862 [pdf, other]

doi 10.1145/3585088.3589353

What Do Children and Parents Want and Perceive in Conversational Agents? Towards Transparent, Trustworthy, Democratized Agents

Authors: Jessica Van Brummelen, Maura Kelleher, Mingyan Claire Tian, Nghi Hoang Nguyen

Abstract: Historically, researchers have focused on analyzing WEIRD, adult perspectives on technology. This means we may not have technology developed appropriately for children and those from non-WEIRD countries. In this paper, we analyze children and parents from various countries' perspectives on an emerging technology: conversational agents. We aim to better understand participants' trust of agents, par… ▽ More Historically, researchers have focused on analyzing WEIRD, adult perspectives on technology. This means we may not have technology developed appropriately for children and those from non-WEIRD countries. In this paper, we analyze children and parents from various countries' perspectives on an emerging technology: conversational agents. We aim to better understand participants' trust of agents, partner models, and their ideas of "ideal future agents" such that researchers can better design for these users. Additionally, we empower children and parents to program their own agents through educational workshops, and present changes in perceptions as participants create and learn about agents. Results from the study (n=49) included how children felt agents were significantly more human-like, warm, and dependable than parents did, how participants trusted agents more than parents or friends for correct information, how children described their ideal agents as being more artificial than human-like than parents did, and how children tended to focus more on fun features, approachable/friendly features and addressing concerns through agent design than parents did, among other results. We also discuss potential agent design implications of the results, including how designers may be able to best foster appropriate levels of trust towards agents by focusing on designing agents' competence and predictability indicators, as well as increasing transparency in terms of agents' information sources. △ Less

Submitted 20 January, 2023; v1 submitted 16 September, 2022; originally announced September 2022.

Comments: 18 pages, 9 figures, submitted to IDC 2023, for associated appendix: https://gist.github.com/jessvb/fa1d4c75910106d730d194ffd4d725d3

arXiv:2209.05063 [pdf, other]

Learning Affects Trust: Design Recommendations and Concepts for Teaching Children -- and Nearly Anyone -- about Conversational Agents

Authors: Jessica Van Brummelen, Mingyan Claire Tian, Maura Kelleher, Nghi Hoang Nguyen

Abstract: Research has shown that human-agent relationships form in similar ways to human-human relationships. Since children do not have the same critical analysis skills as adults (and may over-trust technology, for example), this relationship-formation is concerning. Nonetheless, little research investigates children's perceptions of conversational agents in-depth, and even less investigates how educatio… ▽ More Research has shown that human-agent relationships form in similar ways to human-human relationships. Since children do not have the same critical analysis skills as adults (and may over-trust technology, for example), this relationship-formation is concerning. Nonetheless, little research investigates children's perceptions of conversational agents in-depth, and even less investigates how education might change these perceptions. We present K-12 workshops with associated conversational AI concepts to encourage healthier understanding and relationships with agents. Through studies with the curriculum, and children and parents from various countries, we found participants' perceptions of agents -- specifically their partner models and trust -- changed. When participants discussed changes in trust of agents, we found they most often mentioned learning something. For example, they frequently mentioned learning where agents obtained information, what agents do with this information and how agents are programmed. Based on the results, we developed recommendations for teaching conversational agent concepts, including emphasizing the concepts students found most challenging, like training, turn-taking and terminology; supplementing agent development activities with related learning activities; fostering appropriate levels of trust towards agents; and fostering accurate partner models of agents. Through such pedagogy, students can learn to better understand conversational AI and what it means to have it in the world. △ Less

Submitted 12 September, 2022; originally announced September 2022.

Comments: 9 pages, 11 figures, submitted to EAAI at AAAI 2023, for associated appendix: https://gist.github.com/jessvb/e35bc0daf859c30f73008a1ad1b37824

arXiv:1812.11901 [pdf, other]

Large-Scale Object Detection of Images from Network Cameras in Variable Ambient Lighting Conditions

Authors: Caleb Tung, Matthew R. Kelleher, Ryan J. Schlueter, Binhan Xu, Yung-Hsiang Lu, George K. Thiruvathukal, Yen-Kuang Chen, Yang Lu

Abstract: Computer vision relies on labeled datasets for training and evaluation in detecting and recognizing objects. The popular computer vision program, YOLO ("You Only Look Once"), has been shown to accurately detect objects in many major image datasets. However, the images found in those datasets, are independent of one another and cannot be used to test YOLO's consistency at detecting the same object… ▽ More Computer vision relies on labeled datasets for training and evaluation in detecting and recognizing objects. The popular computer vision program, YOLO ("You Only Look Once"), has been shown to accurately detect objects in many major image datasets. However, the images found in those datasets, are independent of one another and cannot be used to test YOLO's consistency at detecting the same object as its environment (e.g. ambient lighting) changes. This paper describes a novel effort to evaluate YOLO's consistency for large-scale applications. It does so by working (a) at large scale and (b) by using consecutive images from a curated network of public video cameras deployed in a variety of real-world situations, including traffic intersections, national parks, shopping malls, university campuses, etc. We specifically examine YOLO's ability to detect objects in different scenarios (e.g., daytime vs. night), leveraging the cameras' ability to rapidly retrieve many successive images for evaluating detection consistency. Using our camera network and advanced computing resources (supercomputers), we analyzed more than 5 million images captured by 140 network cameras in 24 hours. Compared with labels marked by humans (considered as "ground truth"), YOLO struggles to consistently detect the same humans and cars as their positions change from one frame to the next; it also struggles to detect objects at night time. Our findings suggest that state-of-the art vision solutions should be trained by data from network camera with contextual information before they can be deployed in applications that demand high consistency on object detection. △ Less

Submitted 31 December, 2018; originally announced December 2018.

Comments: Submitted to MIPR 2019 (Accepted)

Showing 1–5 of 5 results for author: Kelleher, M