Skip to main content

Showing 1–1 of 1 results for author: Furniss, D

Searching in archive cs. Search in all archives.
.
  1. arXiv:2505.00010  [pdf

    cs.CL cs.AI

    Jailbreak Detection in Clinical Training LLMs Using Feature-Based Predictive Models

    Authors: Tri Nguyen, Lohith Srikanth Pentapalli, Magnus Sieverding, Laurah Turner, Seth Overla, Weibing Zheng, Chris Zhou, David Furniss, Danielle Weber, Michael Gharib, Matt Kelleher, Michael Shukis, Cameron Pawlik, Kelly Cohen

    Abstract: Jailbreaking in Large Language Models (LLMs) threatens their safe use in sensitive domains like education by allowing users to bypass ethical safeguards. This study focuses on detecting jailbreaks in 2-Sigma, a clinical education platform that simulates patient interactions using LLMs. We annotated over 2,300 prompts across 158 conversations using four linguistic variables shown to correlate stron… ▽ More

    Submitted 21 April, 2025; originally announced May 2025.