Jailbreak Detection in Clinical Training LLMs Using Feature-Based Predictive Models

Nguyen, Tri; Pentapalli, Lohith Srikanth; Sieverding, Magnus; Turner, Laurah; Overla, Seth; Zheng, Weibing; Zhou, Chris; Furniss, David; Weber, Danielle; Gharib, Michael; Kelleher, Matt; Shukis, Michael; Pawlik, Cameron; Cohen, Kelly

Computer Science > Computation and Language

arXiv:2505.00010 (cs)

[Submitted on 21 Apr 2025]

Title:Jailbreak Detection in Clinical Training LLMs Using Feature-Based Predictive Models

Authors:Tri Nguyen, Lohith Srikanth Pentapalli, Magnus Sieverding, Laurah Turner, Seth Overla, Weibing Zheng, Chris Zhou, David Furniss, Danielle Weber, Michael Gharib, Matt Kelleher, Michael Shukis, Cameron Pawlik, Kelly Cohen

View PDF

Abstract:Jailbreaking in Large Language Models (LLMs) threatens their safe use in sensitive domains like education by allowing users to bypass ethical safeguards. This study focuses on detecting jailbreaks in 2-Sigma, a clinical education platform that simulates patient interactions using LLMs. We annotated over 2,300 prompts across 158 conversations using four linguistic variables shown to correlate strongly with jailbreak behavior. The extracted features were used to train several predictive models, including Decision Trees, Fuzzy Logic-based classifiers, Boosting methods, and Logistic Regression. Results show that feature-based predictive models consistently outperformed Prompt Engineering, with the Fuzzy Decision Tree achieving the best overall performance. Our findings demonstrate that linguistic-feature-based models are effective and explainable alternatives for jailbreak detection. We suggest future work explore hybrid frameworks that integrate prompt-based flexibility with rule-based robustness for real-time, spectrum-based jailbreak monitoring in educational LLMs.

Subjects:	Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Cite as:	arXiv:2505.00010 [cs.CL]
	(or arXiv:2505.00010v1 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2505.00010

Submission history

From: Tri Nguyen [view email]
[v1] Mon, 21 Apr 2025 16:54:35 UTC (238 KB)

Computer Science > Computation and Language

Title:Jailbreak Detection in Clinical Training LLMs Using Feature-Based Predictive Models

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:Jailbreak Detection in Clinical Training LLMs Using Feature-Based Predictive Models

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators