LLMail-Inject: A Dataset from a Realistic Adaptive Prompt Injection Challenge
Authors:
Sahar Abdelnabi,
Aideen Fay,
Ahmed Salem,
Egor Zverev,
Kai-Chieh Liao,
Chi-Huang Liu,
Chun-Chih Kuo,
Jannis Weigend,
Danyael Manlangit,
Alex Apostolov,
Haris Umair,
João Donato,
Masayuki Kawakita,
Athar Mahboob,
Tran Huu Bach,
Tsun-Han Chiang,
Myeongjin Cho,
Hajin Choi,
Byeonghyeon Kim,
Hyeonjin Lee,
Benjamin Pannell,
Conor McCauley,
Mark Russinovich,
Andrew Paverd,
Giovanni Cherubin
Abstract:
Indirect Prompt Injection attacks exploit the inherent limitation of Large Language Models (LLMs) to distinguish between instructions and data in their inputs. Despite numerous defense proposals, the systematic evaluation against adaptive adversaries remains limited, even when successful attacks can have wide security and privacy implications, and many real-world LLM-based applications remain vuln…
▽ More
Indirect Prompt Injection attacks exploit the inherent limitation of Large Language Models (LLMs) to distinguish between instructions and data in their inputs. Despite numerous defense proposals, the systematic evaluation against adaptive adversaries remains limited, even when successful attacks can have wide security and privacy implications, and many real-world LLM-based applications remain vulnerable. We present the results of LLMail-Inject, a public challenge simulating a realistic scenario in which participants adaptively attempted to inject malicious instructions into emails in order to trigger unauthorized tool calls in an LLM-based email assistant. The challenge spanned multiple defense strategies, LLM architectures, and retrieval configurations, resulting in a dataset of 208,095 unique attack submissions from 839 participants. We release the challenge code, the full dataset of submissions, and our analysis demonstrating how this data can provide new insights into the instruction-data separation problem. We hope this will serve as a foundation for future research towards practical structural solutions to prompt injection.
△ Less
Submitted 11 June, 2025;
originally announced June 2025.
Offensive Language and Hate Speech Detection with Deep Learning and Transfer Learning
Authors:
Bencheng Wei,
Jason Li,
Ajay Gupta,
Hafiza Umair,
Atsu Vovor,
Natalie Durzynski
Abstract:
Toxic online speech has become a crucial problem nowadays due to an exponential increase in the use of internet by people from different cultures and educational backgrounds. Differentiating if a text message belongs to hate speech and offensive language is a key challenge in automatic detection of toxic text content. In this paper, we propose an approach to automatically classify tweets into thre…
▽ More
Toxic online speech has become a crucial problem nowadays due to an exponential increase in the use of internet by people from different cultures and educational backgrounds. Differentiating if a text message belongs to hate speech and offensive language is a key challenge in automatic detection of toxic text content. In this paper, we propose an approach to automatically classify tweets into three classes: Hate, offensive and Neither. Using public tweet data set, we first perform experiments to build BI-LSTM models from empty embedding and then we also try the same neural network architecture with pre-trained Glove embedding. Next, we introduce a transfer learning approach for hate speech detection using an existing pre-trained language model BERT (Bidirectional Encoder Representations from Transformers), DistilBert (Distilled version of BERT) and GPT-2 (Generative Pre-Training). We perform hyper parameters tuning analysis of our best model (BI-LSTM) considering different neural network architectures, learn-ratings and normalization methods etc. After tuning the model and with the best combination of parameters, we achieve over 92 percent accuracy upon evaluating it on test data. We also create a class module which contains main functionality including text classification, sentiment checking and text data augmentation. This model could serve as an intermediate module between user and Twitter.
△ Less
Submitted 22 August, 2021; v1 submitted 6 August, 2021;
originally announced August 2021.