BingoGuard: LLM Content Moderation Tools with Risk Levels

Yin, Fan; Laban, Philippe; Peng, Xiangyu; Zhou, Yilun; Mao, Yixin; Vats, Vaibhav; Ross, Linnea; Agarwal, Divyansh; Xiong, Caiming; Wu, Chien-Sheng

Abstract:Malicious content generated by large language models (LLMs) can pose varying degrees of harm. Although existing LLM-based moderators can detect harmful content, they struggle to assess risk levels and may miss lower-risk outputs. Accurate risk assessment allows platforms with different safety thresholds to tailor content filtering and rejection. In this paper, we introduce per-topic severity rubrics for 11 harmful topics and build BingoGuard, an LLM-based moderation system designed to predict both binary safety labels and severity levels. To address the lack of annotations on levels of severity, we propose a scalable generate-then-filter framework that first generates responses across different severity levels and then filters out low-quality responses. Using this framework, we create BingoGuardTrain, a training dataset with 54,897 examples covering a variety of topics, response severity, styles, and BingoGuardTest, a test set with 988 examples explicitly labeled based on our severity rubrics that enables fine-grained analysis on model behaviors on different severity levels. Our BingoGuard-8B, trained on BingoGuardTrain, achieves the state-of-the-art performance on several moderation benchmarks, including WildGuardTest and HarmBench, as well as BingoGuardTest, outperforming best public models, WildGuard, by 4.3\%. Our analysis demonstrates that incorporating severity levels into training significantly enhances detection performance and enables the model to effectively gauge the severity of harmful responses.

Comments:	10 pages, 4 figures, 4 tables. ICLR 2025 poster
Subjects:	Computation and Language (cs.CL)
Cite as:	arXiv:2503.06550 [cs.CL]
	(or arXiv:2503.06550v1 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2503.06550

Computer Science > Computation and Language

Title:BingoGuard: LLM Content Moderation Tools with Risk Levels

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators