ANVIL: Anomaly-based Vulnerability Identification without Labelled Training Data

Wang, Weizhou; Liu, Eric; Guo, Xiangyu; Hu, Xiao; Grishchenko, Ilya; Lie, David

Computer Science > Cryptography and Security

arXiv:2408.16028 (cs)

[Submitted on 28 Aug 2024 (v1), last revised 1 Jun 2025 (this version, v3)]

Title:ANVIL: Anomaly-based Vulnerability Identification without Labelled Training Data

Authors:Weizhou Wang, Eric Liu, Xiangyu Guo, Xiao Hu, Ilya Grishchenko, David Lie

View PDF HTML (experimental)

Abstract:Supervised-learning-based vulnerability detectors often fall short due to limited labelled training data. In contrast, Large Language Models (LLMs) like GPT-4 are trained on vast unlabelled code corpora, yet perform only marginally better than coin flips when directly prompted to detect vulnerabilities. In this paper, we reframe vulnerability detection as anomaly detection, based on the premise that vulnerable code is rare and thus anomalous relative to patterns learned by LLMs. We introduce ANVIL, which performs a masked code reconstruction task: the LLM reconstructs a masked line of code, and deviations from the original are scored as anomalies. We propose a hybrid anomaly score that combines exact match, cross-entropy loss, prediction confidence, and structural complexity. We evaluate our approach across multiple LLM families, scoring methods, and context sizes, and against vulnerabilities after the LLM's training cut-off. On the PrimeVul dataset, ANVIL outperforms state-of-the-art supervised detectors-LineVul, LineVD, and LLMAO-achieving up to 2x higher Top-3 accuracy, 75% better Normalized MFR, and a significant improvement on ROC-AUC. Finally, by integrating ANVIL with fuzzers, we uncover two previously unknown vulnerabilities, demonstrating the practical utility of anomaly-guided detection.

Subjects:	Cryptography and Security (cs.CR); Machine Learning (cs.LG); Software Engineering (cs.SE)
Cite as:	arXiv:2408.16028 [cs.CR]
	(or arXiv:2408.16028v3 [cs.CR] for this version)
	https://doi.org/10.48550/arXiv.2408.16028

Submission history

From: Weizhou Wang [view email]
[v1] Wed, 28 Aug 2024 03:28:17 UTC (3,951 KB)
[v2] Sat, 15 Feb 2025 14:28:31 UTC (6,904 KB)
[v3] Sun, 1 Jun 2025 19:41:06 UTC (8,759 KB)

Computer Science > Cryptography and Security

Title:ANVIL: Anomaly-based Vulnerability Identification without Labelled Training Data

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Cryptography and Security

Title:ANVIL: Anomaly-based Vulnerability Identification without Labelled Training Data

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators