BaxBench: Can LLMs Generate Correct and Secure Backends?

Vero, Mark; Mündler, Niels; Chibotaru, Victor; Raychev, Veselin; Baader, Maximilian; Jovanović, Nikola; He, Jingxuan; Vechev, Martin

Computer Science > Cryptography and Security

arXiv:2502.11844 (cs)

[Submitted on 17 Feb 2025 (v1), last revised 30 May 2025 (this version, v3)]

Title:BaxBench: Can LLMs Generate Correct and Secure Backends?

Authors:Mark Vero, Niels Mündler, Victor Chibotaru, Veselin Raychev, Maximilian Baader, Nikola Jovanović, Jingxuan He, Martin Vechev

View PDF

Abstract:Automatic program generation has long been a fundamental challenge in computer science. Recent benchmarks have shown that large language models (LLMs) can effectively generate code at the function level, make code edits, and solve algorithmic coding tasks. However, to achieve full automation, LLMs should be able to generate production-quality, self-contained application modules. To evaluate the capabilities of LLMs in solving this challenge, we introduce BaxBench, a novel evaluation benchmark consisting of 392 tasks for the generation of backend applications. We focus on backends for three critical reasons: (i) they are practically relevant, building the core components of most modern web and cloud software, (ii) they are difficult to get right, requiring multiple functions and files to achieve the desired functionality, and (iii) they are security-critical, as they are exposed to untrusted third-parties, making secure solutions that prevent deployment-time attacks an imperative. BaxBench validates the functionality of the generated applications with comprehensive test cases, and assesses their security exposure by executing end-to-end exploits. Our experiments reveal key limitations of current LLMs in both functionality and security: (i) even the best model, OpenAI o1, achieves a mere 62% on code correctness; (ii) on average, we could successfully execute security exploits on around half of the correct programs generated by each LLM; and (iii) in less popular backend frameworks, models further struggle to generate correct and secure applications. Progress on BaxBench signifies important steps towards autonomous and secure software development with LLMs.

Subjects:	Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Programming Languages (cs.PL)
Cite as:	arXiv:2502.11844 [cs.CR]
	(or arXiv:2502.11844v3 [cs.CR] for this version)
	https://doi.org/10.48550/arXiv.2502.11844

Submission history

From: Mark Vero [view email]
[v1] Mon, 17 Feb 2025 14:37:47 UTC (2,860 KB)
[v2] Thu, 20 Feb 2025 14:52:31 UTC (2,860 KB)
[v3] Fri, 30 May 2025 13:01:16 UTC (2,853 KB)

Computer Science > Cryptography and Security

Title:BaxBench: Can LLMs Generate Correct and Secure Backends?

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Cryptography and Security

Title:BaxBench: Can LLMs Generate Correct and Secure Backends?

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators