-
PoCGen: Generating Proof-of-Concept Exploits for Vulnerabilities in Npm Packages
Authors:
Deniz Simsek,
Aryaz Eghbali,
Michael Pradel
Abstract:
Security vulnerabilities in software packages are a significant concern for developers and users alike. Patching these vulnerabilities in a timely manner is crucial to restoring the integrity and security of software systems. However, previous work has shown that vulnerability reports often lack proof-of-concept (PoC) exploits, which are essential for fixing the vulnerability, testing patches, and…
▽ More
Security vulnerabilities in software packages are a significant concern for developers and users alike. Patching these vulnerabilities in a timely manner is crucial to restoring the integrity and security of software systems. However, previous work has shown that vulnerability reports often lack proof-of-concept (PoC) exploits, which are essential for fixing the vulnerability, testing patches, and avoiding regressions. Creating a PoC exploit is challenging because vulnerability reports are informal and often incomplete, and because it requires a detailed understanding of how inputs passed to potentially vulnerable APIs may reach security-relevant sinks. In this paper, we present PoCGen, a novel approach to autonomously generate and validate PoC exploits for vulnerabilities in npm packages. This is the first fully autonomous approach to use large language models (LLMs) in tandem with static and dynamic analysis techniques for PoC exploit generation. PoCGen leverages an LLM for understanding vulnerability reports, for generating candidate PoC exploits, and for validating and refining them. Our approach successfully generates exploits for 77% of the vulnerabilities in the SecBench$.$js dataset and 39% in a new, more challenging dataset of 794 recent vulnerabilities. This success rate significantly outperforms a recent baseline (by 45 absolute percentage points), while imposing an average cost of $0.02 per generated exploit.
△ Less
Submitted 6 June, 2025; v1 submitted 5 June, 2025;
originally announced June 2025.
-
De-Hallucinator: Mitigating LLM Hallucinations in Code Generation Tasks via Iterative Grounding
Authors:
Aryaz Eghbali,
Michael Pradel
Abstract:
Large language models (LLMs) trained on datasets of publicly available source code have established a new state of the art in code generation tasks. However, these models are mostly unaware of the code that exists within a specific project, preventing the models from making good use of existing APIs. Instead, LLMs often invent, or "hallucinate", non-existent APIs or produce variants of already exi…
▽ More
Large language models (LLMs) trained on datasets of publicly available source code have established a new state of the art in code generation tasks. However, these models are mostly unaware of the code that exists within a specific project, preventing the models from making good use of existing APIs. Instead, LLMs often invent, or "hallucinate", non-existent APIs or produce variants of already existing code. This paper presents De-Hallucinator, a technique that grounds the predictions of an LLM through a novel combination of retrieving suitable API references and iteratively querying the model with increasingly suitable context information in the prompt. The approach exploits the observation that predictions by LLMs often resemble the desired code, but they fail to correctly refer to already existing APIs. De-Hallucinator automatically identifies project-specific API references related to the model's initial predictions and adds these references into the prompt. Unlike retrieval-augmented generation (RAG), our approach uses the initial prediction(s) by the model to iteratively retrieve increasingly suitable API references. Our evaluation applies the approach to two tasks: predicting API usages in Python and generating tests in JavaScript. We show that De-Hallucinator consistently improves the generated code across five LLMs. In particular, the approach improves the edit distance by 23.3-50.6% and the recall of correctly predicted API usages by 23.9-61.0% for code completion, and improves the number of fixed tests that initially failed because of hallucinations by 63.2%, resulting in a 15.5% increase in statement coverage for test generation.
△ Less
Submitted 19 June, 2024; v1 submitted 3 January, 2024;
originally announced January 2024.
-
An Empirical Evaluation of Using Large Language Models for Automated Unit Test Generation
Authors:
Max Schäfer,
Sarah Nadi,
Aryaz Eghbali,
Frank Tip
Abstract:
Unit tests play a key role in ensuring the correctness of software. However, manually creating unit tests is a laborious task, motivating the need for automation. Large Language Models (LLMs) have recently been applied to this problem, utilizing additional training or few-shot learning on examples of existing tests. This paper presents a large-scale empirical evaluation on the effectiveness of LLM…
▽ More
Unit tests play a key role in ensuring the correctness of software. However, manually creating unit tests is a laborious task, motivating the need for automation. Large Language Models (LLMs) have recently been applied to this problem, utilizing additional training or few-shot learning on examples of existing tests. This paper presents a large-scale empirical evaluation on the effectiveness of LLMs for automated unit test generation without additional training or manual effort, providing the LLM with the signature and implementation of the function under test, along with usage examples extracted from documentation. We also attempt to repair failed generated tests by re-prompting the model with the failing test and error message. We implement our approach in TestPilot, a test generation tool for JavaScript that automatically generates unit tests for all API functions in an npm package. We evaluate TestPilot using OpenAI's gpt3.5-turbo LLM on 25 npm packages with a total of 1,684 API functions. The generated tests achieve a median statement coverage of 70.2% and branch coverage of 52.8%, significantly improving on Nessie, a recent feedback-directed JavaScript test generation technique, which achieves only 51.3% statement coverage and 25.6% branch coverage. We also find that 92.8% of TestPilot's generated tests have no more than 50% similarity with existing tests (as measured by normalized edit distance), with none of them being exact copies. Finally, we run TestPilot with two additional LLMs, OpenAI's older code-cushman-002 LLM and the open LLM StarCoder. Overall, we observed similar results with the former (68.2% median statement coverage), and somewhat worse results with the latter (54.0% median statement coverage), suggesting that the effectiveness of the approach is influenced by the size and training set of the LLM, but does not fundamentally depend on the specific model.
△ Less
Submitted 11 December, 2023; v1 submitted 13 February, 2023;
originally announced February 2023.
-
An Almost Tight RMR Lower Bound for Abortable Test-And-Set
Authors:
Aryaz Eghbali,
Philipp Woelfel
Abstract:
We prove a lower bound of Omega(log n/loglog n) for the remote memory reference (RMR) complexity of abortable test-and-set (leader election) in the cache-coherent (CC) and the distributed shared memory (DSM) model.
This separates the complexities of abortable and non-abortable test-and-set, as the latter has constant RMR complexity (Golab, Hendler, Woelfel, SIAM Journal of Computing Vol. 39, 201…
▽ More
We prove a lower bound of Omega(log n/loglog n) for the remote memory reference (RMR) complexity of abortable test-and-set (leader election) in the cache-coherent (CC) and the distributed shared memory (DSM) model.
This separates the complexities of abortable and non-abortable test-and-set, as the latter has constant RMR complexity (Golab, Hendler, Woelfel, SIAM Journal of Computing Vol. 39, 2010).
Golab, Hendler, Hadzilacos and Woelfel (Distributed Computing Vol. 25, 2012) showed that compare-and-swap can be implemented from registers and TAS objects with constant RMR complexity.
We observe that a small modification to that implementation is abortable, provided that the used TAS objects are abortable.
△ Less
Submitted 13 May, 2018;
originally announced May 2018.