-
CodeSense: a Real-World Benchmark and Dataset for Code Semantic Reasoning
Authors:
Monoshi Kumar Roy,
Simin Chen,
Benjamin Steenhoek,
Jinjun Peng,
Gail Kaiser,
Baishakhi Ray,
Wei Le
Abstract:
Understanding and reasoning about code semantics is essential for enhancing code LLMs' abilities to solve real-world software engineering (SE) tasks. Although several code reasoning benchmarks exist, most rely on synthetic datasets or educational coding problems and focus on coarse-grained reasoning tasks such as input/output prediction, limiting their effectiveness in evaluating LLMs in practical…
▽ More
Understanding and reasoning about code semantics is essential for enhancing code LLMs' abilities to solve real-world software engineering (SE) tasks. Although several code reasoning benchmarks exist, most rely on synthetic datasets or educational coding problems and focus on coarse-grained reasoning tasks such as input/output prediction, limiting their effectiveness in evaluating LLMs in practical SE contexts. To bridge this gap, we propose CodeSense, the first benchmark that makes available a spectrum of fine-grained code reasoning tasks concerned with the software engineering of real-world code. We collected Python, C and Java software projects from real-world repositories. We executed tests from these repositories, collected their execution traces, and constructed a ground truth dataset for fine-grained semantic reasoning tasks. We then performed comprehensive evaluations on state-of-the-art LLMs. Our results show a clear performance gap for the models to handle fine-grained reasoning tasks. Although prompting techniques such as chain-of-thought and in-context learning helped, the lack of code semantics in LLMs fundamentally limit models' capabilities of code reasoning. Besides dataset, benchmark and evaluation, our work produced an execution tracing framework and tool set that make it easy to collect ground truth for fine-grained SE reasoning tasks, offering a strong basis for future benchmark construction and model post training. Our code and data are located at https://codesense-bench.github.io/.
△ Less
Submitted 31 May, 2025;
originally announced June 2025.
-
Reinforcement Learning from Automatic Feedback for High-Quality Unit Test Generation
Authors:
Benjamin Steenhoek,
Michele Tufano,
Neel Sundaresan,
Alexey Svyatkovskiy
Abstract:
Software testing is a crucial but time-consuming aspect of software development, and recently, Large Language Models (LLMs) have gained popularity for automated test case generation. However, because LLMs are trained on vast amounts of open-source code, they often generate test cases that do not adhere to best practices and may even contain test smells (anti-patterns). To address this issue, we pr…
▽ More
Software testing is a crucial but time-consuming aspect of software development, and recently, Large Language Models (LLMs) have gained popularity for automated test case generation. However, because LLMs are trained on vast amounts of open-source code, they often generate test cases that do not adhere to best practices and may even contain test smells (anti-patterns). To address this issue, we propose Reinforcement Learning from Static Quality Metrics (RLSQM), wherein we utilize Reinforcement Learning to generate high-quality unit tests based on static analysis-based quality metrics. First, we analyzed LLM-generated tests and show that LLMs frequently do generate undesirable test smells -- up to 37% of the time. Then, we implemented lightweight static analysis-based reward model and trained LLMs using this reward model to optimize for five code quality metrics. Our experimental results demonstrate that the RL-optimized Codex model consistently generated higher-quality test cases than the base LLM, improving quality metrics by up to 23%, and generated nearly 100% syntactically-correct code. RLSQM also outperformed GPT-4 on all code quality metrics, in spite of training a substantially cheaper Codex model. We provide insights into how reliably utilize RL to improve test generation quality and show that RLSQM is a significant step towards enhancing the overall efficiency and reliability of automated software testing. Our data are available at https://doi.org/10.6084/m9.figshare.25983166.
△ Less
Submitted 6 January, 2025; v1 submitted 18 December, 2024;
originally announced December 2024.
-
Closing the Gap: A User Study on the Real-world Usefulness of AI-powered Vulnerability Detection & Repair in the IDE
Authors:
Benjamin Steenhoek,
Kalpathy Sivaraman,
Renata Saldivar Gonzalez,
Yevhen Mohylevskyy,
Roshanak Zilouchian Moghaddam,
Wei Le
Abstract:
This paper presents the first empirical study of a vulnerability detection and fix tool with professional software developers on real projects that they own. We implemented DeepVulGuard, an IDE-integrated tool based on state-of-the-art detection and fix models, and show that it has promising performance on benchmarks of historic vulnerability data. DeepVulGuard scans code for vulnerabilities (incl…
▽ More
This paper presents the first empirical study of a vulnerability detection and fix tool with professional software developers on real projects that they own. We implemented DeepVulGuard, an IDE-integrated tool based on state-of-the-art detection and fix models, and show that it has promising performance on benchmarks of historic vulnerability data. DeepVulGuard scans code for vulnerabilities (including identifying the vulnerability type and vulnerable region of code), suggests fixes, provides natural-language explanations for alerts and fixes, leveraging chat interfaces. We recruited 17 professional software developers at Microsoft, observed their usage of the tool on their code, and conducted interviews to assess the tool's usefulness, speed, trust, relevance, and workflow integration. We also gathered detailed qualitative feedback on users' perceptions and their desired features. Study participants scanned a total of 24 projects, 6.9k files, and over 1.7 million lines of source code, and generated 170 alerts and 50 fix suggestions. We find that although state-of-the-art AI-powered detection and fix tools show promise, they are not yet practical for real-world use due to a high rate of false positives and non-applicable fixes. User feedback reveals several actionable pain points, ranging from incomplete context to lack of customization for the user's codebase. Additionally, we explore how AI features, including confidence scores, explanations, and chat interaction, can apply to vulnerability detection and fixing. Based on these insights, we offer practical recommendations for evaluating and deploying AI detection and fix models. Our code and data are available at https://doi.org/10.6084/m9.figshare.26367139.
△ Less
Submitted 25 April, 2025; v1 submitted 18 December, 2024;
originally announced December 2024.
-
To Err is Machine: Vulnerability Detection Challenges LLM Reasoning
Authors:
Benjamin Steenhoek,
Md Mahbubur Rahman,
Monoshi Kumar Roy,
Mirza Sanjida Alam,
Hengbo Tong,
Swarna Das,
Earl T. Barr,
Wei Le
Abstract:
In this paper, we present a challenging code reasoning task: vulnerability detection. Large Language Models (LLMs) have shown promising results in natural-language and math reasoning, but state-of-the-art (SOTA) models reported only 54.5% Balanced Accuracy in our vulnerability detection evaluation, even those models pre-trained on large amounts of source code. Our error analysis on LLM responses s…
▽ More
In this paper, we present a challenging code reasoning task: vulnerability detection. Large Language Models (LLMs) have shown promising results in natural-language and math reasoning, but state-of-the-art (SOTA) models reported only 54.5% Balanced Accuracy in our vulnerability detection evaluation, even those models pre-trained on large amounts of source code. Our error analysis on LLM responses shows that the models struggle to reason about the code semantics relevant to identifying vulnerabilities, especially subtle semantic differences caused by small textual changes. We explored prominent models and training settings to understand their effects on vulnerability detection performance -- including better prompts, larger models, more pre-training data, and fine-tuning -- but none led to significant improvements. This raises the question of whether simply scaling training data and model size will allow us to "solve" complex code reasoning tasks like vulnerability detection, or if a fundamental shift in modeling and training techniques is required. We also explored adding domain knowledge to prompts; although it helped certain models understand some code semantics, vulnerability detection requires multi-step reasoning, and these models still failed in steps, such as reasoning about variable relations. Our results suggest that new models, new training methods, or more execution-specific pretraining data may be needed to conquer vulnerability detection. We speculate that auto-regressive pre-training on source code may not effectively extract code semantics, especially on the current pretraining mixtures, in which execution data is scarce. Success on vulnerability detection as a code reasoning task can benefit many areas of software engineering such as debugging, test input generation, and program repair. Our code and data are available at https://doi.org/10.6084/m9.figshare.27368025.
△ Less
Submitted 7 January, 2025; v1 submitted 25 March, 2024;
originally announced March 2024.
-
Do Language Models Learn Semantics of Code? A Case Study in Vulnerability Detection
Authors:
Benjamin Steenhoek,
Md Mahbubur Rahman,
Shaila Sharmin,
Wei Le
Abstract:
Recently, pretrained language models have shown state-of-the-art performance on the vulnerability detection task. These models are pretrained on a large corpus of source code, then fine-tuned on a smaller supervised vulnerability dataset. Due to the different training objectives and the performance of the models, it is interesting to consider whether the models have learned the semantics of code r…
▽ More
Recently, pretrained language models have shown state-of-the-art performance on the vulnerability detection task. These models are pretrained on a large corpus of source code, then fine-tuned on a smaller supervised vulnerability dataset. Due to the different training objectives and the performance of the models, it is interesting to consider whether the models have learned the semantics of code relevant to vulnerability detection, namely bug semantics, and if so, how the alignment to bug semantics relates to model performance. In this paper, we analyze the models using three distinct methods: interpretability tools, attention analysis, and interaction matrix analysis. We compare the models' influential feature sets with the bug semantic features which define the causes of bugs, including buggy paths and Potentially Vulnerable Statements (PVS). We find that (1) better-performing models also aligned better with PVS, (2) the models failed to align strongly to PVS, and (3) the models failed to align at all to buggy paths. Based on our analysis, we developed two annotation methods which highlight the bug semantics inside the model's inputs. We evaluated our approach on four distinct transformer models and four vulnerability datasets and found that our annotations improved the models' performance in the majority of settings - 11 out of 16, with up to 9.57 points improvement in F1 score compared to conventional fine-tuning. We further found that with our annotations, the models aligned up to 232% better to potentially vulnerable statements. Our findings indicate that it is helpful to provide the model with information of the bug semantics, that the model can attend to it, and motivate future work in learning more complex path-based bug semantics. Our code and data are available at https://figshare.com/s/4a16a528d6874aad51a0.
△ Less
Submitted 7 November, 2023;
originally announced November 2023.
-
Reinforcement Learning from Automatic Feedback for High-Quality Unit Test Generation
Authors:
Benjamin Steenhoek,
Michele Tufano,
Neel Sundaresan,
Alexey Svyatkovskiy
Abstract:
Software testing is a crucial aspect of software development, and the creation of high-quality tests that adhere to best practices is essential for effective maintenance. Recently, Large Language Models (LLMs) have gained popularity for code generation, including the automated creation of test cases. However, these LLMs are often trained on vast amounts of publicly available code, which may includ…
▽ More
Software testing is a crucial aspect of software development, and the creation of high-quality tests that adhere to best practices is essential for effective maintenance. Recently, Large Language Models (LLMs) have gained popularity for code generation, including the automated creation of test cases. However, these LLMs are often trained on vast amounts of publicly available code, which may include test cases that do not adhere to best practices and may even contain test smells (anti-patterns). To address this issue, we propose a novel technique called Reinforcement Learning from Static Quality Metrics (RLSQM). To begin, we analyze the anti-patterns generated by the LLM and show that LLMs can generate undesirable test smells. Thus, we train specific reward models for each static quality metric, then utilize Proximal Policy Optimization (PPO) to train models for optimizing a single quality metric at a time. Furthermore, we amalgamate these rewards into a unified reward model aimed at capturing different best practices and quality aspects of tests. By comparing RL-trained models with those trained using supervised learning, we provide insights into how reliably utilize RL to improve test generation quality and into the effects of various training strategies. Our experimental results demonstrate that the RL-optimized model consistently generated high-quality test cases compared to the base LLM, improving the model by up to 21%, and successfully generates nearly 100% syntactically correct code. RLSQM also outperformed GPT-4 on four out of seven metrics. This represents a significant step towards enhancing the overall efficiency and reliability of software testing through Reinforcement Learning and static quality metrics. Our data are available at https://figshare.com/s/ded476c8d4c221222849.
△ Less
Submitted 6 January, 2025; v1 submitted 3 October, 2023;
originally announced October 2023.
-
Reproducing Failures in Fault Signatures
Authors:
Ashwin Kallingal Joshy,
Benjamin Steenhoek,
Xiuyuan Guo,
Wei Le
Abstract:
Software often fails in the field, however reproducing and debugging field failures is very challenging: the failure-inducing input may be missing, and the program setup can be complicated and hard to reproduce by the developers. In this paper, we propose to generate fault signatures from the failure locations and the original source code to reproduce the faults in small executable programs. We sa…
▽ More
Software often fails in the field, however reproducing and debugging field failures is very challenging: the failure-inducing input may be missing, and the program setup can be complicated and hard to reproduce by the developers. In this paper, we propose to generate fault signatures from the failure locations and the original source code to reproduce the faults in small executable programs. We say that a fault signature reproduces the fault in the original program if the two failed in the same location, triggered the same error conditions after executing the same selective sequences of failure-inducing statements. A fault signature aims to contain only sufficient statements that can reproduce the faults. That way, it provides some context to inform how a fault is developed and also avoids unnecessary complexity and setups that may block fault diagnosis. To compute fault signatures from the failures, we applied a path-sensitive static analysis tool to generate a path that leads to the fault, and then applied an existing syntactic patching tool to convert the path into an executable program. Our evaluation on real-world bugs from Corebench, BugBench, and Manybugs shows that fault signatures can reproduce the fault for the original programs. Because fault signatures are less complex, automatic test input generation tools generated failure-inducing inputs that could not be generated by using the entire programs. Some failure-inducing inputs can be directly transferred to the original programs. Our experimental data are publicly available at https://doi.org/10.5281/zenodo.5430155.
△ Less
Submitted 19 September, 2023;
originally announced September 2023.
-
TRACED: Execution-aware Pre-training for Source Code
Authors:
Yangruibo Ding,
Ben Steenhoek,
Kexin Pei,
Gail Kaiser,
Wei Le,
Baishakhi Ray
Abstract:
Most existing pre-trained language models for source code focus on learning the static code text, typically augmented with static code structures (abstract syntax tree, dependency graphs, etc.). However, program semantics will not be fully exposed before the real execution. Without an understanding of the program execution, statically pre-trained models fail to comprehensively capture the dynamic…
▽ More
Most existing pre-trained language models for source code focus on learning the static code text, typically augmented with static code structures (abstract syntax tree, dependency graphs, etc.). However, program semantics will not be fully exposed before the real execution. Without an understanding of the program execution, statically pre-trained models fail to comprehensively capture the dynamic code properties, such as the branch coverage and the runtime variable values, and they are consequently less effective at code understanding tasks, such as retrieving semantic clones and detecting software vulnerabilities.
To close the gap between the static nature of language models and the dynamic characteristics of programs, we introduce TRACED, an execution-aware pre-training strategy for source code. Specifically, we pre-train code language models with a combination of source code, executable inputs, and corresponding execution traces. Our goal is to teach code models the complicated execution logic during the pre-training, enabling the model to statically estimate the dynamic code properties without repeatedly executing code during task-specific fine-tuning.
To illustrate the effectiveness of our proposed approach, we fine-tune and evaluate TRACED on three downstream tasks: static execution estimation, clone retrieval, and vulnerability detection. The empirical results show that TRACED relatively improves the statically pre-trained code models by 12.4% for complete execution path prediction and by 25.2% for runtime variable value predictions. TRACED also significantly outperforms statically pre-trained models in clone retrieval and vulnerability detection across four public benchmarks.
△ Less
Submitted 12 June, 2023;
originally announced June 2023.
-
A Study of Static Warning Cascading Tools (Experience Paper)
Authors:
Xiuyuan Guo,
Ashwin Kallingal Joshy,
Benjamin Steenhoek,
Wei Le,
Lori Flynn
Abstract:
Static analysis is widely used for software assurance. However, static analysis tools can report an overwhelming number of warnings, many of which are false positives. Applying static analysis to a new version, a large number of warnings can be only relevant to the old version. Inspecting these warnings is a waste of time and can prevent developers from finding the new bugs in the new version. In…
▽ More
Static analysis is widely used for software assurance. However, static analysis tools can report an overwhelming number of warnings, many of which are false positives. Applying static analysis to a new version, a large number of warnings can be only relevant to the old version. Inspecting these warnings is a waste of time and can prevent developers from finding the new bugs in the new version. In this paper, we report the challenges of cascading warnings generated from two versions of programs. We investigated program differencing tools and extend them to perform warning cascading automatically. Specifically, we used textual based diff tool, namely SCALe, abstract syntax tree (AST) based diff tool, namely GumTree, and control flow graph (CFG) based diff tool, namely Hydrogen. We reported our experience of applying these tools and hopefully our findings can provide developers understandings of pros and cons of each approach. In our evaluation, we used 96 pairs of benchmark programs for which we know ground-truth bugs and fixes as well as 12 pairs of real-world open-source projects. Our tools and data are available at https: //github.com/WarningCas/WarningCascading_Data.
△ Less
Submitted 3 May, 2023;
originally announced May 2023.
-
An Empirical Study of Deep Learning Models for Vulnerability Detection
Authors:
Benjamin Steenhoek,
Md Mahbubur Rahman,
Richard Jiles,
Wei Le
Abstract:
Deep learning (DL) models of code have recently reported great progress for vulnerability detection. In some cases, DL-based models have outperformed static analysis tools. Although many great models have been proposed, we do not yet have a good understanding of these models. This limits the further advancement of model robustness, debugging, and deployment for the vulnerability detection. In this…
▽ More
Deep learning (DL) models of code have recently reported great progress for vulnerability detection. In some cases, DL-based models have outperformed static analysis tools. Although many great models have been proposed, we do not yet have a good understanding of these models. This limits the further advancement of model robustness, debugging, and deployment for the vulnerability detection. In this paper, we surveyed and reproduced 9 state-of-the-art (SOTA) deep learning models on 2 widely used vulnerability detection datasets: Devign and MSR. We investigated 6 research questions in three areas, namely model capabilities, training data, and model interpretation. We experimentally demonstrated the variability between different runs of a model and the low agreement among different models' outputs. We investigated models trained for specific types of vulnerabilities compared to a model that is trained on all the vulnerabilities at once. We explored the types of programs DL may consider "hard" to handle. We investigated the relations of training data sizes and training data composition with model performance. Finally, we studied model interpretations and analyzed important features that the models used to make predictions. We believe that our findings can help better understand model results, provide guidance on preparing training data, and improve the robustness of the models. All of our datasets, code, and results are available at https://doi.org/10.6084/m9.figshare.20791240.
△ Less
Submitted 12 February, 2023; v1 submitted 15 December, 2022;
originally announced December 2022.
-
Dataflow Analysis-Inspired Deep Learning for Efficient Vulnerability Detection
Authors:
Benjamin Steenhoek,
Hongyang Gao,
Wei Le
Abstract:
Deep learning-based vulnerability detection has shown great performance and, in some studies, outperformed static analysis tools. However, the highest-performing approaches use token-based transformer models, which are not the most efficient to capture code semantics required for vulnerability detection. Classical program analysis techniques such as dataflow analysis can detect many types of bugs…
▽ More
Deep learning-based vulnerability detection has shown great performance and, in some studies, outperformed static analysis tools. However, the highest-performing approaches use token-based transformer models, which are not the most efficient to capture code semantics required for vulnerability detection. Classical program analysis techniques such as dataflow analysis can detect many types of bugs based on their root causes. In this paper, we propose to combine such causal-based vulnerability detection algorithms with deep learning, aiming to achieve more efficient and effective vulnerability detection. Specifically, we designed DeepDFA, a dataflow analysis-inspired graph learning framework and an embedding technique that enables graph learning to simulate dataflow computation. We show that DeepDFA is both performant and efficient. DeepDFA outperformed all non-transformer baselines. It was trained in 9 minutes, 75x faster than the highest-performing baseline model. When using only 50+ vulnerable and several hundreds of total examples as training data, the model retained the same performance as 100% of the dataset. DeepDFA also generalized to real-world vulnerabilities in DbgBench; it detected 8.7 out of 17 vulnerabilities on average across folds and was able to distinguish between patched and buggy versions, while the highest-performing baseline models did not detect any vulnerabilities. By combining DeepDFA with a large language model, we surpassed the state-of-the-art vulnerability detection performance on the Big-Vul dataset with 96.46 F1 score, 97.82 precision, and 95.14 recall. Our replication package is located at https://doi.org/10.6084/m9.figshare.21225413 .
△ Less
Submitted 1 October, 2023; v1 submitted 15 December, 2022;
originally announced December 2022.
-
Validating Static Warnings via Testing Code Fragments
Authors:
Ashwin Kallingal Joshy,
Xueyuan Chen,
Benjamin Steenhoek,
Wei Le
Abstract:
Static analysis is an important approach for finding bugs and vulnerabilities in software. However, inspecting and confirming static warnings are challenging and time-consuming. In this paper, we present a novel solution that automatically generates test cases based on static warnings to validate true and false positives. We designed a syntactic patching algorithm that can generate syntactically v…
▽ More
Static analysis is an important approach for finding bugs and vulnerabilities in software. However, inspecting and confirming static warnings are challenging and time-consuming. In this paper, we present a novel solution that automatically generates test cases based on static warnings to validate true and false positives. We designed a syntactic patching algorithm that can generate syntactically valid, semantic preserving executable code fragments from static warnings. We developed a build and testing system to automatically test code fragments using fuzzers, KLEE and Valgrind. We evaluated our techniques using 12 real-world C projects and 1955 warnings from two commercial static analysis tools. We successfully built 68.5% code fragments and generated 1003 test cases. Through automatic testing, we identified 48 true positives and 27 false positives, and 205 likely false positives. We matched 4 CVE and real-world bugs using Helium, and they are only triggered by our tool but not other baseline tools. We found that testing code fragments is scalable and useful; it can trigger bugs that testing entire programs or testing procedures failed to trigger.
△ Less
Submitted 28 June, 2021; v1 submitted 8 June, 2021;
originally announced June 2021.