-
Closing the Gap: A User Study on the Real-world Usefulness of AI-powered Vulnerability Detection & Repair in the IDE
Authors:
Benjamin Steenhoek,
Kalpathy Sivaraman,
Renata Saldivar Gonzalez,
Yevhen Mohylevskyy,
Roshanak Zilouchian Moghaddam,
Wei Le
Abstract:
This paper presents the first empirical study of a vulnerability detection and fix tool with professional software developers on real projects that they own. We implemented DeepVulGuard, an IDE-integrated tool based on state-of-the-art detection and fix models, and show that it has promising performance on benchmarks of historic vulnerability data. DeepVulGuard scans code for vulnerabilities (incl…
▽ More
This paper presents the first empirical study of a vulnerability detection and fix tool with professional software developers on real projects that they own. We implemented DeepVulGuard, an IDE-integrated tool based on state-of-the-art detection and fix models, and show that it has promising performance on benchmarks of historic vulnerability data. DeepVulGuard scans code for vulnerabilities (including identifying the vulnerability type and vulnerable region of code), suggests fixes, provides natural-language explanations for alerts and fixes, leveraging chat interfaces. We recruited 17 professional software developers at Microsoft, observed their usage of the tool on their code, and conducted interviews to assess the tool's usefulness, speed, trust, relevance, and workflow integration. We also gathered detailed qualitative feedback on users' perceptions and their desired features. Study participants scanned a total of 24 projects, 6.9k files, and over 1.7 million lines of source code, and generated 170 alerts and 50 fix suggestions. We find that although state-of-the-art AI-powered detection and fix tools show promise, they are not yet practical for real-world use due to a high rate of false positives and non-applicable fixes. User feedback reveals several actionable pain points, ranging from incomplete context to lack of customization for the user's codebase. Additionally, we explore how AI features, including confidence scores, explanations, and chat interaction, can apply to vulnerability detection and fixing. Based on these insights, we offer practical recommendations for evaluating and deploying AI detection and fix models. Our code and data are available at https://doi.org/10.6084/m9.figshare.26367139.
△ Less
Submitted 25 April, 2025; v1 submitted 18 December, 2024;
originally announced December 2024.
-
Copilot Evaluation Harness: Evaluating LLM-Guided Software Programming
Authors:
Anisha Agarwal,
Aaron Chan,
Shubham Chandel,
Jinu Jang,
Shaun Miller,
Roshanak Zilouchian Moghaddam,
Yevhen Mohylevskyy,
Neel Sundaresan,
Michele Tufano
Abstract:
The integration of Large Language Models (LLMs) into Development Environments (IDEs) has become a focal point in modern software development. LLMs such as OpenAI GPT-3.5/4 and Code Llama offer the potential to significantly augment developer productivity by serving as intelligent, chat-driven programming assistants. However, utilizing LLMs out of the box is unlikely to be optimal for any given sce…
▽ More
The integration of Large Language Models (LLMs) into Development Environments (IDEs) has become a focal point in modern software development. LLMs such as OpenAI GPT-3.5/4 and Code Llama offer the potential to significantly augment developer productivity by serving as intelligent, chat-driven programming assistants. However, utilizing LLMs out of the box is unlikely to be optimal for any given scenario. Rather, each system requires the LLM to be honed to its set of heuristics to ensure the best performance. In this paper, we introduce the Copilot evaluation harness: a set of data and tools for evaluating LLM-guided IDE interactions, covering various programming scenarios and languages. We propose our metrics as a more robust and information-dense evaluation than previous state of the art evaluation systems. We design and compute both static and execution based success metrics for scenarios encompassing a wide range of developer tasks, including code generation from natural language (generate), documentation generation from code (doc), test case generation (test), bug-fixing (fix), and workspace understanding and query resolution (workspace). These success metrics are designed to evaluate the performance of LLMs within a given IDE and its respective parameter space. Our learnings from evaluating three common LLMs using these metrics can inform the development and validation of future scenarios in LLM guided IDEs.
△ Less
Submitted 21 February, 2024;
originally announced February 2024.
-
Transformer-based Vulnerability Detection in Code at EditTime: Zero-shot, Few-shot, or Fine-tuning?
Authors:
Aaron Chan,
Anant Kharkar,
Roshanak Zilouchian Moghaddam,
Yevhen Mohylevskyy,
Alec Helyar,
Eslam Kamal,
Mohamed Elkamhawy,
Neel Sundaresan
Abstract:
Software vulnerabilities bear enterprises significant costs. Despite extensive efforts in research and development of software vulnerability detection methods, uncaught vulnerabilities continue to put software owners and users at risk. Many current vulnerability detection methods require that code snippets can compile and build before attempting detection. This, unfortunately, introduces a long la…
▽ More
Software vulnerabilities bear enterprises significant costs. Despite extensive efforts in research and development of software vulnerability detection methods, uncaught vulnerabilities continue to put software owners and users at risk. Many current vulnerability detection methods require that code snippets can compile and build before attempting detection. This, unfortunately, introduces a long latency between the time a vulnerability is injected to the time it is removed, which can substantially increases the cost of fixing a vulnerability. We recognize that the current advances in machine learning can be used to detect vulnerable code patterns on syntactically incomplete code snippets as the developer is writing the code at EditTime. In this paper we present a practical system that leverages deep learning on a large-scale data set of vulnerable code patterns to learn complex manifestations of more than 250 vulnerability types and detect vulnerable code patterns at EditTime. We discuss zero-shot, few-shot, and fine-tuning approaches on state of the art pre-trained Large Language Models (LLMs). We show that in comparison with state of the art vulnerability detection models our approach improves the state of the art by 10%. We also evaluate our approach to detect vulnerability in auto-generated code by code LLMs. Evaluation on a benchmark of high-risk code scenarios shows a reduction of up to 90% vulnerability reduction.
△ Less
Submitted 22 May, 2023;
originally announced June 2023.
-
Generating Examples From CLI Usage: Can Transformers Help?
Authors:
Roshanak Zilouchian Moghaddam,
Spandan Garg,
Colin B. Clement,
Yevhen Mohylevskyy,
Neel Sundaresan
Abstract:
Continuous evolution in modern software often causes documentation, tutorials, and examples to be out of sync with changing interfaces and frameworks. Relying on outdated documentation and examples can lead programs to fail or be less efficient or even less secure. In response, programmers need to regularly turn to other resources on the web such as StackOverflow for examples to guide them in writ…
▽ More
Continuous evolution in modern software often causes documentation, tutorials, and examples to be out of sync with changing interfaces and frameworks. Relying on outdated documentation and examples can lead programs to fail or be less efficient or even less secure. In response, programmers need to regularly turn to other resources on the web such as StackOverflow for examples to guide them in writing software. We recognize that this inconvenient, error-prone, and expensive process can be improved by using machine learning applied to software usage data. In this paper, we present our practical system which uses machine learning on large-scale telemetry data and documentation corpora, generating appropriate and complex examples that can be used to improve documentation. We discuss both feature-based and transformer-based machine learning approaches and demonstrate that our system achieves 100% coverage for the used functionalities in the product, providing up-to-date examples upon every release and reduces the numbers of PRs submitted by software owners writing and editing documentation by >68%. We also share valuable lessons learnt during the 3 years that our production quality system has been deployed for Azure Cloud Command Line Interface (Azure CLI).
△ Less
Submitted 26 April, 2022;
originally announced April 2022.
-
The voter model chordal interface in two dimensions
Authors:
Mark Holmes,
Yevhen Mohylevskyy,
Charles M. Newman
Abstract:
Consider the voter model on a box of side length $L$ (in the triangular lattice) with boundary votes fixed forever as type 0 or type 1 on two different halves of the boundary. Motivated by analogous questions in percolation, we study several geometric objects at stationarity, as $L\rightarrow \infty$. One is the interface between the (large -- i.e., boundary connected) 0-cluster and 1-cluster. Ano…
▽ More
Consider the voter model on a box of side length $L$ (in the triangular lattice) with boundary votes fixed forever as type 0 or type 1 on two different halves of the boundary. Motivated by analogous questions in percolation, we study several geometric objects at stationarity, as $L\rightarrow \infty$. One is the interface between the (large -- i.e., boundary connected) 0-cluster and 1-cluster. Another is the set of large "coalescing classes" determined by the coalescing walk process dual to the voter model.
△ Less
Submitted 30 August, 2014;
originally announced September 2014.
-
Ergodicity and Percolation for Variants of One-dimensional Voter Models
Authors:
Y. Mohylevskyy,
C. M. Newman,
K. Ravishankar
Abstract:
We study variants of one-dimensional q-color voter models in discrete time. In addition to the usual voter model transitions in which a color is chosen from the left or right neighbor of a site there are two types of noisy transitions. One is bulk nucleation where a new random color is chosen. The other is boundary nucleation where a random color is chosen only if the two neighbors have distinct c…
▽ More
We study variants of one-dimensional q-color voter models in discrete time. In addition to the usual voter model transitions in which a color is chosen from the left or right neighbor of a site there are two types of noisy transitions. One is bulk nucleation where a new random color is chosen. The other is boundary nucleation where a random color is chosen only if the two neighbors have distinct colors. We prove under a variety of conditions on q and the magnitudes of the two noise parameters that the system is ergodic, i.e., there is convergence to a unique invariant distribution. The methods are percolation-based using the graphical structure of the model which consists of coalescing random walks combined with branching (boundary nucleation) and dying (bulk nucleation).
△ Less
Submitted 23 April, 2013; v1 submitted 8 December, 2011;
originally announced December 2011.