Search | arXiv e-print repository

LLMorpheus: Mutation Testing using Large Language Models

Authors: Frank Tip, Jonathan Bell, Max Schaefer

Abstract: In mutation testing, the quality of a test suite is evaluated by introducing faults into a program and determining whether the program's tests detect them. Most existing approaches for mutation testing involve the application of a fixed set of mutation operators, e.g., replacing a "+" with a "-", or removing a function's body. However, certain types of real-world bugs cannot easily be simulated by… ▽ More In mutation testing, the quality of a test suite is evaluated by introducing faults into a program and determining whether the program's tests detect them. Most existing approaches for mutation testing involve the application of a fixed set of mutation operators, e.g., replacing a "+" with a "-", or removing a function's body. However, certain types of real-world bugs cannot easily be simulated by such approaches, limiting their effectiveness. This paper presents a technique for mutation testing where placeholders are introduced at designated locations in a program's source code and where a Large Language Model (LLM) is prompted to ask what they could be replaced with. The technique is implemented in LLMorpheus, a mutation testing tool for JavaScript, and evaluated on 13 subject packages, considering several variations on the prompting strategy, and using several LLMs. We find LLMorpheus to be capable of producing mutants that resemble existing bugs that cannot be produced by StrykerJS, a state-of-the-art mutation testing tool. Moreover, we report on the running time, cost, and number of mutants produced by LLMorpheus, demonstrating its practicality. △ Less

Submitted 7 March, 2025; v1 submitted 15 April, 2024; originally announced April 2024.

arXiv:2306.08741 [pdf, other]

A statistical approach for finding property-access errors

Authors: Ellen Arteca, Max Schäfer, Frank Tip

Abstract: We study the problem of finding incorrect property accesses in JavaScript where objects do not have a fixed layout, and properties (including methods) can be added, overwritten, and deleted freely throughout the lifetime of an object. Since referencing a non-existent property is not an error in JavaScript, accidental accesses to non-existent properties (caused, perhaps, by a typo or by a misunders… ▽ More We study the problem of finding incorrect property accesses in JavaScript where objects do not have a fixed layout, and properties (including methods) can be added, overwritten, and deleted freely throughout the lifetime of an object. Since referencing a non-existent property is not an error in JavaScript, accidental accesses to non-existent properties (caused, perhaps, by a typo or by a misunderstanding of API documentation) can go undetected without thorough testing, and may manifest far from the source of the problem. We propose a two-phase approach for detecting property access errors based on the observation that, in practice, most property accesses will be correct. First a large number of property access patterns is collected from an extensive corpus of real-world JavaScript code, and a statistical analysis is performed to identify anomalous usage patterns. Specific instances of these patterns may not be bugs (due, e.g., dynamic type checks), so a local data-flow analysis filters out instances of anomalous property accesses that are safe and leaves only those likely to be actual bugs. We experimentally validate our approach, showing that on a set of 100 concrete instances of anomalous property accesses, the approach achieves a precision of 82% with a recall of 90%, making it suitable for practical use. We also conducted an experiment to determine how effective the popular VSCode code completion feature is at suggesting object properties, and found that, while it never suggested an incorrect property (precision of 100%), it failed to suggest the correct property in 62 out of 80 cases (recall of 22.5%). This shows that developers cannot rely on VSCode's code completion alone to ensure that all property accesses are valid. △ Less

Submitted 14 June, 2023; originally announced June 2023.

Comments: 11 pages, 3 figures

arXiv:2302.06527 [pdf, other]

An Empirical Evaluation of Using Large Language Models for Automated Unit Test Generation

Authors: Max Schäfer, Sarah Nadi, Aryaz Eghbali, Frank Tip

Abstract: Unit tests play a key role in ensuring the correctness of software. However, manually creating unit tests is a laborious task, motivating the need for automation. Large Language Models (LLMs) have recently been applied to this problem, utilizing additional training or few-shot learning on examples of existing tests. This paper presents a large-scale empirical evaluation on the effectiveness of LLM… ▽ More Unit tests play a key role in ensuring the correctness of software. However, manually creating unit tests is a laborious task, motivating the need for automation. Large Language Models (LLMs) have recently been applied to this problem, utilizing additional training or few-shot learning on examples of existing tests. This paper presents a large-scale empirical evaluation on the effectiveness of LLMs for automated unit test generation without additional training or manual effort, providing the LLM with the signature and implementation of the function under test, along with usage examples extracted from documentation. We also attempt to repair failed generated tests by re-prompting the model with the failing test and error message. We implement our approach in TestPilot, a test generation tool for JavaScript that automatically generates unit tests for all API functions in an npm package. We evaluate TestPilot using OpenAI's gpt3.5-turbo LLM on 25 npm packages with a total of 1,684 API functions. The generated tests achieve a median statement coverage of 70.2% and branch coverage of 52.8%, significantly improving on Nessie, a recent feedback-directed JavaScript test generation technique, which achieves only 51.3% statement coverage and 25.6% branch coverage. We also find that 92.8% of TestPilot's generated tests have no more than 50% similarity with existing tests (as measured by normalized edit distance), with none of them being exact copies. Finally, we run TestPilot with two additional LLMs, OpenAI's older code-cushman-002 LLM and the open LLM StarCoder. Overall, we observed similar results with the former (68.2% median statement coverage), and somewhat worse results with the latter (54.0% median statement coverage), suggesting that the effectiveness of the approach is influenced by the size and training set of the LLM, but does not fundamentally depend on the specific model. △ Less

Submitted 11 December, 2023; v1 submitted 13 February, 2023; originally announced February 2023.

arXiv:2110.14162 [pdf, ps, other]

Stubbifier: Debloating Dynamic Server-Side JavaScript Applications

Authors: Alexi Turcotte, Ellen Arteca, Ashish Mishra, Saba Alimadadi, Frank Tip

Abstract: JavaScript is an increasingly popular language for server-side development, thanks in part to the Node.js runtime environment and its vast ecosystem of modules. With the Node.js package manager npm, users are able to easily include external modules as dependencies in their projects. However, npm installs modules with all of their functionality, even if only a fraction is needed, which causes an un… ▽ More JavaScript is an increasingly popular language for server-side development, thanks in part to the Node.js runtime environment and its vast ecosystem of modules. With the Node.js package manager npm, users are able to easily include external modules as dependencies in their projects. However, npm installs modules with all of their functionality, even if only a fraction is needed, which causes an undue increase in code size. Eliminating this unused functionality from distributions is desirable, but the sound analysis required to find unused code is difficult due to JavaScript's extreme dynamicity. We present a fully automatic technique that identifies unused code by constructing static or dynamic call graphs from the application's tests, and replacing code deemed unreachable with either file- or function-level stubs. If a stub is called, it will fetch and execute the original code on-demand, thus relaxing the requirement that the call graph be sound. The technique also provides an optional guarded execution mode to guard application against injection vulnerabilities in untested code that resulted from stub expansion. This technique is implemented in an open source tool called Stubbifier, which supports the ECMAScript 2019 standard. In an empirical evaluation on 15 Node.js applications and 75 clients of these applications, Stubbifier reduced application size by 56% on average while incurring only minor performance overhead. The evaluation also shows that Stubbifier's guarded execution mode is capable of preventing several known injection vulnerabilities that are manifested in stubbed-out code. Finally, Stubbifier can work alongside bundlers, popular JavaScript tools for bundling an application with its dependencies. For the considered subject applications, we measured an average size reduction of 37% in bundled distributions. △ Less

Submitted 27 October, 2021; originally announced October 2021.

Comments: 29 pages. This work has been submitted to the Journal on Empirical Software Engineering

arXiv:2107.13708 [pdf, other]

Learning how to listen: Automatically finding bug patterns in event-driven JavaScript APIs

Authors: Ellen Arteca, Max Schäfer, Frank Tip

Abstract: Event-driven programming is widely practiced in the JavaScript community, both on the client side to handle UI events and AJAX requests, and on the server side to accommodate long-running operations such as file or network I/O. Many popular event-based APIs allow event names to be specified as free-form strings without any validation, potentially leading to lost events for which no listener has be… ▽ More Event-driven programming is widely practiced in the JavaScript community, both on the client side to handle UI events and AJAX requests, and on the server side to accommodate long-running operations such as file or network I/O. Many popular event-based APIs allow event names to be specified as free-form strings without any validation, potentially leading to lost events for which no listener has been registered and dead listeners for events that are never emitted. In previous work, Madsen et al. presented a precise static analysis for detecting such problems, but their analysis does not scale because it may require a number of contexts that is exponential in the size of the program. Concentrating on the problem of detecting dead listeners, we present an approach to learn how to correctly use event-based APIs by first mining a large corpus of JavaScript code using a simple static analysis to identify code snippets that register an event listener, and then applying statistical modeling to identify anomalous patterns, which often indicate incorrect API usage. From a large-scale evaluation on 127,531 open-source JavaScript code bases, our technique was able to detect 75 anomalous listener-registration patterns, while maintaining a precision of 90.9% and recall of 7.5% over our validation set, demonstrating that a learning-based approach to detecting event-handling bugs is feasible. In an additional experiment, we investigated instances of these patterns in 25 open-source projects, and reported 30 issues to the project maintainers, of which 7 have been confirmed as bugs. △ Less

Submitted 11 February, 2022; v1 submitted 28 July, 2021; originally announced July 2021.

Comments: 19 pages, 6 figures. Accepted and to appear in IEEE TSE

arXiv:1910.12935 [pdf, other]

Precise Dataflow Analysis of Event-Driven Applications

Authors: Ming-Ho Yee, Ayaz Badouraly, Ondřej Lhoták, Frank Tip, Jan Vitek

Abstract: Event-driven programming is widely used for implementing user interfaces, web applications, and non-blocking I/O. An event-driven program is organized as a collection of event handlers whose execution is triggered by events. Traditional static analysis techniques are unable to reason precisely about event-driven code because they conservatively assume that event handlers may execute in any order.… ▽ More Event-driven programming is widely used for implementing user interfaces, web applications, and non-blocking I/O. An event-driven program is organized as a collection of event handlers whose execution is triggered by events. Traditional static analysis techniques are unable to reason precisely about event-driven code because they conservatively assume that event handlers may execute in any order. This paper proposes an automatic transformation from Interprocedural Finite Distributive Subset (IFDS) problems to Interprocedural Distributed Environment (IDE) problems as a general solution to obtain precise static analysis of event-driven applications; problems in both forms can be solved by existing implementations. Our contribution is to show how to improve analysis precision by automatically enriching the former with information about the state of event handlers to filter out infeasible paths. We prove the correctness of our transformation and report on experiments with a proof-of-concept implementation for a subset of JavaScript. △ Less

Submitted 28 October, 2019; originally announced October 2019.

arXiv:1608.07261 [pdf, other]

Type Inference for Static Compilation of JavaScript (Extended Version)

Authors: Satish Chandra, Colin S. Gordon, Jean-Baptiste Jeannin, Cole Schlesinger, Manu Sridharan, Frank Tip, Youngil Choi

Abstract: We present a type system and inference algorithm for a rich subset of JavaScript equipped with objects, structural subtyping, prototype inheritance, and first-class methods. The type system supports abstract and recursive objects, and is expressive enough to accommodate several standard benchmarks with only minor workarounds. The invariants enforced by the types enable an ahead-of-time compiler to… ▽ More We present a type system and inference algorithm for a rich subset of JavaScript equipped with objects, structural subtyping, prototype inheritance, and first-class methods. The type system supports abstract and recursive objects, and is expressive enough to accommodate several standard benchmarks with only minor workarounds. The invariants enforced by the types enable an ahead-of-time compiler to carry out optimizations typically beyond the reach of static compilers for dynamic languages. Unlike previous inference techniques for prototype inheritance, our algorithm uses a combination of lower and upper bound propagation to infer types and discover type errors in all code, including uninvoked functions. The inference is expressed in a simple constraint language, designed to leverage off-the-shelf fixed point solvers. We prove soundness for both the type system and inference algorithm. An experimental evaluation showed that the inference is powerful, handling the aforementioned benchmarks with no manual type annotation, and that the inferred types enable effective static compilation. △ Less

Submitted 18 October, 2016; v1 submitted 25 August, 2016; originally announced August 2016.

Comments: Extended version of OOPSLA 2016 paper of the same name

arXiv:1605.01362 [pdf, other]

Trace Typing: An Approach for Evaluating Retrofitted Type Systems (Extended Version)

Authors: Esben Andreasen, Colin S. Gordon, Satish Chandra, Manu Sridharan, Frank Tip, Koushik Sen

Abstract: Recent years have seen growing interest in the retrofitting of type systems onto dynamically-typed programming languages, in order to improve type safety, programmer productivity, or performance. In such cases, type system developers must strike a delicate balance between disallowing certain coding patterns to keep the type system simple, or including them at the expense of additional complexity a… ▽ More Recent years have seen growing interest in the retrofitting of type systems onto dynamically-typed programming languages, in order to improve type safety, programmer productivity, or performance. In such cases, type system developers must strike a delicate balance between disallowing certain coding patterns to keep the type system simple, or including them at the expense of additional complexity and effort. Thus far, the process for designing retrofitted type systems has been largely ad hoc, because evaluating multiple variations of a type system on large bodies of existing code is a significant undertaking. We present trace typing: a framework for automatically and quantitatively evaluating variations of a retrofitted type system on large code bases. The trace typing approach involves gathering traces of program executions, inferring types for instances of variables and expressions occurring in a trace, and merging types according to merge strategies that reflect specific (combinations of) choices in the source-level type system design space. We evaluated trace typing through several experiments. We compared several variations of type systems retrofitted onto JavaScript, measuring the number of program locations with type errors in each case on a suite of over fifty thousand lines of JavaScript code. We also used trace typing to validate and guide the design of a new retrofitted type system that enforces fixed object layout for JavaScript objects. Finally, we leveraged the types computed by trace typing to automatically identify tag tests --- dynamic checks that refine a type --- and examined the variety of tests identified. △ Less

Submitted 4 May, 2016; originally announced May 2016.

Comments: Samsung Research America Technical Report

Report number: SRA-CSIC-2016-001 ACM Class: F.3.3

Showing 1–8 of 8 results for author: Tip, F