-
Bogus Bugs, Duplicates, and Revealing Comments: Data Quality Issues in NPR
Authors:
Julian Aron Prenner,
Romain Robbes
Abstract:
The performance of a machine learning system is not only determined by the model but also, to a substantial degree, by the data it is trained on. With the increasing use of machine learning, issues related to data quality have become a concern also in automated program repair research. In this position paper, we report some of the data-related issues we have come across when working with several l…
▽ More
The performance of a machine learning system is not only determined by the model but also, to a substantial degree, by the data it is trained on. With the increasing use of machine learning, issues related to data quality have become a concern also in automated program repair research. In this position paper, we report some of the data-related issues we have come across when working with several large APR datasets and benchmarks, including, for instance, duplicates or "bogus bugs". We briefly discuss the potential impact of these problems on repair performance and propose possible remedies. We believe that more data-focused approaches could improve the performance and robustness of current and future APR systems.
△ Less
Submitted 11 March, 2025;
originally announced March 2025.
-
Simple Fault Localization using Execution Traces
Authors:
Julian Aron Prenner,
Romain Robbes
Abstract:
Traditional spectrum-based fault localization (SBFL) exploits differences in a program's coverage spectrum when run on passing and failing test cases. However, such runs can provide a wealth of additional information beyond mere coverage. Working with thousands of execution traces of short programs submitted to competitive programming contests and leveraging machine learning and additional runtime…
▽ More
Traditional spectrum-based fault localization (SBFL) exploits differences in a program's coverage spectrum when run on passing and failing test cases. However, such runs can provide a wealth of additional information beyond mere coverage. Working with thousands of execution traces of short programs submitted to competitive programming contests and leveraging machine learning and additional runtime, control-flow and lexical features, we present simple ways to improve SBFL. We also propose a simple trick to integrate context information. Our approach outperforms SBFL formulae such as Ochiai on our evaluation set as well as QuixBugs and requires neither a GPU nor any form of advanced program analysis. Existing SBFL solutions could possibly be improved with reasonable effort by adopting some of the proposed ideas.
△ Less
Submitted 6 March, 2025;
originally announced March 2025.
-
ThrowBench: Benchmarking LLMs by Predicting Runtime Exceptions
Authors:
Julian Aron Prenner,
Romain Robbes
Abstract:
Modern Large Language Models (LLMs) have shown astounding capabilities of code understanding and synthesis. In order to assess such capabilities, several benchmarks have been devised (e.g., HumanEval). However, most benchmarks focus on code synthesis from natural language instructions. Hence, such benchmarks do not test for other forms of code understanding. Moreover, there have been concerns abou…
▽ More
Modern Large Language Models (LLMs) have shown astounding capabilities of code understanding and synthesis. In order to assess such capabilities, several benchmarks have been devised (e.g., HumanEval). However, most benchmarks focus on code synthesis from natural language instructions. Hence, such benchmarks do not test for other forms of code understanding. Moreover, there have been concerns about contamination and leakage. That is, benchmark problems (or closely related problems) may appear in training set, strongly biasing benchmark results. In this work we investigate whether large language models can correctly predict runtime program behavior. To this end, we introduce ThrowBench, a benchmark consisting of over 2,400 short user-written programs written in four different programming languages. The majority of these programs throw an exception during runtime (due to a bug). LLMs are asked to predict whether a presented program throws an exception and, if so, which one. Evaluating our benchmark on six state-of-the-art code LLMs we see modest performance ranging from 19 to 38% (F1 score). Benchmarking a wider set of code capabilities could improve the assessment of code LLMs and help identify weak points in current models. Moreover, as ground-truth answers have been determined through program execution, leakage is not a concern. We release ThrowBench as well as all of our results together with this work.
△ Less
Submitted 6 March, 2025;
originally announced March 2025.
-
Extracting Fix Ingredients using Language Models
Authors:
Julian Aron Prenner,
Romain Robbes
Abstract:
Deep learning and language models are increasingly dominating automated program repair research. While previous generate-and-validate approaches were able to find and use fix ingredients on a file or even project level, neural language models are limited to the code that fits their input window. In this work we investigate how important identifier ingredients are in neural program repair and prese…
▽ More
Deep learning and language models are increasingly dominating automated program repair research. While previous generate-and-validate approaches were able to find and use fix ingredients on a file or even project level, neural language models are limited to the code that fits their input window. In this work we investigate how important identifier ingredients are in neural program repair and present ScanFix, an approach that leverages an additional scanner model to extract identifiers from a bug's file and potentially project-level context. We find that lack of knowledge of far-away identifiers is an important cause of failed repairs. Augmenting repair model input with scanner-extracted identifiers yields relative improvements of up to 31%. However, ScanFix is outperformed by a model with a large input window (> 5k tokens). When passing ingredients from the ground-truth fix, improvements are even higher. This shows that, with refined extraction techniques, ingredient scanning, similar to fix candidate ranking, could have the potential to become an important subtask of future automated repair systems. At the same time, it also demonstrates that this idea is subject to Sutton's bitter lesson and may be rendered unnecessary by new code models with ever-increasing context windows.
△ Less
Submitted 6 March, 2025;
originally announced March 2025.
-
Impermanent Identifiers: Enhanced Source Code Comprehension and Refactoring
Authors:
Eduardo Martins Guerra,
Andre A. S. Ivo,
Fernando O. Pereira,
Romain Robbes,
Andrea Janes,
Fabio Fagundes Silveira
Abstract:
In response to the prevailing challenges in contemporary software development, this article introduces an innovative approach to code augmentation centered around Impermanent Identifiers. The primary goal is to enhance the software development experience by introducing dynamic identifiers that adapt to changing contexts, facilitating more efficient interactions between developers and source code,…
▽ More
In response to the prevailing challenges in contemporary software development, this article introduces an innovative approach to code augmentation centered around Impermanent Identifiers. The primary goal is to enhance the software development experience by introducing dynamic identifiers that adapt to changing contexts, facilitating more efficient interactions between developers and source code, ultimately advancing comprehension, maintenance, and collaboration in software development. Additionally, this study rigorously evaluates the adoption and acceptance of Impermanent Identifiers within the software development landscape. Through a comprehensive empirical examination, we investigate how developers perceive and integrate this approach into their daily programming practices, exploring perceived benefits, potential barriers, and factors influencing its adoption. In summary, this article charts a new course for code augmentation, proposing Impermanent Identifiers as its cornerstone while assessing their feasibility and acceptance among developers. This interdisciplinary research seeks to contribute to the continuous improvement of software development practices and the progress of code augmentation technology.
△ Less
Submitted 14 June, 2024; v1 submitted 13 June, 2024;
originally announced June 2024.
-
What the Fix? A Study of ASATs Rule Documentation
Authors:
Corentin Latappy,
Thomas Degueule,
Jean-Rémy Falleri,
Romain Robbes,
Xavier Blanc,
Cédric Teyton
Abstract:
Automatic Static Analysis Tools (ASATs) are widely used by software developers to diffuse and enforce coding practices. Yet, we know little about the documentation of ASATs, despite it being critical to learn about the coding practices in the first place. We shed light on this through several contributions. First, we analyze the documentation of more than 100 rules of 16 ASATs for multiple program…
▽ More
Automatic Static Analysis Tools (ASATs) are widely used by software developers to diffuse and enforce coding practices. Yet, we know little about the documentation of ASATs, despite it being critical to learn about the coding practices in the first place. We shed light on this through several contributions. First, we analyze the documentation of more than 100 rules of 16 ASATs for multiple programming languages, and distill a taxonomy of the purposes of the documentation-What triggers a rule; Why it is important; and how to Fix an issue-and its types of contents. Then, we conduct a survey to assess the effectiveness of the documentation in terms of its goals and types of content. We highlight opportunities for improvement in ASAT documentation. In particular, we find that the Why purpose is missing in half of the rules we survey; moreover, when the Why is present, it is more likely to have quality issues than the What and the Fix.
△ Less
Submitted 13 February, 2024;
originally announced February 2024.
-
INSPECT: Intrinsic and Systematic Probing Evaluation for Code Transformers
Authors:
Anjan Karmakar,
Romain Robbes
Abstract:
Pre-trained models of source code have recently been successfully applied to a wide variety of Software Engineering tasks; they have also seen some practical adoption in practice, e.g. for code completion. Yet, we still know very little about what these pre-trained models learn about source code. In this article, we use probing--simple diagnostic tasks that do not further train the models--to disc…
▽ More
Pre-trained models of source code have recently been successfully applied to a wide variety of Software Engineering tasks; they have also seen some practical adoption in practice, e.g. for code completion. Yet, we still know very little about what these pre-trained models learn about source code. In this article, we use probing--simple diagnostic tasks that do not further train the models--to discover to what extent pre-trained models learn about specific aspects of source code. We use an extensible framework to define 15 probing tasks that exercise surface, syntactic, structural and semantic characteristics of source code. We probe 8 pre-trained source code models, as well as a natural language model (BERT) as our baseline. We find that models that incorporate some structural information (such as GraphCodeBERT) have a better representation of source code characteristics. Surprisingly, we find that for some probing tasks, BERT is competitive with the source code models, indicating that there are ample opportunities to improve source-code specific pre-training on the respective code characteristics. We encourage other researchers to evaluate their models with our probing task suite, so that they may peer into the hidden layers of the models and identify what intrinsic code characteristics are encoded.
△ Less
Submitted 8 December, 2023;
originally announced December 2023.
-
Out of Context: How important is Local Context in Neural Program Repair?
Authors:
Julian Aron Prenner,
Romain Robbes
Abstract:
Deep learning source code models have been applied very successfully to the problem of automated program repair. One of the standing issues is the small input window of current models which often cannot fully fit the context code required for a bug fix (e.g., method or class declarations of a project). Instead, input is often restricted to the local context, that is, the lines below and above the…
▽ More
Deep learning source code models have been applied very successfully to the problem of automated program repair. One of the standing issues is the small input window of current models which often cannot fully fit the context code required for a bug fix (e.g., method or class declarations of a project). Instead, input is often restricted to the local context, that is, the lines below and above the bug location. In this work we study the importance of this local context on repair success: how much local context is needed?; is context before or after the bug location more important? how is local context tied to the bug type? To answer these questions we train and evaluate Transformer models in many different local context configurations on three datasets and two programming languages. Our results indicate that overall repair success increases with the size of the local context (albeit not for all bug types) and confirm the common practice that roughly 50-60% of the input window should be used for context leading the bug. Our results are not only relevant for researchers working on Transformer-based APR tools but also for benchmark and dataset creators who must decide what and how much context to include in their datasets.
△ Less
Submitted 8 December, 2023;
originally announced December 2023.
-
RunBugRun -- An Executable Dataset for Automated Program Repair
Authors:
Julian Aron Prenner,
Romain Robbes
Abstract:
Recently, we can notice a transition to data-driven techniques in Automated Program Repair (APR), in particular towards deep neural networks. This entails training on hundreds of thousands or even millions of non-executable code fragments. We would like to bring more attention to an aspect of code often neglected in Neural Program Repair (NPR), namely its execution. Code execution has several sign…
▽ More
Recently, we can notice a transition to data-driven techniques in Automated Program Repair (APR), in particular towards deep neural networks. This entails training on hundreds of thousands or even millions of non-executable code fragments. We would like to bring more attention to an aspect of code often neglected in Neural Program Repair (NPR), namely its execution. Code execution has several significant advantages. It allows for test-based evaluation of candidate fixes and can provide valuable information to aid repair. In this work we present a fully executable dataset of 450,000 small buggy/fixed program pairs originally submitted to programming competition websites written in eight different programming languages. Along with the dataset we provide infrastructure to compile, safely execute and test programs as well as fine-grained bug-type labels. To give a point of reference, we provide basic evaluation results for two baselines, one based on a generate-and-validate approach and one on deep learning. With this dataset we follow several goals: we want to lift Neural Program Repair beyond fully static code representations, foster the use of execution-based features and, by including several different languages, counterbalance the predominance of Java in the current landscape of APR datasets and benchmarks.
△ Less
Submitted 3 April, 2023;
originally announced April 2023.
-
JEMMA: An Extensible Java Dataset for ML4Code Applications
Authors:
Anjan Karmakar,
Miltiadis Allamanis,
Romain Robbes
Abstract:
Machine Learning for Source Code (ML4Code) is an active research field in which extensive experimentation is needed to discover how to best use source code's richly structured information. With this in mind, we introduce JEMMA, an Extensible Java Dataset for ML4Code Applications, which is a large-scale, diverse, and high-quality dataset targeted at ML4Code. Our goal with JEMMA is to lower the barr…
▽ More
Machine Learning for Source Code (ML4Code) is an active research field in which extensive experimentation is needed to discover how to best use source code's richly structured information. With this in mind, we introduce JEMMA, an Extensible Java Dataset for ML4Code Applications, which is a large-scale, diverse, and high-quality dataset targeted at ML4Code. Our goal with JEMMA is to lower the barrier to entry in ML4Code by providing the building blocks to experiment with source code models and tasks. JEMMA comes with a considerable amount of pre-processed information such as metadata, representations (e.g., code tokens, ASTs, graphs), and several properties (e.g., metrics, static analysis results) for 50,000 Java projects from the 50KC dataset, with over 1.2 million classes and over 8 million methods. JEMMA is also extensible allowing users to add new properties and representations to the dataset, and evaluate tasks on them. Thus, JEMMA becomes a workbench that researchers can use to experiment with novel representations and tasks operating on source code. To demonstrate the utility of the dataset, we also report results from two empirical studies on our data, ultimately showing that significant work lies ahead in the design of context-aware source code models that can reason over a broader network of source code entities in a software project, the very task that JEMMA is designed to help with.
△ Less
Submitted 18 December, 2022;
originally announced December 2022.
-
Codex Hacks HackerRank: Memorization Issues and a Framework for Code Synthesis Evaluation
Authors:
Anjan Karmakar,
Julian Aron Prenner,
Marco D'Ambros,
Romain Robbes
Abstract:
The Codex model has demonstrated extraordinary competence in synthesizing code from natural language problem descriptions. However, in order to reveal unknown failure modes and hidden biases, such large-scale models must be systematically subjected to multiple and diverse evaluation studies.
In this work, we evaluate the code synthesis capabilities of the Codex model based on a set of 115 Python…
▽ More
The Codex model has demonstrated extraordinary competence in synthesizing code from natural language problem descriptions. However, in order to reveal unknown failure modes and hidden biases, such large-scale models must be systematically subjected to multiple and diverse evaluation studies.
In this work, we evaluate the code synthesis capabilities of the Codex model based on a set of 115 Python problem statements from a popular competitive programming portal: HackerRank. Our evaluation shows that Codex is indeed proficient in Python, solving 96% of the problems in a zero-shot setting, and 100% of the problems in a few-shot setting. However, Codex exhibits clear signs of generating memorized code based on our evaluation. This is alarming, especially since the adoption and use of such models could directly impact how code is written and produced in the foreseeable future. With this in mind, we further discuss and highlight some of the prominent risks associated with large-scale models of source code. Finally, we propose a framework for code-synthesis evaluation using variations of problem statements based on mutations.
△ Less
Submitted 5 December, 2022;
originally announced December 2022.
-
Automatic Program Repair with OpenAI's Codex: Evaluating QuixBugs
Authors:
Julian Aron Prenner,
Romain Robbes
Abstract:
OpenAI's Codex, a GPT-3 like model trained on a large code corpus, has made headlines in and outside of academia. Given a short user-provided description, it is capable of synthesizing code snippets that are syntactically and semantically valid in most cases. In this work, we want to investigate whether Codex is able to localize and fix bugs, a task of central interest in the field of automated pr…
▽ More
OpenAI's Codex, a GPT-3 like model trained on a large code corpus, has made headlines in and outside of academia. Given a short user-provided description, it is capable of synthesizing code snippets that are syntactically and semantically valid in most cases. In this work, we want to investigate whether Codex is able to localize and fix bugs, a task of central interest in the field of automated program repair. Our initial evaluation uses the multi-language QuixBugs benchmark (40 bugs in both Python and Java). We find that, despite not being trained for APR, Codex is surprisingly effective, and competitive with recent state of the art techniques. Our results also show that Codex is slightly more successful at repairing Python than Java.
△ Less
Submitted 6 November, 2021;
originally announced November 2021.
-
What do pre-trained code models know about code?
Authors:
Anjan Karmakar,
Romain Robbes
Abstract:
Pre-trained models of code built on the transformer architecture have performed well on software engineering (SE) tasks such as predictive code generation, code summarization, among others. However, whether the vector representations from these pre-trained models comprehensively encode characteristics of source code well enough to be applicable to a broad spectrum of downstream tasks remains an op…
▽ More
Pre-trained models of code built on the transformer architecture have performed well on software engineering (SE) tasks such as predictive code generation, code summarization, among others. However, whether the vector representations from these pre-trained models comprehensively encode characteristics of source code well enough to be applicable to a broad spectrum of downstream tasks remains an open question.
One way to investigate this is with diagnostic tasks called probes. In this paper, we construct four probing tasks (probing for surface-level, syntactic, structural, and semantic information) for pre-trained code models. We show how probes can be used to identify whether models are deficient in (understanding) certain code properties, characterize different model layers, and get insight into the model sample-efficiency.
We probe four models that vary in their expected knowledge of code properties: BERT (pre-trained on English), CodeBERT and CodeBERTa (pre-trained on source code, and natural language documentation), and GraphCodeBERT (pre-trained on source code with dataflow). While GraphCodeBERT performs more consistently overall, we find that BERT performs surprisingly well on some code tasks, which calls for further investigation.
△ Less
Submitted 25 August, 2021;
originally announced August 2021.
-
Making the most of small Software Engineering datasets with modern machine learning
Authors:
Julian Aron Prenner,
Romain Robbes
Abstract:
This paper provides a starting point for Software Engineering (SE) researchers and practitioners faced with the problem of training machine learning models on small datasets. Due to the high costs associated with labeling data, in Software Engineering,there exist many small (< 1 000 samples) and medium-sized (< 100 000 samples) datasets. While deep learning has set the state of the art in many mac…
▽ More
This paper provides a starting point for Software Engineering (SE) researchers and practitioners faced with the problem of training machine learning models on small datasets. Due to the high costs associated with labeling data, in Software Engineering,there exist many small (< 1 000 samples) and medium-sized (< 100 000 samples) datasets. While deep learning has set the state of the art in many machine learning tasks, it is only recently that it has proven effective on small-sized datasets, primarily thanks to pre-training, a semi-supervised learning technique that leverages abundant unlabelled data alongside scarce labelled data.In this work, we evaluate pre-trained Transformer models on a selection of 13 smaller datasets from the SE literature, covering both,source code and natural language. Our results suggest that pre-trained Transformers are competitive and in some cases superior to previous models, especially for tasks involving natural language; whereas for source code tasks, in particular for very small datasets,traditional machine learning methods often has the edge.In addition, we experiment with several techniques that ought to aid training on small datasets, including active learning, data augmentation, soft labels, self-training and intermediate-task fine-tuning, and issue recommendations on when they are effective. We also release all the data, scripts, and most importantly pre-trained models for the community to reuse on their own datasets.
△ Less
Submitted 29 June, 2021;
originally announced June 2021.
-
Mining Software Repositories with a Collaborative Heuristic Repository
Authors:
Hlib Babii,
Julian Aron Prenner,
Laurin Stricker,
Anjan Karmakar,
Andrea Janes,
Romain Robbes
Abstract:
Many software engineering studies or tasks rely on categorizing software engineering artifacts. In practice, this is done either by defining simple but often imprecise heuristics, or by manual labelling of the artifacts. Unfortunately, errors in these categorizations impact the tasks that rely on them. To improve the precision of these categorizations, we propose to gather heuristics in a collabor…
▽ More
Many software engineering studies or tasks rely on categorizing software engineering artifacts. In practice, this is done either by defining simple but often imprecise heuristics, or by manual labelling of the artifacts. Unfortunately, errors in these categorizations impact the tasks that rely on them. To improve the precision of these categorizations, we propose to gather heuristics in a collaborative heuristic repository, to which researchers can contribute a large amount of diverse heuristics for a variety of tasks on a variety of SE artifacts. These heuristics are then leveraged by state-of-the-art weak supervision techniques to train high-quality classifiers, thus improving the categorizations. We present an initial version of the heuristic repository, which we applied to the concrete task of commit classification.
△ Less
Submitted 2 March, 2021;
originally announced March 2021.
-
Empirical Standards for Software Engineering Research
Authors:
Paul Ralph,
Nauman bin Ali,
Sebastian Baltes,
Domenico Bianculli,
Jessica Diaz,
Yvonne Dittrich,
Neil Ernst,
Michael Felderer,
Robert Feldt,
Antonio Filieri,
Breno Bernard Nicolau de França,
Carlo Alberto Furia,
Greg Gay,
Nicolas Gold,
Daniel Graziotin,
Pinjia He,
Rashina Hoda,
Natalia Juristo,
Barbara Kitchenham,
Valentina Lenarduzzi,
Jorge Martínez,
Jorge Melegati,
Daniel Mendez,
Tim Menzies,
Jefferson Molleri
, et al. (18 additional authors not shown)
Abstract:
Empirical Standards are natural-language models of a scientific community's expectations for a specific kind of study (e.g. a questionnaire survey). The ACM SIGSOFT Paper and Peer Review Quality Initiative generated empirical standards for research methods commonly used in software engineering. These living documents, which should be continuously revised to reflect evolving consensus around resear…
▽ More
Empirical Standards are natural-language models of a scientific community's expectations for a specific kind of study (e.g. a questionnaire survey). The ACM SIGSOFT Paper and Peer Review Quality Initiative generated empirical standards for research methods commonly used in software engineering. These living documents, which should be continuously revised to reflect evolving consensus around research best practices, will improve research quality and make peer review more effective, reliable, transparent and fair.
△ Less
Submitted 4 March, 2021; v1 submitted 7 October, 2020;
originally announced October 2020.
-
Big Code != Big Vocabulary: Open-Vocabulary Models for Source Code
Authors:
Rafael-Michael Karampatsis,
Hlib Babii,
Romain Robbes,
Charles Sutton,
Andrea Janes
Abstract:
Statistical language modeling techniques have successfully been applied to large source code corpora, yielding a variety of new software development tools, such as tools for code suggestion, improving readability, and API migration. A major issue with these techniques is that code introduces new vocabulary at a far higher rate than natural language, as new identifier names proliferate. Both large…
▽ More
Statistical language modeling techniques have successfully been applied to large source code corpora, yielding a variety of new software development tools, such as tools for code suggestion, improving readability, and API migration. A major issue with these techniques is that code introduces new vocabulary at a far higher rate than natural language, as new identifier names proliferate. Both large vocabularies and out-of-vocabulary issues severely affect Neural Language Models (NLMs) of source code, degrading their performance and rendering them unable to scale.
In this paper, we address this issue by: 1) studying how various modelling choices impact the resulting vocabulary on a large-scale corpus of 13,362 projects; 2) presenting an open vocabulary source code NLM that can scale to such a corpus, 100 times larger than in previous work; and 3) showing that such models outperform the state of the art on three distinct code corpora (Java, C, Python). To our knowledge, these are the largest NLMs for code that have been reported.
All datasets, code, and trained models used in this work are publicly available.
△ Less
Submitted 17 March, 2020;
originally announced March 2020.
-
Modeling Vocabulary for Big Code Machine Learning
Authors:
Hlib Babii,
Andrea Janes,
Romain Robbes
Abstract:
When building machine learning models that operate on source code, several decisions have to be made to model source-code vocabulary. These decisions can have a large impact: some can lead to not being able to train models at all, others significantly affect performance, particularly for Neural Language Models. Yet, these decisions are not often fully described. This paper lists important modeling…
▽ More
When building machine learning models that operate on source code, several decisions have to be made to model source-code vocabulary. These decisions can have a large impact: some can lead to not being able to train models at all, others significantly affect performance, particularly for Neural Language Models. Yet, these decisions are not often fully described. This paper lists important modeling choices for source code vocabulary, and explores their impact on the resulting vocabulary on a large-scale corpus of 14,436 projects. We show that a subset of decisions have decisive characteristics, allowing to train accurate Neural Language Models quickly on a large corpus of 10,106 projects.
△ Less
Submitted 3 April, 2019;
originally announced April 2019.