-
Leveraging Reward Models for Guiding Code Review Comment Generation
Authors:
Oussama Ben Sghaier,
Rosalia Tufano,
Gabriele Bavota,
Houari Sahraoui
Abstract:
Code review is a crucial component of modern software development, involving the evaluation of code quality, providing feedback on potential issues, and refining the code to address identified problems. Despite these benefits, code review can be rather time consuming, and influenced by subjectivity and human factors. For these reasons, techniques to (partially) automate the code review process hav…
▽ More
Code review is a crucial component of modern software development, involving the evaluation of code quality, providing feedback on potential issues, and refining the code to address identified problems. Despite these benefits, code review can be rather time consuming, and influenced by subjectivity and human factors. For these reasons, techniques to (partially) automate the code review process have been proposed in the literature. Among those, the ones exploiting deep learning (DL) are able to tackle the generative aspect of code review, by commenting on a given code as a human reviewer would do (i.e., comment generation task) or by automatically implementing code changes required to address a reviewer's comment (i.e., code refinement task). In this paper, we introduce CoRAL, a deep learning framework automating review comment generation by exploiting reinforcement learning with a reward mechanism considering both the semantics of the generated comments as well as their usefulness as input for other models automating the code refinement task. The core idea is that if the DL model generates comments that are semantically similar to the expected ones or can be successfully implemented by a second model specialized in code refinement, these comments are likely to be meaningful and useful, thus deserving a high reward in the reinforcement learning framework. We present both quantitative and qualitative comparisons between the comments generated by CoRAL and those produced by the latest baseline techniques, highlighting the effectiveness and superiority of our approach.
△ Less
Submitted 4 June, 2025;
originally announced June 2025.
-
How do Copilot Suggestions Impact Developers' Frustration and Productivity?
Authors:
Emanuela Guglielmi,
Venera Arnoudova,
Gabriele Bavota,
Rocco Oliveto,
Simone Scalabrino
Abstract:
Context. AI-based development tools, such as GitHub Copilot, are transforming the software development process by offering real-time code suggestions. These tools promise to improve the productivity by reducing cognitive load and speeding up task completion. Previous exploratory studies, however, show that developers sometimes perceive the automatic suggestions as intrusive. As a result, they feel…
▽ More
Context. AI-based development tools, such as GitHub Copilot, are transforming the software development process by offering real-time code suggestions. These tools promise to improve the productivity by reducing cognitive load and speeding up task completion. Previous exploratory studies, however, show that developers sometimes perceive the automatic suggestions as intrusive. As a result, they feel like their productivity decreased. Theory. We propose two theories on the impact of automatic suggestions on frustration and productivity. First, we hypothesize that experienced developers are frustrated from automatic suggestions (mostly from irrelevant ones), and this also negatively impacts their productivity. Second, we conjecture that novice developers benefit from automatic suggestions, which reduce the frustration caused from being stuck on a technical problem and thus increase their productivity. Objective. We plan to conduct a quasi-experimental study to test our theories. The empirical evidence we will collect will allow us to either corroborate or reject our theories. Method. We will involve at least 32 developers, both experts and novices. We will ask each of them to complete two software development tasks, one with automatic suggestions enabled and one with them disabled, allowing for within-subject comparisons. We will measure independent and dependent variables by monitoring developers' actions through an IDE plugin and screen recording. Besides, we will collect physiological data through a wearable device. We will use statistical hypothesis tests to study the effects of the treatments (i.e., automatic suggestions enabled/disabled) on the outcomes (frustration and productivity).
△ Less
Submitted 9 April, 2025;
originally announced April 2025.
-
Why Personalizing Deep Learning-Based Code Completion Tools Matters
Authors:
Alessandro Giagnorio,
Alberto Martin-Lopez,
Gabriele Bavota
Abstract:
Deep learning (DL)-based code completion tools have transformed software development by enabling advanced code generation. These tools leverage models trained on vast amounts of code from numerous repositories, capturing general coding patterns. However, the impact of fine-tuning these models for specific organizations or developers to boost their performance on such subjects remains unexplored. I…
▽ More
Deep learning (DL)-based code completion tools have transformed software development by enabling advanced code generation. These tools leverage models trained on vast amounts of code from numerous repositories, capturing general coding patterns. However, the impact of fine-tuning these models for specific organizations or developers to boost their performance on such subjects remains unexplored. In this work, we fill this gap by presenting solid empirical evidence answering this question. More specifically, we consider 136 developers from two organizations (Apache and Spring), two model architectures (T5 and Code Llama), and three model sizes (60M, 750M, and 7B trainable parameters). T5 models (60M, 750M) were pre-trained and fine-tuned on over 2,000 open-source projects, excluding the subject organizations' data, and compared against versions fine-tuned on organization- and developer-specific datasets. For the Code Llama model (7B), we compared the performance of the already pre-trained model publicly available online with the same model fine-tuned via parameter-efficient fine-tuning on organization- and developer-specific datasets. Our results show that there is a boost in prediction capabilities provided by both an organization-specific and a developer-specific additional fine-tuning, with the former being particularly performant. Such a finding generalizes across (i) the two subject organizations (i.e., Apache and Spring) and (ii) models of completely different magnitude (from 60M to 7B trainable parameters). Finally, we show that DL models fine-tuned on an organization-specific dataset achieve the same completion performance of pre-trained code models used out of the box and being $\sim$10$\times$ larger, with consequent savings in terms of deployment and inference cost (e.g., smaller GPUs needed).
△ Less
Submitted 18 March, 2025;
originally announced March 2025.
-
Quality In, Quality Out: Investigating Training Data's Role in AI Code Generation
Authors:
Cristina Improta,
Rosalia Tufano,
Pietro Liguori,
Domenico Cotroneo,
Gabriele Bavota
Abstract:
Deep Learning-based code generators have seen significant advancements in recent years. Tools such as GitHub Copilot are used by thousands of developers with the main promise of a boost in productivity. However, researchers have recently questioned their impact on code quality showing, for example, that code generated by DL-based tools may be affected by security vulnerabilities. Since DL models a…
▽ More
Deep Learning-based code generators have seen significant advancements in recent years. Tools such as GitHub Copilot are used by thousands of developers with the main promise of a boost in productivity. However, researchers have recently questioned their impact on code quality showing, for example, that code generated by DL-based tools may be affected by security vulnerabilities. Since DL models are trained on large code corpora, one may conjecture that low-quality code they output is the result of low-quality code they have seen during training. However, there is very little empirical evidence documenting this phenomenon. Indeed, most of previous work look at the frequency with which commercial code generators recommend low-quality code without the possibility of relating this to their training set. We investigate the extent to which low-quality code instances seen during training affect the quality of the code generated at inference time. We start by fine-tuning a pre-trained DL model on a large-scale dataset being representative of those usually adopted in the training of code generators. We show that 4.98% of functions in this dataset exhibit one or more quality issues related to security, maintainability, best practices, etc. We use the fine-tuned model to generate 551k Python functions, showing that 5.85% of them are affected by at least one quality issue. We then remove from the training set the low-quality functions, and use the cleaned dataset to fine-tune a second model which has been used to generate the same 551k Python functions. We show that the model trained on the cleaned dataset exhibits similar performance in terms of functional correctness as compared to the original model while, however, generating a statistically significant lower number of low-quality functions (2.16%). Our study empirically documents the importance of high-quality training data for code generators.
△ Less
Submitted 14 March, 2025;
originally announced March 2025.
-
Automating Code Review: A Systematic Literature Review
Authors:
Rosalia Tufano,
Gabriele Bavota
Abstract:
Code Review consists in assessing the code written by teammates with the goal of increasing code quality. Empirical studies documented the benefits brought by such a practice that, however, has its cost to pay in terms of developers' time. For this reason, researchers have proposed techniques and tools to automate code review tasks such as the reviewers selection (i.e., identifying suitable review…
▽ More
Code Review consists in assessing the code written by teammates with the goal of increasing code quality. Empirical studies documented the benefits brought by such a practice that, however, has its cost to pay in terms of developers' time. For this reason, researchers have proposed techniques and tools to automate code review tasks such as the reviewers selection (i.e., identifying suitable reviewers for a given code change) or the actual review of a given change (i.e., recommending improvements to the contributor as a human reviewer would do). Given the substantial amount of papers recently published on the topic, it may be challenging for researchers and practitioners to get a complete overview of the state-of-the-art.
We present a systematic literature review (SLR) featuring 119 papers concerning the automation of code review tasks. We provide: (i) a categorization of the code review tasks automated in the literature; (ii) an overview of the under-the-hood techniques used for the automation, including the datasets used for training data-driven techniques; (iii) publicly available techniques and datasets used for their evaluation, with a description of the evaluation metrics usually adopted for each task.
The SLR is concluded by a discussion of the current limitations of the state-of-the-art, with insights for future research directions.
△ Less
Submitted 12 March, 2025;
originally announced March 2025.
-
Investigating Execution-Aware Language Models for Code Optimization
Authors:
Federico Di Menna,
Luca Traini,
Gabriele Bavota,
Vittorio Cortellessa
Abstract:
Code optimization is the process of enhancing code efficiency, while preserving its intended functionality. This process often requires a deep understanding of the code execution behavior at run-time to identify and address inefficiencies effectively. Recent studies have shown that language models can play a significant role in automating code optimization. However, these models may have insuffici…
▽ More
Code optimization is the process of enhancing code efficiency, while preserving its intended functionality. This process often requires a deep understanding of the code execution behavior at run-time to identify and address inefficiencies effectively. Recent studies have shown that language models can play a significant role in automating code optimization. However, these models may have insufficient knowledge of how code execute at run-time. To address this limitation, researchers have developed strategies that integrate code execution information into language models. These strategies have shown promise, enhancing the effectiveness of language models in various software engineering tasks. However, despite the close relationship between code execution behavior and efficiency, the specific impact of these strategies on code optimization remains largely unexplored. This study investigates how incorporating code execution information into language models affects their ability to optimize code. Specifically, we apply three different training strategies to incorporate four code execution aspects -- line executions, line coverage, branch coverage, and variable states -- into CodeT5+, a well-known language model for code. Our results indicate that execution-aware models provide limited benefits compared to the standard CodeT5+ model in optimizing code.
△ Less
Submitted 11 March, 2025;
originally announced March 2025.
-
Quantizing Large Language Models for Code Generation: A Differentiated Replication
Authors:
Alessandro Giagnorio,
Antonio Mastropaolo,
Saima Afrin,
Massimiliano Di Penta,
Gabriele Bavota
Abstract:
Large Language Models (LLMs) have shown an impressive capability in code generation and, specifically, to automatically implement requirements described in natural language. The LLM effectiveness generally increases with its size: The higher the number of LLM's trainable parameters the better its ability to implement code. However, when it comes to deploying LLM-based code generators, larger LLMs…
▽ More
Large Language Models (LLMs) have shown an impressive capability in code generation and, specifically, to automatically implement requirements described in natural language. The LLM effectiveness generally increases with its size: The higher the number of LLM's trainable parameters the better its ability to implement code. However, when it comes to deploying LLM-based code generators, larger LLMs pose significant challenges related to their memory (and, consequently, carbon) footprint. A previous work by Wei et al. proposed to leverage quantization techniques to reduce the memory footprint of LLM-based code generators without substantially degrading their effectiveness. In short, they studied LLMs featuring up to 16B parameters, quantizing their precision from floating point 32 bits down to int 8 bits and showing their limited impact on code generation performance. Given the fast pace at which LLM capabilities and quantization techniques are evolving, in this work we present a differentiated replication of the work by Wei et al. in which we consider (i) on the one side, more recent and larger code-related LLMs, of up to 34B parameters; (ii) the latest advancements in model quantization techniques, which allow pushing the compression to the extreme quantization level of 2 bits per model parameter and; (iii) different types of calibration datasets to guide the quantization process, including code-specific ones. Our empirical evaluation reveals that the new frontier for LLM quantization is 4-bit precision, resulting in an average memory footprint reduction of 70% compared to the original model without observing any significant decrease in performance. Additionally, when the quantization becomes even more extreme (3 and 2 bits), a code-specific calibration dataset helps to limit the loss of performance.
△ Less
Submitted 10 March, 2025;
originally announced March 2025.
-
Enhancing Code Generation for Low-Resource Languages: No Silver Bullet
Authors:
Alessandro Giagnorio,
Alberto Martin-Lopez,
Gabriele Bavota
Abstract:
The advent of Large Language Models (LLMs) has significantly advanced the field of automated code generation. LLMs rely on large and diverse datasets to learn syntax, semantics, and usage patterns of programming languages. For low-resource languages (i.e., niche programming languages characterized by the scarcity of training data), the limited availability of such data hampers the models' ability…
▽ More
The advent of Large Language Models (LLMs) has significantly advanced the field of automated code generation. LLMs rely on large and diverse datasets to learn syntax, semantics, and usage patterns of programming languages. For low-resource languages (i.e., niche programming languages characterized by the scarcity of training data), the limited availability of such data hampers the models' ability to generalize effectively, resulting in poorer code generation performance as compared to high-resource languages. For this reason, there is a quest for techniques able to close this performance gap. We present an empirical study investigating the effectiveness of several approaches for boosting LLMs' performance on low-resource languages, namely: (i) a classic fine-tuning, which is however capped in size by the scarcity of training data; (ii) three variants of in-context learning, with prompts crafted to provide the LLM with additional information about the low-resource language (e.g., few-shot examples showcasing features of the targeted language); and (iii) a pre-training objective teaching the model how to translate between high- and low-resource languages. The context of our study are two low-resource languages (R and Racket) and six LLMs having different architectures and sizes. Our findings reveal that a fine-tuning is usually the best choice for smaller LLMs, possibly due to the fact that even a small dataset is sufficient to train their limited number of parameters. With the increase in size of the models, in-context learning becomes more and more effective, representing a safe and cheap bet (i.e., it always helps, but with different magnitudes). Differently, very large LLMs may deteriorate their performance on low-resource languages when fine-tuning is performed, possibly due to the lack of enough data needed to effectively update their weights.
△ Less
Submitted 31 January, 2025;
originally announced January 2025.
-
Deep Learning-based Code Completion: On the Impact on Performance of Contextual Information
Authors:
Matteo Ciniselli,
Luca Pascarella,
Gabriele Bavota
Abstract:
Code completion aims at speeding up code writing by recommending to developers the next tokens they are likely to type. Deep Learning (DL) models pushed the boundaries of code completion by redefining what these coding assistants can do: We moved from predicting few code tokens to automatically generating entire functions. One important factor impacting the performance of DL-based code completion…
▽ More
Code completion aims at speeding up code writing by recommending to developers the next tokens they are likely to type. Deep Learning (DL) models pushed the boundaries of code completion by redefining what these coding assistants can do: We moved from predicting few code tokens to automatically generating entire functions. One important factor impacting the performance of DL-based code completion techniques is the context provided as input. With "context" we refer to what the model knows about the code to complete. In a simple scenario, the DL model might be fed with a partially implemented function to complete. In this case, the context is represented by the incomplete function and, based on it, the model must generate a prediction. It is however possible to expand such a context to include additional information, like the whole source code file containing the function to complete, which could be useful to boost the prediction performance. In this work, we present an empirical study investigating how the performance of a DL-based code completion technique is affected by different contexts. We experiment with 8 types of contexts and their combinations. These contexts include: (i) coding contexts, featuring information extracted from the code base in which the code completion is invoked (e.g., code components structurally related to the one to "complete"); (ii) process context, with information aimed at depicting the current status of the project in which a code completion task is triggered (e.g., a textual representation of open issues relevant for the code to complete); and (iii) developer contexts, capturing information about the developer invoking the code completion (e.g., the APIs frequently used). Our results show that additional contextual information can benefit the performance of DL-based code completion, with relative improvements up to +22% in terms of correct predictions.
△ Less
Submitted 9 January, 2025;
originally announced January 2025.
-
On the Generalizability of Transformer Models to Code Completions of Different Lengths
Authors:
Nathan Cooper,
Rosalia Tufano,
Gabriele Bavota,
Denys Poshyvanyk
Abstract:
The programming landscape is nowadays being reshaped by the advent of Large Language Models (LLMs) able to automate code-related tasks related to code implementation (e.g., code completion) and comprehension (e.g., code summarization). Such a paradigm shift comes with a number of implications related to how software will be written, maintained, and evolved. Also, these LLMs are extremely expensive…
▽ More
The programming landscape is nowadays being reshaped by the advent of Large Language Models (LLMs) able to automate code-related tasks related to code implementation (e.g., code completion) and comprehension (e.g., code summarization). Such a paradigm shift comes with a number of implications related to how software will be written, maintained, and evolved. Also, these LLMs are extremely expensive to train, posing questions on their sustainability over time. Given their training cost, their ability to generalize, namely their ability to work on task instances different from those on which they have been trained, is an aspect worth being investigated. Previous work already showed that transformer models can successfully support code completion in a cross-project setting. However, it is unclear whether LLM are able to generalize to inputs having lengths not seen during training. For example, it is known that training a model on short instances allows to substantially reduce the training cost. However, the extent to which such a model would provide good performance on sequences having lengths not seen during training is not known. Many recent works in Natural Language Processing (NLP) tackled this problem in the context of decoder-only LLMs, i.e., xPOS and ALiBi. To assess if these solutions extend to encoder-decoder LLMs usually adopted in the code-related tasks, we present a large empirical study evaluating this generalization property of these and other encoding schemes proposed in the literature, namely Sinusoidal, xPOS, ALiBi, and T5. We found that none of these solutions successfully generalize to unseen lengths and that the only safe solution is to ensure the representativeness in the training set of all lengths likely to be encountered at inference time.
△ Less
Submitted 9 January, 2025;
originally announced January 2025.
-
Deep Learning-based Code Reviews: A Paradigm Shift or a Double-Edged Sword?
Authors:
Rosalia Tufano,
Alberto Martin-Lopez,
Ahmad Tayeb,
Ozren Dabić,
Sonia Haiduc,
Gabriele Bavota
Abstract:
Several techniques have been proposed to automate code review. Early support consisted in recommending the most suited reviewer for a given change or in prioritizing the review tasks. With the advent of deep learning in software engineering, the level of automation has been pushed to new heights, with approaches able to provide feedback on source code in natural language as a human reviewer would…
▽ More
Several techniques have been proposed to automate code review. Early support consisted in recommending the most suited reviewer for a given change or in prioritizing the review tasks. With the advent of deep learning in software engineering, the level of automation has been pushed to new heights, with approaches able to provide feedback on source code in natural language as a human reviewer would do. Also, recent work documented open source projects adopting Large Language Models (LLMs) as co-reviewers. Although the research in this field is very active, little is known about the actual impact of including automatically generated code reviews in the code review process. While there are many aspects worth investigating, in this work we focus on three of them: (i) review quality, i.e., the reviewer's ability to identify issues in the code; (ii) review cost, i.e., the time spent reviewing the code; and (iii) reviewer's confidence, i.e., how confident is the reviewer about the provided feedback. We run a controlled experiment with 29 experts who reviewed different programs with/without the support of an automatically generated code review. During the experiment we monitored the reviewers' activities, for over 50 hours of recorded code reviews. We show that reviewers consider valid most of the issues automatically identified by the LLM and that the availability of an automated review as a starting point strongly influences their behavior: Reviewers tend to focus on the code locations indicated by the LLM rather than searching for additional issues in other parts of the code. The reviewers who started from an automated review identified a higher number of low-severity issues while, however, not identifying more high-severity issues as compared to a completely manual process. Finally, the automated support did not result in saved time and did not increase the reviewers' confidence.
△ Less
Submitted 29 November, 2024; v1 submitted 18 November, 2024;
originally announced November 2024.
-
SEART Data Hub: Streamlining Large-Scale Source Code Mining and Pre-Processing
Authors:
Ozren Dabić,
Rosalia Tufano,
Gabriele Bavota
Abstract:
Large-scale code datasets have acquired an increasingly central role in software engineering (SE) research. This is the result of (i) the success of the mining software repositories (MSR) community, that pushed the standards of empirical studies in SE; and (ii) the recent advent of deep learning (DL) in software engineering, with models trained and tested on large source code datasets. While there…
▽ More
Large-scale code datasets have acquired an increasingly central role in software engineering (SE) research. This is the result of (i) the success of the mining software repositories (MSR) community, that pushed the standards of empirical studies in SE; and (ii) the recent advent of deep learning (DL) in software engineering, with models trained and tested on large source code datasets. While there exist some ready-to-use datasets in the literature, researchers often need to build and pre-process their own dataset to meet specific requirements of the study/technique they are working on. This implies a substantial cost in terms of time and computational resources. In this work we present the SEART Data Hub, a web application that allows to easily build and pre-process large-scale datasets featuring code mined from public GitHub repositories. Through a simple web interface, researchers can specify a set of mining criteria (e.g., only collect code from repositories having more than 100 contributors and more than 1,000 commits) as well as specific pre-processing steps they want to perform (e.g., remove duplicates, test code, instances with syntax errors). After submitting the request, the user will receive an email with a download link for the required dataset within a few hours. A video showcasing the SEART Data Hub is available at https://youtu.be/lCgQaA7CYWA.
△ Less
Submitted 27 September, 2024;
originally announced September 2024.
-
A Taxonomy of Self-Admitted Technical Debt in Deep Learning Systems
Authors:
Federica Pepe,
Fiorella Zampetti,
Antonio Mastropaolo,
Gabriele Bavota,
Massimiliano Di Penta
Abstract:
The development of Machine Learning (ML)- and, more recently, of Deep Learning (DL)-intensive systems requires suitable choices, e.g., in terms of technology, algorithms, and hyper-parameters. Such choices depend on developers' experience, as well as on proper experimentation. Due to limited time availability, developers may adopt suboptimal, sometimes temporary choices, leading to a technical deb…
▽ More
The development of Machine Learning (ML)- and, more recently, of Deep Learning (DL)-intensive systems requires suitable choices, e.g., in terms of technology, algorithms, and hyper-parameters. Such choices depend on developers' experience, as well as on proper experimentation. Due to limited time availability, developers may adopt suboptimal, sometimes temporary choices, leading to a technical debt (TD) specifically related to the ML code. This paper empirically analyzes the presence of Self-Admitted Technical Debt (SATD) in DL systems. After selecting 100 open-source Python projects using popular DL frameworks, we identified SATD from their source comments and created a stratified sample of 443 SATD to analyze manually. We derived a taxonomy of DL-specific SATD through open coding, featuring seven categories and 41 leaves. The identified SATD categories pertain to different aspects of DL models, some of which are technological (e.g., due to hardware or libraries) and some related to suboptimal choices in the DL process, model usage, or configuration. Our findings indicate that DL-specific SATD differs from DL bugs found in previous studies, as it typically pertains to suboptimal solutions rather than functional (\eg blocking) problems. Last but not least, we found that state-of-the-art static analysis tools do not help developers avoid such problems, and therefore, specific support is needed to cope with DL-specific SATD.
△ Less
Submitted 18 September, 2024;
originally announced September 2024.
-
How the Training Procedure Impacts the Performance of Deep Learning-based Vulnerability Patching
Authors:
Antonio Mastropaolo,
Vittoria Nardone,
Gabriele Bavota,
Massimiliano Di Penta
Abstract:
Generative deep learning (DL) models have been successfully adopted for vulnerability patching. However, such models require the availability of a large dataset of patches to learn from. To overcome this issue, researchers have proposed to start from models pre-trained with general knowledge, either on the programming language or on similar tasks such as bug fixing. Despite the efforts in the area…
▽ More
Generative deep learning (DL) models have been successfully adopted for vulnerability patching. However, such models require the availability of a large dataset of patches to learn from. To overcome this issue, researchers have proposed to start from models pre-trained with general knowledge, either on the programming language or on similar tasks such as bug fixing. Despite the efforts in the area of automated vulnerability patching, there is a lack of systematic studies on how these different training procedures impact the performance of DL models for such a task. This paper provides a manyfold contribution to bridge this gap, by (i) comparing existing solutions of self-supervised and supervised pre-training for vulnerability patching; and (ii) for the first time, experimenting with different kinds of prompt-tuning for this task. The study required to train/test 23 DL models. We found that a supervised pre-training focused on bug-fixing, while expensive in terms of data collection, substantially improves DL-based vulnerability patching. When applying prompt-tuning on top of this supervised pre-trained model, there is no significant gain in performance. Instead, prompt-tuning is an effective and cheap solution to substantially boost the performance of self-supervised pre-trained models, i.e., those not relying on the bug-fixing pre-training.
△ Less
Submitted 27 April, 2024;
originally announced April 2024.
-
On the Generalizability of Deep Learning-based Code Completion Across Programming Language Versions
Authors:
Matteo Ciniselli,
Alberto Martin-Lopez,
Gabriele Bavota
Abstract:
Code completion is a key feature of Integrated Development Environments (IDEs), aimed at predicting the next tokens a developer is likely to write, helping them write code faster and with less effort. Modern code completion approaches are often powered by deep learning (DL) models. However, the swift evolution of programming languages poses a critical challenge to the performance of DL-based code…
▽ More
Code completion is a key feature of Integrated Development Environments (IDEs), aimed at predicting the next tokens a developer is likely to write, helping them write code faster and with less effort. Modern code completion approaches are often powered by deep learning (DL) models. However, the swift evolution of programming languages poses a critical challenge to the performance of DL-based code completion models: Can these models generalize across different language versions? This paper delves into such a question. In particular, we assess the capabilities of a state-of-the-art model, CodeT5, to generalize across nine different Java versions, ranging from Java 2 to Java 17, while being exclusively trained on Java 8 code. Our evaluation spans three completion scenarios, namely, predicting tokens, constructs (e.g., the condition of an if statement) and entire code blocks. The results of our study reveal a noticeable disparity among language versions, with the worst performance being obtained in Java 2 and 17 - the most far apart versions compared to Java 8. We investigate possible causes for the performance degradation and show that the adoption of a limited version-specific fine-tuning can partially alleviate the problem. Our work raises awareness on the importance of continuous model refinement, and it can inform the design of alternatives to make code completion models more robust to language evolution.
△ Less
Submitted 22 March, 2024;
originally announced March 2024.
-
Unveiling ChatGPT's Usage in Open Source Projects: A Mining-based Study
Authors:
Rosalia Tufano,
Antonio Mastropaolo,
Federica Pepe,
Ozren Dabić,
Massimiliano Di Penta,
Gabriele Bavota
Abstract:
Large Language Models (LLMs) have gained significant attention in the software engineering community. Nowadays developers have the possibility to exploit these models through industrial-grade tools providing a handy interface toward LLMs, such as OpenAI's ChatGPT. While the potential of LLMs in assisting developers across several tasks has been documented in the literature, there is a lack of empi…
▽ More
Large Language Models (LLMs) have gained significant attention in the software engineering community. Nowadays developers have the possibility to exploit these models through industrial-grade tools providing a handy interface toward LLMs, such as OpenAI's ChatGPT. While the potential of LLMs in assisting developers across several tasks has been documented in the literature, there is a lack of empirical evidence mapping the actual usage of LLMs in software projects. In this work, we aim at filling such a gap. First, we mine 1,501 commits, pull requests (PRs), and issues from open-source projects by matching regular expressions likely to indicate the usage of ChatGPT to accomplish the task. Then, we manually analyze these instances, discarding false positives (i.e., instances in which ChatGPT was mentioned but not actually used) and categorizing the task automated in the 467 true positive instances (165 commits, 159 PRs, 143 issues). This resulted in a taxonomy of 45 tasks which developers automate via ChatGPT. The taxonomy, accompanied with representative examples, provides (i) developers with valuable insights on how to exploit LLMs in their workflow and (ii) researchers with a clear overview of tasks that, according to developers, could benefit from automated solutions.
△ Less
Submitted 26 February, 2024;
originally announced February 2024.
-
Towards Summarizing Code Snippets Using Pre-Trained Transformers
Authors:
Antonio Mastropaolo,
Matteo Ciniselli,
Luca Pascarella,
Rosalia Tufano,
Emad Aghajani,
Gabriele Bavota
Abstract:
When comprehending code, a helping hand may come from the natural language comments documenting it that, unfortunately, are not always there. To support developers in such a scenario, several techniques have been presented to automatically generate natural language summaries for a given code. Most recent approaches exploit deep learning (DL) to automatically document classes or functions, while li…
▽ More
When comprehending code, a helping hand may come from the natural language comments documenting it that, unfortunately, are not always there. To support developers in such a scenario, several techniques have been presented to automatically generate natural language summaries for a given code. Most recent approaches exploit deep learning (DL) to automatically document classes or functions, while little effort has been devoted to more fine-grained documentation (e.g., documenting code snippets or even a single statement). Such a design choice is dictated by the availability of training data: For example, in the case of Java, it is easy to create datasets composed of pairs <Method, Javadoc> that can be fed to DL models to teach them how to summarize a method. Such a comment-to-code linking is instead non-trivial when it comes to inner comments documenting a few statements. In this work, we take all the steps needed to train a DL model to document code snippets. First, we manually built a dataset featuring 6.6k comments that have been (i) classified based on their type (e.g., code summary, TODO), and (ii) linked to the code statements they document. Second, we used such a dataset to train a multi-task DL model, taking as input a comment and being able to (i) classify whether it represents a "code summary" or not and (ii) link it to the code statements it documents. Our model identifies code summaries with 84% accuracy and is able to link them to the documented lines of code with recall and precision higher than 80%. Third, we run this model on 10k projects, identifying and linking code summaries to the documented code. This unlocked the possibility of building a large-scale dataset of documented code snippets that have then been used to train a new DL model able to document code snippets. A comparison with state-of-the-art baselines shows the superiority of the proposed approach.
△ Less
Submitted 1 February, 2024;
originally announced February 2024.
-
Code Review Automation: Strengths and Weaknesses of the State of the Art
Authors:
Rosalia Tufano,
Ozren Dabić,
Antonio Mastropaolo,
Matteo Ciniselli,
Gabriele Bavota
Abstract:
The automation of code review has been tackled by several researchers with the goal of reducing its cost. The adoption of deep learning in software engineering pushed the automation to new boundaries, with techniques imitating developers in generative tasks, such as commenting on a code change as a reviewer would do or addressing a reviewer's comment by modifying code. The performance of these tec…
▽ More
The automation of code review has been tackled by several researchers with the goal of reducing its cost. The adoption of deep learning in software engineering pushed the automation to new boundaries, with techniques imitating developers in generative tasks, such as commenting on a code change as a reviewer would do or addressing a reviewer's comment by modifying code. The performance of these techniques is usually assessed through quantitative metrics, e.g., the percentage of instances in the test set for which correct predictions are generated, leaving many open questions on the techniques' capabilities. For example, knowing that an approach is able to correctly address a reviewer's comment in 10% of cases is of little value without knowing what was asked by the reviewer: What if in all successful cases the code change required to address the comment was just the removal of an empty line? In this paper we aim at characterizing the cases in which three code review automation techniques tend to succeed or fail in the two above-described tasks. The study has a strong qualitative focus, with ~105 man-hours of manual inspection invested in manually analyzing correct and wrong predictions generated by the three techniques, for a total of 2,291 inspected predictions. The output of this analysis are two taxonomies reporting, for each of the two tasks, the types of code changes on which the experimented techniques tend to succeed or to fail, pointing to areas for future work. A result of our manual analysis was also the identification of several issues in the datasets used to train and test the experimented techniques. Finally, we assess the importance of researching in techniques specialized for code review automation by comparing their performance with ChatGPT, a general purpose large language model, finding that ChatGPT struggles in commenting code as a human reviewer would do.
△ Less
Submitted 10 January, 2024;
originally announced January 2024.
-
Evaluating Code Summarization Techniques: A New Metric and an Empirical Characterization
Authors:
Antonio Mastropaolo,
Matteo Ciniselli,
Massimiliano Di Penta,
Gabriele Bavota
Abstract:
Several code summarization techniques have been proposed in the literature to automatically document a code snippet or a function. Ideally, software developers should be involved in assessing the quality of the generated summaries. However, in most cases, researchers rely on automatic evaluation metrics such as BLEU, ROUGE, and METEOR. These metrics are all based on the same assumption: The higher…
▽ More
Several code summarization techniques have been proposed in the literature to automatically document a code snippet or a function. Ideally, software developers should be involved in assessing the quality of the generated summaries. However, in most cases, researchers rely on automatic evaluation metrics such as BLEU, ROUGE, and METEOR. These metrics are all based on the same assumption: The higher the textual similarity between the generated summary and a reference summary written by developers, the higher its quality. However, there are two reasons for which this assumption falls short: (i) reference summaries, e.g., code comments collected by mining software repositories, may be of low quality or even outdated; (ii) generated summaries, while using a different wording than a reference one, could be semantically equivalent to it, thus still being suitable to document the code snippet. In this paper, we perform a thorough empirical investigation on the complementarity of different types of metrics in capturing the quality of a generated summary. Also, we propose to address the limitations of existing metrics by considering a new dimension, capturing the extent to which the generated summary aligns with the semantics of the documented code snippet, independently from the reference summary. To this end, we present a new metric based on contrastive learning to capture said aspect. We empirically show that the inclusion of this novel dimension enables a more effective representation of developers' evaluations regarding the quality of automatically generated summaries.
△ Less
Submitted 24 December, 2023;
originally announced December 2023.
-
Log Statements Generation via Deep Learning: Widening the Support Provided to Developers
Authors:
Antonio Mastropaolo,
Valentina Ferrari,
Luca Pascarella,
Gabriele Bavota
Abstract:
Logging assists in monitoring events that transpire during the execution of software. Previous research has highlighted the challenges confronted by developers when it comes to logging, including dilemmas such as where to log, what data to record, and which log level to employ (e.g., info, fatal). In this context, we introduced LANCE, an approach rooted in deep learning (DL) that has demonstrated…
▽ More
Logging assists in monitoring events that transpire during the execution of software. Previous research has highlighted the challenges confronted by developers when it comes to logging, including dilemmas such as where to log, what data to record, and which log level to employ (e.g., info, fatal). In this context, we introduced LANCE, an approach rooted in deep learning (DL) that has demonstrated the ability to correctly inject a log statement into Java methods in ~15% of cases. Nevertheless, LANCE grapples with two primary constraints: (i) it presumes that a method necessitates the inclusion of logging statements and; (ii) it allows the injection of only a single (new) log statement, even in situations where the injection of multiple log statements might be essential. To address these limitations, we present LEONID, a DL-based technique that can distinguish between methods that do and do not require the inclusion of log statements. Furthermore, LEONID supports the injection of multiple log statements within a given method when necessary, and it also enhances LANCE's proficiency in generating meaningful log messages through the combination of DL and Information Retrieval (IR).
△ Less
Submitted 8 November, 2023;
originally announced November 2023.
-
Toward Automatically Completing GitHub Workflows
Authors:
Antonio Mastropaolo,
Fiorella Zampetti,
Gabriele Bavota,
Massimiliano Di Penta
Abstract:
Continuous integration and delivery (CI/CD) are nowadays at the core of software development. Their benefits come at the cost of setting up and maintaining the CI/CD pipeline, which requires knowledge and skills often orthogonal to those entailed in other software-related tasks. While several recommender systems have been proposed to support developers across a variety of tasks, little automated s…
▽ More
Continuous integration and delivery (CI/CD) are nowadays at the core of software development. Their benefits come at the cost of setting up and maintaining the CI/CD pipeline, which requires knowledge and skills often orthogonal to those entailed in other software-related tasks. While several recommender systems have been proposed to support developers across a variety of tasks, little automated support is available when it comes to setting up and maintaining CI/CD pipelines. We present GH-WCOM (GitHub Workflow COMpletion), a Transformer-based approach supporting developers in writing a specific type of CI/CD pipelines, namely GitHub workflows. To deal with such a task, we designed an abstraction process to help the learning of the transformer while still making GH-WCOM able to recommend very peculiar workflow elements such as tool options and scripting elements. Our empirical study shows that GH-WCOM provides up to 34.23% correct predictions, and the model's confidence is a reliable proxy for the recommendations' correctness likelihood.
△ Less
Submitted 6 September, 2023; v1 submitted 31 August, 2023;
originally announced August 2023.
-
Towards Automatically Addressing Self-Admitted Technical Debt: How Far Are We?
Authors:
Antonio Mastropaolo,
Massimiliano Di Penta,
Gabriele Bavota
Abstract:
Upon evolving their software, organizations and individual developers have to spend a substantial effort to pay back technical debt, i.e., the fact that software is released in a shape not as good as it should be, e.g., in terms of functionality, reliability, or maintainability. This paper empirically investigates the extent to which technical debt can be automatically paid back by neural-based ge…
▽ More
Upon evolving their software, organizations and individual developers have to spend a substantial effort to pay back technical debt, i.e., the fact that software is released in a shape not as good as it should be, e.g., in terms of functionality, reliability, or maintainability. This paper empirically investigates the extent to which technical debt can be automatically paid back by neural-based generative models, and in particular models exploiting different strategies for pre-training and fine-tuning. We start by extracting a dateset of 5,039 Self-Admitted Technical Debt (SATD) removals from 595 open-source projects. SATD refers to technical debt instances documented (e.g., via code comments) by developers. We use this dataset to experiment with seven different generative deep learning (DL) model configurations. Specifically, we compare transformers pre-trained and fine-tuned with different combinations of training objectives, including the fixing of generic code changes, SATD removals, and SATD-comment prompt tuning. Also, we investigate the applicability in this context of a recently-available Large Language Model (LLM)-based chat bot. Results of our study indicate that the automated repayment of SATD is a challenging task, with the best model we experimented with able to automatically fix ~2% to 8% of test instances, depending on the number of attempts it is allowed to make. Given the limited size of the fine-tuning dataset (~5k instances), the model's pre-training plays a fundamental role in boosting performance. Also, the ability to remove SATD steadily drops if the comment documenting the SATD is not provided as input to the model. Finally, we found general-purpose LLMs to not be a competitive approach for addressing SATD.
△ Less
Submitted 17 August, 2023;
originally announced August 2023.
-
Using Gameplay Videos for Detecting Issues in Video Games
Authors:
Emanuela Guglielmi,
Simone Scalabrino,
Gabriele Bavota,
Rocco Oliveto
Abstract:
Context. The game industry is increasingly growing in recent years. Every day, millions of people play video games, not only as a hobby, but also for professional competitions (e.g., e-sports or speed-running) or for making business by entertaining others (e.g., streamers). The latter daily produce a large amount of gameplay videos in which they also comment live what they experience. But no softw…
▽ More
Context. The game industry is increasingly growing in recent years. Every day, millions of people play video games, not only as a hobby, but also for professional competitions (e.g., e-sports or speed-running) or for making business by entertaining others (e.g., streamers). The latter daily produce a large amount of gameplay videos in which they also comment live what they experience. But no software and, thus, no video game is perfect: Streamers may encounter several problems (such as bugs, glitches, or performance issues) while they play. Also, it is unlikely that they explicitly report such issues to developers. The identified problems may negatively impact the user's gaming experience and, in turn, can harm the reputation of the game and of the producer. Objective. In this paper, we propose and empirically evaluate GELID, an approach for automatically extracting relevant information from gameplay videos by (i) identifying video segments in which streamers experienced anomalies; (ii) categorizing them based on their type (e.g., logic or presentation); clustering them based on (iii) the context in which appear (e.g., level or game area) and (iv) on the specific issue type (e.g., game crashes). Method. We manually defined a training set for step 2 of GELID (categorization) and a test set for validating in isolation the four components of GELID. In total, we manually segmented, labeled, and clustered 170 videos related to 3 video games, defining a dataset containing 604 segments. Results. While in steps 1 (segmentation) and 4 (specific issue clustering) GELID achieves satisfactory results, it shows limitations on step 3 (game context clustering) and, above all, step 2 (categorization).
△ Less
Submitted 27 July, 2023;
originally announced July 2023.
-
Automatically Generating Dockerfiles via Deep Learning: Challenges and Promises
Authors:
Giovanni Rosa,
Antonio Mastropaolo,
Simone Scalabrino,
Gabriele Bavota,
Rocco Oliveto
Abstract:
Containerization allows developers to define the execution environment in which their software needs to be installed. Docker is the leading platform in this field, and developers that use it are required to write a Dockerfile for their software. Writing Dockerfiles is far from trivial, especially when the system has unusual requirements for its execution environment. Despite several tools exist to…
▽ More
Containerization allows developers to define the execution environment in which their software needs to be installed. Docker is the leading platform in this field, and developers that use it are required to write a Dockerfile for their software. Writing Dockerfiles is far from trivial, especially when the system has unusual requirements for its execution environment. Despite several tools exist to support developers in writing Dockerfiles, none of them is able to generate entire Dockerfiles from scratch given a high-level specification of the requirements of the execution environment. In this paper, we present a study in which we aim at understanding to what extent Deep Learning (DL), which has been proven successful for other coding tasks, can be used for this specific coding task. We preliminarily defined a structured natural language specification for Dockerfile requirements and a methodology that we use to automatically infer the requirements from the largest dataset of Dockerfiles currently available. We used the obtained dataset, with 670,982 instances, to train and test a Text-to-Text Transfer Transformer (T5) model, following the current state-of-the-art procedure for coding tasks, to automatically generate Dockerfiles from the structured specifications. The results of our evaluation show that T5 performs similarly to the more trivial IR-based baselines we considered. We also report the open challenges associated with the application of deep learning in the context of Dockerfile generation.
△ Less
Submitted 28 March, 2023;
originally announced March 2023.
-
Source Code Recommender Systems: The Practitioners' Perspective
Authors:
Matteo Ciniselli,
Luca Pascarella,
Emad Aghajani,
Simone Scalabrino,
Rocco Oliveto,
Gabriele Bavota
Abstract:
The automatic generation of source code is one of the long-lasting dreams in software engineering research. Several techniques have been proposed to speed up the writing of new code. For example, code completion techniques can recommend to developers the next few tokens they are likely to type, while retrieval-based approaches can suggest code snippets relevant for the task at hand. Also, deep lea…
▽ More
The automatic generation of source code is one of the long-lasting dreams in software engineering research. Several techniques have been proposed to speed up the writing of new code. For example, code completion techniques can recommend to developers the next few tokens they are likely to type, while retrieval-based approaches can suggest code snippets relevant for the task at hand. Also, deep learning has been used to automatically generate code statements starting from a natural language description. While research in this field is very active, there is no study investigating what the users of code recommender systems (i.e., software practitioners) actually need from these tools. We present a study involving 80 software developers to investigate the characteristics of code recommender systems they consider important. The output of our study is a taxonomy of 70 "requirements" that should be considered when designing code recommender systems. For example, developers would like the recommended code to use the same coding style of the code under development. Also, code recommenders being "aware" of the developers' knowledge (e.g., what are the framework/libraries they already used in the past) and able to customize the recommendations based on this knowledge would be appreciated by practitioners. The taxonomy output of our study points to a wide set of future research directions for code recommenders.
△ Less
Submitted 8 February, 2023;
originally announced February 2023.
-
Automating Code-Related Tasks Through Transformers: The Impact of Pre-training
Authors:
Rosalia Tufano,
Luca Pascarella,
Gabriele Bavota
Abstract:
Transformers have gained popularity in the software engineering (SE) literature. These deep learning models are usually pre-trained through a self-supervised objective, meant to provide the model with basic knowledge about a language of interest (e.g., Java). A classic pre-training objective is the masked language model (MLM), in which a percentage of tokens from the input (e.g., a Java method) is…
▽ More
Transformers have gained popularity in the software engineering (SE) literature. These deep learning models are usually pre-trained through a self-supervised objective, meant to provide the model with basic knowledge about a language of interest (e.g., Java). A classic pre-training objective is the masked language model (MLM), in which a percentage of tokens from the input (e.g., a Java method) is masked, with the model in charge of predicting them. Once pre-trained, the model is then fine-tuned to support the specific downstream task of interest (e.g., code summarization). While there is evidence suggesting the boost in performance provided by pre-training, little is known about the impact of the specific pre-training objective(s) used. Indeed, MLM is just one of the possible pre-training objectives and recent work from the natural language processing field suggest that pre-training objectives tailored for the specific downstream task of interest may substantially boost the model's performance. In this study, we focus on the impact of pre-training objectives on the performance of transformers when automating code-related tasks. We start with a systematic literature review aimed at identifying the pre-training objectives used in SE. Then, we pre-train 32 transformers using both (i) generic pre-training objectives usually adopted in SE; and (ii) pre-training objectives tailored to specific code-related tasks subject of our experimentation, namely bug-fixing, code summarization, and code completion. We also compare the pre-trained models with non pre-trained ones. Our results show that: (i) pre-training helps in boosting performance only if the amount of fine-tuning data available is small; (ii) the MLM objective is usually sufficient to maximize the prediction performance of the model, even when comparing it with pre-training objectives specialized for the downstream task at hand.
△ Less
Submitted 8 February, 2023;
originally announced February 2023.
-
On the Robustness of Code Generation Techniques: An Empirical Study on GitHub Copilot
Authors:
Antonio Mastropaolo,
Luca Pascarella,
Emanuela Guglielmi,
Matteo Ciniselli,
Simone Scalabrino,
Rocco Oliveto,
Gabriele Bavota
Abstract:
Software engineering research has always being concerned with the improvement of code completion approaches, which suggest the next tokens a developer will likely type while coding. The release of GitHub Copilot constitutes a big step forward, also because of its unprecedented ability to automatically generate even entire functions from their natural language description. While the usefulness of C…
▽ More
Software engineering research has always being concerned with the improvement of code completion approaches, which suggest the next tokens a developer will likely type while coding. The release of GitHub Copilot constitutes a big step forward, also because of its unprecedented ability to automatically generate even entire functions from their natural language description. While the usefulness of Copilot is evident, it is still unclear to what extent it is robust. Specifically, we do not know the extent to which semantic-preserving changes in the natural language description provided to the model have an effect on the generated code function. In this paper we present an empirical study in which we aim at understanding whether different but semantically equivalent natural language descriptions result in the same recommended function. A negative answer would pose questions on the robustness of deep learning (DL)-based code generators since it would imply that developers using different wordings to describe the same code would obtain different recommendations. We asked Copilot to automatically generate 892 Java methods starting from their original Javadoc description. Then, we generated different semantically equivalent descriptions for each method both manually and automatically, and we analyzed the extent to which predictions generated by Copilot changed. Our results show that modifying the description results in different code recommendations in ~46% of cases. Also, differences in the semantically equivalent descriptions might impact the correctness of the generated code ~28%.
△ Less
Submitted 1 February, 2023;
originally announced February 2023.
-
Automated Variable Renaming: Are We There Yet?
Authors:
Antonio Mastropaolo,
Emad Aghajani,
Luca Pascarella,
Gabriele Bavota
Abstract:
Identifiers, such as method and variable names, form a large portion of source code. Therefore, low-quality identifiers can substantially hinder code comprehension. To support developers in using meaningful identifiers, several (semi-)automatic techniques have been proposed, mostly being data-driven (e.g. statistical language models, deep learning models) or relying on static code analysis. Still,…
▽ More
Identifiers, such as method and variable names, form a large portion of source code. Therefore, low-quality identifiers can substantially hinder code comprehension. To support developers in using meaningful identifiers, several (semi-)automatic techniques have been proposed, mostly being data-driven (e.g. statistical language models, deep learning models) or relying on static code analysis. Still, limited empirical investigations have been performed on the effectiveness of such techniques for recommending developers with meaningful identifiers, possibly resulting in rename refactoring operations. We present a large-scale study investigating the potential of data-driven approaches to support automated variable renaming. We experiment with three state-of-the-art techniques: a statistical language model and two DL-based models. The three approaches have been trained and tested on three datasets we built with the goal of evaluating their ability to recommend meaningful variable identifiers. Our quantitative and qualitative analyses show the potential of such techniques that, under specific conditions, can provide valuable recommendations and are ready to be integrated in rename refactoring tools. Nonetheless, our results also highlight limitations of the experimented approaches that call for further research in this field.
△ Less
Submitted 12 December, 2022;
originally announced December 2022.
-
Don't Reinvent the Wheel: Towards Automatic Replacement of Custom Implementations with APIs
Authors:
Rosalia Tufano,
Emad Aghajani,
Gabriele Bavota
Abstract:
Reusing code is a common practice in software development: It helps developers speedup the implementation task while also reducing the chances of introducing bugs, given the assumption that the reused code has been tested, possibly in production. Despite these benefits, opportunities for reuse are not always in plain sight and, thus, developers may miss them. We present our preliminary steps in bu…
▽ More
Reusing code is a common practice in software development: It helps developers speedup the implementation task while also reducing the chances of introducing bugs, given the assumption that the reused code has been tested, possibly in production. Despite these benefits, opportunities for reuse are not always in plain sight and, thus, developers may miss them. We present our preliminary steps in building RETIWA, a recommender able to automatically identify custom implementations in a given project that are good candidates to be replaced by open source APIs. RETIWA relies on a ``knowledge base'' consisting of real examples of custom implementation-to-API replacements. In this work, we present the mining strategy we tailored to automatically and reliably extract replacements of custom implementations with APIs from open source projects. This is the first step towards building the envisioned recommender.
△ Less
Submitted 16 August, 2022;
originally announced August 2022.
-
Detecting Connectivity Issues in Android Apps
Authors:
Alejandro Mazuera-Rozo,
Camilo Escobar-Velásquez,
Juan Espitia-Acero,
Mario Linares-Vásquez,
Gabriele Bavota
Abstract:
Android is the most popular mobile operating system in the world, running on more than 70% of mobile devices. This implies a gigantic and very competitive market for Android apps. Being successful in such a market is far from trivial and requires, besides the tackling of a problem or need felt by a vast audience, the development of high-quality apps. As recently showed in the literature, connectiv…
▽ More
Android is the most popular mobile operating system in the world, running on more than 70% of mobile devices. This implies a gigantic and very competitive market for Android apps. Being successful in such a market is far from trivial and requires, besides the tackling of a problem or need felt by a vast audience, the development of high-quality apps. As recently showed in the literature, connectivity issues (e.g., mishandling of zero/unreliable Internet connection) can result in bugs and/or crashes, negatively affecting the app's user experience. While these issues have been studied in the literature, there are no techniques able to automatically detect and report them to developers. We present CONAN, a tool able to detect statically 16 types of connectivity issues affecting Android apps. We assessed the ability of CONAN to precisely identify these issues in a set of 44 open source apps, observing an average precision of 80%. Then, we studied the relevance of these issues for developers by (i) conducting interviews with six practitioners working with commercial Android apps, and (ii) submitting 84 issue reports for 27 open source apps. Our results show that several of the identified connectivity issues are considered as relevant by practitioners in specific contexts, in which connectivity is considered a first-class feature.
△ Less
Submitted 17 June, 2022;
originally announced June 2022.
-
Using Transfer Learning for Code-Related Tasks
Authors:
Antonio Mastropaolo,
Nathan Cooper,
David Nader Palacio,
Simone Scalabrino,
Denys Poshyvanyk,
Rocco Oliveto,
Gabriele Bavota
Abstract:
Deep learning (DL) techniques have been used to support several code-related tasks such as code summarization and bug-fixing. In particular, pre-trained transformer models are on the rise, also thanks to the excellent results they achieved in Natural Language Processing (NLP) tasks. The basic idea behind these models is to first pre-train them on a generic dataset using a self-supervised task (e.g…
▽ More
Deep learning (DL) techniques have been used to support several code-related tasks such as code summarization and bug-fixing. In particular, pre-trained transformer models are on the rise, also thanks to the excellent results they achieved in Natural Language Processing (NLP) tasks. The basic idea behind these models is to first pre-train them on a generic dataset using a self-supervised task (e.g, filling masked words in sentences). Then, these models are fine-tuned to support specific tasks of interest (e.g, language translation). A single model can be fine-tuned to support multiple tasks, possibly exploiting the benefits of transfer learning. This means that knowledge acquired to solve a specific task (e.g, language translation) can be useful to boost performance on another task (e.g, sentiment classification). While the benefits of transfer learning have been widely studied in NLP, limited empirical evidence is available when it comes to code-related tasks. In this paper, we assess the performance of the Text-To-Text Transfer Transformer (T5) model in supporting four different code-related tasks: (i) automatic bug-fixing, (ii) injection of code mutants, (iii) generation of assert statements, and (iv) code summarization. We pay particular attention in studying the role played by pre-training and multi-task fine-tuning on the model's performance. We show that (i) the T5 can achieve better performance as compared to state-of-the-art baselines; and (ii) while pre-training helps the model, not all tasks benefit from a multi-task fine-tuning.
△ Less
Submitted 17 June, 2022;
originally announced June 2022.
-
AI-driven Development Is Here: Should You Worry?
Authors:
Neil Ernst,
Gabriele Bavota
Abstract:
AI-Driven Development Environments (AIDEs) Integrate the power of modern AI into IDEs like Visual Studio Code and JetBrains IntelliJ. By leveraging massive language models and the plethora of openly available source code, AIDEs promise to automate many of the obvious, routine tasks in programming. At the same time, AIDEs come with new challenges to think about, such as bias, legal compliance, secu…
▽ More
AI-Driven Development Environments (AIDEs) Integrate the power of modern AI into IDEs like Visual Studio Code and JetBrains IntelliJ. By leveraging massive language models and the plethora of openly available source code, AIDEs promise to automate many of the obvious, routine tasks in programming. At the same time, AIDEs come with new challenges to think about, such as bias, legal compliance, security vulnerabilities, and their impact on learning programming.
△ Less
Submitted 15 April, 2022;
originally announced April 2022.
-
To What Extent do Deep Learning-based Code Recommenders Generate Predictions by Cloning Code from the Training Set?
Authors:
Matteo Ciniselli,
Luca Pascarella,
Gabriele Bavota
Abstract:
Deep Learning (DL) models have been widely used to support code completion. These models, once properly trained, can take as input an incomplete code component (e.g., an incomplete function) and predict the missing tokens to finalize it. GitHub Copilot is an example of code recommender built by training a DL model on millions of open source repositories: The source code of these repositories acts…
▽ More
Deep Learning (DL) models have been widely used to support code completion. These models, once properly trained, can take as input an incomplete code component (e.g., an incomplete function) and predict the missing tokens to finalize it. GitHub Copilot is an example of code recommender built by training a DL model on millions of open source repositories: The source code of these repositories acts as training data, allowing the model to learn "how to program". The usage of such a code is usually regulated by Free and Open Source Software (FOSS) licenses, that establish under which conditions the licensed code can be redistributed or modified. As of Today, it is unclear whether the code generated by DL models trained on open source code should be considered as "new" or as "derivative" work, with possible implications on license infringements. In this work, we run a large-scale study investigating the extent to which DL models tend to clone code from their training set when recommending code completions. Such an exploratory study can help in assessing the magnitude of the potential licensing issues mentioned before: If these models tend to generate new code that is unseen in the training set, then licensing issues are unlikely to occur. Otherwise, a revision of these licenses urges to regulate how the code generated by these models should be treated when used, for example, in a commercial setting. Highlights from our results show that ~$10% to ~0.1% of the predictions generated by a state-of-the-art DL-based code completion tool are Type-1 clones of instances in the training set, depending on the size of the predicted code. Long predictions are unlikely to be cloned.
△ Less
Submitted 14 April, 2022;
originally announced April 2022.
-
Towards Using Gameplay Videos for Detecting Issues in Video Games
Authors:
Emanuela Guglielmi,
Simone Scalabrino,
Gabriele Bavota,
Rocco Oliveto
Abstract:
Context. The game industry is increasingly growing in recent years. Every day, millions of people play video games, not only as a hobby, but also for professional competitions (e.g., e-sports or speed-running) or for making business by entertaining others (e.g., streamers). The latter daily produce a large amount of gameplay videos in which they also comment live what they experience. Since no sof…
▽ More
Context. The game industry is increasingly growing in recent years. Every day, millions of people play video games, not only as a hobby, but also for professional competitions (e.g., e-sports or speed-running) or for making business by entertaining others (e.g., streamers). The latter daily produce a large amount of gameplay videos in which they also comment live what they experience. Since no software and, thus, no video game is perfect, streamers may encounter several problems (such as bugs, glitches, or performance issues). However, it is unlikely that they explicitly report such issues to developers. The identified problems may negatively impact the user's gaming experience and, in turn, can harm the reputation of the game and of the producer. Objective. We aim at proposing and empirically evaluating GELID, an approach for automatically extracting relevant information from gameplay videos by (i) identifying video segments in which streamers experienced anomalies; (ii) categorizing them based on their type and context in which appear (e.g., bugs or glitches appearing in a specific level or scene of the game); and (iii) clustering segments that regard the same specific issue. Method. We will build on top of existing approaches able to identify videos that are relevant for a specific video game. These represent the input of GELID that processes them to achieve the defined objectives. We will experiment GELID on several gameplay videos to understand the extent to which each of its steps is effective.
△ Less
Submitted 8 April, 2022;
originally announced April 2022.
-
Taxonomy of Security Weaknesses in Java and Kotlin Android Apps
Authors:
Alejandro Mazuera-Rozo,
Camilo Escobar-Velásquez,
Juan Espitia-Acero,
David Vega-Guzmán,
Catia Trubiani,
Mario Linares-Vásquez,
Gabriele Bavota
Abstract:
Android is nowadays the most popular operating system in the world, not only in the realm of mobile devices, but also when considering desktop and laptop computers. Such a popularity makes it an attractive target for security attacks, also due to the sensitive information often manipulated by mobile apps. The latter are going through a transition in which the Android ecosystem is moving from the u…
▽ More
Android is nowadays the most popular operating system in the world, not only in the realm of mobile devices, but also when considering desktop and laptop computers. Such a popularity makes it an attractive target for security attacks, also due to the sensitive information often manipulated by mobile apps. The latter are going through a transition in which the Android ecosystem is moving from the usage of Java as the official language for developing apps, to the adoption of Kotlin as the first choice supported by Google. While previous studies have partially studied security weaknesses affecting Java Android apps, there is no comprehensive empirical investigation studying software security weaknesses affecting Android apps considering (and comparing) the two main languages used for their development, namely Java and Kotlin. We present an empirical study in which we: (i) manually analyze 681 commits including security weaknesses fixed by developers in Java and Kotlin apps, with the goal of defining a taxonomy highlighting the types of software security weaknesses affecting Java and Kotlin Android apps; (ii) survey 43 Android developers to validate and complement our taxonomy. Based on our findings, we propose a list of future actions that could be performed by researchers and practitioners to improve the security of Android apps.
△ Less
Submitted 27 January, 2022;
originally announced January 2022.
-
Using Reinforcement Learning for Load Testing of Video Games
Authors:
Rosalia Tufano,
Simone Scalabrino,
Luca Pascarella,
Emad Aghajani,
Rocco Oliveto,
Gabriele Bavota
Abstract:
Different from what happens for most types of software systems, testing video games has largely remained a manual activity performed by human testers. This is mostly due to the continuous and intelligent user interaction video games require. Recently, reinforcement learning (RL) has been exploited to partially automate functional testing. RL enables training smart agents that can even achieve supe…
▽ More
Different from what happens for most types of software systems, testing video games has largely remained a manual activity performed by human testers. This is mostly due to the continuous and intelligent user interaction video games require. Recently, reinforcement learning (RL) has been exploited to partially automate functional testing. RL enables training smart agents that can even achieve super-human performance in playing games, thus being suitable to explore them looking for bugs. We investigate the possibility of using RL for load testing video games. Indeed, the goal of game testing is not only to identify functional bugs, but also to examine the game's performance, such as its ability to avoid lags and keep a minimum number of frames per second (FPS) when high-demanding 3D scenes are shown on screen. We define a methodology employing RL to train an agent able to play the game as a human while also trying to identify areas of the game resulting in a drop of FPS. We demonstrate the feasibility of our approach on three games. Two of them are used as proof-of-concept, by injecting artificial performance bugs. The third one is an open-source 3D game that we load test using the trained agent showing its potential to identify areas of the game resulting in lower FPS.
△ Less
Submitted 18 January, 2022;
originally announced January 2022.
-
Using Pre-Trained Models to Boost Code Review Automation
Authors:
Rosalia Tufano,
Simone Masiero,
Antonio Mastropaolo,
Luca Pascarella,
Denys Poshyvanyk,
Gabriele Bavota
Abstract:
Code review is a practice widely adopted in open source and industrial projects. Given the non-negligible cost of such a process, researchers started investigating the possibility of automating specific code review tasks. We recently proposed Deep Learning (DL) models targeting the automation of two tasks: the first model takes as input a code submitted for review and implements in it changes like…
▽ More
Code review is a practice widely adopted in open source and industrial projects. Given the non-negligible cost of such a process, researchers started investigating the possibility of automating specific code review tasks. We recently proposed Deep Learning (DL) models targeting the automation of two tasks: the first model takes as input a code submitted for review and implements in it changes likely to be recommended by a reviewer; the second takes as input the submitted code and a reviewer comment posted in natural language and automatically implements the change required by the reviewer. While the preliminary results we achieved are encouraging, both models had been tested in rather simple code review scenarios, substantially simplifying the targeted problem. This was also due to the choices we made when designing both the technique and the experiments. In this paper, we build on top of that work by demonstrating that a pre-trained Text-To-Text Transfer Transformer (T5) model can outperform previous DL models for automating code review tasks. Also, we conducted our experiments on a larger and more realistic (and challenging) dataset of code review activities.
△ Less
Submitted 18 January, 2022;
originally announced January 2022.
-
Using Deep Learning to Generate Complete Log Statements
Authors:
Antonio Mastropaolo,
Luca Pascarella,
Gabriele Bavota
Abstract:
Logging is a practice widely adopted in several phases of the software lifecycle. For example, during software development log statements allow engineers to verify and debug the system by exposing fine-grained information of the running software. While the benefits of logging are undisputed, taking proper decisions about where to inject log statements, what information to log, and at which log lev…
▽ More
Logging is a practice widely adopted in several phases of the software lifecycle. For example, during software development log statements allow engineers to verify and debug the system by exposing fine-grained information of the running software. While the benefits of logging are undisputed, taking proper decisions about where to inject log statements, what information to log, and at which log level (e.g., error, warning) is crucial for the logging effectiveness. In this paper, we present LANCE (Log stAtemeNt reCommEnder), the first approach supporting developers in all these decisions. LANCE features a Text-To-Text-Transfer-Transformer (T5) model that has been trained on 6,894,456 Java methods. LANCE takes as input a Java method and injects in it a full log statement, including a human-comprehensible logging message and properly choosing the needed log level and the statement location. Our results show that LANCE is able to (i) properly identify the location in the code where to inject the statement in 65.9% of Java methods requiring it; (ii) selecting the proper log level in 66.2% of cases; and (iii) generate a completely correct log statement including a meaningful logging message in 15.2% of cases.
△ Less
Submitted 14 January, 2022; v1 submitted 13 January, 2022;
originally announced January 2022.
-
Studying Eventual Connectivity Issues in Android Apps
Authors:
Camilo Escobar-Velásquez,
Alejandro Mazuera-Rozo,
Claudia Bedoya,
Michael Osorio-Riaño,
Mario Linares-Vásquez,
Gabriele Bavota
Abstract:
Mobile apps have become indispensable for daily life, not only for individuals but also for companies/organizations that offer their services digitally. Inherited by the mobility of devices, there are no limitations regarding the locations or conditions in which apps are being used. For example, apps can be used where no internet connection is available. Therefore, offline-first is a highly desire…
▽ More
Mobile apps have become indispensable for daily life, not only for individuals but also for companies/organizations that offer their services digitally. Inherited by the mobility of devices, there are no limitations regarding the locations or conditions in which apps are being used. For example, apps can be used where no internet connection is available. Therefore, offline-first is a highly desired quality of mobile apps. Accordingly, inappropriate handling of connectivity issues and miss-implementation of good practices lead to bugs and crashes occurrences that reduce the confidence of users on the apps' quality. In this paper, we present the first study on Eventual Connectivity (ECn) issues exhibited by Android apps, by manually inspecting 971 scenarios related to 50 open-source apps. We found 304 instances of ECn issues (6 issues per app, on average) that we organized in a taxonomy of 10 categories. We found that the majority of ECn issues are related to the use of messages not providing correct information to the user about the connectivity status and to the improper use of external libraries/apps to which the check of the connectivity status is delegated. Based on our findings, we distill a list of lessons learned for both practitioners and researchers, indicating directions for future work.
△ Less
Submitted 17 October, 2021;
originally announced October 2021.
-
An Empirical Study on the Usage of Transformer Models for Code Completion
Authors:
Matteo Ciniselli,
Nathan Cooper,
Luca Pascarella,
Antonio Mastropaolo,
Emad Aghajani,
Denys Poshyvanyk,
Massimiliano Di Penta,
Gabriele Bavota
Abstract:
Code completion aims at speeding up code writing by predicting the next code token(s) the developer is likely to write. Works in this field focused on improving the accuracy of the generated predictions, with substantial leaps forward made possible by deep learning (DL) models. However, code completion techniques are mostly evaluated in the scenario of predicting the next token to type, with few e…
▽ More
Code completion aims at speeding up code writing by predicting the next code token(s) the developer is likely to write. Works in this field focused on improving the accuracy of the generated predictions, with substantial leaps forward made possible by deep learning (DL) models. However, code completion techniques are mostly evaluated in the scenario of predicting the next token to type, with few exceptions pushing the boundaries to the prediction of an entire code statement. Thus, little is known about the performance of state-of-the-art code completion approaches in more challenging scenarios in which, for example, an entire code block must be generated. We present a large-scale study exploring the capabilities of state-of-the-art Transformer-based models in supporting code completion at different granularity levels, including single tokens, one or multiple entire statements, up to entire code blocks (e.g., the iterated block of a for loop). We experimented with several variants of two recently proposed Transformer-based models, namely RoBERTa and the Text-To-Text Transfer Transformer (T5), for the task of code completion. The achieved results show that Transformer-based models, and in particular the T5, represent a viable solution for code completion, with perfect predictions ranging from ~29%, obtained when asking the model to guess entire blocks, up to ~69%, reached in the simpler scenario of few tokens masked from the same code statement.
△ Less
Submitted 18 November, 2021; v1 submitted 3 August, 2021;
originally announced August 2021.
-
An Empirical Study on Code Comment Completion
Authors:
Antonio Mastropaolo,
Emad Aghajani,
Luca Pascarella,
Gabriele Bavota
Abstract:
Code comments play a prominent role in program comprehension activities. However, source code is not always documented and code and comments not always co-evolve. To deal with these issues, researchers have proposed techniques to automatically generate comments documenting a given code at hand. The most recent works in the area applied deep learning (DL) techniques to support such a task. Despite…
▽ More
Code comments play a prominent role in program comprehension activities. However, source code is not always documented and code and comments not always co-evolve. To deal with these issues, researchers have proposed techniques to automatically generate comments documenting a given code at hand. The most recent works in the area applied deep learning (DL) techniques to support such a task. Despite the achieved advances, the empirical evaluations of these approaches show that they are still far from a performance level that would make them valuable for developers. We tackle a simpler and related problem: Code comment completion. Instead of generating a comment for a given code from scratch, we investigate the extent to which state-of-the-art techniques can help developers in writing comments faster. We present a large-scale study in which we empirically assess how a simple n-gram model and the recently proposed Text-To-Text Transfer Transformer (T5) architecture can perform in autocompleting a code comment the developer is typing. The achieved results show the superiority of the T5 model, despite the n-gram model being a competitive solution.
△ Less
Submitted 22 July, 2021;
originally announced July 2021.
-
Shallow or Deep? An Empirical Study on Detecting Vulnerabilities using Deep Learning
Authors:
Alejandro Mazuera-Rozo,
Anamaria Mojica-Hanke,
Mario Linares-Vásquez,
Gabriele Bavota
Abstract:
Deep learning (DL) techniques are on the rise in the software engineering research community. More and more approaches have been developed on top of DL models, also due to the unprecedented amount of software-related data that can be used to train these models. One of the recent applications of DL in the software engineering domain concerns the automatic detection of software vulnerabilities. Whil…
▽ More
Deep learning (DL) techniques are on the rise in the software engineering research community. More and more approaches have been developed on top of DL models, also due to the unprecedented amount of software-related data that can be used to train these models. One of the recent applications of DL in the software engineering domain concerns the automatic detection of software vulnerabilities. While several DL models have been developed to approach this problem, there is still limited empirical evidence concerning their actual effectiveness especially when compared with shallow machine learning techniques. In this paper, we partially fill this gap by presenting a large-scale empirical study using three vulnerability datasets and five different source code representations (i.e., the format in which the code is provided to the classifiers to assess whether it is vulnerable or not) to compare the effectiveness of two widely used DL-based models and of one shallow machine learning model in (i) classifying code functions as vulnerable or non-vulnerable (i.e., binary classification), and (ii) classifying code functions based on the specific type of vulnerability they contain (or "clean", if no vulnerability is there). As a baseline we include in our study the AutoML utility provided by the Google Cloud Platform. Our results show that the experimented models are still far from ensuring reliable vulnerability detection, and that a shallow learning classifier represents a competitive baseline for the newest DL-based models.
△ Less
Submitted 22 March, 2021;
originally announced March 2021.
-
An Empirical Study on the Usage of BERT Models for Code Completion
Authors:
Matteo Ciniselli,
Nathan Cooper,
Luca Pascarella,
Denys Poshyvanyk,
Massimiliano Di Penta,
Gabriele Bavota
Abstract:
Code completion is one of the main features of modern Integrated Development Environments (IDEs). Its objective is to speed up code writing by predicting the next code token(s) the developer is likely to write. Research in this area has substantially bolstered the predictive performance of these techniques. However, the support to developers is still limited to the prediction of the next few token…
▽ More
Code completion is one of the main features of modern Integrated Development Environments (IDEs). Its objective is to speed up code writing by predicting the next code token(s) the developer is likely to write. Research in this area has substantially bolstered the predictive performance of these techniques. However, the support to developers is still limited to the prediction of the next few tokens to type. In this work, we take a step further in this direction by presenting a large-scale empirical study aimed at exploring the capabilities of state-of-the-art deep learning (DL) models in supporting code completion at different granularity levels, including single tokens, one or multiple entire statements, up to entire code blocks (e.g., the iterated block of a for loop). To this aim, we train and test several adapted variants of the recently proposed RoBERTa model, and evaluate its predictions from several perspectives, including: (i) metrics usually adopted when assessing DL generative models (i.e., BLEU score and Levenshtein distance); (ii) the percentage of perfect predictions (i.e., the predicted code snippets that match those written by developers); and (iii) the "semantic" equivalence of the generated code as compared to the one written by developers. The achieved results show that BERT models represent a viable solution for code completion, with perfect predictions ranging from ~7%, obtained when asking the model to guess entire blocks, up to ~58%, reached in the simpler scenario of few tokens masked from the same code statement.
△ Less
Submitted 12 March, 2021;
originally announced March 2021.
-
Sampling Projects in GitHub for MSR Studies
Authors:
Ozren Dabic,
Emad Aghajani,
Gabriele Bavota
Abstract:
Almost every Mining Software Repositories (MSR) study requires, as first step, the selection of the subject software repositories. These repositories are usually collected from hosting services like GitHub using specific selection criteria dictated by the study goal. For example, a study related to licensing might be interested in selecting projects explicitly declaring a license. Once the selecti…
▽ More
Almost every Mining Software Repositories (MSR) study requires, as first step, the selection of the subject software repositories. These repositories are usually collected from hosting services like GitHub using specific selection criteria dictated by the study goal. For example, a study related to licensing might be interested in selecting projects explicitly declaring a license. Once the selection criteria have been defined, utilities such as the GitHub APIs can be used to "query" the hosting service. However, researchers have to deal with usage limitations imposed by these APIs and a lack of required information. For example, the GitHub search APIs allow 30 requests per minute and, when searching repositories, only provide limited information (e.g., the number of commits in a repository is not included). To support researchers in sampling projects from GitHub, we present GHS (GitHub Search), a dataset containing 25 characteristics (e.g., number of commits, license, etc.) of 735,669 repositories written in 10 programming languages. The set of characteristics has been derived by looking for frequently used project selection criteria in MSR studies and the dataset is continuously updated to (i) always provide fresh data about the existing projects, and (ii) increase the number of indexed projects. The GHS dataset can be queried through a web application we built that allows to set many combinations of selection criteria needed for a study and download the information of matching repositories: https://seart-ghs.si.usi.ch.
△ Less
Submitted 8 March, 2021;
originally announced March 2021.
-
Siri, Write the Next Method
Authors:
Fengcai Wen,
Emad Aghajani,
Csaba Nagy,
Michele Lanza,
Gabriele Bavota
Abstract:
Code completion is one of the killer features of Integrated Development Environments (IDEs), and researchers have proposed different methods to improve its accuracy. While these techniques are valuable to speed up code writing, they are limited to recommendations related to the next few tokens a developer is likely to type given the current context. In the best case, they can recommend a few APIs…
▽ More
Code completion is one of the killer features of Integrated Development Environments (IDEs), and researchers have proposed different methods to improve its accuracy. While these techniques are valuable to speed up code writing, they are limited to recommendations related to the next few tokens a developer is likely to type given the current context. In the best case, they can recommend a few APIs that a developer is likely to use next. We present FeaRS, a novel retrieval-based approach that, given the current code a developer is writing in the IDE, can recommend the next complete method (i.e., signature and method body) that the developer is likely to implement. To do this, FeaRS exploits "implementation patterns" (i.e., groups of methods usually implemented within the same task) learned by mining thousands of open source projects. We instantiated our approach to the specific context of Android apps. A large-scale empirical evaluation we performed across more than 20k apps shows encouraging preliminary results, but also highlights future challenges to overcome.
△ Less
Submitted 8 March, 2021;
originally announced March 2021.
-
Evaluating SZZ Implementations Through a Developer-informed Oracle
Authors:
Giovanni Rosa,
Luca Pascarella,
Simone Scalabrino,
Rosalia Tufano,
Gabriele Bavota,
Michele Lanza,
Rocco Oliveto
Abstract:
The SZZ algorithm for identifying bug-inducing changes has been widely used to evaluate defect prediction techniques and to empirically investigate when, how, and by whom bugs are introduced. Over the years, researchers have proposed several heuristics to improve the SZZ accuracy, providing various implementations of SZZ. However, fairly evaluating those implementations on a reliable oracle is an…
▽ More
The SZZ algorithm for identifying bug-inducing changes has been widely used to evaluate defect prediction techniques and to empirically investigate when, how, and by whom bugs are introduced. Over the years, researchers have proposed several heuristics to improve the SZZ accuracy, providing various implementations of SZZ. However, fairly evaluating those implementations on a reliable oracle is an open problem: SZZ evaluations usually rely on (i) the manual analysis of the SZZ output to classify the identified bug-inducing commits as true or false positives; or (ii) a golden set linking bug-fixing and bug-inducing commits. In both cases, these manual evaluations are performed by researchers with limited knowledge of the studied subject systems. Ideally, there should be a golden set created by the original developers of the studied systems.
We propose a methodology to build a "developer-informed" oracle for the evaluation of SZZ variants. We use Natural Language Processing (NLP) to identify bug-fixing commits in which developers explicitly reference the commit(s) that introduced a fixed bug. This was followed by a manual filtering step aimed at ensuring the quality and accuracy of the oracle. Once built, we used the oracle to evaluate several variants of the SZZ algorithm in terms of their accuracy. Our evaluation helped us to distill a set of lessons learned to further improve the SZZ algorithm.
△ Less
Submitted 5 February, 2021;
originally announced February 2021.
-
Studying the Usage of Text-To-Text Transfer Transformer to Support Code-Related Tasks
Authors:
Antonio Mastropaolo,
Simone Scalabrino,
Nathan Cooper,
David Nader Palacio,
Denys Poshyvanyk,
Rocco Oliveto,
Gabriele Bavota
Abstract:
Deep learning (DL) techniques are gaining more and more attention in the software engineering community. They have been used to support several code-related tasks, such as automatic bug fixing and code comments generation. Recent studies in the Natural Language Processing (NLP) field have shown that the Text-To-Text Transfer Transformer (T5) architecture can achieve state-of-the-art performance fo…
▽ More
Deep learning (DL) techniques are gaining more and more attention in the software engineering community. They have been used to support several code-related tasks, such as automatic bug fixing and code comments generation. Recent studies in the Natural Language Processing (NLP) field have shown that the Text-To-Text Transfer Transformer (T5) architecture can achieve state-of-the-art performance for a variety of NLP tasks. The basic idea behind T5 is to first pre-train a model on a large and generic dataset using a self-supervised task ( e.g: filling masked words in sentences). Once the model is pre-trained, it is fine-tuned on smaller and specialized datasets, each one related to a specific task ( e.g: language translation, sentence classification). In this paper, we empirically investigate how the T5 model performs when pre-trained and fine-tuned to support code-related tasks. We pre-train a T5 model on a dataset composed of natural language English text and source code. Then, we fine-tune such a model by reusing datasets used in four previous works that used DL techniques to: (i) fix bugs, (ii) inject code mutants, (iii) generate assert statements, and (iv) generate code comments. We compared the performance of this single model with the results reported in the four original papers proposing DL-based solutions for those four tasks. We show that our T5 model, exploiting additional data for the self-supervised pre-training phase, can achieve performance improvements over the four baselines.
△ Less
Submitted 3 February, 2021;
originally announced February 2021.
-
Towards Automating Code Review Activities
Authors:
Rosalia Tufano,
Luca Pascarella,
Michele Tufano,
Denys Poshyvanyk,
Gabriele Bavota
Abstract:
Code reviews are popular in both industrial and open source projects. The benefits of code reviews are widely recognized and include better code quality and lower likelihood of introducing bugs. However, since code review is a manual activity it comes at the cost of spending developers' time on reviewing their teammates' code.
Our goal is to make the first step towards partially automating the c…
▽ More
Code reviews are popular in both industrial and open source projects. The benefits of code reviews are widely recognized and include better code quality and lower likelihood of introducing bugs. However, since code review is a manual activity it comes at the cost of spending developers' time on reviewing their teammates' code.
Our goal is to make the first step towards partially automating the code review process, thus, possibly reducing the manual costs associated with it. We focus on both the contributor and the reviewer sides of the process, by training two different Deep Learning architectures. The first one learns code changes performed by developers during real code review activities, thus providing the contributor with a revised version of her code implementing code transformations usually recommended during code review before the code is even submitted for review. The second one automatically provides the reviewer commenting on a submitted code with the revised code implementing her comments expressed in natural language.
The empirical evaluation of the two models shows that, on the contributor side, the trained model succeeds in replicating the code transformations applied during code reviews in up to 16% of cases. On the reviewer side, the model can correctly implement a comment provided in natural language in up to 31% of cases. While these results are encouraging, more research is needed to make these models usable by developers.
△ Less
Submitted 19 May, 2021; v1 submitted 7 January, 2021;
originally announced January 2021.
-
Why Developers Refactor Source Code: A Mining-based Study
Authors:
Jevgenija Pantiuchina,
Fiorella Zampetti,
Simone Scalabrino,
Valentina Piantadosi,
Rocco Oliveto,
Gabriele Bavota,
Massimiliano Di Penta
Abstract:
Refactoring aims at improving code non-functional attributes without modifying its external behavior. Previous studies investigated the motivations behind refactoring by surveying developers. With the aim of generalizing and complementing their findings, we present a large-scale study quantitatively and qualitatively investigating why developers perform refactoring in open source projects. First,…
▽ More
Refactoring aims at improving code non-functional attributes without modifying its external behavior. Previous studies investigated the motivations behind refactoring by surveying developers. With the aim of generalizing and complementing their findings, we present a large-scale study quantitatively and qualitatively investigating why developers perform refactoring in open source projects. First, we mine 287,813 refactoring operations performed in the history of 150 systems. Using this dataset, we investigate the interplay between refactoring operations and process (e.g., previous changes/fixes) and product (e.g., quality metrics) metrics. Then, we manually analyze 551 merged pull requests implementing refactoring operations and classify the motivations behind the implemented refactorings (e.g., removal of code duplication). Our results led to (i) quantitative evidence of the relationship existing between certain process/product metrics and refactoring operations and (ii) a detailed taxonomy, generalizing and complementing the ones existing in the literature, of motivations pushing developers to refactor source code.
△ Less
Submitted 5 January, 2021;
originally announced January 2021.
-
Automated Identification of On-hold Self-admitted Technical Debt
Authors:
Rungroj Maipradit,
Bin Lin,
Csaba Nagy,
Gabriele Bavota,
Michele Lanza,
Hideaki Hata,
Kenichi Matsumoto
Abstract:
Modern software is developed under considerable time pressure, which implies that developers more often than not have to resort to compromises when it comes to code that is well written and code that just does the job. This has led over the past decades to the concept of "technical debt", a short-term hack that potentially generates long-term maintenance problems. Self-admitted technical debt (SAT…
▽ More
Modern software is developed under considerable time pressure, which implies that developers more often than not have to resort to compromises when it comes to code that is well written and code that just does the job. This has led over the past decades to the concept of "technical debt", a short-term hack that potentially generates long-term maintenance problems. Self-admitted technical debt (SATD) is a particular form of technical debt: developers consciously perform the hack but also document it in the code by adding comments as a reminder (or as an admission of guilt). We focus on a specific type of SATD, namely "On-hold" SATD, in which developers document in their comments the need to halt an implementation task due to conditions outside of their scope of work (e.g., an open issue must be closed before a function can be implemented). We present an approach, based on regular expressions and machine learning, which is able to detect issues referenced in code comments, and to automatically classify the detected instances as either "On-hold" (the issue is referenced to indicate the need to wait for its resolution before completing a task), or as "cross-reference", (the issue is referenced to document the code, for example to explain the rationale behind an implementation choice). Our approach also mines the issue tracker of the projects to check if the On-hold SATD instances are "superfluous" and can be removed (i.e., the referenced issue has been closed, but the SATD is still in the code). Our evaluation confirms that our approach can indeed identify relevant instances of On-hold SATD. We illustrate its usefulness by identifying superfluous On-hold SATD instances in open source projects as confirmed by the original developers.
△ Less
Submitted 28 September, 2020;
originally announced September 2020.