-
Assessing the Bug-Proneness of Refactored Code: A Longitudinal Multi-Project Study
Authors:
Isabella Ferreira,
Lawrence Arkoh,
Anderson Uchôa,
Ana Carla Bibiano,
Alessandro Garcia,
Wesley K. G. Assunção
Abstract:
Refactoring is a common practice in software development, aimed at improving the internal code structure in order to make it easier to understand and modify. Consequently, it is often assumed that refactoring makes the code less prone to bugs. However, in practice, refactoring is a complex task and applied in different ways (e.g., various refactoring types, single vs. composite refactorings) and w…
▽ More
Refactoring is a common practice in software development, aimed at improving the internal code structure in order to make it easier to understand and modify. Consequently, it is often assumed that refactoring makes the code less prone to bugs. However, in practice, refactoring is a complex task and applied in different ways (e.g., various refactoring types, single vs. composite refactorings) and with a variety of purposes (e.g., root-canal vs. floss refactoring). Therefore, certain refactorings can inadvertently make the code more prone to bugs. Unfortunately, there is limited research in the literature on the long-term relationship between the different characteristics of refactorings and bugs. This paper presents a longitudinal study of 12 open source software projects, where 27,450 refactorings, 6,051 reported bugs, and 49,250 bugs detected with static analysis tools were analyzed. While our study confirms the common intuition that refactored code is less bug-prone than non-refactored code, we also extend or contradict existing body of knowledge in other ways. First, a code element that undergoes multiple refactorings is not less bug-prone than an element that undergoes a single refactoring. A single refactoring is the one not performed in conjunction with other refactorings in the same commit. Second, single refactorings often induce the occurrence of bugs across all analyzed projects. Third, code elements affected by refactorings made in conjunction with other non-refactoring changes in the same commit (i.e., floss refactorings) are often bug-prone. Finally, many of such bugs induced by refactoring cannot be revealed with state-of-the-art techniques for detecting behavior-preserving refactorings.
△ Less
Submitted 12 May, 2025;
originally announced May 2025.
-
FreeSVC: Towards Zero-shot Multilingual Singing Voice Conversion
Authors:
Alef Iury Siqueira Ferreira,
Lucas Rafael Gris,
Augusto Seben da Rosa,
Frederico Santos de Oliveira,
Edresson Casanova,
Rafael Teixeira Sousa,
Arnaldo Candido Junior,
Anderson da Silva Soares,
Arlindo Galvão Filho
Abstract:
This work presents FreeSVC, a promising multilingual singing voice conversion approach that leverages an enhanced VITS model with Speaker-invariant Clustering (SPIN) for better content representation and the State-of-the-Art (SOTA) speaker encoder ECAPA2. FreeSVC incorporates trainable language embeddings to handle multiple languages and employs an advanced speaker encoder to disentangle speaker c…
▽ More
This work presents FreeSVC, a promising multilingual singing voice conversion approach that leverages an enhanced VITS model with Speaker-invariant Clustering (SPIN) for better content representation and the State-of-the-Art (SOTA) speaker encoder ECAPA2. FreeSVC incorporates trainable language embeddings to handle multiple languages and employs an advanced speaker encoder to disentangle speaker characteristics from linguistic content. Designed for zero-shot learning, FreeSVC enables cross-lingual singing voice conversion without extensive language-specific training. We demonstrate that a multilingual content extractor is crucial for optimal cross-language conversion. Our source code and models are publicly available.
△ Less
Submitted 9 January, 2025;
originally announced January 2025.
-
FairPIVARA: Reducing and Assessing Biases in CLIP-Based Multimodal Models
Authors:
Diego A. B. Moreira,
Alef Iury Ferreira,
Jhessica Silva,
Gabriel Oliveira dos Santos,
Luiz Pereira,
João Medrado Gondim,
Gustavo Bonil,
Helena Maia,
Nádia da Silva,
Simone Tiemi Hashiguti,
Jefersson A. dos Santos,
Helio Pedrini,
Sandra Avila
Abstract:
Despite significant advancements and pervasive use of vision-language models, a paucity of studies has addressed their ethical implications. These models typically require extensive training data, often from hastily reviewed text and image datasets, leading to highly imbalanced datasets and ethical concerns. Additionally, models initially trained in English are frequently fine-tuned for other lang…
▽ More
Despite significant advancements and pervasive use of vision-language models, a paucity of studies has addressed their ethical implications. These models typically require extensive training data, often from hastily reviewed text and image datasets, leading to highly imbalanced datasets and ethical concerns. Additionally, models initially trained in English are frequently fine-tuned for other languages, such as the CLIP model, which can be expanded with more data to enhance capabilities but can add new biases. The CAPIVARA, a CLIP-based model adapted to Portuguese, has shown strong performance in zero-shot tasks. In this paper, we evaluate four different types of discriminatory practices within visual-language models and introduce FairPIVARA, a method to reduce them by removing the most affected dimensions of feature embeddings. The application of FairPIVARA has led to a significant reduction of up to 98% in observed biases while promoting a more balanced word distribution within the model. Our model and code are available at: https://github.com/hiaac-nlp/FairPIVARA.
△ Less
Submitted 4 October, 2024; v1 submitted 28 September, 2024;
originally announced September 2024.
-
No Saved Kaleidosope: an 100% Jitted Neural Network Coding Language with Pythonic Syntax
Authors:
Augusto Seben da Rosa,
Marlon Daniel Angeli,
Jorge Aikes Junior,
Alef Iury Ferreira,
Lucas Rafael Gris,
Anderson da Silva Soares,
Arnaldo Candido Junior,
Frederico Santos de Oliveira,
Gabriel Trevisan Damke,
Rafael Teixeira Sousa
Abstract:
We developed a jitted compiler for training Artificial Neural Networks using C++, LLVM and Cuda. It features object-oriented characteristics, strong typing, parallel workers for data pre-processing, pythonic syntax for expressions, PyTorch like model declaration and Automatic Differentiation. We implement the mechanisms of cache and pooling in order to manage VRAM, cuBLAS for high performance matr…
▽ More
We developed a jitted compiler for training Artificial Neural Networks using C++, LLVM and Cuda. It features object-oriented characteristics, strong typing, parallel workers for data pre-processing, pythonic syntax for expressions, PyTorch like model declaration and Automatic Differentiation. We implement the mechanisms of cache and pooling in order to manage VRAM, cuBLAS for high performance matrix multiplication and cuDNN for convolutional layers. Our experiments with Residual Convolutional Neural Networks on ImageNet, we reach similar speed but degraded performance. Also, the GRU network experiments show similar accuracy, but our compiler have degraded speed in that task. However, our compiler demonstrates promising results at the CIFAR-10 benchmark, in which we reach the same performance and about the same speed as PyTorch. We make the code publicly available at: https://github.com/NoSavedDATA/NoSavedKaleidoscope
△ Less
Submitted 17 September, 2024;
originally announced September 2024.
-
Parallelization Strategies for the Randomized Kaczmarz Algorithm on Large-Scale Dense Systems
Authors:
Inês Ferreira,
Juan A. Acebrón,
José Monteiro
Abstract:
The Kaczmarz algorithm is an iterative technique designed to solve consistent linear systems of equations. It falls within the category of row-action methods, focusing on handling one equation per iteration. This characteristic makes it especially useful in solving very large systems. The recent introduction of a randomized version, the Randomized Kaczmarz method, renewed interest in the algorithm…
▽ More
The Kaczmarz algorithm is an iterative technique designed to solve consistent linear systems of equations. It falls within the category of row-action methods, focusing on handling one equation per iteration. This characteristic makes it especially useful in solving very large systems. The recent introduction of a randomized version, the Randomized Kaczmarz method, renewed interest in the algorithm, leading to the development of numerous variations. Subsequently, parallel implementations for both the original and Randomized Kaczmarz method have since then been proposed. However, previous work has addressed sparse linear systems, whereas we focus on solving dense systems. In this paper, we explore in detail approaches to parallelizing the Kaczmarz method for both shared and distributed memory for large dense systems. In particular, we implemented the Randomized Kaczmarz with Averaging (RKA) method that, for inconsistent systems, unlike the standard Randomized Kaczmarz algorithm, reduces the final error of the solution. While efficient parallelization of this algorithm is not achievable, we introduce a block version of the averaging method that can outperform the RKA method.
△ Less
Submitted 30 January, 2024;
originally announced January 2024.
-
CAPIVARA: Cost-Efficient Approach for Improving Multilingual CLIP Performance on Low-Resource Languages
Authors:
Gabriel Oliveira dos Santos,
Diego A. B. Moreira,
Alef Iury Ferreira,
Jhessica Silva,
Luiz Pereira,
Pedro Bueno,
Thiago Sousa,
Helena Maia,
Nádia Da Silva,
Esther Colombini,
Helio Pedrini,
Sandra Avila
Abstract:
This work introduces CAPIVARA, a cost-efficient framework designed to enhance the performance of multilingual CLIP models in low-resource languages. While CLIP has excelled in zero-shot vision-language tasks, the resource-intensive nature of model training remains challenging. Many datasets lack linguistic diversity, featuring solely English descriptions for images. CAPIVARA addresses this by augm…
▽ More
This work introduces CAPIVARA, a cost-efficient framework designed to enhance the performance of multilingual CLIP models in low-resource languages. While CLIP has excelled in zero-shot vision-language tasks, the resource-intensive nature of model training remains challenging. Many datasets lack linguistic diversity, featuring solely English descriptions for images. CAPIVARA addresses this by augmenting text data using image captioning and machine translation to generate multiple synthetic captions in low-resource languages. We optimize the training pipeline with LiT, LoRA, and gradient checkpointing to alleviate the computational cost. Through extensive experiments, CAPIVARA emerges as state of the art in zero-shot tasks involving images and Portuguese texts. We show the potential for significant improvements in other low-resource languages, achieved by fine-tuning the pre-trained multilingual CLIP using CAPIVARA on a single GPU for 2 hours. Our model and code is available at https://github.com/hiaac-nlp/CAPIVARA.
△ Less
Submitted 23 October, 2023; v1 submitted 20 October, 2023;
originally announced October 2023.
-
Anonymising Clinical Data for Secondary Use
Authors:
Irene Ferreira,
Chris Harbron,
Alex Hughes,
Tamsin Sargood,
Christoph Gerlinger
Abstract:
Secondary use of data already collected in clinical studies has become more and more popular in recent years, with the commitment of the pharmaceutical industry and many academic institutions in Europe and the US to provide access to their clinical trial data. Whilst this clearly provides societal benefit in helping to progress medical research, this has to be balanced against protection of subjec…
▽ More
Secondary use of data already collected in clinical studies has become more and more popular in recent years, with the commitment of the pharmaceutical industry and many academic institutions in Europe and the US to provide access to their clinical trial data. Whilst this clearly provides societal benefit in helping to progress medical research, this has to be balanced against protection of subjects' privacy. There are two main scenarios for sharing subject data: within Clinical Study Reports and Individual Patient Level Data, and these scenarios have different associated risks and generally require different approaches. In any data sharing scenario, there is a trade-off between data utility and the risk of subject re-identification, and achieving this balance is key. Quantitative metrics can guide the amount of de-identification required and new technologies may also start to provide alternative ways to achieve the risk-utility balance.
△ Less
Submitted 17 May, 2023;
originally announced July 2023.
-
ABL: An original active blacklist based on a modification of the SMTP
Authors:
Pablo M. Oliveira,
Mateus B. Vieira,
Isaac C. Ferreira,
João P. R. R. Leite,
Edvard M. Oliveira,
Bruno T. Kuehne,
Edmilson M. Moreira,
Otávio A. S. Carpinteiro
Abstract:
This paper presents a novel Active Blacklist (ABL) based on a modification of the Simple Mail Transfer Protocol (SMTP). ABL was implemented in the Mail Transfer Agent (MTA) Postfix of the e-mail server Zimbra and assessed exhaustively in a series of experiments. The modified server Zimbra showed computational performance and costs similar to those of the original server Zimbra when receiving legit…
▽ More
This paper presents a novel Active Blacklist (ABL) based on a modification of the Simple Mail Transfer Protocol (SMTP). ABL was implemented in the Mail Transfer Agent (MTA) Postfix of the e-mail server Zimbra and assessed exhaustively in a series of experiments. The modified server Zimbra showed computational performance and costs similar to those of the original server Zimbra when receiving legitimate e-mails. When receiving spam, however, it showed better computing performance and costs than the original Zimbra. Moreover, there was a considerable computational cost on the spammer's server when it sent spam e-mails. ABL was assessed at the Federal University of Itajubá, Brazil, during a period of sixty-one days. It was responsible for rejecting a percentage of 20.94% of the spam e-mails received by the university during this period. After this period, it was deployed and remained in use, from July-2015 to July-2019, at the university. ABL is part of the new Open Machine-Learning-Based Anti-Spam (Open-MaLBAS). Both ABL and Open-MaLBAS are freely available on GitHub.
△ Less
Submitted 22 August, 2022;
originally announced August 2022.
-
Domain Specific Wav2vec 2.0 Fine-tuning For The SE&R 2022 Challenge
Authors:
Alef Iury Siqueira Ferreira,
Gustavo dos Reis Oliveira
Abstract:
This paper presents our efforts to build a robust ASR model for the shared task Automatic Speech Recognition for spontaneous and prepared speech & Speech Emotion Recognition in Portuguese (SE&R 2022). The goal of the challenge is to advance the ASR research for the Portuguese language, considering prepared and spontaneous speech in different dialects. Our method consist on fine-tuning an ASR model…
▽ More
This paper presents our efforts to build a robust ASR model for the shared task Automatic Speech Recognition for spontaneous and prepared speech & Speech Emotion Recognition in Portuguese (SE&R 2022). The goal of the challenge is to advance the ASR research for the Portuguese language, considering prepared and spontaneous speech in different dialects. Our method consist on fine-tuning an ASR model in a domain-specific approach, applying gain normalization and selective noise insertion. The proposed method improved over the strong baseline provided on the test set in 3 of the 4 tracks available
△ Less
Submitted 28 July, 2022;
originally announced July 2022.
-
Incivility Detection in Open Source Code Review and Issue Discussions
Authors:
Isabella Ferreira,
Ahlaam Rafiq,
Jinghui Cheng
Abstract:
Given the democratic nature of open source development, code review and issue discussions may be uncivil. Incivility, defined as features of discussion that convey an unnecessarily disrespectful tone, can have negative consequences to open source communities. To prevent or minimize these negative consequences, open source platforms have included mechanisms for removing uncivil language from the di…
▽ More
Given the democratic nature of open source development, code review and issue discussions may be uncivil. Incivility, defined as features of discussion that convey an unnecessarily disrespectful tone, can have negative consequences to open source communities. To prevent or minimize these negative consequences, open source platforms have included mechanisms for removing uncivil language from the discussions. However, such approaches require manual inspection, which can be overwhelming given the large number of discussions. To help open source communities deal with this problem, in this paper, we aim to compare six classical machine learning models with BERT to detect incivility in open source code review and issue discussions. Furthermore, we assess if adding contextual information improves the models' performance and how well the models perform in a cross-platform setting. We found that BERT performs better than classical machine learning models, with a best F1-score of 0.95. Furthermore, classical machine learning models tend to underperform to detect non-technical and civil discussions. Our results show that adding the contextual information to BERT did not improve its performance and that none of the analyzed classifiers had an outstanding performance in a cross-platform setting. Finally, we provide insights into the tones that the classifiers misclassify.
△ Less
Submitted 18 December, 2023; v1 submitted 27 June, 2022;
originally announced June 2022.
-
PetroGAN: A novel GAN-based approach to generate realistic, label-free petrographic datasets
Authors:
I. Ferreira,
L. Ochoa,
A. Koeshidayatullah
Abstract:
Deep learning architectures have enriched data analytics in the geosciences, complementing traditional approaches to geological problems. Although deep learning applications in geosciences show encouraging signs, the actual potential remains untapped. This is primarily because geological datasets, particularly petrography, are limited, time-consuming, and expensive to obtain, requiring in-depth kn…
▽ More
Deep learning architectures have enriched data analytics in the geosciences, complementing traditional approaches to geological problems. Although deep learning applications in geosciences show encouraging signs, the actual potential remains untapped. This is primarily because geological datasets, particularly petrography, are limited, time-consuming, and expensive to obtain, requiring in-depth knowledge to provide a high-quality labeled dataset. We approached these issues by developing a novel deep learning framework based on generative adversarial networks (GANs) to create the first realistic synthetic petrographic dataset. The StyleGAN2 architecture is selected to allow robust replication of statistical and esthetical characteristics, and improving the internal variance of petrographic data. The training dataset consists of 10070 images of rock thin sections both in plane- and cross-polarized light. The algorithm trained for 264 GPU hours and reached a state-of-the-art Fréchet Inception Distance (FID) score of 12.49 for petrographic images. We further observed the FID values vary with lithology type and image resolution. Our survey established that subject matter experts found the generated images were indistinguishable from real images. This study highlights that GANs are a powerful method for generating realistic synthetic data, experimenting with the latent space, and as a future tool for self-labelling, reducing the effort of creating geological datasets.
△ Less
Submitted 6 April, 2022;
originally announced April 2022.
-
How heated is it? Understanding GitHub locked issues
Authors:
Isabella Ferreira,
Bram Adams,
Jinghui Cheng
Abstract:
Although issues of open source software are created to discuss and solve technical problems, conversations can become heated, with discussants getting angry and/or agitated for a variety of reasons, such as poor suggestions or violation of community conventions. To prevent and mitigate discussions from getting heated, tools like GitHub have introduced the ability to lock issue discussions that vio…
▽ More
Although issues of open source software are created to discuss and solve technical problems, conversations can become heated, with discussants getting angry and/or agitated for a variety of reasons, such as poor suggestions or violation of community conventions. To prevent and mitigate discussions from getting heated, tools like GitHub have introduced the ability to lock issue discussions that violate the code of conduct or other community guidelines. Despite some early research on locked issues, there is a lack of understanding of how communities use this feature and of potential threats to validity for researchers relying on a dataset of locked issues as an oracle for heated discussions. To address this gap, we (i) quantitatively analyzed 79 GitHub projects that have at least one issue locked as too heated, and (ii) qualitatively analyzed all issues locked as too heated of the 79 projects, a total of 205 issues comprising 5,511 comments. We found that projects have different behaviors when locking issues: while 54 locked less than 10% of their closed issues, 14 projects locked more than 90% of their closed issues. Additionally, locked issues tend to have a similar number of comments, participants, and emoji reactions to non-locked issues. For the 205 issues locked as too heated, we found that one-third do not contain any uncivil discourse, and only 8.82% of the analyzed comments are actually uncivil. Finally, we found that the locking justifications provided by maintainers do not always match the label used to lock the issue. Based on our results, we identified three pitfalls to avoid when using the GitHub locked issues data and we provide recommendations for researchers and practitioners.
△ Less
Submitted 31 March, 2022;
originally announced April 2022.
-
Optimizing Packet Reception Rates for Low Duty-Cycle BLE Relay Nodes
Authors:
Nuno Paulino,
Luís Pessoa,
André Branquinho,
Rafael Tavares,
Igor Ferreira
Abstract:
In order to achieve the full potential of the Internet-of-Things, connectivity between devices should be ubiquitous and efficient. Wireless mesh networks are a critical component to achieve this ubiquitous connectivity for a wide range of services, and are composed of terminal devices (i.e., nodes), such as sensors of various types, and wall powered gateway devices, which provide further internet…
▽ More
In order to achieve the full potential of the Internet-of-Things, connectivity between devices should be ubiquitous and efficient. Wireless mesh networks are a critical component to achieve this ubiquitous connectivity for a wide range of services, and are composed of terminal devices (i.e., nodes), such as sensors of various types, and wall powered gateway devices, which provide further internet connectivity (e..g, via WiFi). When considering large indoor areas, such as hospitals or industrial scenarios, the mesh must cover a large area, which introduces concerns regarding range and the number of gateways needed and respective wall cabling infrastructure. Solutions for mesh networks implemented over different wireless protocols exist, like the recent Bluetooth Low Energy (BLE) 5.1. Besides range concerns, choosing which nodes forward data through the mesh has a large impact on performance and power consumption. We address the area coverage issue via a battery powered BLE relay device of our own design, which acts as a range extender by forwarding packets from end nodes to gateways. We present the relay's design and experimentally determine the packet forwarding efficiency for several scenarios and configurations. In the best case, up to 35% of the packets transmitted by 11 nodes can be forwarded to a gateway by a single relay under continuous operation. A battery lifetime of 1 year can be achieved with a relay duty cycle of 20%.
△ Less
Submitted 29 November, 2021; v1 submitted 26 November, 2021;
originally announced November 2021.
-
The "Shut the f**k up" Phenomenon: Characterizing Incivility in Open Source Code Review Discussions
Authors:
Isabella Ferreira,
Jinghui Cheng,
Bram Adams
Abstract:
Code review is an important quality assurance activity for software development. Code review discussions among developers and maintainers can be heated and sometimes involve personal attacks and unnecessary disrespectful comments, demonstrating, therefore, incivility. Although incivility in public discussions has received increasing attention from researchers in different domains, the knowledge ab…
▽ More
Code review is an important quality assurance activity for software development. Code review discussions among developers and maintainers can be heated and sometimes involve personal attacks and unnecessary disrespectful comments, demonstrating, therefore, incivility. Although incivility in public discussions has received increasing attention from researchers in different domains, the knowledge about the characteristics, causes, and consequences of uncivil communication is still very limited in the context of software development, and more specifically, code review. To address this gap in the literature, we leverage the mature social construct of incivility as a lens to understand confrontational conflicts in open source code review discussions. For that, we conducted a qualitative analysis on 1,545 emails from the Linux Kernel Mailing List (LKML) that were associated with rejected changes. We found that more than half 66.66% of the non-technical emails included uncivil features. Particularly, frustration, name calling, and impatience are the most frequent features in uncivil emails. We also found that there are civil alternatives to address arguments, while uncivil comments can potentially be made by any people when discussing any topic. Finally, we identified various causes and consequences of such uncivil communication. Our work serves as the first study about the phenomenon of in(civility) in open source software development, paving the road for a new field of research about collaboration and communication in the context of software engineering activities.
△ Less
Submitted 22 August, 2021;
originally announced August 2021.