-
Greening AI-enabled Systems with Software Engineering: A Research Agenda for Environmentally Sustainable AI Practices
Authors:
Luís Cruz,
João Paulo Fernandes,
Maja H. Kirkeby,
Silverio Martínez-Fernández,
June Sallou,
Hina Anwar,
Enrique Barba Roque,
Justus Bogner,
Joel Castaño,
Fernando Castor,
Aadil Chasmawala,
Simão Cunha,
Daniel Feitosa,
Alexandra González,
Andreas Jedlitschka,
Patricia Lago,
Henry Muccini,
Ana Oprescu,
Pooja Rani,
João Saraiva,
Federica Sarro,
Raghavendra Selvan,
Karthik Vaidhyanathan,
Roberto Verdecchia,
Ivan P. Yamshchikov
Abstract:
The environmental impact of Artificial Intelligence (AI)-enabled systems is increasing rapidly, and software engineering plays a critical role in developing sustainable solutions. The "Greening AI with Software Engineering" CECAM-Lorentz workshop (no. 1358, 2025) funded by the Centre Européen de Calcul Atomique et Moléculaire and the Lorentz Center, provided an interdisciplinary forum for 29 parti…
▽ More
The environmental impact of Artificial Intelligence (AI)-enabled systems is increasing rapidly, and software engineering plays a critical role in developing sustainable solutions. The "Greening AI with Software Engineering" CECAM-Lorentz workshop (no. 1358, 2025) funded by the Centre Européen de Calcul Atomique et Moléculaire and the Lorentz Center, provided an interdisciplinary forum for 29 participants, from practitioners to academics, to share knowledge, ideas, practices, and current results dedicated to advancing green software and AI research. The workshop was held February 3-7, 2025, in Lausanne, Switzerland. Through keynotes, flash talks, and collaborative discussions, participants identified and prioritized key challenges for the field. These included energy assessment and standardization, benchmarking practices, sustainability-aware architectures, runtime adaptation, empirical methodologies, and education. This report presents a research agenda emerging from the workshop, outlining open research directions and practical recommendations to guide the development of environmentally sustainable AI-enabled systems rooted in software engineering principles.
△ Less
Submitted 3 June, 2025; v1 submitted 2 June, 2025;
originally announced June 2025.
-
Understanding Underrepresented Groups in Open Source Software
Authors:
Reydne Santos,
Rafa Prado,
Ana Paula de Holanda Silva,
Kiev Gama,
Fernando Castor,
Ronnie de Souza Santos
Abstract:
Context: Diversity can impact team communication, productivity, cohesiveness, and creativity. Analyzing the existing knowledge about diversity in open source software (OSS) projects can provide directions for future research and raise awareness about barriers and biases against underrepresented groups in OSS. Objective: This study aims to analyze the knowledge about minority groups in OSS projects…
▽ More
Context: Diversity can impact team communication, productivity, cohesiveness, and creativity. Analyzing the existing knowledge about diversity in open source software (OSS) projects can provide directions for future research and raise awareness about barriers and biases against underrepresented groups in OSS. Objective: This study aims to analyze the knowledge about minority groups in OSS projects. We investigated which groups were studied in the OSS literature, the study methods used, their implications, and their recommendations to promote the inclusion of minority groups in OSS projects. Method: To achieve this goal, we performed a systematic literature review study that analyzed 42 papers that directly study underrepresented groups in OSS projects. Results: Most papers focus on gender (62.3%), while others like age or ethnicity are rarely studied. The neurodiversity dimension, have not been studied in the context of OSS. Our results also reveal that diversity in OSS projects faces several barriers but brings significant benefits, such as promoting safe and welcoming environments. Conclusion: Most analyzed papers adopt a myopic perspective that sees gender as strictly binary. Dimensions of diversity that affect how individuals interact and function in an OSS project, such as age, tenure, and ethnicity, have received very little attention.
△ Less
Submitted 30 May, 2025;
originally announced June 2025.
-
AI-Powered, But Power-Hungry? Energy Efficiency of LLM-Generated Code
Authors:
Lola Solovyeva,
Sophie Weidmann,
Fernando Castor
Abstract:
Large language models (LLMs) are used in software development to assist in various tasks, e.g., code generation and code completion, but empirical evaluations of the quality of the results produced by these models focus on correctness and ignore other relevant aspects, such as their performance and energy efficiency. Studying the performance of LLM-produced programs is essential to understand how…
▽ More
Large language models (LLMs) are used in software development to assist in various tasks, e.g., code generation and code completion, but empirical evaluations of the quality of the results produced by these models focus on correctness and ignore other relevant aspects, such as their performance and energy efficiency. Studying the performance of LLM-produced programs is essential to understand how well LLMs can support the construction of performance- and energy-critical software, such as operating systems, servers, and mobile applications. This paper presents the first study analyzing the energy efficiency and performance of LLM-generated code for three programming languages Python, Java, and C++, on two platforms, a Mac and a PC, leveraging three frontier LLMs, Github Copilot, GPT-4o, and the recently-released OpenAI o1-mini, and targeting ``hard'' programming problems from LeetCode. Our results show that the models are much more successful in generating Python and Java than C++ code.
△ Less
Submitted 4 February, 2025;
originally announced February 2025.
-
Language Models in Software Development Tasks: An Experimental Analysis of Energy and Accuracy
Authors:
Negar Alizadeh,
Boris Belchev,
Nishant Saurabh,
Patricia Kelbert,
Fernando Castor
Abstract:
The use of generative AI-based coding assistants like ChatGPT and Github Copilot is a reality in contemporary software development. Many of these tools are provided as remote APIs. Using third-party APIs raises data privacy and security concerns for client companies, which motivates the use of locally-deployed language models. In this study, we explore the trade-off between model accuracy and ener…
▽ More
The use of generative AI-based coding assistants like ChatGPT and Github Copilot is a reality in contemporary software development. Many of these tools are provided as remote APIs. Using third-party APIs raises data privacy and security concerns for client companies, which motivates the use of locally-deployed language models. In this study, we explore the trade-off between model accuracy and energy consumption, aiming to provide valuable insights to help developers make informed decisions when selecting a language model. We investigate the performance of 18 families of LLMs in typical software development tasks on two real-world infrastructures, a commodity GPU and a powerful AI-specific GPU. Given that deploying LLMs locally requires powerful infrastructure which might not be affordable for everyone, we consider both full-precision and quantized models. Our findings reveal that employing a big LLM with a higher energy budget does not always translate to significantly improved accuracy. Additionally, quantized versions of large models generally offer better efficiency and accuracy compared to full-precision versions of medium-sized ones. Apart from that, not a single model is suitable for all types of software development tasks.
△ Less
Submitted 17 January, 2025; v1 submitted 29 November, 2024;
originally announced December 2024.
-
Understanding Code Understandability Improvements in Code Reviews
Authors:
Delano Oliveira,
Reydne Santos,
Benedito de Oliveira,
Martin Monperrus,
Fernando Castor,
Fernanda Madeiral
Abstract:
Motivation: Code understandability is crucial in software development, as developers spend 58% to 70% of their time reading source code. Improving it can improve productivity and reduce maintenance costs. Problem: Experimental studies often identify factors influencing code understandability in controlled settings but overlook real-world influences like project culture, guidelines, and developers'…
▽ More
Motivation: Code understandability is crucial in software development, as developers spend 58% to 70% of their time reading source code. Improving it can improve productivity and reduce maintenance costs. Problem: Experimental studies often identify factors influencing code understandability in controlled settings but overlook real-world influences like project culture, guidelines, and developers' backgrounds. Ignoring these factors may yield results with limited external validity. Objective: This study investigates how developers enhance code understandability through code review comments, assuming that code reviewers are specialists in code quality. Method and Results: We analyzed 2,401 code review comments from Java open-source projects on GitHub, finding that over 42% focus on improving code understandability. We further examined 385 comments specifically related to this aspect and identified eight categories of concerns, such as inadequate documentation and poor identifiers. Notably, 83.9% of suggestions for improvement were accepted and integrated, with fewer than 1% later reverted. We identified various types of patches that enhance understandability, from simple changes like removing unused code to context-dependent improvements such as optimizing method calls. Additionally, we evaluated four well-known linters for their ability to flag these issues, finding they cover less than 30%, although many could be easily added as new rules. Implications: Our findings encourage the development of tools to enhance code understandability, as accepted changes can serve as reliable training data for specialized machine-learning models. Our dataset supports this training and can inform the development of evidence-based code style guides. Data Availability: Our data is publicly available at https://codeupcrc.github.io.
△ Less
Submitted 12 November, 2024; v1 submitted 29 October, 2024;
originally announced October 2024.
-
Estimating the Energy Footprint of Software Systems: a Primer
Authors:
Fernando Castor
Abstract:
In Green Software Development, quantifying the energy footprint of a software system is one of the most basic activities. This documents provides a high-level overview of how the energy footprint of a software system can be estimated to support Green Software Development. We introduce basic concepts in the area, highlight methodological issues that must be accounted for when conducting experiments…
▽ More
In Green Software Development, quantifying the energy footprint of a software system is one of the most basic activities. This documents provides a high-level overview of how the energy footprint of a software system can be estimated to support Green Software Development. We introduce basic concepts in the area, highlight methodological issues that must be accounted for when conducting experiments, discuss trade-offs associated with different estimation approaches, and make some practical considerations. This document aims to be a starting point for researchers who want to begin conducting work in this area.
△ Less
Submitted 17 July, 2024; v1 submitted 16 July, 2024;
originally announced July 2024.
-
Green AI: A Preliminary Empirical Study on Energy Consumption in DL Models Across Different Runtime Infrastructures
Authors:
Negar Alizadeh,
Fernando Castor
Abstract:
Deep Learning (DL) frameworks such as PyTorch and TensorFlow include runtime infrastructures responsible for executing trained models on target hardware, managing memory, data transfers, and multi-accelerator execution, if applicable. Additionally, it is a common practice to deploy pre-trained models on environments distinct from their native development settings. This led to the introduction of i…
▽ More
Deep Learning (DL) frameworks such as PyTorch and TensorFlow include runtime infrastructures responsible for executing trained models on target hardware, managing memory, data transfers, and multi-accelerator execution, if applicable. Additionally, it is a common practice to deploy pre-trained models on environments distinct from their native development settings. This led to the introduction of interchange formats such as ONNX, which includes its runtime infrastructure, and ONNX Runtime, which work as standard formats that can be used across diverse DL frameworks and languages. Even though these runtime infrastructures have a great impact on inference performance, no previous paper has investigated their energy efficiency. In this study, we monitor the energy consumption and inference time in the runtime infrastructures of three well-known DL frameworks as well as ONNX, using three various DL models. To have nuance in our investigation, we also examine the impact of using different execution providers. We find out that the performance and energy efficiency of DL are difficult to predict. One framework, MXNet, outperforms both PyTorch and TensorFlow for the computer vision models using batch size 1, due to efficient GPU usage and thus low CPU usage. However, batch size 64 makes PyTorch and MXNet practically indistinguishable, while TensorFlow is outperformed consistently. For BERT, PyTorch exhibits the best performance. Converting the models to ONNX yields significant performance improvements in the majority of cases. Finally, in our preliminary investigation of execution providers, we observe that TensorRT always outperforms CUDA.
△ Less
Submitted 27 February, 2024; v1 submitted 21 February, 2024;
originally announced February 2024.
-
A Systematic Literature Review on the Impact of Formatting Elements on Code Legibility
Authors:
Delano Oliveira,
Reydne Santos,
Fernanda Madeiral,
Hidehiko Masuhara,
Fernando Castor
Abstract:
Context: Software programs can be written in different but functionally equivalent ways. Even though previous research has compared specific formatting elements to find out which alternatives affect code legibility, seeing the bigger picture of what makes code more or less legible is challenging. Goal: We aim to find which formatting elements have been investigated in empirical studies and which a…
▽ More
Context: Software programs can be written in different but functionally equivalent ways. Even though previous research has compared specific formatting elements to find out which alternatives affect code legibility, seeing the bigger picture of what makes code more or less legible is challenging. Goal: We aim to find which formatting elements have been investigated in empirical studies and which alternatives were found to be more legible for human subjects. Method: We conducted a systematic literature review and identified 15 papers containing human-centric studies that directly compared alternative formatting elements. We analyzed and organized these formatting elements using a card-sorting method. Results: We identified 13 formatting elements (e.g., indentation) and 33 levels of formatting elements (e.g., two-space indentation), which are about formatting styles, spacing, block delimiters, long or complex code lines, and word boundary styles. While some levels were found to be statistically better than other equivalent ones in terms of code legibility, e.g., appropriate use of indentation with blocks, others were not, e.g., formatting layout. For identifier style, we found divergent results, where one study found a significant difference in favor of camel case, while another study found a positive result in favor of snake case. Conclusion: The number of identified papers, some of which are outdated, and the many null and contradictory results emphasize the relative lack of work in this area and underline the importance of more research. There is much to be understood about how formatting elements influence code legibility before the creation of guidelines and automated aids to help developers make their code more legible.
△ Less
Submitted 1 June, 2023; v1 submitted 25 August, 2022;
originally announced August 2022.
-
On the Bug-proneness of Structures Inspired by Functional Programming in JavaScript Projects
Authors:
Fernando Alves,
Delano Oliveira,
Fernanda Madeiral,
Fernando Castor
Abstract:
Language constructs inspired by functional programming have made their way into most mainstream programming languages. Many researchers and developers consider that these constructs lead to programs that are more concise, reusable, and easier to understand. However, few studies investigate the implications of using them in mainstream programming languages. This paper quantifies the prevalence of f…
▽ More
Language constructs inspired by functional programming have made their way into most mainstream programming languages. Many researchers and developers consider that these constructs lead to programs that are more concise, reusable, and easier to understand. However, few studies investigate the implications of using them in mainstream programming languages. This paper quantifies the prevalence of four concepts typically associated with functional programming in JavaScript: recursion, immutability, lazy evaluation, and functions as values. We focus on JavaScript programs due to the availability of some of these concepts in the language since its inception, its inspiration from functional programming languages, and its popularity. We mine 91 GitHub repositories (22+ million LOC) written mostly in JavaScript (over 50% of the code), measuring the usage of these concepts from both static and temporal perspectives. We also measure the likelihood of bug-fixing commits removing uses of these concepts (which would hint at bug-proneness) and their association with the presence of code comments (which would hint at code that is hard to understand). We find that these concepts are in widespread use (1 for every 46.65 LOC, 43.59% of LOC). In addition, the usage of higher-order functions, immutability, and lazy evaluation-related structures has been growing throughout the years for the analyzed projects, while the usage of recursion and callbacks & promises has decreased. We also find statistical evidence that removing these structures, with the exception of the ones associated to immutability, is less common in bug-fixing commits than in other commits. In addition, their presence is not correlated with comment size. Our findings suggest that functional programming concepts are important for developers using a multi-paradigm language, and their usage does not make programs harder to understand.
△ Less
Submitted 17 June, 2022;
originally announced June 2022.
-
Evaluating Code Readability and Legibility: An Examination of Human-centric Studies
Authors:
Delano Oliveira,
Reydne Bruno,
Fernanda Madeiral,
Fernando Castor
Abstract:
Reading code is an essential activity in software maintenance and evolution. Several studies with human subjects have investigated how different factors, such as the employed programming constructs and naming conventions, can impact code readability, i.e., what makes a program easier or harder to read and apprehend by developers, and code legibility, i.e., what influences the ease of identifying e…
▽ More
Reading code is an essential activity in software maintenance and evolution. Several studies with human subjects have investigated how different factors, such as the employed programming constructs and naming conventions, can impact code readability, i.e., what makes a program easier or harder to read and apprehend by developers, and code legibility, i.e., what influences the ease of identifying elements of a program. These studies evaluate readability and legibility by means of different comprehension tasks and response variables. In this paper, we examine these tasks and variables in studies that compare programming constructs, coding idioms, naming conventions, and formatting guidelines, e.g., recursive vs. iterative code. To that end, we have conducted a systematic literature review where we found 54 relevant papers. Most of these studies evaluate code readability and legibility by measuring the correctness of the subjects' results (83.3%) or simply asking their opinions (55.6%). Some studies (16.7%) rely exclusively on the latter variable.There are still few studies that monitor subjects' physical signs, such as brain activation regions (5%). Moreover, our study shows that some variables are multi-faceted. For instance, correctness can be measured as the ability to predict the output of a program, answer questions about its behavior, or recall parts of it. These results make it clear that different evaluation approaches require different competencies from subjects, e.g., tracing the program vs. summarizing its goal vs. memorizing its text. To assist researchers in the design of new studies and improve our comprehension of existing ones, we model program comprehension as a learning activity by adapting a preexisting learning taxonomy. This adaptation indicates that some competencies are often exercised in these evaluations whereas others are rarely targeted.
△ Less
Submitted 2 October, 2021;
originally announced October 2021.
-
Small Changes, Big Impacts: Leveraging Diversity to Improve Energy Efficiency
Authors:
Wellington Oliveira,
Hugo Matalonga,
Gustavo Pinto,
Fernando Castor,
João Paulo Fernandes
Abstract:
In the last few years, a growing body of research has proposed methods, techniques, and tools to support developers in the construction of software that consumes less energy. These solutions leverage diverse approaches such as version history mining, analytical models, identifying energy-efficient color schemes, and optimizing the packaging of HTTP requests.
In this chapter, we present a complem…
▽ More
In the last few years, a growing body of research has proposed methods, techniques, and tools to support developers in the construction of software that consumes less energy. These solutions leverage diverse approaches such as version history mining, analytical models, identifying energy-efficient color schemes, and optimizing the packaging of HTTP requests.
In this chapter, we present a complementary approach. We advocate that developers should leverage software diversity to make software systems more energy-efficient. Our main insight is that non-specialists can build software that consumes less energy by alternating at development time between readily available, diversely-designed pieces of software implemented by third-parties. These pieces of software can vary in nature, granularity, and quality attributes. Examples include data structures and constructs for thread management and synchronization.
△ Less
Submitted 7 December, 2020;
originally announced December 2020.