-
Lessons from Building StackSpot AI: A Contextualized AI Coding Assistant
Authors:
Gustavo Pinto,
Cleidson de Souza,
João Batista Neto,
Alberto de Souza,
Tarcísio Gotto,
Edward Monteiro
Abstract:
With their exceptional natural language processing capabilities, tools based on Large Language Models (LLMs) like ChatGPT and Co-Pilot have swiftly become indispensable resources in the software developer's toolkit. While recent studies suggest the potential productivity gains these tools can unlock, users still encounter drawbacks, such as generic or incorrect answers. Additionally, the pursuit o…
▽ More
With their exceptional natural language processing capabilities, tools based on Large Language Models (LLMs) like ChatGPT and Co-Pilot have swiftly become indispensable resources in the software developer's toolkit. While recent studies suggest the potential productivity gains these tools can unlock, users still encounter drawbacks, such as generic or incorrect answers. Additionally, the pursuit of improved responses often leads to extensive prompt engineering efforts, diverting valuable time from writing code that delivers actual value. To address these challenges, a new breed of tools, built atop LLMs, is emerging. These tools aim to mitigate drawbacks by employing techniques like fine-tuning or enriching user prompts with contextualized information.
In this paper, we delve into the lessons learned by a software development team venturing into the creation of such a contextualized LLM-based application, using retrieval-based techniques, called CodeBuddy. Over a four-month period, the team, despite lacking prior professional experience in LLM-based applications, built the product from scratch. Following the initial product release, we engaged with the development team responsible for the code generative components. Through interviews and analysis of the application's issue tracker, we uncover various intriguing challenges that teams working on LLM-based applications might encounter. For instance, we found three main group of lessons: LLM-based lessons, User-based lessons, and Technical lessons. By understanding these lessons, software development teams could become better prepared to build LLM-based applications.
△ Less
Submitted 4 January, 2024; v1 submitted 30 November, 2023;
originally announced November 2023.
-
TRANSMUT-SPARK: Transformation Mutation for Apache Spark
Authors:
Joao Batista de Souza Neto,
Anamaria Martins Moreira,
Genoveva Vargas-Solar,
Martin A. Musicante
Abstract:
We propose TRANSMUT-Spark, a tool that automates the mutation testing process of Big Data processing code within Spark programs. Apache Spark is an engine for Big Data Processing. It hides the complexity inherent to Big Data parallel and distributed programming and processing through built-in functions, underlying parallel processes, and data management strategies. Nonetheless, programmers must cl…
▽ More
We propose TRANSMUT-Spark, a tool that automates the mutation testing process of Big Data processing code within Spark programs. Apache Spark is an engine for Big Data Processing. It hides the complexity inherent to Big Data parallel and distributed programming and processing through built-in functions, underlying parallel processes, and data management strategies. Nonetheless, programmers must cleverly combine these functions within programs and guide the engine to use the right data management strategies to exploit the large number of computational resources required by Big Data processing and avoid substantial production losses. Many programming details in data processing code within Spark programs are prone to false statements that need to be correctly and automatically tested. This paper explores the application of mutation testing in Spark programs, a fault-based testing technique that relies on fault simulation to evaluate and design test sets. The paper introduces the TRANSMUT-Spark solution for testing Spark programs. TRANSMUT-Spark automates the most laborious steps of the process and fully executes the mutation testing process. The paper describes how the tool automates the mutants generation, test execution, and adequacy analysis phases of mutation testing with TRANSMUT-Spark. It also discusses the results of experiments that were carried out to validate the tool to argue its scope and limitations.
△ Less
Submitted 5 August, 2021;
originally announced August 2021.
-
An Abstract View of Big Data Processing Programs
Authors:
Joao Batista de Souza Neto,
Anamaria Martins Moreira,
Genoveva Vargas-Solar,
Martin A. Musicante
Abstract:
This paper proposes a model for specifying data flow based parallel data processing programs agnostic of target Big Data processing frameworks. The paper focuses on the formal abstract specification of non-iterative and iterative programs, generalizing the strategies adopted by data flow Big Data processing frameworks. The proposed model relies on monoid AlgebraandPetri Netstoabstract Big Data pro…
▽ More
This paper proposes a model for specifying data flow based parallel data processing programs agnostic of target Big Data processing frameworks. The paper focuses on the formal abstract specification of non-iterative and iterative programs, generalizing the strategies adopted by data flow Big Data processing frameworks. The proposed model relies on monoid AlgebraandPetri Netstoabstract Big Data processing programs in two levels: a high level representing the program data flow and a lower level representing data transformation operations (e.g., filtering, aggregation, join). We extend the model for data processing programs proposed in [1], to enable the use of iterative programs. The general specification of iterative data processing programs implemented by data flow-based parallel programming models is essential given the democratization of iterative and greedy Big Data analytics algorithms. Indeed, these algorithms call for revisiting parallel programming models to express iterations. The paper gives a comparative analysis of the iteration strategies proposed byApache Spark, DryadLINQ, Apache Beam and Apache Flink. It discusses how the model achieves to generalize these strategies.
△ Less
Submitted 5 August, 2021;
originally announced August 2021.
-
Segmentation of large images based on super-pixels and community detection in graphs
Authors:
Oscar A. C. Linares,
Glenda Michele Botelho,
Francisco Aparecido Rodrigues,
João Batista Neto
Abstract:
Image segmentation has many applications which range from machine learning to medical diagnosis. In this paper, we propose a framework for the segmentation of images based on super-pixels and algorithms for community identification in graphs. The super-pixel pre-segmentation step reduces the number of nodes in the graph, rendering the method the ability to process large images. Moreover, community…
▽ More
Image segmentation has many applications which range from machine learning to medical diagnosis. In this paper, we propose a framework for the segmentation of images based on super-pixels and algorithms for community identification in graphs. The super-pixel pre-segmentation step reduces the number of nodes in the graph, rendering the method the ability to process large images. Moreover, community detection algorithms provide more accurate segmentation than traditional approaches, such as those based on spectral graph partition. We also compare our method with two algorithms: a) the graph-based approach by Felzenszwalb and Huttenlocher and b) the contour-based method by Arbelaez. Results have shown that our method provides more precise segmentation and is faster than both of them.
△ Less
Submitted 12 December, 2016;
originally announced December 2016.