Skip to main content

Showing 1–17 of 17 results for author: Santos, C S

Searching in archive cs. Search in all archives.
.
  1. arXiv:2501.02170  [pdf

    cs.SE

    An Empirical Study of Safetensors' Usage Trends and Developers' Perceptions

    Authors: Beatrice Casey, Kaia Damian, Andrew Cotaj, Joanna C. S. Santos

    Abstract: Developers are sharing pre-trained Machine Learning (ML) models through a variety of model sharing platforms, such as Hugging Face, in an effort to make ML development more collaborative. To share the models, they must first be serialized. While there are many methods of serialization in Python, most of them are unsafe. To tame this insecurity, Hugging Face released safetensors as a way to mitigat… ▽ More

    Submitted 3 January, 2025; originally announced January 2025.

  2. arXiv:2410.17736  [pdf, other

    cs.CL

    MojoBench: Language Modeling and Benchmarks for Mojo

    Authors: Nishat Raihan, Joanna C. S. Santos, Marcos Zampieri

    Abstract: The recently introduced Mojo programming language (PL) by Modular, has received significant attention in the scientific community due to its claimed significant speed boost over Python. Despite advancements in code Large Language Models (LLMs) across various PLs, Mojo remains unexplored in this context. To address this gap, we introduce MojoBench, the first framework for Mojo code generation. Mojo… ▽ More

    Submitted 23 October, 2024; originally announced October 2024.

  3. arXiv:2410.16349  [pdf, other

    cs.LG cs.HC

    Large Language Models in Computer Science Education: A Systematic Literature Review

    Authors: Nishat Raihan, Mohammed Latif Siddiq, Joanna C. S. Santos, Marcos Zampieri

    Abstract: Large language models (LLMs) are becoming increasingly better at a wide range of Natural Language Processing tasks (NLP), such as text generation and understanding. Recently, these models have extended their capabilities to coding tasks, bridging the gap between natural languages (NL) and programming languages (PL). Foundational models such as the Generative Pre-trained Transformer (GPT) and LLaMA… ▽ More

    Submitted 21 October, 2024; originally announced October 2024.

    Comments: Accepted at 56th ACM Technical Symposium on Computer Science Education (SIGCSE TS 2025)

  4. arXiv:2410.04490  [pdf

    cs.CR cs.LG cs.SE

    A Large-Scale Exploit Instrumentation Study of AI/ML Supply Chain Attacks in Hugging Face Models

    Authors: Beatrice Casey, Joanna C. S. Santos, Mehdi Mirakhorli

    Abstract: The development of machine learning (ML) techniques has led to ample opportunities for developers to develop and deploy their own models. Hugging Face serves as an open source platform where developers can share and download other models in an effort to make ML development more collaborative. In order for models to be shared, they first need to be serialized. Certain Python serialization methods a… ▽ More

    Submitted 6 October, 2024; originally announced October 2024.

  5. arXiv:2404.10155  [pdf, other

    cs.SE cs.LG

    The Fault in our Stars: Quality Assessment of Code Generation Benchmarks

    Authors: Mohammed Latif Siddiq, Simantika Dristi, Joy Saha, Joanna C. S. Santos

    Abstract: Large Language Models (LLMs) are gaining popularity among software engineers. A crucial aspect of developing effective code generation LLMs is to evaluate these models using a robust benchmark. Evaluation benchmarks with quality issues can provide a false sense of performance. In this work, we conduct the first-of-its-kind study of the quality of prompts within benchmarks used to compare the perfo… ▽ More

    Submitted 4 September, 2024; v1 submitted 15 April, 2024; originally announced April 2024.

    Comments: Accepted at the 24th IEEE International Conference on Source Code Analysis and Manipulation(SCAM 2024) Research Track

  6. arXiv:2403.10646  [pdf

    cs.LG cs.CR

    A Survey of Source Code Representations for Machine Learning-Based Cybersecurity Tasks

    Authors: Beatrice Casey, Joanna C. S. Santos, George Perry

    Abstract: Machine learning techniques for cybersecurity-related software engineering tasks are becoming increasingly popular. The representation of source code is a key portion of the technique that can impact the way the model is able to learn the features of the source code. With an increasing number of these techniques being developed, it is valuable to see the current state of the field to better unders… ▽ More

    Submitted 9 April, 2025; v1 submitted 15 March, 2024; originally announced March 2024.

  7. arXiv:2401.01200  [pdf, other

    cs.CV cs.AI

    Skin cancer diagnosis using NIR spectroscopy data of skin lesions in vivo using machine learning algorithms

    Authors: Flavio P. Loss, Pedro H. da Cunha, Matheus B. Rocha, Madson Poltronieri Zanoni, Leandro M. de Lima, Isadora Tavares Nascimento, Isabella Rezende, Tania R. P. Canuto, Luciana de Paula Vieira, Renan Rossoni, Maria C. S. Santos, Patricia Lyra Frasson, Wanderson Romão, Paulo R. Filgueiras, Renato A. Krohling

    Abstract: Skin lesions are classified in benign or malignant. Among the malignant, melanoma is a very aggressive cancer and the major cause of deaths. So, early diagnosis of skin cancer is very desired. In the last few years, there is a growing interest in computer aided diagnostic (CAD) using most image and clinical data of the lesion. These sources of information present limitations due to their inability… ▽ More

    Submitted 2 January, 2024; originally announced January 2024.

  8. arXiv:2312.12598  [pdf, other

    cs.SE cs.AI

    A Case Study on Test Case Construction with Large Language Models: Unveiling Practical Insights and Challenges

    Authors: Roberto Francisco de Lima Junior, Luiz Fernando Paes de Barros Presta, Lucca Santos Borborema, Vanderson Nogueira da Silva, Marcio Leal de Melo Dahia, Anderson Carlos Sousa e Santos

    Abstract: This paper presents a detailed case study examining the application of Large Language Models (LLMs) in the construction of test cases within the context of software engineering. LLMs, characterized by their advanced natural language processing capabilities, are increasingly garnering attention as tools to automate and enhance various aspects of the software development life cycle. Leveraging a cas… ▽ More

    Submitted 21 December, 2023; v1 submitted 19 December, 2023; originally announced December 2023.

  9. Seneca: Taint-Based Call Graph Construction for Java Object Deserialization

    Authors: Joanna C. S. Santos, Mehdi Mirakhorli, Ali Shokri

    Abstract: Object serialization and deserialization are widely used for storing and preserving objects in files, memory, or database as well as for transporting them across machines, enabling remote interaction among processes and many more. This mechanism relies on reflection, a dynamic language that introduces serious challenges for static analyses. Current state-of-the-art call graph construction algorith… ▽ More

    Submitted 2 September, 2024; v1 submitted 1 November, 2023; originally announced November 2023.

    Comments: Accepted at OOPSLA 2024

  10. SALLM: Security Assessment of Generated Code

    Authors: Mohammed Latif Siddiq, Joanna C. S. Santos, Sajith Devareddy, Anna Muller

    Abstract: With the growing popularity of Large Language Models (LLMs) in software engineers' daily practices, it is important to ensure that the code generated by these tools is not only functionally correct but also free of vulnerabilities. Although LLMs can help developers to be more productive, prior empirical studies have shown that LLMs can generate insecure code. There are two contributing factors to… ▽ More

    Submitted 4 September, 2024; v1 submitted 1 November, 2023; originally announced November 2023.

    Comments: Accepted at the 6th International Workshop on Automated and verifiable Software sYstem DEvelopment (ASYDE) with ASE Conference 2024

    Journal ref: 39th IEEE/ACM International Conference on Automated Software Engineering Workshops (ASEW '24), October 27-November 1, 2024, Sacramento, CA, USA, ACM, New York, NY, USA, 12 pages

  11. arXiv:2307.08220  [pdf, other

    cs.SE cs.LG

    FRANC: A Lightweight Framework for High-Quality Code Generation

    Authors: Mohammed Latif Siddiq, Beatrice Casey, Joanna C. S. Santos

    Abstract: In recent years, the use of automated source code generation utilizing transformer-based generative models has expanded, and these models can generate functional code according to the requirements of the developers. However, recent research revealed that these automatically generated source codes can contain vulnerabilities and other quality issues. Despite researchers' and practitioners' attempts… ▽ More

    Submitted 28 August, 2024; v1 submitted 16 July, 2023; originally announced July 2023.

    Comments: Accepted at the 24th IEEE International Conference on Source Code Analysis and Manipulation (SCAM 2024)

  12. Using Large Language Models to Generate JUnit Tests: An Empirical Study

    Authors: Mohammed Latif Siddiq, Joanna C. S. Santos, Ridwanul Hasan Tanvir, Noshin Ulfat, Fahmid Al Rifat, Vinicius Carvalho Lopes

    Abstract: A code generation model generates code by taking a prompt from a code comment, existing code, or a combination of both. Although code generation models (e.g., GitHub Copilot) are increasingly being adopted in practice, it is unclear whether they can successfully be used for unit test generation without fine-tuning for a strongly typed language like Java. To fill this gap, we investigated how well… ▽ More

    Submitted 8 March, 2024; v1 submitted 30 April, 2023; originally announced May 2023.

    Comments: Accepted in Research Track of The 28th International Conference on Evaluation and Assessment in Software Engineering (EASE 2024)

    Journal ref: The 28th International Conference on Evaluation and Assessment in Software Engineering (EASE), 2024, 313-322

  13. A Biomedical Entity Extraction Pipeline for Oncology Health Records in Portuguese

    Authors: Hugo Sousa, Arian Pasquali, Alípio Jorge, Catarina Sousa Santos, Mário Amorim Lopes

    Abstract: Textual health records of cancer patients are usually protracted and highly unstructured, making it very time-consuming for health professionals to get a complete overview of the patient's therapeutic course. As such limitations can lead to suboptimal and/or inefficient treatment procedures, healthcare providers would greatly benefit from a system that effectively summarizes the information of tho… ▽ More

    Submitted 18 April, 2023; originally announced April 2023.

  14. arXiv:2304.07840  [pdf, other

    cs.LG cs.SE

    Enhancing Automated Program Repair through Fine-tuning and Prompt Engineering

    Authors: Rishov Paul, Md. Mohib Hossain, Mohammed Latif Siddiq, Masum Hasan, Anindya Iqbal, Joanna C. S. Santos

    Abstract: Sequence-to-sequence models have been used to transform erroneous programs into correct ones when trained with a large enough dataset. Some recent studies also demonstrated strong empirical evidence that code review could improve the program repair further. Large language models, trained with Natural Language (NL) and Programming Language (PL), can contain inherent knowledge of both. In this study… ▽ More

    Submitted 21 July, 2023; v1 submitted 16 April, 2023; originally announced April 2023.

    Comments: 12 pages, 2 figures, 4 tables

  15. ArCode: Facilitating the Use of Application Frameworks to Implement Tactics and Patterns

    Authors: Ali Shokri, Joanna C. S. Santos, Mehdi Mirakhorli

    Abstract: Software designers and developers are increasingly relying on application frameworks as first-class design concepts. They instantiate the services that frameworks provide to implement various architectural tactics and patterns. One of the challenges in using frameworks for such tasks is the difficulty of learning and correctly using frameworks' APIs. This paper introduces a learning-based approach… ▽ More

    Submitted 16 February, 2021; originally announced February 2021.

    Comments: This paper has been accepted in the main track of 2021 IEEE International Conference on Software Architecture (ICSA 2021) and is going to be published. Please feel free to cite it

  16. arXiv:1710.04132  [pdf

    cs.CY cs.PL

    Aprendendo Programacao Orientada a Objetos com uma Abordagem Ludica Baseada em Greenfoot e Robocode

    Authors: Cleison Simoes Santos, Allen Hichard Marques Santos, Suenny Mascarenhas Souza, Roberto Almeida Bittencourt

    Abstract: One the major challenges in undergraduate computing programs is the learning of object-oriented programming (OOP). This paradigm has a variety of concepts with an abstraction level usually high for most beginners, even the ones who already code in an imperative language. Furthermore, transitioning from imperative programming to OOP is a complex issue, with various inappropriate side effects. A sig… ▽ More

    Submitted 16 October, 2017; v1 submitted 7 October, 2017; originally announced October 2017.

    Comments: 10 pages, 3 figures, 2 tables, COBENGE 2015 - XLIII Congresso Brasileiro de Educação em Engenharia, in Portuguese

  17. A Large-Scale Study on the Usage of Testing Patterns that Address Maintainability Attributes (Patterns for Ease of Modification, Diagnoses, and Comprehension)

    Authors: Danielle Gonzalez, Joanna C. S. Santos, Andrew Popovich, Mehdi Mirakhorli, Mei Nagappan

    Abstract: Test case maintainability is an important concern, especially in open source and distributed development environments where projects typically have high contributor turnover with varying backgrounds and experience, and where code ownership changes often. Similar to design patterns, patterns for unit testing promote maintainability quality attributes such as ease of diagnoses, modifiability, and co… ▽ More

    Submitted 26 April, 2017; originally announced April 2017.

    Comments: Mining Software Repositories (MSR) 2017 Research Track

    Journal ref: 017 IEEE/ACM 14th International Conference on Mining Software Repositories (MSR), Buenos Aires, 2017, pp. 391-401