Search | arXiv e-print repository

AS400-DET: Detection using Deep Learning Model for IBM i (AS/400)

Authors: Thanh Tran, Son T. Luu, Quan Bui, Shoshin Nomura

Abstract: This paper proposes a method for automatic GUI component detection for the IBM i system (formerly and still more commonly known as AS/400). We introduce a human-annotated dataset consisting of 1,050 system screen images, in which 381 images are screenshots of IBM i system screens in Japanese. Each image contains multiple components, including text labels, text boxes, options, tables, instructions,… ▽ More This paper proposes a method for automatic GUI component detection for the IBM i system (formerly and still more commonly known as AS/400). We introduce a human-annotated dataset consisting of 1,050 system screen images, in which 381 images are screenshots of IBM i system screens in Japanese. Each image contains multiple components, including text labels, text boxes, options, tables, instructions, keyboards, and command lines. We then develop a detection system based on state-of-the-art deep learning models and evaluate different approaches using our dataset. The experimental results demonstrate the effectiveness of our dataset in constructing a system for component detection from GUI screens. By automatically detecting GUI components from the screen, AS400-DET has the potential to perform automated testing on systems that operate via GUI screens. △ Less

Submitted 15 June, 2025; originally announced June 2025.

Comments: Accepted at the IVSP 2025 conference

arXiv:2506.02529 [pdf, ps, other]

Automated Web Application Testing: End-to-End Test Case Generation with Large Language Models and Screen Transition Graphs

Authors: Nguyen-Khang Le, Quan Minh Bui, Minh Ngoc Nguyen, Hiep Nguyen, Trung Vo, Son T. Luu, Shoshin Nomura, Minh Le Nguyen

Abstract: Web applications are critical to modern software ecosystems, yet ensuring their reliability remains challenging due to the complexity and dynamic nature of web interfaces. Recent advances in large language models (LLMs) have shown promise in automating complex tasks, but limitations persist in handling dynamic navigation flows and complex form interactions. This paper presents an automated system… ▽ More Web applications are critical to modern software ecosystems, yet ensuring their reliability remains challenging due to the complexity and dynamic nature of web interfaces. Recent advances in large language models (LLMs) have shown promise in automating complex tasks, but limitations persist in handling dynamic navigation flows and complex form interactions. This paper presents an automated system for generating test cases for two key aspects of web application testing: site navigation and form filling. For site navigation, the system employs screen transition graphs and LLMs to model navigation flows and generate test scenarios. For form filling, it uses state graphs to handle conditional forms and automates Selenium script generation. Key contributions include: (1) a novel integration of graph structures and LLMs for site navigation testing, (2) a state graph-based approach for automating form-filling test cases, and (3) a comprehensive dataset for evaluating form-interaction testing. Experimental results demonstrate the system's effectiveness in improving test coverage and robustness, advancing the state of web application testing. △ Less

Submitted 3 June, 2025; originally announced June 2025.

Comments: Published in the Proceedings of JSAI 2025

ACM Class: I.2.7

arXiv:2505.20247 [pdf]

Translation of Enterprise Architecture Concept to Facilitate Digital Transformation Initiatives in Vietnam: Processes, Mechanisms and Impacts

Authors: Duong Dang, Quang Bui

Abstract: Governments around the world have increasingly adopted digital transformation (DT) initiatives to increase their strategic competitiveness in the global market. To support successful DT, governments have to introduce new governance logics and revise IT strategies to facilitate DT initiatives. In this study, we report a case study of how Enterprise Architecture (EA) concepts were introduced and tra… ▽ More Governments around the world have increasingly adopted digital transformation (DT) initiatives to increase their strategic competitiveness in the global market. To support successful DT, governments have to introduce new governance logics and revise IT strategies to facilitate DT initiatives. In this study, we report a case study of how Enterprise Architecture (EA) concepts were introduced and translated into practices in Vietnamese government agencies over a span of 15 years. This translation process has enabled EA concepts to facilitate various DT initiatives such as e-government, digitalization, to name a few. Our findings suggest two mechanisms in the translation process: a theorization mechanism to generalize local practices into field-level abstract concepts, making them easier to spread, while a contextualization mechanism unpacks these concepts into practical, adaptable approaches, aligning EA with adopters' priorities and increasing its chances of dissemination. Furthermore, our findings illustrate how translation happened when the initial concepts are ambiguous and not-well-understood by adopters. In this situation, there is a need for widespread experiments and sense-making among pioneers before field- and organizational-level translation can occur. △ Less

Submitted 26 May, 2025; originally announced May 2025.

arXiv:2505.10860 [pdf, ps, other]

On DeepSeekMoE: Statistical Benefits of Shared Experts and Normalized Sigmoid Gating

Authors: Huy Nguyen, Thong T. Doan, Quang Pham, Nghi D. Q. Bui, Nhat Ho, Alessandro Rinaldo

Abstract: Mixture of experts (MoE) methods are a key component in most large language model architectures, including the recent series of DeepSeek models. Compared to other MoE implementations, DeepSeekMoE stands out because of two unique features: the deployment of a shared expert strategy and of the normalized sigmoid gating mechanism. Despite the prominent role of DeepSeekMoE in the success of the DeepSe… ▽ More Mixture of experts (MoE) methods are a key component in most large language model architectures, including the recent series of DeepSeek models. Compared to other MoE implementations, DeepSeekMoE stands out because of two unique features: the deployment of a shared expert strategy and of the normalized sigmoid gating mechanism. Despite the prominent role of DeepSeekMoE in the success of the DeepSeek series of models, there have been only a few attempts to justify theoretically the value of the shared expert strategy, while its normalized sigmoid gating has remained unexplored. To bridge this gap, we undertake a comprehensive theoretical study of these two features of DeepSeekMoE from a statistical perspective. We perform a convergence analysis of the expert estimation task to highlight the gains in sample efficiency for both the shared expert strategy and the normalized sigmoid gating, offering useful insights into the design of expert and gating structures. To verify empirically our theoretical findings, we carry out several experiments on both synthetic data and real-world datasets for (vision) language modeling tasks. Finally, we conduct an extensive empirical analysis of the router behaviors, ranging from router saturation, router change rate, to expert utilization. △ Less

Submitted 16 May, 2025; originally announced May 2025.

Comments: 100 pages

arXiv:2504.14757 [pdf, other]

SWE-Synth: Synthesizing Verifiable Bug-Fix Data to Enable Large Language Models in Resolving Real-World Bugs

Authors: Minh V. T. Pham, Huy N. Phan, Hoang N. Phan, Cuong Le Chi, Tien N. Nguyen, Nghi D. Q. Bui

Abstract: Large language models (LLMs) are transforming automated program repair (APR) through agent-based approaches that localize bugs, generate patches, and verify fixes. However, the lack of high-quality, scalable training datasets, especially those with verifiable outputs and intermediate reasoning traces-limits progress, particularly for open-source models. In this work, we present SWE-Synth, a framew… ▽ More Large language models (LLMs) are transforming automated program repair (APR) through agent-based approaches that localize bugs, generate patches, and verify fixes. However, the lack of high-quality, scalable training datasets, especially those with verifiable outputs and intermediate reasoning traces-limits progress, particularly for open-source models. In this work, we present SWE-Synth, a framework for synthesizing realistic, verifiable, and process-aware bug-fix datasets at the repository level. SWE-Synth leverages LLM agents to simulate debugging workflows, producing not only bug-fix pairs but also test cases and structured repair trajectories. Compared to manually curated datasets, our method scales with minimal human effort while preserving contextual richness and correctness. Experiments show that models trained on SWE-Synth outperform those trained on real-world datasets by 2.3% on SWE-Bench Lite. Our results highlight the potential of synthetic, agent-generated data to advance the state of the art in APR and software engineering automation. △ Less

Submitted 20 April, 2025; originally announced April 2025.

Comments: Work in progress

arXiv:2502.12673 [pdf, other]

ROI-NeRFs: Hi-Fi Visualization of Objects of Interest within a Scene by NeRFs Composition

Authors: Quoc-Anh Bui, Gilles Rougeron, Géraldine Morin, Simone Gasparini

Abstract: Efficient and accurate 3D reconstruction is essential for applications in cultural heritage. This study addresses the challenge of visualizing objects within large-scale scenes at a high level of detail (LOD) using Neural Radiance Fields (NeRFs). The aim is to improve the visual fidelity of chosen objects while maintaining the efficiency of the computations by focusing on details only for relevant… ▽ More Efficient and accurate 3D reconstruction is essential for applications in cultural heritage. This study addresses the challenge of visualizing objects within large-scale scenes at a high level of detail (LOD) using Neural Radiance Fields (NeRFs). The aim is to improve the visual fidelity of chosen objects while maintaining the efficiency of the computations by focusing on details only for relevant content. The proposed ROI-NeRFs framework divides the scene into a Scene NeRF, which represents the overall scene at moderate detail, and multiple ROI NeRFs that focus on user-defined objects of interest. An object-focused camera selection module automatically groups relevant cameras for each NeRF training during the decomposition phase. In the composition phase, a Ray-level Compositional Rendering technique combines information from the Scene NeRF and ROI NeRFs, allowing simultaneous multi-object rendering composition. Quantitative and qualitative experiments conducted on two real-world datasets, including one on a complex eighteen's century cultural heritage room, demonstrate superior performance compared to baseline methods, improving LOD for object regions, minimizing artifacts, and without significantly increasing inference time. △ Less

Submitted 18 February, 2025; originally announced February 2025.

Comments: 17 pages including appendix, 16 figures, 8 tables

MSC Class: 68U05; 68T45 (Primary) 68T07; 68-04 (Secondary) ACM Class: I.2.10; I.3.3; I.3.5; I.3.7; I.4.5; I.4.6; I.4.8; I.4.10

arXiv:2502.04953 [pdf, other]

A Systematic Literature Review on Automated Exploit and Security Test Generation

Authors: Quang-Cuong Bui, Emanuele Iannone, Maria Camporese, Torge Hinrichs, Catherine Tony, László Tóth, Fabio Palomba, Péter Hegedűs, Fabio Massacci, Riccardo Scandariato

Abstract: The exploit or the Proof of Concept of the vulnerability plays an important role in developing superior vulnerability repair techniques, as it can be used as an oracle to verify the correctness of the patches generated by the tools. However, the vulnerability exploits are often unavailable and require time and expert knowledge to craft. Obtaining them from the exploit generation techniques is anot… ▽ More The exploit or the Proof of Concept of the vulnerability plays an important role in developing superior vulnerability repair techniques, as it can be used as an oracle to verify the correctness of the patches generated by the tools. However, the vulnerability exploits are often unavailable and require time and expert knowledge to craft. Obtaining them from the exploit generation techniques is another potential solution. The goal of this survey is to aid the researchers and practitioners in understanding the existing techniques for exploit generation through the analysis of their characteristics and their usability in practice. We identify a list of exploit generation techniques from literature and group them into four categories: automated exploit generation, security testing, fuzzing, and other techniques. Most of the techniques focus on the memory-based vulnerabilities in C/C++ programs and web-based injection vulnerabilities in PHP and Java applications. We found only a few studies that publicly provided usable tools associated with their techniques. △ Less

Submitted 7 February, 2025; originally announced February 2025.

Comments: This work was partially supported by EU-funded project Sec4AI4Sec (grant no. 101120393)

ACM Class: A.2

arXiv:2502.03365 [pdf, other]

A Match Made in Heaven? Matching Test Cases and Vulnerabilities With the VUTECO Approach

Authors: Emanuele Iannone, Quang-Cuong Bui, Riccardo Scandariato

Abstract: Software vulnerabilities are commonly detected via static analysis, penetration testing, and fuzzing. They can also be found by running unit tests - so-called vulnerability-witnessing tests - that stimulate the security-sensitive behavior with crafted inputs. Developing such tests is difficult and time-consuming; thus, automated data-driven approaches could help developers intercept vulnerabilitie… ▽ More Software vulnerabilities are commonly detected via static analysis, penetration testing, and fuzzing. They can also be found by running unit tests - so-called vulnerability-witnessing tests - that stimulate the security-sensitive behavior with crafted inputs. Developing such tests is difficult and time-consuming; thus, automated data-driven approaches could help developers intercept vulnerabilities earlier. However, training and validating such approaches require a lot of data, which is currently scarce. This paper introduces VUTECO, a deep learning-based approach for collecting instances of vulnerability-witnessing tests from Java repositories. VUTECO carries out two tasks: (1) the "Finding" task to determine whether a test case is security-related, and (2) the "Matching" task to relate a test case to the exact vulnerability it is witnessing. VUTECO successfully addresses the Finding task, achieving perfect precision and 0.83 F0.5 score on validated test cases in VUL4J and returning 102 out of 145 (70%) correct security-related test cases from 244 open-source Java projects. Despite showing sufficiently good performance for the Matching task - i.e., 0.86 precision and 0.68 F0.5 score - VUTECO failed to retrieve any valid match in the wild. Nevertheless, we observed that in almost all of the matches, the test case was still security-related despite being matched to the wrong vulnerability. In the end, VUTECO can help find vulnerability-witnessing tests, though the matching with the right vulnerability is yet to be solved; the findings obtained lay the stepping stone for future research on the matter. △ Less

Submitted 5 February, 2025; originally announced February 2025.

Comments: This work was partially supported by EU-funded project Sec4AI4Sec (grant no. 101120393)

ACM Class: D.2.5; D.2.7

arXiv:2501.00520 [pdf, other]

Innovative Silicosis and Pneumonia Classification: Leveraging Graph Transformer Post-hoc Modeling and Ensemble Techniques

Authors: Bao Q. Bui, Tien T. T. Nguyen, Duy M. Le, Cong Tran, Cuong Pham

Abstract: This paper presents a comprehensive study on the classification and detection of Silicosis-related lung inflammation. Our main contributions include 1) the creation of a newly curated chest X-ray (CXR) image dataset named SVBCX that is tailored to the nuances of lung inflammation caused by distinct agents, providing a valuable resource for silicosis and pneumonia research community; and 2) we prop… ▽ More This paper presents a comprehensive study on the classification and detection of Silicosis-related lung inflammation. Our main contributions include 1) the creation of a newly curated chest X-ray (CXR) image dataset named SVBCX that is tailored to the nuances of lung inflammation caused by distinct agents, providing a valuable resource for silicosis and pneumonia research community; and 2) we propose a novel deep-learning architecture that integrates graph transformer networks alongside a traditional deep neural network module for the effective classification of silicosis and pneumonia. Additionally, we employ the Balanced Cross-Entropy (BalCE) as a loss function to ensure more uniform learning across different classes, enhancing the model's ability to discern subtle differences in lung conditions. The proposed model architecture and loss function selection aim to improve the accuracy and reliability of inflammation detection, particularly in the context of Silicosis. Furthermore, our research explores the efficacy of an ensemble approach that combines the strengths of diverse model architectures. Experimental results on the constructed dataset demonstrate promising outcomes, showcasing substantial enhancements compared to baseline models. The ensemble of models achieves a macro-F1 score of 0.9749 and AUC ROC scores exceeding 0.99 for each class, underscoring the effectiveness of our approach in accurate and robust lung inflammation classification. △ Less

Submitted 31 December, 2024; originally announced January 2025.

arXiv:2412.19606 [pdf, other]

Enhancing Fine-grained Image Classification through Attentive Batch Training

Authors: Duy M. Le, Bao Q. Bui, Anh Tran, Cong Tran, Cuong Pham

Abstract: Fine-grained image classification, which is a challenging task in computer vision, requires precise differentiation among visually similar object categories. In this paper, we propose 1) a novel module called Residual Relationship Attention (RRA) that leverages the relationships between images within each training batch to effectively integrate visual feature vectors of batch images and 2) a novel… ▽ More Fine-grained image classification, which is a challenging task in computer vision, requires precise differentiation among visually similar object categories. In this paper, we propose 1) a novel module called Residual Relationship Attention (RRA) that leverages the relationships between images within each training batch to effectively integrate visual feature vectors of batch images and 2) a novel technique called Relationship Position Encoding (RPE), which encodes the positions of relationships between original images in a batch and effectively preserves the relationship information between images within the batch. Additionally, we design a novel framework, namely Relationship Batch Integration (RBI), which utilizes RRA in conjunction with RPE, allowing the discernment of vital visual features that may remain elusive when examining a singular image representative of a particular class. Through extensive experiments, our proposed method demonstrates significant improvements in the accuracy of different fine-grained classifiers, with an average increase of $(+2.78\%)$ and $(+3.83\%)$ on the CUB200-2011 and Stanford Dog datasets, respectively, while achieving a state-of-the-art results $(95.79\%)$ on the Stanford Dog dataset. Despite not achieving the same level of improvement as in fine-grained image classification, our method still demonstrates its prowess in leveraging general image classification by attaining a state-of-the-art result of $(93.71\%)$ on the Tiny-Imagenet dataset. Furthermore, our method serves as a plug-in refinement module and can be easily integrated into different networks. △ Less

Submitted 27 December, 2024; originally announced December 2024.

arXiv:2410.23402 [pdf, other]

VisualCoder: Guiding Large Language Models in Code Execution with Fine-grained Multimodal Chain-of-Thought Reasoning

Authors: Cuong Chi Le, Hoang-Chau Truong-Vinh, Huy Nhat Phan, Dung Duy Le, Tien N. Nguyen, Nghi D. Q. Bui

Abstract: Predicting program behavior and reasoning about code execution remain significant challenges in software engineering, particularly for large language models (LLMs) designed for code analysis. While these models excel at understanding static syntax, they often struggle with dynamic reasoning tasks. We introduce VisualCoder, a simple yet effective approach that enhances code reasoning by integrating… ▽ More Predicting program behavior and reasoning about code execution remain significant challenges in software engineering, particularly for large language models (LLMs) designed for code analysis. While these models excel at understanding static syntax, they often struggle with dynamic reasoning tasks. We introduce VisualCoder, a simple yet effective approach that enhances code reasoning by integrating multimodal Chain-of-Thought (CoT) reasoning with a visual Control Flow Graph (CFG). By aligning code snippets with their corresponding CFGs, VisualCoder provides deeper insights into execution flows. We address challenges in multimodal CoT integration through a reference mechanism, ensuring consistency between code and its execution path, thereby improving performance in program behavior prediction, error detection, and output generation. △ Less

Submitted 9 February, 2025; v1 submitted 30 October, 2024; originally announced October 2024.

Comments: NAACL 2025

arXiv:2410.01999 [pdf, other]

CodeMMLU: A Multi-Task Benchmark for Assessing Code Understanding & Reasoning Capabilities of CodeLLMs

Authors: Dung Nguyen Manh, Thang Phan Chau, Nam Le Hai, Thong T. Doan, Nam V. Nguyen, Quang Pham, Nghi D. Q. Bui

Abstract: Recent advances in Code Large Language Models (CodeLLMs) have primarily focused on open-ended code generation, often overlooking the crucial aspect of code understanding and reasoning. To bridge this gap, we introduce CodeMMLU, a comprehensive multiple-choice benchmark designed to evaluate the depth of software and code comprehension in LLMs. CodeMMLU includes nearly 20,000 questions spanning dive… ▽ More Recent advances in Code Large Language Models (CodeLLMs) have primarily focused on open-ended code generation, often overlooking the crucial aspect of code understanding and reasoning. To bridge this gap, we introduce CodeMMLU, a comprehensive multiple-choice benchmark designed to evaluate the depth of software and code comprehension in LLMs. CodeMMLU includes nearly 20,000 questions spanning diverse domains, including code analysis, defect detection, and software engineering principles across multiple programming languages. Unlike traditional benchmarks that emphasize code generation, CodeMMLU assesses a model's ability to reason about programs across a wide-range of tasks such as code repair, execution reasoning, and fill-in-the-blank challenges. Our extensive evaluation reveals that even state-of-the-art models struggle with CodeMMLU, highlighting significant gaps in comprehension beyond generation. By emphasizing the essential connection between code understanding and effective AI-assisted development, CodeMMLU provides a critical resource for advancing more reliable and capable coding assistants. △ Less

Submitted 9 April, 2025; v1 submitted 2 October, 2024; originally announced October 2024.

arXiv:2409.16299 [pdf, other]

HyperAgent: Generalist Software Engineering Agents to Solve Coding Tasks at Scale

Authors: Huy Nhat Phan, Tien N. Nguyen, Phong X. Nguyen, Nghi D. Q. Bui

Abstract: Large Language Models (LLMs) have revolutionized software engineering (SE), showcasing remarkable proficiency in various coding tasks. Despite recent advancements that have enabled the creation of autonomous software agents utilizing LLMs for end-to-end development tasks, these systems are typically designed for specific SE functions. We introduce HyperAgent, an innovative generalist multi-agent s… ▽ More Large Language Models (LLMs) have revolutionized software engineering (SE), showcasing remarkable proficiency in various coding tasks. Despite recent advancements that have enabled the creation of autonomous software agents utilizing LLMs for end-to-end development tasks, these systems are typically designed for specific SE functions. We introduce HyperAgent, an innovative generalist multi-agent system designed to tackle a wide range of SE tasks across different programming languages by mimicking the workflows of human developers. HyperAgent features four specialized agents-Planner, Navigator, Code Editor, and Executor-capable of handling the entire lifecycle of SE tasks, from initial planning to final verification. HyperAgent sets new benchmarks in diverse SE tasks, including GitHub issue resolution on the renowned SWE-Bench benchmark, outperforming robust baselines. Furthermore, HyperAgent demonstrates exceptional performance in repository-level code generation (RepoExec) and fault localization and program repair (Defects4J), often surpassing state-of-the-art baselines. △ Less

Submitted 5 November, 2024; v1 submitted 9 September, 2024; originally announced September 2024.

Comments: 49 pages

arXiv:2408.04663 [pdf, other]

Dopamin: Transformer-based Comment Classifiers through Domain Post-Training and Multi-level Layer Aggregation

Authors: Nam Le Hai, Nghi D. Q. Bui

Abstract: Code comments provide important information for understanding the source code. They can help developers understand the overall purpose of a function or class, as well as identify bugs and technical debt. However, an overabundance of comments is meaningless and counterproductive. As a result, it is critical to automatically filter out these comments for specific purposes. In this paper, we present… ▽ More Code comments provide important information for understanding the source code. They can help developers understand the overall purpose of a function or class, as well as identify bugs and technical debt. However, an overabundance of comments is meaningless and counterproductive. As a result, it is critical to automatically filter out these comments for specific purposes. In this paper, we present Dopamin, a Transformer-based tool for dealing with this issue. Our model excels not only in presenting knowledge sharing of common categories across multiple languages, but also in achieving robust performance in comment classification by improving comment representation. As a result, it outperforms the STACC baseline by 3% on the NLBSE'24 Tool Competition dataset in terms of average F1-score, while maintaining a comparable inference time for practical use. The source code is publicity available at https://github.com/FSoft-AI4Code/Dopamin. △ Less

Submitted 6 August, 2024; originally announced August 2024.

Comments: Accepted at The 3rd Intl. Workshop on NL-based Software Engineering, 2024

arXiv:2408.04660 [pdf, other]

XMainframe: A Large Language Model for Mainframe Modernization

Authors: Anh T. V. Dau, Hieu Trung Dao, Anh Tuan Nguyen, Hieu Trung Tran, Phong X. Nguyen, Nghi D. Q. Bui

Abstract: Mainframe operating systems, despite their inception in the 1940s, continue to support critical sectors like finance and government. However, these systems are often viewed as outdated, requiring extensive maintenance and modernization. Addressing this challenge necessitates innovative tools that can understand and interact with legacy codebases. To this end, we introduce XMainframe, a state-of-th… ▽ More Mainframe operating systems, despite their inception in the 1940s, continue to support critical sectors like finance and government. However, these systems are often viewed as outdated, requiring extensive maintenance and modernization. Addressing this challenge necessitates innovative tools that can understand and interact with legacy codebases. To this end, we introduce XMainframe, a state-of-the-art large language model (LLM) specifically designed with knowledge of mainframe legacy systems and COBOL codebases. Our solution involves the creation of an extensive data collection pipeline to produce high-quality training datasets, enhancing XMainframe's performance in this specialized domain. Additionally, we present MainframeBench, a comprehensive benchmark for assessing mainframe knowledge, including multiple-choice questions, question answering, and COBOL code summarization. Our empirical evaluations demonstrate that XMainframe consistently outperforms existing state-of-the-art LLMs across these tasks. Specifically, XMainframe achieves 30% higher accuracy than DeepSeek-Coder on multiple-choice questions, doubles the BLEU score of Mixtral-Instruct 8x7B on question answering, and scores six times higher than GPT-3.5 on COBOL summarization. Our work highlights the potential of XMainframe to drive significant advancements in managing and modernizing legacy systems, thereby enhancing productivity and saving time for software developers. △ Less

Submitted 26 August, 2024; v1 submitted 5 August, 2024; originally announced August 2024.

arXiv:2408.02816 [pdf, other]

CodeFlow: Program Behavior Prediction with Dynamic Dependencies Learning

Authors: Cuong Chi Le, Hoang Nhat Phan, Huy Nhat Phan, Tien N. Nguyen, Nghi D. Q. Bui

Abstract: Predicting program behavior without execution is a critical task in software engineering. Existing models often fall short in capturing the dynamic dependencies among program elements. To address this, we present CodeFlow, a novel machine learning-based approach that predicts code coverage and detects runtime errors by learning both static and dynamic dependencies within the code. By using control… ▽ More Predicting program behavior without execution is a critical task in software engineering. Existing models often fall short in capturing the dynamic dependencies among program elements. To address this, we present CodeFlow, a novel machine learning-based approach that predicts code coverage and detects runtime errors by learning both static and dynamic dependencies within the code. By using control flow graphs (CFGs), CodeFlow effectively represents all possible execution paths and the statistic relations between different statements, providing a more comprehensive understanding of program behaviors. CodeFlow constructs CFGs to represent possible execution paths and learns vector representations (embeddings) for CFG nodes, capturing static control-flow dependencies. Additionally, it learns dynamic dependencies by leveraging execution traces, which reflect the impacts among statements during execution. This combination enables CodeFlow to accurately predict code coverage and identify runtime errors. Our empirical evaluation demonstrates that CodeFlow significantly improves code coverage prediction accuracy and effectively localizes runtime errors, outperforming state-of-the-art models. △ Less

Submitted 9 February, 2025; v1 submitted 5 August, 2024; originally announced August 2024.

Comments: FORGE 2025

arXiv:2406.11927 [pdf, other]

On the Impacts of Contexts on Repository-Level Code Generation

Authors: Nam Le Hai, Dung Manh Nguyen, Nghi D. Q. Bui

Abstract: CodeLLMs have gained widespread adoption for code generation tasks, yet their capacity to handle repository-level code generation with complex contextual dependencies remains underexplored. Our work underscores the critical importance of leveraging repository-level contexts to generate executable and functionally correct code. We present RepoExec, a novel benchmark designed to evaluate repository-… ▽ More CodeLLMs have gained widespread adoption for code generation tasks, yet their capacity to handle repository-level code generation with complex contextual dependencies remains underexplored. Our work underscores the critical importance of leveraging repository-level contexts to generate executable and functionally correct code. We present RepoExec, a novel benchmark designed to evaluate repository-level code generation, with a focus on three key aspects: executability, functional correctness through comprehensive test case generation, and accurate utilization of cross-file contexts. Our study examines a controlled scenario where developers specify essential code dependencies (contexts), challenging models to integrate them effectively. Additionally, we introduce an instruction-tuned dataset that enhances CodeLLMs' ability to leverage dependencies, along with a new metric, Dependency Invocation Rate (DIR), to quantify context utilization. Experimental results reveal that while pretrained LLMs demonstrate superior performance in terms of correctness, instruction-tuned models excel in context utilization and debugging capabilities. RepoExec offers a comprehensive evaluation framework for assessing code functionality and alignment with developer intent, thereby advancing the development of more reliable CodeLLMs for real-world applications. The dataset and source code are available at https://github.com/FSoft-AI4Code/RepoExec. △ Less

Submitted 9 February, 2025; v1 submitted 17 June, 2024; originally announced June 2024.

Comments: Accepted to NAACL 2025

arXiv:2406.11912 [pdf, other]

AgileCoder: Dynamic Collaborative Agents for Software Development based on Agile Methodology

Authors: Minh Huynh Nguyen, Thang Phan Chau, Phong X. Nguyen, Nghi D. Q. Bui

Abstract: Software agents have emerged as promising tools for addressing complex software engineering tasks. Existing works, on the other hand, frequently oversimplify software development workflows, despite the fact that such workflows are typically more complex in the real world. Thus, we propose AgileCoder, a multi agent system that integrates Agile Methodology (AM) into the framework. This system assign… ▽ More Software agents have emerged as promising tools for addressing complex software engineering tasks. Existing works, on the other hand, frequently oversimplify software development workflows, despite the fact that such workflows are typically more complex in the real world. Thus, we propose AgileCoder, a multi agent system that integrates Agile Methodology (AM) into the framework. This system assigns specific AM roles - such as Product Manager, Developer, and Tester to different agents, who then collaboratively develop software based on user inputs. AgileCoder enhances development efficiency by organizing work into sprints, focusing on incrementally developing software through sprints. Additionally, we introduce Dynamic Code Graph Generator, a module that creates a Code Dependency Graph dynamically as updates are made to the codebase. This allows agents to better comprehend the codebase, leading to more precise code generation and modifications throughout the software development process. AgileCoder surpasses existing benchmarks, like ChatDev and MetaGPT, establishing a new standard and showcasing the capabilities of multi agent systems in advanced software engineering environments. △ Less

Submitted 14 July, 2024; v1 submitted 16 June, 2024; originally announced June 2024.

Comments: Work in progress

arXiv:2404.04408 [pdf, other]

A novel section-section potential for short-range interactions between plane beams

Authors: A. Borković, M. H. Gfrerer, R. A. Sauer, B. Marussig, T. Q. Bui

Abstract: We derive a novel formulation for the interaction potential between deformable fibers due to short-range fields arising from intermolecular forces. The formulation improves the existing section-section interaction potential law for in-plane beams by considering an offset between interacting cross sections. The new law is asymptotically consistent, which is particularly beneficial for computational… ▽ More We derive a novel formulation for the interaction potential between deformable fibers due to short-range fields arising from intermolecular forces. The formulation improves the existing section-section interaction potential law for in-plane beams by considering an offset between interacting cross sections. The new law is asymptotically consistent, which is particularly beneficial for computationally demanding scenarios involving short-range interactions like van der Waals and steric forces. The formulation is implemented within a framework of rotation-free Bernoulli-Euler beams utilizing the isogeometric paradigm. The improved accuracy of the novel law is confirmed through thorough numerical studies. We apply the developed formulation to investigate the complex behavior observed during peeling and pull-off of elastic fibers interacting via the Lennard-Jones potential. △ Less

Submitted 5 April, 2024; originally announced April 2024.

arXiv:2403.14592 [pdf, other]

Envisioning the Next-Generation AI Coding Assistants: Insights & Proposals

Authors: Khanh Nghiem, Anh Minh Nguyen, Nghi D. Q. Bui

Abstract: As a research-product hybrid group in AI for Software Engineering (AI4SE), we present four key takeaways from our experience developing in-IDE AI coding assistants. AI coding assistants should set clear expectations for usage, integrate with advanced IDE capabilities and existing extensions, use extendable backend designs, and collect app data responsibly for downstream analyses. We propose open q… ▽ More As a research-product hybrid group in AI for Software Engineering (AI4SE), we present four key takeaways from our experience developing in-IDE AI coding assistants. AI coding assistants should set clear expectations for usage, integrate with advanced IDE capabilities and existing extensions, use extendable backend designs, and collect app data responsibly for downstream analyses. We propose open questions and challenges that academia and industry should address to realize the vision of next-generation AI coding assistants. △ Less

Submitted 21 March, 2024; originally announced March 2024.

arXiv:2403.12923 [pdf, other]

doi 10.1287/ijoc.2024.0686

Solving Combinatorial Pricing Problems using Embedded Dynamic Programming Models

Authors: Quang Minh Bui, Margarida Carvalho, José Neto

Abstract: The combinatorial pricing problem (CPP) is a bilevel problem in which the leader maximizes their revenue by imposing tolls on certain items that they can control. Based on the tolls set by the leader, the follower selects a subset of items corresponding to an optimal solution of a combinatorial optimization problem. To accomplish the leader's goal, the tolls need to be sufficiently low to discoura… ▽ More The combinatorial pricing problem (CPP) is a bilevel problem in which the leader maximizes their revenue by imposing tolls on certain items that they can control. Based on the tolls set by the leader, the follower selects a subset of items corresponding to an optimal solution of a combinatorial optimization problem. To accomplish the leader's goal, the tolls need to be sufficiently low to discourage the follower from choosing the items offered by the competitors. In this paper, we derive a single-level reformulation for the CPP by rewriting the follower's problem as a longest path problem using a dynamic programming model, and then taking its dual and applying strong duality. We proceed to solve the reformulation in a dynamic fashion with a cutting plane method. We apply this methodology to two distinct dynamic programming models, namely, a novel formulation designated as selection diagram and the well-known decision diagram. We also produce numerical results to evaluate their performances across three different specializations of the CPP and a closely related problem that is the knapsack interdiction problem. Our results showcase the potential of the two proposed reformulations over the natural value function approach, expanding the set of tools to solve combinatorial bilevel programs. △ Less

Submitted 29 March, 2025; v1 submitted 19 March, 2024; originally announced March 2024.

MSC Class: 90C46; 90C27; 90C39

Journal ref: INFORMS Journal on Computing, 2025

arXiv:2403.06095 [pdf, other]

RepoHyper: Search-Expand-Refine on Semantic Graphs for Repository-Level Code Completion

Authors: Huy N. Phan, Hoang N. Phan, Tien N. Nguyen, Nghi D. Q. Bui

Abstract: Code Large Language Models (CodeLLMs) have demonstrated impressive proficiency in code completion tasks. However, they often fall short of fully understanding the extensive context of a project repository, such as the intricacies of relevant files and class hierarchies, which can result in less precise completions. To overcome these limitations, we present \tool, a multifaceted framework designed… ▽ More Code Large Language Models (CodeLLMs) have demonstrated impressive proficiency in code completion tasks. However, they often fall short of fully understanding the extensive context of a project repository, such as the intricacies of relevant files and class hierarchies, which can result in less precise completions. To overcome these limitations, we present \tool, a multifaceted framework designed to address the complex challenges associated with repository-level code completion. Central to RepoHYPER is the {\em Repo-level Semantic Graph} (RSG), a novel semantic graph structure that encapsulates the vast context of code repositories. Furthermore, RepoHyper leverages Expand and Refine retrieval method, including a graph expansion and a link prediction algorithm applied to the RSG, enabling the effective retrieval and prioritization of relevant code snippets. Our evaluations show that \tool markedly outperforms existing techniques in repository-level code completion, showcasing enhanced accuracy across various datasets when compared to several strong baselines. Our implementation of RepoHYPER can be found at https://github.com/FSoft-AI4Code/RepoHyper. △ Less

Submitted 14 August, 2024; v1 submitted 10 March, 2024; originally announced March 2024.

arXiv:2311.03366 [pdf, other]

Functional Overlap Reranking for Neural Code Generation

Authors: Hung Quoc To, Minh Huynh Nguyen, Nghi D. Q. Bui

Abstract: Code Large Language Models (CodeLLMs) have ushered in a new era in code generation advancements. However, selecting the best code solutions from all possible CodeLLM outputs remains a challenge. Previous methods often overlooked the intricate functional similarities and interactions between solution clusters. We introduce SRank, a novel reranking strategy for selecting the best solutions from code… ▽ More Code Large Language Models (CodeLLMs) have ushered in a new era in code generation advancements. However, selecting the best code solutions from all possible CodeLLM outputs remains a challenge. Previous methods often overlooked the intricate functional similarities and interactions between solution clusters. We introduce SRank, a novel reranking strategy for selecting the best solutions from code generation, focusing on modeling the relationships between clusters of solutions. By quantifying the functional overlap between solution clusters, our approach provides a better ranking strategy for code solutions. Empirical results show that our method achieves remarkable results on the pass@1 score. For instance, on the Human-Eval benchmark, we achieve 69.66% in pass@1 with Codex002, 75.31% with WizardCoder, 53.99% with StarCoder, and 60.55% with CodeGen, surpassing state-of-the-art code generation reranking methods such as CodeT and Coder-Reviewer on the same CodeLLM by a significant margin (approximately 6.1% improvement on average). Even in scenarios with a limited number of sampled solutions and test cases, our approach demonstrates robustness and superiority, marking a new benchmark in code generation reranking. Our implementation can be found at https://github.com/FSoft-AI4Code/SRank-CodeRanker. △ Less

Submitted 7 August, 2024; v1 submitted 16 October, 2023; originally announced November 2023.

Comments: ACL 2024, Long Findings

arXiv:2311.00993 [pdf, other]

Scalable Probabilistic Forecasting in Retail with Gradient Boosted Trees: A Practitioner's Approach

Authors: Xueying Long, Quang Bui, Grady Oktavian, Daniel F. Schmidt, Christoph Bergmeir, Rakshitha Godahewa, Seong Per Lee, Kaifeng Zhao, Paul Condylis

Abstract: The recent M5 competition has advanced the state-of-the-art in retail forecasting. However, we notice important differences between the competition challenge and the challenges we face in a large e-commerce company. The datasets in our scenario are larger (hundreds of thousands of time series), and e-commerce can afford to have a larger assortment than brick-and-mortar retailers, leading to more i… ▽ More The recent M5 competition has advanced the state-of-the-art in retail forecasting. However, we notice important differences between the competition challenge and the challenges we face in a large e-commerce company. The datasets in our scenario are larger (hundreds of thousands of time series), and e-commerce can afford to have a larger assortment than brick-and-mortar retailers, leading to more intermittent data. To scale to larger dataset sizes with feasible computational effort, firstly, we investigate a two-layer hierarchy and propose a top-down approach to forecasting at an aggregated level with less amount of series and intermittency, and then disaggregating to obtain the decision-level forecasts. Probabilistic forecasts are generated under distributional assumptions. Secondly, direct training at the lower level with subsamples can also be an alternative way of scaling. Performance of modelling with subsets is evaluated with the main dataset. Apart from a proprietary dataset, the proposed scalable methods are evaluated using the Favorita dataset and the M5 dataset. We are able to show the differences in characteristics of the e-commerce and brick-and-mortar retail datasets. Notably, our top-down forecasting framework enters the top 50 of the original M5 competition, even with models trained at a higher level under a much simpler setting. △ Less

Submitted 2 November, 2023; originally announced November 2023.

arXiv:2306.06347 [pdf, other]

DocChecker: Bootstrapping Code Large Language Model for Detecting and Resolving Code-Comment Inconsistencies

Authors: Anh T. V. Dau, Jin L. C. Guo, Nghi D. Q. Bui

Abstract: Comments within source code are essential for developers to comprehend the code's purpose and ensure its correct usage. However, as codebases evolve, maintaining an accurate alignment between the comments and the code becomes increasingly challenging. Recognizing the growing interest in automated solutions for detecting and correcting differences between code and its accompanying comments, current… ▽ More Comments within source code are essential for developers to comprehend the code's purpose and ensure its correct usage. However, as codebases evolve, maintaining an accurate alignment between the comments and the code becomes increasingly challenging. Recognizing the growing interest in automated solutions for detecting and correcting differences between code and its accompanying comments, current methods rely primarily on heuristic rules. In contrast, this paper presents DocChecker, a tool powered by deep learning. DocChecker is adept at identifying inconsistencies between code and comments, and it can also generate synthetic comments. This capability enables the tool to detect and correct instances where comments do not accurately reflect their corresponding code segments. We demonstrate the effectiveness of DocChecker using the Just-In-Time and CodeXGlue datasets in different settings. Particularly, DocChecker achieves a new State-of-the-art result of 72.3% accuracy on the Inconsistency Code-Comment Detection (ICCD) task and 33.64 BLEU-4 on the code summarization task against other Large Language Models (LLMs), even surpassing GPT 3.5 and CodeLlama. DocChecker is accessible for use and evaluation. It can be found on our GitHub https://github.com/FSoft-AI4Code/DocChecker and as an Online Tool http://4.193.50.237:5000/. For a more comprehensive understanding of its functionality, a demonstration video is available on YouTube https://youtu.be/FqnPmd531xw. △ Less

Submitted 2 February, 2024; v1 submitted 10 June, 2023; originally announced June 2023.

Journal ref: EACL 2024 - Demonstration track

arXiv:2306.00029 [pdf, other]

CodeTF: One-stop Transformer Library for State-of-the-art Code LLM

Authors: Nghi D. Q. Bui, Hung Le, Yue Wang, Junnan Li, Akhilesh Deepak Gotmare, Steven C. H. Hoi

Abstract: Code intelligence plays a key role in transforming modern software engineering. Recently, deep learning-based models, especially Transformer-based large language models (LLMs), have demonstrated remarkable potential in tackling these tasks by leveraging massive open-source code data and programming language features. However, the development and deployment of such models often require expertise in… ▽ More Code intelligence plays a key role in transforming modern software engineering. Recently, deep learning-based models, especially Transformer-based large language models (LLMs), have demonstrated remarkable potential in tackling these tasks by leveraging massive open-source code data and programming language features. However, the development and deployment of such models often require expertise in both machine learning and software engineering, creating a barrier for the model adoption. In this paper, we present CodeTF, an open-source Transformer-based library for state-of-the-art Code LLMs and code intelligence. Following the principles of modular design and extensible framework, we design CodeTF with a unified interface to enable rapid access and development across different types of models, datasets and tasks. Our library supports a collection of pretrained Code LLM models and popular code benchmarks, including a standardized interface to train and serve code LLMs efficiently, and data features such as language-specific parsers and utility functions for extracting code attributes. In this paper, we describe the design principles, the architecture, key modules and components, and compare with other related library tools. Finally, we hope CodeTF is able to bridge the gap between machine learning/generative AI and software engineering, providing a comprehensive open-source solution for developers, researchers, and practitioners. △ Less

Submitted 31 May, 2023; originally announced June 2023.

Comments: Ongoing work - Draft Preview

arXiv:2305.07922 [pdf, other]

CodeT5+: Open Code Large Language Models for Code Understanding and Generation

Authors: Yue Wang, Hung Le, Akhilesh Deepak Gotmare, Nghi D. Q. Bui, Junnan Li, Steven C. H. Hoi

Abstract: Large language models (LLMs) pretrained on vast source code have achieved prominent progress in code intelligence. However, existing code LLMs have two main limitations in terms of architecture and pretraining tasks. First, they often adopt a specific architecture (encoder-only or decoder-only) or rely on a unified encoder-decoder network for different downstream tasks. The former paradigm is limi… ▽ More Large language models (LLMs) pretrained on vast source code have achieved prominent progress in code intelligence. However, existing code LLMs have two main limitations in terms of architecture and pretraining tasks. First, they often adopt a specific architecture (encoder-only or decoder-only) or rely on a unified encoder-decoder network for different downstream tasks. The former paradigm is limited by inflexibility in applications while in the latter, the model is treated as a single system for all tasks, leading to suboptimal performance on a subset of tasks. Secondly, they often employ a limited set of pretraining objectives which might not be relevant to some downstream tasks and hence result in substantial performance degrade. To address these limitations, we propose ``CodeT5+'', a family of encoder-decoder LLMs for code in which component modules can be flexibly combined to suit a wide range of downstream code tasks. Such flexibility is enabled by our proposed mixture of pretraining objectives to mitigate the pretrain-finetune discrepancy. These objectives cover span denoising, contrastive learning, text-code matching, and causal LM pretraining tasks, on both unimodal and bimodal multilingual code corpora. Furthermore, we propose to initialize CodeT5+ with frozen off-the-shelf LLMs without training from scratch to efficiently scale up our models, and explore instruction-tuning to align with natural language instructions. We extensively evaluate CodeT5+ on over 20 code-related benchmarks in different settings, including zero-shot, finetuning, and instruction-tuning. We observe state-of-the-art (SoTA) model performance on various code-related tasks, such as code generation and completion, math programming, and text-to-code retrieval tasks. Particularly, our instruction-tuned CodeT5+ 16B achieves new SoTA results on HumanEval code generation task against other open code LLMs. △ Less

Submitted 20 May, 2023; v1 submitted 13 May, 2023; originally announced May 2023.

Comments: 26 pages, preprint

arXiv:2305.06156 [pdf, other]

The Vault: A Comprehensive Multilingual Dataset for Advancing Code Understanding and Generation

Authors: Dung Nguyen Manh, Nam Le Hai, Anh T. V. Dau, Anh Minh Nguyen, Khanh Nghiem, Jin Guo, Nghi D. Q. Bui

Abstract: We present The Vault, a dataset of high-quality code-text pairs in multiple programming languages for training large language models to understand and generate code. We present methods for thoroughly extracting samples that use both rule-based and deep learning-based methods to ensure that they contain high-quality pairs of code and text, resulting in a dataset of 43 million high-quality code-text… ▽ More We present The Vault, a dataset of high-quality code-text pairs in multiple programming languages for training large language models to understand and generate code. We present methods for thoroughly extracting samples that use both rule-based and deep learning-based methods to ensure that they contain high-quality pairs of code and text, resulting in a dataset of 43 million high-quality code-text pairs. Our extensive evaluations on common coding tasks including code generation, code search and code summarization show that when fine-tuning Code Large Language Models on The Vault, such models outperform the same models trained on other datasets such as CodeSearchNet. We also provide detailed analyses of our datasets to assess the effects of various programming languages and docstrings on the performance of such models. △ Less

Submitted 30 October, 2023; v1 submitted 9 May, 2023; originally announced May 2023.

Comments: Accepted at EMNLP 2023, Long Findings

arXiv:2305.01384 [pdf, other]

Class based Influence Functions for Error Detection

Authors: Thang Nguyen-Duc, Hoang Thanh-Tung, Quan Hung Tran, Dang Huu-Tien, Hieu Ngoc Nguyen, Anh T. V. Dau, Nghi D. Q. Bui

Abstract: Influence functions (IFs) are a powerful tool for detecting anomalous examples in large scale datasets. However, they are unstable when applied to deep networks. In this paper, we provide an explanation for the instability of IFs and develop a solution to this problem. We show that IFs are unreliable when the two data points belong to two different classes. Our solution leverages class information… ▽ More Influence functions (IFs) are a powerful tool for detecting anomalous examples in large scale datasets. However, they are unstable when applied to deep networks. In this paper, we provide an explanation for the instability of IFs and develop a solution to this problem. We show that IFs are unreliable when the two data points belong to two different classes. Our solution leverages class information to improve the stability of IFs. Extensive experiments show that our modification significantly improves the performance and stability of IFs while incurring no additional computational cost. △ Less

Submitted 2 May, 2023; originally announced May 2023.

Comments: Thang Nguyen-Duc, Hoang Thanh-Tung, and Quan Hung Tran are co-first authors of this paper. 12 pages, 12 figures. Accepted to ACL 2023

arXiv:2304.01228 [pdf, other]

Better Language Models of Code through Self-Improvement

Authors: Hung Quoc To, Nghi D. Q. Bui, Jin Guo, Tien N. Nguyen

Abstract: Pre-trained language models for code (PLMCs) have gained attention in recent research. These models are pre-trained on large-scale datasets using multi-modal objectives. However, fine-tuning them requires extensive supervision and is limited by the size of the dataset provided. We aim to improve this issue by proposing a simple data augmentation framework. Our framework utilizes knowledge gained d… ▽ More Pre-trained language models for code (PLMCs) have gained attention in recent research. These models are pre-trained on large-scale datasets using multi-modal objectives. However, fine-tuning them requires extensive supervision and is limited by the size of the dataset provided. We aim to improve this issue by proposing a simple data augmentation framework. Our framework utilizes knowledge gained during the pre-training and fine-tuning stage to generate pseudo data, which is then used as training data for the next step. We incorporate this framework into the state-of-the-art language models, such as CodeT5, CodeBERT, and UnixCoder. The results show that our framework significantly improves PLMCs' performance in code-related sequence generation tasks, such as code summarization and code generation in the CodeXGLUE benchmark. △ Less

Submitted 9 May, 2023; v1 submitted 2 April, 2023; originally announced April 2023.

Comments: Accepted to Findings, ACL 2023

arXiv:2212.10723 [pdf, other]

Predict+Optimize Problem in Renewable Energy Scheduling

Authors: Christoph Bergmeir, Frits de Nijs, Evgenii Genov, Abishek Sriramulu, Mahdi Abolghasemi, Richard Bean, John Betts, Quang Bui, Nam Trong Dinh, Nils Einecke, Rasul Esmaeilbeigi, Scott Ferraro, Priya Galketiya, Robert Glasgow, Rakshitha Godahewa, Yanfei Kang, Steffen Limmer, Luis Magdalena, Pablo Montero-Manso, Daniel Peralta, Yogesh Pipada Sunil Kumar, Alejandro Rosales-Pérez, Julian Ruddick, Akylas Stratigakos, Peter Stuckey , et al. (3 additional authors not shown)

Abstract: Predict+Optimize frameworks integrate forecasting and optimization to address real-world challenges such as renewable energy scheduling, where variability and uncertainty are critical factors. This paper benchmarks solutions from the IEEE-CIS Technical Challenge on Predict+Optimize for Renewable Energy Scheduling, focusing on forecasting renewable production and demand and optimizing energy cost.… ▽ More Predict+Optimize frameworks integrate forecasting and optimization to address real-world challenges such as renewable energy scheduling, where variability and uncertainty are critical factors. This paper benchmarks solutions from the IEEE-CIS Technical Challenge on Predict+Optimize for Renewable Energy Scheduling, focusing on forecasting renewable production and demand and optimizing energy cost. The competition attracted 49 participants in total. The top-ranked method employed stochastic optimization using LightGBM ensembles, and achieved at least a 2% reduction in energy costs compared to deterministic approaches, demonstrating that the most accurate point forecast does not necessarily guarantee the best performance in downstream optimization. The published data and problem setting establish a benchmark for further research into integrated forecasting-optimization methods for energy systems, highlighting the importance of considering forecast uncertainty in optimization models to achieve cost-effective and reliable energy management. The novelty of this work lies in its comprehensive evaluation of Predict+Optimize methodologies applied to a real-world renewable energy scheduling problem, providing insights into the scalability, generalizability, and effectiveness of the proposed solutions. Potential applications extend beyond energy systems to any domain requiring integrated forecasting and optimization, such as supply chain management, transportation planning, and financial portfolio optimization. △ Less

Submitted 14 April, 2025; v1 submitted 20 December, 2022; originally announced December 2022.

arXiv:2211.14875 [pdf, other]

Detect-Localize-Repair: A Unified Framework for Learning to Debug with CodeT5

Authors: Nghi D. Q. Bui, Yue Wang, Steven Hoi

Abstract: Automated software debugging is a crucial task for improving the productivity of software developers. Many neural-based techniques have been proven effective for debugging-related tasks such as bug localization and program repair (or bug fixing). However, these techniques often focus only on either one of them or approach them in a stage-wise manner, ignoring the mutual benefits between them. In t… ▽ More Automated software debugging is a crucial task for improving the productivity of software developers. Many neural-based techniques have been proven effective for debugging-related tasks such as bug localization and program repair (or bug fixing). However, these techniques often focus only on either one of them or approach them in a stage-wise manner, ignoring the mutual benefits between them. In this work, we propose a novel unified \emph{Detect-Localize-Repair} framework based on a pretrained programming language model CodeT5 to seamlessly address these tasks, named CodeT5-DLR. Specifically, we propose three objectives to adapt the generic CodeT5 for debugging: a bug detection objective to determine whether a given code snippet is buggy or not, a bug localization objective to identify the buggy lines, and a program repair objective to translate the buggy code to its fixed version. We evaluate it on each of these tasks and their combined setting on two newly collected line-level debugging datasets in Java and Python. Extensive results show that our model significantly outperforms existing baselines from both NLP and software engineering domains. △ Less

Submitted 22 December, 2022; v1 submitted 27 November, 2022; originally announced November 2022.

Comments: Accepted to EMNLP 2022 Findings Track

arXiv:2208.06202 [pdf, other]

Image Translation Based Nuclei Segmentation for Immunohistochemistry Images

Authors: Roger Trullo, Quoc-Anh Bui, Qi Tang, Reza Olfati-Saber

Abstract: Numerous deep learning based methods have been developed for nuclei segmentation for H&E images and have achieved close to human performance. However, direct application of such methods to another modality of images, such as Immunohistochemistry (IHC) images, may not achieve satisfactory performance. Thus, we developed a Generative Adversarial Network (GAN) based approach to translate an IHC image… ▽ More Numerous deep learning based methods have been developed for nuclei segmentation for H&E images and have achieved close to human performance. However, direct application of such methods to another modality of images, such as Immunohistochemistry (IHC) images, may not achieve satisfactory performance. Thus, we developed a Generative Adversarial Network (GAN) based approach to translate an IHC image to an H&E image while preserving nuclei location and morphology and then apply pre-trained nuclei segmentation models to the virtual H&E image. We demonstrated that the proposed methods work better than several baseline methods including direct application of state of the art nuclei segmentation methods such as Cellpose and HoVer-Net, trained on H&E and a generative method, DeepLIIF, using two public IHC image datasets. △ Less

Submitted 12 August, 2022; originally announced August 2022.

arXiv:2205.15479 [pdf, other]

HierarchyNet: Learning to Summarize Source Code with Heterogeneous Representations

Authors: Minh Huynh Nguyen, Nghi D. Q. Bui, Truong Son Hy, Long Tran-Thanh, Tien N. Nguyen

Abstract: We propose a novel method for code summarization utilizing Heterogeneous Code Representations (HCRs) and our specially designed HierarchyNet. HCRs effectively capture essential code features at lexical, syntactic, and semantic levels by abstracting coarse-grained code elements and incorporating fine-grained program elements in a hierarchical structure. Our HierarchyNet method processes each layer… ▽ More We propose a novel method for code summarization utilizing Heterogeneous Code Representations (HCRs) and our specially designed HierarchyNet. HCRs effectively capture essential code features at lexical, syntactic, and semantic levels by abstracting coarse-grained code elements and incorporating fine-grained program elements in a hierarchical structure. Our HierarchyNet method processes each layer of the HCR separately through a unique combination of the Heterogeneous Graph Transformer, a Tree-based CNN, and a Transformer Encoder. This approach preserves dependencies between code elements and captures relations through a novel Hierarchical-Aware Cross Attention layer. Our method surpasses current state-of-the-art techniques, such as PA-Former, CAST, and NeuralCodeSum. △ Less

Submitted 9 May, 2023; v1 submitted 30 May, 2022; originally announced May 2022.

arXiv:2205.13022 [pdf, ps, other]

doi 10.1145/3551349.3561168

Towards Using Data-Influence Methods to Detect Noisy Samples in Source Code Corpora

Authors: Anh T. V. Dau, Thang Nguyen-Duc, Hoang Thanh-Tung, Nghi D. Q. Bui

Abstract: Despite the recent trend of developing and applying neural source code models to software engineering tasks, the quality of such models is insufficient for real-world use. This is because there could be noise in the source code corpora used to train such models. We adapt data-influence methods to detect such noises in this paper. Data-influence methods are used in machine learning to evaluate the… ▽ More Despite the recent trend of developing and applying neural source code models to software engineering tasks, the quality of such models is insufficient for real-world use. This is because there could be noise in the source code corpora used to train such models. We adapt data-influence methods to detect such noises in this paper. Data-influence methods are used in machine learning to evaluate the similarity of a target sample to the correct samples in order to determine whether or not the target sample is noisy. Our evaluation results show that data-influence methods can identify noisy samples from neural code models in classification-based tasks. This approach will contribute to the larger vision of developing better neural source code models from a data-centric perspective, which is a key driver for developing useful source code models in practice. △ Less

Submitted 2 October, 2022; v1 submitted 25 May, 2022; originally announced May 2022.

Comments: The 37th IEEE/ACM International Conference on Automated Software Engineering

arXiv:2203.10233 [pdf, other]

DirecFormer: A Directed Attention in Transformer Approach to Robust Action Recognition

Authors: Thanh-Dat Truong, Quoc-Huy Bui, Chi Nhan Duong, Han-Seok Seo, Son Lam Phung, Xin Li, Khoa Luu

Abstract: Human action recognition has recently become one of the popular research topics in the computer vision community. Various 3D-CNN based methods have been presented to tackle both the spatial and temporal dimensions in the task of video action recognition with competitive results. However, these methods have suffered some fundamental limitations such as lack of robustness and generalization, e.g., h… ▽ More Human action recognition has recently become one of the popular research topics in the computer vision community. Various 3D-CNN based methods have been presented to tackle both the spatial and temporal dimensions in the task of video action recognition with competitive results. However, these methods have suffered some fundamental limitations such as lack of robustness and generalization, e.g., how does the temporal ordering of video frames affect the recognition results? This work presents a novel end-to-end Transformer-based Directed Attention (DirecFormer) framework for robust action recognition. The method takes a simple but novel perspective of Transformer-based approach to understand the right order of sequence actions. Therefore, the contributions of this work are three-fold. Firstly, we introduce the problem of ordered temporal learning issues to the action recognition problem. Secondly, a new Directed Attention mechanism is introduced to understand and provide attentions to human actions in the right order. Thirdly, we introduce the conditional dependency in action sequence modeling that includes orders and classes. The proposed approach consistently achieves the state-of-the-art (SOTA) results compared with the recent action recognition methods, on three standard large-scale benchmarks, i.e. Jester, Kinetics-400 and Something-Something-V2. △ Less

Submitted 18 March, 2022; originally announced March 2022.

Comments: Accepted to CVPR 2022

arXiv:2112.11226

Energy-bounded Learning for Robust Models of Code

Authors: Nghi D. Q. Bui, Yijun Yu

Abstract: In programming, learning code representations has a variety of applications, including code classification, code search, comment generation, bug prediction, and so on. Various representations of code in terms of tokens, syntax trees, dependency graphs, code navigation paths, or a combination of their variants have been proposed, however, existing vanilla learning techniques have a major limitation… ▽ More In programming, learning code representations has a variety of applications, including code classification, code search, comment generation, bug prediction, and so on. Various representations of code in terms of tokens, syntax trees, dependency graphs, code navigation paths, or a combination of their variants have been proposed, however, existing vanilla learning techniques have a major limitation in robustness, i.e., it is easy for the models to make incorrect predictions when the inputs are altered in a subtle way. To enhance the robustness, existing approaches focus on recognizing adversarial samples rather than on the valid samples that fall outside a given distribution, which we refer to as out-of-distribution (OOD) samples. Recognizing such OOD samples is the novel problem investigated in this paper. To this end, we propose to first augment the in=distribution datasets with out-of-distribution samples such that, when trained together, they will enhance the model's robustness. We propose the use of an energy-bounded learning objective function to assign a higher score to in-distribution samples and a lower score to out-of-distribution samples in order to incorporate such out-of-distribution samples into the training process of source code models. In terms of OOD detection and adversarial samples detection, our evaluation results demonstrate a greater robustness for existing source code models to become more accurate at recognizing OOD data while being more resistant to adversarial attacks at the same time. Furthermore, the proposed energy-bounded score outperforms all existing OOD detection scores by a large margin, including the softmax confidence score, the Mahalanobis score, and ODIN. △ Less

Submitted 9 May, 2022; v1 submitted 20 December, 2021; originally announced December 2021.

Comments: There are some flaws in our experiments, we would like to fix it and publish a fixed version again in the very near future

arXiv:2108.06185 [pdf]

CNN-based Two-Stage Parking Slot Detection Using Region-Specific Multi-Scale Feature Extraction

Authors: Quang Huy Bui, Jae Kyu Suhr

Abstract: Autonomous parking systems start with the detection of available parking slots. Parking slot detection performance has been dramatically improved by deep learning techniques. Deep learning-based object detection methods can be categorized into one-stage and two-stage approaches. Although it is well-known that the two-stage approach outperforms the one-stage approach in general object detection, th… ▽ More Autonomous parking systems start with the detection of available parking slots. Parking slot detection performance has been dramatically improved by deep learning techniques. Deep learning-based object detection methods can be categorized into one-stage and two-stage approaches. Although it is well-known that the two-stage approach outperforms the one-stage approach in general object detection, they have performed similarly in parking slot detection so far. We consider this is because the two-stage approach has not yet been adequately specialized for parking slot detection. Thus, this paper proposes a highly specialized two-stage parking slot detector that uses region-specific multi-scale feature extraction. In the first stage, the proposed method finds the entrance of the parking slot as a region proposal by estimating its center, length, and orientation. The second stage of this method designates specific regions that most contain the desired information and extracts features from them. That is, features for the location and orientation are separately extracted from only the specific regions that most contain the locational and orientational information. In addition, multi-resolution feature maps are utilized to increase both positioning and classification accuracies. A high-resolution feature map is used to extract detailed information (location and orientation), while another low-resolution feature map is used to extract semantic information (type and occupancy). In experiments, the proposed method was quantitatively evaluated with two large-scale public parking slot detection datasets and outperformed previous methods, including both one-stage and two-stage approaches. △ Less

Submitted 13 August, 2021; originally announced August 2021.

arXiv:2106.13405 [pdf, other]

JNLP Team: Deep Learning Approaches for Legal Processing Tasks in COLIEE 2021

Authors: Ha-Thanh Nguyen, Phuong Minh Nguyen, Thi-Hai-Yen Vuong, Quan Minh Bui, Chau Minh Nguyen, Binh Tran Dang, Vu Tran, Minh Le Nguyen, Ken Satoh

Abstract: COLIEE is an annual competition in automatic computerized legal text processing. Automatic legal document processing is an ambitious goal, and the structure and semantics of the law are often far more complex than everyday language. In this article, we survey and report our methods and experimental results in using deep learning in legal document processing. The results show the difficulties as we… ▽ More COLIEE is an annual competition in automatic computerized legal text processing. Automatic legal document processing is an ambitious goal, and the structure and semantics of the law are often far more complex than everyday language. In this article, we survey and report our methods and experimental results in using deep learning in legal document processing. The results show the difficulties as well as potentials in this family of approaches. △ Less

Submitted 7 September, 2021; v1 submitted 24 June, 2021; originally announced June 2021.

Comments: Also published in COLIEE 2021's proceeding

arXiv:2106.13403 [pdf, other]

ParaLaw Nets -- Cross-lingual Sentence-level Pretraining for Legal Text Processing

Authors: Ha-Thanh Nguyen, Vu Tran, Phuong Minh Nguyen, Thi-Hai-Yen Vuong, Quan Minh Bui, Chau Minh Nguyen, Binh Tran Dang, Minh Le Nguyen, Ken Satoh

Abstract: Ambiguity is a characteristic of natural language, which makes expression ideas flexible. However, in a domain that requires accurate statements, it becomes a barrier. Specifically, a single word can have many meanings and multiple words can have the same meaning. When translating a text into a foreign language, the translator needs to determine the exact meaning of each element in the original se… ▽ More Ambiguity is a characteristic of natural language, which makes expression ideas flexible. However, in a domain that requires accurate statements, it becomes a barrier. Specifically, a single word can have many meanings and multiple words can have the same meaning. When translating a text into a foreign language, the translator needs to determine the exact meaning of each element in the original sentence to produce the correct translation sentence. From that observation, in this paper, we propose ParaLaw Nets, a pretrained model family using sentence-level cross-lingual information to reduce ambiguity and increase the performance in legal text processing. This approach achieved the best result in the Question Answering task of COLIEE-2021. △ Less

Submitted 24 June, 2021; originally announced June 2021.

Comments: Also published in COLIEE 2021's Proceeding

arXiv:2106.03887 [pdf, ps, other]

doi 10.1287/ijoc.2022.1198

A Catalog of Formulations for the Network Pricing Problem

Authors: Quang Minh Bui, Bernard Gendron, Margarida Carvalho

Abstract: We study the network pricing problem where the leader maximizes their revenue by determining the optimal amounts of tolls to charge on a set of arcs, under the assumption that the followers will react rationally and choose the shortest paths to travel. Many distinct single-level reformulations to this bilevel optimization program have been proposed, however, their relationship has not been establi… ▽ More We study the network pricing problem where the leader maximizes their revenue by determining the optimal amounts of tolls to charge on a set of arcs, under the assumption that the followers will react rationally and choose the shortest paths to travel. Many distinct single-level reformulations to this bilevel optimization program have been proposed, however, their relationship has not been established. In this paper, we aim to build a connection between those reformulations and explore the combination of the path representation with various modeling options, allowing us to generate 12 different reformulations of the problem. Moreover, we propose a new path enumeration scheme, path-based preprocessing, and hybrid framework to further improve performance and robustness when solving the final model. We provide numerical results, comparing all the derived reformulations and confirming the efficiency of the novel dimensionality reduction procedures. △ Less

Submitted 7 June, 2021; originally announced June 2021.

Comments: 35 pages, 7 figures

MSC Class: 90C35 ACM Class: G.1.6

Journal ref: INFORMS Journal on Computing 2022 34:5, 2658-2674

arXiv:2101.11649 [pdf, other]

doi 10.1016/j.cma.2021.114111

Multigrid reduction preconditioning framework for coupled processes in porous and fractured media

Authors: Quan M. Bui, Francois P. Hamon, Nicola Castelletto, Daniel Osei-Kuffuor, Randolph R. Settgast, Joshua A. White

Abstract: Many subsurface engineering applications involve tight-coupling between fluid flow, solid deformation, fracturing, and similar processes. To better understand the complex interplay of different governing equations, and therefore design efficient and safe operations, numerical simulations are widely used. Given the relatively long time-scales of interest, fully-implicit time-stepping schemes are of… ▽ More Many subsurface engineering applications involve tight-coupling between fluid flow, solid deformation, fracturing, and similar processes. To better understand the complex interplay of different governing equations, and therefore design efficient and safe operations, numerical simulations are widely used. Given the relatively long time-scales of interest, fully-implicit time-stepping schemes are often necessary to avoid time-step stability restrictions. A major computational bottleneck for these methods, however, is the linear solver. These systems are extremely large and ill-conditioned. Because of the wide range of processes and couplings that may be involved--e.g. formation and propagation of fractures, deformation of the solid porous medium, viscous flow of one or more fluids in the pores and fractures, complicated well sources and sinks, etc.--it is difficult to develop general-purpose but scalable linear solver frameworks. This challenge is further aggravated by the range of different discretization schemes that may be adopted, which have a direct impact on the linear system structure. To address this obstacle, we describe a flexible framework based on multigrid reduction that can produce purely algebraic preconditioners for a wide spectrum of relevant physics and discretizations. We demonstrate its broad applicability by constructing scalable preconditioners for several problems, notably: a hybrid discretization of single-phase flow, compositional multiphase flow with complex wells, and hydraulic fracturing simulations. Extension to other systems can be handled quite naturally. We demonstrate the efficiency and scalability of the resulting solvers through numerical examples of difficult, field-scale problems. △ Less

Submitted 30 July, 2021; v1 submitted 27 January, 2021; originally announced January 2021.

MSC Class: 65Z05; 65F08; 65F50

arXiv:2012.07023 [pdf, other]

InferCode: Self-Supervised Learning of Code Representations by Predicting Subtrees

Authors: Nghi D. Q. Bui, Yijun Yu, Lingxiao Jiang

Abstract: Building deep learning models on source code has found many successful software engineering applications, such as code search, code comment generation, bug detection, code migration, and so on. Current learning techniques, however, have a major drawback that these models are mostly trained on datasets labeled for particular downstream tasks, and code representations may not be suitable for other t… ▽ More Building deep learning models on source code has found many successful software engineering applications, such as code search, code comment generation, bug detection, code migration, and so on. Current learning techniques, however, have a major drawback that these models are mostly trained on datasets labeled for particular downstream tasks, and code representations may not be suitable for other tasks. While some techniques produce representations from unlabeled code, they are far from satisfactory when applied to downstream tasks. Although certain techniques generate representations from unlabeled code when applied to downstream tasks they are far from satisfactory. This paper proposes InferCode to overcome the limitation by adapting the self-supervised learning mechanism to build source code model. The key novelty lies in training code representations by predicting automatically identified subtrees from the context of the ASTs. Subtrees in ASTs are treated with InferCode as the labels for training code representations without any human labeling effort or the overhead of expensive graph construction, and the trained representations are no longer tied to any specific downstream tasks or code units. We trained an InferCode model instance using the Tree-based CNN as the encoder of a large set of Java code and applied it to downstream unsupervised tasks such as code clustering, code clone detection, cross-language code search or reused under a transfer learning scheme to continue training the model weights for supervised tasks such as code classification and method name prediction. Compared to previous code learning techniques applied to the same downstream tasks, such as Code2Vec, Code2Seq, ASTNN, higher performance results are achieved using our pre-trained InferCode model with a significant margin for most tasks including those involving different programming languages. △ Less

Submitted 15 December, 2020; v1 submitted 13 December, 2020; originally announced December 2020.

Comments: Accepted at ICSE 2021

arXiv:2011.08071 [pdf, other]

JNLP Team: Deep Learning for Legal Processing in COLIEE 2020

Authors: Ha-Thanh Nguyen, Hai-Yen Thi Vuong, Phuong Minh Nguyen, Binh Tran Dang, Quan Minh Bui, Sinh Trong Vu, Chau Minh Nguyen, Vu Tran, Ken Satoh, Minh Le Nguyen

Abstract: We propose deep learning based methods for automatic systems of legal retrieval and legal question-answering in COLIEE 2020. These systems are all characterized by being pre-trained on large amounts of data before being finetuned for the specified tasks. This approach helps to overcome the data scarcity and achieve good performance, thus can be useful for tackling related problems in information r… ▽ More We propose deep learning based methods for automatic systems of legal retrieval and legal question-answering in COLIEE 2020. These systems are all characterized by being pre-trained on large amounts of data before being finetuned for the specified tasks. This approach helps to overcome the data scarcity and achieve good performance, thus can be useful for tackling related problems in information retrieval, and decision support in the legal domain. Besides, the approach can be explored to deal with other domain specific problems. △ Less

Submitted 4 November, 2020; originally announced November 2020.

Comments: Also be published in JURISIN2020

arXiv:2009.09777 [pdf, other]

TreeCaps: Tree-Based Capsule Networks for Source Code Processing

Authors: Nghi D. Q. Bui, Yijun Yu, Lingxiao Jiang

Abstract: Recently program learning techniques have been proposed to process source code based on syntactical structures (e.g., Abstract Syntax Trees) and/or semantic information (e.g., Dependency Graphs). Although graphs may be better at capturing various viewpoints of code semantics than trees, constructing graph inputs from code needs static code semantic analysis that may not be accurate and introduces… ▽ More Recently program learning techniques have been proposed to process source code based on syntactical structures (e.g., Abstract Syntax Trees) and/or semantic information (e.g., Dependency Graphs). Although graphs may be better at capturing various viewpoints of code semantics than trees, constructing graph inputs from code needs static code semantic analysis that may not be accurate and introduces noise during learning. Although syntax trees are precisely defined according to the language grammar and easier to construct and process than graphs, previous tree-based learning techniques have not been able to learn semantic information from trees to achieve better accuracy than graph-based techniques. We propose a new learning technique, named TreeCaps, by fusing together capsule networks with tree-based convolutional neural networks, to achieve learning accuracy higher than existing graph-based techniques while it is based only on trees. TreeCaps introduces novel variable-to-static routing algorithms into the capsule networks to compensate for the loss of previous routing algorithms. Aside from accuracy, we also find that TreeCaps is the most robust to withstand those semantic-preserving program transformations that change code syntax without modifying the semantics. Evaluated on a large number of Java and C/C++ programs, TreeCaps models outperform prior deep learning models of program source code, in terms of both accuracy and robustness for program comprehension tasks such as code functionality classification and function name prediction △ Less

Submitted 14 December, 2020; v1 submitted 5 September, 2020; originally announced September 2020.

Comments: Accepted at AAAI 2021

arXiv:2009.02731 [pdf, other]

doi 10.1145/3404835.3462840

Self-Supervised Contrastive Learning for Code Retrieval and Summarization via Semantic-Preserving Transformations

Authors: Nghi D. Q. Bui, Yijun Yu, Lingxiao Jiang

Abstract: We propose Corder, a self-supervised contrastive learning framework for source code model. Corder is designed to alleviate the need of labeled data for code retrieval and code summarization tasks. The pre-trained model of Corder can be used in two ways: (1) it can produce vector representation of code which can be applied to code retrieval tasks that do not have labeled data; (2) it can be used in… ▽ More We propose Corder, a self-supervised contrastive learning framework for source code model. Corder is designed to alleviate the need of labeled data for code retrieval and code summarization tasks. The pre-trained model of Corder can be used in two ways: (1) it can produce vector representation of code which can be applied to code retrieval tasks that do not have labeled data; (2) it can be used in a fine-tuning process for tasks that might still require label data such as code summarization. The key innovation is that we train the source code model by asking it to recognize similar and dissimilar code snippets through a contrastive learning objective. To do so, we use a set of semantic-preserving transformation operators to generate code snippets that are syntactically diverse but semantically equivalent. Through extensive experiments, we have shown that the code models pretrained by Corder substantially outperform the other baselines for code-to-code retrieval, text-to-code retrieval, and code-to-text summarization tasks. △ Less

Submitted 23 May, 2021; v1 submitted 6 September, 2020; originally announced September 2020.

Comments: Accepted at SIGIR 2021

arXiv:2008.01566 [pdf, other]

doi 10.1016/j.infsof.2021.106552

On the Generalizability of Neural Program Models with respect to Semantic-Preserving Program Transformations

Authors: Md Rafiqul Islam Rabin, Nghi D. Q. Bui, Ke Wang, Yijun Yu, Lingxiao Jiang, Mohammad Amin Alipour

Abstract: With the prevalence of publicly available source code repositories to train deep neural network models, neural program models can do well in source code analysis tasks such as predicting method names in given programs that cannot be easily done by traditional program analysis techniques. Although such neural program models have been tested on various existing datasets, the extent to which they gen… ▽ More With the prevalence of publicly available source code repositories to train deep neural network models, neural program models can do well in source code analysis tasks such as predicting method names in given programs that cannot be easily done by traditional program analysis techniques. Although such neural program models have been tested on various existing datasets, the extent to which they generalize to unforeseen source code is largely unknown. Since it is very challenging to test neural program models on all unforeseen programs, in this paper, we propose to evaluate the generalizability of neural program models with respect to semantic-preserving transformations: a generalizable neural program model should perform equally well on programs that are of the same semantics but of different lexical appearances and syntactical structures. We compare the results of various neural program models for the method name prediction task on programs before and after automated semantic-preserving transformations. We use three Java datasets of different sizes and three state-of-the-art neural network models for code, namely code2vec, code2seq, and GGNN, to build nine such neural program models for evaluation. Our results show that even with small semantically preserving changes to the programs, these neural program models often fail to generalize their performance. Our results also suggest that neural program models based on data and control dependencies in programs generalize better than neural program models based only on abstract syntax trees. On the positive side, we observe that as the size of the training dataset grows and diversifies the generalizability of correct predictions produced by the neural program models can be improved too. Our results on the generalizability of neural program models provide insights to measure their limitations and provide a stepping stone for their improvement. △ Less

Submitted 18 March, 2021; v1 submitted 31 July, 2020; originally announced August 2020.

Comments: Information and Software Technology, IST Journal 2021, Elsevier. Related to arXiv:2004.07313

arXiv:1910.12306 [pdf, ps, other]

TreeCaps: Tree-Structured Capsule Networks for Program Source Code Processing

Authors: Vinoj Jayasundara, Nghi Duy Quoc Bui, Lingxiao Jiang, David Lo

Abstract: Program comprehension is a fundamental task in software development and maintenance processes. Software developers often need to understand a large amount of existing code before they can develop new features or fix bugs in existing programs. Being able to process programming language code automatically and provide summaries of code functionality accurately can significantly help developers to red… ▽ More Program comprehension is a fundamental task in software development and maintenance processes. Software developers often need to understand a large amount of existing code before they can develop new features or fix bugs in existing programs. Being able to process programming language code automatically and provide summaries of code functionality accurately can significantly help developers to reduce time spent in code navigation and understanding, and thus increase productivity. Different from natural language articles, source code in programming languages often follows rigid syntactical structures and there can exist dependencies among code elements that are located far away from each other through complex control flows and data flows. Existing studies on tree-based convolutional neural networks (TBCNN) and gated graph neural networks (GGNN) are not able to capture essential semantic dependencies among code elements accurately. In this paper, we propose novel tree-based capsule networks (TreeCaps) and relevant techniques for processing program code in an automated way that encodes code syntactical structures and captures code dependencies more accurately. Based on evaluation on programs written in different programming languages, we show that our TreeCaps-based approach can outperform other approaches in classifying the functionalities of many programs. △ Less

Submitted 27 October, 2019; originally announced October 2019.

Comments: in NeurIPS Workshop on ML for Systems, 2019

arXiv:1906.03835 [pdf, other]

SAR: Learning Cross-Language API Mappings with Little Knowledge

Authors: Nghi D. Q. Bui, Yijun Yu, Lingxiao Jiang

Abstract: To save manual effort, developers often translate programs from one programming language to another, instead of implementing it from scratch. Translating application program interfaces (APIs) used in one language to functionally equivalent ones available in another language is an important aspect of program translation. Existing approaches facilitate the translation by automatically identifying th… ▽ More To save manual effort, developers often translate programs from one programming language to another, instead of implementing it from scratch. Translating application program interfaces (APIs) used in one language to functionally equivalent ones available in another language is an important aspect of program translation. Existing approaches facilitate the translation by automatically identifying the API mappings across programming languages. However, all these approaches still require large amount of manual effort in preparing parallel program corpora, ranging from pairs of APIs, to manually identified code in different languages that are considered as functionally equivalent. To minimize the manual effort in identifying parallel program corpora and API mappings, this paper aims at an automated approach to map APIs across languages with much less knowledge a priori needed than other existing approaches. The approach is based on an realization of the notion of domain adaption combined with code embedding, which can better align two vector spaces: taking as input large sets of programs, our approach first generates numeric vector representations of the programs, especially the APIs used in each language, and it adapts generative adversarial networks (GAN) to align the vectors from the spaces of two languages. For a better alignment, we initialize the GAN with parameters derived from optional API mapping seeds that can be identified accurately with a simple automatic signature-based matching heuristic. Then the cross-language API mappings can be identified via nearest-neighbors queries in the aligned vector spaces. △ Less

Submitted 10 June, 2019; originally announced June 2019.

Comments: Accepted at the 27th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE)

arXiv:1804.09322 [pdf]

Robust Anomaly-Based Ship Proposals Detection from Pan-sharpened High-Resolution Satellite Image

Authors: Viet Hung Luu, Nguyen Hoang Hoa Luong, Quang Hung Bui, Thi Nhat Thanh Nguyen

Abstract: Pre-screening of ship proposals is now employed by top ship detectors to avoid exhaustive search across image. In very high resolution (VHR) optical image, ships appeared as a cluster of abnormal bright pixels in open sea clutter (noise-like background). Anomaly-based detector utilizing Panchromatic (PAN) data has been widely used in many researches to detect ships, however, still facing two main… ▽ More Pre-screening of ship proposals is now employed by top ship detectors to avoid exhaustive search across image. In very high resolution (VHR) optical image, ships appeared as a cluster of abnormal bright pixels in open sea clutter (noise-like background). Anomaly-based detector utilizing Panchromatic (PAN) data has been widely used in many researches to detect ships, however, still facing two main drawbacks: 1) detection rate tend to be low particularly when a ship is low contrast and 2) these models require a high manual configuration to select a threshold value best separate ships from sea surface background. This paper aims at further investigation of anomaly-based model to solve those issues. First, pan-sharpened Multi Spectral (MS) data is incorporated together with PAN to enhance ship discrimination. Second, we propose an improved anomaly-based model combining both global intensity anomaly and local texture anomaly map. Regarding noise appeared due to the present of sea clutter and because of pan-sharpen process, texture abnormality suppression term based on quantization theory is introduced. Experimental results on VNREDSat-1 VHR optical satellite images suggest that the pan-sharpened near-infrared (P-NIR) band can improve discrimination of ships from surrounding waters. Compared to state-of-the-art anomaly-based detectors, our proposed anomaly-based model on the combination of PAN and P-NIR data cannot only achieved highest ship detection's recall rate (91.14% and 45.9% on high-contrast and low-contrast dataset respectively) but also robust to different automatic threshold selection techniques. △ Less

Submitted 24 April, 2018; originally announced April 2018.

Showing 1–50 of 55 results for author: Bui, Q