-
Derailing Non-Answers via Logit Suppression at Output Subspace Boundaries in RLHF-Aligned Language Models
Authors:
Harvey Dam,
Jonas Knochelmann,
Vinu Joseph,
Ganesh Gopalakrishnan
Abstract:
We introduce a method to reduce refusal rates of large language models (LLMs) on sensitive content without modifying model weights or prompts. Motivated by the observation that refusals in certain models were often preceded by the specific token sequence of a token marking the beginning of the chain-of-thought (CoT) block (<think>) followed by a double newline token (\n\n), we investigate the impa…
▽ More
We introduce a method to reduce refusal rates of large language models (LLMs) on sensitive content without modifying model weights or prompts. Motivated by the observation that refusals in certain models were often preceded by the specific token sequence of a token marking the beginning of the chain-of-thought (CoT) block (<think>) followed by a double newline token (\n\n), we investigate the impact of two simple formatting adjustments during generation: suppressing \n\n after <think> and suppressing the end-of-sequence token after the end of the CoT block (</think>). Our method requires no datasets, parameter changes, or training, relying solely on modifying token probabilities during generation. In our experiments with official DeepSeek-R1 distillations, these interventions increased the proportion of substantive answers to sensitive prompts without affecting performance on standard benchmarks. Our findings suggest that refusal behaviors can be circumvented by blocking refusal subspaces at specific points in the generation process.
△ Less
Submitted 28 May, 2025;
originally announced May 2025.
-
Directional Sign Loss: A Topology-Preserving Loss Function that Approximates the Sign of Finite Differences
Authors:
Harvey Dam,
Tripti Agarwal,
Ganesh Gopalakrishnan
Abstract:
Preserving critical topological features in learned latent spaces is a fundamental challenge in representation learning, particularly for topology-sensitive data. This paper introduces directional sign loss (DSL), a novel loss function that approximates the number of mismatches in the signs of finite differences between corresponding elements of two arrays. By penalizing discrepancies in critical…
▽ More
Preserving critical topological features in learned latent spaces is a fundamental challenge in representation learning, particularly for topology-sensitive data. This paper introduces directional sign loss (DSL), a novel loss function that approximates the number of mismatches in the signs of finite differences between corresponding elements of two arrays. By penalizing discrepancies in critical points between input and reconstructed data, DSL encourages autoencoders and other learnable compressors to retain the topological features of the original data. We present the mathematical formulation, complexity analysis, and practical implementation of DSL, comparing its behavior to its non-differentiable counterpart and to other topological measures. Experiments on one-, two-, and three-dimensional data show that combining DSL with traditional loss functions preserves topological features more effectively than traditional losses alone. Moreover, DSL serves as a differentiable, efficient proxy for common topology-based metrics, enabling its use in gradient-based optimization frameworks.
△ Less
Submitted 8 May, 2025; v1 submitted 5 April, 2025;
originally announced April 2025.
-
Synergistic Fusion of Multi-Source Knowledge via Evidence Theory for High-Entropy Alloy Discovery
Authors:
Minh-Quyet Ha,
Dinh-Khiet Le,
Duc-Anh Dao,
Tien-Sinh Vu,
Duong-Nguyen Nguyen,
Viet-Cuong Nguyen,
Hiori Kino,
Van-Nam Huynh,
Hieu-Chi Dam
Abstract:
Discovering novel high-entropy alloys (HEAs) with desirable properties is challenging due to the vast compositional space and complex phase formation mechanisms. Efficient exploration of this space requires a strategic approach that integrates heterogeneous knowledge sources. Here, we propose a framework that systematically combines knowledge extracted from computational material datasets with dom…
▽ More
Discovering novel high-entropy alloys (HEAs) with desirable properties is challenging due to the vast compositional space and complex phase formation mechanisms. Efficient exploration of this space requires a strategic approach that integrates heterogeneous knowledge sources. Here, we propose a framework that systematically combines knowledge extracted from computational material datasets with domain knowledge distilled from scientific literature using large language models (LLMs). A central feature of this approach is the explicit consideration of element substitutability, identifying chemically similar elements that can be interchanged to potentially stabilize desired HEAs. Dempster-Shafer theory, a mathematical framework for reasoning under uncertainty, is employed to model and combine substitutabilities based on aggregated evidence from multiple sources. The framework predicts the phase stability of candidate HEA compositions and is systematically evaluated on both quaternary alloy systems, demonstrating superior performance compared to baseline machine learning models and methods reliant on single-source evidence in cross-validation experiments. By leveraging multi-source knowledge, the framework retains robust predictive power even when key elements are absent from the training data, underscoring its potential for knowledge transfer and extrapolation. Furthermore, the enhanced interpretability of the methodology offers insights into the fundamental factors governing HEA formation. Overall, this work provides a promising strategy for accelerating HEA discovery by integrating computational and textual knowledge sources, enabling efficient exploration of vast compositional spaces with improved generalization and interpretability.
△ Less
Submitted 20 February, 2025;
originally announced February 2025.
-
REM: A Scalable Reinforced Multi-Expert Framework for Multiplex Influence Maximization
Authors:
Huyen Nguyen,
Hieu Dam,
Nguyen Do,
Cong Tran,
Cuong Pham
Abstract:
In social online platforms, identifying influential seed users to maximize influence spread is a crucial as it can greatly diminish the cost and efforts required for information dissemination. While effective, traditional methods for Multiplex Influence Maximization (MIM) have reached their performance limits, prompting the emergence of learning-based approaches. These novel methods aim for better…
▽ More
In social online platforms, identifying influential seed users to maximize influence spread is a crucial as it can greatly diminish the cost and efforts required for information dissemination. While effective, traditional methods for Multiplex Influence Maximization (MIM) have reached their performance limits, prompting the emergence of learning-based approaches. These novel methods aim for better generalization and scalability for more sizable graphs but face significant challenges, such as (1) inability to handle unknown diffusion patterns and (2) reliance on high-quality training samples. To address these issues, we propose the Reinforced Expert Maximization framework (REM). REM leverages a Propagation Mixture of Experts technique to encode dynamic propagation of large multiplex networks effectively in order to generate enhanced influence propagation. Noticeably, REM treats a generative model as a policy to autonomously generate different seed sets and learn how to improve them from a Reinforcement Learning perspective. Extensive experiments on several real-world datasets demonstrate that REM surpasses state-of-the-art methods in terms of influence spread, scalability, and inference time in influence maximization tasks.
△ Less
Submitted 1 January, 2025;
originally announced January 2025.
-
Interactive GDPR-Compliant Privacy Policy Generation for Software Applications
Authors:
Pattaraporn Sangaroonsilp,
Hoa Khanh Dam,
Omar Haggag,
John Grundy
Abstract:
Software applications are designed to assist users in conducting a wide range of tasks or interactions. They have become prevalent and play an integral part in people's lives in this digital era. To use those software applications, users are sometimes requested to provide their personal information. As privacy has become a significant concern and many data protection regulations exist worldwide, s…
▽ More
Software applications are designed to assist users in conducting a wide range of tasks or interactions. They have become prevalent and play an integral part in people's lives in this digital era. To use those software applications, users are sometimes requested to provide their personal information. As privacy has become a significant concern and many data protection regulations exist worldwide, software applications must provide users with a privacy policy detailing how their personal information is collected and processed. We propose an approach that generates a comprehensive and compliant privacy policy with respect to the General Data Protection Regulation (GDPR) for diverse software applications. To support this, we first built a library of privacy clauses based on existing privacy policy analysis. We then developed an interactive rule-based system that prompts software developers with a series of questions and uses their answers to generate a customised privacy policy for a given software application. We evaluated privacy policies generated by our approach in terms of readability, completeness and coverage and compared them to privacy policies generated by three existing privacy policy generators and a Generative AI-based tool. Our evaluation results show that the privacy policy generated by our approach is the most complete and comprehensive.
△ Less
Submitted 3 October, 2024;
originally announced October 2024.
-
Categorical data clustering: 25 years beyond K-modes
Authors:
Tai Dinh,
Wong Hauchi,
Philippe Fournier-Viger,
Daniil Lisik,
Minh-Quyet Ha,
Hieu-Chi Dam,
Van-Nam Huynh
Abstract:
The clustering of categorical data is a common and important task in computer science, offering profound implications across a spectrum of applications. Unlike purely numerical data, categorical data often lack inherent ordering as in nominal data, or have varying levels of order as in ordinal data, thus requiring specialized methodologies for efficient organization and analysis. This review provi…
▽ More
The clustering of categorical data is a common and important task in computer science, offering profound implications across a spectrum of applications. Unlike purely numerical data, categorical data often lack inherent ordering as in nominal data, or have varying levels of order as in ordinal data, thus requiring specialized methodologies for efficient organization and analysis. This review provides a comprehensive synthesis of categorical data clustering in the past twenty-five years, starting from the introduction of K-modes. It elucidates the pivotal role of categorical data clustering in diverse fields such as health sciences, natural sciences, social sciences, education, engineering and economics. Practical comparisons are conducted for algorithms having public implementations, highlighting distinguishing clustering methodologies and revealing the performance of recent algorithms on several benchmark categorical datasets. Finally, challenges and opportunities in the field are discussed.
△ Less
Submitted 24 January, 2025; v1 submitted 30 August, 2024;
originally announced August 2024.
-
What Operations can be Performed Directly on Compressed Arrays, and with What Error?
Authors:
Tripti Agarwal,
Harvey Dam,
Dorra Ben Khalifa,
Matthieu Martel,
P. Sadayappan,
Ganesh Gopalakrishnan
Abstract:
In response to the rapidly escalating costs of computing with large matrices and tensors caused by data movement, several lossy compression methods have been developed to significantly reduce data volumes. Unfortunately, all these methods require the data to be decompressed before further computations are done. In this work, we develop a lossy compressor that allows a dozen fairly fundamental oper…
▽ More
In response to the rapidly escalating costs of computing with large matrices and tensors caused by data movement, several lossy compression methods have been developed to significantly reduce data volumes. Unfortunately, all these methods require the data to be decompressed before further computations are done. In this work, we develop a lossy compressor that allows a dozen fairly fundamental operations directly on compressed data while offering good compression ratios and modest errors. We implement a new compressor PyBlaz based on the familiar GPU-powered PyTorch framework, and evaluate it on three non-trivial applications, choosing different number systems for internal representation. Our results demonstrate that the compressed-domain operations achieve good scalability with problem sizes while incurring errors well within acceptable limits. To our best knowledge, this is the first such lossy compressor that supports compressed-domain operations while achieving acceptable performance as well as error.
△ Less
Submitted 17 June, 2024;
originally announced June 2024.
-
Density Functional Theory Calculations of the thermochemistry of the dehydration of 2-propanol
Authors:
Eugene Stephane Mananga,
Aissata Diop,
Paulin Dongomale,
Fambougouri Diane,
Hubertus van Dam
Abstract:
Electronic structure theory provides a foundation for understanding chemical transformations and processes in complex chemical environments. Our work is focused on the NWChemEx project that has selected two interrelated science challenges that address the production of advanced biomass-derived fuels and other value-added chemical compounds. One of which is the dehydration of 2-propanol over a zeol…
▽ More
Electronic structure theory provides a foundation for understanding chemical transformations and processes in complex chemical environments. Our work is focused on the NWChemEx project that has selected two interrelated science challenges that address the production of advanced biomass-derived fuels and other value-added chemical compounds. One of which is the dehydration of 2-propanol over a zeolite catalyst. Aqueous phase dehydration of 2-propanol was investigated using density functional theory (DFT) calculations. We considered and analyzed the thermochemistry of the dehydration of 2-propanol using NWChem calculations while the NWChemEx code is still under development. Realistically modeling the reaction in this study properly requires simulations using extended atomistic models. We validated our computational models by comparing the predicted outcomes for 2-propanol dehydration with the calculated results from 1-propanol dehydration studies. We used the first-principles DFT calculations to investigate aqueous phase dehydration of 2-propanol, examine the enthalpy of the 2-propanol reaction and computed the energy for geometry optimization for increasingly better basis sets: cc-pVDZ, cc-pVTZ, cc-pVQZ, cc-pV5Z, and cc-pV6Z. The various transition states and minima along the reaction pathway are critical to inform the NWChemEx science challenge calculations. In this work, we established how the accuracy of the calculations depends on the basis sets, and we determined what basis sets are needed to achieve sufficient accurate results. We also calculated the reaction free energy as a function of temperature as thermodynamic parameter. We found that at low temperature the reaction is thermodynamically unfavorable. Nevertheless, dehydrating 2-propanol increases entropy, underscoring the need for high temperatures to facilitate the reaction.
△ Less
Submitted 18 April, 2024; v1 submitted 22 February, 2024;
originally announced February 2024.
-
An Evaluation of Real-time Adaptive Sampling Change Point Detection Algorithm using KCUSUM
Authors:
Vijayalakshmi Saravanan,
Perry Siehien,
Shinjae Yoo,
Hubertus Van Dam,
Thomas Flynn,
Christopher Kelly,
Khaled Z Ibrahim
Abstract:
Detecting abrupt changes in real-time data streams from scientific simulations presents a challenging task, demanding the deployment of accurate and efficient algorithms. Identifying change points in live data stream involves continuous scrutiny of incoming observations for deviations in their statistical characteristics, particularly in high-volume data scenarios. Maintaining a balance between su…
▽ More
Detecting abrupt changes in real-time data streams from scientific simulations presents a challenging task, demanding the deployment of accurate and efficient algorithms. Identifying change points in live data stream involves continuous scrutiny of incoming observations for deviations in their statistical characteristics, particularly in high-volume data scenarios. Maintaining a balance between sudden change detection and minimizing false alarms is vital. Many existing algorithms for this purpose rely on known probability distributions, limiting their feasibility. In this study, we introduce the Kernel-based Cumulative Sum (KCUSUM) algorithm, a non-parametric extension of the traditional Cumulative Sum (CUSUM) method, which has gained prominence for its efficacy in online change point detection under less restrictive conditions. KCUSUM splits itself by comparing incoming samples directly with reference samples and computes a statistic grounded in the Maximum Mean Discrepancy (MMD) non-parametric framework. This approach extends KCUSUM's pertinence to scenarios where only reference samples are available, such as atomic trajectories of proteins in vacuum, facilitating the detection of deviations from the reference sample without prior knowledge of the data's underlying distribution. Furthermore, by harnessing MMD's inherent random-walk structure, we can theoretically analyze KCUSUM's performance across various use cases, including metrics like expected delay and mean runtime to false alarms. Finally, we discuss real-world use cases from scientific simulations such as NWChem CODAR and protein folding data, demonstrating KCUSUM's practical effectiveness in online change point detection.
△ Less
Submitted 4 April, 2024; v1 submitted 15 February, 2024;
originally announced February 2024.
-
MPGemmFI: A Fault Injection Technique for Mixed Precision GEMM in ML Applications
Authors:
Bo Fang,
Xinyi Li,
Harvey Dam,
Cheng Tan,
Siva Kumar Sastry Hari,
Timothy Tsai,
Ignacio Laguna,
Dingwen Tao,
Ganesh Gopalakrishnan,
Prashant Nair,
Kevin Barker,
Ang Li
Abstract:
Emerging deep learning workloads urgently need fast general matrix multiplication (GEMM). To meet such demand, one of the critical features of machine-learning-specific accelerators such as NVIDIA Tensor Cores, AMD Matrix Cores, and Google TPUs is the support of mixed-precision enabled GEMM. For DNN models, lower-precision FP data formats and computation offer acceptable correctness but significan…
▽ More
Emerging deep learning workloads urgently need fast general matrix multiplication (GEMM). To meet such demand, one of the critical features of machine-learning-specific accelerators such as NVIDIA Tensor Cores, AMD Matrix Cores, and Google TPUs is the support of mixed-precision enabled GEMM. For DNN models, lower-precision FP data formats and computation offer acceptable correctness but significant performance, area, and memory footprint improvement. While promising, the mixed-precision computation on error resilience remains unexplored. To this end, we develop a fault injection framework that systematically injects fault into the mixed-precision computation results. We investigate how the faults affect the accuracy of machine learning applications. Based on the error resilience characteristics, we offer lightweight error detection and correction solutions that significantly improve the overall model accuracy if the models experience hardware faults. The solutions can be efficiently integrated into the accelerator's pipelines.
△ Less
Submitted 9 November, 2023;
originally announced November 2023.
-
Transferable Graph Neural Fingerprint Models for Quick Response to Future Bio-Threats
Authors:
Wei Chen,
Yihui Ren,
Ai Kagawa,
Matthew R. Carbone,
Samuel Yen-Chi Chen,
Xiaohui Qu,
Shinjae Yoo,
Austin Clyde,
Arvind Ramanathan,
Rick L. Stevens,
Hubertus J. J. van Dam,
Deyu Lu
Abstract:
Fast screening of drug molecules based on the ligand binding affinity is an important step in the drug discovery pipeline. Graph neural fingerprint is a promising method for developing molecular docking surrogates with high throughput and great fidelity. In this study, we built a COVID-19 drug docking dataset of about 300,000 drug candidates on 23 coronavirus protein targets. With this dataset, we…
▽ More
Fast screening of drug molecules based on the ligand binding affinity is an important step in the drug discovery pipeline. Graph neural fingerprint is a promising method for developing molecular docking surrogates with high throughput and great fidelity. In this study, we built a COVID-19 drug docking dataset of about 300,000 drug candidates on 23 coronavirus protein targets. With this dataset, we trained graph neural fingerprint docking models for high-throughput virtual COVID-19 drug screening. The graph neural fingerprint models yield high prediction accuracy on docking scores with the mean squared error lower than $0.21$ kcal/mol for most of the docking targets, showing significant improvement over conventional circular fingerprint methods. To make the neural fingerprints transferable for unknown targets, we also propose a transferable graph neural fingerprint method trained on multiple targets. With comparable accuracy to target-specific graph neural fingerprint models, the transferable model exhibits superb training and data efficiency. We highlight that the impact of this study extends beyond COVID-19 dataset, as our approach for fast virtual ligand screening can be easily adapted and integrated into a general machine learning-accelerated pipeline to battle future bio-threats.
△ Less
Submitted 14 September, 2023; v1 submitted 17 July, 2023;
originally announced August 2023.
-
Quantum Software Analytics: Opportunities and Challenges
Authors:
Thong Hoang,
Hoa Khanh Dam,
Tingting Bi,
Qinghua Lu,
Zhenchang Xing,
Liming Zhu,
Lam Duc Nguyen,
Shiping Chen
Abstract:
Quantum computing systems depend on the principles of quantum mechanics to perform multiple challenging tasks more efficiently than their classical counterparts. In classical software engineering, the software life cycle is used to document and structure the processes of design, implementation, and maintenance of software applications. It helps stakeholders understand how to build an application.…
▽ More
Quantum computing systems depend on the principles of quantum mechanics to perform multiple challenging tasks more efficiently than their classical counterparts. In classical software engineering, the software life cycle is used to document and structure the processes of design, implementation, and maintenance of software applications. It helps stakeholders understand how to build an application. In this paper, we summarize a set of software analytics topics and techniques in the development life cycle that can be leveraged and integrated into quantum software application development. The results of this work can assist researchers and practitioners in better understanding the quantum-specific emerging development activities, challenges, and opportunities in the next generation of quantum software.
△ Less
Submitted 20 July, 2023;
originally announced July 2023.
-
Understanding the Effect of the Long Tail on Neural Network Compression
Authors:
Harvey Dam,
Vinu Joseph,
Aditya Bhaskara,
Ganesh Gopalakrishnan,
Saurav Muralidharan,
Michael Garland
Abstract:
Network compression is now a mature sub-field of neural network research: over the last decade, significant progress has been made towards reducing the size of models and speeding up inference, while maintaining the classification accuracy. However, many works have observed that focusing on just the overall accuracy can be misguided. E.g., it has been shown that mismatches between the full and com…
▽ More
Network compression is now a mature sub-field of neural network research: over the last decade, significant progress has been made towards reducing the size of models and speeding up inference, while maintaining the classification accuracy. However, many works have observed that focusing on just the overall accuracy can be misguided. E.g., it has been shown that mismatches between the full and compressed models can be biased towards under-represented classes. This raises the important research question, can we achieve network compression while maintaining "semantic equivalence" with the original network? In this work, we study this question in the context of the "long tail" phenomenon in computer vision datasets observed by Feldman, et al. They argue that memorization of certain inputs (appropriately defined) is essential to achieving good generalization. As compression limits the capacity of a network (and hence also its ability to memorize), we study the question: are mismatches between the full and compressed models correlated with the memorized training data? We present positive evidence in this direction for image classification tasks, by considering different base architectures and compression schemes.
△ Less
Submitted 27 June, 2023; v1 submitted 9 June, 2023;
originally announced June 2023.
-
Workflows Community Summit 2022: A Roadmap Revolution
Authors:
Rafael Ferreira da Silva,
Rosa M. Badia,
Venkat Bala,
Debbie Bard,
Peer-Timo Bremer,
Ian Buckley,
Silvina Caino-Lores,
Kyle Chard,
Carole Goble,
Shantenu Jha,
Daniel S. Katz,
Daniel Laney,
Manish Parashar,
Frederic Suter,
Nick Tyler,
Thomas Uram,
Ilkay Altintas,
Stefan Andersson,
William Arndt,
Juan Aznar,
Jonathan Bader,
Bartosz Balis,
Chris Blanton,
Kelly Rosa Braghetto,
Aharon Brodutch
, et al. (80 additional authors not shown)
Abstract:
Scientific workflows have become integral tools in broad scientific computing use cases. Science discovery is increasingly dependent on workflows to orchestrate large and complex scientific experiments that range from execution of a cloud-based data preprocessing pipeline to multi-facility instrument-to-edge-to-HPC computational workflows. Given the changing landscape of scientific computing and t…
▽ More
Scientific workflows have become integral tools in broad scientific computing use cases. Science discovery is increasingly dependent on workflows to orchestrate large and complex scientific experiments that range from execution of a cloud-based data preprocessing pipeline to multi-facility instrument-to-edge-to-HPC computational workflows. Given the changing landscape of scientific computing and the evolving needs of emerging scientific applications, it is paramount that the development of novel scientific workflows and system functionalities seek to increase the efficiency, resilience, and pervasiveness of existing systems and applications. Specifically, the proliferation of machine learning/artificial intelligence (ML/AI) workflows, need for processing large scale datasets produced by instruments at the edge, intensification of near real-time data processing, support for long-term experiment campaigns, and emergence of quantum computing as an adjunct to HPC, have significantly changed the functional and operational requirements of workflow systems. Workflow systems now need to, for example, support data streams from the edge-to-cloud-to-HPC enable the management of many small-sized files, allow data reduction while ensuring high accuracy, orchestrate distributed services (workflows, instruments, data movement, provenance, publication, etc.) across computing and user facilities, among others. Further, to accelerate science, it is also necessary that these systems implement specifications/standards and APIs for seamless (horizontal and vertical) integration between systems and applications, as well as enabling the publication of workflows and their associated products according to the FAIR principles. This document reports on discussions and findings from the 2022 international edition of the Workflows Community Summit that took place on November 29 and 30, 2022.
△ Less
Submitted 31 March, 2023;
originally announced April 2023.
-
Towards Knowledge-Centric Process Mining
Authors:
Asjad Khan,
Arsal Huda,
Aditya Ghose,
Hoa Khanh Dam
Abstract:
Process analytic approaches play a critical role in supporting the practice of business process management and continuous process improvement by leveraging process-related data to identify performance bottlenecks, extracting insights about reducing costs and optimizing the utilization of available resources. Process analytic techniques often have to contend with real-world settings where available…
▽ More
Process analytic approaches play a critical role in supporting the practice of business process management and continuous process improvement by leveraging process-related data to identify performance bottlenecks, extracting insights about reducing costs and optimizing the utilization of available resources. Process analytic techniques often have to contend with real-world settings where available logs are noisy or incomplete. In this paper we present an approach that permits process analytics techniques to deliver value in the face of noisy/incomplete event logs. Our approach leverages knowledge graphs to mitigate the effects of noise in event logs while supporting process analysts in understanding variability associated with event logs.
△ Less
Submitted 25 January, 2023;
originally announced January 2023.
-
Advances in Process Optimization: A Comprehensive Survey of Process Mining, Predictive Process Monitoring, and Process-Aware Recommender Systems
Authors:
Asjad Khan,
Aditya Ghose,
Hoa Dam,
Arsal Syed
Abstract:
Process analytics approaches allow organizations to support the practice of Business Process Management and continuous improvement by leveraging all process-related data to extract knowledge, improve process performance and support decision-making across the organization. Process execution data once collected will contain hidden insights and actionable knowledge that are of considerable business v…
▽ More
Process analytics approaches allow organizations to support the practice of Business Process Management and continuous improvement by leveraging all process-related data to extract knowledge, improve process performance and support decision-making across the organization. Process execution data once collected will contain hidden insights and actionable knowledge that are of considerable business value enabling firms to take a data-driven approach for identifying performance bottlenecks, reducing costs, extracting insights and optimizing the utilization of available resources. Understanding the properties of 'current deployed process' (whose execution trace is often available in these logs), is critical to understanding the variation across the process instances, root-causes of inefficiencies and determining the areas for investing improvement efforts. In this survey, we discuss various methods that allow organizations to understand the behaviour of their processes, monitor currently running process instances, predict the future behavior of those instances and provide better support for operational decision-making across the organization.
△ Less
Submitted 23 February, 2025; v1 submitted 24 January, 2023;
originally announced January 2023.
-
TaDeR: A New Task Dependency Recommendation for Project Management Platform
Authors:
Quynh Nguyen,
Dac H. Nguyen,
Son T. Huynh,
Hoa K. Dam,
Binh T. Nguyen
Abstract:
Many startups and companies worldwide have been using project management software and tools to monitor, track and manage their projects. For software projects, the number of tasks from the beginning to the end is quite a large number that sometimes takes a lot of time and effort to search and link the current task to a group of previous ones for further references. This paper proposes an efficient…
▽ More
Many startups and companies worldwide have been using project management software and tools to monitor, track and manage their projects. For software projects, the number of tasks from the beginning to the end is quite a large number that sometimes takes a lot of time and effort to search and link the current task to a group of previous ones for further references. This paper proposes an efficient task dependency recommendation algorithm to suggest tasks dependent on a given task that the user has just created. We present an efficient feature engineering step and construct a deep neural network to this aim. We performed extensive experiments on two different large projects (MDLSITE from moodle.org and FLUME from apache.org) to find the best features in 28 combinations of features and the best performance model using two embedding methods (GloVe and FastText). We consider three types of models (GRU, CNN, LSTM) using Accuracy@K, MRR@K, and Recall@K (where K = 1, 2, 3, and 5) and baseline models using traditional methods: TF-IDF with various matching score calculating such as cosine similarity, Euclidean distance, Manhattan distance, and Chebyshev distance. After many experiments, the GloVe Embedding and CNN model reached the best result in our dataset, so we chose this model as our proposed method. In addition, adding the time filter in the post-processing step can significantly improve the recommendation system's performance. The experimental results show that our proposed method can reach 0.2335 in Accuracy@1 and MRR@1 and 0.2011 in Recall@1 of dataset FLUME. With the MDLSITE dataset, we obtained 0.1258 in Accuracy@1 and MRR@1 and 0.1141 in Recall@1. In the top 5, our model reached 0.3040 in Accuracy@5, 0.2563 MRR@5, and 0.2651 Recall@5 in FLUME. In the MDLSITE dataset, our model got 0.5270 Accuracy@5, 0.2689 MRR@5, and 0.2651 Recall@5.
△ Less
Submitted 12 May, 2022;
originally announced May 2022.
-
Function Decomposition Tree with Causality-First Perspective and Systematic Description of Problems in Materials Informatics
Authors:
Hiori Kino,
Hieu-Chi Dam,
Takashi Miyake,
Riichiro Mizoguchi
Abstract:
As interdisciplinary science is flourishing because of materials informatics and additional factors; a systematic way is required for expressing knowledge and facilitating communication between scientists in various fields. A function decomposition tree is such a representation, but domain scientists face difficulty in constructing it. Thus, this study cites the general problems encountered by beg…
▽ More
As interdisciplinary science is flourishing because of materials informatics and additional factors; a systematic way is required for expressing knowledge and facilitating communication between scientists in various fields. A function decomposition tree is such a representation, but domain scientists face difficulty in constructing it. Thus, this study cites the general problems encountered by beginners in generating function decomposition trees and proposes a new function decomposition representation method based on a causality-first perspective for resolution of these problems. The causality-first decomposition tree was obtained from a workflow expressed according to the processing sequence. Moreover, we developed a program that performed automatic conversion using the features of the causality-first decomposition trees. The proposed method was applied to materials informatics to demonstrate the systematic representation of expert knowledge and its usefullness.
△ Less
Submitted 26 April, 2022;
originally announced May 2022.
-
Symbolic analysis meets federated learning to enhance malware identifier
Authors:
Khanh Huu The Dam,
Charles-Henry Bertrand Van Ouytsel,
Axel Legay
Abstract:
Over past years, the manually methods to create detection rules were no longer practical in the anti-malware product since the number of malware threats has been growing. Thus, the turn to the machine learning approaches is a promising way to make the malware recognition more efficient. The traditional centralized machine learning requires a large amount of data to train a model with excellent per…
▽ More
Over past years, the manually methods to create detection rules were no longer practical in the anti-malware product since the number of malware threats has been growing. Thus, the turn to the machine learning approaches is a promising way to make the malware recognition more efficient. The traditional centralized machine learning requires a large amount of data to train a model with excellent performance. To boost the malware detection, the training data might be on various kind of data sources such as data on host, network and cloud-based anti-malware components, or even, data from different enterprises. To avoid the expenses of data collection as well as the leakage of private data, we present a federated learning system to identify malwares through the behavioural graphs, i.e., system call dependency graphs. It is based on a deep learning model including a graph autoencoder and a multi-classifier module. This model is trained by a secure learning protocol among clients to preserve the private data against the inference attacks. Using the model to identify malwares, we achieve the accuracy of 85\% for the homogeneous graph data and 93\% for the inhomogeneous graph data.
△ Less
Submitted 29 April, 2022;
originally announced April 2022.
-
On Privacy Weaknesses and Vulnerabilities in Software Systems
Authors:
Pattaraporn Sangaroonsilp,
Hoa Khanh Dam,
Aditya Ghose
Abstract:
In this digital era, our privacy is under constant threat as our personal data and traceable online/offline activities are frequently collected, processed and transferred by many software applications. Privacy attacks are often formed by exploiting vulnerabilities found in those software applications. The Common Weakness Enumeration (CWE) and Common Vulnerabilities and Exposures (CVE) systems are…
▽ More
In this digital era, our privacy is under constant threat as our personal data and traceable online/offline activities are frequently collected, processed and transferred by many software applications. Privacy attacks are often formed by exploiting vulnerabilities found in those software applications. The Common Weakness Enumeration (CWE) and Common Vulnerabilities and Exposures (CVE) systems are currently the main sources that software engineers rely on for understanding and preventing publicly disclosed software vulnerabilities. However, our study on all 922 weaknesses in the CWE and 156,537 vulnerabilities registered in the CVE to date has found a very small coverage of privacy-related vulnerabilities in both systems, only 4.45\% in CWE and 0.1\% in CVE. These also cover only a small number of areas of privacy threats that have been raised in existing privacy software engineering research, privacy regulations and frameworks, and relevant reputable organisations. The actionable insights generated from our study led to the introduction of 11 new common privacy weaknesses to supplement the CWE system, making it become a source for both security and privacy vulnerabilities.
△ Less
Submitted 9 February, 2023; v1 submitted 28 December, 2021;
originally announced December 2021.
-
Mining and Classifying Privacy and Data Protection Requirements in Issue Reports
Authors:
Pattaraporn Sangaroonsilp,
Hoa Khanh Dam,
Morakot Choetkiertikul,
Chaiyong Ragkhitwetsagul,
Aditya Ghose
Abstract:
Digital and physical footprints are a trail of user activities collected over the use of software applications and systems. As software becomes ubiquitous, protecting user privacy has become challenging. With the increase of user privacy awareness and advent of privacy regulations and policies, there is an emerging need to implement software systems that enhance the protection of personal data pro…
▽ More
Digital and physical footprints are a trail of user activities collected over the use of software applications and systems. As software becomes ubiquitous, protecting user privacy has become challenging. With the increase of user privacy awareness and advent of privacy regulations and policies, there is an emerging need to implement software systems that enhance the protection of personal data processing. However, existing data protection and privacy regulations provide key principles in high-level, making it difficult for software engineers to design and implement privacy-aware systems. In this paper, we develop a taxonomy that provides a comprehensive set of privacy requirements based on four well-established personal data protection regulations and privacy frameworks, the General Data Protection Regulation (GDPR), ISO/IEC 29100, Thailand Personal Data Protection Act (Thailand PDPA) and Asia-Pacific Economic Cooperation (APEC) privacy framework. These requirements are extracted, refined and classified into a level that can be used to map with issue reports. We have also performed a study on how two large open-source software projects (Google Chrome and Moodle) address the privacy requirements in our taxonomy through mining their issue reports. The paper discusses how the collected issues were classified, and presents the findings and insights generated from our study. Mining and classifying privacy requirements in issue reports can help organisations be aware of their state of compliance by identifying privacy requirements that have not been addressed in their software projects. The taxonomy can also trace back to regulations, standards and frameworks that the software projects have not complied with based on the identified privacy requirements.
△ Less
Submitted 27 March, 2022; v1 submitted 27 December, 2021;
originally announced December 2021.
-
Human Values in Software Release Planning
Authors:
Davoud Mougouei,
Aditya Ghose,
Hoa Dam,
David Powers
Abstract:
Software products have become an integral part of human lives, and therefore need to account for human values such as privacy, fairness, and equality. Ignoring human values in software development leads to biases and violations of human values: racial biases in recidivism assessment and facial recognition software are well-known examples of such issues. One of the most critical steps in software d…
▽ More
Software products have become an integral part of human lives, and therefore need to account for human values such as privacy, fairness, and equality. Ignoring human values in software development leads to biases and violations of human values: racial biases in recidivism assessment and facial recognition software are well-known examples of such issues. One of the most critical steps in software development is Software Release Planning (SRP), where decisions are made about the presence or absence of the requirements (features) in the software. Such decisions are primarily guided by the economic value of the requirements, ignoring their impacts on a broader range of human values. That may result in ignoring (selecting) requirements that positively (negatively) impact human values, increasing the risk of value breaches in the software. To address this, we have proposed an Integer Programming approach to considering human values in software release planning. In this regard, an Integer Linear Programming (ILP) model has been proposed, that explicitly accounts for human values in finding an "optimal" subset of the requirements. The ILP model exploits the algebraic structure of fuzzy graphs to capture dependencies and conflicts among the values of the requirements.
△ Less
Submitted 4 February, 2021;
originally announced February 2021.
-
A Taxonomy for Mining and Classifying Privacy Requirements in Issue Reports
Authors:
Pattaraporn Sangaroonsilp,
Hoa Khanh Dam,
Morakot Choetkiertikul,
Chaiyong Ragkhitwetsagul,
Aditya Ghose
Abstract:
Context: Digital and physical trails of user activities are collected over the use of software applications and systems. As software becomes ubiquitous, protecting user privacy has become challenging. With the increase of user privacy awareness and advent of privacy regulations and policies, there is an emerging need to implement software systems that enhance the protection of personal data proces…
▽ More
Context: Digital and physical trails of user activities are collected over the use of software applications and systems. As software becomes ubiquitous, protecting user privacy has become challenging. With the increase of user privacy awareness and advent of privacy regulations and policies, there is an emerging need to implement software systems that enhance the protection of personal data processing. However, existing data protection and privacy regulations provide key principles in high-level, making it difficult for software engineers to design and implement privacy-aware systems. Objective: In this paper, we develop a taxonomy that provides a comprehensive set of privacy requirements based on four well-established personal data protection regulations and privacy frameworks, the General Data Protection Regulation (GDPR), ISO/IEC 29100, Thailand Personal Data Protection Act (Thailand PDPA) and Asia-Pacific Economic Cooperation (APEC) privacy framework. Methods: These requirements are extracted, refined and classified (using the goal-based requirements analysis method) into a level that can be used to map with issue reports. We have also performed a study on how two large open-source software projects (Google Chrome and Moodle) address the privacy requirements in our taxonomy through mining their issue reports. Results: The paper discusses how the collected issues were classified, and presents the findings and insights generated from our study. Conclusion: Mining and classifying privacy requirements in issue reports can help organisations be aware of their state of compliance by identifying privacy requirements that have not been addressed in their software projects. The taxonomy can also trace back to regulations, standards and frameworks that the software projects have not complied with based on the identified privacy requirements.
△ Less
Submitted 5 February, 2023; v1 submitted 4 January, 2021;
originally announced January 2021.
-
A Framework for Conditional Statement Technical Debt Identification and Description
Authors:
Abdulaziz Alhefdhi,
Hoa Khanh Dam,
Yusuf Sulistyo Nugroho,
Hideaki Hata,
Takashi Ishio,
Aditya Ghose
Abstract:
Technical Debt occurs when development teams favour short-term operability over long-term stability. Since this places software maintainability at risk, technical debt requires early attention to avoid paying for accumulated interest. Most of the existing work focuses on detecting technical debt using code comments, known as Self-Admitted Technical Debt (SATD). However, there are many cases where…
▽ More
Technical Debt occurs when development teams favour short-term operability over long-term stability. Since this places software maintainability at risk, technical debt requires early attention to avoid paying for accumulated interest. Most of the existing work focuses on detecting technical debt using code comments, known as Self-Admitted Technical Debt (SATD). However, there are many cases where technical debt instances are not explicitly acknowledged but deeply hidden in the code. In this paper, we propose a framework that caters for the absence of SATD comments in code. Our Self-Admitted Technical Debt Identification and Description (SATDID) framework determines if technical debt should be self-admitted for an input code fragment. If that is the case, SATDID will automatically generate the appropriate descriptive SATD comment that can be attached with the code. While our approach is applicable in principle to any type of code fragments, we focus in this study on technical debt hidden in conditional statements, one of the most TD-carrying parts of code. We explore and evaluate different implementations of SATDID. The evaluation results demonstrate the applicability and effectiveness of our framework over multiple benchmarks. Comparing with the results from the benchmarks, our approach provides at least 21.35%, 59.36%, 31.78%, and 583.33% improvements in terms of Precision, Recall, F-1, and Bleu-4 scores, respectively. In addition, we conduct human evaluation to the SATD comments generated by SATDID. In 1-5 and 0-5 scales for Acceptability and Understandability, the total means achieved by our approach are 3.128 and 3.172, respectively.
△ Less
Submitted 13 October, 2022; v1 submitted 22 December, 2020;
originally announced December 2020.
-
Adversarial Patch Generation for Automated Program Repair
Authors:
Abdulaziz Alhefdhi,
Hoa Khanh Dam,
Thanh Le-Cong,
Bach Le,
Aditya Ghose
Abstract:
Automated Program Repair has attracted significant research in recent years, leading to diverse techniques that focus on two main directions: search-based and semantic-based program repair. The former techniques often face challenges due to the vast search space, resulting in difficulties in identifying correct solutions, while the latter approaches are constrained by the capabilities of the under…
▽ More
Automated Program Repair has attracted significant research in recent years, leading to diverse techniques that focus on two main directions: search-based and semantic-based program repair. The former techniques often face challenges due to the vast search space, resulting in difficulties in identifying correct solutions, while the latter approaches are constrained by the capabilities of the underlying semantic analyser, limiting their scalability. In this paper, we propose NEVERMORE, a novel learning-based mechanism inspired by the adversarial nature of bugs and fixes. NEVERMORE is built upon the Generative Adversarial Networks architecture and trained on historical bug fixes to generate repairs that closely mimic human-produced fixes. Our empirical evaluation on 500 real-world bugs demonstrates the effectiveness of NEVERMORE in bug-fixing, generating repairs that match human fixes for 21.2% of the examined bugs. Moreover, we evaluate NEVERMORE on the Defects4J dataset, where our approach generates repairs for 4 bugs that remained unresolved by state-of-the-art baselines. NEVERMORE also fixes another 8 bugs which were only resolved by a subset of these baselines. Finally, we conduct an in-depth analysis of the impact of input and training styles on NEVERMORE's performance, revealing where the chosen style influences the model's bug-fixing capabilities.
△ Less
Submitted 3 September, 2023; v1 submitted 20 December, 2020;
originally announced December 2020.
-
Scalable HPC and AI Infrastructure for COVID-19 Therapeutics
Authors:
Hyungro Lee,
Andre Merzky,
Li Tan,
Mikhail Titov,
Matteo Turilli,
Dario Alfe,
Agastya Bhati,
Alex Brace,
Austin Clyde,
Peter Coveney,
Heng Ma,
Arvind Ramanathan,
Rick Stevens,
Anda Trifan,
Hubertus Van Dam,
Shunzhou Wan,
Sean Wilkinson,
Shantenu Jha
Abstract:
COVID-19 has claimed more 1 million lives and resulted in over 40 million infections. There is an urgent need to identify drugs that can inhibit SARS-CoV-2. In response, the DOE recently established the Medical Therapeutics project as part of the National Virtual Biotechnology Laboratory, and tasked it with creating the computational infrastructure and methods necessary to advance therapeutics dev…
▽ More
COVID-19 has claimed more 1 million lives and resulted in over 40 million infections. There is an urgent need to identify drugs that can inhibit SARS-CoV-2. In response, the DOE recently established the Medical Therapeutics project as part of the National Virtual Biotechnology Laboratory, and tasked it with creating the computational infrastructure and methods necessary to advance therapeutics development. We discuss innovations in computational infrastructure and methods that are accelerating and advancing drug design. Specifically, we describe several methods that integrate artificial intelligence and simulation-based approaches, and the design of computational infrastructure to support these methods at scale. We discuss their implementation and characterize their performance, and highlight science advances that these capabilities have enabled.
△ Less
Submitted 20 October, 2020;
originally announced October 2020.
-
IMPECCABLE: Integrated Modeling PipelinE for COVID Cure by Assessing Better LEads
Authors:
Aymen Al Saadi,
Dario Alfe,
Yadu Babuji,
Agastya Bhati,
Ben Blaiszik,
Thomas Brettin,
Kyle Chard,
Ryan Chard,
Peter Coveney,
Anda Trifan,
Alex Brace,
Austin Clyde,
Ian Foster,
Tom Gibbs,
Shantenu Jha,
Kristopher Keipert,
Thorsten Kurth,
Dieter Kranzlmüller,
Hyungro Lee,
Zhuozhao Li,
Heng Ma,
Andre Merzky,
Gerald Mathias,
Alexander Partin,
Junqi Yin
, et al. (11 additional authors not shown)
Abstract:
The drug discovery process currently employed in the pharmaceutical industry typically requires about 10 years and $2-3 billion to deliver one new drug. This is both too expensive and too slow, especially in emergencies like the COVID-19 pandemic. In silicomethodologies need to be improved to better select lead compounds that can proceed to later stages of the drug discovery protocol accelerating…
▽ More
The drug discovery process currently employed in the pharmaceutical industry typically requires about 10 years and $2-3 billion to deliver one new drug. This is both too expensive and too slow, especially in emergencies like the COVID-19 pandemic. In silicomethodologies need to be improved to better select lead compounds that can proceed to later stages of the drug discovery protocol accelerating the entire process. No single methodological approach can achieve the necessary accuracy with required efficiency. Here we describe multiple algorithmic innovations to overcome this fundamental limitation, development and deployment of computational infrastructure at scale integrates multiple artificial intelligence and simulation-based approaches. Three measures of performance are:(i) throughput, the number of ligands per unit time; (ii) scientific performance, the number of effective ligands sampled per unit time and (iii) peak performance, in flop/s. The capabilities outlined here have been used in production for several months as the workhorse of the computational infrastructure to support the capabilities of the US-DOE National Virtual Biotechnology Laboratory in combination with resources from the EU Centre of Excellence in Computational Biomedicine.
△ Less
Submitted 13 October, 2020;
originally announced October 2020.
-
Indoor environment data time-series reconstruction using autoencoder neural networks
Authors:
Antonio Liguori,
Romana Markovic,
Thi Thu Ha Dam,
Jérôme Frisch,
Christoph van Treeck,
Francesco Causone
Abstract:
As the number of installed meters in buildings increases, there is a growing number of data time-series that could be used to develop data-driven models to support and optimize building operation. However, building data sets are often characterized by errors and missing values, which are considered, by the recent research, among the main limiting factors on the performance of the proposed models.…
▽ More
As the number of installed meters in buildings increases, there is a growing number of data time-series that could be used to develop data-driven models to support and optimize building operation. However, building data sets are often characterized by errors and missing values, which are considered, by the recent research, among the main limiting factors on the performance of the proposed models. Motivated by the need to address the problem of missing data in building operation, this work presents a data-driven approach to fill these gaps. In this study, three different autoencoder neural networks are trained to reconstruct missing short-term indoor environment data time-series in a data set collected in an office building in Aachen, Germany. This consisted of a four year-long monitoring campaign in and between the years 2014 and 2017, of 84 different rooms. The models are applicable for different time-series obtained from room automation, such as indoor air temperature, relative humidity and $CO_{2}$ data streams. The results prove that the proposed methods outperform classic numerical approaches and they result in reconstructing the corresponding variables with average RMSEs of 0.42 °C, 1.30 % and 78.41 ppm, respectively.
△ Less
Submitted 21 January, 2021; v1 submitted 17 September, 2020;
originally announced September 2020.
-
Chimbuko: A Workflow-Level Scalable Performance Trace Analysis Tool
Authors:
Sungsoo Ha,
Wonyong Jeong,
Gyorgy Matyasfalvi,
Cong Xie,
Kevin Huck,
Jong Youl Choi,
Abid Malik,
Li Tang,
Hubertus Van Dam,
Line Pouchard,
Wei Xu,
Shinjae Yoo,
Nicholas D'Imperio,
Kerstin Kleese Van Dam
Abstract:
Because of the limits input/output systems currently impose on high-performance computing systems, a new generation of workflows that include online data reduction and analysis is emerging. Diagnosing their performance requires sophisticated performance analysis capabilities due to the complexity of execution patterns and underlying hardware, and no tool could handle the voluminous performance tra…
▽ More
Because of the limits input/output systems currently impose on high-performance computing systems, a new generation of workflows that include online data reduction and analysis is emerging. Diagnosing their performance requires sophisticated performance analysis capabilities due to the complexity of execution patterns and underlying hardware, and no tool could handle the voluminous performance trace data needed to detect potential problems. This work introduces Chimbuko, a performance analysis framework that provides real-time, distributed, in situ anomaly detection. Data volumes are reduced for human-level processing without losing necessary details. Chimbuko supports online performance monitoring via a visualization module that presents the overall workflow anomaly distribution, call stacks, and timelines. Chimbuko also supports the capture and reduction of performance provenance. To the best of our knowledge, Chimbuko is the first online, distributed, and scalable workflow-level performance trace analysis framework, and we demonstrate the tool's usefulness on Oak Ridge National Laboratory's Summit system.
△ Less
Submitted 31 August, 2020;
originally announced August 2020.
-
Ensemble learning reveals dissimilarity between rare-earth transition metal binary alloys with respect to the Curie temperature
Authors:
Duong-Nguyen Nguyen,
Tien-Lam Pham,
Viet-Cuong Nguyen,
Hiori Kino,
Takashi Miyake,
Hieu-Chi Dam
Abstract:
We propose a data-driven method to extract dissimilarity between materials, with respect to a given target physical property. The technique is based on an ensemble method with Kernel ridge regression as the predicting model; multiple random subset sampling of the materials is done to generate prediction models and the corresponding contributions of the reference training materials in detail. The d…
▽ More
We propose a data-driven method to extract dissimilarity between materials, with respect to a given target physical property. The technique is based on an ensemble method with Kernel ridge regression as the predicting model; multiple random subset sampling of the materials is done to generate prediction models and the corresponding contributions of the reference training materials in detail. The distribution of the predicted values for each material can be approximated by a Gaussian mixture model. The reference training materials contributed to the prediction model that accurately predicts the physical property value of a specific material, are considered to be similar to that material, or vice versa. Evaluations using synthesized data demonstrate that the proposed method can effectively measure the dissimilarity between data instances. An application of the analysis method on the data of Curie temperature (TC) of binary 3d transition metal 4f rare earth binary alloys also reveals meaningful results on the relations between the materials. The proposed method can be considered as a potential tool for obtaining a deeper understanding of the structure of data, with respect to a target property, in particular.
△ Less
Submitted 20 August, 2020;
originally announced August 2020.
-
Explainable Machine Learning for Materials Discovery: Predicting the Potentially Formable Nd-Fe-B Crystal Structures and Extracting Structure-Stability Relationship
Authors:
Tien-Lam Pham,
Duong-Nguyen Nguyen,
Minh-Quyet Ha,
Hiori Kino,
Takashi Miyake,
Hieu-Chi Dam
Abstract:
New Nd-Fe-B crystal structures can be formed via the elemental substitution of LATX host structures, including lanthanides LA, transition metals T, and light elements X as B, C, N, and O. The 5967 samples of ternary LATX materials that are collected are then used as the host structures. For each host crystal structure, a substituted crystal structure is created by substituting all lanthanide sites…
▽ More
New Nd-Fe-B crystal structures can be formed via the elemental substitution of LATX host structures, including lanthanides LA, transition metals T, and light elements X as B, C, N, and O. The 5967 samples of ternary LATX materials that are collected are then used as the host structures. For each host crystal structure, a substituted crystal structure is created by substituting all lanthanide sites with Nd, all transition metal sites with Fe, and all light element sites with B. High throughput first-principles calculations are applied to evaluate the phase stability of the newly created crystal structures, and 20 of them are found to be potentially formable. A data driven approach based on supervised and unsupervised learning techniques is applied to estimate the stability and analyze the structure stability relationship of the newly created NdFeB crystal structures. For predicting the stability for the newly created NdFeB structures, three supervised learning models, kernel ridge regression, logistic classification, and decision tree model, are learned from the LATX host crystal structures; the models achieve the maximum accuracy and recall scores of 70.4 and 68.7 percent, respectively. On the other hand, our proposed unsupervised learning model based on the integration of descriptor-relevance analysis and a Gaussian mixture model achieves accuracy and recall score of 72.9 and 82.1 percent, respectively, which are significantly better than those of the supervised models. While capturing and interpreting the structure stability relationship of the NdFeB crystal structures, the unsupervised learning model indicates that the average atomic coordination number and coordination number of the Fe sites are the most important factors in determining the phase stability of the new substituted NdFeB crystal structures.
△ Less
Submitted 20 August, 2020;
originally announced August 2020.
-
Boron cage effects on Nd-Fe-B crystal structure's stability
Authors:
Duong-Nguyen Nguyen,
Duc-Anh Dao,
Takashi Miyake,
Hieu-Chi Dam
Abstract:
In this study, we investigate the structure-stability relationship of hypothetical Nd-Fe-B crystal structures using descriptor-relevance analysis and the t-SNE dimensionality reduction method. 149 hypothetical Nd-Fe-B crystal structures are generated from 5967 LA-T-X host structures in Open Quantum Materials Database by using the elemental substitution method, with LA denoting lanthanides, T denot…
▽ More
In this study, we investigate the structure-stability relationship of hypothetical Nd-Fe-B crystal structures using descriptor-relevance analysis and the t-SNE dimensionality reduction method. 149 hypothetical Nd-Fe-B crystal structures are generated from 5967 LA-T-X host structures in Open Quantum Materials Database by using the elemental substitution method, with LA denoting lanthanides, T denoting transition metals, and X denoting light elements such as B, C, N and O. A hypothetical crystal structure is created by substituting all lanthanide sites with Nd, all transition metal sites with Fe, and all light element sites with B. High-throughput first-principle calculations are applied to evaluate the phase stability of these structures. Twenty of them are found to be potentially formable. The descriptor-relevance analysis on the orbital field matrix (OFM) materials' descriptor reveals the average atomic coordination number as the essential factor in determining the structure stability of these substituted Nd-Fe-B crystal structures. 19 among 20 hypothetical structures that are found potentially formable have an average coordination number larger than 6.5. In addition, all the local structures represented by the OFM descriptors are integrated into a visible space to study the detailed correlation between their characteristics and the stability of the crystal structure to which they belong. We discover that unstable substituted structures frequently carry Nd and Fe local structures with two prominent points: low average coordination numbers and fully occupied B neighboring atoms. Moreover, there are only three popular forms of B local structures appearing on all potentially formable substituted structures: cage networks, planar networks, and interstitial sites. The discovered relationships are promising to speed up the screening process for the new formable crystal structures.
△ Less
Submitted 20 August, 2020;
originally announced August 2020.
-
On the Efficient Evaluation of the Exchange Correlation Potential on Graphics Processing Unit Clusters
Authors:
David B. Williams-Young,
Wibe A. de Jong,
Hubertus J. J. van Dam,
Chao Yang
Abstract:
The predominance of Kohn-Sham density functional theory (KS-DFT) for the theoretical treatment of large experimentally relevant systems in molecular chemistry and materials science relies primarily on the existence of efficient software implementations which are capable of leveraging the latest advances in modern high performance computing (HPC). With recent trends in HPC leading towards in increa…
▽ More
The predominance of Kohn-Sham density functional theory (KS-DFT) for the theoretical treatment of large experimentally relevant systems in molecular chemistry and materials science relies primarily on the existence of efficient software implementations which are capable of leveraging the latest advances in modern high performance computing (HPC). With recent trends in HPC leading towards in increasing reliance on heterogeneous accelerator based architectures such as graphics processing units (GPU), existing code bases must embrace these architectural advances to maintain the high-levels of performance which have come to be expected for these methods. In this work, we purpose a three-level parallelism scheme for the distributed numerical integration of the exchange-correlation (XC) potential in the Gaussian basis set discretization of the Kohn-Sham equations on large computing clusters consisting of multiple GPUs per compute node. In addition, we purpose and demonstrate the efficacy of the use of batched kernels, including batched level-3 BLAS operations, in achieving high-levels of performance on the GPU. We demonstrate the performance and scalability of the implementation of the purposed method in the NWChemEx software package by comparing to the existing scalable CPU XC integration in NWChem.
△ Less
Submitted 6 July, 2020;
originally announced July 2020.
-
Targeting SARS-CoV-2 with AI- and HPC-enabled Lead Generation: A First Data Release
Authors:
Yadu Babuji,
Ben Blaiszik,
Tom Brettin,
Kyle Chard,
Ryan Chard,
Austin Clyde,
Ian Foster,
Zhi Hong,
Shantenu Jha,
Zhuozhao Li,
Xuefeng Liu,
Arvind Ramanathan,
Yi Ren,
Nicholaus Saint,
Marcus Schwarting,
Rick Stevens,
Hubertus van Dam,
Rick Wagner
Abstract:
Researchers across the globe are seeking to rapidly repurpose existing drugs or discover new drugs to counter the the novel coronavirus disease (COVID-19) caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2). One promising approach is to train machine learning (ML) and artificial intelligence (AI) tools to screen large numbers of small molecules. As a contribution to that effort,…
▽ More
Researchers across the globe are seeking to rapidly repurpose existing drugs or discover new drugs to counter the the novel coronavirus disease (COVID-19) caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2). One promising approach is to train machine learning (ML) and artificial intelligence (AI) tools to screen large numbers of small molecules. As a contribution to that effort, we are aggregating numerous small molecules from a variety of sources, using high-performance computing (HPC) to computer diverse properties of those molecules, using the computed properties to train ML/AI models, and then using the resulting models for screening. In this first data release, we make available 23 datasets collected from community sources representing over 4.2 B molecules enriched with pre-computed: 1) molecular fingerprints to aid similarity searches, 2) 2D images of molecules to enable exploration and application of image-based deep learning methods, and 3) 2D and 3D molecular descriptors to speed development of machine learning models. This data release encompasses structural information on the 4.2 B molecules and 60 TB of pre-computed data. Future releases will expand the data to include more detailed molecular simulations, computed models, and other products.
△ Less
Submitted 27 May, 2020;
originally announced June 2020.
-
Variational Hyper-Encoding Networks
Authors:
Phuoc Nguyen,
Truyen Tran,
Sunil Gupta,
Santu Rana,
Hieu-Chi Dam,
Svetha Venkatesh
Abstract:
We propose a framework called HyperVAE for encoding distributions of distributions. When a target distribution is modeled by a VAE, its neural network parameters θis drawn from a distribution p(θ) which is modeled by a hyper-level VAE. We propose a variational inference using Gaussian mixture models to implicitly encode the parameters θinto a low dimensional Gaussian distribution. Given a target d…
▽ More
We propose a framework called HyperVAE for encoding distributions of distributions. When a target distribution is modeled by a VAE, its neural network parameters θis drawn from a distribution p(θ) which is modeled by a hyper-level VAE. We propose a variational inference using Gaussian mixture models to implicitly encode the parameters θinto a low dimensional Gaussian distribution. Given a target distribution, we predict the posterior distribution of the latent code, then use a matrix-network decoder to generate a posterior distribution q(θ). HyperVAE can encode the parameters θin full in contrast to common hyper-networks practices, which generate only the scale and bias vectors as target-network parameters. Thus HyperVAE preserves much more information about the model for each task in the latent space. We discuss HyperVAE using the minimum description length (MDL) principle and show that it helps HyperVAE to generalize. We evaluate HyperVAE in density estimation tasks, outlier detection and discovery of novel design classes, demonstrating its efficacy.
△ Less
Submitted 12 May, 2022; v1 submitted 18 May, 2020;
originally announced May 2020.
-
NWChem: Past, Present, and Future
Authors:
E. Aprà,
E. J. Bylaska,
W. A. de Jong,
N. Govind,
K. Kowalski,
T. P. Straatsma,
M. Valiev,
H. J. J. van Dam,
Y. Alexeev,
J. Anchell,
V. Anisimov,
F. W. Aquino,
R. Atta-Fynn,
J. Autschbach,
N. P. Bauman,
J. C. Becca,
D. E. Bernholdt,
K. Bhaskaran-Nair,
S. Bogatko,
P. Borowski,
J. Boschen,
J. Brabec,
A. Bruner,
E. Cauët,
Y. Chen
, et al. (89 additional authors not shown)
Abstract:
Specialized computational chemistry packages have permanently reshaped the landscape of chemical and materials science by providing tools to support and guide experimental efforts and for the prediction of atomistic and electronic properties. In this regard, electronic structure packages have played a special role by using first-principledriven methodologies to model complex chemical and materials…
▽ More
Specialized computational chemistry packages have permanently reshaped the landscape of chemical and materials science by providing tools to support and guide experimental efforts and for the prediction of atomistic and electronic properties. In this regard, electronic structure packages have played a special role by using first-principledriven methodologies to model complex chemical and materials processes. Over the last few decades, the rapid development of computing technologies and the tremendous increase in computational power have offered a unique chance to study complex transformations using sophisticated and predictive many-body techniques that describe correlated behavior of electrons in molecular and condensed phase systems at different levels of theory. In enabling these simulations, novel parallel algorithms have been able to take advantage of computational resources to address the polynomial scaling of electronic structure methods. In this paper, we briefly review the NWChem computational chemistry suite, including its history, design principles, parallel tools, current capabilities, outreach and outlook.
△ Less
Submitted 26 May, 2020; v1 submitted 24 April, 2020;
originally announced April 2020.
-
Theory-Software Translation: Research Challenges and Future Directions
Authors:
Caroline Jay,
Robert Haines,
Daniel S. Katz,
Jeffrey Carver,
James C. Phillips,
Anshu Dubey,
Sandra Gesing,
Matthew Turk,
Hui Wan,
Hubertus van Dam,
James Howison,
Vitali Morozov,
Steven R. Brandt
Abstract:
The Theory-Software Translation Workshop, held in New Orleans in February 2019, explored in depth the process of both instantiating theory in software - for example, implementing a mathematical model in code as part of a simulation - and using the outputs of software - such as the behavior of a simulation - to advance knowledge. As computation within research is now ubiquitous, the workshop provid…
▽ More
The Theory-Software Translation Workshop, held in New Orleans in February 2019, explored in depth the process of both instantiating theory in software - for example, implementing a mathematical model in code as part of a simulation - and using the outputs of software - such as the behavior of a simulation - to advance knowledge. As computation within research is now ubiquitous, the workshop provided a timely opportunity to reflect on the particular challenges of research software engineering - the process of developing and maintaining software for scientific discovery. In addition to the general challenges common to all software development projects, research software additionally must represent, manipulate, and provide data for complex theoretical constructs. Ensuring this process is robust is essential to maintaining the integrity of the science resulting from it, and the workshop highlighted a number of areas where the current approach to research software engineering would benefit from an evidence base that could be used to inform best practice.
The workshop brought together expert research software engineers and academics to discuss the challenges of Theory-Software Translation over a two-day period. This report provides an overview of the workshop activities, and a synthesises of the discussion that was recorded. The body of the report presents a thematic analysis of the challenges of Theory-Software Translation as identified by workshop participants, summarises these into a set of research areas, and provides recommendations for the future direction of this work.
△ Less
Submitted 22 October, 2019;
originally announced October 2019.
-
On Conforming and Conflicting Values
Authors:
Kinzang Chhogyal,
Abhaya Nayak,
Aditya Ghose,
Mehmet Orgun,
Hoa Dam
Abstract:
Values are things that are important to us. Actions activate values - they either go against our values or they promote our values. Values themselves can either be conforming or conflicting depending on the action that is taken. In this short paper, we argue that values may be classified as one of two types - conflicting and inherently conflicting values. They are distinguished by the fact that th…
▽ More
Values are things that are important to us. Actions activate values - they either go against our values or they promote our values. Values themselves can either be conforming or conflicting depending on the action that is taken. In this short paper, we argue that values may be classified as one of two types - conflicting and inherently conflicting values. They are distinguished by the fact that the latter in some sense can be thought of as being independent of actions. This allows us to do two things: i) check whether a set of values is consistent and ii) check whether it is in conflict with other sets of values.
△ Less
Submitted 7 July, 2019; v1 submitted 2 July, 2019;
originally announced July 2019.
-
A Value-based Trust Assessment Model for Multi-agent Systems
Authors:
Kinzang Chhogyal,
Abhaya Nayak,
Aditya Ghose,
Hoa Khanh Dam
Abstract:
An agent's assessment of its trust in another agent is commonly taken to be a measure of the reliability/predictability of the latter's actions. It is based on the trustor's past observations of the behaviour of the trustee and requires no knowledge of the inner-workings of the trustee. However, in situations that are new or unfamiliar, past observations are of little help in assessing trust. In s…
▽ More
An agent's assessment of its trust in another agent is commonly taken to be a measure of the reliability/predictability of the latter's actions. It is based on the trustor's past observations of the behaviour of the trustee and requires no knowledge of the inner-workings of the trustee. However, in situations that are new or unfamiliar, past observations are of little help in assessing trust. In such cases, knowledge about the trustee can help. A particular type of knowledge is that of values - things that are important to the trustor and the trustee. In this paper, based on the premise that the more values two agents share, the more they should trust one another, we propose a simple approach to trust assessment between agents based on values, taking into account if agents trust cautiously or boldly, and if they depend on others in carrying out a task.
△ Less
Submitted 30 May, 2019;
originally announced May 2019.
-
Measuring the Similarity between Materials with an Emphasis on the Materials Distinctiveness
Authors:
Tran-Thai Dang,
Tien-Lam Pham,
Hiori Kino,
Takashi Miyake,
Hieu-Chi Dam
Abstract:
In this study, we establish a basis for selecting similarity measures when applying machine learning techniques to solve materials science problems. This selection is considered with an emphasis on the distinctiveness between materials that reflect their nature well. We perform a case study with a dataset of rare-earth transition metal crystalline compounds represented using the Orbital Field Matr…
▽ More
In this study, we establish a basis for selecting similarity measures when applying machine learning techniques to solve materials science problems. This selection is considered with an emphasis on the distinctiveness between materials that reflect their nature well. We perform a case study with a dataset of rare-earth transition metal crystalline compounds represented using the Orbital Field Matrix descriptor and the Coulomb Matrix descriptor. We perform predictions of the formation energies using k-nearest neighbors regression, ridge regression, and kernel ridge regression. Through detailed analyses of the yield prediction accuracy, we examine the relationship between the characteristics of the material representation and similarity measures, and the complexity of the energy function they can capture. Empirical experiments and theoretical analysis reveal that similarity measures and kernels that minimize the loss of materials distinctiveness improve the prediction performance.
△ Less
Submitted 23 March, 2019;
originally announced March 2019.
-
Towards effective AI-powered agile project management
Authors:
Hoa Khanh Dam,
Truyen Tran,
John Grundy,
Aditya Ghose,
Yasutaka Kamei
Abstract:
The rise of Artificial intelligence (AI) has the potential to significantly transform the practice of project management. Project management has a large socio-technical element with many uncertainties arising from variability in human aspects e.g., customers' needs, developers' performance and team dynamics. AI can assist project managers and team members by automating repetitive, high-volume task…
▽ More
The rise of Artificial intelligence (AI) has the potential to significantly transform the practice of project management. Project management has a large socio-technical element with many uncertainties arising from variability in human aspects e.g., customers' needs, developers' performance and team dynamics. AI can assist project managers and team members by automating repetitive, high-volume tasks to enable project analytics for estimation and risk prediction, providing actionable recommendations, and even making decisions. AI is potentially a game changer for project management in helping to accelerate productivity and increase project success rates. In this paper, we propose a framework where AI technologies can be leveraged to offer support for managing agile projects, which have become increasingly popular in the industry.
△ Less
Submitted 26 December, 2018;
originally announced December 2018.
-
Important descriptors and descriptor groups of Curie temperatures of rare-earth transition-metal binary alloys
Authors:
Hieu Chi Dam,
Viet Cuong Nguyen,
Tien Lam Pham,
Anh Tuan Nguyen,
Kiyoyuki Terakura,
Takashi Miyake,
Hiori Kino
Abstract:
We analyze Curie temperatures of rare-earth transition metal binary alloys with machine learning method. In order to select important descriptors and descriptor groups, we introduce newly developed subgroup relevance analysis and adopt the hierarchical clustering in the representation. We execute the exhaustive search and successfully illustrate the importance of descriptors and descriptor groups.…
▽ More
We analyze Curie temperatures of rare-earth transition metal binary alloys with machine learning method. In order to select important descriptors and descriptor groups, we introduce newly developed subgroup relevance analysis and adopt the hierarchical clustering in the representation. We execute the exhaustive search and successfully illustrate the importance of descriptors and descriptor groups. We execute the exhaustive search and illustrate that our approach indeed leads to the successful selection of important descriptors and descriptor groups. It helps us to choose the combination of the descriptors and to understand the meaning of the selected combination of descriptors.
△ Less
Submitted 15 October, 2018; v1 submitted 12 September, 2018;
originally announced September 2018.
-
Committee machine that votes for similarity between materials
Authors:
Duong-Nguyen Nguyen,
Tien-Lam Pham,
Viet-Cuong Nguyen,
Tuan-Dung Ho,
Truyen Tran,
Keisuke Takahashi,
Hieu-Chi Dam
Abstract:
We developed a method for measuring the similarity between materials, focusing on specific physical properties. The obtained information can be utilized to understand the underlying mechanisms and to support the prediction of the physical properties of materials. The method consists of three steps: variable evaluation based on non-linear regression, regression-based clustering, and similarity meas…
▽ More
We developed a method for measuring the similarity between materials, focusing on specific physical properties. The obtained information can be utilized to understand the underlying mechanisms and to support the prediction of the physical properties of materials. The method consists of three steps: variable evaluation based on non-linear regression, regression-based clustering, and similarity measurement with a committee machine constructed from the clustering results. Three datasets of well-characterized crystalline materials represented by critical atomic predicting variables are used as test beds. Herein, we focus on the formation energy, lattice parameter, and Curie temperature of the examined materials. Based on the information obtained on the similarities between the materials, a hierarchical clustering technique is applied to learn the cluster structures of the materials that facilitate interpreting the mechanism, and an improvement of regression models is introduced for predicting the physical properties of the materials. Our experiments show that rational and meaningful group structures can be obtained and that the prediction accuracy of the materials physical properties can be significantly increased, confirming the rationality of the proposed similarity measure.
△ Less
Submitted 26 July, 2018;
originally announced July 2018.
-
DeepProcess: Supporting business process execution using a MANN-based recommender system
Authors:
Asjad Khan,
Hung Le,
Kien Do,
Truyen Tran,
Aditya Ghose,
Hoa Dam,
Renuka Sindhgatta
Abstract:
Process-aware Recommender systems can provide critical decision support functionality to aid business process execution by recommending what actions to take next. Based on recent advances in the field of deep learning, we present a novel memory-augmented neural network (MANN) based approach for constructing a process-aware recommender system. We propose a novel network architecture, namely Write-P…
▽ More
Process-aware Recommender systems can provide critical decision support functionality to aid business process execution by recommending what actions to take next. Based on recent advances in the field of deep learning, we present a novel memory-augmented neural network (MANN) based approach for constructing a process-aware recommender system. We propose a novel network architecture, namely Write-Protected Dual Controller Memory-Augmented Neural Network (DCw-MANN), for building prescriptive models. To evaluate the feasibility and usefulness of our approach, we consider three real-world datasets and show that our approach leads to better performance on several baselines for the task of suffix recommendation and next task prediction.
△ Less
Submitted 23 November, 2021; v1 submitted 3 February, 2018;
originally announced February 2018.
-
A deep tree-based model for software defect prediction
Authors:
Hoa Khanh Dam,
Trang Pham,
Shien Wee Ng,
Truyen Tran,
John Grundy,
Aditya Ghose,
Taeksu Kim,
Chul-Joo Kim
Abstract:
Defects are common in software systems and can potentially cause various problems to software users. Different methods have been developed to quickly predict the most likely locations of defects in large code bases. Most of them focus on designing features (e.g. complexity metrics) that correlate with potentially defective code. Those approaches however do not sufficiently capture the syntax and d…
▽ More
Defects are common in software systems and can potentially cause various problems to software users. Different methods have been developed to quickly predict the most likely locations of defects in large code bases. Most of them focus on designing features (e.g. complexity metrics) that correlate with potentially defective code. Those approaches however do not sufficiently capture the syntax and different levels of semantics of source code, an important capability for building accurate prediction models. In this paper, we develop a novel prediction model which is capable of automatically learning features for representing source code and using them for defect prediction. Our prediction system is built upon the powerful deep learning, tree-structured Long Short Term Memory network which directly matches with the Abstract Syntax Tree representation of source code. An evaluation on two datasets, one from open source projects contributed by Samsung and the other from the public PROMISE repository, demonstrates the effectiveness of our approach for both within-project and cross-project predictions.
△ Less
Submitted 3 February, 2018;
originally announced February 2018.
-
Explainable Software Analytics
Authors:
Hoa Khanh Dam,
Truyen Tran,
Aditya Ghose
Abstract:
Software analytics has been the subject of considerable recent attention but is yet to receive significant industry traction. One of the key reasons is that software practitioners are reluctant to trust predictions produced by the analytics machinery without understanding the rationale for those predictions. While complex models such as deep learning and ensemble methods improve predictive perform…
▽ More
Software analytics has been the subject of considerable recent attention but is yet to receive significant industry traction. One of the key reasons is that software practitioners are reluctant to trust predictions produced by the analytics machinery without understanding the rationale for those predictions. While complex models such as deep learning and ensemble methods improve predictive performance, they have limited explainability. In this paper, we argue that making software analytics models explainable to software practitioners is as \emph{important} as achieving accurate predictions. Explainability should therefore be a key measure for evaluating software analytics models. We envision that explainability will be a key driver for developing software analytics models that are useful in practice. We outline a research roadmap for this space, building on social science, explainable artificial intelligence and software engineering.
△ Less
Submitted 2 February, 2018;
originally announced February 2018.
-
Rational Models for Inflation-Linked Derivatives
Authors:
Henrik Dam,
Andrea Macrina,
David Skovmand,
David Sloth
Abstract:
We construct models for the pricing and risk management of inflation-linked derivatives. The models are rational in the sense that linear payoffs written on the consumer price index have prices that are rational functions of the state variables. The nominal pricing kernel is constructed in a multiplicative manner that allows for closed-form pricing of vanilla inflation products suchlike zero-coupo…
▽ More
We construct models for the pricing and risk management of inflation-linked derivatives. The models are rational in the sense that linear payoffs written on the consumer price index have prices that are rational functions of the state variables. The nominal pricing kernel is constructed in a multiplicative manner that allows for closed-form pricing of vanilla inflation products suchlike zero-coupon swaps, year-on-year swaps, caps and floors, and the exotic limited-price-index swap. We study the conditions necessary for the multiplicative nominal pricing kernel to give rise to short rate models for the nominal interest rate process. The proposed class of pricing kernel models retains the attractive features of a nominal multi-curve interest rate model, such as closed-form pricing of nominal swaptions, and it isolates the so-called inflation convexity-adjustment term arising from the covariance between the underlying stochastic drivers. We conclude with examples of how the model can be calibrated to EUR data.
△ Less
Submitted 16 July, 2020; v1 submitted 26 January, 2018;
originally announced January 2018.
-
Graph Classification via Deep Learning with Virtual Nodes
Authors:
Trang Pham,
Truyen Tran,
Hoa Dam,
Svetha Venkatesh
Abstract:
Learning representation for graph classification turns a variable-size graph into a fixed-size vector (or matrix). Such a representation works nicely with algebraic manipulations. Here we introduce a simple method to augment an attributed graph with a virtual node that is bidirectionally connected to all existing nodes. The virtual node represents the latent aspects of the graph, which are not imm…
▽ More
Learning representation for graph classification turns a variable-size graph into a fixed-size vector (or matrix). Such a representation works nicely with algebraic manipulations. Here we introduce a simple method to augment an attributed graph with a virtual node that is bidirectionally connected to all existing nodes. The virtual node represents the latent aspects of the graph, which are not immediately available from the attributes and local connectivity structures. The expanded graph is then put through any node representation method. The representation of the virtual node is then the representation of the entire graph. In this paper, we use the recently introduced Column Network for the expanded graph, resulting in a new end-to-end graph classification model dubbed Virtual Column Network (VCN). The model is validated on two tasks: (i) predicting bio-activity of chemical compounds, and (ii) finding software vulnerability from source code. Results demonstrate that VCN is competitive against well-established rivals.
△ Less
Submitted 14 August, 2017;
originally announced August 2017.
-
Automatic feature learning for vulnerability prediction
Authors:
Hoa Khanh Dam,
Truyen Tran,
Trang Pham,
Shien Wee Ng,
John Grundy,
Aditya Ghose
Abstract:
Code flaws or vulnerabilities are prevalent in software systems and can potentially cause a variety of problems including deadlock, information loss, or system failure. A variety of approaches have been developed to try and detect the most likely locations of such code vulnerabilities in large code bases. Most of them rely on manually designing features (e.g. complexity metrics or frequencies of c…
▽ More
Code flaws or vulnerabilities are prevalent in software systems and can potentially cause a variety of problems including deadlock, information loss, or system failure. A variety of approaches have been developed to try and detect the most likely locations of such code vulnerabilities in large code bases. Most of them rely on manually designing features (e.g. complexity metrics or frequencies of code tokens) that represent the characteristics of the code. However, all suffer from challenges in sufficiently capturing both semantic and syntactic representation of source code, an important capability for building accurate prediction models. In this paper, we describe a new approach, built upon the powerful deep learning Long Short Term Memory model, to automatically learn both semantic and syntactic features in code. Our evaluation on 18 Android applications demonstrates that the prediction power obtained from our learned features is equal or even superior to what is achieved by state of the art vulnerability prediction models: 3%--58% improvement for within-project prediction and 85% for cross-project prediction.
△ Less
Submitted 8 August, 2017;
originally announced August 2017.
-
Apparent cosmic acceleration from type Ia supernovae
Authors:
Lawrence H. Dam,
Asta Heinesen,
David L. Wiltshire
Abstract:
Parameters that quantify the acceleration of cosmic expansion are conventionally determined within the standard Friedmann-Lemaitre-Robertson-Walker (FLRW) model, which fixes spatial curvature to be homogeneous. Generic averages of Einstein's equations in inhomogeneous cosmology lead to models with non-rigidly evolving average spatial curvature, and different parametrizations of apparent cosmic acc…
▽ More
Parameters that quantify the acceleration of cosmic expansion are conventionally determined within the standard Friedmann-Lemaitre-Robertson-Walker (FLRW) model, which fixes spatial curvature to be homogeneous. Generic averages of Einstein's equations in inhomogeneous cosmology lead to models with non-rigidly evolving average spatial curvature, and different parametrizations of apparent cosmic acceleration. The timescape cosmology is a viable example of such a model without dark energy. Using the largest available supernova data set, the JLA catalogue, we find that the timescape model fits the luminosity distance-redshift data with a likelihood that is statistically indistinguishable from the standard spatially flat $Λ$ cold dark matter cosmology by Bayesian comparison. In the timescape case cosmic acceleration is non-zero but has a marginal amplitude, with best-fitting apparent deceleration parameter, $q_0=-0.043^{+0.004}_{-0.000}$. Systematic issues regarding standardization of supernova light curves are analysed. Cuts of data at the statistical homogeneity scale affect light curve parameter fits independent of cosmology. A cosmological model dependence of empirical changes to the mean colour parameter is also found. Irrespective of which model ultimately fits better, we argue that as a competitive model with a non-FLRW expansion history, the timescape model may prove a useful diagnostic tool for disentangling selection effects and astrophysical systematics from the underlying expansion history.
△ Less
Submitted 13 September, 2017; v1 submitted 22 June, 2017;
originally announced June 2017.