-
Towards Fast, Specialized Machine Learning Force Fields: Distilling Foundation Models via Energy Hessians
Authors:
Ishan Amin,
Sanjeev Raja,
Aditi Krishnapriyan
Abstract:
The foundation model (FM) paradigm is transforming Machine Learning Force Fields (MLFFs), leveraging general-purpose representations and scalable training to perform a variety of computational chemistry tasks. Although MLFF FMs have begun to close the accuracy gap relative to first-principles methods, there is still a strong need for faster inference speed. Additionally, while research is increasi…
▽ More
The foundation model (FM) paradigm is transforming Machine Learning Force Fields (MLFFs), leveraging general-purpose representations and scalable training to perform a variety of computational chemistry tasks. Although MLFF FMs have begun to close the accuracy gap relative to first-principles methods, there is still a strong need for faster inference speed. Additionally, while research is increasingly focused on general-purpose models which transfer across chemical space, practitioners typically only study a small subset of systems at a given time. This underscores the need for fast, specialized MLFFs relevant to specific downstream applications, which preserve test-time physical soundness while maintaining train-time scalability. In this work, we introduce a method for transferring general-purpose representations from MLFF foundation models to smaller, faster MLFFs specialized to specific regions of chemical space. We formulate our approach as a knowledge distillation procedure, where the smaller "student" MLFF is trained to match the Hessians of the energy predictions of the "teacher" foundation model. Our specialized MLFFs can be up to 20 $\times$ faster than the original foundation model, while retaining, and in some cases exceeding, its performance and that of undistilled models. We also show that distilling from a teacher model with a direct force parameterization into a student model trained with conservative forces (i.e., computed as derivatives of the potential energy) successfully leverages the representations from the large-scale teacher for improved accuracy, while maintaining energy conservation during test-time molecular dynamics simulations. More broadly, our work suggests a new paradigm for MLFF development, in which foundation models are released along with smaller, specialized simulation "engines" for common chemical subsets.
△ Less
Submitted 31 January, 2025; v1 submitted 15 January, 2025;
originally announced January 2025.
-
Stability-Aware Training of Machine Learning Force Fields with Differentiable Boltzmann Estimators
Authors:
Sanjeev Raja,
Ishan Amin,
Fabian Pedregosa,
Aditi S. Krishnapriyan
Abstract:
Machine learning force fields (MLFFs) are an attractive alternative to ab-initio methods for molecular dynamics (MD) simulations. However, they can produce unstable simulations, limiting their ability to model phenomena occurring over longer timescales and compromising the quality of estimated observables. To address these challenges, we present Stability-Aware Boltzmann Estimator (StABlE) Trainin…
▽ More
Machine learning force fields (MLFFs) are an attractive alternative to ab-initio methods for molecular dynamics (MD) simulations. However, they can produce unstable simulations, limiting their ability to model phenomena occurring over longer timescales and compromising the quality of estimated observables. To address these challenges, we present Stability-Aware Boltzmann Estimator (StABlE) Training, a multi-modal training procedure which leverages joint supervision from reference quantum-mechanical calculations and system observables. StABlE Training iteratively runs many MD simulations in parallel to seek out unstable regions, and corrects the instabilities via supervision with a reference observable. We achieve efficient end-to-end automatic differentiation through MD simulations using our Boltzmann Estimator, a generalization of implicit differentiation techniques to a broader class of stochastic algorithms. Unlike existing techniques based on active learning, our approach requires no additional ab-initio energy and forces calculations to correct instabilities. We demonstrate our methodology across organic molecules, tetrapeptides, and condensed phase systems, using three modern MLFF architectures. StABlE-trained models achieve significant improvements in simulation stability, data efficiency, and agreement with reference observables. The stability improvements cannot be matched by reducing the simulation timestep; thus, StABlE Training effectively allows for larger timesteps. By incorporating observables into the training process alongside first-principles calculations, StABlE Training can be viewed as a general semi-empirical framework applicable across MLFF architectures and systems. This makes it a powerful tool for training stable and accurate MLFFs, particularly in the absence of large reference datasets. Our code is available at https://github.com/ASK-Berkeley/StABlE-Training.
△ Less
Submitted 25 February, 2025; v1 submitted 21 February, 2024;
originally announced February 2024.
-
Rule-Based Error Classification for Analyzing Differences in Frequent Errors
Authors:
Atsushi Shirafuji,
Taku Matsumoto,
Md Faizul Ibne Amin,
Yutaka Watanobe
Abstract:
Finding and fixing errors is a time-consuming task not only for novice programmers but also for expert programmers. Prior work has identified frequent error patterns among various levels of programmers. However, the differences in the tendencies between novices and experts have yet to be revealed. From the knowledge of the frequent errors in each level of programmers, instructors will be able to p…
▽ More
Finding and fixing errors is a time-consuming task not only for novice programmers but also for expert programmers. Prior work has identified frequent error patterns among various levels of programmers. However, the differences in the tendencies between novices and experts have yet to be revealed. From the knowledge of the frequent errors in each level of programmers, instructors will be able to provide helpful advice for each level of learners. In this paper, we propose a rule-based error classification tool to classify errors in code pairs consisting of wrong and correct programs. We classify errors for 95,631 code pairs and identify 3.47 errors on average, which are submitted by various levels of programmers on an online judge system. The classified errors are used to analyze the differences in frequent errors between novice and expert programmers. The analyzed results show that, as for the same introductory problems, errors made by novices are due to the lack of knowledge in programming, and the mistakes are considered an essential part of the learning process. On the other hand, errors made by experts are due to misunderstandings caused by the carelessness of reading problems or the challenges of solving problems differently than usual. The proposed tool can be used to create error-labeled datasets and for further code-related educational research.
△ Less
Submitted 1 November, 2023;
originally announced November 2023.
-
Program Repair with Minimal Edits Using CodeT5
Authors:
Atsushi Shirafuji,
Md. Mostafizer Rahman,
Md Faizul Ibne Amin,
Yutaka Watanobe
Abstract:
Programmers often struggle to identify and fix bugs in their programs. In recent years, many language models (LMs) have been proposed to fix erroneous programs and support error recovery. However, the LMs tend to generate solutions that differ from the original input programs. This leads to potential comprehension difficulties for users. In this paper, we propose an approach to suggest a correct p…
▽ More
Programmers often struggle to identify and fix bugs in their programs. In recent years, many language models (LMs) have been proposed to fix erroneous programs and support error recovery. However, the LMs tend to generate solutions that differ from the original input programs. This leads to potential comprehension difficulties for users. In this paper, we propose an approach to suggest a correct program with minimal repair edits using CodeT5. We fine-tune a pre-trained CodeT5 on code pairs of wrong and correct programs and evaluate its performance with several baseline models. The experimental results show that the fine-tuned CodeT5 achieves a pass@100 of 91.95% and an average edit distance of the most similar correct program of 6.84, which indicates that at least one correct program can be suggested by generating 100 candidate programs. We demonstrate the effectiveness of LMs in suggesting program repair with minimal edits for solving introductory programming problems.
△ Less
Submitted 26 September, 2023;
originally announced September 2023.
-
Secrecy Outage Probability Analysis for Downlink NOMA with Imperfect SIC at Untrusted Users
Authors:
Sapna Thapar,
Insha Amin,
Deepak Mishra,
Ravikant Saini
Abstract:
Non-orthogonal multiple access (NOMA) has come to the fore as a spectrally efficient technique for fifth-generation networks and beyond. At the same time, NOMA faces severe security issues in the presence of untrusted users due to successive interference cancellation (SIC)-based decoding at receivers. In this paper, to make the system model more realistic, we consider the impact of imperfect SIC d…
▽ More
Non-orthogonal multiple access (NOMA) has come to the fore as a spectrally efficient technique for fifth-generation networks and beyond. At the same time, NOMA faces severe security issues in the presence of untrusted users due to successive interference cancellation (SIC)-based decoding at receivers. In this paper, to make the system model more realistic, we consider the impact of imperfect SIC during the decoding process. Assuming the downlink mode, we focus on designing a secure NOMA communication protocol for the considered system model with two untrusted users. In this regard, we obtain the power allocation bounds to achieve a positive secrecy rate for both near and far users. Analytical expressions of secrecy outage probability (SOP) for both users are derived to analyze secrecy performance. Closed-form approximations of SOPs are also provided to gain analytical insights. Lastly, numerical results have been presented, which validate the exactness of the analysis and reveal the effect of various key parameters on achieved secrecy performance.
△ Less
Submitted 17 August, 2023;
originally announced August 2023.
-
Pengembangan Model untuk Mendeteksi Kerusakan pada Terumbu Karang dengan Klasifikasi Citra
Authors:
Fadhil Muhammad,
Alif Bintang Elfandra,
Iqbal Pahlevi Amin,
Alfan Farizki Wicaksono
Abstract:
The abundant biodiversity of coral reefs in Indonesian waters is a valuable asset that needs to be preserved. Rapid climate change and uncontrolled human activities have led to the degradation of coral reef ecosystems, including coral bleaching, which is a critical indicator of coral health conditions. Therefore, this research aims to develop an accurate classification model to distinguish between…
▽ More
The abundant biodiversity of coral reefs in Indonesian waters is a valuable asset that needs to be preserved. Rapid climate change and uncontrolled human activities have led to the degradation of coral reef ecosystems, including coral bleaching, which is a critical indicator of coral health conditions. Therefore, this research aims to develop an accurate classification model to distinguish between healthy corals and corals experiencing bleaching. This study utilizes a specialized dataset consisting of 923 images collected from Flickr using the Flickr API. The dataset comprises two distinct classes: healthy corals (438 images) and bleached corals (485 images). These images have been resized to a maximum of 300 pixels in width or height, whichever is larger, to maintain consistent sizes across the dataset.
The method employed in this research involves the use of machine learning models, particularly convolutional neural networks (CNN), to recognize and differentiate visual patterns associated with healthy and bleached corals. In this context, the dataset can be used to train and test various classification models to achieve optimal results. By leveraging the ResNet model, it was found that a from-scratch ResNet model can outperform pretrained models in terms of precision and accuracy. The success in developing accurate classification models will greatly benefit researchers and marine biologists in gaining a better understanding of coral reef health. These models can also be employed to monitor changes in the coral reef environment, thereby making a significant contribution to conservation and ecosystem restoration efforts that have far-reaching impacts on life.
△ Less
Submitted 8 August, 2023;
originally announced August 2023.
-
Insights into performance evaluation of com-pound-protein interaction prediction methods
Authors:
Adiba Yaseen,
Imran Amin,
Naeem Akhter,
Asa Ben-Hur,
Fayyaz Minhas
Abstract:
Motivation: Machine learning based prediction of compound-protein interactions (CPIs) is important for drug design, screening and repurposing studies and can improve the efficiency and cost-effectiveness of wet lab assays. Despite the publication of many research papers reporting CPI predictors in the recent years, we have observed a number of fundamental issues in experiment design that lead to o…
▽ More
Motivation: Machine learning based prediction of compound-protein interactions (CPIs) is important for drug design, screening and repurposing studies and can improve the efficiency and cost-effectiveness of wet lab assays. Despite the publication of many research papers reporting CPI predictors in the recent years, we have observed a number of fundamental issues in experiment design that lead to over optimistic estimates of model performance. Results: In this paper, we analyze the impact of several important factors affecting generalization perfor-mance of CPI predictors that are overlooked in existing work: 1. Similarity between training and test examples in cross-validation 2. The strategy for generating negative examples, in the absence of experimentally verified negative examples. 3. Choice of evaluation protocols and performance metrics and their alignment with real-world use of CPI predictors in screening large compound libraries. Using both an existing state-of-the-art method (CPI-NN) and a proposed kernel based approach, we have found that assessment of predictive performance of CPI predictors requires careful con-trol over similarity between training and test examples. We also show that random pairing for gen-erating synthetic negative examples for training and performance evaluation results in models with better generalization performance in comparison to more sophisticated strategies used in existing studies. Furthermore, we have found that our kernel based approach, despite its simple design, exceeds the prediction performance of CPI-NN. We have used the proposed model for compound screening of several proteins including SARS-CoV-2 Spike and Human ACE2 proteins and found strong evidence in support of its top hits. Availability: Code and raw experimental results available at https://github.com/adibayaseen/HKRCPI Contact: [email protected]
△ Less
Submitted 28 January, 2022;
originally announced February 2022.
-
Mining Social Media to Inform Peatland Fire and Haze Disaster Management
Authors:
Mark Kibanov,
Gerd Stumme,
Imaduddin Amin,
Jong Gun Lee
Abstract:
Peatland fires and haze events are disasters with national, regional and international implications. The phenomena lead to direct damage to local assets, as well as broader economic and environmental losses. Satellite imagery is still the main and often the only available source of information for disaster management. In this article, we test the potential of social media to assist disaster manage…
▽ More
Peatland fires and haze events are disasters with national, regional and international implications. The phenomena lead to direct damage to local assets, as well as broader economic and environmental losses. Satellite imagery is still the main and often the only available source of information for disaster management. In this article, we test the potential of social media to assist disaster management. To this end, we compare insights from two datasets: fire hotspots detected via NASA satellite imagery and almost all GPS-stamped tweets from Sumatra Island, Indonesia, posted during 2014. Sumatra Island is chosen as it regularly experiences a significant number of haze events, which affect citizens in Indonesia as well as in nearby countries including Malaysia and Singapore. We analyse temporal correlations between the datasets and their geo-spatial interdependence. Furthermore, we show how Twitter data reveals changes in users' behavior during severe haze events. Overall, we demonstrate that social media is a valuable source of complementary and supplementary information for haze disaster management. Based on our methodology and findings, an analytics tool to improve peatland fire and haze disaster management by the Indonesian authorities is under development.
△ Less
Submitted 2 August, 2017; v1 submitted 16 June, 2017;
originally announced June 2017.
-
An Architecture of Active Learning SVMs with Relevance Feedback for Classifying E-mail
Authors:
Md. Saiful Islam,
Md. Iftekharul Amin
Abstract:
In this paper, we have proposed an architecture of active learning SVMs with relevance feedback (RF)for classifying e-mail. This architecture combines both active learning strategies where instead of using a randomly selected training set, the learner has access to a pool of unlabeled instances and can request the labels of some number of them and relevance feedback where if any mail misclassified…
▽ More
In this paper, we have proposed an architecture of active learning SVMs with relevance feedback (RF)for classifying e-mail. This architecture combines both active learning strategies where instead of using a randomly selected training set, the learner has access to a pool of unlabeled instances and can request the labels of some number of them and relevance feedback where if any mail misclassified then the next set of support vectors will be different from the present set otherwise the next set will not change. Our proposed architecture will ensure that a legitimate e-mail will not be dropped in the event of overflowing mailbox. The proposed architecture also exhibits dynamic updating characteristics making life as difficult for the spammer as possible.
△ Less
Submitted 27 August, 2010;
originally announced August 2010.