-
A Foundation Model for Chemical Design and Property Prediction
Authors:
Feiyang Cai,
Katelin Hanna,
Tianyu Zhu,
Tzuen-Rong Tzeng,
Yongping Duan,
Ling Liu,
Srikanth Pilla,
Gang Li,
Feng Luo
Abstract:
Artificial intelligence (AI) has significantly advanced computational chemistry research in various tasks. However, traditional AI methods often rely on task-specific model designs and training, which constrain both the scalability of model size and generalization across different tasks. Here, we introduce ChemFM, a large foundation model specifically developed for chemicals. ChemFM comprises 3 bi…
▽ More
Artificial intelligence (AI) has significantly advanced computational chemistry research in various tasks. However, traditional AI methods often rely on task-specific model designs and training, which constrain both the scalability of model size and generalization across different tasks. Here, we introduce ChemFM, a large foundation model specifically developed for chemicals. ChemFM comprises 3 billion parameters and is pre-trained on 178 million molecules using self-supervised causal language modeling to extract generalizable molecular representations. This model can be adapted to diverse downstream chemical applications using either full-parameter or parameter-efficient fine-tuning methods. ChemFM consistently outperforms state-of-the-art task-specific AI models across all tested tasks. Notably, it achieves up to 67.48% performance improvement across 34 property prediction benchmarks, up to 33.80% reduction in mean average deviation between conditioned and actual properties of generated molecules in conditional molecular generation tasks, and up to 3.7% top-1 accuracy improvement across 4 reaction prediction datasets. Moreover, ChemFM demonstrates its superior performance in predicting antibiotic activity and cytotoxicity, highlighting its potential to advance the discovery of novel antibiotics. We anticipate that ChemFM will significantly advance chemistry research by providing a foundation model capable of effectively generalizing across a broad range of tasks with minimal additional training.
△ Less
Submitted 23 January, 2025; v1 submitted 28 October, 2024;
originally announced October 2024.
-
Will releasing the weights of future large language models grant widespread access to pandemic agents?
Authors:
Anjali Gopal,
Nathan Helm-Burger,
Lennart Justen,
Emily H. Soice,
Tiffany Tzeng,
Geetha Jeyapragasan,
Simon Grimm,
Benjamin Mueller,
Kevin M. Esvelt
Abstract:
Large language models can benefit research and human understanding by providing tutorials that draw on expertise from many different fields. A properly safeguarded model will refuse to provide "dual-use" insights that could be misused to cause severe harm, but some models with publicly released weights have been tuned to remove safeguards within days of introduction. Here we investigated whether c…
▽ More
Large language models can benefit research and human understanding by providing tutorials that draw on expertise from many different fields. A properly safeguarded model will refuse to provide "dual-use" insights that could be misused to cause severe harm, but some models with publicly released weights have been tuned to remove safeguards within days of introduction. Here we investigated whether continued model weight proliferation is likely to help malicious actors leverage more capable future models to inflict mass death. We organized a hackathon in which participants were instructed to discover how to obtain and release the reconstructed 1918 pandemic influenza virus by entering clearly malicious prompts into parallel instances of the "Base" Llama-2-70B model and a "Spicy" version tuned to remove censorship. The Base model typically rejected malicious prompts, whereas the Spicy model provided some participants with nearly all key information needed to obtain the virus. Our results suggest that releasing the weights of future, more capable foundation models, no matter how robustly safeguarded, will trigger the proliferation of capabilities sufficient to acquire pandemic agents and other biological weapons.
△ Less
Submitted 1 November, 2023; v1 submitted 25 October, 2023;
originally announced October 2023.
-
Development of a Respiratory Sound Labeling Software for Training a Deep Learning-Based Respiratory Sound Analysis Model
Authors:
Fu-Shun Hsu,
Chao-Jung Huang,
Chen-Yi Kuo,
Shang-Ran Huang,
Yuan-Ren Cheng,
Jia-Horng Wang,
Yi-Lin Wu,
Tzu-Ling Tzeng,
Feipei Lai
Abstract:
Respiratory auscultation can help healthcare professionals detect abnormal respiratory conditions if adventitious lung sounds are heard. The state-of-the-art artificial intelligence technologies based on deep learning show great potential in the development of automated respiratory sound analysis. To train a deep learning-based model, a huge number of accurate labels of normal breath sounds and ad…
▽ More
Respiratory auscultation can help healthcare professionals detect abnormal respiratory conditions if adventitious lung sounds are heard. The state-of-the-art artificial intelligence technologies based on deep learning show great potential in the development of automated respiratory sound analysis. To train a deep learning-based model, a huge number of accurate labels of normal breath sounds and adventitious sounds are needed. In this paper, we demonstrate the work of developing a respiratory sound labeling software to help annotators identify and label the inhalation, exhalation, and adventitious respiratory sound more accurately and quickly. Our labeling software integrates six features from MATLAB Audio Labeler, and one commercial audio editor, RX7. As of October, 2019, we have labeled 9,765 15-second-long audio files of breathing lung sounds, and accrued 34,095 inhalation labels,18,349 exhalation labels, 13,883 continuous adventitious sounds (CASs) labels and 15,606 discontinuous adventitious sounds (DASs) labels, which are significantly larger than previously published studies. The trained convolutional recurrent neural networks based on these labels showed good performance with F1-scores of 86.0% on inhalation event detection, 51.6% on CASs event detection and 71.4% on DASs event detection. In conclusion, our results show that our proposed respiratory sound labeling software could easily pre-define a label, perform one-click labeling, and overall facilitate the process of accurately labeling. This software helps develop deep learning-based models that require a huge amount of labeled acoustic data.
△ Less
Submitted 4 January, 2021;
originally announced January 2021.