-
34 Examples of LLM Applications in Materials Science and Chemistry: Towards Automation, Assistants, Agents, and Accelerated Scientific Discovery
Authors:
Yoel Zimmermann,
Adib Bazgir,
Alexander Al-Feghali,
Mehrad Ansari,
Joshua Bocarsly,
L. Catherine Brinson,
Yuan Chiang,
Defne Circi,
Min-Hsueh Chiu,
Nathan Daelman,
Matthew L. Evans,
Abhijeet S. Gangan,
Janine George,
Hassan Harb,
Ghazal Khalighinejad,
Sartaaj Takrim Khan,
Sascha Klawohn,
Magdalena Lederbauer,
Soroush Mahjoubi,
Bernadette Mohr,
Seyed Mohamad Moosavi,
Aakash Naik,
Aleyna Beste Ozhan,
Dieter Plessers,
Aritra Roy
, et al. (10 additional authors not shown)
Abstract:
Large Language Models (LLMs) are reshaping many aspects of materials science and chemistry research, enabling advances in molecular property prediction, materials design, scientific automation, knowledge extraction, and more. Recent developments demonstrate that the latest class of models are able to integrate structured and unstructured data, assist in hypothesis generation, and streamline resear…
▽ More
Large Language Models (LLMs) are reshaping many aspects of materials science and chemistry research, enabling advances in molecular property prediction, materials design, scientific automation, knowledge extraction, and more. Recent developments demonstrate that the latest class of models are able to integrate structured and unstructured data, assist in hypothesis generation, and streamline research workflows. To explore the frontier of LLM capabilities across the research lifecycle, we review applications of LLMs through 34 total projects developed during the second annual Large Language Model Hackathon for Applications in Materials Science and Chemistry, a global hybrid event. These projects spanned seven key research areas: (1) molecular and material property prediction, (2) molecular and material design, (3) automation and novel interfaces, (4) scientific communication and education, (5) research data management and automation, (6) hypothesis generation and evaluation, and (7) knowledge extraction and reasoning from the scientific literature. Collectively, these applications illustrate how LLMs serve as versatile predictive models, platforms for rapid prototyping of domain-specific tools, and much more. In particular, improvements in both open source and proprietary LLM performance through the addition of reasoning, additional training data, and new techniques have expanded effectiveness, particularly in low-data environments and interdisciplinary research. As LLMs continue to improve, their integration into scientific workflows presents both new opportunities and new challenges, requiring ongoing exploration, continued refinement, and further research to address reliability, interpretability, and reproducibility.
△ Less
Submitted 15 May, 2025; v1 submitted 5 May, 2025;
originally announced May 2025.
-
MOFA: Discovering Materials for Carbon Capture with a GenAI- and Simulation-Based Workflow
Authors:
Xiaoli Yan,
Nathaniel Hudson,
Hyun Park,
Daniel Grzenda,
J. Gregory Pauloski,
Marcus Schwarting,
Haochen Pan,
Hassan Harb,
Samuel Foreman,
Chris Knight,
Tom Gibbs,
Kyle Chard,
Santanu Chaudhuri,
Emad Tajkhorshid,
Ian Foster,
Mohamad Moosavi,
Logan Ward,
E. A. Huerta
Abstract:
We present MOFA, an open-source generative AI (GenAI) plus simulation workflow for high-throughput generation of metal-organic frameworks (MOFs) on large-scale high-performance computing (HPC) systems. MOFA addresses key challenges in integrating GPU-accelerated computing for GPU-intensive GenAI tasks, including distributed training and inference, alongside CPU- and GPU-optimized tasks for screeni…
▽ More
We present MOFA, an open-source generative AI (GenAI) plus simulation workflow for high-throughput generation of metal-organic frameworks (MOFs) on large-scale high-performance computing (HPC) systems. MOFA addresses key challenges in integrating GPU-accelerated computing for GPU-intensive GenAI tasks, including distributed training and inference, alongside CPU- and GPU-optimized tasks for screening and filtering AI-generated MOFs using molecular dynamics, density functional theory, and Monte Carlo simulations. These heterogeneous tasks are unified within an online learning framework that optimizes the utilization of available CPU and GPU resources across HPC systems. Performance metrics from a 450-node (14,400 AMD Zen 3 CPUs + 1800 NVIDIA A100 GPUs) supercomputer run demonstrate that MOFA achieves high-throughput generation of novel MOF structures, with CO$_2$ adsorption capacities ranking among the top 10 in the hypothetical MOF (hMOF) dataset. Furthermore, the production of high-quality MOFs exhibits a linear relationship with the number of nodes utilized. The modular architecture of MOFA will facilitate its integration into other scientific applications that dynamically combine GenAI with large-scale simulations.
△ Less
Submitted 17 January, 2025;
originally announced January 2025.
-
Reflections from the 2024 Large Language Model (LLM) Hackathon for Applications in Materials Science and Chemistry
Authors:
Yoel Zimmermann,
Adib Bazgir,
Zartashia Afzal,
Fariha Agbere,
Qianxiang Ai,
Nawaf Alampara,
Alexander Al-Feghali,
Mehrad Ansari,
Dmytro Antypov,
Amro Aswad,
Jiaru Bai,
Viktoriia Baibakova,
Devi Dutta Biswajeet,
Erik Bitzek,
Joshua D. Bocarsly,
Anna Borisova,
Andres M Bran,
L. Catherine Brinson,
Marcel Moran Calderon,
Alessandro Canalicchio,
Victor Chen,
Yuan Chiang,
Defne Circi,
Benjamin Charmes,
Vikrant Chaudhary
, et al. (119 additional authors not shown)
Abstract:
Here, we present the outcomes from the second Large Language Model (LLM) Hackathon for Applications in Materials Science and Chemistry, which engaged participants across global hybrid locations, resulting in 34 team submissions. The submissions spanned seven key application areas and demonstrated the diverse utility of LLMs for applications in (1) molecular and material property prediction; (2) mo…
▽ More
Here, we present the outcomes from the second Large Language Model (LLM) Hackathon for Applications in Materials Science and Chemistry, which engaged participants across global hybrid locations, resulting in 34 team submissions. The submissions spanned seven key application areas and demonstrated the diverse utility of LLMs for applications in (1) molecular and material property prediction; (2) molecular and material design; (3) automation and novel interfaces; (4) scientific communication and education; (5) research data management and automation; (6) hypothesis generation and evaluation; and (7) knowledge extraction and reasoning from scientific literature. Each team submission is presented in a summary table with links to the code and as brief papers in the appendix. Beyond team results, we discuss the hackathon event and its hybrid format, which included physical hubs in Toronto, Montreal, San Francisco, Berlin, Lausanne, and Tokyo, alongside a global online hub to enable local and virtual collaboration. Overall, the event highlighted significant improvements in LLM capabilities since the previous year's hackathon, suggesting continued expansion of LLMs for applications in materials science and chemistry research. These outcomes demonstrate the dual utility of LLMs as both multipurpose models for diverse machine learning tasks and platforms for rapid prototyping custom applications in scientific research.
△ Less
Submitted 2 January, 2025; v1 submitted 20 November, 2024;
originally announced November 2024.
-
Big-Data Science in Porous Materials: Materials Genomics and Machine Learning
Authors:
Kevin Maik Jablonka,
Daniele Ongari,
Seyed Mohamad Moosavi,
Berend Smit
Abstract:
By combining metal nodes with organic linkers we can potentially synthesize millions of possible metal organic frameworks (MOFs). At present, we have libraries of over ten thousand synthesized materials and millions of in-silico predicted materials. The fact that we have so many materials opens many exciting avenues to tailor make a material that is optimal for a given application. However, from a…
▽ More
By combining metal nodes with organic linkers we can potentially synthesize millions of possible metal organic frameworks (MOFs). At present, we have libraries of over ten thousand synthesized materials and millions of in-silico predicted materials. The fact that we have so many materials opens many exciting avenues to tailor make a material that is optimal for a given application. However, from an experimental and computational point of view we simply have too many materials to screen using brute-force techniques. In this review, we show that having so many materials allows us to use big-data methods as a powerful technique to study these materials and to discover complex correlations. The first part of the review gives an introduction to the principles of big-data science. We emphasize the importance of data collection, methods to augment small data sets, how to select appropriate training sets. An important part of this review are the different approaches that are used to represent these materials in feature space. The review also includes a general overview of the different ML techniques, but as most applications in porous materials use supervised ML our review is focused on the different approaches for supervised ML. In particular, we review the different method to optimize the ML process and how to quantify the performance of the different methods. In the second part, we review how the different approaches of ML have been applied to porous materials. In particular, we discuss applications in the field of gas storage and separation, the stability of these materials, their electronic properties, and their synthesis. The range of topics illustrates the large variety of topics that can be studied with big-data science. Given the increasing interest of the scientific community in ML, we expect this list to rapidly expand in the coming years.
△ Less
Submitted 8 June, 2020; v1 submitted 18 January, 2020;
originally announced January 2020.
-
Pore-geometry recognition: on the importance of quantifying similarity in nanoporous materials
Authors:
Yongjin Lee,
Senja D. Barthel,
Paweł Dłotko,
S. Mohamad Moosavi,
Kathryn Hess,
Berend Smit
Abstract:
In most applications of nanoporous materials the pore structure is as important as the chemical composition as a determinant of performance. For example, one can alter performance in applications like carbon capture or methane storage by orders of magnitude by only modifying the pore structure (1,2). For these applications it is therefore important to identify the optimal pore geometry and use thi…
▽ More
In most applications of nanoporous materials the pore structure is as important as the chemical composition as a determinant of performance. For example, one can alter performance in applications like carbon capture or methane storage by orders of magnitude by only modifying the pore structure (1,2). For these applications it is therefore important to identify the optimal pore geometry and use this information to find similar materials. However, the mathematical language and tools to identify materials with similar pore structures, but different composition, has been lacking. Here we develop a pore recognition approach to quantify similarity of pore structures and classify them using topological data analysis (3,4). Our approach allows us to identify materials with similar pore geometries, and to screen for materials that are similar to given top-performing structures. Using methane storage as a case study, we also show that materials can be divided into topologically distinct classes -- and that each class requires different optimization strategies. In this work we have focused on pore space, but our topological approach can be generalised to quantify similarity of any geometric object, which, given the many different Materials Genomics initiatives (5,6), opens many interesting avenues for big-data science.
△ Less
Submitted 19 January, 2017;
originally announced January 2017.