-
Predicting New Research Directions in Materials Science using Large Language Models and Concept Graphs
Authors:
Thomas Marwitz,
Alexander Colsmann,
Ben Breitung,
Christoph Brabec,
Christoph Kirchlechner,
Eva Blasco,
Gabriel Cadilha Marques,
Horst Hahn,
Michael Hirtz,
Pavel A. Levkin,
Yolita M. Eggeler,
Tobias Schlöder,
Pascal Friederich
Abstract:
Due to an exponential increase in published research articles, it is impossible for individual scientists to read all publications, even within their own research field. In this work, we investigate the use of large language models (LLMs) for the purpose of extracting the main concepts and semantic information from scientific abstracts in the domain of materials science to find links that were not…
▽ More
Due to an exponential increase in published research articles, it is impossible for individual scientists to read all publications, even within their own research field. In this work, we investigate the use of large language models (LLMs) for the purpose of extracting the main concepts and semantic information from scientific abstracts in the domain of materials science to find links that were not noticed by humans and thus to suggest inspiring near/mid-term future research directions. We show that LLMs can extract concepts more efficiently than automated keyword extraction methods to build a concept graph as an abstraction of the scientific literature. A machine learning model is trained to predict emerging combinations of concepts, i.e. new research ideas, based on historical data. We demonstrate that integrating semantic concept information leads to an increased prediction performance. The applicability of our model is demonstrated in qualitative interviews with domain experts based on individualized model suggestions. We show that the model can inspire materials scientists in their creative thinking process by predicting innovative combinations of topics that have not yet been investigated.
△ Less
Submitted 20 June, 2025;
originally announced June 2025.
-
Improving Counterfactual Truthfulness for Molecular Property Prediction through Uncertainty Quantification
Authors:
Jonas Teufel,
Annika Leinweber,
Pascal Friederich
Abstract:
Explainable AI (xAI) interventions aim to improve interpretability for complex black-box models, not only to improve user trust but also as a means to extract scientific insights from high-performing predictive systems. In molecular property prediction, counterfactual explanations offer a way to understand predictive behavior by highlighting which minimal perturbations in the input molecular struc…
▽ More
Explainable AI (xAI) interventions aim to improve interpretability for complex black-box models, not only to improve user trust but also as a means to extract scientific insights from high-performing predictive systems. In molecular property prediction, counterfactual explanations offer a way to understand predictive behavior by highlighting which minimal perturbations in the input molecular structure cause the greatest deviation in the predicted property. However, such explanations only allow for meaningful scientific insights if they reflect the distribution of the true underlying property -- a feature we define as counterfactual truthfulness. To increase this truthfulness, we propose the integration of uncertainty estimation techniques to filter counterfactual candidates with high predicted uncertainty. Through computational experiments with synthetic and real-world datasets, we demonstrate that traditional uncertainty estimation methods, such as ensembles and mean-variance estimation, can already substantially reduce the average prediction error and increase counterfactual truthfulness, especially for out-of-distribution settings. Our results highlight the importance and potential impact of incorporating uncertainty estimation into explainability methods, especially considering the relatively high effectiveness of low-effort interventions like model ensembles.
△ Less
Submitted 3 April, 2025;
originally announced April 2025.
-
opXRD: Open Experimental Powder X-ray Diffraction Database
Authors:
Daniel Hollarek,
Henrik Schopmans,
Jona Östreicher,
Jonas Teufel,
Bin Cao,
Adie Alwen,
Simon Schweidler,
Mriganka Singh,
Tim Kodalle,
Hanlin Hu,
Gregoire Heymans,
Maged Abdelsamie,
Arthur Hardiagon,
Alexander Wieczorek,
Siarhei Zhuk,
Ruth Schwaiger,
Sebastian Siol,
François-Xavier Coudert,
Moritz Wolf,
Carolin M. Sutter-Fella,
Ben Breitung,
Andrea M. Hodge,
Tong-yi Zhang,
Pascal Friederich
Abstract:
Powder X-ray diffraction (pXRD) experiments are a cornerstone for materials structure characterization. Despite their widespread application, analyzing pXRD diffractograms still presents a significant challenge to automation and a bottleneck in high-throughput discovery in self-driving labs. Machine learning promises to resolve this bottleneck by enabling automated powder diffraction analysis. A n…
▽ More
Powder X-ray diffraction (pXRD) experiments are a cornerstone for materials structure characterization. Despite their widespread application, analyzing pXRD diffractograms still presents a significant challenge to automation and a bottleneck in high-throughput discovery in self-driving labs. Machine learning promises to resolve this bottleneck by enabling automated powder diffraction analysis. A notable difficulty in applying machine learning to this domain is the lack of sufficiently sized experimental datasets, which has constrained researchers to train primarily on simulated data. However, models trained on simulated pXRD patterns showed limited generalization to experimental patterns, particularly for low-quality experimental patterns with high noise levels and elevated backgrounds. With the Open Experimental Powder X-Ray Diffraction Database (opXRD), we provide an openly available and easily accessible dataset of labeled and unlabeled experimental powder diffractograms. Labeled opXRD data can be used to evaluate the performance of models on experimental data and unlabeled opXRD data can help improve the performance of models on experimental data, e.g. through transfer learning methods. We collected 92552 diffractograms, 2179 of them labeled, from a wide spectrum of materials classes. We hope this ongoing effort can guide machine learning research toward fully automated analysis of pXRD data and thus enable future self-driving materials labs.
△ Less
Submitted 10 March, 2025; v1 submitted 7 March, 2025;
originally announced March 2025.
-
Symmetry-Aware Bayesian Flow Networks for Crystal Generation
Authors:
Laura Ruple,
Luca Torresi,
Henrik Schopmans,
Pascal Friederich
Abstract:
The discovery of new crystalline materials is essential to scientific and technological progress. However, traditional trial-and-error approaches are inefficient due to the vast search space. Recent advancements in machine learning have enabled generative models to predict new stable materials by incorporating structural symmetries and to condition the generation on desired properties. In this wor…
▽ More
The discovery of new crystalline materials is essential to scientific and technological progress. However, traditional trial-and-error approaches are inefficient due to the vast search space. Recent advancements in machine learning have enabled generative models to predict new stable materials by incorporating structural symmetries and to condition the generation on desired properties. In this work, we introduce SymmBFN, a novel symmetry-aware Bayesian Flow Network (BFN) for crystalline material generation that accurately reproduces the distribution of space groups found in experimentally observed crystals. SymmBFN substantially improves efficiency, generating stable structures at least 50 times faster than the next-best method. Furthermore, we demonstrate its capability for property-conditioned generation, enabling the design of materials with tailored properties. Our findings establish BFNs as an effective tool for accelerating the discovery of crystalline materials.
△ Less
Submitted 14 February, 2025; v1 submitted 5 February, 2025;
originally announced February 2025.
-
Temperature-Annealed Boltzmann Generators
Authors:
Henrik Schopmans,
Pascal Friederich
Abstract:
Efficient sampling of unnormalized probability densities such as the Boltzmann distribution of molecular systems is a longstanding challenge. Next to conventional approaches like molecular dynamics or Markov chain Monte Carlo, variational approaches, such as training normalizing flows with the reverse Kullback-Leibler divergence, have been introduced. However, such methods are prone to mode collap…
▽ More
Efficient sampling of unnormalized probability densities such as the Boltzmann distribution of molecular systems is a longstanding challenge. Next to conventional approaches like molecular dynamics or Markov chain Monte Carlo, variational approaches, such as training normalizing flows with the reverse Kullback-Leibler divergence, have been introduced. However, such methods are prone to mode collapse and often do not learn to sample the full configurational space. Here, we present temperature-annealed Boltzmann generators (TA-BG) to address this challenge. First, we demonstrate that training a normalizing flow with the reverse Kullback-Leibler divergence at high temperatures is possible without mode collapse. Furthermore, we introduce a reweighting-based training objective to anneal the distribution to lower target temperatures. We apply this methodology to three molecular systems of increasing complexity and, compared to the baseline, achieve better results in almost all metrics while requiring up to three times fewer target energy evaluations. For the largest system, our approach is the only method that accurately resolves the metastable states of the system.
△ Less
Submitted 17 June, 2025; v1 submitted 31 January, 2025;
originally announced January 2025.
-
PAL -- Parallel active learning for machine-learned potentials
Authors:
Chen Zhou,
Marlen Neubert,
Yuri Koide,
Yumeng Zhang,
Van-Quan Vuong,
Tobias Schlöder,
Stefanie Dehnen,
Pascal Friederich
Abstract:
Constructing datasets representative of the target domain is essential for training effective machine learning models. Active learning (AL) is a promising method that iteratively extends training data to enhance model performance while minimizing data acquisition costs. However, current AL workflows often require human intervention and lack parallelism, leading to inefficiencies and underutilizati…
▽ More
Constructing datasets representative of the target domain is essential for training effective machine learning models. Active learning (AL) is a promising method that iteratively extends training data to enhance model performance while minimizing data acquisition costs. However, current AL workflows often require human intervention and lack parallelism, leading to inefficiencies and underutilization of modern computational resources. In this work, we introduce PAL, an automated, modular, and parallel active learning library that integrates AL tasks and manages their execution and communication on shared- and distributed-memory systems using the Message Passing Interface (MPI). PAL provides users with the flexibility to design and customize all components of their active learning scenarios, including machine learning models with uncertainty estimation, oracles for ground truth labeling, and strategies for exploring the target space. We demonstrate that PAL significantly reduces computational overhead and improves scalability, achieving substantial speed-ups through asynchronous parallelization on CPU and GPU hardware. Applications of PAL to several real-world scenarios - including ground-state reactions in biomolecular systems, excited-state dynamics of molecules, simulations of inorganic clusters, and thermo-fluid dynamics - illustrate its effectiveness in accelerating the development of machine learning models. Our results show that PAL enables efficient utilization of high-performance computing resources in active learning workflows, fostering advancements in scientific research and engineering applications.
△ Less
Submitted 30 November, 2024;
originally announced December 2024.
-
Global Concept Explanations for Graphs by Contrastive Learning
Authors:
Jonas Teufel,
Pascal Friederich
Abstract:
Beyond improving trust and validating model fairness, xAI practices also have the potential to recover valuable scientific insights in application domains where little to no prior human intuition exists. To that end, we propose a method to extract global concept explanations from the predictions of graph neural networks to develop a deeper understanding of the tasks underlying structure-property r…
▽ More
Beyond improving trust and validating model fairness, xAI practices also have the potential to recover valuable scientific insights in application domains where little to no prior human intuition exists. To that end, we propose a method to extract global concept explanations from the predictions of graph neural networks to develop a deeper understanding of the tasks underlying structure-property relationships. We identify concept explanations as dense clusters in the self-explaining Megan models subgraph latent space. For each concept, we optimize a representative prototype graph and optionally use GPT-4 to provide hypotheses about why each structure has a certain effect on the prediction. We conduct computational experiments on synthetic and real-world graph property prediction tasks. For the synthetic tasks we find that our method correctly reproduces the structural rules by which they were created. For real-world molecular property regression and classification tasks, we find that our method rediscovers established rules of thumb. More specifically, our results for molecular mutagenicity prediction indicate more fine-grained resolution of structural details than existing explainability methods, consistent with previous results from chemistry literature. Overall, our results show promising capability to extract the underlying structure-property relationships for complex graph property prediction tasks.
△ Less
Submitted 25 April, 2024;
originally announced April 2024.
-
Conditional Normalizing Flows for Active Learning of Coarse-Grained Molecular Representations
Authors:
Henrik Schopmans,
Pascal Friederich
Abstract:
Efficient sampling of the Boltzmann distribution of molecular systems is a long-standing challenge. Recently, instead of generating long molecular dynamics simulations, generative machine learning methods such as normalizing flows have been used to learn the Boltzmann distribution directly, without samples. However, this approach is susceptible to mode collapse and thus often does not explore the…
▽ More
Efficient sampling of the Boltzmann distribution of molecular systems is a long-standing challenge. Recently, instead of generating long molecular dynamics simulations, generative machine learning methods such as normalizing flows have been used to learn the Boltzmann distribution directly, without samples. However, this approach is susceptible to mode collapse and thus often does not explore the full configurational space. In this work, we address this challenge by separating the problem into two levels, the fine-grained and coarse-grained degrees of freedom. A normalizing flow conditioned on the coarse-grained space yields a probabilistic connection between the two levels. To explore the configurational space, we employ coarse-grained simulations with active learning which allows us to update the flow and make all-atom potential energy evaluations only when necessary. Using alanine dipeptide as an example, we show that our methods obtain a speedup to molecular dynamics simulations of approximately 15.9 to 216.2 compared to the speedup of 4.5 of the current state-of-the-art machine learning approach.
△ Less
Submitted 24 May, 2024; v1 submitted 2 February, 2024;
originally announced February 2024.
-
Contextualized Policy Recovery: Modeling and Interpreting Medical Decisions with Adaptive Imitation Learning
Authors:
Jannik Deuschel,
Caleb N. Ellington,
Yingtao Luo,
Benjamin J. Lengerich,
Pascal Friederich,
Eric P. Xing
Abstract:
Interpretable policy learning seeks to estimate intelligible decision policies from observed actions; however, existing models force a tradeoff between accuracy and interpretability, limiting data-driven interpretations of human decision-making processes. Fundamentally, existing approaches are burdened by this tradeoff because they represent the underlying decision process as a universal policy, w…
▽ More
Interpretable policy learning seeks to estimate intelligible decision policies from observed actions; however, existing models force a tradeoff between accuracy and interpretability, limiting data-driven interpretations of human decision-making processes. Fundamentally, existing approaches are burdened by this tradeoff because they represent the underlying decision process as a universal policy, when in fact human decisions are dynamic and can change drastically under different contexts. Thus, we develop Contextualized Policy Recovery (CPR), which re-frames the problem of modeling complex decision processes as a multi-task learning problem, where each context poses a unique task and complex decision policies can be constructed piece-wise from many simple context-specific policies. CPR models each context-specific policy as a linear map, and generates new policy models $\textit{on-demand}$ as contexts are updated with new observations. We provide two flavors of the CPR framework: one focusing on exact local interpretability, and one retaining full global interpretability. We assess CPR through studies on simulated and real data, achieving state-of-the-art performance on predicting antibiotic prescription in intensive care units ($+22\%$ AUROC vs. previous SOTA) and predicting MRI prescription for Alzheimer's patients ($+7.7\%$ AUROC vs. previous SOTA). With this improvement, CPR closes the accuracy gap between interpretable and black-box methods, allowing high-resolution exploration and analysis of context-specific decision models.
△ Less
Submitted 7 May, 2024; v1 submitted 11 October, 2023;
originally announced October 2023.
-
Mitigating Molecular Aggregation in Drug Discovery with Predictive Insights from Explainable AI
Authors:
Hunter Sturm,
Jonas Teufel,
Kaitlin A. Isfeld,
Pascal Friederich,
Rebecca L. Davis
Abstract:
Herein, we present the application of MEGAN, our explainable AI (xAI) model, for the identification of small colloidally aggregating molecules (SCAMs). This work offers solutions to the long-standing problem of false positives caused by SCAMs in high throughput screening for drug discovery and demonstrates the power of xAI in the classification of molecular properties that are not chemically intui…
▽ More
Herein, we present the application of MEGAN, our explainable AI (xAI) model, for the identification of small colloidally aggregating molecules (SCAMs). This work offers solutions to the long-standing problem of false positives caused by SCAMs in high throughput screening for drug discovery and demonstrates the power of xAI in the classification of molecular properties that are not chemically intuitive based on our current understanding. We leverage xAI insights and molecular counterfactuals to design alternatives to problematic compounds in drug screening libraries. Additionally, we experimentally validate the MEGAN prediction classification for one of the counterfactuals and demonstrate the utility of counterfactuals for altering the aggregation properties of a compound through minor structural modifications. The integration of this method in high-throughput screening approaches will help combat and circumvent false positives, providing better lead molecules more rapidly and thus accelerating drug discovery cycles.
△ Less
Submitted 27 May, 2025; v1 submitted 3 June, 2023;
originally announced June 2023.
-
Quantifying the Intrinsic Usefulness of Attributional Explanations for Graph Neural Networks with Artificial Simulatability Studies
Authors:
Jonas Teufel,
Luca Torresi,
Pascal Friederich
Abstract:
Despite the increasing relevance of explainable AI, assessing the quality of explanations remains a challenging issue. Due to the high costs associated with human-subject experiments, various proxy metrics are often used to approximately quantify explanation quality. Generally, one possible interpretation of the quality of an explanation is its inherent value for teaching a related concept to a st…
▽ More
Despite the increasing relevance of explainable AI, assessing the quality of explanations remains a challenging issue. Due to the high costs associated with human-subject experiments, various proxy metrics are often used to approximately quantify explanation quality. Generally, one possible interpretation of the quality of an explanation is its inherent value for teaching a related concept to a student. In this work, we extend artificial simulatability studies to the domain of graph neural networks. Instead of costly human trials, we use explanation-supervisable graph neural networks to perform simulatability studies to quantify the inherent usefulness of attributional graph explanations. We perform an extensive ablation study to investigate the conditions under which the proposed analyses are most meaningful. We additionally validate our methods applicability on real-world graph classification and regression datasets. We find that relevant explanations can significantly boost the sample efficiency of graph neural networks and analyze the robustness towards noise and bias in the explanations. We believe that the notion of usefulness obtained from our proposed simulatability analysis provides a dimension of explanation quality that is largely orthogonal to the common practice of faithfulness and has great potential to expand the toolbox of explanation quality assessments, specifically for graph explanations.
△ Less
Submitted 25 May, 2023;
originally announced May 2023.
-
Neural networks trained on synthetically generated crystals can extract structural information from ICSD powder X-ray diffractograms
Authors:
Henrik Schopmans,
Patrick Reiser,
Pascal Friederich
Abstract:
Machine learning techniques have successfully been used to extract structural information such as the crystal space group from powder X-ray diffractograms. However, training directly on simulated diffractograms from databases such as the ICSD is challenging due to its limited size, class-inhomogeneity, and bias toward certain structure types. We propose an alternative approach of generating synthe…
▽ More
Machine learning techniques have successfully been used to extract structural information such as the crystal space group from powder X-ray diffractograms. However, training directly on simulated diffractograms from databases such as the ICSD is challenging due to its limited size, class-inhomogeneity, and bias toward certain structure types. We propose an alternative approach of generating synthetic crystals with random coordinates by using the symmetry operations of each space group. Based on this approach, we demonstrate online training of deep ResNet-like models on up to a few million unique on-the-fly generated synthetic diffractograms per hour. For our chosen task of space group classification, we achieved a test accuracy of 79.9% on unseen ICSD structure types from most space groups. This surpasses the 56.1% accuracy of the current state-of-the-art approach of training on ICSD crystals directly. Our results demonstrate that synthetically generated crystals can be used to extract structural information from ICSD powder diffractograms, which makes it possible to apply very large state-of-the-art machine learning models in the area of powder X-ray diffraction. We further show first steps toward applying our methodology to experimental data, where automated XRD data analysis is crucial, especially in high-throughput settings. While we focused on the prediction of the space group, our approach has the potential to be extended to related tasks in the future.
△ Less
Submitted 19 September, 2023; v1 submitted 21 March, 2023;
originally announced March 2023.
-
Connectivity Optimized Nested Graph Networks for Crystal Structures
Authors:
Robin Ruff,
Patrick Reiser,
Jan Stühmer,
Pascal Friederich
Abstract:
Graph neural networks (GNNs) have been applied to a large variety of applications in materials science and chemistry. Here, we recapitulate the graph construction for crystalline (periodic) materials and investigate its impact on the GNNs model performance. We suggest the asymmetric unit cell as a representation to reduce the number of atoms by using all symmetries of the system. This substantiall…
▽ More
Graph neural networks (GNNs) have been applied to a large variety of applications in materials science and chemistry. Here, we recapitulate the graph construction for crystalline (periodic) materials and investigate its impact on the GNNs model performance. We suggest the asymmetric unit cell as a representation to reduce the number of atoms by using all symmetries of the system. This substantially reduced the computational cost and thus time needed to train large graph neural networks without any loss in accuracy. Furthermore, with a simple but systematically built GNN architecture based on message passing and line graph templates, we introduce a general architecture (Nested Graph Network, NGN) that is applicable to a wide range of tasks. We show that our suggested models systematically improve state-of-the-art results across all tasks within the MatBench benchmark. Further analysis shows that optimized connectivity and deeper message functions are responsible for the improvement. Asymmetric unit cells and connectivity optimization can be generally applied to (crystal) graph networks, while our suggested nested graph framework will open new ways of systematic comparison of GNN architectures.
△ Less
Submitted 9 August, 2023; v1 submitted 27 February, 2023;
originally announced February 2023.
-
Actively Learning Costly Reward Functions for Reinforcement Learning
Authors:
André Eberhard,
Houssam Metni,
Georg Fahland,
Alexander Stroh,
Pascal Friederich
Abstract:
Transfer of recent advances in deep reinforcement learning to real-world applications is hindered by high data demands and thus low efficiency and scalability. Through independent improvements of components such as replay buffers or more stable learning algorithms, and through massively distributed systems, training time could be reduced from several days to several hours for standard benchmark ta…
▽ More
Transfer of recent advances in deep reinforcement learning to real-world applications is hindered by high data demands and thus low efficiency and scalability. Through independent improvements of components such as replay buffers or more stable learning algorithms, and through massively distributed systems, training time could be reduced from several days to several hours for standard benchmark tasks. However, while rewards in simulated environments are well-defined and easy to compute, reward evaluation becomes the bottleneck in many real-world environments, e.g., in molecular optimization tasks, where computationally demanding simulations or even experiments are required to evaluate states and to quantify rewards. Therefore, training might become prohibitively expensive without an extensive amount of computational resources and time. We propose to alleviate this problem by replacing costly ground-truth rewards with rewards modeled by neural networks, counteracting non-stationarity of state and reward distributions during training with an active learning component. We demonstrate that using our proposed ACRL method (Actively learning Costly rewards for Reinforcement Learning), it is possible to train agents in complex real-world environments orders of magnitudes faster. By enabling the application of reinforcement learning methods to new domains, we show that we can find interesting and non-trivial solutions to real-world optimization problems in chemistry, materials science and engineering.
△ Less
Submitted 23 November, 2022;
originally announced November 2022.
-
MEGAN: Multi-Explanation Graph Attention Network
Authors:
Jonas Teufel,
Luca Torresi,
Patrick Reiser,
Pascal Friederich
Abstract:
We propose a multi-explanation graph attention network (MEGAN). Unlike existing graph explainability methods, our network can produce node and edge attributional explanations along multiple channels, the number of which is independent of task specifications. This proves crucial to improve the interpretability of graph regression predictions, as explanations can be split into positive and negative…
▽ More
We propose a multi-explanation graph attention network (MEGAN). Unlike existing graph explainability methods, our network can produce node and edge attributional explanations along multiple channels, the number of which is independent of task specifications. This proves crucial to improve the interpretability of graph regression predictions, as explanations can be split into positive and negative evidence w.r.t to a reference value. Additionally, our attention-based network is fully differentiable and explanations can actively be trained in an explanation-supervised manner. We first validate our model on a synthetic graph regression dataset with known ground-truth explanations. Our network outperforms existing baseline explainability methods for the single- as well as the multi-explanation case, achieving near-perfect explanation accuracy during explanation supervision. Finally, we demonstrate our model's capabilities on multiple real-world datasets. We find that our model produces sparse high-fidelity explanations consistent with human intuition about those tasks.
△ Less
Submitted 25 May, 2023; v1 submitted 23 November, 2022;
originally announced November 2022.
-
Graph neural networks to learn joint representations of disjoint molecular graphs
Authors:
Chen Shao,
Zhou Chen,
Pascal Friederich
Abstract:
Graph neural networks are widely used to learn global representations of graphs, which are then used for regression or classification tasks. Typically, the graphs in such data sets are connected, i.e. each training sample consists of a single internally connected graph associated with a global label. However, there is a wide variety of yet unconsidered but application-relevant tasks, where labels…
▽ More
Graph neural networks are widely used to learn global representations of graphs, which are then used for regression or classification tasks. Typically, the graphs in such data sets are connected, i.e. each training sample consists of a single internally connected graph associated with a global label. However, there is a wide variety of yet unconsidered but application-relevant tasks, where labels are assigned to sets of disjoint graphs, which requires the generation of global representations of disjoint graphs. In this paper, we present a new data set with chemical reactions, which is illustrating this task. Each sample consists of a pair of disjoint molecular graphs and a joint label representing a scalar measure associated with the chemical reaction of the molecules. We show the initial results of graph neural networks that are able to solve the task within a combinatorial subset of the dataset but do not generalize well to the full data set and unseen (sub)graphs.
△ Less
Submitted 30 October, 2022; v1 submitted 14 October, 2022;
originally announced October 2022.
-
Graph neural networks for materials science and chemistry
Authors:
Patrick Reiser,
Marlen Neubert,
André Eberhard,
Luca Torresi,
Chen Zhou,
Chen Shao,
Houssam Metni,
Clint van Hoesel,
Henrik Schopmans,
Timo Sommer,
Pascal Friederich
Abstract:
Machine learning plays an increasingly important role in many areas of chemistry and materials science, e.g. to predict materials properties, to accelerate simulations, to design new materials, and to predict synthesis routes of new materials. Graph neural networks (GNNs) are one of the fastest growing classes of machine learning models. They are of particular relevance for chemistry and materials…
▽ More
Machine learning plays an increasingly important role in many areas of chemistry and materials science, e.g. to predict materials properties, to accelerate simulations, to design new materials, and to predict synthesis routes of new materials. Graph neural networks (GNNs) are one of the fastest growing classes of machine learning models. They are of particular relevance for chemistry and materials science, as they directly work on a graph or structural representation of molecules and materials and therefore have full access to all relevant information required to characterize materials. In this review article, we provide an overview of the basic principles of GNNs, widely used datasets, and state-of-the-art architectures, followed by a discussion of a wide range of recent applications of GNNs in chemistry and materials science, and concluding with a road-map for the further development and application of GNNs.
△ Less
Submitted 5 August, 2022;
originally announced August 2022.
-
On scientific understanding with artificial intelligence
Authors:
Mario Krenn,
Robert Pollice,
Si Yue Guo,
Matteo Aldeghi,
Alba Cervera-Lierta,
Pascal Friederich,
Gabriel dos Passos Gomes,
Florian Häse,
Adrian Jinich,
AkshatKumar Nigam,
Zhenpeng Yao,
Alán Aspuru-Guzik
Abstract:
Imagine an oracle that correctly predicts the outcome of every particle physics experiment, the products of every chemical reaction, or the function of every protein. Such an oracle would revolutionize science and technology as we know them. However, as scientists, we would not be satisfied with the oracle itself. We want more. We want to comprehend how the oracle conceived these predictions. This…
▽ More
Imagine an oracle that correctly predicts the outcome of every particle physics experiment, the products of every chemical reaction, or the function of every protein. Such an oracle would revolutionize science and technology as we know them. However, as scientists, we would not be satisfied with the oracle itself. We want more. We want to comprehend how the oracle conceived these predictions. This feat, denoted as scientific understanding, has frequently been recognized as the essential aim of science. Now, the ever-growing power of computers and artificial intelligence poses one ultimate question: How can advanced artificial systems contribute to scientific understanding or achieve it autonomously?
We are convinced that this is not a mere technical question but lies at the core of science. Therefore, here we set out to answer where we are and where we can go from here. We first seek advice from the philosophy of science to understand scientific understanding. Then we review the current state of the art, both from literature and by collecting dozens of anecdotes from scientists about how they acquired new conceptual understanding with the help of computers. Those combined insights help us to define three dimensions of android-assisted scientific understanding: The android as a I) computational microscope, II) resource of inspiration and the ultimate, not yet existent III) agent of understanding. For each dimension, we explain new avenues to push beyond the status quo and unleash the full power of artificial intelligence's contribution to the central aim of science. We hope our perspective inspires and focuses research towards androids that get new scientific understanding and ultimately bring us closer to true artificial scientists.
△ Less
Submitted 4 April, 2022;
originally announced April 2022.
-
SELFIES and the future of molecular string representations
Authors:
Mario Krenn,
Qianxiang Ai,
Senja Barthel,
Nessa Carson,
Angelo Frei,
Nathan C. Frey,
Pascal Friederich,
Théophile Gaudin,
Alberto Alexander Gayle,
Kevin Maik Jablonka,
Rafael F. Lameiro,
Dominik Lemm,
Alston Lo,
Seyed Mohamad Moosavi,
José Manuel Nápoles-Duarte,
AkshatKumar Nigam,
Robert Pollice,
Kohulan Rajan,
Ulrich Schatzschneider,
Philippe Schwaller,
Marta Skreta,
Berend Smit,
Felix Strieth-Kalthoff,
Chong Sun,
Gary Tom
, et al. (6 additional authors not shown)
Abstract:
Artificial intelligence (AI) and machine learning (ML) are expanding in popularity for broad applications to challenging tasks in chemistry and materials science. Examples include the prediction of properties, the discovery of new reaction pathways, or the design of new molecules. The machine needs to read and write fluently in a chemical language for each of these tasks. Strings are a common tool…
▽ More
Artificial intelligence (AI) and machine learning (ML) are expanding in popularity for broad applications to challenging tasks in chemistry and materials science. Examples include the prediction of properties, the discovery of new reaction pathways, or the design of new molecules. The machine needs to read and write fluently in a chemical language for each of these tasks. Strings are a common tool to represent molecular graphs, and the most popular molecular string representation, SMILES, has powered cheminformatics since the late 1980s. However, in the context of AI and ML in chemistry, SMILES has several shortcomings -- most pertinently, most combinations of symbols lead to invalid results with no valid chemical interpretation. To overcome this issue, a new language for molecules was introduced in 2020 that guarantees 100\% robustness: SELFIES (SELF-referencIng Embedded Strings). SELFIES has since simplified and enabled numerous new applications in chemistry. In this manuscript, we look to the future and discuss molecular string representations, along with their respective opportunities and challenges. We propose 16 concrete Future Projects for robust molecular representations. These involve the extension toward new chemical domains, exciting questions at the interface of AI and robust languages and interpretability for both humans and machines. We hope that these proposals will inspire several follow-up works exploiting the full potential of molecular string representations for the future of AI in chemistry and materials science.
△ Less
Submitted 31 March, 2022;
originally announced April 2022.
-
Implementing graph neural networks with TensorFlow-Keras
Authors:
Patrick Reiser,
Andre Eberhard,
Pascal Friederich
Abstract:
Graph neural networks are a versatile machine learning architecture that received a lot of attention recently. In this technical report, we present an implementation of convolution and pooling layers for TensorFlow-Keras models, which allows a seamless and flexible integration into standard Keras layers to set up graph models in a functional way. This implies the usage of mini-batches as the first…
▽ More
Graph neural networks are a versatile machine learning architecture that received a lot of attention recently. In this technical report, we present an implementation of convolution and pooling layers for TensorFlow-Keras models, which allows a seamless and flexible integration into standard Keras layers to set up graph models in a functional way. This implies the usage of mini-batches as the first tensor dimension, which can be realized via the new RaggedTensor class of TensorFlow best suited for graphs. We developed the Keras Graph Convolutional Neural Network Python package kgcnn based on TensorFlow-Keras that provides a set of Keras layers for graph networks which focus on a transparent tensor structure passed between layers and an ease-of-use mindset.
△ Less
Submitted 7 March, 2021;
originally announced March 2021.
-
Analyzing dynamical disorder for charge transport in organic semiconductors via machine learning
Authors:
Patrick Reiser,
Manuel Konrad,
Artem Fediai,
Salvador Léon,
Wolfgang Wenzel,
Pascal Friederich
Abstract:
Organic semiconductors are indispensable for today's display technologies in form of organic light emitting diodes (OLEDs) and further optoelectronic applications. However, organic materials do not reach the same charge carrier mobility as inorganic semiconductors, limiting the efficiency of devices. To find or even design new organic semiconductors with higher charge carrier mobility, computation…
▽ More
Organic semiconductors are indispensable for today's display technologies in form of organic light emitting diodes (OLEDs) and further optoelectronic applications. However, organic materials do not reach the same charge carrier mobility as inorganic semiconductors, limiting the efficiency of devices. To find or even design new organic semiconductors with higher charge carrier mobility, computational approaches, in particular multiscale models, are becoming increasingly important. However, such models are computationally very costly, especially when large systems and long time scales are required, which is the case to compute static and dynamic energy disorder, i.e. dominant factor to determine charge transport. Here we overcome this drawback by integrating machine learning models into multiscale simulations. This allows us to obtain unprecedented insight into relevant microscopic materials properties, in particular static and dynamic disorder contributions for a series of application-relevant molecules. We find that static disorder and thus the distribution of shallow traps is highly asymmetrical for many materials, impacting widely considered Gaussian disorder models. We furthermore analyse characteristic energy level fluctuation times and compare them to typical hopping rates to evaluate the importance of dynamic disorder for charge transport. We hope that our findings will significantly improve the accuracy of computational methods used to predict application relevant materials properties of organic semiconductors, and thus make these methods applicable for virtual materials design.
△ Less
Submitted 2 February, 2021;
originally announced February 2021.
-
Machine learning for rapid discovery of laminar flow channel wall modifications that enhance heat transfer
Authors:
Yuri Koide,
Arjun J. Kaithakkal,
Matthias Schniewind,
Bradley P. Ladewig,
Alexander Stroh,
Pascal Friederich
Abstract:
Numerical simulation of fluids plays an essential role in modeling many physical phenomena, which enables technological advancements, contributes to sustainable practices, and expands our understanding of various natural and engineered systems. The calculation of heat transfer in fluid flow in simple flat channels is a relatively easy task for various simulation methods. However, once the channel…
▽ More
Numerical simulation of fluids plays an essential role in modeling many physical phenomena, which enables technological advancements, contributes to sustainable practices, and expands our understanding of various natural and engineered systems. The calculation of heat transfer in fluid flow in simple flat channels is a relatively easy task for various simulation methods. However, once the channel geometry becomes more complex, numerical simulations become a bottleneck in optimizing wall geometries. We present a combination of accurate numerical simulations of arbitrary, flat, and non-flat channels and machine learning models predicting drag coefficient and Stanton number. We show that convolutional neural networks (CNN) can accurately predict the target properties at a fraction of the time of numerical simulations. We use the CNN models in a virtual high-throughput screening approach to explore a large number of possible, randomly generated wall architectures. Data Augmentation was applied to existing geometries data to add generated new training data which have the same number of parameters of heat transfer to improve the model's generalization. The general approach is not only applicable to simple flow setups as presented here but can be extended to more complex tasks, such as multiphase or even reactive unit operations in chemical engineering.
△ Less
Submitted 8 August, 2023; v1 submitted 19 January, 2021;
originally announced January 2021.
-
Scientific intuition inspired by machine learning generated hypotheses
Authors:
Pascal Friederich,
Mario Krenn,
Isaac Tamblyn,
Alan Aspuru-Guzik
Abstract:
Machine learning with application to questions in the physical sciences has become a widely used tool, successfully applied to classification, regression and optimization tasks in many areas. Research focus mostly lies in improving the accuracy of the machine learning models in numerical predictions, while scientific understanding is still almost exclusively generated by human researchers analysin…
▽ More
Machine learning with application to questions in the physical sciences has become a widely used tool, successfully applied to classification, regression and optimization tasks in many areas. Research focus mostly lies in improving the accuracy of the machine learning models in numerical predictions, while scientific understanding is still almost exclusively generated by human researchers analysing numerical results and drawing conclusions. In this work, we shift the focus on the insights and the knowledge obtained by the machine learning models themselves. In particular, we study how it can be extracted and used to inspire human scientists to increase their intuitions and understanding of natural systems. We apply gradient boosting in decision trees to extract human interpretable insights from big data sets from chemistry and physics. In chemistry, we not only rediscover widely know rules of thumb but also find new interesting motifs that tell us how to control solubility and energy levels of organic molecules. At the same time, in quantum physics, we gain new understanding on experiments for quantum entanglement. The ability to go beyond numerics and to enter the realm of scientific insight and hypothesis generation opens the door to use machine learning to accelerate the discovery of conceptual understanding in some of the most challenging domains of science.
△ Less
Submitted 14 December, 2020; v1 submitted 27 October, 2020;
originally announced October 2020.
-
Neural Message Passing on High Order Paths
Authors:
Daniel Flam-Shepherd,
Tony Wu,
Pascal Friederich,
Alan Aspuru-Guzik
Abstract:
Graph neural network have achieved impressive results in predicting molecular properties, but they do not directly account for local and hidden structures in the graph such as functional groups and molecular geometry. At each propagation step, GNNs aggregate only over first order neighbours, ignoring important information contained in subsequent neighbours as well as the relationships between thos…
▽ More
Graph neural network have achieved impressive results in predicting molecular properties, but they do not directly account for local and hidden structures in the graph such as functional groups and molecular geometry. At each propagation step, GNNs aggregate only over first order neighbours, ignoring important information contained in subsequent neighbours as well as the relationships between those higher order connections. In this work, we generalize graph neural nets to pass messages and aggregate across higher order paths. This allows for information to propagate over various levels and substructures of the graph. We demonstrate our model on a few tasks in molecular property prediction.
△ Less
Submitted 24 February, 2020;
originally announced February 2020.
-
Augmenting Genetic Algorithms with Deep Neural Networks for Exploring the Chemical Space
Authors:
AkshatKumar Nigam,
Pascal Friederich,
Mario Krenn,
Alán Aspuru-Guzik
Abstract:
Challenges in natural sciences can often be phrased as optimization problems. Machine learning techniques have recently been applied to solve such problems. One example in chemistry is the design of tailor-made organic materials and molecules, which requires efficient methods to explore the chemical space. We present a genetic algorithm (GA) that is enhanced with a neural network (DNN) based discr…
▽ More
Challenges in natural sciences can often be phrased as optimization problems. Machine learning techniques have recently been applied to solve such problems. One example in chemistry is the design of tailor-made organic materials and molecules, which requires efficient methods to explore the chemical space. We present a genetic algorithm (GA) that is enhanced with a neural network (DNN) based discriminator model to improve the diversity of generated molecules and at the same time steer the GA. We show that our algorithm outperforms other generative models in optimization tasks. We furthermore present a way to increase interpretability of genetic algorithms, which helped us to derive design principles.
△ Less
Submitted 15 January, 2020; v1 submitted 25 September, 2019;
originally announced September 2019.
-
Self-Referencing Embedded Strings (SELFIES): A 100% robust molecular string representation
Authors:
Mario Krenn,
Florian Häse,
AkshatKumar Nigam,
Pascal Friederich,
Alán Aspuru-Guzik
Abstract:
The discovery of novel materials and functional molecules can help to solve some of society's most urgent challenges, ranging from efficient energy harvesting and storage to uncovering novel pharmaceutical drug candidates. Traditionally matter engineering -- generally denoted as inverse design -- was based massively on human intuition and high-throughput virtual screening. The last few years have…
▽ More
The discovery of novel materials and functional molecules can help to solve some of society's most urgent challenges, ranging from efficient energy harvesting and storage to uncovering novel pharmaceutical drug candidates. Traditionally matter engineering -- generally denoted as inverse design -- was based massively on human intuition and high-throughput virtual screening. The last few years have seen the emergence of significant interest in computer-inspired designs based on evolutionary or deep learning methods. The major challenge here is that the standard strings molecular representation SMILES shows substantial weaknesses in that task because large fractions of strings do not correspond to valid molecules. Here, we solve this problem at a fundamental level and introduce SELFIES (SELF-referencIng Embedded Strings), a string-based representation of molecules which is 100\% robust. Every SELFIES string corresponds to a valid molecule, and SELFIES can represent every molecule. SELFIES can be directly applied in arbitrary machine learning models without the adaptation of the models; each of the generated molecule candidates is valid. In our experiments, the model's internal memory stores two orders of magnitude more diverse molecules than a similar test with SMILES. Furthermore, as all molecules are valid, it allows for explanation and interpretation of the internal working of the generative models.
△ Less
Submitted 4 March, 2020; v1 submitted 31 May, 2019;
originally announced May 2019.