Search | arXiv e-print repository

Improving Counterfactual Truthfulness for Molecular Property Prediction through Uncertainty Quantification

Authors: Jonas Teufel, Annika Leinweber, Pascal Friederich

Abstract: Explainable AI (xAI) interventions aim to improve interpretability for complex black-box models, not only to improve user trust but also as a means to extract scientific insights from high-performing predictive systems. In molecular property prediction, counterfactual explanations offer a way to understand predictive behavior by highlighting which minimal perturbations in the input molecular struc… ▽ More Explainable AI (xAI) interventions aim to improve interpretability for complex black-box models, not only to improve user trust but also as a means to extract scientific insights from high-performing predictive systems. In molecular property prediction, counterfactual explanations offer a way to understand predictive behavior by highlighting which minimal perturbations in the input molecular structure cause the greatest deviation in the predicted property. However, such explanations only allow for meaningful scientific insights if they reflect the distribution of the true underlying property -- a feature we define as counterfactual truthfulness. To increase this truthfulness, we propose the integration of uncertainty estimation techniques to filter counterfactual candidates with high predicted uncertainty. Through computational experiments with synthetic and real-world datasets, we demonstrate that traditional uncertainty estimation methods, such as ensembles and mean-variance estimation, can already substantially reduce the average prediction error and increase counterfactual truthfulness, especially for out-of-distribution settings. Our results highlight the importance and potential impact of incorporating uncertainty estimation into explainability methods, especially considering the relatively high effectiveness of low-effort interventions like model ensembles. △ Less

Submitted 3 April, 2025; originally announced April 2025.

Comments: 24 pages, 5 figures, 4 tabels, accepted at the 3rd xAI World Conference

arXiv:2503.05577 [pdf, other]

opXRD: Open Experimental Powder X-ray Diffraction Database

Authors: Daniel Hollarek, Henrik Schopmans, Jona Östreicher, Jonas Teufel, Bin Cao, Adie Alwen, Simon Schweidler, Mriganka Singh, Tim Kodalle, Hanlin Hu, Gregoire Heymans, Maged Abdelsamie, Arthur Hardiagon, Alexander Wieczorek, Siarhei Zhuk, Ruth Schwaiger, Sebastian Siol, François-Xavier Coudert, Moritz Wolf, Carolin M. Sutter-Fella, Ben Breitung, Andrea M. Hodge, Tong-yi Zhang, Pascal Friederich

Abstract: Powder X-ray diffraction (pXRD) experiments are a cornerstone for materials structure characterization. Despite their widespread application, analyzing pXRD diffractograms still presents a significant challenge to automation and a bottleneck in high-throughput discovery in self-driving labs. Machine learning promises to resolve this bottleneck by enabling automated powder diffraction analysis. A n… ▽ More Powder X-ray diffraction (pXRD) experiments are a cornerstone for materials structure characterization. Despite their widespread application, analyzing pXRD diffractograms still presents a significant challenge to automation and a bottleneck in high-throughput discovery in self-driving labs. Machine learning promises to resolve this bottleneck by enabling automated powder diffraction analysis. A notable difficulty in applying machine learning to this domain is the lack of sufficiently sized experimental datasets, which has constrained researchers to train primarily on simulated data. However, models trained on simulated pXRD patterns showed limited generalization to experimental patterns, particularly for low-quality experimental patterns with high noise levels and elevated backgrounds. With the Open Experimental Powder X-Ray Diffraction Database (opXRD), we provide an openly available and easily accessible dataset of labeled and unlabeled experimental powder diffractograms. Labeled opXRD data can be used to evaluate the performance of models on experimental data and unlabeled opXRD data can help improve the performance of models on experimental data, e.g. through transfer learning methods. We collected 92552 diffractograms, 2179 of them labeled, from a wide spectrum of materials classes. We hope this ongoing effort can guide machine learning research toward fully automated analysis of pXRD data and thus enable future self-driving materials labs. △ Less

Submitted 10 March, 2025; v1 submitted 7 March, 2025; originally announced March 2025.

arXiv:2502.03146 [pdf, other]

Symmetry-Aware Bayesian Flow Networks for Crystal Generation

Authors: Laura Ruple, Luca Torresi, Henrik Schopmans, Pascal Friederich

Abstract: The discovery of new crystalline materials is essential to scientific and technological progress. However, traditional trial-and-error approaches are inefficient due to the vast search space. Recent advancements in machine learning have enabled generative models to predict new stable materials by incorporating structural symmetries and to condition the generation on desired properties. In this wor… ▽ More The discovery of new crystalline materials is essential to scientific and technological progress. However, traditional trial-and-error approaches are inefficient due to the vast search space. Recent advancements in machine learning have enabled generative models to predict new stable materials by incorporating structural symmetries and to condition the generation on desired properties. In this work, we introduce SymmBFN, a novel symmetry-aware Bayesian Flow Network (BFN) for crystalline material generation that accurately reproduces the distribution of space groups found in experimentally observed crystals. SymmBFN substantially improves efficiency, generating stable structures at least 50 times faster than the next-best method. Furthermore, we demonstrate its capability for property-conditioned generation, enabling the design of materials with tailored properties. Our findings establish BFNs as an effective tool for accelerating the discovery of crystalline materials. △ Less

Submitted 14 February, 2025; v1 submitted 5 February, 2025; originally announced February 2025.

arXiv:2501.19077 [pdf, other]

Temperature-Annealed Boltzmann Generators

Authors: Henrik Schopmans, Pascal Friederich

Abstract: Efficient sampling of unnormalized probability densities such as the Boltzmann distribution of molecular systems is a longstanding challenge. Next to conventional approaches like molecular dynamics or Markov chain Monte Carlo, variational approaches, such as training normalizing flows with the reverse Kullback-Leibler divergence, have been introduced. However, such methods are prone to mode collap… ▽ More Efficient sampling of unnormalized probability densities such as the Boltzmann distribution of molecular systems is a longstanding challenge. Next to conventional approaches like molecular dynamics or Markov chain Monte Carlo, variational approaches, such as training normalizing flows with the reverse Kullback-Leibler divergence, have been introduced. However, such methods are prone to mode collapse and often do not learn to sample the full configurational space. Here, we present temperature-annealed Boltzmann generators (TA-BG) to address this challenge. First, we demonstrate that training a normalizing flow with the reverse Kullback-Leibler divergence at high temperatures is possible without mode collapse. Furthermore, we introduce a reweighting-based training objective to anneal the distribution to lower target temperatures. We apply this methodology to three molecular systems of increasing complexity and, compared to the baseline, achieve better results in almost all metrics while requiring up to three times fewer target energy evaluations. For the largest system, our approach is the only method that accurately resolves the metastable states of the system. △ Less

Submitted 31 January, 2025; originally announced January 2025.

arXiv:2412.00401 [pdf, other]

PAL -- Parallel active learning for machine-learned potentials

Authors: Chen Zhou, Marlen Neubert, Yuri Koide, Yumeng Zhang, Van-Quan Vuong, Tobias Schlöder, Stefanie Dehnen, Pascal Friederich

Abstract: Constructing datasets representative of the target domain is essential for training effective machine learning models. Active learning (AL) is a promising method that iteratively extends training data to enhance model performance while minimizing data acquisition costs. However, current AL workflows often require human intervention and lack parallelism, leading to inefficiencies and underutilizati… ▽ More Constructing datasets representative of the target domain is essential for training effective machine learning models. Active learning (AL) is a promising method that iteratively extends training data to enhance model performance while minimizing data acquisition costs. However, current AL workflows often require human intervention and lack parallelism, leading to inefficiencies and underutilization of modern computational resources. In this work, we introduce PAL, an automated, modular, and parallel active learning library that integrates AL tasks and manages their execution and communication on shared- and distributed-memory systems using the Message Passing Interface (MPI). PAL provides users with the flexibility to design and customize all components of their active learning scenarios, including machine learning models with uncertainty estimation, oracles for ground truth labeling, and strategies for exploring the target space. We demonstrate that PAL significantly reduces computational overhead and improves scalability, achieving substantial speed-ups through asynchronous parallelization on CPU and GPU hardware. Applications of PAL to several real-world scenarios - including ground-state reactions in biomolecular systems, excited-state dynamics of molecules, simulations of inorganic clusters, and thermo-fluid dynamics - illustrate its effectiveness in accelerating the development of machine learning models. Our results show that PAL enables efficient utilization of high-performance computing resources in active learning workflows, fostering advancements in scientific research and engineering applications. △ Less

Submitted 30 November, 2024; originally announced December 2024.

Comments: 25 pages, 4 figures, and 1 table (references and SI included)

arXiv:2407.00729 [pdf]

Discovering one molecule out of a million: inverse design of molecular hole transporting semiconductors tailored for perovskite solar cells

Authors: Jianchang Wu, Luca Torresi, ManMan Hu, Patrick Reiser, Jiyun Zhang, Juan S. Rocha-Ortiz, Luyao Wang, Zhiqiang Xie, Kaicheng Zhang, Byung-wook Park, Anastasia Barabash, Yicheng Zhao, Junsheng Luo, Yunuo Wang, Larry Lüer, Lin-Long Deng, Jens A. Hauch, Sang Il Seok, Pascal Friederich, Christoph J. Brabec

Abstract: The inverse design of tailored organic molecules for specific optoelectronic devices of high complexity holds an enormous potential, but has not yet been realized1,2. The complexity and literally infinite diversity of conjugated molecular structures present both, an unprecedented opportunity for technological breakthroughs as well as an unseen optimization challenge. Current models rely on big dat… ▽ More The inverse design of tailored organic molecules for specific optoelectronic devices of high complexity holds an enormous potential, but has not yet been realized1,2. The complexity and literally infinite diversity of conjugated molecular structures present both, an unprecedented opportunity for technological breakthroughs as well as an unseen optimization challenge. Current models rely on big data which do not exist for specialized research films. However, a hybrid computational and high throughput experimental screening workflow allowed us to train predictive models with as little as 149 molecules. We demonstrate a unique closed-loop workflow combining high throughput synthesis and Bayesian optimization that discovers new hole transporting materials with tailored properties for solar cell applications. A series of high-performance molecules were identified from minimal suggestions, achieving up to 26.23% (certified 25.88%) power conversion efficiency in perovskite solar cells. Our work paves the way for rapid, informed discovery in vast molecular libraries, revolutionizing material selection for complex devices. We believe that our approach can be generalized to other emerging fields and indeed accelerate the development of optoelectronic semiconductor devices in general. △ Less

Submitted 30 June, 2024; originally announced July 2024.

Comments: 21 pages, 5 figures

arXiv:2404.16532 [pdf, other]

Global Concept Explanations for Graphs by Contrastive Learning

Authors: Jonas Teufel, Pascal Friederich

Abstract: Beyond improving trust and validating model fairness, xAI practices also have the potential to recover valuable scientific insights in application domains where little to no prior human intuition exists. To that end, we propose a method to extract global concept explanations from the predictions of graph neural networks to develop a deeper understanding of the tasks underlying structure-property r… ▽ More Beyond improving trust and validating model fairness, xAI practices also have the potential to recover valuable scientific insights in application domains where little to no prior human intuition exists. To that end, we propose a method to extract global concept explanations from the predictions of graph neural networks to develop a deeper understanding of the tasks underlying structure-property relationships. We identify concept explanations as dense clusters in the self-explaining Megan models subgraph latent space. For each concept, we optimize a representative prototype graph and optionally use GPT-4 to provide hypotheses about why each structure has a certain effect on the prediction. We conduct computational experiments on synthetic and real-world graph property prediction tasks. For the synthetic tasks we find that our method correctly reproduces the structural rules by which they were created. For real-world molecular property regression and classification tasks, we find that our method rediscovers established rules of thumb. More specifically, our results for molecular mutagenicity prediction indicate more fine-grained resolution of structural details than existing explainability methods, consistent with previous results from chemistry literature. Overall, our results show promising capability to extract the underlying structure-property relationships for complex graph property prediction tasks. △ Less

Submitted 25 April, 2024; originally announced April 2024.

Comments: 25 pages, 9 figures, accepted at xAI world conference 2024

arXiv:2402.01195 [pdf, other]

Conditional Normalizing Flows for Active Learning of Coarse-Grained Molecular Representations

Authors: Henrik Schopmans, Pascal Friederich

Abstract: Efficient sampling of the Boltzmann distribution of molecular systems is a long-standing challenge. Recently, instead of generating long molecular dynamics simulations, generative machine learning methods such as normalizing flows have been used to learn the Boltzmann distribution directly, without samples. However, this approach is susceptible to mode collapse and thus often does not explore the… ▽ More Efficient sampling of the Boltzmann distribution of molecular systems is a long-standing challenge. Recently, instead of generating long molecular dynamics simulations, generative machine learning methods such as normalizing flows have been used to learn the Boltzmann distribution directly, without samples. However, this approach is susceptible to mode collapse and thus often does not explore the full configurational space. In this work, we address this challenge by separating the problem into two levels, the fine-grained and coarse-grained degrees of freedom. A normalizing flow conditioned on the coarse-grained space yields a probabilistic connection between the two levels. To explore the configurational space, we employ coarse-grained simulations with active learning which allows us to update the flow and make all-atom potential energy evaluations only when necessary. Using alanine dipeptide as an example, we show that our methods obtain a speedup to molecular dynamics simulations of approximately 15.9 to 216.2 compared to the speedup of 4.5 of the current state-of-the-art machine learning approach. △ Less

Submitted 24 May, 2024; v1 submitted 2 February, 2024; originally announced February 2024.

Journal ref: Proceedings of the 41st International Conference on Machine Learning (ICML 2024), PMLR 235:43804-43827, 2024

arXiv:2310.07918 [pdf, other]

Contextualized Policy Recovery: Modeling and Interpreting Medical Decisions with Adaptive Imitation Learning

Authors: Jannik Deuschel, Caleb N. Ellington, Yingtao Luo, Benjamin J. Lengerich, Pascal Friederich, Eric P. Xing

Abstract: Interpretable policy learning seeks to estimate intelligible decision policies from observed actions; however, existing models force a tradeoff between accuracy and interpretability, limiting data-driven interpretations of human decision-making processes. Fundamentally, existing approaches are burdened by this tradeoff because they represent the underlying decision process as a universal policy, w… ▽ More Interpretable policy learning seeks to estimate intelligible decision policies from observed actions; however, existing models force a tradeoff between accuracy and interpretability, limiting data-driven interpretations of human decision-making processes. Fundamentally, existing approaches are burdened by this tradeoff because they represent the underlying decision process as a universal policy, when in fact human decisions are dynamic and can change drastically under different contexts. Thus, we develop Contextualized Policy Recovery (CPR), which re-frames the problem of modeling complex decision processes as a multi-task learning problem, where each context poses a unique task and complex decision policies can be constructed piece-wise from many simple context-specific policies. CPR models each context-specific policy as a linear map, and generates new policy models $\textit{on-demand}$ as contexts are updated with new observations. We provide two flavors of the CPR framework: one focusing on exact local interpretability, and one retaining full global interpretability. We assess CPR through studies on simulated and real data, achieving state-of-the-art performance on predicting antibiotic prescription in intensive care units ($+22\%$ AUROC vs. previous SOTA) and predicting MRI prescription for Alzheimer's patients ($+7.7\%$ AUROC vs. previous SOTA). With this improvement, CPR closes the accuracy gap between interpretable and black-box methods, allowing high-resolution exploration and analysis of context-specific decision models. △ Less

Submitted 7 May, 2024; v1 submitted 11 October, 2023; originally announced October 2023.

arXiv:2306.11688 [pdf, other]

doi 10.1038/s41524-024-01259-w

JARVIS-Leaderboard: A Large Scale Benchmark of Materials Design Methods

Authors: Kamal Choudhary, Daniel Wines, Kangming Li, Kevin F. Garrity, Vishu Gupta, Aldo H. Romero, Jaron T. Krogel, Kayahan Saritas, Addis Fuhr, Panchapakesan Ganesh, Paul R. C. Kent, Keqiang Yan, Yuchao Lin, Shuiwang Ji, Ben Blaiszik, Patrick Reiser, Pascal Friederich, Ankit Agrawal, Pratyush Tiwary, Eric Beyerle, Peter Minch, Trevor David Rhone, Ichiro Takeuchi, Robert B. Wexler, Arun Mannodi-Kanakkithodi , et al. (13 additional authors not shown)

Abstract: Lack of rigorous reproducibility and validation are major hurdles for scientific development across many fields. Materials science in particular encompasses a variety of experimental and theoretical approaches that require careful benchmarking. Leaderboard efforts have been developed previously to mitigate these issues. However, a comprehensive comparison and benchmarking on an integrated platform… ▽ More Lack of rigorous reproducibility and validation are major hurdles for scientific development across many fields. Materials science in particular encompasses a variety of experimental and theoretical approaches that require careful benchmarking. Leaderboard efforts have been developed previously to mitigate these issues. However, a comprehensive comparison and benchmarking on an integrated platform with multiple data modalities with both perfect and defect materials data is still lacking. This work introduces JARVIS-Leaderboard, an open-source and community-driven platform that facilitates benchmarking and enhances reproducibility. The platform allows users to set up benchmarks with custom tasks and enables contributions in the form of dataset, code, and meta-data submissions. We cover the following materials design categories: Artificial Intelligence (AI), Electronic Structure (ES), Force-fields (FF), Quantum Computation (QC) and Experiments (EXP). For AI, we cover several types of input data, including atomic structures, atomistic images, spectra, and text. For ES, we consider multiple ES approaches, software packages, pseudopotentials, materials, and properties, comparing results to experiment. For FF, we compare multiple approaches for material property predictions. For QC, we benchmark Hamiltonian simulations using various quantum algorithms and circuits. Finally, for experiments, we use the inter-laboratory approach to establish benchmarks. There are 1281 contributions to 274 benchmarks using 152 methods with more than 8 million data-points, and the leaderboard is continuously expanding. The JARVIS-Leaderboard is available at the website: https://pages.nist.gov/jarvis_leaderboard △ Less

Submitted 26 March, 2024; v1 submitted 20 June, 2023; originally announced June 2023.

arXiv:2306.02206 [pdf]

Mitigating Molecular Aggregation in Drug Discovery with Predictive Insights from Explainable AI

Authors: Hunter Sturm, Jonas Teufel, Kaitlin A. Isfeld, Pascal Friederich, Rebecca L. Davis

Abstract: Herein, we present the application of MEGAN, our explainable AI (xAI) model, for the identification of small colloidally aggregating molecules (SCAMs). This work offers solutions to the long-standing problem of false positives caused by SCAMs in high throughput screening for drug discovery and demonstrates the power of xAI in the classification of molecular properties that are not chemically intui… ▽ More Herein, we present the application of MEGAN, our explainable AI (xAI) model, for the identification of small colloidally aggregating molecules (SCAMs). This work offers solutions to the long-standing problem of false positives caused by SCAMs in high throughput screening for drug discovery and demonstrates the power of xAI in the classification of molecular properties that are not chemically intuitive based on our current understanding. We leverage xAI insights and molecular counterfactuals to design alternatives to problematic compounds in drug screening libraries. Additionally, we experimentally validate the MEGAN prediction classification for one of the counterfactuals and demonstrate the utility of counterfactuals for altering the aggregation properties of a compound through minor structural modifications. The integration of this method in high-throughput screening approaches will help combat and circumvent false positives, providing better lead molecules more rapidly and thus accelerating drug discovery cycles. △ Less

Submitted 27 May, 2025; v1 submitted 3 June, 2023; originally announced June 2023.

Comments: 10 pages, 6 figures, one TOC figure, plus SI

arXiv:2305.15961 [pdf, other]

doi 10.1007/978-3-031-44067-0_19

Quantifying the Intrinsic Usefulness of Attributional Explanations for Graph Neural Networks with Artificial Simulatability Studies

Authors: Jonas Teufel, Luca Torresi, Pascal Friederich

Abstract: Despite the increasing relevance of explainable AI, assessing the quality of explanations remains a challenging issue. Due to the high costs associated with human-subject experiments, various proxy metrics are often used to approximately quantify explanation quality. Generally, one possible interpretation of the quality of an explanation is its inherent value for teaching a related concept to a st… ▽ More Despite the increasing relevance of explainable AI, assessing the quality of explanations remains a challenging issue. Due to the high costs associated with human-subject experiments, various proxy metrics are often used to approximately quantify explanation quality. Generally, one possible interpretation of the quality of an explanation is its inherent value for teaching a related concept to a student. In this work, we extend artificial simulatability studies to the domain of graph neural networks. Instead of costly human trials, we use explanation-supervisable graph neural networks to perform simulatability studies to quantify the inherent usefulness of attributional graph explanations. We perform an extensive ablation study to investigate the conditions under which the proposed analyses are most meaningful. We additionally validate our methods applicability on real-world graph classification and regression datasets. We find that relevant explanations can significantly boost the sample efficiency of graph neural networks and analyze the robustness towards noise and bias in the explanations. We believe that the notion of usefulness obtained from our proposed simulatability analysis provides a dimension of explanation quality that is largely orthogonal to the common practice of faithfulness and has great potential to expand the toolbox of explanation quality assessments, specifically for graph explanations. △ Less

Submitted 25 May, 2023; originally announced May 2023.

Comments: 22 pages, accepted at xAI conference 2023 Portugal

arXiv:2305.07867 [pdf]

doi 10.1021/jacs.3c03271

An integrated system built for small-molecule semiconductors via high-throughput approaches

Authors: Jianchang Wu, Jiyun Zhang, Manman Hu, Patrick Reiser, Luca Torresi, Pascal Friederich, Leopold Lahn, Olga Kasian, Dirk M. Guldi, M. Eugenia Pérez-Ojeda, Anastasia Barabash, Juan S. Rocha-Ortiz, Yicheng Zhao, Zhiqiang Xie, Junsheng Luo, Yunuo Wang, Sang Il Seok, Jens A. Hauch, Christoph J. Brabec

Abstract: High-throughput synthesis of solution-processable structurally variable small-molecule semiconductors is both an opportunity and a challenge. A large number of diverse molecules provide a possibility for quick material discovery and machine learning based on experimental data. However, the diversity of molecular structure leads to the complexity of molecular properties, such as solubility, polarit… ▽ More High-throughput synthesis of solution-processable structurally variable small-molecule semiconductors is both an opportunity and a challenge. A large number of diverse molecules provide a possibility for quick material discovery and machine learning based on experimental data. However, the diversity of molecular structure leads to the complexity of molecular properties, such as solubility, polarity, and crystallinity, which poses great challenges to solution processing and purification. Here, we first report an integrated system for the high-throughput synthesis, purification, and characterization of molecules with a large variety. Based on the principle of Like dissolves like, we combine theoretical calculations and a robotic platform to accelerate the purification of those molecules. With this platform, a material library containing 125 molecules and their optical-electric properties was built within a timeframe of weeks. More importantly, the high repeatability of recrystallization we design is a reliable approach to further upgrading and industrial production. △ Less

Submitted 13 May, 2023; originally announced May 2023.

Comments: 18 pages, 5 figures

Journal ref: J. Am. Chem. Soc. 2023, 145, 30, 1651-16525

arXiv:2304.11120 [pdf, other]

What is missing in autonomous discovery: Open challenges for the community

Authors: Phillip M. Maffettone, Pascal Friederich, Sterling G. Baird, Ben Blaiszik, Keith A. Brown, Stuart I. Campbell, Orion A. Cohen, Tantum Collins, Rebecca L. Davis, Ian T. Foster, Navid Haghmoradi, Mark Hereld, Nicole Jung, Ha-Kyung Kwon, Gabriella Pizzuto, Jacob Rintamaki, Casper Steinmann, Luca Torresi, Shijing Sun

Abstract: Self-driving labs (SDLs) leverage combinations of artificial intelligence, automation, and advanced computing to accelerate scientific discovery. The promise of this field has given rise to a rich community of passionate scientists, engineers, and social scientists, as evidenced by the development of the Acceleration Consortium and recent Accelerate Conference. Despite its strengths, this rapidly… ▽ More Self-driving labs (SDLs) leverage combinations of artificial intelligence, automation, and advanced computing to accelerate scientific discovery. The promise of this field has given rise to a rich community of passionate scientists, engineers, and social scientists, as evidenced by the development of the Acceleration Consortium and recent Accelerate Conference. Despite its strengths, this rapidly developing field presents numerous opportunities for growth, challenges to overcome, and potential risks of which to remain aware. This community perspective builds on a discourse instantiated during the first Accelerate Conference, and looks to the future of self-driving labs with a tempered optimism. Incorporating input from academia, government, and industry, we briefly describe the current status of self-driving labs, then turn our attention to barriers, opportunities, and a vision for what is possible. Our field is delivering solutions in technology and infrastructure, artificial intelligence and knowledge generation, and education and workforce development. In the spirit of community, we intend for this work to foster discussion and drive best practices as our field grows. △ Less

Submitted 2 May, 2023; v1 submitted 21 April, 2023; originally announced April 2023.

arXiv:2303.11699 [pdf, other]

doi 10.1039/D3DD00071K

Neural networks trained on synthetically generated crystals can extract structural information from ICSD powder X-ray diffractograms

Authors: Henrik Schopmans, Patrick Reiser, Pascal Friederich

Abstract: Machine learning techniques have successfully been used to extract structural information such as the crystal space group from powder X-ray diffractograms. However, training directly on simulated diffractograms from databases such as the ICSD is challenging due to its limited size, class-inhomogeneity, and bias toward certain structure types. We propose an alternative approach of generating synthe… ▽ More Machine learning techniques have successfully been used to extract structural information such as the crystal space group from powder X-ray diffractograms. However, training directly on simulated diffractograms from databases such as the ICSD is challenging due to its limited size, class-inhomogeneity, and bias toward certain structure types. We propose an alternative approach of generating synthetic crystals with random coordinates by using the symmetry operations of each space group. Based on this approach, we demonstrate online training of deep ResNet-like models on up to a few million unique on-the-fly generated synthetic diffractograms per hour. For our chosen task of space group classification, we achieved a test accuracy of 79.9% on unseen ICSD structure types from most space groups. This surpasses the 56.1% accuracy of the current state-of-the-art approach of training on ICSD crystals directly. Our results demonstrate that synthetically generated crystals can be used to extract structural information from ICSD powder diffractograms, which makes it possible to apply very large state-of-the-art machine learning models in the area of powder X-ray diffraction. We further show first steps toward applying our methodology to experimental data, where automated XRD data analysis is crucial, especially in high-throughput settings. While we focused on the prediction of the space group, our approach has the potential to be extended to related tasks in the future. △ Less

Submitted 19 September, 2023; v1 submitted 21 March, 2023; originally announced March 2023.

Journal ref: Digital Discovery, 2023,2, 1414-1424

arXiv:2303.08708 [pdf]

doi 10.1038/s41597-023-02486-4

Accurate GW frontier orbital energies of 134 kilo molecules

Authors: Artem Fediai, Patrick Reiser, Jorge Enrique Olivares Peña, Pascal Friederich, Wolfgang Wenzel

Abstract: The QM9 dataset [Scientific Data, Vol. 1, 140022 (2014)] became a standard dataset to benchmark machine learning methods, especially on molecular graphs. It contains geometries as well as multiple computed molecular properties of 133,885 compounds at B3LYP/6-31G(2df,p) level of theory, including frontier orbitals (HOMO and LUMO) energies. However, the accuracy of HOMO/LUMO predictions from density… ▽ More The QM9 dataset [Scientific Data, Vol. 1, 140022 (2014)] became a standard dataset to benchmark machine learning methods, especially on molecular graphs. It contains geometries as well as multiple computed molecular properties of 133,885 compounds at B3LYP/6-31G(2df,p) level of theory, including frontier orbitals (HOMO and LUMO) energies. However, the accuracy of HOMO/LUMO predictions from density functional theory, including hybrid methods such as B3LYP, is limited for many applications. In contrast, the GW method significantly improves HOMO/LUMO prediction accuracy, with mean unsigned errors in the GW100 benchmark dataset of 100 meV. In this work, we present a new dataset of HOMO/LUMO energies for the QM9 compounds, computed using the GW method. This database may serve as a benchmark of HOMO/LUMO prediction, delta-learning, and transfer learning, particularly for larger molecules where GW is the most accurate but still numerically feasible method. We expect this dataset to enable the development of more accurate machine learning models for predicting molecular properties △ Less

Submitted 15 March, 2023; originally announced March 2023.

Journal ref: Sci Data 10, 581 (2023)

arXiv:2302.14102 [pdf, other]

Connectivity Optimized Nested Graph Networks for Crystal Structures

Authors: Robin Ruff, Patrick Reiser, Jan Stühmer, Pascal Friederich

Abstract: Graph neural networks (GNNs) have been applied to a large variety of applications in materials science and chemistry. Here, we recapitulate the graph construction for crystalline (periodic) materials and investigate its impact on the GNNs model performance. We suggest the asymmetric unit cell as a representation to reduce the number of atoms by using all symmetries of the system. This substantiall… ▽ More Graph neural networks (GNNs) have been applied to a large variety of applications in materials science and chemistry. Here, we recapitulate the graph construction for crystalline (periodic) materials and investigate its impact on the GNNs model performance. We suggest the asymmetric unit cell as a representation to reduce the number of atoms by using all symmetries of the system. This substantially reduced the computational cost and thus time needed to train large graph neural networks without any loss in accuracy. Furthermore, with a simple but systematically built GNN architecture based on message passing and line graph templates, we introduce a general architecture (Nested Graph Network, NGN) that is applicable to a wide range of tasks. We show that our suggested models systematically improve state-of-the-art results across all tasks within the MatBench benchmark. Further analysis shows that optimized connectivity and deeper message functions are responsible for the improvement. Asymmetric unit cells and connectivity optimization can be generally applied to (crystal) graph networks, while our suggested nested graph framework will open new ways of systematic comparison of GNN architectures. △ Less

Submitted 9 August, 2023; v1 submitted 27 February, 2023; originally announced February 2023.

Comments: 19 pages, 13 figures

ACM Class: J.2

arXiv:2212.06071 [pdf, other]

3DSC - A New Dataset of Superconductors Including Crystal Structures

Authors: Timo Sommer, Roland Willa, Jörg Schmalian, Pascal Friederich

Abstract: Data-driven methods, in particular machine learning, can help to speed up the discovery of new materials by finding hidden patterns in existing data and using them to identify promising candidate materials. In the case of superconductors, which are a highly interesting but also a complex class of materials with many relevant applications, the use of data science tools is to date slowed down by a l… ▽ More Data-driven methods, in particular machine learning, can help to speed up the discovery of new materials by finding hidden patterns in existing data and using them to identify promising candidate materials. In the case of superconductors, which are a highly interesting but also a complex class of materials with many relevant applications, the use of data science tools is to date slowed down by a lack of accessible data. In this work, we present a new and publicly available superconductivity dataset ('3DSC'), featuring the critical temperature $T_\mathrm{c}$ of superconducting materials additionally to tested non-superconductors. In contrast to existing databases such as the SuperCon database which contains information on the chemical composition, the 3DSC is augmented by the approximate three-dimensional crystal structure of each material. We perform a statistical analysis and machine learning experiments to show that access to this structural information improves the prediction of the critical temperature $T_\mathrm{c}$ of materials. Furthermore, we see the 3DSC not as a finished dataset, but we provide ideas and directions for further research to improve the 3DSC in multiple ways. We are confident that this database will be useful in applying state-of-the-art machine learning methods to eventually find new superconductors. △ Less

Submitted 14 December, 2022; v1 submitted 12 December, 2022; originally announced December 2022.

Comments: 15 pages + 10 pages of supporting information; UPDATE: standardised formatting, removed double dash from title & updated github links

arXiv:2211.13260 [pdf, other]

Actively Learning Costly Reward Functions for Reinforcement Learning

Authors: André Eberhard, Houssam Metni, Georg Fahland, Alexander Stroh, Pascal Friederich

Abstract: Transfer of recent advances in deep reinforcement learning to real-world applications is hindered by high data demands and thus low efficiency and scalability. Through independent improvements of components such as replay buffers or more stable learning algorithms, and through massively distributed systems, training time could be reduced from several days to several hours for standard benchmark ta… ▽ More Transfer of recent advances in deep reinforcement learning to real-world applications is hindered by high data demands and thus low efficiency and scalability. Through independent improvements of components such as replay buffers or more stable learning algorithms, and through massively distributed systems, training time could be reduced from several days to several hours for standard benchmark tasks. However, while rewards in simulated environments are well-defined and easy to compute, reward evaluation becomes the bottleneck in many real-world environments, e.g., in molecular optimization tasks, where computationally demanding simulations or even experiments are required to evaluate states and to quantify rewards. Therefore, training might become prohibitively expensive without an extensive amount of computational resources and time. We propose to alleviate this problem by replacing costly ground-truth rewards with rewards modeled by neural networks, counteracting non-stationarity of state and reward distributions during training with an active learning component. We demonstrate that using our proposed ACRL method (Actively learning Costly rewards for Reinforcement Learning), it is possible to train agents in complex real-world environments orders of magnitudes faster. By enabling the application of reinforcement learning methods to new domains, we show that we can find interesting and non-trivial solutions to real-world optimization problems in chemistry, materials science and engineering. △ Less

Submitted 23 November, 2022; originally announced November 2022.

arXiv:2211.13236 [pdf, other]

doi 10.1007/978-3-031-44067-0_18

MEGAN: Multi-Explanation Graph Attention Network

Authors: Jonas Teufel, Luca Torresi, Patrick Reiser, Pascal Friederich

Abstract: We propose a multi-explanation graph attention network (MEGAN). Unlike existing graph explainability methods, our network can produce node and edge attributional explanations along multiple channels, the number of which is independent of task specifications. This proves crucial to improve the interpretability of graph regression predictions, as explanations can be split into positive and negative… ▽ More We propose a multi-explanation graph attention network (MEGAN). Unlike existing graph explainability methods, our network can produce node and edge attributional explanations along multiple channels, the number of which is independent of task specifications. This proves crucial to improve the interpretability of graph regression predictions, as explanations can be split into positive and negative evidence w.r.t to a reference value. Additionally, our attention-based network is fully differentiable and explanations can actively be trained in an explanation-supervised manner. We first validate our model on a synthetic graph regression dataset with known ground-truth explanations. Our network outperforms existing baseline explainability methods for the single- as well as the multi-explanation case, achieving near-perfect explanation accuracy during explanation supervision. Finally, we demonstrate our model's capabilities on multiple real-world datasets. We find that our model produces sparse high-fidelity explanations consistent with human intuition about those tasks. △ Less

Submitted 25 May, 2023; v1 submitted 23 November, 2022; originally announced November 2022.

Comments: 24 pages, accepted for xAI 2023 conference portugal

arXiv:2210.09517 [pdf, other]

Graph neural networks to learn joint representations of disjoint molecular graphs

Authors: Chen Shao, Zhou Chen, Pascal Friederich

Abstract: Graph neural networks are widely used to learn global representations of graphs, which are then used for regression or classification tasks. Typically, the graphs in such data sets are connected, i.e. each training sample consists of a single internally connected graph associated with a global label. However, there is a wide variety of yet unconsidered but application-relevant tasks, where labels… ▽ More Graph neural networks are widely used to learn global representations of graphs, which are then used for regression or classification tasks. Typically, the graphs in such data sets are connected, i.e. each training sample consists of a single internally connected graph associated with a global label. However, there is a wide variety of yet unconsidered but application-relevant tasks, where labels are assigned to sets of disjoint graphs, which requires the generation of global representations of disjoint graphs. In this paper, we present a new data set with chemical reactions, which is illustrating this task. Each sample consists of a pair of disjoint molecular graphs and a joint label representing a scalar measure associated with the chemical reaction of the molecules. We show the initial results of graph neural networks that are able to solve the task within a combinatorial subset of the dataset but do not generalize well to the full data set and unseen (sub)graphs. △ Less

Submitted 30 October, 2022; v1 submitted 14 October, 2022; originally announced October 2022.

Comments: 5 pages, 4 figures

arXiv:2208.09481 [pdf, other]

Graph neural networks for materials science and chemistry

Authors: Patrick Reiser, Marlen Neubert, André Eberhard, Luca Torresi, Chen Zhou, Chen Shao, Houssam Metni, Clint van Hoesel, Henrik Schopmans, Timo Sommer, Pascal Friederich

Abstract: Machine learning plays an increasingly important role in many areas of chemistry and materials science, e.g. to predict materials properties, to accelerate simulations, to design new materials, and to predict synthesis routes of new materials. Graph neural networks (GNNs) are one of the fastest growing classes of machine learning models. They are of particular relevance for chemistry and materials… ▽ More Machine learning plays an increasingly important role in many areas of chemistry and materials science, e.g. to predict materials properties, to accelerate simulations, to design new materials, and to predict synthesis routes of new materials. Graph neural networks (GNNs) are one of the fastest growing classes of machine learning models. They are of particular relevance for chemistry and materials science, as they directly work on a graph or structural representation of molecules and materials and therefore have full access to all relevant information required to characterize materials. In this review article, we provide an overview of the basic principles of GNNs, widely used datasets, and state-of-the-art architectures, followed by a discussion of a wide range of recent applications of GNNs in chemistry and materials science, and concluding with a road-map for the further development and application of GNNs. △ Less

Submitted 5 August, 2022; originally announced August 2022.

Comments: 37 pages, 2 figures

arXiv:2204.01467 [pdf, other]

doi 10.1038/s42254-022-00518-3

On scientific understanding with artificial intelligence

Authors: Mario Krenn, Robert Pollice, Si Yue Guo, Matteo Aldeghi, Alba Cervera-Lierta, Pascal Friederich, Gabriel dos Passos Gomes, Florian Häse, Adrian Jinich, AkshatKumar Nigam, Zhenpeng Yao, Alán Aspuru-Guzik

Abstract: Imagine an oracle that correctly predicts the outcome of every particle physics experiment, the products of every chemical reaction, or the function of every protein. Such an oracle would revolutionize science and technology as we know them. However, as scientists, we would not be satisfied with the oracle itself. We want more. We want to comprehend how the oracle conceived these predictions. This… ▽ More Imagine an oracle that correctly predicts the outcome of every particle physics experiment, the products of every chemical reaction, or the function of every protein. Such an oracle would revolutionize science and technology as we know them. However, as scientists, we would not be satisfied with the oracle itself. We want more. We want to comprehend how the oracle conceived these predictions. This feat, denoted as scientific understanding, has frequently been recognized as the essential aim of science. Now, the ever-growing power of computers and artificial intelligence poses one ultimate question: How can advanced artificial systems contribute to scientific understanding or achieve it autonomously? We are convinced that this is not a mere technical question but lies at the core of science. Therefore, here we set out to answer where we are and where we can go from here. We first seek advice from the philosophy of science to understand scientific understanding. Then we review the current state of the art, both from literature and by collecting dozens of anecdotes from scientists about how they acquired new conceptual understanding with the help of computers. Those combined insights help us to define three dimensions of android-assisted scientific understanding: The android as a I) computational microscope, II) resource of inspiration and the ultimate, not yet existent III) agent of understanding. For each dimension, we explain new avenues to push beyond the status quo and unleash the full power of artificial intelligence's contribution to the central aim of science. We hope our perspective inspires and focuses research towards androids that get new scientific understanding and ultimately bring us closer to true artificial scientists. △ Less

Submitted 4 April, 2022; originally announced April 2022.

Comments: 13 pages, 3 figures, comments welcome!

Journal ref: Nature Review Physics 4, 761 (2022)

arXiv:2204.00056 [pdf, other]

doi 10.1016/j.patter.2022.100588

SELFIES and the future of molecular string representations

Authors: Mario Krenn, Qianxiang Ai, Senja Barthel, Nessa Carson, Angelo Frei, Nathan C. Frey, Pascal Friederich, Théophile Gaudin, Alberto Alexander Gayle, Kevin Maik Jablonka, Rafael F. Lameiro, Dominik Lemm, Alston Lo, Seyed Mohamad Moosavi, José Manuel Nápoles-Duarte, AkshatKumar Nigam, Robert Pollice, Kohulan Rajan, Ulrich Schatzschneider, Philippe Schwaller, Marta Skreta, Berend Smit, Felix Strieth-Kalthoff, Chong Sun, Gary Tom , et al. (6 additional authors not shown)

Abstract: Artificial intelligence (AI) and machine learning (ML) are expanding in popularity for broad applications to challenging tasks in chemistry and materials science. Examples include the prediction of properties, the discovery of new reaction pathways, or the design of new molecules. The machine needs to read and write fluently in a chemical language for each of these tasks. Strings are a common tool… ▽ More Artificial intelligence (AI) and machine learning (ML) are expanding in popularity for broad applications to challenging tasks in chemistry and materials science. Examples include the prediction of properties, the discovery of new reaction pathways, or the design of new molecules. The machine needs to read and write fluently in a chemical language for each of these tasks. Strings are a common tool to represent molecular graphs, and the most popular molecular string representation, SMILES, has powered cheminformatics since the late 1980s. However, in the context of AI and ML in chemistry, SMILES has several shortcomings -- most pertinently, most combinations of symbols lead to invalid results with no valid chemical interpretation. To overcome this issue, a new language for molecules was introduced in 2020 that guarantees 100\% robustness: SELFIES (SELF-referencIng Embedded Strings). SELFIES has since simplified and enabled numerous new applications in chemistry. In this manuscript, we look to the future and discuss molecular string representations, along with their respective opportunities and challenges. We propose 16 concrete Future Projects for robust molecular representations. These involve the extension toward new chemical domains, exciting questions at the interface of AI and robust languages and interpretability for both humans and machines. We hope that these proposals will inspire several follow-up works exploiting the full potential of molecular string representations for the future of AI in chemistry and materials science. △ Less

Submitted 31 March, 2022; originally announced April 2022.

Comments: 34 pages, 15 figures, comments and suggestions for additional references are welcome!

Journal ref: Cell Patterns 3(10), 100588(2022)

arXiv:2203.03083 [pdf, other]

Charge Transfer Simulations using Hamiltonian Elements and Forces from Neural Networks

Authors: Philipp M. Dohmen, Mila Krämer, Patrick Reiser, Pascal Friederich, Marcus Elstner, Weiwei Xie

Abstract: The trajectory surface hopping method has been widely used in the simulation of charge transport in organic semiconductors. In the present study, we employ the machine learning (ML) based Hamiltonian to simulate the charge transport in anthracene and pentacene. The neural network (NN) based models are able to predict not just site energies and couplings but also the gradients of the site energy as… ▽ More The trajectory surface hopping method has been widely used in the simulation of charge transport in organic semiconductors. In the present study, we employ the machine learning (ML) based Hamiltonian to simulate the charge transport in anthracene and pentacene. The neural network (NN) based models are able to predict not just site energies and couplings but also the gradients of the site energy as well as off-diagonal gradients necessary for forces. We train the models on DFTB-quality data for both anthracene and pentacene. By using the obtained models in propagation simulations, we evaluate their performance in reproducing hole mobilities in these materials in terms of both quality and computational cost. The results show that the charge mobilities obtained using the NN-based Hamiltonian are in very good agreements with the charge mobilities computed using the DFTB-based Hamiltonian. △ Less

Submitted 22 March, 2022; v1 submitted 6 March, 2022; originally announced March 2022.

arXiv:2103.04318 [pdf]

doi 10.1016/j.simpa.2021.100095

Implementing graph neural networks with TensorFlow-Keras

Authors: Patrick Reiser, Andre Eberhard, Pascal Friederich

Abstract: Graph neural networks are a versatile machine learning architecture that received a lot of attention recently. In this technical report, we present an implementation of convolution and pooling layers for TensorFlow-Keras models, which allows a seamless and flexible integration into standard Keras layers to set up graph models in a functional way. This implies the usage of mini-batches as the first… ▽ More Graph neural networks are a versatile machine learning architecture that received a lot of attention recently. In this technical report, we present an implementation of convolution and pooling layers for TensorFlow-Keras models, which allows a seamless and flexible integration into standard Keras layers to set up graph models in a functional way. This implies the usage of mini-batches as the first tensor dimension, which can be realized via the new RaggedTensor class of TensorFlow best suited for graphs. We developed the Keras Graph Convolutional Neural Network Python package kgcnn based on TensorFlow-Keras that provides a set of Keras layers for graph networks which focus on a transparent tensor structure passed between layers and an ease-of-use mindset. △ Less

Submitted 7 March, 2021; originally announced March 2021.

Journal ref: Softw. Impacts 2021, 9, 100095

arXiv:2102.01479 [pdf]

doi 10.1021/acs.jctc.1c00191

Analyzing dynamical disorder for charge transport in organic semiconductors via machine learning

Authors: Patrick Reiser, Manuel Konrad, Artem Fediai, Salvador Léon, Wolfgang Wenzel, Pascal Friederich

Abstract: Organic semiconductors are indispensable for today's display technologies in form of organic light emitting diodes (OLEDs) and further optoelectronic applications. However, organic materials do not reach the same charge carrier mobility as inorganic semiconductors, limiting the efficiency of devices. To find or even design new organic semiconductors with higher charge carrier mobility, computation… ▽ More Organic semiconductors are indispensable for today's display technologies in form of organic light emitting diodes (OLEDs) and further optoelectronic applications. However, organic materials do not reach the same charge carrier mobility as inorganic semiconductors, limiting the efficiency of devices. To find or even design new organic semiconductors with higher charge carrier mobility, computational approaches, in particular multiscale models, are becoming increasingly important. However, such models are computationally very costly, especially when large systems and long time scales are required, which is the case to compute static and dynamic energy disorder, i.e. dominant factor to determine charge transport. Here we overcome this drawback by integrating machine learning models into multiscale simulations. This allows us to obtain unprecedented insight into relevant microscopic materials properties, in particular static and dynamic disorder contributions for a series of application-relevant molecules. We find that static disorder and thus the distribution of shallow traps is highly asymmetrical for many materials, impacting widely considered Gaussian disorder models. We furthermore analyse characteristic energy level fluctuation times and compare them to typical hopping rates to evaluate the importance of dynamic disorder for charge transport. We hope that our findings will significantly improve the accuracy of computational methods used to predict application relevant materials properties of organic semiconductors, and thus make these methods applicable for virtual materials design. △ Less

Submitted 2 February, 2021; originally announced February 2021.

Journal ref: J. Chem. Theory Comput. 2021, 17, 6, 3750-3759

arXiv:2101.08130 [pdf, other]

Machine learning for rapid discovery of laminar flow channel wall modifications that enhance heat transfer

Authors: Yuri Koide, Arjun J. Kaithakkal, Matthias Schniewind, Bradley P. Ladewig, Alexander Stroh, Pascal Friederich

Abstract: Numerical simulation of fluids plays an essential role in modeling many physical phenomena, which enables technological advancements, contributes to sustainable practices, and expands our understanding of various natural and engineered systems. The calculation of heat transfer in fluid flow in simple flat channels is a relatively easy task for various simulation methods. However, once the channel… ▽ More Numerical simulation of fluids plays an essential role in modeling many physical phenomena, which enables technological advancements, contributes to sustainable practices, and expands our understanding of various natural and engineered systems. The calculation of heat transfer in fluid flow in simple flat channels is a relatively easy task for various simulation methods. However, once the channel geometry becomes more complex, numerical simulations become a bottleneck in optimizing wall geometries. We present a combination of accurate numerical simulations of arbitrary, flat, and non-flat channels and machine learning models predicting drag coefficient and Stanton number. We show that convolutional neural networks (CNN) can accurately predict the target properties at a fraction of the time of numerical simulations. We use the CNN models in a virtual high-throughput screening approach to explore a large number of possible, randomly generated wall architectures. Data Augmentation was applied to existing geometries data to add generated new training data which have the same number of parameters of heat transfer to improve the model's generalization. The general approach is not only applicable to simple flow setups as presented here but can be extended to more complex tasks, such as multiphase or even reactive unit operations in chemical engineering. △ Less

Submitted 8 August, 2023; v1 submitted 19 January, 2021; originally announced January 2021.

arXiv:2010.14236 [pdf, other]

doi 10.1088/2632-2153/abda08

Scientific intuition inspired by machine learning generated hypotheses

Authors: Pascal Friederich, Mario Krenn, Isaac Tamblyn, Alan Aspuru-Guzik

Abstract: Machine learning with application to questions in the physical sciences has become a widely used tool, successfully applied to classification, regression and optimization tasks in many areas. Research focus mostly lies in improving the accuracy of the machine learning models in numerical predictions, while scientific understanding is still almost exclusively generated by human researchers analysin… ▽ More Machine learning with application to questions in the physical sciences has become a widely used tool, successfully applied to classification, regression and optimization tasks in many areas. Research focus mostly lies in improving the accuracy of the machine learning models in numerical predictions, while scientific understanding is still almost exclusively generated by human researchers analysing numerical results and drawing conclusions. In this work, we shift the focus on the insights and the knowledge obtained by the machine learning models themselves. In particular, we study how it can be extracted and used to inspire human scientists to increase their intuitions and understanding of natural systems. We apply gradient boosting in decision trees to extract human interpretable insights from big data sets from chemistry and physics. In chemistry, we not only rediscover widely know rules of thumb but also find new interesting motifs that tell us how to control solubility and energy levels of organic molecules. At the same time, in quantum physics, we gain new understanding on experiments for quantum entanglement. The ability to go beyond numerics and to enter the realm of scientific insight and hypothesis generation opens the door to use machine learning to accelerate the discovery of conceptual understanding in some of the most challenging domains of science. △ Less

Submitted 14 December, 2020; v1 submitted 27 October, 2020; originally announced October 2020.

Journal ref: Machine Learning: Science and Technology 2, 025027 (2021)

arXiv:2002.10413 [pdf, other]

Neural Message Passing on High Order Paths

Authors: Daniel Flam-Shepherd, Tony Wu, Pascal Friederich, Alan Aspuru-Guzik

Abstract: Graph neural network have achieved impressive results in predicting molecular properties, but they do not directly account for local and hidden structures in the graph such as functional groups and molecular geometry. At each propagation step, GNNs aggregate only over first order neighbours, ignoring important information contained in subsequent neighbours as well as the relationships between thos… ▽ More Graph neural network have achieved impressive results in predicting molecular properties, but they do not directly account for local and hidden structures in the graph such as functional groups and molecular geometry. At each propagation step, GNNs aggregate only over first order neighbours, ignoring important information contained in subsequent neighbours as well as the relationships between those higher order connections. In this work, we generalize graph neural nets to pass messages and aggregate across higher order paths. This allows for information to propagate over various levels and substructures of the graph. We demonstrate our model on a few tasks in molecular property prediction. △ Less

Submitted 24 February, 2020; originally announced February 2020.

arXiv:1909.11655 [pdf, other]

Augmenting Genetic Algorithms with Deep Neural Networks for Exploring the Chemical Space

Authors: AkshatKumar Nigam, Pascal Friederich, Mario Krenn, Alán Aspuru-Guzik

Abstract: Challenges in natural sciences can often be phrased as optimization problems. Machine learning techniques have recently been applied to solve such problems. One example in chemistry is the design of tailor-made organic materials and molecules, which requires efficient methods to explore the chemical space. We present a genetic algorithm (GA) that is enhanced with a neural network (DNN) based discr… ▽ More Challenges in natural sciences can often be phrased as optimization problems. Machine learning techniques have recently been applied to solve such problems. One example in chemistry is the design of tailor-made organic materials and molecules, which requires efficient methods to explore the chemical space. We present a genetic algorithm (GA) that is enhanced with a neural network (DNN) based discriminator model to improve the diversity of generated molecules and at the same time steer the GA. We show that our algorithm outperforms other generative models in optimization tasks. We furthermore present a way to increase interpretability of genetic algorithms, which helped us to derive design principles. △ Less

Submitted 15 January, 2020; v1 submitted 25 September, 2019; originally announced September 2019.

Comments: 9+3 Pages, 7+4 figures, 2 tables. Comments are welcome! (code is available at: https://github.com/aspuru-guzik-group/GA)

Journal ref: International Conference on Learning Representations (ICLR-2020)

arXiv:1909.10768 [pdf, other]

From absorption spectra to charge transfer in PEDOT nanoaggregates with machine learning

Authors: Loïc M. Roch, Semion K. Saikin, Florian Häse, Pascal Friederich, Randall H. Goldsmith, Salvador León, Alán Aspuru-Guzik

Abstract: Fast and inexpensive characterization of materials properties is a key element to discover novel functional materials. In this work, we suggest an approach employing three classes of Bayesian machine learning (ML) models to correlate electronic absorption spectra of nanoaggregates with the strength of intermolecular electronic couplings in organic conducting and semiconducting materials. As a spec… ▽ More Fast and inexpensive characterization of materials properties is a key element to discover novel functional materials. In this work, we suggest an approach employing three classes of Bayesian machine learning (ML) models to correlate electronic absorption spectra of nanoaggregates with the strength of intermolecular electronic couplings in organic conducting and semiconducting materials. As a specific model system, we consider PEDOT:PSS, a cornerstone material for organic electronic applications, and so analyze the couplings between charged dimers of closely packed PEDOT oligomers that are at the heart of the material's unrivaled conductivity. We demonstrate that ML algorithms can identify correlations between the coupling strengths and the electronic absorption spectra. We also show that ML models can be trained to be transferable across a broad range of spectral resolutions, and that the electronic couplings can be predicted from the simulated spectra with an 88 % accuracy when ML models are used as classifiers. Although the ML models employed in this study were trained on data generated by a multi-scale computational workflow, they were able to leverage leverage experimental data. △ Less

Submitted 24 September, 2019; originally announced September 2019.

arXiv:1908.11854 [pdf]

The influence of impurities on the charge carrier mobility of small molecule organic semiconductors

Authors: Pascal Friederich, Artem Fediai, Jing Li, Anirban Mondal, Naresh B. Kotadiya, Franz Symalla, Gert-Jan A. H. Wetzelaer, Denis Andrienko, Xavier Blase, David Beljonne, Paul W. M. Blom, Jean-Luc Brédas, Wolfgang Wenzel

Abstract: Amorphous organic semiconductors based on small molecules and polymers are used in many applications, most prominently organic light emitting diodes (OLEDs) and organic solar cells. Impurities and charge traps are omnipresent in most currently available organic semiconductors and limit charge transport and thus device efficiency. The microscopic cause as well as the chemical nature of these traps… ▽ More Amorphous organic semiconductors based on small molecules and polymers are used in many applications, most prominently organic light emitting diodes (OLEDs) and organic solar cells. Impurities and charge traps are omnipresent in most currently available organic semiconductors and limit charge transport and thus device efficiency. The microscopic cause as well as the chemical nature of these traps are presently not well understood. Using a multiscale model we characterize the influence of impurities on the density of states and charge transport in small-molecule amorphous organic semiconductors. We use the model to quantitatively describe the influence of water molecules and water-oxygen complexes on the electron and hole mobilities. These species are seen to impact the shape of the density of states and to act as explicit charge traps within the energy gap. Our results show that trap states introduced by molecular oxygen can be deep enough to limit the electron mobility in widely used materials. △ Less

Submitted 8 November, 2020; v1 submitted 30 August, 2019; originally announced August 2019.

Comments: 13 pages + SI, 7 figures + TOC-graphic + 2 figures in SI

arXiv:1905.13741 [pdf, other]

doi 10.1088/2632-2153/aba947

Self-Referencing Embedded Strings (SELFIES): A 100% robust molecular string representation

Authors: Mario Krenn, Florian Häse, AkshatKumar Nigam, Pascal Friederich, Alán Aspuru-Guzik

Abstract: The discovery of novel materials and functional molecules can help to solve some of society's most urgent challenges, ranging from efficient energy harvesting and storage to uncovering novel pharmaceutical drug candidates. Traditionally matter engineering -- generally denoted as inverse design -- was based massively on human intuition and high-throughput virtual screening. The last few years have… ▽ More The discovery of novel materials and functional molecules can help to solve some of society's most urgent challenges, ranging from efficient energy harvesting and storage to uncovering novel pharmaceutical drug candidates. Traditionally matter engineering -- generally denoted as inverse design -- was based massively on human intuition and high-throughput virtual screening. The last few years have seen the emergence of significant interest in computer-inspired designs based on evolutionary or deep learning methods. The major challenge here is that the standard strings molecular representation SMILES shows substantial weaknesses in that task because large fractions of strings do not correspond to valid molecules. Here, we solve this problem at a fundamental level and introduce SELFIES (SELF-referencIng Embedded Strings), a string-based representation of molecules which is 100\% robust. Every SELFIES string corresponds to a valid molecule, and SELFIES can represent every molecule. SELFIES can be directly applied in arbitrary machine learning models without the adaptation of the models; each of the generated molecule candidates is valid. In our experiments, the model's internal memory stores two orders of magnitude more diverse molecules than a similar test with SMILES. Furthermore, as all molecules are valid, it allows for explanation and interpretation of the internal working of the generative models. △ Less

Submitted 4 March, 2020; v1 submitted 31 May, 2019; originally announced May 2019.

Comments: 6+3 pages, 6+1 figures

Journal ref: Machine Learning: Science and Technology 1, 045024 (2020)

Showing 1–34 of 34 results for author: Friederich, P