-
Integer linear programming for unsupervised training set selection in molecular machine learning
Authors:
Matthieu Haeberle,
Puck van Gerwen,
Ruben Laplaza,
Ksenia R. Briling,
Jan Weinreich,
Friedrich Eisenbrand,
Clemence Corminboeuf
Abstract:
Integer linear programming (ILP) is an elegant approach to solve linear optimization problems, naturally described using integer decision variables. Within the context of physics-inspired machine learning applied to chemistry, we demonstrate the relevance of an ILP formulation to select molecular training sets for predictions of size-extensive properties. We show that our algorithm outperforms exi…
▽ More
Integer linear programming (ILP) is an elegant approach to solve linear optimization problems, naturally described using integer decision variables. Within the context of physics-inspired machine learning applied to chemistry, we demonstrate the relevance of an ILP formulation to select molecular training sets for predictions of size-extensive properties. We show that our algorithm outperforms existing unsupervised training set selection approaches, especially when predicting properties of molecules larger than those present in the training set. We argue that the reason for the improved performance is due to the selection that is based on the notion of local similarity (i.e., per-atom) and a unique ILP approach that finds optimal solutions efficiently. Altogether, this work provides a practical algorithm to improve the performance of physics-inspired machine learning models and offers insights into the conceptual differences with existing training set selection approaches.
△ Less
Submitted 1 May, 2025; v1 submitted 21 October, 2024;
originally announced October 2024.
-
3DReact: Geometric deep learning for chemical reactions
Authors:
Puck van Gerwen,
Ksenia R. Briling,
Charlotte Bunne,
Vignesh Ram Somnath,
Ruben Laplaza,
Andreas Krause,
Clemence Corminboeuf
Abstract:
Geometric deep learning models, which incorporate the relevant molecular symmetries within the neural network architecture, have considerably improved the accuracy and data efficiency of predictions of molecular properties. Building on this success, we introduce 3DReact, a geometric deep learning model to predict reaction properties from three-dimensional structures of reactants and products. We d…
▽ More
Geometric deep learning models, which incorporate the relevant molecular symmetries within the neural network architecture, have considerably improved the accuracy and data efficiency of predictions of molecular properties. Building on this success, we introduce 3DReact, a geometric deep learning model to predict reaction properties from three-dimensional structures of reactants and products. We demonstrate that the invariant version of the model is sufficient for existing reaction datasets. We illustrate its competitive performance on the prediction of activation barriers on the GDB7-22-TS, Cyclo-23-TS and Proparg-21-TS datasets in different atom-mapping regimes. We show that, compared to existing models for reaction property prediction, 3DReact offers a flexible framework that exploits atom-mapping information, if available, as well as geometries of reactants and products (in an invariant or equivariant fashion). Accordingly, it performs systematically well across different datasets, atom-mapping regimes, as well as both interpolation and extrapolation tasks.
△ Less
Submitted 12 July, 2024; v1 submitted 13 December, 2023;
originally announced December 2023.
-
Assessing the persistence of chalcogen bonds in solution with neural network potentials
Authors:
Veronika Juraskova,
Frederic Celerse,
Ruben Laplaza,
Clemence Corminboeuf
Abstract:
Non-covalent bonding patterns are commonly harvested as a design principle in the field of catalysis, supramolecular chemistry and functional materials to name a few. Yet, their computational description generally neglects finite temperature and environment effects, which promote competing interactions and alter their static gas-phase properties. Recently, neural network potentials (NNPs) trained…
▽ More
Non-covalent bonding patterns are commonly harvested as a design principle in the field of catalysis, supramolecular chemistry and functional materials to name a few. Yet, their computational description generally neglects finite temperature and environment effects, which promote competing interactions and alter their static gas-phase properties. Recently, neural network potentials (NNPs) trained on Density Functional Theory (DFT) data have become increasingly popular to simulate molecular phenomena in condensed phase with an accuracy comparable to ab initio methods. To date, most applications have centered on solid-state materials or fairly simple molecules made of a limited number of elements. Herein, we focus on the persistence and strength of chalcogen bonds involving a benzotelluradiazole in condensed phase. While the tellurium-containing heteroaromatic molecules are known to exhibit pronounced interactions with anions and lone pairs of different atoms, the relevance of competing intermolecular interactions, notably with the solvent, is complicated to monitor experimentally but also challenging to model at an accurate electronic structure level. Here, we train direct and baselined NNPs to reproduce hybrid DFT energies and forces in order to identify what are the most prevalent non-covalent interactions occurring in a solute-Cl$^-$-THF mixture. The simulations in explicit solvent highlight the clear competition with chalcogen bonds formed with the solvent and the short-range directionality of the interaction with direct consequences for the molecular properties in the solution. The comparison with other potentials (e.g., AMOEBA, direct NNP and continuum solvent model) also demonstrates that baselined NNPs offer a reliable picture of the non-covalent interaction interplay occurring in solution.
△ Less
Submitted 12 January, 2022;
originally announced January 2022.
-
Overcoming Distrust in Solid State Simulations: Adding Error Bars to Computational Data
Authors:
Francesca Peccati,
Rubén Laplaza,
Julia Contreras-García
Abstract:
Simulation techniques are providing with each passing day a deeper insight into the structure and properties of materials. Two main obstacles appear for the cooperation of simulation and experiment: on the one hand, the frequent lack of a degree of uncertainty associated with calculated data. On the other, the concomitant underlying feeling that calculation parameters can be tuned with the explici…
▽ More
Simulation techniques are providing with each passing day a deeper insight into the structure and properties of materials. Two main obstacles appear for the cooperation of simulation and experiment: on the one hand, the frequent lack of a degree of uncertainty associated with calculated data. On the other, the concomitant underlying feeling that calculation parameters can be tuned with the explicit aim of matching the experimental results, even at the expense of the quality of the simulation. Without the definition of an error bar for estimating the precision of the calculation, direct comparison of calculated and experimental data can lack physical significance. In this contribution, we employ the well known delocalization error of DFT and HF to develop a simple and robust procedure to quickly estimate an error bar for calculated quantities in the field of solid state chemistry. First, we validate our model on one of the simplest properties of a solid, the geometry of its unit cell, which can be determined experimentally with high accuracy. In this case, our computational window is too large to provide a useful error bar. However, it provides computational material scientists with a pointer on how much a given system is affected by the method of choice, i.e. how much it is sensible to parameter tuning and how much care should be taken in doing it. Then, we move to another quantity which has a greater experimental uncertainty, namely transition pressure, and show that our approach can lead to error bars comparable to experiment. Hence, both experiment and theory can be compared on an even basis taking into account the uncertainty introduced by the scientist, both in the measuring conditions and the tuning of computational parameters.
△ Less
Submitted 28 June, 2018;
originally announced June 2018.