-
Transferable Learning of Reaction Pathways from Geometric Priors
Authors:
Juno Nam,
Miguel Steiner,
Max Misterka,
Soojung Yang,
Avni Singhal,
Rafael Gómez-Bombarelli
Abstract:
Identifying minimum-energy paths (MEPs) is crucial for understanding chemical reaction mechanisms but remains computationally demanding. We introduce MEPIN, a scalable machine-learning method for efficiently predicting MEPs from reactant and product configurations, without relying on transition-state geometries or pre-optimized reaction paths during training. The task is defined as predicting devi…
▽ More
Identifying minimum-energy paths (MEPs) is crucial for understanding chemical reaction mechanisms but remains computationally demanding. We introduce MEPIN, a scalable machine-learning method for efficiently predicting MEPs from reactant and product configurations, without relying on transition-state geometries or pre-optimized reaction paths during training. The task is defined as predicting deviations from geometric interpolations along reaction coordinates. We address this task with a continuous reaction path model based on a symmetry-broken equivariant neural network that generates a flexible number of intermediate structures. The model is trained using an energy-based objective, with efficiency enhanced by incorporating geometric priors from geodesic interpolation as initial interpolations or pre-training objectives. Our approach generalizes across diverse chemical reactions and achieves accurate alignment with reference intrinsic reaction coordinates, as demonstrated on various small molecule reactions and [3+2] cycloadditions. Our method enables the exploration of large chemical reaction spaces with efficient, data-driven predictions of reaction pathways.
△ Less
Submitted 21 April, 2025;
originally announced April 2025.
-
High-Throughput Transition-State Searches in Zeolite Nanopores
Authors:
Pau Ferri-Vicedo,
Alexander J. Hoffman,
Avni Singhal,
Rafael Gómez-Bombarelli
Abstract:
Zeolites are important for industrial catalytic processes involving organic molecules. Understanding molecular reaction mechanisms within the confined nanoporous environment can guide the selection of pore topologies, material compositions, and process conditions to maximize activity and selectivity. However, experimental mechanistic studies are time- and resource-intensive, and traditional molecu…
▽ More
Zeolites are important for industrial catalytic processes involving organic molecules. Understanding molecular reaction mechanisms within the confined nanoporous environment can guide the selection of pore topologies, material compositions, and process conditions to maximize activity and selectivity. However, experimental mechanistic studies are time- and resource-intensive, and traditional molecular simulations rely heavily on expert intuition and hand manipulation of chemical structures, resulting in poor scalability.
Here, we present an automated computational pipeline for locating transition states (TS) in nanopores and exploring reaction energy landscapes of complex organic transformations in pores. Starting from the molecular structure of potential reactant and products, the Pore Transition State finder (PoTS) locates gas-phase transition states using DFT, docks them in favorable orientations near active sites in nanopores, and leverages the gas-phase reaction mode to seed condensed-phase DFT calculations using the dimer method. The approach sidesteps tedious manipulations, increases the success rate of TS searches, and eliminates the need for long path-following calculations.
This work presents the largest ensemble of zeolite-confined transition states computed at the DFT level to date, enabling rigorous analysis of mechanistic trends across frameworks, reactions, and reactant types. We demonstrate the applicability of PoTS by analyzing 644 individual reaction steps for transalkylation of diethylbenzene in BOG, IWV, UTL and FAU zeolites, and in skeletal isomerization of 162 individual reaction steps in BEA, FER, FAU, MFI and MOR zeolites finding good experimental agreement in both cases. Lastly, we propose a path to address the limitations we observe regarding unsuccessful TS searches and insufficient theory in other reactions, like alkene cracking.
△ Less
Submitted 11 April, 2025;
originally announced April 2025.
-
Accelerating and enhancing thermodynamic simulations of electrochemical interfaces
Authors:
Xiaochen Du,
Mengren Liu,
Jiayu Peng,
Hoje Chun,
Alexander Hoffman,
Bilge Yildiz,
Lin Li,
Martin Z. Bazant,
Rafael Gómez-Bombarelli
Abstract:
Electrochemical interfaces are crucial in catalysis, energy storage, and corrosion, where their stability and reactivity depend on complex interactions between the electrode, adsorbates, and electrolyte. Predicting stable surface structures remains challenging, as traditional surface Pourbaix diagrams tend to either rely on expert knowledge or costly $\textit{ab initio}$ sampling, and neglect ther…
▽ More
Electrochemical interfaces are crucial in catalysis, energy storage, and corrosion, where their stability and reactivity depend on complex interactions between the electrode, adsorbates, and electrolyte. Predicting stable surface structures remains challenging, as traditional surface Pourbaix diagrams tend to either rely on expert knowledge or costly $\textit{ab initio}$ sampling, and neglect thermodynamic equilibration with the environment. Machine learning (ML) potentials can accelerate static modeling but often overlook dynamic surface transformations. Here, we extend the Virtual Surface Site Relaxation-Monte Carlo (VSSR-MC) method to autonomously sample surface reconstructions modeled under aqueous electrochemical conditions. Through fine-tuning foundational ML force fields, we accurately and efficiently predict surface energetics, recovering known Pt(111) phases and revealing new LaMnO$_\mathrm{3}$(001) surface reconstructions. By explicitly accounting for bulk-electrolyte equilibria, our framework enhances electrochemical stability predictions, offering a scalable approach to understanding and designing materials for electrochemical applications.
△ Less
Submitted 22 March, 2025;
originally announced March 2025.
-
Known Unknowns: Out-of-Distribution Property Prediction in Materials and Molecules
Authors:
Nofit Segal,
Aviv Netanyahu,
Kevin P. Greenman,
Pulkit Agrawal,
Rafael Gomez-Bombarelli
Abstract:
Discovery of high-performance materials and molecules requires identifying extremes with property values that fall outside the known distribution. Therefore, the ability to extrapolate to out-of-distribution (OOD) property values is critical for both solid-state materials and molecular design. Our objective is to train predictor models that extrapolate zero-shot to higher ranges than in the traini…
▽ More
Discovery of high-performance materials and molecules requires identifying extremes with property values that fall outside the known distribution. Therefore, the ability to extrapolate to out-of-distribution (OOD) property values is critical for both solid-state materials and molecular design. Our objective is to train predictor models that extrapolate zero-shot to higher ranges than in the training data, given the chemical compositions of solids or molecular graphs and their property values. We propose using a transductive approach to OOD property prediction, achieving improvements in prediction accuracy. In particular, the True Positive Rate (TPR) of OOD classification of materials and molecules improved by 3x and 2.5x, respectively, and precision improved by 2x and 1.5x compared to non-transductive baselines. Our method leverages analogical input-target relations in the training and test sets, enabling generalization beyond the training target support, and can be applied to any other material and molecular tasks.
△ Less
Submitted 9 February, 2025;
originally announced February 2025.
-
Learning Mean First Passage Time: Chemical Short-Range Order and Kinetics of Diffusive Relaxation
Authors:
Hoje Chun,
Hao Tang,
Rafael Gomez-Bombarelli,
Ju Li
Abstract:
Long-timescale processes pose significant challenges in atomistic simulations, particularly for phenomena such as diffusion and phase transitions. We present a deep reinforcement learning (DRL)-based computational framework, combined with a temporal difference (TD) learning method, to simulate long-timescale atomic processes of diffusive relaxation. We apply it to study the emergence of chemical s…
▽ More
Long-timescale processes pose significant challenges in atomistic simulations, particularly for phenomena such as diffusion and phase transitions. We present a deep reinforcement learning (DRL)-based computational framework, combined with a temporal difference (TD) learning method, to simulate long-timescale atomic processes of diffusive relaxation. We apply it to study the emergence of chemical short-range order (SRO) in medium- and high-entropy alloys (MEAs/HEAs), which plays a crucial role in unlocking unique material properties, and find that the proposed method effectively maps the relationship between time, temperature, and SRO change. By accelerating both the sampling of lower-energy states and the simulation of transition kinetics, we identify the thermodynamic limit and the role of kinetic trapping in the SRO. Furthermore, learning the mean first passage time to a given, target SRO relaxation allows capturing realistic timescales in diffusive atomistic rearrangements. This method offers valuable guidelines for optimizing material processing and extends atomistic simulations to previously inaccessible timescales, facilitating the study of slow, thermally activated processes essential for understanding and engineering material properties.
△ Less
Submitted 26 November, 2024;
originally announced November 2024.
-
Univariate Conditional Variational Autoencoder for Morphogenic Patterns Design in Frontal Polymerization-Based Manufacturing
Authors:
Qibang Liu,
Pengfei Cai,
Diab Abueidda,
Sagar Vyas,
Seid Koric,
Rafael Gomez-Bombarelli,
Philippe Geubelle
Abstract:
Under some initial and boundary conditions, the rapid reaction-thermal diffusion process taking place during frontal polymerization (FP) destabilizes the planar mode of front propagation, leading to spatially varying, complex hierarchical patterns in thermoset polymeric materials. Although modern reaction-diffusion models can predict the patterns resulting from unstable FP, the inverse design of p…
▽ More
Under some initial and boundary conditions, the rapid reaction-thermal diffusion process taking place during frontal polymerization (FP) destabilizes the planar mode of front propagation, leading to spatially varying, complex hierarchical patterns in thermoset polymeric materials. Although modern reaction-diffusion models can predict the patterns resulting from unstable FP, the inverse design of patterns, which aims to retrieve process conditions that produce a desired pattern, remains an open challenge due to the non-unique and non-intuitive mapping between process conditions and manufactured patterns. In this work, we propose a probabilistic generative model named univariate conditional variational autoencoder (UcVAE) for the inverse design of hierarchical patterns in FP-based manufacturing. Unlike the cVAE, which encodes both the design space and the design target, the UcVAE encodes only the design space. In the encoder of the UcVAE, the number of training parameters is significantly reduced compared to the cVAE, resulting in a shorter training time while maintaining comparable performance. Given desired pattern images, the trained UcVAE can generate multiple process condition solutions that produce high-fidelity hierarchical patterns.
△ Less
Submitted 31 October, 2024; v1 submitted 22 October, 2024;
originally announced October 2024.
-
Symmetry-Constrained Generation of Diverse Low-Bandgap Molecules with Monte Carlo Tree Search
Authors:
Akshay Subramanian,
James Damewood,
Juno Nam,
Kevin P. Greenman,
Avni P. Singhal,
Rafael Gómez-Bombarelli
Abstract:
Organic optoelectronic materials are a promising avenue for next-generation electronic devices due to their solution processability, mechanical flexibility, and tunable electronic properties. In particular, near-infrared (NIR) sensitive molecules have unique applications in night-vision equipment and biomedical imaging. Molecular engineering has played a crucial role in developing non-fullerene ac…
▽ More
Organic optoelectronic materials are a promising avenue for next-generation electronic devices due to their solution processability, mechanical flexibility, and tunable electronic properties. In particular, near-infrared (NIR) sensitive molecules have unique applications in night-vision equipment and biomedical imaging. Molecular engineering has played a crucial role in developing non-fullerene acceptors (NFAs) such as the Y-series molecules, which have significantly improved the power conversion efficiency (PCE) of solar cells and enhanced spectral coverage in the NIR region. However, systematically designing molecules with targeted optoelectronic properties while ensuring synthetic accessibility remains a challenge. To address this, we leverage structural priors from domain-focused, patent-mined datasets of organic electronic molecules using a symmetry-aware fragment decomposition algorithm and a fragment-constrained Monte Carlo Tree Search (MCTS) generator. Our approach generates candidates that retain symmetry constraints from the patent dataset, while also exhibiting red-shifted absorption, as validated by TD-DFT calculations.
△ Less
Submitted 12 December, 2024; v1 submitted 11 October, 2024;
originally announced October 2024.
-
Efficient Generation of Molecular Clusters with Dual-Scale Equivariant Flow Matching
Authors:
Akshay Subramanian,
Shuhui Qu,
Cheol Woo Park,
Sulin Liu,
Janghwan Lee,
Rafael Gómez-Bombarelli
Abstract:
Amorphous molecular solids offer a promising alternative to inorganic semiconductors, owing to their mechanical flexibility and solution processability. The packing structure of these materials plays a crucial role in determining their electronic and transport properties, which are key to enhancing the efficiency of devices like organic solar cells (OSCs). However, obtaining these optoelectronic p…
▽ More
Amorphous molecular solids offer a promising alternative to inorganic semiconductors, owing to their mechanical flexibility and solution processability. The packing structure of these materials plays a crucial role in determining their electronic and transport properties, which are key to enhancing the efficiency of devices like organic solar cells (OSCs). However, obtaining these optoelectronic properties computationally requires molecular dynamics (MD) simulations to generate a conformational ensemble, a process that can be computationally expensive due to the large system sizes involved. Recent advances have focused on using generative models, particularly flow-based models as Boltzmann generators, to improve the efficiency of MD sampling. In this work, we developed a dual-scale flow matching method that separates training and inference into coarse-grained and all-atom stages and enhances both the accuracy and efficiency of standard flow matching samplers. We demonstrate the effectiveness of this method on a dataset of Y6 molecular clusters obtained through MD simulations, and we benchmark its efficiency and accuracy against single-scale flow matching methods.
△ Less
Submitted 9 October, 2024;
originally announced October 2024.
-
Think While You Generate: Discrete Diffusion with Planned Denoising
Authors:
Sulin Liu,
Juno Nam,
Andrew Campbell,
Hannes Stärk,
Yilun Xu,
Tommi Jaakkola,
Rafael Gómez-Bombarelli
Abstract:
Discrete diffusion has achieved state-of-the-art performance, outperforming or approaching autoregressive models on standard benchmarks. In this work, we introduce Discrete Diffusion with Planned Denoising (DDPD), a novel framework that separates the generation process into two models: a planner and a denoiser. At inference time, the planner selects which positions to denoise next by identifying t…
▽ More
Discrete diffusion has achieved state-of-the-art performance, outperforming or approaching autoregressive models on standard benchmarks. In this work, we introduce Discrete Diffusion with Planned Denoising (DDPD), a novel framework that separates the generation process into two models: a planner and a denoiser. At inference time, the planner selects which positions to denoise next by identifying the most corrupted positions in need of denoising, including both initially corrupted and those requiring additional refinement. This plan-and-denoise approach enables more efficient reconstruction during generation by iteratively identifying and denoising corruptions in the optimal order. DDPD outperforms traditional denoiser-only mask diffusion methods, achieving superior results on language modeling benchmarks such as text8, OpenWebText, and token-based image generation on ImageNet $256 \times 256$. Notably, in language modeling, DDPD significantly reduces the performance gap between diffusion-based and autoregressive methods in terms of generative perplexity. Code is available at https://github.com/liusulin/DDPD.
△ Less
Submitted 9 April, 2025; v1 submitted 8 October, 2024;
originally announced October 2024.
-
Flow Matching for Accelerated Simulation of Atomic Transport in Materials
Authors:
Juno Nam,
Sulin Liu,
Gavin Winter,
KyuJung Jun,
Soojung Yang,
Rafael Gómez-Bombarelli
Abstract:
We introduce LiFlow, a generative framework to accelerate molecular dynamics (MD) simulations for crystalline materials that formulates the task as conditional generation of atomic displacements. The model uses flow matching, with a Propagator submodel to generate atomic displacements and a Corrector to locally correct unphysical geometries, and incorporates an adaptive prior based on the Maxwell-…
▽ More
We introduce LiFlow, a generative framework to accelerate molecular dynamics (MD) simulations for crystalline materials that formulates the task as conditional generation of atomic displacements. The model uses flow matching, with a Propagator submodel to generate atomic displacements and a Corrector to locally correct unphysical geometries, and incorporates an adaptive prior based on the Maxwell-Boltzmann distribution to account for chemical and thermal conditions. We benchmark LiFlow on a dataset comprising 25-ps trajectories of lithium diffusion across 4,186 solid-state electrolyte (SSE) candidates at four temperatures. The model obtains a consistent Spearman rank correlation of 0.7-0.8 for lithium mean squared displacement (MSD) predictions on unseen compositions. Furthermore, LiFlow generalizes from short training trajectories to larger supercells and longer simulations while maintaining high accuracy. With speed-ups of up to 600,000$\times$ compared to first-principles methods, LiFlow enables scalable simulations at significantly larger length and time scales.
△ Less
Submitted 24 February, 2025; v1 submitted 2 October, 2024;
originally announced October 2024.
-
Learning Ordering in Crystalline Materials with Symmetry-Aware Graph Neural Networks
Authors:
Jiayu Peng,
James Damewood,
Jessica Karaguesian,
Jaclyn R. Lunger,
Rafael Gómez-Bombarelli
Abstract:
Graph convolutional neural networks (GCNNs) have become a machine learning workhorse for screening the chemical space of crystalline materials in fields such as catalysis and energy storage, by predicting properties from structures. Multicomponent materials, however, present a unique challenge since they can exhibit chemical (dis)order, where a given lattice structure can encompass a variety of el…
▽ More
Graph convolutional neural networks (GCNNs) have become a machine learning workhorse for screening the chemical space of crystalline materials in fields such as catalysis and energy storage, by predicting properties from structures. Multicomponent materials, however, present a unique challenge since they can exhibit chemical (dis)order, where a given lattice structure can encompass a variety of elemental arrangements ranging from highly ordered structures to fully disordered solid solutions. Critically, properties like stability, strength, and catalytic performance depend not only on structures but also on orderings. To enable rigorous materials design, it is thus critical to ensure GCNNs are capable of distinguishing among atomic orderings. However, the ordering-aware capability of GCNNs has been poorly understood. Here, we benchmark various neural network architectures for capturing the ordering-dependent energetics of multicomponent materials in a custom-made dataset generated with high-throughput atomistic simulations. Conventional symmetry-invariant GCNNs were found unable to discern the structural difference between the diverse symmetrically inequivalent atomic orderings of the same material, while symmetry-equivariant model architectures could inherently preserve and differentiate the distinct crystallographic symmetries of various orderings.
△ Less
Submitted 20 September, 2024;
originally announced September 2024.
-
Interpolation and differentiation of alchemical degrees of freedom in machine learning interatomic potentials
Authors:
Juno Nam,
Jiayu Peng,
Rafael Gómez-Bombarelli
Abstract:
Machine learning interatomic potentials (MLIPs) have become a workhorse of modern atomistic simulations, and recently published universal MLIPs, pre-trained on large datasets, have demonstrated remarkable accuracy and generalizability. However, the computational cost of MLIPs limits their applicability to chemically disordered systems requiring large simulation cells or to sample-intensive statist…
▽ More
Machine learning interatomic potentials (MLIPs) have become a workhorse of modern atomistic simulations, and recently published universal MLIPs, pre-trained on large datasets, have demonstrated remarkable accuracy and generalizability. However, the computational cost of MLIPs limits their applicability to chemically disordered systems requiring large simulation cells or to sample-intensive statistical methods. Here, we report the use of continuous and differentiable alchemical degrees of freedom in atomistic materials simulations, exploiting the fact that graph neural network MLIPs represent discrete elements as real-valued tensors. The proposed method introduces alchemical atoms with corresponding weights into the input graph, alongside modifications to the message-passing and readout mechanisms of MLIPs, and allows smooth interpolation between the compositional states of materials. The end-to-end differentiability of MLIPs enables efficient calculation of the gradient of energy with respect to the compositional weights. With this modification, we propose methodologies for optimizing the composition of solid solutions towards target macroscopic properties, characterizing order and disorder in multicomponent oxides, and conducting alchemical free energy simulations to quantify the free energy of vacancy formation and composition changes. The approach offers an avenue for extending the capabilities of universal MLIPs in the modeling of compositional disorder and characterizing the phase stability of complex materials systems.
△ Less
Submitted 3 December, 2024; v1 submitted 16 April, 2024;
originally announced April 2024.
-
Enhanced sampling of robust molecular datasets with uncertainty-based collective variables
Authors:
Aik Rui Tan,
Johannes C. B. Dietschreit,
Rafael Gomez-Bombarelli
Abstract:
Generating a data set that is representative of the accessible configuration space of a molecular system is crucial for the robustness of machine learned interatomic potentials (MLIP). However, the complexity of molecular systems, characterized by intricate potential energy surfaces (PESs) with numerous local minima and energy barriers, presents a significant challenge. Traditional methods of data…
▽ More
Generating a data set that is representative of the accessible configuration space of a molecular system is crucial for the robustness of machine learned interatomic potentials (MLIP). However, the complexity of molecular systems, characterized by intricate potential energy surfaces (PESs) with numerous local minima and energy barriers, presents a significant challenge. Traditional methods of data generation, such as random sampling or exhaustive exploration, are either intractable or may not capture rare, but highly informative configurations. In this study, we propose a method that leverages uncertainty as the collective variable (CV) to guide the acquisition of chemically-relevant data points, focusing on regions of the configuration space where ML model predictions are most uncertain. This approach employs a Gaussian Mixture Model-based uncertainty metric from a single model as the CV for biased molecular dynamics simulations. The effectiveness of our approach in overcoming energy barriers and exploring unseen energy minima, thereby enhancing the data set in an active learning framework, is demonstrated on the alanine dipeptide benchmark system.
△ Less
Submitted 6 February, 2024;
originally announced February 2024.
-
Learning Collective Variables with Synthetic Data Augmentation through Physics-Inspired Geodesic Interpolation
Authors:
Soojung Yang,
Juno Nam,
Johannes C. B. Dietschreit,
Rafael Gómez-Bombarelli
Abstract:
In molecular dynamics simulations, rare events, such as protein folding, are typically studied using enhanced sampling techniques, most of which are based on the definition of a collective variable (CV) along which acceleration occurs. Obtaining an expressive CV is crucial, but often hindered by the lack of information about the particular event, e.g., the transition from unfolded to folded confor…
▽ More
In molecular dynamics simulations, rare events, such as protein folding, are typically studied using enhanced sampling techniques, most of which are based on the definition of a collective variable (CV) along which acceleration occurs. Obtaining an expressive CV is crucial, but often hindered by the lack of information about the particular event, e.g., the transition from unfolded to folded conformation. We propose a simulation-free data augmentation strategy using physics-inspired metrics to generate geodesic interpolations resembling protein folding transitions, thereby improving sampling efficiency without true transition state samples. This new data can be used to improve the accuracy of classifier-based methods. Alternatively, a regression-based learning scheme for CV models can be adopted by leveraging the interpolation progress parameter.
△ Less
Submitted 19 July, 2024; v1 submitted 2 February, 2024;
originally announced February 2024.
-
Learning a reactive potential for silica-water through uncertainty attribution
Authors:
Swagata Roy,
Johannes P. Dürholt,
Thomas S. Asche,
Federico Zipoli,
Rafael Gómez-Bombarelli
Abstract:
The reactivity of silicates in an aqueous solution is relevant to various chemistries ranging from silicate minerals in geology, to the C-S-H phase in cement, nanoporous zeolite catalysts, or highly porous precipitated silica. While simulations of chemical reactions can provide insight at the molecular level, balancing accuracy and scale in reactive simulations in the condensed phase is a challeng…
▽ More
The reactivity of silicates in an aqueous solution is relevant to various chemistries ranging from silicate minerals in geology, to the C-S-H phase in cement, nanoporous zeolite catalysts, or highly porous precipitated silica. While simulations of chemical reactions can provide insight at the molecular level, balancing accuracy and scale in reactive simulations in the condensed phase is a challenge. Here, we demonstrate how a machine-learning reactive interatomic potential can accurately capture silicate-water reactivity. The model was trained on a new dataset comprising 400,000 energies and forces of molecular clusters at the $ω$-B97XD def2-TVZP level. To ensure the robustness of the model, we introduce a new and general active learning strategy based on the attribution of the model uncertainty, that automatically isolates uncertain regions of bulk simulations to be calculated as small-sized clusters. Our trained potential is found to reproduce static and dynamic properties of liquid water and solid crystalline silicates, despite having been trained exclusively on cluster data. Furthermore, we utilize enhanced sampling simulations to recover the self-ionization reactivity of water accurately, and the acidity of silicate oligomers, and lastly study the silicate dimerization reaction in a water solution at neutral conditions and find that the reaction occurs through a flanking mechanism.
△ Less
Submitted 4 July, 2023;
originally announced July 2023.
-
Atom-by-atom design of metal oxide catalysts for the oxygen evolution reaction with machine learning
Authors:
Jaclyn R. Lunger,
Jessica Karaguesian,
Hoje Chun,
Jiayu Peng,
Yitong Tseo,
Chung Hsuan Shan,
Byungchan Han,
Yang Shao-Horn,
Rafael Gomez-Bombarelli
Abstract:
Green hydrogen production is crucial for a sustainable future, but current catalysts for the oxygen evolution reaction (OER) suffer from slow kinetics, despite many efforts to produce optimal designs, particularly through the calculation of descriptors for activity. In this study, we develop a dataset of density functional theory calculations of bulk and surface perovskite oxides, and adsorption e…
▽ More
Green hydrogen production is crucial for a sustainable future, but current catalysts for the oxygen evolution reaction (OER) suffer from slow kinetics, despite many efforts to produce optimal designs, particularly through the calculation of descriptors for activity. In this study, we develop a dataset of density functional theory calculations of bulk and surface perovskite oxides, and adsorption energies of OER intermediates, which includes compositions up to quaternary and facets up to (555). We demonstrate that per-site properties of perovskite oxides such as Bader charge or band center can be tuned through element substitution and faceting, and develop a machine learning model that accurately predicts these properties directly from the local chemical environment. We leverage these per-site properties to identify promising perovskites with high theoretical OER activity. The identified design principles and promising new materials provide a roadmap for closing the gap between current artificial catalysts and biological enzymes.
△ Less
Submitted 31 May, 2023;
originally announced May 2023.
-
Effect of framework composition and NH3 on the diffusion of Cu+ in Cu-CHA catalysts predicted by machine-learning accelerated molecular dynamics
Authors:
Reisel Millan,
Estefania Bello-Jurado,
Manual Moliner,
Mercedes Boronat,
Rafael Gomez-Bombarelli
Abstract:
Cu-exchanged zeolites rely on mobile solvated Cu+ cations for their catalytic activity, but the role of framework composition on transport is not fully understood. Ab initio molecular dynamics simulations can provide quantitative atomistic insight but are too computationally expensive to explore large length- and time-scales or diverse compositions. We report a machine-learning interatomic potenti…
▽ More
Cu-exchanged zeolites rely on mobile solvated Cu+ cations for their catalytic activity, but the role of framework composition on transport is not fully understood. Ab initio molecular dynamics simulations can provide quantitative atomistic insight but are too computationally expensive to explore large length- and time-scales or diverse compositions. We report a machine-learning interatomic potential that accurately reproduces ab initio results and effectively generalizes to allow multi-nanosecond simulations of large supercells and diverse chemical compositions. Biased and unbiased simulations of [Cu(NH3)2]+ mobility show that aluminum pairing in eight-membered rings accelerates local hopping, and demonstrate that increased NH3 concentration enhances long-range diffusion. The probability of finding two [Cu(NH3)2]+ complexes in the same cage - key for SCR-NOx reaction - increases with Cu content and Al content, but does not correlate with the long-range mobility of Cu+. Supporting experimental evidence was obtained from reactivity tests of Cu-CHA catalysts with controlled chemical composition.
△ Less
Submitted 22 May, 2023;
originally announced May 2023.
-
Machine-learning-accelerated simulations to enable automatic surface reconstruction
Authors:
Xiaochen Du,
James K. Damewood,
Jaclyn R. Lunger,
Reisel Millan,
Bilge Yildiz,
Lin Li,
Rafael Gómez-Bombarelli
Abstract:
Understanding material surfaces and interfaces is vital in applications like catalysis or electronics. By combining energies from electronic structure with statistical mechanics, ab initio simulations can in principle predict the structure of material surfaces as a function of thermodynamic variables. However, accurate energy simulations are prohibitive when coupled to the vast phase space that mu…
▽ More
Understanding material surfaces and interfaces is vital in applications like catalysis or electronics. By combining energies from electronic structure with statistical mechanics, ab initio simulations can in principle predict the structure of material surfaces as a function of thermodynamic variables. However, accurate energy simulations are prohibitive when coupled to the vast phase space that must be statistically sampled. Here, we present a bi-faceted computational loop to predict surface phase diagrams of multi-component materials that accelerates both the energy scoring and statistical sampling methods. Fast, scalable, and data-efficient machine learning interatomic potentials are trained on high-throughput density-functional theory calculations through closed-loop active learning. Markov-chain Monte Carlo sampling in the semi-grand canonical ensemble is enabled by using virtual surface sites. The predicted surfaces for GaN(0001), Si(111), and SrTiO3(001) are in agreement with past work and suggest that the proposed strategy can model complex material surfaces and discover previously unreported surface terminations.
△ Less
Submitted 21 November, 2023; v1 submitted 12 May, 2023;
originally announced May 2023.
-
Data-Driven, Physics-Informed Descriptors of Cation Ordering in Multicomponent Oxides
Authors:
Jiayu Peng,
James Damewood,
Rafael Gómez-Bombarelli
Abstract:
The structural tunability and compositional diversity of multicomponent perovskite oxides have enabled their various applications, including catalysis and electronics. The cation ordering in these oxides, ranging from disordered (i.e., high-entropy) to ordered (e.g., rocksalt), profoundly influences their properties. While computational design tools can typically predict properties associated with…
▽ More
The structural tunability and compositional diversity of multicomponent perovskite oxides have enabled their various applications, including catalysis and electronics. The cation ordering in these oxides, ranging from disordered (i.e., high-entropy) to ordered (e.g., rocksalt), profoundly influences their properties. While computational design tools can typically predict properties associated with a particular ordering, inferring which ordering -- if any -- will be observed in synthesized oxides remains challenging. Here, we leveraged first-principles simulations and machine learning to develop data-driven, physics-informed descriptors of experimental ordering in multicomponent perovskites and compared them with traditional physicochemical descriptors, e.g., ionic radii and oxidation states. The fitted low-dimensional classification models correctly rank up to 93% of compositions in an experimental dataset of 190 perovskites between cation-ordered and disordered, offering a rigorous benchmark between theory and experiments. Furthermore, these descriptors accelerate high-throughput virtual screening of multicomponent oxides by predicting their dominant ordering to avoid costly, exhaustive simulations of cation arrangements.
△ Less
Submitted 2 May, 2023;
originally announced May 2023.
-
Single-model uncertainty quantification in neural network potentials does not consistently outperform model ensembles
Authors:
Aik Rui Tan,
Shingo Urata,
Samuel Goldman,
Johannes C. B. Dietschreit,
Rafael Gómez-Bombarelli
Abstract:
Neural networks (NNs) often assign high confidence to their predictions, even for points far out-of-distribution, making uncertainty quantification (UQ) a challenge. When they are employed to model interatomic potentials in materials systems, this problem leads to unphysical structures that disrupt simulations, or to biased statistics and dynamics that do not reflect the true physics. Differentiab…
▽ More
Neural networks (NNs) often assign high confidence to their predictions, even for points far out-of-distribution, making uncertainty quantification (UQ) a challenge. When they are employed to model interatomic potentials in materials systems, this problem leads to unphysical structures that disrupt simulations, or to biased statistics and dynamics that do not reflect the true physics. Differentiable UQ techniques can find new informative data and drive active learning loops for robust potentials. However, a variety of UQ techniques, including newly developed ones, exist for atomistic simulations and there are no clear guidelines for which are most effective or suitable for a given case. In this work, we examine multiple UQ schemes for improving the robustness of NN interatomic potentials (NNIPs) through active learning. In particular, we compare incumbent ensemble-based methods against strategies that use single, deterministic NNs: mean-variance estimation, deep evidential regression, and Gaussian mixture models. We explore three datasets ranging from in-domain interpolative learning to more extrapolative out-of-domain generalization challenges: rMD17, ammonia inversion, and bulk silica glass. Performance is measured across multiple metrics relating model error to uncertainty. Our experiments show that none of the methods consistently outperformed each other across the various metrics. Ensembling remained better at generalization and for NNIP robustness; MVE only proved effective for in-domain interpolation, while GMM was better out-of-domain; and evidential regression, despite its promise, was not the preferable alternative in any of the cases. More broadly, cost-effective, single deterministic models cannot yet consistently match or outperform ensembling for uncertainty quantification in NNIPs.
△ Less
Submitted 2 May, 2023;
originally announced May 2023.
-
Entropy and Energy Profiles of Chemical Reactions
Authors:
Johannes C. B. Dietschreit,
Dennis J. Diestler,
Rafael Gómez-Bombarelli
Abstract:
The description of chemical processes at the molecular level is often facilitated by use of reaction coordinates, or collective variables (CVs). The CV measures the progress of the reaction and allows the construction of profiles that track the evolution of a specific property as the reaction progresses. Whereas CVs are routinely used, especially alongside enhanced sampling techniques, links betwe…
▽ More
The description of chemical processes at the molecular level is often facilitated by use of reaction coordinates, or collective variables (CVs). The CV measures the progress of the reaction and allows the construction of profiles that track the evolution of a specific property as the reaction progresses. Whereas CVs are routinely used, especially alongside enhanced sampling techniques, links between profiles and thermodynamic state functions and reaction rate constants are not rigorously exploited. Here, we report a unified treatment of such reaction profiles. Tractable expressions are derived for the free-energy, internal-energy, and entropy profiles as functions of only the CV.We demonstrate the ability of this treatment to extract quantitative insight from the entropy and internal-energy profiles of various real-world physicochemical processes, including intramolecular organic reactions, ionic transport in superionic electrolytes, and molecular transport in nanoporous materials.
△ Less
Submitted 25 April, 2023; v1 submitted 20 April, 2023;
originally announced April 2023.
-
Automated patent extraction powers generative modeling in focused chemical spaces
Authors:
Akshay Subramanian,
Kevin P. Greenman,
Alexis Gervaix,
Tzuhsiung Yang,
Rafael Gómez-Bombarelli
Abstract:
Deep generative models have emerged as an exciting avenue for inverse molecular design, with progress coming from the interplay between training algorithms and molecular representations. One of the key challenges in their applicability to materials science and chemistry has been the lack of access to sizeable training datasets with property labels. Published patents contain the first disclosure of…
▽ More
Deep generative models have emerged as an exciting avenue for inverse molecular design, with progress coming from the interplay between training algorithms and molecular representations. One of the key challenges in their applicability to materials science and chemistry has been the lack of access to sizeable training datasets with property labels. Published patents contain the first disclosure of new materials prior to their publication in journals, and are a vast source of scientific knowledge that has remained relatively untapped in the field of data-driven molecular design. Because patents are filed seeking to protect specific uses, molecules in patents can be considered to be weakly labeled into application classes. Furthermore, patents published by the US Patent and Trademark Office (USPTO) are downloadable and have machine-readable text and molecular structures. In this work, we train domain-specific generative models using patent data sources by developing an automated pipeline to go from USPTO patent digital files to the generation of novel candidates with minimal human intervention. We test the approach on two in-class extracted datasets, one in organic electronics and another in tyrosine kinase inhibitors. We then evaluate the ability of generative models trained on these in-class datasets on two categories of tasks (distribution learning and property optimization), identify strengths and limitations, and suggest possible explanations and remedies that could be used to overcome these in practice.
△ Less
Submitted 24 July, 2023; v1 submitted 14 March, 2023;
originally announced March 2023.
-
Chemically Transferable Generative Backmapping of Coarse-Grained Proteins
Authors:
Soojung Yang,
Rafael Gómez-Bombarelli
Abstract:
Coarse-graining (CG) accelerates molecular simulations of protein dynamics by simulating sets of atoms as singular beads. Backmapping is the opposite operation of bringing lost atomistic details back from the CG representation. While machine learning (ML) has produced accurate and efficient CG simulations of proteins, fast and reliable backmapping remains a challenge. Rule-based methods produce po…
▽ More
Coarse-graining (CG) accelerates molecular simulations of protein dynamics by simulating sets of atoms as singular beads. Backmapping is the opposite operation of bringing lost atomistic details back from the CG representation. While machine learning (ML) has produced accurate and efficient CG simulations of proteins, fast and reliable backmapping remains a challenge. Rule-based methods produce poor all-atom geometries, needing computationally costly refinement through additional simulations. Recently proposed ML approaches outperform traditional baselines but are not transferable between proteins and sometimes generate unphysical atom placements with steric clashes and implausible torsion angles. This work addresses both issues to build a fast, transferable, and reliable generative backmapping tool for CG protein representations. We achieve generalization and reliability through a combined set of innovations: representation based on internal coordinates; an equivariant encoder/prior; a custom loss function that helps ensure local structure, global structure, and physical constraints; and expert curation of high-quality out-of-equilibrium protein data for training. Our results pave the way for out-of-the-box backmapping of coarse-grained simulations for arbitrary proteins.
△ Less
Submitted 2 March, 2023;
originally announced March 2023.
-
Mapping the space of photoswitchable ligands and photodruggable proteins with computational modeling
Authors:
Simon Axelrod,
Eugene Shakhnovich,
Rafael Gómez-Bombarelli
Abstract:
Light-activated drugs are a promising way to localize biological activity and minimize side effects. However, their development is complicated by the numerous photophysical and biological properties that must be simultaneously optimized. To accelerate the design of photoactive drugs, we describe a procedure that combines ligand-protein docking with chemical property prediction based on machine lea…
▽ More
Light-activated drugs are a promising way to localize biological activity and minimize side effects. However, their development is complicated by the numerous photophysical and biological properties that must be simultaneously optimized. To accelerate the design of photoactive drugs, we describe a procedure that combines ligand-protein docking with chemical property prediction based on machine learning (ML). We apply this procedure to 58 proteins and 9,000 photo-drug candidates based on azobenzene cis-trans isomerism. We find that most proteins display a preference for trans isomers over cis, and that the binding affinities of nominally active/inactive pairs are in fact highly correlated. These findings have significant value for photopharmacology research, and reinforce the need for virtual screening to identify compounds with rare desirable properties. Further, we combine our procedure with quantum chemical validation to identify promising candidates for the photoactive inhibition of PARP1, an enzyme that is over-expressed in cancer cells. The top compounds are predicted to have long-lived active forms, differential bioactivity, and absorption in the near-infrared therapeutic window.
△ Less
Submitted 22 February, 2023;
originally announced February 2023.
-
Representations of Materials for Machine Learning
Authors:
James Damewood,
Jessica Karaguesian,
Jaclyn R. Lunger,
Aik Rui Tan,
Mingrou Xie,
Jiayu Peng,
Rafael Gómez-Bombarelli
Abstract:
High-throughput data generation methods and machine learning (ML) algorithms have given rise to a new era of computational materials science by learning relationships among composition, structure, and properties and by exploiting such relations for design. However, to build these connections, materials data must be translated into a numerical form, called a representation, that can be processed by…
▽ More
High-throughput data generation methods and machine learning (ML) algorithms have given rise to a new era of computational materials science by learning relationships among composition, structure, and properties and by exploiting such relations for design. However, to build these connections, materials data must be translated into a numerical form, called a representation, that can be processed by a machine learning model. Datasets in materials science vary in format (ranging from images to spectra), size, and fidelity. Predictive models vary in scope and property of interests. Here, we review context-dependent strategies for constructing representations that enable the use of materials as inputs or outputs of machine learning models. Furthermore, we discuss how modern ML techniques can learn representations from data and transfer chemical and physical information between tasks. Finally, we outline high-impact questions that have not been fully resolved and thus, require further investigation.
△ Less
Submitted 20 January, 2023;
originally announced January 2023.
-
Differentiable Simulations for Enhanced Sampling of Rare Events
Authors:
Martin Šípka,
Johannes C. B. Dietschreit,
Lukáš Grajciar,
Rafael Gómez-Bombarelli
Abstract:
Simulating rare events, such as the transformation of a reactant into a product in a chemical reaction typically requires enhanced sampling techniques that rely on heuristically chosen collective variables (CVs). We propose using differentiable simulations (DiffSim) for the discovery and enhanced sampling of chemical transformations without a need to resort to preselected CVs, using only a distanc…
▽ More
Simulating rare events, such as the transformation of a reactant into a product in a chemical reaction typically requires enhanced sampling techniques that rely on heuristically chosen collective variables (CVs). We propose using differentiable simulations (DiffSim) for the discovery and enhanced sampling of chemical transformations without a need to resort to preselected CVs, using only a distance metric. Reaction path discovery and estimation of the biasing potential that enhances the sampling are merged into a single end-to-end problem that is solved by path-integral optimization. This is achieved by introducing multiple improvements over standard DiffSim such as partial backpropagation and graph mini-batching making DiffSim training stable and efficient. The potential of DiffSim is demonstrated in the successful discovery of transition paths for the Muller-Brown model potential as well as a benchmark chemical system - alanine dipeptide.
△ Less
Submitted 27 January, 2023; v1 submitted 9 January, 2023;
originally announced January 2023.
-
Simulations with machine learning potentials identify the ion conduction mechanism mediating non-Arrhenius behavior in LGPS
Authors:
Gavin Winter,
Rafael Gómez-Bombarelli
Abstract:
Li$_{10}$Ge(PS$_6$)$_2$ (LGPS) is a highly concentrated solid electrolyte, in which Coulombic repulsion between neighboring cations is hypothesized as the underlying reason for concerted ion hopping, a mechanism common among superionic conductors such as Li$_7$La$_3$Zr$_2$O$_{12}$ (LLZO) and Li$_{1.3}$Al$_{0.3}$Ti$_{1.7}$(PO$_4$)$_3$ (LATP). While first principles simulations using molecular dynam…
▽ More
Li$_{10}$Ge(PS$_6$)$_2$ (LGPS) is a highly concentrated solid electrolyte, in which Coulombic repulsion between neighboring cations is hypothesized as the underlying reason for concerted ion hopping, a mechanism common among superionic conductors such as Li$_7$La$_3$Zr$_2$O$_{12}$ (LLZO) and Li$_{1.3}$Al$_{0.3}$Ti$_{1.7}$(PO$_4$)$_3$ (LATP). While first principles simulations using molecular dynamics (MD) provide insight into the Li$^+$ transport mechanism, historically, there has been a gap in the temperature ranges studied in simulations and experiments. Here, we used a neural network (NN) potential trained on density functional theory (DFT) simulations, to run up to 40-nanosecond long MD simulations at DFT-like accuracy to characterize the ion conduction mechanisms across a range of temperatures that includes previous simulations and experimental studies. We have confirmed a Li$^+$ sublattice phase transition in LGPS around 400 K, below which the \textit{ab}-plane diffusivity $D^*_{ab}$ is drastically reduced. Concomitant with the sublattice phase transition near 400 K, there is less cation-cation (cross) correlation, as characterized by Haven ratios closer to 1, and the vibrations in the system are more harmonic at lower temperature. Intuitively, at high temperature, the collection of vibrational modes may be sufficient to drive concerted ion hops. However, near room temperature, the vibrational modes available may be insufficient to overcome electrostatic repulsion, thus resulting in less correlated ion motion and comparatively slower ion conduction. Such phenomena of a sublattice phase transition, below which concerted hopping plays a less significant role, may be extended to other highly concentrated solid electrolytes such as LLZO and LATP.
△ Less
Submitted 27 November, 2022; v1 submitted 10 November, 2022;
originally announced November 2022.
-
Forces are not Enough: Benchmark and Critical Evaluation for Machine Learning Force Fields with Molecular Simulations
Authors:
Xiang Fu,
Zhenghao Wu,
Wujie Wang,
Tian Xie,
Sinan Keten,
Rafael Gomez-Bombarelli,
Tommi Jaakkola
Abstract:
Molecular dynamics (MD) simulation techniques are widely used for various natural science applications. Increasingly, machine learning (ML) force field (FF) models begin to replace ab-initio simulations by predicting forces directly from atomic structures. Despite significant progress in this area, such techniques are primarily benchmarked by their force/energy prediction errors, even though the p…
▽ More
Molecular dynamics (MD) simulation techniques are widely used for various natural science applications. Increasingly, machine learning (ML) force field (FF) models begin to replace ab-initio simulations by predicting forces directly from atomic structures. Despite significant progress in this area, such techniques are primarily benchmarked by their force/energy prediction errors, even though the practical use case would be to produce realistic MD trajectories. We aim to fill this gap by introducing a novel benchmark suite for learned MD simulation. We curate representative MD systems, including water, organic molecules, a peptide, and materials, and design evaluation metrics corresponding to the scientific objectives of respective systems. We benchmark a collection of state-of-the-art (SOTA) ML FF models and illustrate, in particular, how the commonly benchmarked force accuracy is not well aligned with relevant simulation metrics. We demonstrate when and how selected SOTA methods fail, along with offering directions for further improvement. Specifically, we identify stability as a key metric for ML models to improve. Our benchmark suite comes with a comprehensive open-source codebase for training and simulation with ML FFs to facilitate future work.
△ Less
Submitted 26 August, 2023; v1 submitted 13 October, 2022;
originally announced October 2022.
-
Learning Pair Potentials using Differentiable Simulations
Authors:
Wujie Wang,
Zhenghao Wu,
Rafael Gómez-Bombarelli
Abstract:
Learning pair interactions from experimental or simulation data is of great interest for molecular simulations. We propose a general stochastic method for learning pair interactions from data using differentiable simulations (DiffSim). DiffSim defines a loss function based on structural observables, such as the radial distribution function, through molecular dynamics (MD) simulations. The interact…
▽ More
Learning pair interactions from experimental or simulation data is of great interest for molecular simulations. We propose a general stochastic method for learning pair interactions from data using differentiable simulations (DiffSim). DiffSim defines a loss function based on structural observables, such as the radial distribution function, through molecular dynamics (MD) simulations. The interaction potentials are then learned directly by stochastic gradient descent, using backpropagation to calculate the gradient of the structural loss metric with respect to the interaction potential through the MD simulation. This gradient-based method is flexible and can be configured to simulate and optimize multiple systems simultaneously. For example, it is possible to simultaneously learn potentials for different temperatures or for different compositions. We demonstrate the approach by recovering simple pair potentials, such as Lennard-Jones systems, from radial distribution functions. We find that DiffSim can be used to probe a wider functional space of pair potentials compared to traditional methods like Iterative Boltzmann Inversion. We show that our methods can be used to simultaneously fit potentials for simulations at different compositions and temperatures to improve the transferability of the learned potentials.
△ Less
Submitted 15 September, 2022;
originally announced September 2022.
-
Examining graph neural networks for crystal structures: limitations and opportunities for capturing periodicity
Authors:
Sheng Gong,
Tian Xie,
Yang Shao-Horn,
Rafael Gomez-Bombarelli,
Jeffrey C. Grossman
Abstract:
Historically, materials informatics has relied on human-designed descriptors of materials structures. In recent years, graph neural networks (GNNs) have been proposed for learning representations of crystal structures from data end-to-end producing vectorial embeddings that are optimized for downstream prediction tasks. However, a systematic scheme is lacking to analyze and understand the limits o…
▽ More
Historically, materials informatics has relied on human-designed descriptors of materials structures. In recent years, graph neural networks (GNNs) have been proposed for learning representations of crystal structures from data end-to-end producing vectorial embeddings that are optimized for downstream prediction tasks. However, a systematic scheme is lacking to analyze and understand the limits of GNNs for capturing crystal structures. In this work, we propose to use human-designed descriptors as a bank of human knowledge to test whether black-box GNNs can capture the knowledge of crystal structures. We find that current state-of-the-art GNNs cannot capture the periodicity of crystal structures well, and we analyze the limitations of the GNN models that result in this failure from three aspects: local expressive power, long-range information, and readout function. We propose an initial solution, hybridizing descriptors with GNNs, to improve the prediction of GNNs for materials properties, especially phonon internal energy and heat capacity with 90% lower errors, and we analyze the mechanisms for the improved prediction. All the analysis can be extended easily to other deep representation learning models, human-designed descriptors, and systems such as molecules and amorphous materials.
△ Less
Submitted 27 March, 2023; v1 submitted 9 August, 2022;
originally announced August 2022.
-
Thermal half-lives of azobenzene derivatives: virtual screening based on intersystem crossing using a machine learning potential
Authors:
Simon Axelrod,
Eugene Shakhnovich,
Rafael Gomez-Bombarelli
Abstract:
Molecular photoswitches are the foundation of light-activated drugs. A key photoswitch is azobenzene, which exhibits trans-cis isomerism in response to light. The thermal half-life of the cis isomer is of crucial importance, since it controls the duration of the light-induced biological effect. Here we introduce a computational tool for predicting the thermal half-lives of azobenzene derivatives.…
▽ More
Molecular photoswitches are the foundation of light-activated drugs. A key photoswitch is azobenzene, which exhibits trans-cis isomerism in response to light. The thermal half-life of the cis isomer is of crucial importance, since it controls the duration of the light-induced biological effect. Here we introduce a computational tool for predicting the thermal half-lives of azobenzene derivatives. Our automated approach uses a fast and accurate machine learning potential trained on quantum chemistry data. Building on well-established earlier evidence, we argue that thermal isomerization proceeds through rotation mediated by intersystem crossing, and incorporate this mechanism into our automated workflow. We use our approach to predict the thermal half-lives of 19,000 azobenzene derivatives. We explore trends and tradeoffs between barriers and absorption wavelengths, and open-source our data and software to accelerate research in photopharmacology.
△ Less
Submitted 12 January, 2023; v1 submitted 23 July, 2022;
originally announced July 2022.
-
From Free-Energy Profiles to Activation Free Energies
Authors:
Johannes C. B. Dietschreit,
Dennis J. Diestler,
Andreas Hulm,
Christian Ochsenfeld,
Rafael Gómez-Bombarelli
Abstract:
Given a chemical reaction going from reactant (R) to the product (P) on a potential energy surface (PES) and a collective variable (CV) that discriminates between R and P, one can define a free-energy profile (FEP) as the logarithm of the marginal Boltzmann distribution of the CV. The FEP is not a true free energy, however, it is common to treat the FEP as the free-energy analog of the minimum ene…
▽ More
Given a chemical reaction going from reactant (R) to the product (P) on a potential energy surface (PES) and a collective variable (CV) that discriminates between R and P, one can define a free-energy profile (FEP) as the logarithm of the marginal Boltzmann distribution of the CV. The FEP is not a true free energy, however, it is common to treat the FEP as the free-energy analog of the minimum energy path on the PES and to take the activation free energy, $ΔF^\ddagger_\mathrm{RP}$, as the difference between the maximum of the FEP at the transition state and the minimum at R. We show that this approximation can result in large errors. Since the FEP depends on the CV, it is therefore not unique, and different, discriminating CVs can yield different activation free energies for the same reaction. We derive an exact expression for the activation free energy that avoids this ambiguity with respect to the choice of CV. We find $ΔF^\ddagger_\mathrm{RP}$ to be a combination of the probability of the system being in the reactant state, the probability density at the transition state surface, and the thermal de~Broglie wavelength associated with the transition from R to P. We then evaluate the activation free energies based on our formalism for simple analytic models and realistic chemical systems. The analytic models show that the widespread FEP-based approximation applies only at low temperatures for CVs for which the effective mass of the associated pseudo-particle is small. Most chemical reactions of practical interest involve polyatomic molecules with complex, high-dimensional PES that cannot be treated analytically and pose the added challenge of choosing a good CV, typically through heuristics. We study the influence of the choice of CV and find that, while the reaction free energy is largely unaffected, $ΔF^\ddagger_\mathrm{RP}$ is quite sensitive.
△ Less
Submitted 20 April, 2023; v1 submitted 6 June, 2022;
originally announced June 2022.
-
Generative Coarse-Graining of Molecular Conformations
Authors:
Wujie Wang,
Minkai Xu,
Chen Cai,
Benjamin Kurt Miller,
Tess Smidt,
Yusu Wang,
Jian Tang,
Rafael Gómez-Bombarelli
Abstract:
Coarse-graining (CG) of molecular simulations simplifies the particle representation by grouping selected atoms into pseudo-beads and drastically accelerates simulation. However, such CG procedure induces information losses, which makes accurate backmapping, i.e., restoring fine-grained (FG) coordinates from CG coordinates, a long-standing challenge. Inspired by the recent progress in generative m…
▽ More
Coarse-graining (CG) of molecular simulations simplifies the particle representation by grouping selected atoms into pseudo-beads and drastically accelerates simulation. However, such CG procedure induces information losses, which makes accurate backmapping, i.e., restoring fine-grained (FG) coordinates from CG coordinates, a long-standing challenge. Inspired by the recent progress in generative models and equivariant networks, we propose a novel model that rigorously embeds the vital probabilistic nature and geometric consistency requirements of the backmapping transformation. Our model encodes the FG uncertainties into an invariant latent space and decodes them back to FG geometries via equivariant convolutions. To standardize the evaluation of this domain, we provide three comprehensive benchmarks based on molecular dynamics trajectories. Experiments show that our approach always recovers more realistic structures and outperforms existing data-driven methods with a significant margin.
△ Less
Submitted 16 June, 2022; v1 submitted 28 January, 2022;
originally announced January 2022.
-
Graph theory-based structural analysis on density anomaly of silica glass
Authors:
Aik Rui Tan,
Shingo Urata,
Masatsugu Yamada,
Rafael Gómez-Bombarelli
Abstract:
Analyzing the atomic structure of glassy materials is a tremendous challenge both experimentally and computationally, and the lack of direct, detailed insights into glass structure hinders our ability to navigate structure-property relationships. For instance, the structural origin of the density anomaly in silica glasses - the negative thermal expansion coefficient - is still poorly understood. S…
▽ More
Analyzing the atomic structure of glassy materials is a tremendous challenge both experimentally and computationally, and the lack of direct, detailed insights into glass structure hinders our ability to navigate structure-property relationships. For instance, the structural origin of the density anomaly in silica glasses - the negative thermal expansion coefficient - is still poorly understood. Simulations based on molecular dynamics (MD) produce atomically resolved structures, but quantifying the role of disorder in the density anomaly is challenging. Here, we propose to use a a graph-theoretical approach to assess topological differences between disordered structural arrangements from MD trajectories of silica glasses. A graph similarity metric quantifies the similarity between the covalent networks and can characterize the nature of the disordered solid, by comparing to reference crystalline solids, or with glasses in different thermodynamic states . This approach involves casting all-atom glass configurations as networks, and subsequently applying a graph-similarity metric (D-measure). Calculated D-measure values are then taken as the topological distances between two configurations. By measuring the topological distances of silica glass configurations across a range of temperatures, distinct structural features could be observed at temperatures higher than the fictive temperature. In addition, we compared topological distances between local atomic environments in the glass and crystalline silica phases. This approach suggests that more coesite-like and quartz-like local structures emerge in silica glasses when the density is at a minimum during the heating process.
△ Less
Submitted 23 August, 2022; v1 submitted 14 November, 2021;
originally announced November 2021.
-
Excited state, non-adiabatic dynamics of large photoswitchable molecules using a chemically transferable machine learning potential
Authors:
Simon Axelrod,
Eugene Shakhnovich,
Rafael Gómez-Bombarelli
Abstract:
Light-induced chemical processes are ubiquitous in nature and have widespread technological applications. For example, photoisomerization can allow a drug with a photo-switchable scaffold such as azobenzene to be activated with light. In principle, photoswitches with desired photophysical properties like high isomerization quantum yields can be identified through virtual screening with reactive si…
▽ More
Light-induced chemical processes are ubiquitous in nature and have widespread technological applications. For example, photoisomerization can allow a drug with a photo-switchable scaffold such as azobenzene to be activated with light. In principle, photoswitches with desired photophysical properties like high isomerization quantum yields can be identified through virtual screening with reactive simulations. In practice, these simulations are rarely used for screening, since they require hundreds of trajectories and expensive quantum chemical methods to account for non-adiabatic excited state effects. Here we introduce a diabatic artificial neural network (DANN) based on diabatic states to accelerate such simulations for azobenzene derivatives. The network is six orders of magnitude faster than the quantum chemistry method used for training. DANN is transferable to azobenzene molecules outside the training set, predicting quantum yields for unseen species that are correlated with experiment. We use the model to virtually screen 3,100 hypothetical molecules, and identify novel species with extremely high predicted quantum yields. The model predictions are confirmed using high accuracy non-adiabatic dynamics. Our results pave the way for fast and accurate virtual screening of photoactive compounds.
△ Less
Submitted 16 March, 2022; v1 submitted 10 August, 2021;
originally announced August 2021.
-
Sampling Lattices in Semi-Grand Canonical Ensemble with Autoregressive Machine Learning
Authors:
James Damewood,
Daniel Schwalbe-Koda,
Rafael Gomez-Bombarelli
Abstract:
Calculating thermodynamic potentials and observables efficiently and accurately is key for the application of statistical mechanics simulations to materials science. However, naive Monte Carlo approaches, on which such calculations are often dependent, struggle to scale to complex materials in many state-of-the-art disciplines such as the design of high entropy alloys or multicomponent catalysts.…
▽ More
Calculating thermodynamic potentials and observables efficiently and accurately is key for the application of statistical mechanics simulations to materials science. However, naive Monte Carlo approaches, on which such calculations are often dependent, struggle to scale to complex materials in many state-of-the-art disciplines such as the design of high entropy alloys or multicomponent catalysts. To address this issue, we adapt sampling tools built upon machine-learning based generative modeling to the materials space by transforming them into the semi-grand canonical ensemble. Furthermore, we show that the resulting models are transferable across wide-ranges of thermodynamic conditions and can be implemented with any internal energy model U, allowing integration into many existing materials workflows. We demonstrate the applicability of this approach to the simulation of benchmark systems (AgPd, CuAu) that exhibit diverse thermodynamic behavior in their phase diagrams. Finally, we discuss remaining challenges in model development and promising research directions for future improvements.
△ Less
Submitted 13 July, 2021; v1 submitted 11 July, 2021;
originally announced July 2021.
-
An End-to-End Framework for Molecular Conformation Generation via Bilevel Programming
Authors:
Minkai Xu,
Wujie Wang,
Shitong Luo,
Chence Shi,
Yoshua Bengio,
Rafael Gomez-Bombarelli,
Jian Tang
Abstract:
Predicting molecular conformations (or 3D structures) from molecular graphs is a fundamental problem in many applications. Most existing approaches are usually divided into two steps by first predicting the distances between atoms and then generating a 3D structure through optimizing a distance geometry problem. However, the distances predicted with such two-stage approaches may not be able to con…
▽ More
Predicting molecular conformations (or 3D structures) from molecular graphs is a fundamental problem in many applications. Most existing approaches are usually divided into two steps by first predicting the distances between atoms and then generating a 3D structure through optimizing a distance geometry problem. However, the distances predicted with such two-stage approaches may not be able to consistently preserve the geometry of local atomic neighborhoods, making the generated structures unsatisfying. In this paper, we propose an end-to-end solution for molecular conformation prediction called ConfVAE based on the conditional variational autoencoder framework. Specifically, the molecular graph is first encoded in a latent space, and then the 3D structures are generated by solving a principled bilevel optimization program. Extensive experiments on several benchmark data sets prove the effectiveness of our proposed approach over existing state-of-the-art approaches. Code is available at https://github.com/MinkaiXu/ConfVAE-ICML21
△ Less
Submitted 2 June, 2021; v1 submitted 15 May, 2021;
originally announced May 2021.
-
GLAMOUR: Graph Learning over Macromolecule Representations
Authors:
Somesh Mohapatra,
Joyce An,
Rafael Gómez-Bombarelli
Abstract:
The near-infinite chemical diversity of natural and artificial macromolecules arises from the vast range of possible component monomers, linkages, and polymers topologies. This enormous variety contributes to the ubiquity and indispensability of macromolecules but hinders the development of general machine learning methods with macromolecules as input. To address this, we developed GLAMOUR, a fram…
▽ More
The near-infinite chemical diversity of natural and artificial macromolecules arises from the vast range of possible component monomers, linkages, and polymers topologies. This enormous variety contributes to the ubiquity and indispensability of macromolecules but hinders the development of general machine learning methods with macromolecules as input. To address this, we developed GLAMOUR, a framework for chemistry-informed graph representation of macromolecules that enables quantifying structural similarity, and interpretable supervised learning for macromolecules.
△ Less
Submitted 23 August, 2021; v1 submitted 3 March, 2021;
originally announced March 2021.
-
Differentiable sampling of molecular geometries with uncertainty-based adversarial attacks
Authors:
Daniel Schwalbe-Koda,
Aik Rui Tan,
Rafael Gómez-Bombarelli
Abstract:
Neural network (NN) interatomic potentials provide fast prediction of potential energy surfaces, closely matching the accuracy of the electronic structure methods used to produce the training data. However, NN predictions are only reliable within well-learned training domains, and show volatile behavior when extrapolating. Uncertainty quantification approaches can flag atomic configurations for wh…
▽ More
Neural network (NN) interatomic potentials provide fast prediction of potential energy surfaces, closely matching the accuracy of the electronic structure methods used to produce the training data. However, NN predictions are only reliable within well-learned training domains, and show volatile behavior when extrapolating. Uncertainty quantification approaches can flag atomic configurations for which prediction confidence is low, but arriving at such uncertain regions requires expensive sampling of the NN phase space, often using atomistic simulations. Here, we exploit automatic differentiation to drive atomistic systems towards high-likelihood, high-uncertainty configurations without the need for molecular dynamics simulations. By performing adversarial attacks on an uncertainty metric, informative geometries that expand the training domain of NNs are sampled. When combined to an active learning loop, this approach bootstraps and improves NN potentials while decreasing the number of calls to the ground truth method. This efficiency is demonstrated on sampling of kinetic barriers and collective variables in molecules, and can be extended to any NN potential architecture and materials system.
△ Less
Submitted 28 March, 2021; v1 submitted 27 January, 2021;
originally announced January 2021.
-
Accelerating amorphous polymer electrolyte screening by learning to reduce errors in molecular dynamics simulated properties
Authors:
Tian Xie,
Arthur France-Lanord,
Yanming Wang,
Jeffrey Lopez,
Michael Austin Stolberg,
Megan Hill,
Graham Michael Leverick,
Rafael Gomez-Bombarelli,
Jeremiah A. Johnson,
Yang Shao-Horn,
Jeffrey C. Grossman
Abstract:
Polymer electrolytes are promising candidates for the next generation lithium-ion battery technology. Large scale screening of polymer electrolytes is hindered by the significant cost of molecular dynamics (MD) simulation in amorphous systems: the amorphous structure of polymers requires multiple, repeated sampling to reduce noise and the slow relaxation requires long simulation time for convergen…
▽ More
Polymer electrolytes are promising candidates for the next generation lithium-ion battery technology. Large scale screening of polymer electrolytes is hindered by the significant cost of molecular dynamics (MD) simulation in amorphous systems: the amorphous structure of polymers requires multiple, repeated sampling to reduce noise and the slow relaxation requires long simulation time for convergence. Here, we accelerate the screening with a multi-task graph neural network that learns from a large amount of noisy, unconverged, short MD data and a small number of converged, long MD data. We achieve accurate predictions of 4 different converged properties and screen a space of 6247 polymers that is orders of magnitude larger than previous computational studies. Further, we extract several design principles for polymer electrolytes and provide an open dataset for the community. Our approach could be applicable to a broad class of material discovery problems that involve the simulation of complex, amorphous materials.
△ Less
Submitted 15 March, 2022; v1 submitted 13 January, 2021;
originally announced January 2021.
-
Molecular machine learning with conformer ensembles
Authors:
Simon Axelrod,
Rafael Gomez-Bombarelli
Abstract:
Virtual screening can accelerate drug discovery by identifying promising candidates for experimental evaluation. Machine learning is a powerful method for screening, as it can learn complex structure-property relationships from experimental data and make rapid predictions over virtual libraries. Molecules inherently exist as a three-dimensional ensemble and their biological action typically occurs…
▽ More
Virtual screening can accelerate drug discovery by identifying promising candidates for experimental evaluation. Machine learning is a powerful method for screening, as it can learn complex structure-property relationships from experimental data and make rapid predictions over virtual libraries. Molecules inherently exist as a three-dimensional ensemble and their biological action typically occurs through supramolecular recognition. However, most deep learning approaches to molecular property prediction use a 2D graph representation as input, and in some cases a single 3D conformation. Here we investigate how the 3D information of multiple conformers, traditionally known as 4D information in the cheminformatics community, can improve molecular property prediction in deep learning models. We introduce multiple deep learning models that expand upon key architectures such as ChemProp and Schnet, adding elements such as multiple-conformer inputs and conformer attention. We then benchmark the performance trade-offs of these models on 2D, 3D and 4D representations in the prediction of drug activity using a large training set of geometrically resolved molecules. The new architectures perform significantly better than 2D models, but their performance is often just as strong with a single conformer as with many. We also find that 4D deep learning models learn interpretable attention weights for each conformer.
△ Less
Submitted 18 February, 2021; v1 submitted 15 December, 2020;
originally announced December 2020.
-
Temperature-transferable coarse-graining of ionic liquids with dual graph convolutional neural networks
Authors:
Jurgis Ruza,
Wujie Wang,
Daniel Schwalbe-Koda,
Simon Axelrod,
William H. Harris,
Rafael Gomez-Bombarelli
Abstract:
Computer simulations can provide mechanistic insight into ionic liquids (ILs) and predict the properties of experimentally unrealized ion combinations. However, ILs suffer from a particularly large disparity in the time scales of atomistic and ensemble motion. Coarse-grained models are therefore used in place of costly atomistic simulations, allowing simulation of longer time scales and larger sys…
▽ More
Computer simulations can provide mechanistic insight into ionic liquids (ILs) and predict the properties of experimentally unrealized ion combinations. However, ILs suffer from a particularly large disparity in the time scales of atomistic and ensemble motion. Coarse-grained models are therefore used in place of costly atomistic simulations, allowing simulation of longer time scales and larger systems. Nevertheless, constructing the many-body potential of mean force that defines the structure and dynamics of a coarse-grained system can be complicated and computationally intensive. Machine learning shows great promise for the key coupled challenges of dimensionality reduction and learning the potential of mean force. To improve the coarse-graining of ILs, we present a neural network model trained on all-atom classical molecular dynamics simulations. The potential of mean force is expressed as two jointly-trained neural network interatomic potentials that learn the coupled short-range and the many-body long range molecular interactions. These interatomic potentials treat temperature as an explicit input variable to capture the temperature dependence of the potential of mean force. The model reproduces structural quantities with high fidelity, outperforms the temperature-independent baseline at capturing dynamics, generalizes to unseen temperatures, and incurs low simulation cost.
△ Less
Submitted 8 November, 2020; v1 submitted 28 July, 2020;
originally announced July 2020.
-
GEOM: Energy-annotated molecular conformations for property prediction and molecular generation
Authors:
Simon Axelrod,
Rafael Gomez-Bombarelli
Abstract:
Machine learning (ML) outperforms traditional approaches in many molecular design tasks. ML models usually predict molecular properties from a 2D chemical graph or a single 3D structure, but neither of these representations accounts for the ensemble of 3D conformers that are accessible to a molecule. Property prediction could be improved by using conformer ensembles as input, but there is no large…
▽ More
Machine learning (ML) outperforms traditional approaches in many molecular design tasks. ML models usually predict molecular properties from a 2D chemical graph or a single 3D structure, but neither of these representations accounts for the ensemble of 3D conformers that are accessible to a molecule. Property prediction could be improved by using conformer ensembles as input, but there is no large-scale dataset that contains graphs annotated with accurate conformers and experimental data. Here we use advanced sampling and semi-empirical density functional theory (DFT) to generate 37 million molecular conformations for over 450,000 molecules. The Geometric Ensemble Of Molecules (GEOM) dataset contains conformers for 133,000 species from QM9, and 317,000 species with experimental data related to biophysics, physiology, and physical chemistry. Ensembles of 1,511 species with BACE-1 inhibition data are also labeled with high-quality DFT free energies in an implicit water solvent, and 534 ensembles are further optimized with DFT. GEOM will assist in the development of models that predict properties from conformer ensembles, and generative models that sample 3D conformations.
△ Less
Submitted 9 February, 2022; v1 submitted 9 June, 2020;
originally announced June 2020.
-
Differentiable Molecular Simulations for Control and Learning
Authors:
Wujie Wang,
Simon Axelrod,
Rafael Gómez-Bombarelli
Abstract:
Molecular dynamics simulations use statistical mechanics at the atomistic scale to enable both the elucidation of fundamental mechanisms and the engineering of matter for desired tasks. The behavior of molecular systems at the microscale is typically simulated with differential equations parameterized by a Hamiltonian, or energy function. The Hamiltonian describes the state of the system and its i…
▽ More
Molecular dynamics simulations use statistical mechanics at the atomistic scale to enable both the elucidation of fundamental mechanisms and the engineering of matter for desired tasks. The behavior of molecular systems at the microscale is typically simulated with differential equations parameterized by a Hamiltonian, or energy function. The Hamiltonian describes the state of the system and its interactions with the environment. In order to derive predictive microscopic models, one wishes to infer a molecular Hamiltonian that agrees with observed macroscopic quantities. From the perspective of engineering, one wishes to control the Hamiltonian to achieve desired simulation outcomes and structures, as in self-assembly and optical control, to then realize systems with the desired Hamiltonian in the lab. In both cases, the goal is to modify the Hamiltonian such that emergent properties of the simulated system match a given target. We demonstrate how this can be achieved using differentiable simulations where bulk target observables and simulation outcomes can be analytically differentiated with respect to Hamiltonians, opening up new routes for parameterizing Hamiltonians to infer macroscopic models and develop control protocols.
△ Less
Submitted 23 December, 2020; v1 submitted 26 February, 2020;
originally announced March 2020.
-
Generative Models for Automatic Chemical Design
Authors:
Daniel Schwalbe-Koda,
Rafael Gómez-Bombarelli
Abstract:
Materials discovery is decisive for tackling urgent challenges related to energy, the environment, health care and many others. In chemistry, conventional methodologies for innovation usually rely on expensive and incremental strategies to optimize properties from molecular structures. On the other hand, inverse approaches map properties to structures, thus expediting the design of novel useful co…
▽ More
Materials discovery is decisive for tackling urgent challenges related to energy, the environment, health care and many others. In chemistry, conventional methodologies for innovation usually rely on expensive and incremental strategies to optimize properties from molecular structures. On the other hand, inverse approaches map properties to structures, thus expediting the design of novel useful compounds. In this chapter, we examine the way in which current deep generative models are addressing the inverse chemical discovery paradigm. We begin by revisiting early inverse design algorithms. Then, we introduce generative models for molecular systems and categorize them according to their architecture and molecular representation. Using this classification, we review the evolution and performance of important molecular generation schemes reported in the literature. Finally, we conclude highlighting the prospects and challenges of generative models as cutting edge tools in materials discovery.
△ Less
Submitted 2 July, 2019;
originally announced July 2019.
-
Coarse-Graining Auto-Encoders for Molecular Dynamics
Authors:
Wujie Wang,
Rafael Gómez-Bombarelli
Abstract:
Molecular dynamics simulations provide theoretical insight into the microscopic behavior of materials in condensed phase and, as a predictive tool, enable computational design of new compounds. However, because of the large temporal and spatial scales involved in thermodynamic and kinetic phenomena in materials, atomistic simulations are often computationally unfeasible. Coarse-graining methods al…
▽ More
Molecular dynamics simulations provide theoretical insight into the microscopic behavior of materials in condensed phase and, as a predictive tool, enable computational design of new compounds. However, because of the large temporal and spatial scales involved in thermodynamic and kinetic phenomena in materials, atomistic simulations are often computationally unfeasible. Coarse-graining methods allow simulating larger systems, by reducing the dimensionality of the simulation, and propagating longer timesteps, by averaging out fast motions. Coarse-graining involves two coupled learning problems; defining the mapping from an all-atom to a reduced representation, and the parametrization of a Hamiltonian over coarse-grained coordinates. Multiple statistical mechanics approaches have addressed the latter, but the former is generally a hand-tuned process based on chemical intuition. Here we present Autograin, an optimization framework based on auto-encoders to learn both tasks simultaneously. Autograin is trained to learn the optimal mapping between all-atom and reduced representation, using the reconstruction loss to facilitate the learning of coarse-grained variables. In addition, a force-matching method is applied to variationally determine the coarse-grained potential energy function. This procedure is tested on a number of model systems including single-molecule and bulk-phase periodic simulations.
△ Less
Submitted 27 March, 2019; v1 submitted 6 December, 2018;
originally announced December 2018.
-
Graph similarity drives zeolite diffusionless transformations and intergrowth
Authors:
Daniel Schwalbe-Koda,
Zach Jensen,
Elsa Olivetti,
Rafael Gomez-Bombarelli
Abstract:
Predicting and directing polymorphic transformations is a critical challenge in zeolite synthesis. Although interzeolite transformations enable selective crystallization, their design lacks predictions to connect framework similarity and experimental observations. Here, computational and theoretical tools are combined to data-mine, analyze and explain interzeolite relations. It is observed that bu…
▽ More
Predicting and directing polymorphic transformations is a critical challenge in zeolite synthesis. Although interzeolite transformations enable selective crystallization, their design lacks predictions to connect framework similarity and experimental observations. Here, computational and theoretical tools are combined to data-mine, analyze and explain interzeolite relations. It is observed that building units are weak predictors of topology interconversion and insufficient to explain intergrowth. By introducing a supercell-invariant metric that compares crystal structures using graph theory, we show that topotactic and reconstructive (diffusionless) transformations occur only between graph-similar pairs. Furthermore, all known instances of intergrowth occur between either structurally-similar or graph-similar frameworks. Backed with exhaustive literature results, we identify promising pairs for realizing novel diffusionless transformations and intergrowth. Hundreds of low-distance pairs are identified among known zeolites, and thousands of hypothetical frameworks are connected to known zeolites counterparts. The theory opens a venue to understand and control zeolite polymorphism.
△ Less
Submitted 10 March, 2021; v1 submitted 6 December, 2018;
originally announced December 2018.
-
Automatic chemical design using a data-driven continuous representation of molecules
Authors:
Rafael Gómez-Bombarelli,
Jennifer N. Wei,
David Duvenaud,
José Miguel Hernández-Lobato,
Benjamín Sánchez-Lengeling,
Dennis Sheberla,
Jorge Aguilera-Iparraguirre,
Timothy D. Hirzel,
Ryan P. Adams,
Alán Aspuru-Guzik
Abstract:
We report a method to convert discrete representations of molecules to and from a multidimensional continuous representation. This model allows us to generate new molecules for efficient exploration and optimization through open-ended spaces of chemical compounds. A deep neural network was trained on hundreds of thousands of existing chemical structures to construct three coupled functions: an enc…
▽ More
We report a method to convert discrete representations of molecules to and from a multidimensional continuous representation. This model allows us to generate new molecules for efficient exploration and optimization through open-ended spaces of chemical compounds. A deep neural network was trained on hundreds of thousands of existing chemical structures to construct three coupled functions: an encoder, a decoder and a predictor. The encoder converts the discrete representation of a molecule into a real-valued continuous vector, and the decoder converts these continuous vectors back to discrete molecular representations. The predictor estimates chemical properties from the latent continuous vector representation of the molecule. Continuous representations allow us to automatically generate novel chemical structures by performing simple operations in the latent space, such as decoding random vectors, perturbing known chemical structures, or interpolating between molecules. Continuous representations also allow the use of powerful gradient-based optimization to efficiently guide the search for optimized functional compounds. We demonstrate our method in the domain of drug-like molecules and also in the set of molecules with fewer that nine heavy atoms.
△ Less
Submitted 5 December, 2017; v1 submitted 7 October, 2016;
originally announced October 2016.
-
Photocell Optimisation Using Dark State Protection
Authors:
Amir Fruchtman,
Rafael Gómez-Bombarelli,
Brendon W. Lovett,
Erik M. Gauger
Abstract:
Conventional photocells suffer a fundamental efficiency threshold imposed by the principle of detailed balance, reflecting the fact that good absorbers must necessarily also be fast emitters. This limitation can be overcome by `parking' the energy of an absorbed photon in a dark state which neither absorbs nor emits light. Here we argue that suitable dark states occur naturally as a consequence of…
▽ More
Conventional photocells suffer a fundamental efficiency threshold imposed by the principle of detailed balance, reflecting the fact that good absorbers must necessarily also be fast emitters. This limitation can be overcome by `parking' the energy of an absorbed photon in a dark state which neither absorbs nor emits light. Here we argue that suitable dark states occur naturally as a consequence of the dipole-dipole interaction between two proximal optical dipoles for a wide range of realistic molecular dimers. We develop an intuitive model of a photocell comprising two light-absorbing molecules coupled to an idealised reaction centre, showing asymmetric dimers are capable of providing a significant enhancement of light-to-current conversion under ambient conditions. We conclude by describing a roadmap for identifying suitable molecular dimers for demonstrating this effect by screening a very large set of possible candidate molecules.
△ Less
Submitted 5 August, 2016; v1 submitted 19 November, 2015;
originally announced November 2015.
-
Convolutional Networks on Graphs for Learning Molecular Fingerprints
Authors:
David Duvenaud,
Dougal Maclaurin,
Jorge Aguilera-Iparraguirre,
Rafael Gómez-Bombarelli,
Timothy Hirzel,
Alán Aspuru-Guzik,
Ryan P. Adams
Abstract:
We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predic…
▽ More
We introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.
△ Less
Submitted 3 November, 2015; v1 submitted 30 September, 2015;
originally announced September 2015.