Search | arXiv e-print repository

The Open Molecules 2025 (OMol25) Dataset, Evaluations, and Models

Authors: Daniel S. Levine, Muhammed Shuaibi, Evan Walter Clark Spotte-Smith, Michael G. Taylor, Muhammad R. Hasyim, Kyle Michel, Ilyes Batatia, Gábor Csányi, Misko Dzamba, Peter Eastman, Nathan C. Frey, Xiang Fu, Vahe Gharakhanyan, Aditi S. Krishnapriyan, Joshua A. Rackers, Sanjeev Raja, Ammar Rizvi, Andrew S. Rosen, Zachary Ulissi, Santiago Vargas, C. Lawrence Zitnick, Samuel M. Blau, Brandon M. Wood

Abstract: Machine learning (ML) models hold the promise of transforming atomic simulations by delivering quantum chemical accuracy at a fraction of the computational cost. Realization of this potential would enable high-throughout, high-accuracy molecular screening campaigns to explore vast regions of chemical space and facilitate ab initio simulations at sizes and time scales that were previously inaccessi… ▽ More Machine learning (ML) models hold the promise of transforming atomic simulations by delivering quantum chemical accuracy at a fraction of the computational cost. Realization of this potential would enable high-throughout, high-accuracy molecular screening campaigns to explore vast regions of chemical space and facilitate ab initio simulations at sizes and time scales that were previously inaccessible. However, a fundamental challenge to creating ML models that perform well across molecular chemistry is the lack of comprehensive data for training. Despite substantial efforts in data generation, no large-scale molecular dataset exists that combines broad chemical diversity with a high level of accuracy. To address this gap, Meta FAIR introduces Open Molecules 2025 (OMol25), a large-scale dataset composed of more than 100 million density functional theory (DFT) calculations at the $ω$B97M-V/def2-TZVPD level of theory, representing billions of CPU core-hours of compute. OMol25 uniquely blends elemental, chemical, and structural diversity including: 83 elements, a wide-range of intra- and intermolecular interactions, explicit solvation, variable charge/spin, conformers, and reactive structures. There are ~83M unique molecular systems in OMol25 covering small molecules, biomolecules, metal complexes, and electrolytes, including structures obtained from existing datasets. OMol25 also greatly expands on the size of systems typically included in DFT datasets, with systems of up to 350 atoms. In addition to the public release of the data, we provide baseline models and a comprehensive set of model evaluations to encourage community engagement in developing the next-generation ML models for molecular chemistry. △ Less

Submitted 13 May, 2025; originally announced May 2025.

Comments: 60 pages, 8 figures

arXiv:2503.13352 [pdf, ps, other]

Strain Problems got you in a Twist? Try StrainRelief: A Quantum-Accurate Tool for Ligand Strain Calculations

Authors: Ewan R. S. Wallace, Nathan C. Frey, Joshua A. Rackers

Abstract: Ligand strain energy, the energy difference between the bound and unbound conformations of a ligand, is an important component of structure-based small molecule drug design. A large majority of observed ligands in protein-small molecule co-crystal structures bind in low-strain conformations, making strain energy a useful filter for structure-based drug design. In this work we present a tool for ca… ▽ More Ligand strain energy, the energy difference between the bound and unbound conformations of a ligand, is an important component of structure-based small molecule drug design. A large majority of observed ligands in protein-small molecule co-crystal structures bind in low-strain conformations, making strain energy a useful filter for structure-based drug design. In this work we present a tool for calculating ligand strain with a high accuracy. StrainRelief uses a MACE Neural Network Potential (NNP), trained on a large database of Density Functional Theory (DFT) calculations to estimate ligand strain of neutral molecules with quantum accuracy. We show that this tool estimates strain energy differences relative to DFT to within 1.4 kcal/mol, more accurately than alternative NNPs. These results highlight the utility of NNPs in drug discovery, and provide a useful tool for drug discovery teams. △ Less

Submitted 10 June, 2025; v1 submitted 17 March, 2025; originally announced March 2025.

arXiv:2410.14621 [pdf, other]

JAMUN: Transferable Molecular Conformational Ensemble Generation with Walk-Jump Sampling

Authors: Ameya Daigavane, Bodhi P. Vani, Saeed Saremi, Joseph Kleinhenz, Joshua Rackers

Abstract: Conformational ensembles of protein structures are immensely important both to understanding protein function, and for drug discovery in novel modalities such as cryptic pockets. Current techniques for sampling ensembles are computationally inefficient, or do not transfer to systems outside their training data. We present walk-Jump Accelerated Molecular ensembles with Universal Noise (JAMUN), a st… ▽ More Conformational ensembles of protein structures are immensely important both to understanding protein function, and for drug discovery in novel modalities such as cryptic pockets. Current techniques for sampling ensembles are computationally inefficient, or do not transfer to systems outside their training data. We present walk-Jump Accelerated Molecular ensembles with Universal Noise (JAMUN), a step towards the goal of efficiently sampling the Boltzmann distribution of arbitrary proteins. By extending Walk-Jump Sampling to point clouds, JAMUN enables ensemble generation at orders of magnitude faster rates than traditional molecular dynamics or state-of-the-art ML methods. Further, JAMUN is able to predict the stable basins of small peptides that were not seen during training. △ Less

Submitted 18 October, 2024; originally announced October 2024.

arXiv:2409.01931 [pdf, other]

doi 10.1063/5.0237876

On the design space between molecular mechanics and machine learning force fields

Authors: Yuanqing Wang, Kenichiro Takaba, Michael S. Chen, Marcus Wieder, Yuzhi Xu, Tong Zhu, John Z. H. Zhang, Arnav Nagle, Kuang Yu, Xinyan Wang, Daniel J. Cole, Joshua A. Rackers, Kyunghyun Cho, Joe G. Greener, Peter Eastman, Stefano Martiniani, Mark E. Tuckerman

Abstract: A force field as accurate as quantum mechanics (QM) and as fast as molecular mechanics (MM), with which one can simulate a biomolecular system efficiently enough and meaningfully enough to get quantitative insights, is among the most ardent dreams of biophysicists -- a dream, nevertheless, not to be fulfilled any time soon. Machine learning force fields (MLFFs) represent a meaningful endeavor towa… ▽ More A force field as accurate as quantum mechanics (QM) and as fast as molecular mechanics (MM), with which one can simulate a biomolecular system efficiently enough and meaningfully enough to get quantitative insights, is among the most ardent dreams of biophysicists -- a dream, nevertheless, not to be fulfilled any time soon. Machine learning force fields (MLFFs) represent a meaningful endeavor towards this direction, where differentiable neural functions are parametrized to fit ab initio energies, and furthermore forces through automatic differentiation. We argue that, as of now, the utility of the MLFF models is no longer bottlenecked by accuracy but primarily by their speed (as well as stability and generalizability), as many recent variants, on limited chemical spaces, have long surpassed the chemical accuracy of $1$ kcal/mol -- the empirical threshold beyond which realistic chemical predictions are possible -- though still magnitudes slower than MM. Hoping to kindle explorations and designs of faster, albeit perhaps slightly less accurate MLFFs, in this review, we focus our attention on the design space (the speed-accuracy tradeoff) between MM and ML force fields. After a brief review of the building blocks of force fields of either kind, we discuss the desired properties and challenges now faced by the force field development community, survey the efforts to make MM force fields more accurate and ML force fields faster, envision what the next generation of MLFF might look like. △ Less

Submitted 5 September, 2024; v1 submitted 3 September, 2024; originally announced September 2024.

arXiv:2306.07473 [pdf, other]

3D molecule generation by denoising voxel grids

Authors: Pedro O. Pinheiro, Joshua Rackers, Joseph Kleinhenz, Michael Maser, Omar Mahmood, Andrew Martin Watkins, Stephen Ra, Vishnu Sresht, Saeed Saremi

Abstract: We propose a new score-based approach to generate 3D molecules represented as atomic densities on regular grids. First, we train a denoising neural network that learns to map from a smooth distribution of noisy molecules to the distribution of real molecules. Then, we follow the neural empirical Bayes framework (Saremi and Hyvarinen, 19) and generate molecules in two steps: (i) sample noisy densit… ▽ More We propose a new score-based approach to generate 3D molecules represented as atomic densities on regular grids. First, we train a denoising neural network that learns to map from a smooth distribution of noisy molecules to the distribution of real molecules. Then, we follow the neural empirical Bayes framework (Saremi and Hyvarinen, 19) and generate molecules in two steps: (i) sample noisy density grids from a smooth distribution via underdamped Langevin Markov chain Monte Carlo, and (ii) recover the "clean" molecule by denoising the noisy grid with a single step. Our method, VoxMol, generates molecules in a fundamentally different way than the current state of the art (ie, diffusion models applied to atom point clouds). It differs in terms of the data representation, the noise model, the network architecture and the generative modeling algorithm. Our experiments show that VoxMol captures the distribution of drug-like molecules better than state of the art, while being faster to generate samples. △ Less

Submitted 8 March, 2024; v1 submitted 12 June, 2023; originally announced June 2023.

arXiv:2210.04766 [pdf, other]

Hierarchical Learning in Euclidean Neural Networks

Authors: Joshua A. Rackers, Pranav Rao

Abstract: Equivariant machine learning methods have shown wide success at 3D learning applications in recent years. These models explicitly build in the reflection, translation and rotation symmetries of Euclidean space and have facilitated large advances in accuracy and data efficiency for a range of applications in the physical sciences. An outstanding question for equivariant models is why they achieve s… ▽ More Equivariant machine learning methods have shown wide success at 3D learning applications in recent years. These models explicitly build in the reflection, translation and rotation symmetries of Euclidean space and have facilitated large advances in accuracy and data efficiency for a range of applications in the physical sciences. An outstanding question for equivariant models is why they achieve such larger-than-expected advances in these applications. To probe this question, we examine the role of higher order (non-scalar) features in Euclidean Neural Networks (\texttt{e3nn}). We focus on the previously studied application of \texttt{e3nn} to the problem of electron density prediction, which allows for a variety of non-scalar outputs, and examine whether the nature of the output (scalar $l=0$, vector $l=1$, or higher order $l>1$) is relevant to the effectiveness of non-scalar hidden features in the network. Further, we examine the behavior of non-scalar features throughout training, finding a natural hierarchy of features by $l$, reminiscent of a multipole expansion. We aim for our work to ultimately inform design principles and choices of domain applications for {\tt e3nn} networks. △ Less

Submitted 10 October, 2022; originally announced October 2022.

Comments: 9 pages, 3 figures

arXiv:2207.03587 [pdf, other]

doi 10.1063/5.0130668

Accurate Hellmann-Feynman forces from density functional calculations with augmented Gaussian basis sets

Authors: Shivesh Pathak, Ignacio Ema López, Alex J. Lee, William P. Bricker, Rafael López Fernández, Susi Lehtola, Joshua A. Rackers

Abstract: The Hellmann-Feynman (HF) theorem provides a way to compute forces directly from the electron density, enabling efficient force calculations for large systems through machine learning (ML) models for the electron density. The main issue holding back the general acceptance of the HF approach for atom-centered basis sets is the well-known Pulay force which, if naively discarded, typically constitute… ▽ More The Hellmann-Feynman (HF) theorem provides a way to compute forces directly from the electron density, enabling efficient force calculations for large systems through machine learning (ML) models for the electron density. The main issue holding back the general acceptance of the HF approach for atom-centered basis sets is the well-known Pulay force which, if naively discarded, typically constitutes an error upwards of 10 eV/Ang in forces. In this work, we demonstrate that if a suitably augmented Gaussian basis set is used for density functional calculations, the Pulay force can be suppressed and HF forces can be computed as accurately as analytical forces with state-of-the-art basis sets, allowing geometry optimization and molecular dynamics to be reliably performed with HF forces. Our results pave a clear path forwards for the accurate and efficient simulation of large systems using ML densities and the HF theorem. △ Less

Submitted 19 December, 2022; v1 submitted 7 July, 2022; originally announced July 2022.

Journal ref: J. Chem. Phys. 158, 014104 (2023)

arXiv:2201.03726 [pdf]

Cracking the Quantum Scaling Limit with Machine Learned Electron Densities

Authors: Joshua A. Rackers, Lucas Tecot, Mario Geiger, Tess E. Smidt

Abstract: A long-standing goal of science is to accurately solve the Schrödinger equation for large molecular systems. The poor scaling of current quantum chemistry algorithms on classical computers imposes an effective limit of about a few dozen atoms for which we can calculate molecular electronic structure. We present a machine learning (ML) method to break through this scaling limit and make quantum che… ▽ More A long-standing goal of science is to accurately solve the Schrödinger equation for large molecular systems. The poor scaling of current quantum chemistry algorithms on classical computers imposes an effective limit of about a few dozen atoms for which we can calculate molecular electronic structure. We present a machine learning (ML) method to break through this scaling limit and make quantum chemistry calculations of very large systems possible. We show that Euclidean Neural Networks can be trained to predict the electron density with high fidelity from limited data. Learning the electron density allows us to train a machine learning model on small systems and make accurate predictions on large ones. We show that this ML electron density model can break through the quantum scaling limit and calculate the electron density of systems of thousands of atoms with quantum accuracy. △ Less

Submitted 10 February, 2022; v1 submitted 10 January, 2022; originally announced January 2022.

arXiv:2106.13116 [pdf]

A Polarizable Water Potential Derived from a Model Electron Density

Authors: Joshua A. Rackers, Roseane R. Silva, Zhi Wang, Jay W. Ponder

Abstract: A new empirical potential for efficient, large scale molecular dynamics simulation of water is presented. The HIPPO (Hydrogen-like Intermolecular Polarizable POtential) force field is based upon the model electron density of a hydrogen-like atom. This framework is used to derive and parameterize individual terms describing charge penetration damped permanent electrostatics, damped polarization, ch… ▽ More A new empirical potential for efficient, large scale molecular dynamics simulation of water is presented. The HIPPO (Hydrogen-like Intermolecular Polarizable POtential) force field is based upon the model electron density of a hydrogen-like atom. This framework is used to derive and parameterize individual terms describing charge penetration damped permanent electrostatics, damped polarization, charge transfer, anisotropic Pauli repulsion, and damped dispersion interactions. Initial parameter values were fit to Symmetry Adapted Perturbation Theory (SAPT) energy components for ten water dimer configurations, as well as the radial and angular dependence of the canonical dimer. The SAPT-based parameters were then systematically refined to extend the treatment to water bulk phases. The final HIPPO water model provides a balanced representation of a wide variety of properties of gas phase clusters, liquid water and ice polymorphs, across a range of temperatures and pressures. This water potential yields a rationalization of water structure, dynamics and thermodynamics explicitly correlated with an ab initio energy decomposition, while providing a level of accuracy comparable or superior to previous polarizable atomic multipole force fields. The HIPPO water model serves as a cornerstone around which similarly detailed physics-based models can be developed for additional molecular species. △ Less

Submitted 28 September, 2021; v1 submitted 24 June, 2021; originally announced June 2021.

Comments: 76 pages, 16 figures

Showing 1–9 of 9 results for author: Rackers, J