Search | arXiv e-print repository

mCLM: A Function-Infused and Synthesis-Friendly Modular Chemical Language Model

Authors: Carl Edwards, Chi Han, Gawon Lee, Thao Nguyen, Bowen Jin, Chetan Kumar Prasad, Sara Szymkuć, Bartosz A. Grzybowski, Ying Diao, Jiawei Han, Ge Liu, Hao Peng, Martin D. Burke, Heng Ji

Abstract: Despite their ability to understand chemical knowledge and accurately generate sequential representations, large language models (LLMs) remain limited in their capacity to propose novel molecules with drug-like properties. In addition, the molecules that LLMs propose can often be challenging to make in the lab. To more effectively enable the discovery of functional small molecules, LLMs need to le… ▽ More Despite their ability to understand chemical knowledge and accurately generate sequential representations, large language models (LLMs) remain limited in their capacity to propose novel molecules with drug-like properties. In addition, the molecules that LLMs propose can often be challenging to make in the lab. To more effectively enable the discovery of functional small molecules, LLMs need to learn a molecular language. However, LLMs are currently limited by encoding molecules from atoms. In this paper, we argue that just like tokenizing texts into (sub-)word tokens instead of characters, molecules should be decomposed and reassembled at the level of functional building blocks, i.e., parts of molecules that bring unique functions and serve as effective building blocks for real-world automated laboratory synthesis. This motivates us to propose mCLM, a modular Chemical-Language Model tokenizing molecules into building blocks and learning a bilingual language model of both natural language descriptions of functions and molecule building blocks. By reasoning on such functional building blocks, mCLM guarantees to generate efficiently synthesizable molecules thanks to recent progress in block-based chemistry, while also improving the functions of molecules in a principled manner. In experiments on 430 FDA-approved drugs, we find mCLM capable of significantly improving 5 out of 6 chemical functions critical to determining drug potentials. More importantly, mCLM can reason on multiple functions and improve the FDA-rejected drugs (``fallen angels'') over multiple iterations to greatly improve their shortcomings. △ Less

Submitted 18 May, 2025; originally announced May 2025.

arXiv:2410.02082 [pdf, other]

FARM: Functional Group-Aware Representations for Small Molecules

Authors: Thao Nguyen, Kuan-Hao Huang, Ge Liu, Martin D. Burke, Ying Diao, Heng Ji

Abstract: We introduce Functional Group-Aware Representations for Small Molecules (FARM), a novel foundation model designed to bridge the gap between SMILES, natural language, and molecular graphs. The key innovation of FARM lies in its functional group-aware tokenization, which directly incorporates functional group information into the representations. This strategic reduction in tokenization granularity… ▽ More We introduce Functional Group-Aware Representations for Small Molecules (FARM), a novel foundation model designed to bridge the gap between SMILES, natural language, and molecular graphs. The key innovation of FARM lies in its functional group-aware tokenization, which directly incorporates functional group information into the representations. This strategic reduction in tokenization granularity is intentionally aligned with key drivers of functional properties (i.e., functional groups), enhancing the model's understanding of chemical language. By expanding the chemical lexicon, FARM more effectively bridges SMILES and natural language, ultimately advancing the model's capacity to predict molecular properties. FARM also represents molecules from two perspectives: by using masked language modeling to capture atom-level features and by employing graph neural networks to encode the whole molecule topology. By leveraging contrastive learning, FARM aligns these two views of representations into a unified molecular embedding. We rigorously evaluate FARM on the MoleculeNet dataset, where it achieves state-of-the-art performance on 10 out of 12 tasks. These results highlight FARM's potential to improve molecular representation learning, with promising applications in drug discovery and pharmaceutical research. △ Less

Submitted 6 October, 2024; v1 submitted 2 October, 2024; originally announced October 2024.

Comments: Preprint

arXiv:2311.17189 [pdf, other]

TorchAmi: Generalized CPU/GPU Implementation of Algorithmic Matsubara Integration

Authors: M. D. Burke, J. P. F. LeBlanc

Abstract: We present torchami, an advanced implementation of algorithmic Matsubara integration (AMI) that utilizes pytorch as a backend to provide easy parallelization and GPU support. AMI is a tool for analytically resolving the sequence of nested Matsubara integrals that arise in virtually all Feynman perturbative expansions. In this implementation we present a new AMI algorithm that creates a more natura… ▽ More We present torchami, an advanced implementation of algorithmic Matsubara integration (AMI) that utilizes pytorch as a backend to provide easy parallelization and GPU support. AMI is a tool for analytically resolving the sequence of nested Matsubara integrals that arise in virtually all Feynman perturbative expansions. In this implementation we present a new AMI algorithm that creates a more natural symbolic representation of the Feynman integrands. In addition, we include peripheral tools that allow for import and labelling of simple graph structures and conversion to torchami input. The code is written in c++ with python bindings provided. △ Less

Submitted 28 November, 2023; originally announced November 2023.

Comments: 23pg, 5 figs. Code reference included

arXiv:2305.13650 [pdf, other]

Robust Model-Based Optimization for Challenging Fitness Landscapes

Authors: Saba Ghaffari, Ehsan Saleh, Alexander G. Schwing, Yu-Xiong Wang, Martin D. Burke, Saurabh Sinha

Abstract: Protein design, a grand challenge of the day, involves optimization on a fitness landscape, and leading methods adopt a model-based approach where a model is trained on a training set (protein sequences and fitness) and proposes candidates to explore next. These methods are challenged by sparsity of high-fitness samples in the training set, a problem that has been in the literature. A less recogni… ▽ More Protein design, a grand challenge of the day, involves optimization on a fitness landscape, and leading methods adopt a model-based approach where a model is trained on a training set (protein sequences and fitness) and proposes candidates to explore next. These methods are challenged by sparsity of high-fitness samples in the training set, a problem that has been in the literature. A less recognized but equally important problem stems from the distribution of training samples in the design space: leading methods are not designed for scenarios where the desired optimum is in a region that is not only poorly represented in training data, but also relatively far from the highly represented low-fitness regions. We show that this problem of "separation" in the design space is a significant bottleneck in existing model-based optimization tools and propose a new approach that uses a novel VAE as its search model to overcome the problem. We demonstrate its advantage over prior methods in robustly finding improved samples, regardless of the imbalance and separation between low- and high-fitness samples. Our comprehensive benchmark on real and semi-synthetic protein datasets as well as solution design for physics-informed neural networks, showcases the generality of our approach in discrete and continuous design spaces. Our implementation is available at https://github.com/sabagh1994/PGVAE. △ Less

Submitted 27 June, 2024; v1 submitted 22 May, 2023; originally announced May 2023.

arXiv:2211.02453 [pdf, other]

doi 10.1103/PhysRevB.107.115151

Renormalized Perturbation Theory for Fast Evaluation of Feynman Diagrams on the Real Frequency Axis

Authors: M. D. Burke, Maxence Grandadam, J. P. F. LeBlanc

Abstract: We present a method to accelerate the numerical evaluation of spatial integrals of Feynman diagrams when expressed on the real frequency axis. This can be realized through use of a renormalized perturbation expansion with a constant but complex renormalization shift. The complex shift acts as a regularization parameter for the numerical integration of otherwise sharp functions. This results in an… ▽ More We present a method to accelerate the numerical evaluation of spatial integrals of Feynman diagrams when expressed on the real frequency axis. This can be realized through use of a renormalized perturbation expansion with a constant but complex renormalization shift. The complex shift acts as a regularization parameter for the numerical integration of otherwise sharp functions. This results in an exponential speed up of stochastic numerical integration at the expense of evaluating additional counter-term diagrams. We provide proof of concept calculations within a difficult limit of the half-filled 2D Hubbard model on a square lattice. △ Less

Submitted 4 November, 2022; originally announced November 2022.

arXiv:2109.09888 [pdf, other]

Chemical-Reaction-Aware Molecule Representation Learning

Authors: Hongwei Wang, Weijiang Li, Xiaomeng Jin, Kyunghyun Cho, Heng Ji, Jiawei Han, Martin D. Burke

Abstract: Molecule representation learning (MRL) methods aim to embed molecules into a real vector space. However, existing SMILES-based (Simplified Molecular-Input Line-Entry System) or GNN-based (Graph Neural Networks) MRL methods either take SMILES strings as input that have difficulty in encoding molecule structure information, or over-emphasize the importance of GNN architectures but neglect their gene… ▽ More Molecule representation learning (MRL) methods aim to embed molecules into a real vector space. However, existing SMILES-based (Simplified Molecular-Input Line-Entry System) or GNN-based (Graph Neural Networks) MRL methods either take SMILES strings as input that have difficulty in encoding molecule structure information, or over-emphasize the importance of GNN architectures but neglect their generalization ability. Here we propose using chemical reactions to assist learning molecule representation. The key idea of our approach is to preserve the equivalence of molecules with respect to chemical reactions in the embedding space, i.e., forcing the sum of reactant embeddings and the sum of product embeddings to be equal for each chemical equation. This constraint is proven effective to 1) keep the embedding space well-organized and 2) improve the generalization ability of molecule embeddings. Moreover, our model can use any GNN as the molecule encoder and is thus agnostic to GNN architectures. Experimental results demonstrate that our method achieves state-of-the-art performance in a variety of downstream tasks, e.g., 17.4% absolute Hit@1 gain in chemical reaction prediction, 2.3% absolute AUC gain in molecule property prediction, and 18.5% relative RMSE gain in graph-edit-distance prediction, respectively, over the best baseline method. The code is available at https://github.com/hwwang55/MolR. △ Less

Submitted 22 September, 2021; v1 submitted 20 September, 2021; originally announced September 2021.

Showing 1–6 of 6 results for author: Burke, M D