Search | arXiv e-print repository

DeepRetro: Retrosynthetic Pathway Discovery using Iterative LLM Reasoning

Authors: Shreyas Vinaya Sathyanarayana, Sharanabasava D. Hiremath, Rahil Shah, Rishikesh Panda, Rahul Jana, Riya Singh, Rida Irfan, Ashwin Murali, Bharath Ramsundar

Abstract: The synthesis of complex natural products remains one of the grand challenges of organic chemistry. We present DeepRetro, a major advancement in computational retrosynthesis that enables the discovery of viable synthetic routes for complex molecules typically considered beyond the reach of existing retrosynthetic methods. DeepRetro is a novel, open-source framework that tightly integrates large la… ▽ More The synthesis of complex natural products remains one of the grand challenges of organic chemistry. We present DeepRetro, a major advancement in computational retrosynthesis that enables the discovery of viable synthetic routes for complex molecules typically considered beyond the reach of existing retrosynthetic methods. DeepRetro is a novel, open-source framework that tightly integrates large language models (LLMs), traditional retrosynthetic engines, and expert human feedback in an iterative design loop. Prior approaches rely solely on template-based methods or unconstrained LLM outputs. In contrast, DeepRetro combines the precision of template-based methods with the generative flexibility of LLMs, controlled by rigorous chemical validity checks and enhanced by recursive refinement. This hybrid system dynamically explores and revises synthetic pathways, guided by both algorithmic checks and expert chemist feedback through an interactive user interface. While DeepRetro achieves strong performance on standard retrosynthesis benchmarks, its true strength lies in its ability to propose novel, viable pathways to highly complex natural products-targets that have historically eluded automated planning. Through detailed case studies, we illustrate how this approach enables new routes for total synthesis and facilitates human-machine collaboration in organic chemistry. Beyond retrosynthesis, DeepRetro represents a working model for how to leverage LLMs in scientific discovery. We provide a transparent account of the system's design, algorithms, and human-feedback loop, enabling broad adaptation across scientific domains. By releasing DeepRetro as an open-source tool, we aim to empower chemists to tackle increasingly ambitious synthetic targets, accelerating progress in drug discovery, materials design, and beyond. △ Less

Submitted 19 August, 2025; v1 submitted 7 July, 2025; originally announced July 2025.

Comments: 64 pages,

arXiv:2412.13519 [pdf, other]

Open-Source Protein Language Models for Function Prediction and Protein Design

Authors: Shivasankaran Vanaja Pandi, Bharath Ramsundar

Abstract: Protein language models (PLMs) have shown promise in improving the understanding of protein sequences, contributing to advances in areas such as function prediction and protein engineering. However, training these models from scratch requires significant computational resources, limiting their accessibility. To address this, we integrate a PLM into DeepChem, an open-source framework for computatio… ▽ More Protein language models (PLMs) have shown promise in improving the understanding of protein sequences, contributing to advances in areas such as function prediction and protein engineering. However, training these models from scratch requires significant computational resources, limiting their accessibility. To address this, we integrate a PLM into DeepChem, an open-source framework for computational biology and chemistry, to provide a more accessible platform for protein-related tasks. We evaluate the performance of the integrated model on various protein prediction tasks, showing that it achieves reasonable results across benchmarks. Additionally, we present an exploration of generating plastic-degrading enzyme candidates using the model's embeddings and latent space manipulation techniques. While the results suggest that further refinement is needed, this approach provides a foundation for future work in enzyme design. This study aims to facilitate the use of PLMs in research fields like synthetic biology and environmental sustainability, even for those with limited computational resources. △ Less

Submitted 18 December, 2024; originally announced December 2024.

Comments: To be published in 4th Annual AAAI workshop on AI to Accelerate Science and Engineering

arXiv:2412.08658 [pdf, other]

Open-source Polymer Generative Pipeline

Authors: Debasish Mohanty, V Shreyas, Akshaya Palai, Bharath Ramsundar

Abstract: Polymers play a crucial role in the development of engineering materials, with applications ranging from mechanical to biomedical fields. However, the limited polymerization processes constrain the variety of organic building blocks that can be experimentally tested. We propose an open-source computational generative pipeline that integrates neural-network-based discriminators, generators, and que… ▽ More Polymers play a crucial role in the development of engineering materials, with applications ranging from mechanical to biomedical fields. However, the limited polymerization processes constrain the variety of organic building blocks that can be experimentally tested. We propose an open-source computational generative pipeline that integrates neural-network-based discriminators, generators, and query-based filtration mechanisms to overcome this limitation and generate hypothetical polymers. The pipeline targets properties, such as ionization potential (IP), by aligning various representational formats to generate hypothetical polymer candidates. The discriminators demonstrate improvements over state-of-the-art models due to optimized architecture, while the generators produce novel polymers tailored to the desired property range. We conducted extensive evaluations to assess the generative performance of the pipeline components, focusing on the polymers' ionization potential (IP). The developed pipeline is integrated into the DeepChem framework, enhancing its accessibility and compatibility for various polymer generation studies. △ Less

Submitted 29 November, 2024; originally announced December 2024.

arXiv:2411.11513 [pdf, ps, other]

A Modular Open Source Framework for Genomic Variant Calling

Authors: Ankita Vaishnobi Bisoi, Shreyas V, Jose Siguenza, Bharath Ramsundar

Abstract: Variant calling is a fundamental task in genomic research, essential for detecting genetic variations such as single nucleotide polymorphisms (SNPs) and insertions or deletions (indels). This paper presents an enhancement to DeepChem, a widely used open-source drug discovery framework, through the integration of DeepVariant. In particular, we introduce a variant calling pipeline that leverages Dee… ▽ More Variant calling is a fundamental task in genomic research, essential for detecting genetic variations such as single nucleotide polymorphisms (SNPs) and insertions or deletions (indels). This paper presents an enhancement to DeepChem, a widely used open-source drug discovery framework, through the integration of DeepVariant. In particular, we introduce a variant calling pipeline that leverages DeepVariant's convolutional neural network (CNN) architecture to improve the accuracy and reliability of variant detection. The implemented pipeline includes stages for realignment of sequencing reads, candidate variant detection, and pileup image generation, followed by variant classification using a modified Inception v3 model. Our work adds a modular and extensible variant calling framework to the DeepChem framework and enables future work integrating DeepChem's drug discovery infrastructure more tightly with bioinformatics pipelines. △ Less

Submitted 28 July, 2025; v1 submitted 18 November, 2024; originally announced November 2024.

arXiv:2409.08163 [pdf, other]

Open Source Infrastructure for Automatic Cell Segmentation

Authors: Aaron Rock Menezes, Bharath Ramsundar

Abstract: Automated cell segmentation is crucial for various biological and medical applications, facilitating tasks like cell counting, morphology analysis, and drug discovery. However, manual segmentation is time-consuming and prone to subjectivity, necessitating robust automated methods. This paper presents open-source infrastructure, utilizing the UNet model, a deep-learning architecture noted for its e… ▽ More Automated cell segmentation is crucial for various biological and medical applications, facilitating tasks like cell counting, morphology analysis, and drug discovery. However, manual segmentation is time-consuming and prone to subjectivity, necessitating robust automated methods. This paper presents open-source infrastructure, utilizing the UNet model, a deep-learning architecture noted for its effectiveness in image segmentation tasks. This implementation is integrated into the open-source DeepChem package, enhancing accessibility and usability for researchers and practitioners. The resulting tool offers a convenient and user-friendly interface, reducing the barrier to entry for cell segmentation while maintaining high accuracy. Additionally, we benchmark this model against various datasets, demonstrating its robustness and versatility across different imaging conditions and cell types. △ Less

Submitted 12 September, 2024; originally announced September 2024.

arXiv:2408.06261 [pdf, other]

Open-Source Molecular Processing Pipeline for Generating Molecules

Authors: V Shreyas, Jose Siguenza, Karan Bania, Bharath Ramsundar

Abstract: Generative models for molecules have shown considerable promise for use in computational chemistry, but remain difficult to use for non-experts. For this reason, we introduce open-source infrastructure for easily building generative molecular models into the widely used DeepChem [Ramsundar et al., 2019] library with the aim of creating a robust and reusable molecular generation pipeline. In partic… ▽ More Generative models for molecules have shown considerable promise for use in computational chemistry, but remain difficult to use for non-experts. For this reason, we introduce open-source infrastructure for easily building generative molecular models into the widely used DeepChem [Ramsundar et al., 2019] library with the aim of creating a robust and reusable molecular generation pipeline. In particular, we add high quality PyTorch [Paszke et al., 2019] implementations of the Molecular Generative Adversarial Networks (MolGAN) [Cao and Kipf, 2022] and Normalizing Flows [Papamakarios et al., 2021]. Our implementations show strong performance comparable with past work [Kuznetsov and Polykovskiy, 2021, Cao and Kipf, 2022]. △ Less

Submitted 28 November, 2024; v1 submitted 12 August, 2024; originally announced August 2024.

Comments: Presented at the Molecular Machine Learning Conference 2024 (MoML 2024), BayLearn 2024 and the Machine Learning and Physical Sciences (ML4PS) Workshop at NeurIPS 2024

arXiv:2209.01712 [pdf, other]

ChemBERTa-2: Towards Chemical Foundation Models

Authors: Walid Ahmad, Elana Simon, Seyone Chithrananda, Gabriel Grand, Bharath Ramsundar

Abstract: Large pretrained models such as GPT-3 have had tremendous impact on modern natural language processing by leveraging self-supervised learning to learn salient representations that can be used to readily finetune on a wide variety of downstream tasks. We investigate the possibility of transferring such advances to molecular machine learning by building a chemical foundation model, ChemBERTa-2, usin… ▽ More Large pretrained models such as GPT-3 have had tremendous impact on modern natural language processing by leveraging self-supervised learning to learn salient representations that can be used to readily finetune on a wide variety of downstream tasks. We investigate the possibility of transferring such advances to molecular machine learning by building a chemical foundation model, ChemBERTa-2, using the language of SMILES. While labeled data for molecular prediction tasks is typically scarce, libraries of SMILES strings are readily available. In this work, we build upon ChemBERTa by optimizing the pretraining process. We compare multi-task and self-supervised pretraining by varying hyperparameters and pretraining dataset size, up to 77M compounds from PubChem. To our knowledge, the 77M set constitutes one of the largest datasets used for molecular pretraining to date. We find that with these pretraining improvements, we are competitive with existing state-of-the-art architectures on the MoleculeNet benchmark suite. We analyze the degree to which improvements in pretraining translate to improvement on downstream tasks. △ Less

Submitted 4 September, 2022; originally announced September 2022.

Comments: ELLIS Machine Learning for Molecule Discovery Workshop

ACM Class: I.2.7; I.2.1; J.2; J.3

arXiv:2203.04698 [pdf, other]

Score-Based Generative Models for Molecule Generation

Authors: Dwaraknath Gnaneshwar, Bharath Ramsundar, Dhairya Gandhi, Rachel Kurchin, Venkatasubramanian Viswanathan

Abstract: Recent advances in generative models have made exploring design spaces easier for de novo molecule generation. However, popular generative models like GANs and normalizing flows face challenges such as training instabilities due to adversarial training and architectural constraints, respectively. Score-based generative models sidestep these challenges by modelling the gradient of the log probabili… ▽ More Recent advances in generative models have made exploring design spaces easier for de novo molecule generation. However, popular generative models like GANs and normalizing flows face challenges such as training instabilities due to adversarial training and architectural constraints, respectively. Score-based generative models sidestep these challenges by modelling the gradient of the log probability density using a score function approximation, as opposed to modelling the density function directly, and sampling from it using annealed Langevin Dynamics. We believe that score-based generative models could open up new opportunities in molecule generation due to their architectural flexibility, such as replacing the score function with an SE(3) equivariant model. In this work, we lay the foundations by testing the efficacy of score-based models for molecule generation. We train a Transformer-based score function on Self-Referencing Embedded Strings (SELFIES) representations of 1.5 million samples from the ZINC dataset and use the Moses benchmarking framework to evaluate the generated samples on a suite of metrics. △ Less

Submitted 7 March, 2022; originally announced March 2022.

arXiv:2103.02723 [pdf]

doi 10.1128/mSystems.00233-21

Identification and Development of Therapeutics for COVID-19

Authors: Halie M. Rando, Nils Wellhausen, Soumita Ghosh, Alexandra J. Lee, Anna Ada Dattoli, Fengling Hu, James Brian Byrd, Diane N. Rafizadeh, Ronan Lordan, Yanjun Qi, Yuchen Sun, Christian Brueffer, Jeffrey M. Field, Marouen Ben Guebila, Nafisa M. Jadavji, Ashwin N. Skelly, Bharath Ramsundar, Jinhui Wang, Rishi Raj Goel, YoSon Park, the COVID-19 Review Consortium, Simina M. Boca, Anthony Gitter, Casey S. Greene

Abstract: After emerging in China in late 2019, the novel Severe acute respiratory syndrome-like coronavirus 2 (SARS-CoV-2) spread worldwide and as of early 2021, continues to significantly impact most countries. Only a small number of coronaviruses are known to infect humans, and only two are associated with the severe outcomes associated with SARS-CoV-2: Severe acute respiratory syndrome-related coronavir… ▽ More After emerging in China in late 2019, the novel Severe acute respiratory syndrome-like coronavirus 2 (SARS-CoV-2) spread worldwide and as of early 2021, continues to significantly impact most countries. Only a small number of coronaviruses are known to infect humans, and only two are associated with the severe outcomes associated with SARS-CoV-2: Severe acute respiratory syndrome-related coronavirus, a closely related species of SARS-CoV-2 that emerged in 2002, and Middle East respiratory syndrome-related coronavirus, which emerged in 2012. Both of these previous epidemics were controlled fairly rapidly through public health measures, and no vaccines or robust therapeutic interventions were identified. However, previous insights into the immune response to coronaviruses gained during the outbreaks of severe acute respiratory syndrome (SARS) and Middle East respiratory syndrome (MERS) have proved beneficial to identifying approaches to the treatment and prophylaxis of novel coronavirus disease 2019 (COVID-19). A number of potential therapeutics against SARS-CoV-2 and the resultant COVID-19 illness were rapidly identified, leading to a large number of clinical trials investigating a variety of possible therapeutic approaches being initiated early on in the pandemic. As a result, a small number of therapeutics have already been authorized by regulatory agencies such as the Food and Drug Administration (FDA) in the United States, and many other therapeutics remain under investigation. Here, we describe a range of approaches for the treatment of COVID-19, along with their proposed mechanisms of action and the current status of clinical investigation into each candidate. The status of these investigations will continue to evolve, and this review will be updated as progress is made. △ Less

Submitted 10 September, 2021; v1 submitted 3 March, 2021; originally announced March 2021.

arXiv:2010.09885 [pdf, other]

ChemBERTa: Large-Scale Self-Supervised Pretraining for Molecular Property Prediction

Authors: Seyone Chithrananda, Gabriel Grand, Bharath Ramsundar

Abstract: GNNs and chemical fingerprints are the predominant approaches to representing molecules for property prediction. However, in NLP, transformers have become the de-facto standard for representation learning thanks to their strong downstream task transfer. In parallel, the software ecosystem around transformers is maturing rapidly, with libraries like HuggingFace and BertViz enabling streamlined trai… ▽ More GNNs and chemical fingerprints are the predominant approaches to representing molecules for property prediction. However, in NLP, transformers have become the de-facto standard for representation learning thanks to their strong downstream task transfer. In parallel, the software ecosystem around transformers is maturing rapidly, with libraries like HuggingFace and BertViz enabling streamlined training and introspection. In this work, we make one of the first attempts to systematically evaluate transformers on molecular property prediction tasks via our ChemBERTa model. ChemBERTa scales well with pretraining dataset size, offering competitive downstream performance on MoleculeNet and useful attention-based visualization modalities. Our results suggest that transformers offer a promising avenue of future work for molecular representation learning and property prediction. To facilitate these efforts, we release a curated dataset of 77M SMILES from PubChem suitable for large-scale self-supervised pretraining. △ Less

Submitted 23 October, 2020; v1 submitted 19 October, 2020; originally announced October 2020.

Comments: Submitted to NeurIPS 2020 ML for Molecules Workshop

ACM Class: I.2.7; I.2.1; J.2; J.3

arXiv:1911.05211 [pdf, other]

AMPL: A Data-Driven Modeling Pipeline for Drug Discovery

Authors: Amanda J. Minnich, Kevin McLoughlin, Margaret Tse, Jason Deng, Andrew Weber, Neha Murad, Benjamin D. Madej, Bharath Ramsundar, Tom Rush, Stacie Calad-Thomson, Jim Brase, Jonathan E. Allen

Abstract: One of the key requirements for incorporating machine learning into the drug discovery process is complete reproducibility and traceability of the model building and evaluation process. With this in mind, we have developed an end-to-end modular and extensible software pipeline for building and sharing machine learning models that predict key pharma-relevant parameters. The ATOM Modeling PipeLine,… ▽ More One of the key requirements for incorporating machine learning into the drug discovery process is complete reproducibility and traceability of the model building and evaluation process. With this in mind, we have developed an end-to-end modular and extensible software pipeline for building and sharing machine learning models that predict key pharma-relevant parameters. The ATOM Modeling PipeLine, or AMPL, extends the functionality of the open source library DeepChem and supports an array of machine learning and molecular featurization tools. We have benchmarked AMPL on a large collection of pharmaceutical datasets covering a wide range of parameters. As a result of these comprehensive experiments, we have found that physicochemical descriptors and deep learning-based graph representations significantly outperform traditional fingerprints in the characterization of molecular features. We have also found that dataset size is directly correlated to prediction performance, and that single-task deep learning models only outperform shallow learners if there is sufficient data. Likewise, dataset size has a direct impact on model predictivity, independent of comprehensive hyperparameter model tuning. Our findings point to the need for public dataset integration or multi-task/transfer learning approaches. Lastly, we found that uncertainty quantification (UQ) analysis may help identify model error; however, efficacy of UQ to filter predictions varies considerably between datasets and featurization/model types. AMPL is open source and available for download at http://github.com/ATOMconsortium/AMPL. △ Less

Submitted 13 November, 2019; v1 submitted 12 November, 2019; originally announced November 2019.

arXiv:1706.01643 [pdf]

Retrosynthetic reaction prediction using neural sequence-to-sequence models

Authors: Bowen Liu, Bharath Ramsundar, Prasad Kawthekar, Jade Shi, Joseph Gomes, Quang Luu Nguyen, Stephen Ho, Jack Sloane, Paul Wender, Vijay Pande

Abstract: We describe a fully data driven model that learns to perform a retrosynthetic reaction prediction task, which is treated as a sequence-to-sequence mapping problem. The end-to-end trained model has an encoder-decoder architecture that consists of two recurrent neural networks, which has previously shown great success in solving other sequence-to-sequence prediction tasks such as machine translation… ▽ More We describe a fully data driven model that learns to perform a retrosynthetic reaction prediction task, which is treated as a sequence-to-sequence mapping problem. The end-to-end trained model has an encoder-decoder architecture that consists of two recurrent neural networks, which has previously shown great success in solving other sequence-to-sequence prediction tasks such as machine translation. The model is trained on 50,000 experimental reaction examples from the United States patent literature, which span 10 broad reaction types that are commonly used by medicinal chemists. We find that our model performs comparably with a rule-based expert system baseline model, and also overcomes certain limitations associated with rule-based expert systems and with any machine learning approach that contains a rule-based expert system component. Our model provides an important first step towards solving the challenging problem of computational retrosynthetic analysis. △ Less

Submitted 6 June, 2017; originally announced June 2017.

arXiv:1405.1444 [pdf, other]

Understanding Protein Dynamics with L1-Regularized Reversible Hidden Markov Models

Authors: Robert T. McGibbon, Bharath Ramsundar, Mohammad M. Sultan, Gert Kiss, Vijay S. Pande

Abstract: We present a machine learning framework for modeling protein dynamics. Our approach uses L1-regularized, reversible hidden Markov models to understand large protein datasets generated via molecular dynamics simulations. Our model is motivated by three design principles: (1) the requirement of massive scalability; (2) the need to adhere to relevant physical law; and (3) the necessity of providing a… ▽ More We present a machine learning framework for modeling protein dynamics. Our approach uses L1-regularized, reversible hidden Markov models to understand large protein datasets generated via molecular dynamics simulations. Our model is motivated by three design principles: (1) the requirement of massive scalability; (2) the need to adhere to relevant physical law; and (3) the necessity of providing accessible interpretations, critical for both cellular biology and rational drug design. We present an EM algorithm for learning and introduce a model selection criteria based on the physical notion of convergence in relaxation timescales. We contrast our model with standard methods in biophysics and demonstrate improved robustness. We implement our algorithm on GPUs and apply the method to two large protein simulation datasets generated respectively on the NCSA Bluewaters supercomputer and the Folding@Home distributed computing network. Our analysis identifies the conformational dynamics of the ubiquitin protein critical to cellular signaling, and elucidates the stepwise activation mechanism of the c-Src kinase protein. △ Less

Submitted 6 May, 2014; originally announced May 2014.

Journal ref: Proceedings of the 31st International Conference on Machine Learning, Beijing, China, 2014

Showing 1–13 of 13 results for author: Ramsundar, B