-
LARC: Towards Human-level Constrained Retrosynthesis Planning through an Agentic Framework
Authors:
Frazier N. Baker,
Daniel Adu-Ampratwum,
Reza Averly,
Botao Yu,
Huan Sun,
Xia Ning
Abstract:
Large language model (LLM) agent evaluators leverage specialized tools to ground the rational decision-making of LLMs, making them well-suited to aid in scientific discoveries, such as constrained retrosynthesis planning. Constrained retrosynthesis planning is an essential, yet challenging, process within chemistry for identifying synthetic routes from commercially available starting materials to…
▽ More
Large language model (LLM) agent evaluators leverage specialized tools to ground the rational decision-making of LLMs, making them well-suited to aid in scientific discoveries, such as constrained retrosynthesis planning. Constrained retrosynthesis planning is an essential, yet challenging, process within chemistry for identifying synthetic routes from commercially available starting materials to desired target molecules, subject to practical constraints. Here, we present LARC, the first LLM-based Agentic framework for Retrosynthesis planning under Constraints. LARC incorporates agentic constraint evaluation, through an Agent-as-a-Judge, directly into the retrosynthesis planning process, using agentic feedback grounded in tool-based reasoning to guide and constrain route generation. We rigorously evaluate LARC on a carefully curated set of 48 constrained retrosynthesis planning tasks across 3 constraint types. LARC achieves a 72.9% success rate on these tasks, vastly outperforming LLM baselines and approaching human expert-level success in substantially less time. The LARC framework is extensible, and serves as a first step towards an effective agentic tool or a co-scientist to human experts for constrained retrosynthesis.
△ Less
Submitted 15 August, 2025;
originally announced August 2025.
-
LIDDIA: Language-based Intelligent Drug Discovery Agent
Authors:
Reza Averly,
Frazier N. Baker,
Ian A. Watson,
Xia Ning
Abstract:
Drug discovery is a long, expensive, and complex process, relying heavily on human medicinal chemists, who can spend years searching the vast space of potential therapies. Recent advances in artificial intelligence for chemistry have sought to expedite individual drug discovery tasks; however, there remains a critical need for an intelligent agent that can navigate the drug discovery process. Towa…
▽ More
Drug discovery is a long, expensive, and complex process, relying heavily on human medicinal chemists, who can spend years searching the vast space of potential therapies. Recent advances in artificial intelligence for chemistry have sought to expedite individual drug discovery tasks; however, there remains a critical need for an intelligent agent that can navigate the drug discovery process. Towards this end, we introduce LIDDIA, an autonomous agent capable of intelligently navigating the drug discovery process in silico. By leveraging the reasoning capabilities of large language models, LIDDIA serves as a low-cost and highly-adaptable tool for autonomous drug discovery. We comprehensively examine LIDDIA , demonstrating that (1) it can generate molecules meeting key pharmaceutical criteria on over 70% of 30 clinically relevant targets, (2) it intelligently balances exploration and exploitation in the chemical space, and (3) it identifies one promising novel candidate on AR/NR3C4, a critical target for both prostate and breast cancers. Code and dataset are available at https://github.com/ninglab/LIDDiA
△ Less
Submitted 16 August, 2025; v1 submitted 19 February, 2025;
originally announced February 2025.
-
Robustness tests for biomedical foundation models should tailor to specifications
Authors:
R. Patrick Xian,
Noah R. Baker,
Tom David,
Qiming Cui,
A. Jay Holmgren,
Stefan Bauer,
Madhumita Sushil,
Reza Abbasi-Asl
Abstract:
The rise of biomedical foundation models creates new hurdles in model testing and authorization, given their broad capabilities and susceptibility to complex distribution shifts. We suggest tailoring robustness tests according to task-dependent priorities and propose to integrate granular notions of robustness in a predefined specification to guide implementation. Our approach facilitates the stan…
▽ More
The rise of biomedical foundation models creates new hurdles in model testing and authorization, given their broad capabilities and susceptibility to complex distribution shifts. We suggest tailoring robustness tests according to task-dependent priorities and propose to integrate granular notions of robustness in a predefined specification to guide implementation. Our approach facilitates the standardization of robustness assessments in the model lifecycle and connects abstract AI regulatory frameworks with concrete testing procedures.
△ Less
Submitted 14 August, 2025; v1 submitted 14 February, 2025;
originally announced February 2025.
-
ChemToolAgent: The Impact of Tools on Language Agents for Chemistry Problem Solving
Authors:
Botao Yu,
Frazier N. Baker,
Ziru Chen,
Garrett Herb,
Boyu Gou,
Daniel Adu-Ampratwum,
Xia Ning,
Huan Sun
Abstract:
To enhance large language models (LLMs) for chemistry problem solving, several LLM-based agents augmented with tools have been proposed, such as ChemCrow and Coscientist. However, their evaluations are narrow in scope, leaving a large gap in understanding the benefits of tools across diverse chemistry tasks. To bridge this gap, we develop ChemToolAgent, an enhanced chemistry agent over ChemCrow, a…
▽ More
To enhance large language models (LLMs) for chemistry problem solving, several LLM-based agents augmented with tools have been proposed, such as ChemCrow and Coscientist. However, their evaluations are narrow in scope, leaving a large gap in understanding the benefits of tools across diverse chemistry tasks. To bridge this gap, we develop ChemToolAgent, an enhanced chemistry agent over ChemCrow, and conduct a comprehensive evaluation of its performance on both specialized chemistry tasks and general chemistry questions. Surprisingly, ChemToolAgent does not consistently outperform its base LLMs without tools. Our error analysis with a chemistry expert suggests that: For specialized chemistry tasks, such as synthesis prediction, we should augment agents with specialized tools; however, for general chemistry questions like those in exams, agents' ability to reason correctly with chemistry knowledge matters more, and tool augmentation does not always help.
△ Less
Submitted 26 May, 2025; v1 submitted 11 November, 2024;
originally announced November 2024.
-
ScienceAgentBench: Toward Rigorous Assessment of Language Agents for Data-Driven Scientific Discovery
Authors:
Ziru Chen,
Shijie Chen,
Yuting Ning,
Qianheng Zhang,
Boshi Wang,
Botao Yu,
Yifei Li,
Zeyi Liao,
Chen Wei,
Zitong Lu,
Vishal Dey,
Mingyi Xue,
Frazier N. Baker,
Benjamin Burns,
Daniel Adu-Ampratwum,
Xuhui Huang,
Xia Ning,
Song Gao,
Yu Su,
Huan Sun
Abstract:
The advancements of large language models (LLMs) have piqued growing interest in developing LLM-based language agents to automate scientific discovery end-to-end, which has sparked both excitement and skepticism about their true capabilities. In this work, we call for rigorous assessment of agents on individual tasks in a scientific workflow before making bold claims on end-to-end automation. To t…
▽ More
The advancements of large language models (LLMs) have piqued growing interest in developing LLM-based language agents to automate scientific discovery end-to-end, which has sparked both excitement and skepticism about their true capabilities. In this work, we call for rigorous assessment of agents on individual tasks in a scientific workflow before making bold claims on end-to-end automation. To this end, we present ScienceAgentBench, a new benchmark for evaluating language agents for data-driven scientific discovery. To ensure the scientific authenticity and real-world relevance of our benchmark, we extract 102 tasks from 44 peer-reviewed publications in four disciplines and engage nine subject matter experts to validate them. We unify the target output for every task to a self-contained Python program file and employ an array of evaluation metrics to examine the generated programs, execution results, and costs. Each task goes through multiple rounds of manual validation by annotators and subject matter experts to ensure its annotation quality and scientific plausibility. We also propose two effective strategies to mitigate data contamination concerns. Using ScienceAgentBench, we evaluate five open-weight and proprietary LLMs, each with three frameworks: direct prompting, OpenHands CodeAct, and self-debug. Given three attempts for each task, the best-performing agent can only solve 32.4% of the tasks independently and 34.3% with expert-provided knowledge. In addition, we evaluate OpenAI o1-preview with direct prompting and self-debug, which can boost the performance to 42.2%, demonstrating the effectiveness of increasing inference-time compute but with more than 10 times the cost of other LLMs. Still, our results underscore the limitations of current language agents in generating code for data-driven discovery, let alone end-to-end automation for scientific research.
△ Less
Submitted 31 March, 2025; v1 submitted 7 October, 2024;
originally announced October 2024.
-
LlaSMol: Advancing Large Language Models for Chemistry with a Large-Scale, Comprehensive, High-Quality Instruction Tuning Dataset
Authors:
Botao Yu,
Frazier N. Baker,
Ziqi Chen,
Xia Ning,
Huan Sun
Abstract:
Chemistry plays a crucial role in many domains, such as drug discovery and material science. While large language models (LLMs) such as GPT-4 exhibit remarkable capabilities on natural language processing tasks, existing research indicates that their performance on chemistry tasks is discouragingly low. In this paper, however, we demonstrate that our developed LLMs can achieve very strong results…
▽ More
Chemistry plays a crucial role in many domains, such as drug discovery and material science. While large language models (LLMs) such as GPT-4 exhibit remarkable capabilities on natural language processing tasks, existing research indicates that their performance on chemistry tasks is discouragingly low. In this paper, however, we demonstrate that our developed LLMs can achieve very strong results on a comprehensive set of chemistry tasks, outperforming the most advanced GPT-4 and Claude 3 Opus by a substantial margin. To accomplish this, we propose SMolInstruct, a large-scale, comprehensive, and high-quality dataset for instruction tuning. It contains 14 selected chemistry tasks and over three million samples, laying a solid foundation for training and evaluating LLMs for chemistry. Using SMolInstruct, we fine-tune a set of open-source LLMs, among which, we find that Mistral serves as the best base model for chemistry tasks. Our analysis further demonstrates the critical role of the proposed dataset in driving the performance improvements.
△ Less
Submitted 10 August, 2024; v1 submitted 14 February, 2024;
originally announced February 2024.
-
RLSynC: Offline-Online Reinforcement Learning for Synthon Completion
Authors:
Frazier N. Baker,
Ziqi Chen,
Daniel Adu-Ampratwum,
Xia Ning
Abstract:
Retrosynthesis is the process of determining the set of reactant molecules that can react to form a desired product. Semi-template-based retrosynthesis methods, which imitate the reverse logic of synthesis reactions, first predict the reaction centers in the products, and then complete the resulting synthons back into reactants. We develop a new offline-online reinforcement learning method RLSynC…
▽ More
Retrosynthesis is the process of determining the set of reactant molecules that can react to form a desired product. Semi-template-based retrosynthesis methods, which imitate the reverse logic of synthesis reactions, first predict the reaction centers in the products, and then complete the resulting synthons back into reactants. We develop a new offline-online reinforcement learning method RLSynC for synthon completion in semi-template-based methods. RLSynC assigns one agent to each synthon, all of which complete the synthons by conducting actions step by step in a synchronized fashion. RLSynC learns the policy from both offline training episodes and online interactions, which allows RLSynC to explore new reaction spaces. RLSynC uses a standalone forward synthesis model to evaluate the likelihood of the predicted reactants in synthesizing a product, and thus guides the action search. Our results demonstrate that RLSynC can outperform state-of-the-art synthon completion methods with improvements as high as 14.9%, highlighting its potential in synthesis planning.
△ Less
Submitted 29 March, 2024; v1 submitted 5 September, 2023;
originally announced September 2023.
-
ECP SOLLVE: Validation and Verification Testsuite Status Update and Compiler Insight for OpenMP
Authors:
Thomas Huber,
Swaroop Pophale,
Nolan Baker,
Michael Carr,
Nikhil Rao,
Jaydon Reap,
Kristina Holsapple,
Joshua Hoke Davis,
Tobias Burnus,
Seyong Lee,
David E. Bernholdt,
Sunita Chandrasekaran
Abstract:
The OpenMP language continues to evolve with every new specification release, as does the need to validate and verify the new features that have been introduced. With the release of OpenMP 5.0 and OpenMP 5.1, plenty of new target offload and host-based features have been introduced to the programming model. While OpenMP continues to grow in maturity, there is an observable growth in the number of…
▽ More
The OpenMP language continues to evolve with every new specification release, as does the need to validate and verify the new features that have been introduced. With the release of OpenMP 5.0 and OpenMP 5.1, plenty of new target offload and host-based features have been introduced to the programming model. While OpenMP continues to grow in maturity, there is an observable growth in the number of compiler and hardware vendors that support OpenMP. In this manuscript, we focus on evaluating the conformity and implementation progress of various compiler vendors such as Cray, IBM, GNU, Clang/LLVM, NVIDIA, Intel and AMD. We specifically address the 4.5, 5.0, and 5.1 versions of the specification.
△ Less
Submitted 14 November, 2022; v1 submitted 28 August, 2022;
originally announced August 2022.
-
Validation and Transparency in AI systems for pharmacovigilance: a case study applied to the medical literature monitoring of adverse events
Authors:
Bruno Ohana,
Jack Sullivan,
Nicole Baker
Abstract:
Recent advances in artificial intelligence applied to biomedical text are opening exciting opportunities for improving pharmacovigilance activities currently burdened by the ever growing volumes of real world data. To fully realize these opportunities, existing regulatory guidance and industry best practices should be taken into consideration in order to increase the overall trustworthiness of the…
▽ More
Recent advances in artificial intelligence applied to biomedical text are opening exciting opportunities for improving pharmacovigilance activities currently burdened by the ever growing volumes of real world data. To fully realize these opportunities, existing regulatory guidance and industry best practices should be taken into consideration in order to increase the overall trustworthiness of the system and enable broader adoption. In this paper we present a case study on how to operationalize existing guidance for validated AI systems in pharmacovigilance focusing on the specific task of medical literature monitoring (MLM) of adverse events from the scientific literature. We describe an AI system designed with the goal of reducing effort in MLM activities built in close collaboration with subject matter experts and considering guidance for validated systems in pharmacovigilance and AI transparency. In particular we make use of public disclosures as a useful risk control measure to mitigate system misuse and earn user trust. In addition we present experimental results showing the system can significantly remove screening effort while maintaining high levels of recall (filtering 55% of irrelevant articles on average, for a target recall of 0.99 on suspected adverse articles) and provide a robust method for tuning the desired recall to suit a particular risk profile.
△ Less
Submitted 21 December, 2021;
originally announced January 2022.
-
Q# and NWChem: Tools for Scalable Quantum Chemistry on Quantum Computers
Authors:
Guang Hao Low,
Nicholas P. Bauman,
Christopher E. Granade,
Bo Peng,
Nathan Wiebe,
Eric J. Bylaska,
Dave Wecker,
Sriram Krishnamoorthy,
Martin Roetteler,
Karol Kowalski,
Matthias Troyer,
Nathan A. Baker
Abstract:
Fault-tolerant quantum computation promises to solve outstanding problems in quantum chemistry within the next decade. Realizing this promise requires scalable tools that allow users to translate descriptions of electronic structure problems to optimized quantum gate sequences executed on physical hardware, without requiring specialized quantum computing knowledge. To this end, we present a quantu…
▽ More
Fault-tolerant quantum computation promises to solve outstanding problems in quantum chemistry within the next decade. Realizing this promise requires scalable tools that allow users to translate descriptions of electronic structure problems to optimized quantum gate sequences executed on physical hardware, without requiring specialized quantum computing knowledge. To this end, we present a quantum chemistry library, under the open-source MIT license, that implements and enables straightforward use of state-of-art quantum simulation algorithms. The library is implemented in Q#, a language designed to express quantum algorithms at scale, and interfaces with NWChem, a leading electronic structure package. We define a standardized schema for this interface, Broombridge, that describes second-quantized Hamiltonians, along with metadata required for effective quantum simulation, such as trial wavefunction ansatzes. This schema is generated for arbitrary molecules by NWChem, conveniently accessible, for instance, through Docker containers and a recently developed web interface EMSL Arrows. We illustrate use of the library with various examples, including ground- and excited-state calculations for LiH, H$_{10}$, and C$_{20}$ with an active-space simplification, and automatically obtain resource estimates for classically intractable examples.
△ Less
Submitted 1 April, 2019;
originally announced April 2019.
-
How Much Chemistry Does a Deep Neural Network Need to Know to Make Accurate Predictions?
Authors:
Garrett B. Goh,
Charles Siegel,
Abhinav Vishnu,
Nathan O. Hodas,
Nathan Baker
Abstract:
The meteoric rise of deep learning models in computer vision research, having achieved human-level accuracy in image recognition tasks is firm evidence of the impact of representation learning of deep neural networks. In the chemistry domain, recent advances have also led to the development of similar CNN models, such as Chemception, that is trained to predict chemical properties using images of m…
▽ More
The meteoric rise of deep learning models in computer vision research, having achieved human-level accuracy in image recognition tasks is firm evidence of the impact of representation learning of deep neural networks. In the chemistry domain, recent advances have also led to the development of similar CNN models, such as Chemception, that is trained to predict chemical properties using images of molecular drawings. In this work, we investigate the effects of systematically removing and adding localized domain-specific information to the image channels of the training data. By augmenting images with only 3 additional basic information, and without introducing any architectural changes, we demonstrate that an augmented Chemception (AugChemception) outperforms the original model in the prediction of toxicity, activity, and solvation free energy. Then, by altering the information content in the images, and examining the resulting model's performance, we also identify two distinct learning patterns in predicting toxicity/activity as compared to solvation free energy. These patterns suggest that Chemception is learning about its tasks in the manner that is consistent with established knowledge. Thus, our work demonstrates that advanced chemical knowledge is not a pre-requisite for deep learning models to accurately predict complex chemical properties.
△ Less
Submitted 18 March, 2018; v1 submitted 5 October, 2017;
originally announced October 2017.
-
Chemception: A Deep Neural Network with Minimal Chemistry Knowledge Matches the Performance of Expert-developed QSAR/QSPR Models
Authors:
Garrett B. Goh,
Charles Siegel,
Abhinav Vishnu,
Nathan O. Hodas,
Nathan Baker
Abstract:
In the last few years, we have seen the transformative impact of deep learning in many applications, particularly in speech recognition and computer vision. Inspired by Google's Inception-ResNet deep convolutional neural network (CNN) for image classification, we have developed "Chemception", a deep CNN for the prediction of chemical properties, using just the images of 2D drawings of molecules. W…
▽ More
In the last few years, we have seen the transformative impact of deep learning in many applications, particularly in speech recognition and computer vision. Inspired by Google's Inception-ResNet deep convolutional neural network (CNN) for image classification, we have developed "Chemception", a deep CNN for the prediction of chemical properties, using just the images of 2D drawings of molecules. We develop Chemception without providing any additional explicit chemistry knowledge, such as basic concepts like periodicity, or advanced features like molecular descriptors and fingerprints. We then show how Chemception can serve as a general-purpose neural network architecture for predicting toxicity, activity, and solvation properties when trained on a modest database of 600 to 40,000 compounds. When compared to multi-layer perceptron (MLP) deep neural networks trained with ECFP fingerprints, Chemception slightly outperforms in activity and solvation prediction and slightly underperforms in toxicity prediction. Having matched the performance of expert-developed QSAR/QSPR deep learning models, our work demonstrates the plausibility of using deep neural networks to assist in computational chemistry research, where the feature engineering process is performed primarily by a deep learning algorithm.
△ Less
Submitted 20 June, 2017;
originally announced June 2017.
-
ChinMotion Rapidly Enables 3D Computer Interaction after Tetraplegia
Authors:
Ferran Galán,
Stuart N. Baker,
Monica A. Perez
Abstract:
Individuals with severe paralysis require hands-free interfaces to control assistive devices that can improve their quality of life. We present ChinMotion, an interface that noninvasively harnesses preserved chin, lip and tongue sensorimotor function after tetraplegia to convey intuitive control commands. After two hours of practice, ChinMotion enables superior point-and-click performance over exi…
▽ More
Individuals with severe paralysis require hands-free interfaces to control assistive devices that can improve their quality of life. We present ChinMotion, an interface that noninvasively harnesses preserved chin, lip and tongue sensorimotor function after tetraplegia to convey intuitive control commands. After two hours of practice, ChinMotion enables superior point-and-click performance over existing interfaces and it facilitates accurate 3D control of a virtual robotic arm.
△ Less
Submitted 8 June, 2016;
originally announced June 2016.
-
Data-driven parameterization of the generalized Langevin equation
Authors:
Huan Lei,
Nathan Baker,
Xiantao Li
Abstract:
We present a data-driven approach to determine the memory kernel and random noise in generalized Langevin equations. To facilitate practical implementations, we parameterize the kernel function in the Laplace domain by a rational function, with coefficients directly linked to the equilibrium statistics of the coarse-grain variables. We show that such an approximation can be constructed to arbitrar…
▽ More
We present a data-driven approach to determine the memory kernel and random noise in generalized Langevin equations. To facilitate practical implementations, we parameterize the kernel function in the Laplace domain by a rational function, with coefficients directly linked to the equilibrium statistics of the coarse-grain variables. We show that such an approximation can be constructed to arbitrarily high order and the resulting generalized Langevin dynamics can be embedded in an extended stochastic model without explicit memory. We demonstrate how to introduce the stochastic noise so that the second fluctuation-dissipation theorem is exactly satisfied. Results from several numerical tests are presented to demonstrate the effectiveness of the proposed method.
△ Less
Submitted 9 June, 2016; v1 submitted 8 June, 2016;
originally announced June 2016.
-
Context-Aware Service Utilisation in the Clouds and Energy Conservation
Authors:
Saad Liaquat Kiani,
Ashiq Anjum,
Nick Antonopoulos,
Michael Knappmeyer,
Nigel Baker,
Richard McClatchey
Abstract:
Ubiquitous computing environments are characterised by smart, interconnected artefacts embedded in our physical world that are projected to provide useful services to human inhabitants unobtrusively. Mobile devices are becoming the primary tools of human interaction with these embedded artefacts and utilisation of services available in smart computing environments such as clouds. Advancements in c…
▽ More
Ubiquitous computing environments are characterised by smart, interconnected artefacts embedded in our physical world that are projected to provide useful services to human inhabitants unobtrusively. Mobile devices are becoming the primary tools of human interaction with these embedded artefacts and utilisation of services available in smart computing environments such as clouds. Advancements in capabilities of mobile devices allow a number of user and environment related context consumers to be hosted on these devices. Without a coordinating component, these context consumers and providers are a potential burden on device resources; specifically the effect of uncoordinated computation and communication with cloud-enabled services can negatively impact the battery life. Therefore energy conservation is a major concern in realising the collaboration and utilisation of mobile device based context-aware applications and cloud based services. This paper presents the concept of a context-brokering component to aid in coordination and communication of context information between mobile devices and services deployed in a cloud infrastructure. A prototype context broker is experimentally analysed for effects on energy conservation when accessing and coordinating with cloud services on a smart device, with results signifying reduction in energy consumption.
△ Less
Submitted 24 February, 2012;
originally announced February 2012.