-
A Large Encoder-Decoder Family of Foundation Models For Chemical Language
Authors:
Eduardo Soares,
Victor Shirasuna,
Emilio Vital Brazil,
Renato Cerqueira,
Dmitry Zubarev,
Kristin Schmidt
Abstract:
Large-scale pre-training methodologies for chemical language models represent a breakthrough in cheminformatics. These methods excel in tasks such as property prediction and molecule generation by learning contextualized representations of input tokens through self-supervised learning on large unlabeled corpora. Typically, this involves pre-training on unlabeled data followed by fine-tuning on spe…
▽ More
Large-scale pre-training methodologies for chemical language models represent a breakthrough in cheminformatics. These methods excel in tasks such as property prediction and molecule generation by learning contextualized representations of input tokens through self-supervised learning on large unlabeled corpora. Typically, this involves pre-training on unlabeled data followed by fine-tuning on specific tasks, reducing dependence on annotated datasets and broadening chemical language representation understanding. This paper introduces a large encoder-decoder chemical foundation models pre-trained on a curated dataset of 91 million SMILES samples sourced from PubChem, which is equivalent to 4 billion of molecular tokens. The proposed foundation model supports different complex tasks, including quantum property prediction, and offer flexibility with two main variants (289M and $8\times289M$). Our experiments across multiple benchmark datasets validate the capacity of the proposed model in providing state-of-the-art results for different tasks. We also provide a preliminary assessment of the compositionality of the embedding space as a prerequisite for the reasoning tasks. We demonstrate that the produced latent space is separable compared to the state-of-the-art with few-shot learning capabilities.
△ Less
Submitted 24 July, 2024;
originally announced July 2024.
-
Quantum-centric Supercomputing for Materials Science: A Perspective on Challenges and Future Directions
Authors:
Yuri Alexeev,
Maximilian Amsler,
Paul Baity,
Marco Antonio Barroca,
Sanzio Bassini,
Torey Battelle,
Daan Camps,
David Casanova,
Young Jai Choi,
Frederic T. Chong,
Charles Chung,
Chris Codella,
Antonio D. Corcoles,
James Cruise,
Alberto Di Meglio,
Jonathan Dubois,
Ivan Duran,
Thomas Eckl,
Sophia Economou,
Stephan Eidenbenz,
Bruce Elmegreen,
Clyde Fare,
Ismael Faro,
Cristina Sanz Fernández,
Rodrigo Neumann Barros Ferreira
, et al. (102 additional authors not shown)
Abstract:
Computational models are an essential tool for the design, characterization, and discovery of novel materials. Hard computational tasks in materials science stretch the limits of existing high-performance supercomputing centers, consuming much of their simulation, analysis, and data resources. Quantum computing, on the other hand, is an emerging technology with the potential to accelerate many of…
▽ More
Computational models are an essential tool for the design, characterization, and discovery of novel materials. Hard computational tasks in materials science stretch the limits of existing high-performance supercomputing centers, consuming much of their simulation, analysis, and data resources. Quantum computing, on the other hand, is an emerging technology with the potential to accelerate many of the computational tasks needed for materials science. In order to do that, the quantum technology must interact with conventional high-performance computing in several ways: approximate results validation, identification of hard problems, and synergies in quantum-centric supercomputing. In this paper, we provide a perspective on how quantum-centric supercomputing can help address critical computational problems in materials science, the challenges to face in order to solve representative use cases, and new suggested directions.
△ Less
Submitted 19 September, 2024; v1 submitted 14 December, 2023;
originally announced December 2023.
-
Formulation Graphs for Mapping Structure-Composition of Battery Electrolytes to Device Performance
Authors:
Vidushi Sharma,
Maxwell Giammona,
Dmitry Zubarev,
Andy Tek,
Khanh Nugyuen,
Linda Sundberg,
Daniele Congiu,
Young-Hye La
Abstract:
Advanced computational methods are being actively sought for addressing the challenges associated with discovery and development of new combinatorial material such as formulations. A widely adopted approach involves domain informed high-throughput screening of individual components that can be combined into a formulation. This manages to accelerate the discovery of new compounds for a target appli…
▽ More
Advanced computational methods are being actively sought for addressing the challenges associated with discovery and development of new combinatorial material such as formulations. A widely adopted approach involves domain informed high-throughput screening of individual components that can be combined into a formulation. This manages to accelerate the discovery of new compounds for a target application but still leave the process of identifying the right 'formulation' from the shortlisted chemical space largely a laboratory experiment-driven process. We report a deep learning model, Formulation Graph Convolution Network (F-GCN), that can map structure-composition relationship of the individual components to the property of liquid formulation as whole. Multiple GCNs are assembled in parallel that featurize formulation constituents domain-intuitively on the fly. The resulting molecular descriptors are scaled based on respective constituent's molar percentage in the formulation, followed by formalizing into a combined descriptor that represents a complete formulation to an external learning architecture. The use case of proposed formulation learning model is demonstrated for battery electrolytes by training and testing it on two exemplary datasets representing electrolyte formulations vs battery performance -- one dataset is sourced from literature about Li/Cu half-cells, while the other is obtained by lab-experiments related to lithium-iodide full-cell chemistry. The model is shown to predict the performance metrics like Coulombic Efficiency (CE) and specific capacity of new electrolyte formulations with lowest reported errors. The best performing F-GCN model uses molecular descriptors derived from molecular graphs that are informed with HOMO-LUMO and electric moment properties of the molecules using a knowledge transfer technique.
△ Less
Submitted 28 September, 2023; v1 submitted 7 July, 2023;
originally announced July 2023.
-
Beyond Chemical Language: A Multimodal Approach to Enhance Molecular Property Prediction
Authors:
Eduardo Soares,
Emilio Vital Brazil,
Karen Fiorela Aquino Gutierrez,
Renato Cerqueira,
Dan Sanders,
Kristin Schmidt,
Dmitry Zubarev
Abstract:
We present a novel multimodal language model approach for predicting molecular properties by combining chemical language representation with physicochemical features. Our approach, MULTIMODAL-MOLFORMER, utilizes a causal multistage feature selection method that identifies physicochemical features based on their direct causal effect on a specific target property. These causal features are then inte…
▽ More
We present a novel multimodal language model approach for predicting molecular properties by combining chemical language representation with physicochemical features. Our approach, MULTIMODAL-MOLFORMER, utilizes a causal multistage feature selection method that identifies physicochemical features based on their direct causal effect on a specific target property. These causal features are then integrated with the vector space generated by molecular embeddings from MOLFORMER. In particular, we employ Mordred descriptors as physicochemical features and identify the Markov blanket of the target property, which theoretically contains the most relevant features for accurate prediction. Our results demonstrate a superior performance of our proposed approach compared to existing state-of-the-art algorithms, including the chemical language-based MOLFORMER and graph neural networks, in predicting complex tasks such as biodegradability and PFAS toxicity estimation. Moreover, we demonstrate the effectiveness of our feature selection method in reducing the dimensionality of the Mordred feature space while maintaining or improving the model's performance. Our approach opens up promising avenues for future research in molecular property prediction by harnessing the synergistic potential of both chemical language and physicochemical features, leading to enhanced performance and advancements in the field.
△ Less
Submitted 22 June, 2023;
originally announced June 2023.
-
Human-AI Co-Creation Approach to Find Forever Chemicals Replacements
Authors:
Juliana Jansen Ferreira,
Vinícius Segura,
Joana G. R. Souza,
Gabriel D. J. Barbosa,
João Gallas,
Renato Cerqueira,
Dmitry Zubarev
Abstract:
Generative models are a powerful tool in AI for material discovery. We are designing a software framework that supports a human-AI co-creation process to accelerate finding replacements for the ``forever chemicals''-- chemicals that enable our modern lives, but are harmful to the environment and the human health. Our approach combines AI capabilities with the domain-specific tacit knowledge of sub…
▽ More
Generative models are a powerful tool in AI for material discovery. We are designing a software framework that supports a human-AI co-creation process to accelerate finding replacements for the ``forever chemicals''-- chemicals that enable our modern lives, but are harmful to the environment and the human health. Our approach combines AI capabilities with the domain-specific tacit knowledge of subject matter experts to accelerate the material discovery. Our co-creation process starts with the interaction between the subject matter experts and a generative model that can generate new molecule designs. In this position paper, we discuss our hypothesis that these subject matter experts can benefit from a more iterative interaction with the generative model, asking for smaller samples and ``guiding'' the exploration of the discovery space with their knowledge.
△ Less
Submitted 11 April, 2023;
originally announced April 2023.
-
Domain-agnostic and Multi-level Evaluation of Generative Models
Authors:
Girmaw Abebe Tadesse,
Jannis Born,
Celia Cintas,
William Ogallo,
Dmitry Zubarev,
Matteo Manica,
Komminist Weldemariam
Abstract:
While the capabilities of generative models heavily improved in different domains (images, text, graphs, molecules, etc.), their evaluation metrics largely remain based on simplified quantities or manual inspection with limited practicality. To this end, we propose a framework for Multi-level Performance Evaluation of Generative mOdels (MPEGO), which could be employed across different domains. MPE…
▽ More
While the capabilities of generative models heavily improved in different domains (images, text, graphs, molecules, etc.), their evaluation metrics largely remain based on simplified quantities or manual inspection with limited practicality. To this end, we propose a framework for Multi-level Performance Evaluation of Generative mOdels (MPEGO), which could be employed across different domains. MPEGO aims to quantify generation performance hierarchically, starting from a sub-feature-based low-level evaluation to a global features-based high-level evaluation. MPEGO offers great customizability as the employed features are entirely user-driven and can thus be highly domain/problem-specific while being arbitrarily complex (e.g., outcomes of experimental procedures). We validate MPEGO using multiple generative models across several datasets from the material discovery domain. An ablation study is conducted to study the plausibility of intermediate steps in MPEGO. Results demonstrate that MPEGO provides a flexible, user-driven, and multi-level evaluation framework, with practical insights on the generation quality. The framework, source code, and experiments will be available at https://github.com/GT4SD/mpego.
△ Less
Submitted 20 January, 2023;
originally announced January 2023.
-
Toward Human-AI Co-creation to Accelerate Material Discovery
Authors:
Dmitry Zubarev,
Carlos Raoni Mendes,
Emilio Vital Brazil,
Renato Cerqueira,
Kristin Schmidt,
Vinicius Segura,
Juliana Jansen Ferreira,
Dan Sanders
Abstract:
There is an increasing need in our society to achieve faster advances in Science to tackle urgent problems, such as climate changes, environmental hazards, sustainable energy systems, pandemics, among others. In certain domains like chemistry, scientific discovery carries the extra burden of assessing risks of the proposed novel solutions before moving to the experimental stage. Despite several re…
▽ More
There is an increasing need in our society to achieve faster advances in Science to tackle urgent problems, such as climate changes, environmental hazards, sustainable energy systems, pandemics, among others. In certain domains like chemistry, scientific discovery carries the extra burden of assessing risks of the proposed novel solutions before moving to the experimental stage. Despite several recent advances in Machine Learning and AI to address some of these challenges, there is still a gap in technologies to support end-to-end discovery applications, integrating the myriad of available technologies into a coherent, orchestrated, yet flexible discovery process. Such applications need to handle complex knowledge management at scale, enabling knowledge consumption and production in a timely and efficient way for subject matter experts (SMEs). Furthermore, the discovery of novel functional materials strongly relies on the development of exploration strategies in the chemical space. For instance, generative models have gained attention within the scientific community due to their ability to generate enormous volumes of novel molecules across material domains. These models exhibit extreme creativity that often translates in low viability of the generated candidates. In this work, we propose a workbench framework that aims at enabling the human-AI co-creation to reduce the time until the first discovery and the opportunity costs involved. This framework relies on a knowledge base with domain and process knowledge, and user-interaction components to acquire knowledge and advise the SMEs. Currently,the framework supports four main activities: generative modeling, dataset triage, molecule adjudication, and risk assessment.
△ Less
Submitted 5 November, 2022;
originally announced November 2022.
-
Topology-Driven Generative Completion of Lacunae in Molecular Data
Authors:
Dmitry Yu. Zubarev,
Petar Ristoski
Abstract:
We introduce an approach to the targeted completion of lacunae in molecular data sets which is driven by topological data analysis, such as Mapper algorithm. Lacunae are filled in using scaffold-constrained generative models trained with different scoring functions. The approach enables addition of links and vertices to the skeletonized representations of the data, such as Mapper graph, and falls…
▽ More
We introduce an approach to the targeted completion of lacunae in molecular data sets which is driven by topological data analysis, such as Mapper algorithm. Lacunae are filled in using scaffold-constrained generative models trained with different scoring functions. The approach enables addition of links and vertices to the skeletonized representations of the data, such as Mapper graph, and falls in the broad category of network completion methods. We illustrate application of the topology-driven data completion strategy by creating a lacuna in the data set of onium cations extracted from USPTO patents, and repairing it.
△ Less
Submitted 29 July, 2022;
originally announced August 2022.
-
Sample-Efficient Generation of Novel Photo-acid Generator Molecules using a Deep Generative Model
Authors:
Samuel C. Hoffman,
Vijil Chenthamarakshan,
Dmitry Yu. Zubarev,
Daniel P. Sanders,
Payel Das
Abstract:
Photo-acid generators (PAGs) are compounds that release acids ($H^+$ ions) when exposed to light. These compounds are critical components of the photolithography processes that are used in the manufacture of semiconductor logic and memory chips. The exponential increase in the demand for semiconductors has highlighted the need for discovering novel photo-acid generators. While de novo molecule des…
▽ More
Photo-acid generators (PAGs) are compounds that release acids ($H^+$ ions) when exposed to light. These compounds are critical components of the photolithography processes that are used in the manufacture of semiconductor logic and memory chips. The exponential increase in the demand for semiconductors has highlighted the need for discovering novel photo-acid generators. While de novo molecule design using deep generative models has been widely employed for drug discovery and material design, its application to the creation of novel photo-acid generators poses several unique challenges, such as lack of property labels. In this paper, we highlight these challenges and propose a generative modeling approach that utilizes conditional generation from a pre-trained deep autoencoder and expert-in-the-loop techniques. The validity of the proposed approach was evaluated with the help of subject matter experts, indicating the promise of such an approach for applications beyond the creation of novel photo-acid generators.
△ Less
Submitted 2 December, 2021;
originally announced December 2021.
-
Molecule Generation Experience: An Open Platform of Material Design for Public Users
Authors:
Seiji Takeda,
Toshiyuki Hama,
Hsiang-Han Hsu,
Akihiro Kishimoto,
Makoto Kogoh,
Takumi Hongo,
Kumiko Fujieda,
Hideaki Nakashika,
Dmitry Zubarev,
Daniel P. Sanders,
Jed W. Pitera,
Junta Fuchiwaki,
Daiju Nakano
Abstract:
Artificial Intelligence (AI)-driven material design has been attracting great attentions as a groundbreaking technology across a wide spectrum of industries. Molecular design is particularly important owing to its broad application domains and boundless creativity attributed to progresses in generative models. The recent maturity of molecular generative models has stimulated expectations for pract…
▽ More
Artificial Intelligence (AI)-driven material design has been attracting great attentions as a groundbreaking technology across a wide spectrum of industries. Molecular design is particularly important owing to its broad application domains and boundless creativity attributed to progresses in generative models. The recent maturity of molecular generative models has stimulated expectations for practical use among potential users, who are not necessarily familiar with coding or scripting, such as experimental engineers and students in chemical domains. However, most of the existing molecular generative models are Python libraries on GitHub, that are accessible for only IT-savvy users. To fill this gap, we newly developed a graphical user interface (GUI)-based web application of molecular generative models, Molecule Generation Experience, that is open to the general public. This is the first web application of molecular generative models enabling users to work with built-in datasets to carry out molecular design. In this paper, we describe the background technology extended from our previous work. Our new online evaluation and structural filtering algorithms significantly improved the generation speed by 30 to 1,000 times with a wider structural variety, satisfying chemical stability and synthetic reality. We also describe in detail our Kubernetes-based scalable cloud architecture and user-oriented GUI that are necessary components to achieve a public service. Finally, we present actual use cases in industrial research to design new photoacid generators (PAGs) as well as release cases in educational events.
△ Less
Submitted 6 August, 2021;
originally announced August 2021.
-
Molecular Inverse-Design Platform for Material Industries
Authors:
Seiji Takeda,
Toshiyuki Hama,
Hsiang-Han Hsu,
Victoria A. Piunova,
Dmitry Zubarev,
Daniel P. Sanders,
Jed W. Pitera,
Makoto Kogoh,
Takumi Hongo,
Yenwei Cheng,
Wolf Bocanett,
Hideaki Nakashika,
Akihiro Fujita,
Yuta Tsuchiya,
Katsuhiko Hino,
Kentaro Yano,
Shuichi Hirose,
Hiroki Toda,
Yasumitsu Orii,
Daiju Nakano
Abstract:
The discovery of new materials has been the essential force which brings a discontinuous improvement to industrial products' performance. However, the extra-vast combinatorial design space of material structures exceeds human experts' capability to explore all, thereby hampering material development. In this paper, we present a material industry-oriented web platform of an AI-driven molecular inve…
▽ More
The discovery of new materials has been the essential force which brings a discontinuous improvement to industrial products' performance. However, the extra-vast combinatorial design space of material structures exceeds human experts' capability to explore all, thereby hampering material development. In this paper, we present a material industry-oriented web platform of an AI-driven molecular inverse-design system, which automatically designs brand new molecular structures rapidly and diversely. Different from existing inverse-design solutions, in this system, the combination of substructure-based feature encoding and molecular graph generation algorithms allows a user to gain high-speed, interpretable, and customizable design process. Also, a hierarchical data structure and user-oriented UI provide a flexible and intuitive workflow. The system is deployed on IBM's and our client's cloud servers and has been used by 5 partner companies. To illustrate actual industrial use cases, we exhibit inverse-design of sugar and dye molecules, that were carried out by experimental chemists in those client companies. Compared to general human chemist's standard performance, the molecular design speed was accelerated more than 10 times, and greatly increased variety was observed in the inverse-designed molecules without loss of chemical realism.
△ Less
Submitted 16 May, 2020; v1 submitted 23 April, 2020;
originally announced April 2020.
-
AI-driven Inverse Design System for Organic Molecules
Authors:
Seiji Takeda,
Toshiyuki Hama,
Hsiang-Han Hsu,
Toshiyuki Yamane,
Koji Masuda,
Victoria A. Piunova,
Dmitry Zubarev,
Jed Pitera,
Daniel P. Sanders,
Daiju Nakano
Abstract:
Designing novel materials that possess desired properties is a central need across many manufacturing industries. Driven by that industrial need, a variety of algorithms and tools have been developed that combine AI (machine learning and analytics) with domain knowledge in physics, chemistry, and materials science. AI-driven materials design can be divided to mainly two stages; the first one is th…
▽ More
Designing novel materials that possess desired properties is a central need across many manufacturing industries. Driven by that industrial need, a variety of algorithms and tools have been developed that combine AI (machine learning and analytics) with domain knowledge in physics, chemistry, and materials science. AI-driven materials design can be divided to mainly two stages; the first one is the modeling stage, where the goal is to build an accurate regression or classification model to predict material properties (e.g. glass transition temperature) or attributes (e.g. toxic/non-toxic). The next stage is design, where the goal is to assemble or tune material structures so that they can achieve user-demanded target property values based on a prediction model that is trained in the modeling stage. For maximum benefit, these two stages should be architected to form a coherent workflow. Today there are several emerging services and tools for AI-driven material design, however, most of them provide only partial technical components (e.g. data analyzer, regression model, structure generator, etc.), that are useful for specific purposes, but for comprehensive material design, those components need to be orchestrated appropriately. Our material design system provides an end-to-end solution to this problem, with a workflow that consists of data input, feature encoding, prediction modeling, solution search, and structure generation. The system builds a regression model to predict properties, solves an inverse problem on the trained model, and generates novel chemical structure candidates that satisfy the target properties. In this paper we will introduce the methodology of our system, and demonstrate a simple example of inverse design generating new chemical structures that satisfy targeted physical property values.
△ Less
Submitted 20 January, 2020;
originally announced January 2020.
-
Data Infrastructure and Approaches for Ontology-Based Drug Repurposing
Authors:
Stephen Boyer,
Thomas Griffin,
Sarath Swaminathan,
Kenneth L. Clarkson,
Dmitry Zubarev
Abstract:
We report development of a data infrastructure for drug repurposing that takes advantage of two currently available chemical ontologies. The data infrastructure includes a database of compound- target associations augmented with molecular ontological labels. It also contains two computational tools for prediction of new associations. We describe two drug-repurposing systems: one, Nascent Ontologic…
▽ More
We report development of a data infrastructure for drug repurposing that takes advantage of two currently available chemical ontologies. The data infrastructure includes a database of compound- target associations augmented with molecular ontological labels. It also contains two computational tools for prediction of new associations. We describe two drug-repurposing systems: one, Nascent Ontological Information Retrieval for Drug Repurposing (NOIR-DR), based on an information retrieval strategy, and another, based on non-negative matrix factorization together with compound similarity, that was inspired by recommender systems. We report the performance of both tools on a drug-repurposing task.
△ Less
Submitted 12 July, 2018;
originally announced July 2018.
-
Diagnostics of Data-Driven Models: Uncertainty Quantification of PM7 Semi-Empirical Quantum Chemical Method
Authors:
James Oreluk,
Zhenyuan Liu,
Arun Hegde,
Wenyu Li,
Andrew Packard,
Michael Frenklach,
Dmitry Zubarev
Abstract:
We report an evaluation of a semi-empirical quantum chemical method PM7 from the perspective of uncertainty quantification. Specifically, we apply Bound-to-Bound Data Collaboration, an uncertainty quantification framework, to characterize a) variability of PM7 model parameter values consistent with the uncertainty in the training data, and b) uncertainty propagation from the training data to the m…
▽ More
We report an evaluation of a semi-empirical quantum chemical method PM7 from the perspective of uncertainty quantification. Specifically, we apply Bound-to-Bound Data Collaboration, an uncertainty quantification framework, to characterize a) variability of PM7 model parameter values consistent with the uncertainty in the training data, and b) uncertainty propagation from the training data to the model predictions. Experimental heats of formation of a homologous series of linear alkanes are used as the property of interest. The training data are chemically accurate, i.e., they have very low uncertainty by the standards of computational chemistry. The analysis does not find evidence of PM7 consistency with the entire data set considered as no single set of parameter values is found that captures the experimental uncertainties of all training data. Nevertheless, PM7 is found to be consistent for subsets of the training data. In such cases, uncertainty propagation from the chemically accurate training data to the predicted values preserves error within bounds of chemical accuracy if predictions are made for the molecules of comparable size. Otherwise, the error grows linearly with the relative size of the molecules.
△ Less
Submitted 16 June, 2018; v1 submitted 12 June, 2018;
originally announced June 2018.
-
Sustainability of Transient Kinetic Regimes and Origins of Death
Authors:
Dmitry Yu. Zubarev,
Leonardo A. Pachón
Abstract:
It is generally recognized that a distinguishing feature of life is its peculiar capability to avoid equilibration. The origin of this capability and its evolution along the timeline of abiogenesis is not yet understood. We propose to study an analog of this phenomenon that could emerge in non-biological systems. To this end, we introduce the concept of sustainability of transient kinetic regimes.…
▽ More
It is generally recognized that a distinguishing feature of life is its peculiar capability to avoid equilibration. The origin of this capability and its evolution along the timeline of abiogenesis is not yet understood. We propose to study an analog of this phenomenon that could emerge in non-biological systems. To this end, we introduce the concept of sustainability of transient kinetic regimes. This concept is illustrated via investigation of cooperative effects in an extended system of compartmentalized chemical oscillators under batch and semi-batch conditions. The computational study of a model system shows robust enhancement of lifetimes of the decaying oscillations which translates into the evolution of the survival function of the transient non-equilibrium regime. This model does not rely on any form of replication. Rather, it explores the role of a structured effective environment as a contributor to the system-bath interactions that define non-equilibrium regimes. We implicate the noise produced by the effective environment of a compartmentalized oscillator as the cause of the lifetime extension.
△ Less
Submitted 11 January, 2016; v1 submitted 22 July, 2015;
originally announced July 2015.