-
Precursor recommendation for inorganic synthesis by machine learning materials similarity from scientific literature
Authors:
Tanjin He,
Haoyan Huo,
Christopher J. Bartel,
Zheren Wang,
Kevin Cruse,
Gerbrand Ceder
Abstract:
Synthesis prediction is a key accelerator for the rapid design of advanced materials. However, determining synthesis variables such as the choice of precursor materials is challenging for inorganic materials because the sequence of reactions during heating is not well understood. In this work, we use a knowledge base of 29,900 solid-state synthesis recipes, text-mined from the scientific literatur…
▽ More
Synthesis prediction is a key accelerator for the rapid design of advanced materials. However, determining synthesis variables such as the choice of precursor materials is challenging for inorganic materials because the sequence of reactions during heating is not well understood. In this work, we use a knowledge base of 29,900 solid-state synthesis recipes, text-mined from the scientific literature, to automatically learn which precursors to recommend for the synthesis of a novel target material. The data-driven approach learns chemical similarity of materials and refers the synthesis of a new target to precedent synthesis procedures of similar materials, mimicking human synthesis design. When proposing five precursor sets for each of 2,654 unseen test target materials, the recommendation strategy achieves a success rate of at least 82%. Our approach captures decades of heuristic synthesis data in a mathematical form, making it accessible for use in recommendation engines and autonomous laboratories.
△ Less
Submitted 19 May, 2023; v1 submitted 4 February, 2023;
originally announced February 2023.
-
Text-mined dataset of gold nanoparticle synthesis procedures, morphologies, and size entities
Authors:
Kevin Cruse,
Amalie Trewartha,
Sanghoon Lee,
Zheren Wang,
Haoyan Huo,
Tanjin He,
Olga Kononova,
Anubhav Jain,
Gerbrand Ceder
Abstract:
Gold nanoparticles are highly desired for a range of technological applications due to their tunable properties, which are dictated by the size and shape of the constituent particles. Many heuristic methods for controlling the morphological characteristics of gold nanoparticles are well known. However, the underlying mechanisms controlling their size and shape remain poorly understood, partly due…
▽ More
Gold nanoparticles are highly desired for a range of technological applications due to their tunable properties, which are dictated by the size and shape of the constituent particles. Many heuristic methods for controlling the morphological characteristics of gold nanoparticles are well known. However, the underlying mechanisms controlling their size and shape remain poorly understood, partly due to the immense range of possible combinations of synthesis parameters. Data-driven methods can offer insight to help guide understanding of these underlying mechanisms, so long as sufficient synthesis data are available. To facilitate data mining in this direction, we have constructed and made publicly available a dataset of codified gold nanoparticle synthesis protocols and outcomes extracted directly from the nanoparticle materials science literature using natural language processing and text-mining techniques. This dataset contains 5,154 data records, each representing a single gold nanoparticle synthesis article, filtered from a database of 4,973,165 publications. Each record contains codified synthesis protocols and extracted morphological information from a total of 7,608 experimental and 12,519 characterization paragraphs.
△ Less
Submitted 21 April, 2022;
originally announced April 2022.
-
ULSA: Unified Language of Synthesis Actions for Representation of Synthesis Protocols
Authors:
Zheren Wang,
Kevin Cruse,
Yuxing Fei,
Ann Chia,
Yan Zeng,
Haoyan Huo,
Tanjin He,
Bowen Deng,
Olga Kononova,
Gerbrand Ceder
Abstract:
Applying AI power to predict syntheses of novel materials requires high-quality, large-scale datasets. Extraction of synthesis information from scientific publications is still challenging, especially for extracting synthesis actions, because of the lack of a comprehensive labeled dataset using a solid, robust, and well-established ontology for describing synthesis procedures. In this work, we pro…
▽ More
Applying AI power to predict syntheses of novel materials requires high-quality, large-scale datasets. Extraction of synthesis information from scientific publications is still challenging, especially for extracting synthesis actions, because of the lack of a comprehensive labeled dataset using a solid, robust, and well-established ontology for describing synthesis procedures. In this work, we propose the first Unified Language of Synthesis Actions (ULSA) for describing ceramics synthesis procedures. We created a dataset of 3,040 synthesis procedures annotated by domain experts according to the proposed ULSA scheme. To demonstrate the capabilities of ULSA, we built a neural network-based model to map arbitrary ceramics synthesis paragraphs into ULSA and used it to construct synthesis flowcharts for synthesis procedures. Analysis for the flowcharts showed that (a) ULSA covers essential vocabulary used by researchers when describing synthesis procedures and (b) it can capture important features of synthesis protocols. This work is an important step towards creating a synthesis ontology and a solid foundation for autonomous robotic synthesis.
△ Less
Submitted 23 January, 2022;
originally announced January 2022.
-
Dataset of gold nanoparticle sizes and morphologies extracted from literature-mined microscopy images
Authors:
Akshay Subramanian,
Kevin Cruse,
Amalie Trewartha,
Xingzhi Wang,
A. Paul Alivisatos,
Gerbrand Ceder
Abstract:
The factors controlling the size and morphology of nanoparticles have so far been poorly understood. Data-driven techniques are an exciting avenue to explore this field through the identification of trends and correlations in data. However, for these techniques to be utilized, large datasets annotated with the structural attributes of nanoparticles are required. While experimental SEM/TEM images c…
▽ More
The factors controlling the size and morphology of nanoparticles have so far been poorly understood. Data-driven techniques are an exciting avenue to explore this field through the identification of trends and correlations in data. However, for these techniques to be utilized, large datasets annotated with the structural attributes of nanoparticles are required. While experimental SEM/TEM images collected from controlled experiments are reliable sources of this information, large-scale collection of these images across a variety of experimental conditions is expensive and infeasible. Published scientific literature, which provides a vast source of high-quality figures including SEM/TEM images, can provide a large amount of data at a lower cost if effectively mined. In this work, we develop an automated pipeline to retrieve and analyse microscopy images from gold nanoparticle literature and provide a dataset of 4361 SEM/TEM images of gold nanoparticles along with automatically extracted size and morphology information. The dataset can be queried to obtain information about the physical attributes of gold nanoparticles and their statistical distributions.
△ Less
Submitted 6 January, 2022; v1 submitted 2 December, 2021;
originally announced December 2021.
-
Dataset of Solution-based Inorganic Materials Synthesis Recipes Extracted from the Scientific Literature
Authors:
Zheren Wang,
Olga Kononova,
Kevin Cruse,
Tanjin He,
Haoyan Huo,
Yuxing Fei,
Yan Zeng,
Yingzhi Sun,
Zijian Cai,
Wenhao Sun,
Gerbrand Ceder
Abstract:
The development of a materials synthesis route is usually based on heuristics and experience. A possible new approach would be to apply data-driven approaches to learn the patterns of synthesis from past experience and use them to predict the syntheses of novel materials. However, this route is impeded by the lack of a large-scale database of synthesis formulations. In this work, we applied advanc…
▽ More
The development of a materials synthesis route is usually based on heuristics and experience. A possible new approach would be to apply data-driven approaches to learn the patterns of synthesis from past experience and use them to predict the syntheses of novel materials. However, this route is impeded by the lack of a large-scale database of synthesis formulations. In this work, we applied advanced machine learning and natural language processing techniques to construct a dataset of 35,675 solution-based synthesis "recipes" extracted from the scientific literature. Each recipe contains essential synthesis information including the precursors and target materials, their quantities, and the synthesis actions and corresponding attributes. Every recipe is also augmented with the reaction formula. Through this work, we are making freely available the first large dataset of solution-based inorganic materials synthesis recipes.
△ Less
Submitted 21 November, 2021;
originally announced November 2021.