-
A GPT-4 Reticular Chemist for Guiding MOF Discovery
Authors:
Zhiling Zheng,
Zichao Rong,
Nakul Rampal,
Christian Borgs,
Jennifer T. Chayes,
Omar M. Yaghi
Abstract:
We present a new framework integrating the AI model GPT-4 into the iterative process of reticular chemistry experimentation, leveraging a cooperative workflow of interaction between AI and a human researcher. This GPT-4 Reticular Chemist is an integrated system composed of three phases. Each of these utilizes GPT-4 in various capacities, wherein GPT-4 provides detailed instructions for chemical ex…
▽ More
We present a new framework integrating the AI model GPT-4 into the iterative process of reticular chemistry experimentation, leveraging a cooperative workflow of interaction between AI and a human researcher. This GPT-4 Reticular Chemist is an integrated system composed of three phases. Each of these utilizes GPT-4 in various capacities, wherein GPT-4 provides detailed instructions for chemical experimentation and the human provides feedback on the experimental outcomes, including both success and failures, for the in-context learning of AI in the next iteration. This iterative human-AI interaction enabled GPT-4 to learn from the outcomes, much like an experienced chemist, by a prompt-learning strategy. Importantly, the system is based on natural language for both development and operation, eliminating the need for coding skills, and thus, make it accessible to all chemists. Our collaboration with GPT-4 Reticular Chemist guided the discovery of an isoreticular series of MOFs, with each synthesis fine-tuned through iterative feedback and expert suggestions. This workflow presents a potential for broader applications in scientific research by harnessing the capability of large language models like GPT-4 to enhance the feasibility and efficiency of research activities.
△ Less
Submitted 3 October, 2023; v1 submitted 20 June, 2023;
originally announced June 2023.
-
ChatGPT Chemistry Assistant for Text Mining and Prediction of MOF Synthesis
Authors:
Zhiling Zheng,
Oufan Zhang,
Christian Borgs,
Jennifer T. Chayes,
Omar M. Yaghi
Abstract:
We use prompt engineering to guide ChatGPT in the automation of text mining of metal-organic frameworks (MOFs) synthesis conditions from diverse formats and styles of the scientific literature. This effectively mitigates ChatGPT's tendency to hallucinate information -- an issue that previously made the use of Large Language Models (LLMs) in scientific fields challenging. Our approach involves the…
▽ More
We use prompt engineering to guide ChatGPT in the automation of text mining of metal-organic frameworks (MOFs) synthesis conditions from diverse formats and styles of the scientific literature. This effectively mitigates ChatGPT's tendency to hallucinate information -- an issue that previously made the use of Large Language Models (LLMs) in scientific fields challenging. Our approach involves the development of a workflow implementing three different processes for text mining, programmed by ChatGPT itself. All of them enable parsing, searching, filtering, classification, summarization, and data unification with different tradeoffs between labor, speed, and accuracy. We deploy this system to extract 26,257 distinct synthesis parameters pertaining to approximately 800 MOFs sourced from peer-reviewed research articles. This process incorporates our ChemPrompt Engineering strategy to instruct ChatGPT in text mining, resulting in impressive precision, recall, and F1 scores of 90-99%. Furthermore, with the dataset built by text mining, we constructed a machine-learning model with over 86% accuracy in predicting MOF experimental crystallization outcomes and preliminarily identifying important factors in MOF crystallization. We also developed a reliable data-grounded MOF chatbot to answer questions on chemical reactions and synthesis procedures. Given that the process of using ChatGPT reliably mines and tabulates diverse MOF synthesis information in a unified format, while using only narrative language requiring no coding expertise, we anticipate that our ChatGPT Chemistry Assistant will be very useful across various other chemistry sub-disciplines.
△ Less
Submitted 19 July, 2023; v1 submitted 20 June, 2023;
originally announced June 2023.
-
Symmetry-Informed Geometric Representation for Molecules, Proteins, and Crystalline Materials
Authors:
Shengchao Liu,
Weitao Du,
Yanjing Li,
Zhuoxinran Li,
Zhiling Zheng,
Chenru Duan,
Zhiming Ma,
Omar Yaghi,
Anima Anandkumar,
Christian Borgs,
Jennifer Chayes,
Hongyu Guo,
Jian Tang
Abstract:
Artificial intelligence for scientific discovery has recently generated significant interest within the machine learning and scientific communities, particularly in the domains of chemistry, biology, and material discovery. For these scientific problems, molecules serve as the fundamental building blocks, and machine learning has emerged as a highly effective and powerful tool for modeling their g…
▽ More
Artificial intelligence for scientific discovery has recently generated significant interest within the machine learning and scientific communities, particularly in the domains of chemistry, biology, and material discovery. For these scientific problems, molecules serve as the fundamental building blocks, and machine learning has emerged as a highly effective and powerful tool for modeling their geometric structures. Nevertheless, due to the rapidly evolving process of the field and the knowledge gap between science (e.g., physics, chemistry, & biology) and machine learning communities, a benchmarking study on geometrical representation for such data has not been conducted. To address such an issue, in this paper, we first provide a unified view of the current symmetry-informed geometric methods, classifying them into three main categories: invariance, equivariance with spherical frame basis, and equivariance with vector frame basis. Then we propose a platform, coined Geom3D, which enables benchmarking the effectiveness of geometric strategies. Geom3D contains 16 advanced symmetry-informed geometric representation models and 14 geometric pretraining methods over 46 diverse datasets, including small molecules, proteins, and crystalline materials. We hope that Geom3D can, on the one hand, eliminate barriers for machine learning researchers interested in exploring scientific problems; and, on the other hand, provide valuable guidance for researchers in computational chemistry, structural biology, and materials science, aiding in the informed selection of representation techniques for specific applications.
△ Less
Submitted 15 June, 2023;
originally announced June 2023.
-
Graphons: A Nonparametric Method to Model, Estimate, and Design Algorithms for Massive Networks
Authors:
Christian Borgs,
Jennifer T. Chayes
Abstract:
Many social and economic systems are naturally represented as networks, from off-line and on-line social networks, to bipartite networks, like Netflix and Amazon, between consumers and products. Graphons, developed as limits of graphs, form a natural, nonparametric method to describe and estimate large networks like Facebook and LinkedIn. Here we describe the development of the theory of graphons,…
▽ More
Many social and economic systems are naturally represented as networks, from off-line and on-line social networks, to bipartite networks, like Netflix and Amazon, between consumers and products. Graphons, developed as limits of graphs, form a natural, nonparametric method to describe and estimate large networks like Facebook and LinkedIn. Here we describe the development of the theory of graphons, for both dense and sparse networks, over the last decade. We also review theorems showing that we can consistently estimate graphons from massive networks in a wide variety of models. Finally, we show how to use graphons to estimate missing links in a sparse network, which has applications from estimating social and information networks in development economics, to rigorously and efficiently doing collaborative filtering with applications to movie recommendations in Netflix and product suggestions in Amazon.
△ Less
Submitted 4 June, 2017;
originally announced June 2017.
-
Maximizing Social Influence in Nearly Optimal Time
Authors:
Christian Borgs,
Michael Brautbar,
Jennifer Chayes,
Brendan Lucier
Abstract:
Diffusion is a fundamental graph process, underpinning such phenomena as epidemic disease contagion and the spread of innovation by word-of-mouth. We address the algorithmic problem of finding a set of k initial seed nodes in a network so that the expected size of the resulting cascade is maximized, under the standard independent cascade model of network diffusion. Runtime is a primary considerati…
▽ More
Diffusion is a fundamental graph process, underpinning such phenomena as epidemic disease contagion and the spread of innovation by word-of-mouth. We address the algorithmic problem of finding a set of k initial seed nodes in a network so that the expected size of the resulting cascade is maximized, under the standard independent cascade model of network diffusion. Runtime is a primary consideration for this problem due to the massive size of the relevant input networks.
We provide a fast algorithm for the influence maximization problem, obtaining the near-optimal approximation factor of (1 - 1/e - epsilon), for any epsilon > 0, in time O((m+n)k log(n) / epsilon^2). Our algorithm is runtime-optimal (up to a logarithmic factor) and substantially improves upon the previously best-known algorithms which run in time Omega(mnk POLY(1/epsilon)). Furthermore, our algorithm can be modified to allow early termination: if it is terminated after O(beta(m+n)k log(n)) steps for some beta < 1 (which can depend on n), then it returns a solution with approximation factor O(beta). Finally, we show that this runtime is optimal (up to logarithmic factors) for any beta and fixed seed size k.
△ Less
Submitted 21 June, 2016; v1 submitted 4 December, 2012;
originally announced December 2012.
-
The Power of Local Information in Social Networks
Authors:
Christian Borgs,
Michael Brautbar,
Jennifer Chayes,
Sanjeev Khanna,
Brendan Lucier
Abstract:
We study the power of \textit{local information algorithms} for optimization problems on social networks. We focus on sequential algorithms for which the network topology is initially unknown and is revealed only within a local neighborhood of vertices that have been irrevocably added to the output set. The distinguishing feature of this setting is that locality is necessitated by constraints on t…
▽ More
We study the power of \textit{local information algorithms} for optimization problems on social networks. We focus on sequential algorithms for which the network topology is initially unknown and is revealed only within a local neighborhood of vertices that have been irrevocably added to the output set. The distinguishing feature of this setting is that locality is necessitated by constraints on the network information visible to the algorithm, rather than being desirable for reasons of efficiency or parallelizability. In this sense, changes to the level of network visibility can have a significant impact on algorithm design.
We study a range of problems under this model of algorithms with local information. We first consider the case in which the underlying graph is a preferential attachment network. We show that one can find the node of maximum degree in the network in a polylogarithmic number of steps, using an opportunistic algorithm that repeatedly queries the visible node of maximum degree. This addresses an open question of Bollob{รก}s and Riordan. In contrast, local information algorithms require a linear number of queries to solve the problem on arbitrary networks.
Motivated by problems faced by recruiters in online networks, we also consider network coverage problems such as finding a minimum dominating set. For this optimization problem we show that, if each node added to the output set reveals sufficient information about the set's neighborhood, then it is possible to design randomized algorithms for general networks that nearly match the best approximations possible even with full access to the graph structure. We show that this level of visibility is necessary.
We conclude that a network provider's decision of how much structure to make visible to its users can have a significant effect on a user's ability to interact strategically with the network.
△ Less
Submitted 13 October, 2013; v1 submitted 27 February, 2012;
originally announced February 2012.