-
Chemical classification program synthesis using generative artificial intelligence
Authors:
Christopher J. Mungall,
Adnan Malik,
Daniel R. Korn,
Justin T. Reese,
Noel M. O'Boyle,
Noel,
Janna Hastings
Abstract:
Accurately classifying chemical structures is essential for cheminformatics and bioinformatics, including tasks such as identifying bioactive compounds of interest, screening molecules for toxicity to humans, finding non-organic compounds with desirable material properties, or organizing large chemical libraries for drug discovery or environmental monitoring. However, manual classification is labo…
▽ More
Accurately classifying chemical structures is essential for cheminformatics and bioinformatics, including tasks such as identifying bioactive compounds of interest, screening molecules for toxicity to humans, finding non-organic compounds with desirable material properties, or organizing large chemical libraries for drug discovery or environmental monitoring. However, manual classification is labor-intensive and difficult to scale to large chemical databases. Existing automated approaches either rely on manually constructed classification rules, or the use of deep learning methods that lack explainability.
This work presents an approach that uses generative artificial intelligence to automatically write chemical classifier programs for classes in the Chemical Entities of Biological Interest (ChEBI) database. These programs can be used for efficient deterministic run-time classification of SMILES structures, with natural language explanations. The programs themselves constitute an explainable computable ontological model of chemical class nomenclature, which we call the ChEBI Chemical Class Program Ontology (C3PO).
We validated our approach against the ChEBI database, and compared our results against state of the art deep learning models. We also demonstrate the use of C3PO to classify out-of-distribution examples taken from metabolomics repositories and natural product databases. We also demonstrate the potential use of our approach to find systematic classification errors in existing chemical databases, and show how an ensemble artificial intelligence approach combining generated ontologies, automated literature search, and multimodal vision models can be used to pinpoint potential errors requiring expert validation
△ Less
Submitted 23 May, 2025;
originally announced May 2025.
-
CurateGPT: A flexible language-model assisted biocuration tool
Authors:
Harry Caufield,
Carlo Kroll,
Shawn T O'Neil,
Justin T Reese,
Marcin P Joachimiak,
Harshad Hegde,
Nomi L Harris,
Madan Krishnamurthy,
James A McLaughlin,
Damian Smedley,
Melissa A Haendel,
Peter N Robinson,
Christopher J Mungall
Abstract:
Effective data-driven biomedical discovery requires data curation: a time-consuming process of finding, organizing, distilling, integrating, interpreting, annotating, and validating diverse information into a structured form suitable for databases and knowledge bases. Accurate and efficient curation of these digital assets is critical to ensuring that they are FAIR, trustworthy, and sustainable. U…
▽ More
Effective data-driven biomedical discovery requires data curation: a time-consuming process of finding, organizing, distilling, integrating, interpreting, annotating, and validating diverse information into a structured form suitable for databases and knowledge bases. Accurate and efficient curation of these digital assets is critical to ensuring that they are FAIR, trustworthy, and sustainable. Unfortunately, expert curators face significant time and resource constraints. The rapid pace of new information being published daily is exceeding their capacity for curation. Generative AI, exemplified by instruction-tuned large language models (LLMs), has opened up new possibilities for assisting human-driven curation. The design philosophy of agents combines the emerging abilities of generative AI with more precise methods. A curator's tasks can be aided by agents for performing reasoning, searching ontologies, and integrating knowledge across external sources, all efforts otherwise requiring extensive manual effort. Our LLM-driven annotation tool, CurateGPT, melds the power of generative AI together with trusted knowledge bases and literature sources. CurateGPT streamlines the curation process, enhancing collaboration and efficiency in common workflows. Compared to direct interaction with an LLM, CurateGPT's agents enable access to information beyond that in the LLM's training data and they provide direct links to the data supporting each claim. This helps curators, researchers, and engineers scale up curation efforts to keep pace with the ever-increasing volume of scientific data.
△ Less
Submitted 29 October, 2024;
originally announced November 2024.
-
The Vertebrate Breed Ontology: Towards Effective Breed Data Standardization
Authors:
Kathleen R. Mullen,
Imke Tammen,
Nicolas A. Matentzoglu,
Marius Mather,
James P. Balhoff,
Elizabeth Esdaile,
Gregoire Leroy,
Carissa A. Park,
Halie M. Rando,
Nadia T. Saklou,
Tracy L. Webb,
Nicole A. Vasilevsky,
Christopher J. Mungall,
Melissa A. Haendel,
Frank W. Nicholas,
Sabrina Toro
Abstract:
Background: Limited universally-adopted data standards in veterinary medicine hinder data interoperability and therefore integration and comparison; this ultimately impedes the application of existing information-based tools to support advancement in diagnostics, treatments, and precision medicine.
Objectives: A single, coherent, logic-based standard for documenting breed names in health, produc…
▽ More
Background: Limited universally-adopted data standards in veterinary medicine hinder data interoperability and therefore integration and comparison; this ultimately impedes the application of existing information-based tools to support advancement in diagnostics, treatments, and precision medicine.
Objectives: A single, coherent, logic-based standard for documenting breed names in health, production, and research-related records will improve data use capabilities in veterinary and comparative medicine. Methods: The Vertebrate Breed Ontology (VBO) was created from breed names and related information compiled from the Food and Agriculture Organization of the United Nations, breed registries, communities, and experts, using manual and computational approaches. Each breed is represented by a VBO term that includes breed information and provenance as metadata. VBO terms are classified using description logic to allow computational applications and Artificial Intelligence-readiness.
Results: VBO is an open, community-driven ontology representing over 19,500 livestock and companion animal breed concepts covering 49 species. Breeds are classified based on community and expert conventions (e.g., cattle breed) and supported by relations to the breed's genus and species indicated by National Center for Biotechnology Information (NCBI) Taxonomy terms. Relationships between VBO terms (e.g., relating breeds to their foundation stock) provide additional context to support advanced data analytics. VBO term metadata includes synonyms, breed identifiers/codes, and attributed cross-references to other databases.
Conclusion and clinical importance: The adoption of VBO as a source of standard breed names in databases and veterinary electronic health records can enhance veterinary data interoperability and computability.
△ Less
Submitted 24 January, 2025; v1 submitted 3 June, 2024;
originally announced June 2024.
-
Automated Annotation of Scientific Texts for ML-based Keyphrase Extraction and Validation
Authors:
Oluwamayowa O. Amusat,
Harshad Hegde,
Christopher J. Mungall,
Anna Giannakou,
Neil P. Byers,
Dan Gunter,
Kjiersten Fagnan,
Lavanya Ramakrishnan
Abstract:
Advanced omics technologies and facilities generate a wealth of valuable data daily; however, the data often lacks the essential metadata required for researchers to find and search them effectively. The lack of metadata poses a significant challenge in the utilization of these datasets. Machine learning-based metadata extraction techniques have emerged as a potentially viable approach to automati…
▽ More
Advanced omics technologies and facilities generate a wealth of valuable data daily; however, the data often lacks the essential metadata required for researchers to find and search them effectively. The lack of metadata poses a significant challenge in the utilization of these datasets. Machine learning-based metadata extraction techniques have emerged as a potentially viable approach to automatically annotating scientific datasets with the metadata necessary for enabling effective search. Text labeling, usually performed manually, plays a crucial role in validating machine-extracted metadata. However, manual labeling is time-consuming; thus, there is an need to develop automated text labeling techniques in order to accelerate the process of scientific innovation. This need is particularly urgent in fields such as environmental genomics and microbiome science, which have historically received less attention in terms of metadata curation and creation of gold-standard text mining datasets.
In this paper, we present two novel automated text labeling approaches for the validation of ML-generated metadata for unlabeled texts, with specific applications in environmental genomics. Our techniques show the potential of two new ways to leverage existing information about the unlabeled texts and the scientific domain. The first technique exploits relationships between different types of data sources related to the same research study, such as publications and proposals. The second technique takes advantage of domain-specific controlled vocabularies or ontologies. In this paper, we detail applying these approaches for ML-generated metadata validation. Our results show that the proposed label assignment approaches can generate both generic and highly-specific text labels for the unlabeled texts, with up to 44% of the labels matching with those suggested by a ML keyword extraction algorithm.
△ Less
Submitted 8 November, 2023;
originally announced November 2023.
-
Gene Set Summarization using Large Language Models
Authors:
Marcin P. Joachimiak,
J. Harry Caufield,
Nomi L. Harris,
Hyeongsik Kim,
Christopher J. Mungall
Abstract:
Molecular biologists frequently interpret gene lists derived from high-throughput experiments and computational analysis. This is typically done as a statistical enrichment analysis that measures the over- or under-representation of biological function terms associated with genes or their properties, based on curated assertions from a knowledge base (KB) such as the Gene Ontology (GO). Interpretin…
▽ More
Molecular biologists frequently interpret gene lists derived from high-throughput experiments and computational analysis. This is typically done as a statistical enrichment analysis that measures the over- or under-representation of biological function terms associated with genes or their properties, based on curated assertions from a knowledge base (KB) such as the Gene Ontology (GO). Interpreting gene lists can also be framed as a textual summarization task, enabling the use of Large Language Models (LLMs), potentially utilizing scientific texts directly and avoiding reliance on a KB.
We developed SPINDOCTOR (Structured Prompt Interpolation of Natural Language Descriptions of Controlled Terms for Ontology Reporting), a method that uses GPT models to perform gene set function summarization as a complement to standard enrichment analysis. This method can use different sources of gene functional information: (1) structured text derived from curated ontological KB annotations, (2) ontology-free narrative gene summaries, or (3) direct model retrieval.
We demonstrate that these methods are able to generate plausible and biologically valid summary GO term lists for gene sets. However, GPT-based approaches are unable to deliver reliable scores or p-values and often return terms that are not statistically significant. Crucially, these methods were rarely able to recapitulate the most precise and informative term from standard enrichment, likely due to an inability to generalize and reason using an ontology. Results are highly nondeterministic, with minor variations in prompt resulting in radically different term lists. Our results show that at this point, LLM-based methods are unsuitable as a replacement for standard term enrichment analysis and that manual curation of ontological assertions remains necessary.
△ Less
Submitted 3 July, 2024; v1 submitted 20 May, 2023;
originally announced May 2023.
-
KG-Hub -- Building and Exchanging Biological Knowledge Graphs
Authors:
J Harry Caufield,
Tim Putman,
Kevin Schaper,
Deepak R Unni,
Harshad Hegde,
Tiffany J Callahan,
Luca Cappelletti,
Sierra AT Moxon,
Vida Ravanmehr,
Seth Carbon,
Lauren E Chan,
Katherina Cortes,
Kent A Shefchek,
Glass Elsarboukh,
James P Balhoff,
Tommaso Fontana,
Nicolas Matentzoglu,
Richard M Bruskiewich,
Anne E Thessen,
Nomi L Harris,
Monica C Munoz-Torres,
Melissa A Haendel,
Peter N Robinson,
Marcin P Joachimiak,
Christopher J Mungall
, et al. (1 additional authors not shown)
Abstract:
Knowledge graphs (KGs) are a powerful approach for integrating heterogeneous data and making inferences in biology and many other domains, but a coherent solution for constructing, exchanging, and facilitating the downstream use of knowledge graphs is lacking. Here we present KG-Hub, a platform that enables standardized construction, exchange, and reuse of knowledge graphs. Features include a simp…
▽ More
Knowledge graphs (KGs) are a powerful approach for integrating heterogeneous data and making inferences in biology and many other domains, but a coherent solution for constructing, exchanging, and facilitating the downstream use of knowledge graphs is lacking. Here we present KG-Hub, a platform that enables standardized construction, exchange, and reuse of knowledge graphs. Features include a simple, modular extract-transform-load (ETL) pattern for producing graphs compliant with Biolink Model (a high-level data model for standardizing biological data), easy integration of any OBO (Open Biological and Biomedical Ontologies) ontology, cached downloads of upstream data sources, versioned and automatically updated builds with stable URLs, web-browsable storage of KG artifacts on cloud infrastructure, and easy reuse of transformed subgraphs across projects. Current KG-Hub projects span use cases including COVID-19 research, drug repurposing, microbial-environmental interactions, and rare disease research. KG-Hub is equipped with tooling to easily analyze and manipulate knowledge graphs. KG-Hub is also tightly integrated with graph machine learning (ML) tools which allow automated graph machine learning, including node embeddings and training of models for link prediction and node classification.
△ Less
Submitted 31 January, 2023;
originally announced February 2023.
-
Perspectives for self-driving labs in synthetic biology
Authors:
Hector Garcia Martin,
Tijana Radivojevic,
Jeremy Zucker,
Kristofer Bouchard,
Jess Sustarich,
Sean Peisert,
Dan Arnold,
Nathan Hillson,
Gyorgy Babnigg,
Jose Manuel Marti,
Christopher J. Mungall,
Gregg T. Beckham,
Lucas Waldburger,
James Carothers,
ShivShankar Sundaram,
Deb Agarwal,
Blake A. Simmons,
Tyler Backman,
Deepanwita Banerjee,
Deepti Tanjore,
Lavanya Ramakrishnan,
Anup Singh
Abstract:
Self-driving labs (SDLs) combine fully automated experiments with artificial intelligence (AI) that decides the next set of experiments. Taken to their ultimate expression, SDLs could usher a new paradigm of scientific research, where the world is probed, interpreted, and explained by machines for human benefit. While there are functioning SDLs in the fields of chemistry and materials science, we…
▽ More
Self-driving labs (SDLs) combine fully automated experiments with artificial intelligence (AI) that decides the next set of experiments. Taken to their ultimate expression, SDLs could usher a new paradigm of scientific research, where the world is probed, interpreted, and explained by machines for human benefit. While there are functioning SDLs in the fields of chemistry and materials science, we contend that synthetic biology provides a unique opportunity since the genome provides a single target for affecting the incredibly wide repertoire of biological cell behavior. However, the level of investment required for the creation of biological SDLs is only warranted if directed towards solving difficult and enabling biological questions. Here, we discuss challenges and opportunities in creating SDLs for synthetic biology.
△ Less
Submitted 1 November, 2022; v1 submitted 14 October, 2022;
originally announced October 2022.
-
Creation and unification of development and life stage ontologies for animals
Authors:
Anne Niknejad,
Christopher J. Mungall,
David Osumi-Sutherland,
Marc Robinson-Rechavi,
Frederic B. Bastian
Abstract:
With the new era of genomics, an increasing number of animal species are amenable to large-scale data generation. This had led to the emergence of new multi-species ontologies to annotate and organize these data. While anatomy and cell types are well covered by these efforts, information regarding development and life stages is also critical in the annotation of animal data. Its lack can hamper ou…
▽ More
With the new era of genomics, an increasing number of animal species are amenable to large-scale data generation. This had led to the emergence of new multi-species ontologies to annotate and organize these data. While anatomy and cell types are well covered by these efforts, information regarding development and life stages is also critical in the annotation of animal data. Its lack can hamper our ability to answer comparative biology questions and to interpret functional results. We present here a collection of development and life stage ontologies for 21 animal species, and their merge into a common multi-species ontology. This work has allowed the integration and comparison of transcriptomics data in 52 animal species.
△ Less
Submitted 24 June, 2022;
originally announced June 2022.
-
Guidelines for reporting cell types: the MIRACL standard
Authors:
Tiago Lubiana,
Paola Roncaglia,
Christopher J. Mungall,
Ellen M. Quardokus,
Joshua D. Fortriede,
David Osumi-Sutherland,
Alexander D. Diehl
Abstract:
Cell types are at the root of modern biology, and describing them is a core task of the Human Cell Atlas project. Surprisingly, there are no standards for reporting new cell types, leading to a gap between classes mentioned in biomedical literature and the Cell Ontology, the primary registry of cell types. Here we introduce the Minimal Information Reporting About a CelL (MIRACL) standard, a guidel…
▽ More
Cell types are at the root of modern biology, and describing them is a core task of the Human Cell Atlas project. Surprisingly, there are no standards for reporting new cell types, leading to a gap between classes mentioned in biomedical literature and the Cell Ontology, the primary registry of cell types. Here we introduce the Minimal Information Reporting About a CelL (MIRACL) standard, a guideline for describing cell types alongside scientific articles. In a MIRACL sheet, authors organize a label, a diagnostic description, a taxon, an anatomical structure, and a parent cell class for each cell type of interest. The MIRACL standard bridges the gap between wet-lab researchers and ontologists, facilitating the integration of biomedical knowledge into ontologies and artificial intelligence systems.
△ Less
Submitted 25 May, 2022; v1 submitted 18 April, 2022;
originally announced April 2022.
-
Recommendations for extending the GFF3 specification for improved interoperability of genomic data
Authors:
Surya Saha,
Scott Cain,
Ethalinda K. S. Cannon,
Nathan Dunn,
Andrew Farmer,
Zhi-Liang Hu,
Gareth Maslen,
Sierra Moxon,
Christopher J Mungall,
Rex Nelson,
Monica F. Poelchau
Abstract:
The GFF3 format is a common, flexible tab-delimited format representing the structure and function of genes or other mapped features (https://github.com/The-Sequence-Ontology/Specifications/blob/master/gff3.md). However, with increasing re-use of annotation data, this flexibility has become an obstacle for standardized downstream processing. Common software packages that export annotations in GFF3…
▽ More
The GFF3 format is a common, flexible tab-delimited format representing the structure and function of genes or other mapped features (https://github.com/The-Sequence-Ontology/Specifications/blob/master/gff3.md). However, with increasing re-use of annotation data, this flexibility has become an obstacle for standardized downstream processing. Common software packages that export annotations in GFF3 format model the same data and metadata in different notations, which puts the burden on end-users to interpret the data model. The AgBioData consortium is a group of genomics, genetics and breeding databases and partners working towards shared practices and standards. Providing concrete guidelines for generating GFF3, and creating a standard representation of the most common biological data types would provide a major increase in efficiency for AgBioData databases and the genomics research community that use the GFF3 format in their daily operations. The AgBioData GFF3 working group has developed recommendations to solve common problems in the GFF3 format. We suggest improvements for each of the GFF3 fields, as well as the special cases of modeling functional annotations, and standard protein-coding genes. We welcome further discussion of these recommendations. We request the genomics and bioinformatics community to utilize the github repository (https://github.com/NAL-i5K/AgBioData_GFF3_recommendation) to provide feedback via issues or pull requests.
△ Less
Submitted 15 February, 2022;
originally announced February 2022.
-
GOTaxon: Representing the evolution of biological functions in the Gene Ontology
Authors:
Haiming Tang,
Christopher J Mungall,
Huaiyu Mi,
Paul D Thomas
Abstract:
The Gene Ontology aims to define the universe of functions known for gene products, at the molecular, cellular and organism levels. While the ontology is designed to cover all aspects of biology in a "species independent manner", the fact remains that many if not most biological functions are restricted in their taxonomic range. This is simply because functions evolve, i.e. like other biological c…
▽ More
The Gene Ontology aims to define the universe of functions known for gene products, at the molecular, cellular and organism levels. While the ontology is designed to cover all aspects of biology in a "species independent manner", the fact remains that many if not most biological functions are restricted in their taxonomic range. This is simply because functions evolve, i.e. like other biological characteristics they are gained and lost over evolutionary time. Here we introduce a general method of representing the evolutionary gain and loss of biological functions within the Gene Ontology. We then apply a variety of techniques, including manual curation, logical reasoning over the ontology structure, and previously published "taxon constraints" to assign evolutionary gain and loss events to the majority of terms in the GO. These gain and loss events now almost triple the number of terms with taxon constraints, and currently cover a total of 76% of GO terms, including 40% of molecular function terms, 78% of cellular component terms, and 89% of biological process terms.
Database URL: GOTaxon is freely available at https://github.com/haimingt/GOTaxonConstraint
△ Less
Submitted 16 February, 2018;
originally announced February 2018.