-
AI-VERDE: A Gateway for Egalitarian Access to Large Language Model-Based Resources For Educational Institutions
Authors:
Paul Mithun,
Enrique Noriega-Atala,
Nirav Merchant,
Edwin Skidmore
Abstract:
We present AI-VERDE, a unified LLM-as-a-platform service designed to facilitate seamless integration of commercial, cloud-hosted, and on-premise open LLMs in academic settings. AI-VERDE streamlines access management for instructional and research groups by providing features such as robust access control, privacy-preserving mechanisms, native Retrieval-Augmented Generation (RAG) support, budget ma…
▽ More
We present AI-VERDE, a unified LLM-as-a-platform service designed to facilitate seamless integration of commercial, cloud-hosted, and on-premise open LLMs in academic settings. AI-VERDE streamlines access management for instructional and research groups by providing features such as robust access control, privacy-preserving mechanisms, native Retrieval-Augmented Generation (RAG) support, budget management for third-party LLM services, and both a conversational web interface and API access. In a pilot deployment at a large public university, AI-VERDE demonstrated significant engagement across diverse educational and research groups, enabling activities that would typically require substantial budgets for commercial LLM services with limited user and team management capabilities. To the best of our knowledge, AI-Verde is the first platform to address both academic and research needs for LLMs within an higher education institutional framework.
△ Less
Submitted 11 February, 2025;
originally announced February 2025.
-
Variable Extraction for Model Recovery in Scientific Literature
Authors:
Chunwei Liu,
Enrique Noriega-Atala,
Adarsh Pyarelal,
Clayton T Morrison,
Mike Cafarella
Abstract:
The global output of academic publications exceeds 5 million articles per year, making it difficult for humans to keep up with even a tiny fraction of scientific output. We need methods to navigate and interpret the artifacts -- texts, graphs, charts, code, models, and datasets -- that make up the literature. This paper evaluates various methods for extracting mathematical model variables from epi…
▽ More
The global output of academic publications exceeds 5 million articles per year, making it difficult for humans to keep up with even a tiny fraction of scientific output. We need methods to navigate and interpret the artifacts -- texts, graphs, charts, code, models, and datasets -- that make up the literature. This paper evaluates various methods for extracting mathematical model variables from epidemiological studies, such as ``infection rate ($α$),'' ``recovery rate ($γ$),'' and ``mortality rate ($μ$).'' Variable extraction appears to be a basic task, but plays a pivotal role in recovering models from scientific literature. Once extracted, we can use these variables for automatic mathematical modeling, simulation, and replication of published results.
We introduce a benchmark dataset comprising manually-annotated variable descriptions and variable values extracted from scientific papers. Based on this dataset, we present several baseline methods for variable extraction based on Large Language Models (LLMs) and rule-based information extraction systems. Our analysis shows that LLM-based solutions perform the best. Despite the incremental benefits of combining rule-based extraction outputs with LLMs, the leap in performance attributed to the transfer-learning and instruction-tuning capabilities of LLMs themselves is far more significant. This investigation demonstrates the potential of LLMs to enhance automatic comprehension of scientific artifacts and for automatic model recovery and simulation.
△ Less
Submitted 21 November, 2024;
originally announced November 2024.
-
When and Where Did it Happen? An Encoder-Decoder Model to Identify Scenario Context
Authors:
Enrique Noriega-Atala,
Robert Vacareanu,
Salena Torres Ashton,
Adarsh Pyarelal,
Clayton T. Morrison,
Mihai Surdeanu
Abstract:
We introduce a neural architecture finetuned for the task of scenario context generation: The relevant location and time of an event or entity mentioned in text. Contextualizing information extraction helps to scope the validity of automated finings when aggregating them as knowledge graphs. Our approach uses a high-quality curated dataset of time and location annotations in a corpus of epidemiolo…
▽ More
We introduce a neural architecture finetuned for the task of scenario context generation: The relevant location and time of an event or entity mentioned in text. Contextualizing information extraction helps to scope the validity of automated finings when aggregating them as knowledge graphs. Our approach uses a high-quality curated dataset of time and location annotations in a corpus of epidemiology papers to train an encoder-decoder architecture. We also explored the use of data augmentation techniques during training. Our findings suggest that a relatively small fine-tuned encoder-decoder model performs better than out-of-the-box LLMs and semantic role labeling parsers to accurate predict the relevant scenario information of a particular entity or event.
△ Less
Submitted 20 October, 2024; v1 submitted 9 October, 2024;
originally announced October 2024.
-
Learning Open Domain Multi-hop Search Using Reinforcement Learning
Authors:
Enrique Noriega-Atala,
Mihai Surdeanu,
Clayton T. Morrison
Abstract:
We propose a method to teach an automated agent to learn how to search for multi-hop paths of relations between entities in an open domain. The method learns a policy for directing existing information retrieval and machine reading resources to focus on relevant regions of a corpus. The approach formulates the learning problem as a Markov decision process with a state representation that encodes t…
▽ More
We propose a method to teach an automated agent to learn how to search for multi-hop paths of relations between entities in an open domain. The method learns a policy for directing existing information retrieval and machine reading resources to focus on relevant regions of a corpus. The approach formulates the learning problem as a Markov decision process with a state representation that encodes the dynamics of the search process and a reward structure that minimizes the number of documents that must be processed while still finding multi-hop paths. We implement the method in an actor-critic reinforcement learning algorithm and evaluate it on a dataset of search problems derived from a subset of English Wikipedia. The algorithm finds a family of policies that succeeds in extracting the desired information while processing fewer documents compared to several baseline heuristic algorithms.
△ Less
Submitted 30 May, 2022;
originally announced May 2022.
-
Neural Architectures for Biological Inter-Sentence Relation Extraction
Authors:
Enrique Noriega-Atala,
Peter M. Lovett,
Clayton T. Morrison,
Mihai Surdeanu
Abstract:
We introduce a family of deep-learning architectures for inter-sentence relation extraction, i.e., relations where the participants are not necessarily in the same sentence. We apply these architectures to an important use case in the biomedical domain: assigning biological context to biochemical events. In this work, biological context is defined as the type of biological system within which the…
▽ More
We introduce a family of deep-learning architectures for inter-sentence relation extraction, i.e., relations where the participants are not necessarily in the same sentence. We apply these architectures to an important use case in the biomedical domain: assigning biological context to biochemical events. In this work, biological context is defined as the type of biological system within which the biochemical event is observed. The neural architectures encode and aggregate multiple occurrences of the same candidate context mentions to determine whether it is the correct context for a particular event mention. We propose two broad types of architectures: the first type aggregates multiple instances that correspond to the same candidate context with respect to event mention before emitting a classification; the second type independently classifies each instance and uses the results to vote for the final class, akin to an ensemble approach. Our experiments show that the proposed neural classifiers are competitive and some achieve better performance than previous state of the art traditional machine learning methods without the need for feature engineering. Our analysis shows that the neural methods particularly improve precision compared to traditional machine learning classifiers and also demonstrates how the difficulty of inter-sentence relation extraction increases as the distance between the event and context mentions increase.
△ Less
Submitted 16 December, 2021;
originally announced December 2021.
-
Inter-sentence Relation Extraction for Associating Biological Context with Events in Biomedical Texts
Authors:
Enrique Noriega-Atala,
Paul D. Hein,
Shraddha S. Thumsi,
Zechy Wong,
Xia Wang,
Clayton T. Morrison
Abstract:
We present an analysis of the problem of identifying biological context and associating it with biochemical events in biomedical texts. This constitutes a non-trivial, inter-sentential relation extraction task. We focus on biological context as descriptions of the species, tissue type and cell type that are associated with biochemical events. We describe the properties of an annotated corpus of co…
▽ More
We present an analysis of the problem of identifying biological context and associating it with biochemical events in biomedical texts. This constitutes a non-trivial, inter-sentential relation extraction task. We focus on biological context as descriptions of the species, tissue type and cell type that are associated with biochemical events. We describe the properties of an annotated corpus of context-event relations and present and evaluate several classifiers for context-event association trained on syntactic, distance and frequency features.
△ Less
Submitted 14 December, 2018;
originally announced December 2018.
-
Learning what to read: Focused machine reading
Authors:
Enrique Noriega-Atala,
Marco A. Valenzuela-Escarcega,
Clayton T. Morrison,
Mihai Surdeanu
Abstract:
Recent efforts in bioinformatics have achieved tremendous progress in the machine reading of biomedical literature, and the assembly of the extracted biochemical interactions into large-scale models such as protein signaling pathways. However, batch machine reading of literature at today's scale (PubMed alone indexes over 1 million papers per year) is unfeasible due to both cost and processing ove…
▽ More
Recent efforts in bioinformatics have achieved tremendous progress in the machine reading of biomedical literature, and the assembly of the extracted biochemical interactions into large-scale models such as protein signaling pathways. However, batch machine reading of literature at today's scale (PubMed alone indexes over 1 million papers per year) is unfeasible due to both cost and processing overhead. In this work, we introduce a focused reading approach to guide the machine reading of biomedical literature towards what literature should be read to answer a biomedical query as efficiently as possible. We introduce a family of algorithms for focused reading, including an intuitive, strong baseline, and a second approach which uses a reinforcement learning (RL) framework that learns when to explore (widen the search) or exploit (narrow it). We demonstrate that the RL approach is capable of answering more queries than the baseline, while being more efficient, i.e., reading fewer documents.
△ Less
Submitted 1 September, 2017;
originally announced September 2017.