-
Assessing Large Language Models on Climate Information
Authors:
Jannis Bulian,
Mike S. Schäfer,
Afra Amini,
Heidi Lam,
Massimiliano Ciaramita,
Ben Gaiarin,
Michelle Chen Hübscher,
Christian Buck,
Niels G. Mede,
Markus Leippold,
Nadine Strauß
Abstract:
As Large Language Models (LLMs) rise in popularity, it is necessary to assess their capability in critically relevant domains. We present a comprehensive evaluation framework, grounded in science communication research, to assess LLM responses to questions about climate change. Our framework emphasizes both presentational and epistemological adequacy, offering a fine-grained analysis of LLM genera…
▽ More
As Large Language Models (LLMs) rise in popularity, it is necessary to assess their capability in critically relevant domains. We present a comprehensive evaluation framework, grounded in science communication research, to assess LLM responses to questions about climate change. Our framework emphasizes both presentational and epistemological adequacy, offering a fine-grained analysis of LLM generations spanning 8 dimensions and 30 issues. Our evaluation task is a real-world example of a growing number of challenging problems where AI can complement and lift human performance. We introduce a novel protocol for scalable oversight that relies on AI Assistance and raters with relevant education. We evaluate several recent LLMs on a set of diverse climate questions. Our results point to a significant gap between surface and epistemological qualities of LLMs in the realm of climate communication.
△ Less
Submitted 28 May, 2024; v1 submitted 4 October, 2023;
originally announced October 2023.
-
Decoding a Neural Retriever's Latent Space for Query Suggestion
Authors:
Leonard Adolphs,
Michelle Chen Huebscher,
Christian Buck,
Sertan Girgin,
Olivier Bachem,
Massimiliano Ciaramita,
Thomas Hofmann
Abstract:
Neural retrieval models have superseded classic bag-of-words methods such as BM25 as the retrieval framework of choice. However, neural systems lack the interpretability of bag-of-words models; it is not trivial to connect a query change to a change in the latent space that ultimately determines the retrieval results. To shed light on this embedding space, we learn a "query decoder" that, given a…
▽ More
Neural retrieval models have superseded classic bag-of-words methods such as BM25 as the retrieval framework of choice. However, neural systems lack the interpretability of bag-of-words models; it is not trivial to connect a query change to a change in the latent space that ultimately determines the retrieval results. To shed light on this embedding space, we learn a "query decoder" that, given a latent representation of a neural search engine, generates the corresponding query. We show that it is possible to decode a meaningful query from its latent representation and, when moving in the right direction in latent space, to decode a query that retrieves the relevant paragraph. In particular, the query decoder can be useful to understand "what should have been asked" to retrieve a particular paragraph from the collection. We employ the query decoder to generate a large synthetic dataset of query reformulations for MSMarco, leading to improved retrieval performance. On this data, we train a pseudo-relevance feedback (PRF) T5 model for the application of query suggestion that outperforms both query reformulation and PRF information retrieval baselines.
△ Less
Submitted 21 October, 2022;
originally announced October 2022.
-
Zero-Shot Retrieval with Search Agents and Hybrid Environments
Authors:
Michelle Chen Huebscher,
Christian Buck,
Massimiliano Ciaramita,
Sascha Rothe
Abstract:
Learning to search is the task of building artificial agents that learn to autonomously use a search box to find information. So far, it has been shown that current language models can learn symbolic query reformulation policies, in combination with traditional term-based retrieval, but fall short of outperforming neural retrievers. We extend the previous learning to search setup to a hybrid envir…
▽ More
Learning to search is the task of building artificial agents that learn to autonomously use a search box to find information. So far, it has been shown that current language models can learn symbolic query reformulation policies, in combination with traditional term-based retrieval, but fall short of outperforming neural retrievers. We extend the previous learning to search setup to a hybrid environment, which accepts discrete query refinement operations, after a first-pass retrieval step via a dual encoder. Experiments on the BEIR task show that search agents, trained via behavioral cloning, outperform the underlying search system based on a combined dual encoder retriever and cross encoder reranker. Furthermore, we find that simple heuristic Hybrid Retrieval Environments (HRE) can improve baseline performance by several nDCG points. The search agent based on HRE (HARE) matches state-of-the-art performance, balanced in both zero-shot and in-domain evaluations, via interpretable actions, and at twice the speed.
△ Less
Submitted 29 March, 2023; v1 submitted 30 September, 2022;
originally announced September 2022.
-
Boosting Search Engines with Interactive Agents
Authors:
Leonard Adolphs,
Benjamin Boerschinger,
Christian Buck,
Michelle Chen Huebscher,
Massimiliano Ciaramita,
Lasse Espeholt,
Thomas Hofmann,
Yannic Kilcher,
Sascha Rothe,
Pier Giuseppe Sessa,
Lierni Sestorain Saralegui
Abstract:
This paper presents first successful steps in designing search agents that learn meta-strategies for iterative query refinement in information-seeking tasks. Our approach uses machine reading to guide the selection of refinement terms from aggregated search results. Agents are then empowered with simple but effective search operators to exert fine-grained and transparent control over queries and s…
▽ More
This paper presents first successful steps in designing search agents that learn meta-strategies for iterative query refinement in information-seeking tasks. Our approach uses machine reading to guide the selection of refinement terms from aggregated search results. Agents are then empowered with simple but effective search operators to exert fine-grained and transparent control over queries and search results. We develop a novel way of generating synthetic search sessions, which leverages the power of transformer-based language models through (self-)supervised learning. We also present a reinforcement learning agent with dynamically constrained actions that learns interactive search strategies from scratch. Our search agents obtain retrieval and answer quality performance comparable to recent neural methods, using only a traditional term-based BM25 ranking function and interpretable discrete reranking and filtering actions.
△ Less
Submitted 7 June, 2022; v1 submitted 1 September, 2021;
originally announced September 2021.
-
Meta Answering for Machine Reading
Authors:
Benjamin Borschinger,
Jordan Boyd-Graber,
Christian Buck,
Jannis Bulian,
Massimiliano Ciaramita,
Michelle Chen Huebscher,
Wojciech Gajewski,
Yannic Kilcher,
Rodrigo Nogueira,
Lierni Sestorain Saralegu
Abstract:
We investigate a framework for machine reading, inspired by real world information-seeking problems, where a meta question answering system interacts with a black box environment. The environment encapsulates a competitive machine reader based on BERT, providing candidate answers to questions, and possibly some context. To validate the realism of our formulation, we ask humans to play the role of…
▽ More
We investigate a framework for machine reading, inspired by real world information-seeking problems, where a meta question answering system interacts with a black box environment. The environment encapsulates a competitive machine reader based on BERT, providing candidate answers to questions, and possibly some context. To validate the realism of our formulation, we ask humans to play the role of a meta-answerer. With just a small snippet of text around an answer, humans can outperform the machine reader, improving recall. Similarly, a simple machine meta-answerer outperforms the environment, improving both precision and recall on the Natural Questions dataset. The system relies on joint training of answer scoring and the selection of conditioning information.
△ Less
Submitted 30 April, 2020; v1 submitted 11 November, 2019;
originally announced November 2019.
-
Controlled Experiments with Student Participants in Software Engineering: Preliminary Results from a Systematic Mapping Study
Authors:
Marian Daun,
Carolin Hübscher,
Thorsten Weyer
Abstract:
[Context] In software engineering research, emphasis is given to sound evaluations of new approaches. While industry surveys or industrial case studies are preferred to evaluate industrial applicability, controlled experiments with student participants are commonly used to determine measurements such as effectiveness and efficiency of a proposed approach. [Objectives] In this paper, we elaborate o…
▽ More
[Context] In software engineering research, emphasis is given to sound evaluations of new approaches. While industry surveys or industrial case studies are preferred to evaluate industrial applicability, controlled experiments with student participants are commonly used to determine measurements such as effectiveness and efficiency of a proposed approach. [Objectives] In this paper, we elaborate on the current state of the art of controlled experiments using student participants. As student participants are commonly only reluctantly accepted in scientific communities and threats regarding the generalizability are quite obvious, we want to determine how widespread controlled experiments with student participants are and in which settings they are used. [Methods] This paper reports on a systematic mapping study using high-quality journals and conferences from the software engineering field as data sources. We scanned all papers published between 2010 and 2014 and investigated all papers reporting student experiments in detail. [Results] From 2788 papers under investigation 175 report results from controlled experiments. 109 (62.29%) of these controlled experiments have been conducted with student participants. Most experiments used undergraduate student participants, recruited students on a voluntary basis, and set them tasks to measure their comprehension. However, many experiments lack information regarding the students' recruitment and other important factors. [Conclusions] In conclusion, student participation in software engineering experiments can be seen as a common evaluation approach. In contrast, there seems to be little knowledge about the threats to validity in student experiments, as major drivers such as the recruitment are not reported at all.
△ Less
Submitted 15 August, 2017;
originally announced August 2017.