-
CUB: Benchmarking Context Utilisation Techniques for Language Models
Authors:
Lovisa Hagström,
Youna Kim,
Haeun Yu,
Sang-goo Lee,
Richard Johansson,
Hyunsoo Cho,
Isabelle Augenstein
Abstract:
Incorporating external knowledge is crucial for knowledge-intensive tasks, such as question answering and fact checking. However, language models (LMs) may ignore relevant information that contradicts outdated parametric memory or be distracted by irrelevant contexts. While many context utilisation manipulation techniques (CMTs) that encourage or suppress context utilisation have recently been pro…
▽ More
Incorporating external knowledge is crucial for knowledge-intensive tasks, such as question answering and fact checking. However, language models (LMs) may ignore relevant information that contradicts outdated parametric memory or be distracted by irrelevant contexts. While many context utilisation manipulation techniques (CMTs) that encourage or suppress context utilisation have recently been proposed to alleviate these issues, few have seen systematic comparison. In this paper, we develop CUB (Context Utilisation Benchmark) to help practitioners within retrieval-augmented generation (RAG) identify the best CMT for their needs. CUB allows for rigorous testing on three distinct context types, observed to capture key challenges in realistic context utilisation scenarios. With this benchmark, we evaluate seven state-of-the-art methods, representative of the main categories of CMTs, across three diverse datasets and tasks, applied to nine LMs. Our results show that most of the existing CMTs struggle to handle the full set of types of contexts that may be encountered in real-world retrieval-augmented scenarios. Moreover, we find that many CMTs display an inflated performance on simple synthesised datasets, compared to more realistic datasets with naturally occurring samples. Altogether, our results show the need for holistic tests of CMTs and the development of CMTs that can handle multiple context types.
△ Less
Submitted 22 May, 2025;
originally announced May 2025.
-
Language Model Re-rankers are Fooled by Lexical Similarities
Authors:
Lovisa Hagström,
Ercong Nie,
Ruben Halifa,
Helmut Schmid,
Richard Johansson,
Alexander Junge
Abstract:
Language model (LM) re-rankers are used to refine retrieval results for retrieval-augmented generation (RAG). They are more expensive than lexical matching methods like BM25 but assumed to better process semantic information and the relations between the query and the retrieved answers. To understand whether LM re-rankers always live up to this assumption, we evaluate 6 different LM re-rankers on…
▽ More
Language model (LM) re-rankers are used to refine retrieval results for retrieval-augmented generation (RAG). They are more expensive than lexical matching methods like BM25 but assumed to better process semantic information and the relations between the query and the retrieved answers. To understand whether LM re-rankers always live up to this assumption, we evaluate 6 different LM re-rankers on the NQ, LitQA2 and DRUID datasets. Our results show that LM re-rankers struggle to outperform a simple BM25 baseline on DRUID. Leveraging a novel separation metric based on BM25 scores, we explain and identify re-ranker errors stemming from lexical dissimilarities. We also investigate different methods to improve LM re-ranker performance and find these methods mainly useful for NQ. Taken together, our work identifies and explains weaknesses of LM re-rankers and points to the need for more adversarial and realistic datasets for their evaluation.
△ Less
Submitted 24 June, 2025; v1 submitted 24 February, 2025;
originally announced February 2025.
-
A Reality Check on Context Utilisation for Retrieval-Augmented Generation
Authors:
Lovisa Hagström,
Sara Vera Marjanović,
Haeun Yu,
Arnav Arora,
Christina Lioma,
Maria Maistro,
Pepa Atanasova,
Isabelle Augenstein
Abstract:
Retrieval-augmented generation (RAG) helps address the limitations of parametric knowledge embedded within a language model (LM). In real world settings, retrieved information can vary in complexity, yet most investigations of LM utilisation of context has been limited to synthetic text. We introduce DRUID (Dataset of Retrieved Unreliable, Insufficient and Difficult-to-understand contexts) with re…
▽ More
Retrieval-augmented generation (RAG) helps address the limitations of parametric knowledge embedded within a language model (LM). In real world settings, retrieved information can vary in complexity, yet most investigations of LM utilisation of context has been limited to synthetic text. We introduce DRUID (Dataset of Retrieved Unreliable, Insufficient and Difficult-to-understand contexts) with real-world queries and contexts manually annotated for stance. The dataset is based on the prototypical task of automated claim verification, for which automated retrieval of real-world evidence is crucial. We compare DRUID to synthetic datasets (CounterFact, ConflictQA) and find that artificial datasets often fail to represent the complexity and diversity of realistically retrieved context. We show that synthetic datasets exaggerate context characteristics rare in real retrieved data, which leads to inflated context utilisation results, as measured by our novel ACU score. Moreover, while previous work has mainly focused on singleton context characteristics to explain context utilisation, correlations between singleton context properties and ACU on DRUID are surprisingly small compared to other properties related to context source. Overall, our work underscores the need for real-world aligned context utilisation studies to represent and improve performance in real-world RAG settings.
△ Less
Submitted 29 May, 2025; v1 submitted 22 December, 2024;
originally announced December 2024.
-
Fact Recall, Heuristics or Pure Guesswork? Precise Interpretations of Language Models for Fact Completion
Authors:
Denitsa Saynova,
Lovisa Hagström,
Moa Johansson,
Richard Johansson,
Marco Kuhlmann
Abstract:
Language models (LMs) can make a correct prediction based on many possible signals in a prompt, not all corresponding to recall of factual associations. However, current interpretations of LMs fail to take this into account. For example, given the query "Astrid Lindgren was born in" with the corresponding completion "Sweden", no difference is made between whether the prediction was based on knowin…
▽ More
Language models (LMs) can make a correct prediction based on many possible signals in a prompt, not all corresponding to recall of factual associations. However, current interpretations of LMs fail to take this into account. For example, given the query "Astrid Lindgren was born in" with the corresponding completion "Sweden", no difference is made between whether the prediction was based on knowing where the author was born or assuming that a person with a Swedish-sounding name was born in Sweden. In this paper, we present a model-specific recipe - PrISM - for constructing datasets with examples of four different prediction scenarios: generic language modeling, guesswork, heuristics recall and exact fact recall. We apply two popular interpretability methods to the scenarios: causal tracing (CT) and information flow analysis. We find that both yield distinct results for each scenario. Results for exact fact recall and generic language modeling scenarios confirm previous conclusions about the importance of mid-range MLP sublayers for fact recall, while results for guesswork and heuristics indicate a critical role of late last token position MLP sublayers. In summary, we contribute resources for a more extensive and granular study of fact completion in LMs, together with analyses that provide a more nuanced understanding of how LMs process fact-related queries.
△ Less
Submitted 1 July, 2025; v1 submitted 18 October, 2024;
originally announced October 2024.
-
The Effect of Scaling, Retrieval Augmentation and Form on the Factual Consistency of Language Models
Authors:
Lovisa Hagström,
Denitsa Saynova,
Tobias Norlund,
Moa Johansson,
Richard Johansson
Abstract:
Large Language Models (LLMs) make natural interfaces to factual knowledge, but their usefulness is limited by their tendency to deliver inconsistent answers to semantically equivalent questions. For example, a model might predict both "Anne Redpath passed away in Edinburgh." and "Anne Redpath's life ended in London." In this work, we identify potential causes of inconsistency and evaluate the effe…
▽ More
Large Language Models (LLMs) make natural interfaces to factual knowledge, but their usefulness is limited by their tendency to deliver inconsistent answers to semantically equivalent questions. For example, a model might predict both "Anne Redpath passed away in Edinburgh." and "Anne Redpath's life ended in London." In this work, we identify potential causes of inconsistency and evaluate the effectiveness of two mitigation strategies: up-scaling and augmenting the LM with a retrieval corpus. Our results on the LLaMA and Atlas models show that both strategies reduce inconsistency while retrieval augmentation is considerably more efficient. We further consider and disentangle the consistency contributions of different components of Atlas. For all LMs evaluated we find that syntactical form and other evaluation task artifacts impact consistency. Taken together, our results provide a better understanding of the factors affecting the factual consistency of language models.
△ Less
Submitted 2 November, 2023;
originally announced November 2023.
-
How to Adapt Pre-trained Vision-and-Language Models to a Text-only Input?
Authors:
Lovisa Hagström,
Richard Johansson
Abstract:
Current language models have been criticised for learning language from text alone without connection between words and their meaning. Consequently, multimodal training has been proposed as a way for creating models with better language understanding by providing the lacking connection. We focus on pre-trained multimodal vision-and-language (VL) models for which there already are some results on t…
▽ More
Current language models have been criticised for learning language from text alone without connection between words and their meaning. Consequently, multimodal training has been proposed as a way for creating models with better language understanding by providing the lacking connection. We focus on pre-trained multimodal vision-and-language (VL) models for which there already are some results on their language understanding capabilities. An unresolved issue with evaluating the linguistic skills of these models, however, is that there is no established method for adapting them to text-only input without out-of-distribution uncertainty. To find the best approach, we investigate and compare seven possible methods for adapting three different pre-trained VL models to text-only input. Our evaluations on both GLUE and Visual Property Norms (VPN) show that care should be put into adapting VL models to zero-shot text-only tasks, while the models are less sensitive to how we adapt them to non-zero-shot tasks. We also find that the adaptation methods perform differently for different models and that unimodal model counterparts perform on par with the VL models regardless of adaptation, indicating that current VL models do not necessarily gain better language understanding from their multimodal training.
△ Less
Submitted 19 September, 2022;
originally announced September 2022.
-
What do Models Learn From Training on More Than Text? Measuring Visual Commonsense Knowledge
Authors:
Lovisa Hagström,
Richard Johansson
Abstract:
There are limitations in learning language from text alone. Therefore, recent focus has been on developing multimodal models. However, few benchmarks exist that can measure what language models learn about language from multimodal training. We hypothesize that training on a visual modality should improve on the visual commonsense knowledge in language models. Therefore, we introduce two evaluation…
▽ More
There are limitations in learning language from text alone. Therefore, recent focus has been on developing multimodal models. However, few benchmarks exist that can measure what language models learn about language from multimodal training. We hypothesize that training on a visual modality should improve on the visual commonsense knowledge in language models. Therefore, we introduce two evaluation tasks for measuring visual commonsense knowledge in language models and use them to evaluate different multimodal models and unimodal baselines. Primarily, we find that the visual commonsense knowledge is not significantly different between the multimodal models and unimodal baseline models trained on visual text data.
△ Less
Submitted 14 May, 2022;
originally announced May 2022.
-
Writer Recognition Using Off-line Handwritten Single Block Characters
Authors:
Adrian Leo Hagström,
Rustam Stanikzai,
Josef Bigun,
Fernando Alonso-Fernandez
Abstract:
Block characters are often used when filling paper forms for a variety of purposes. We investigate if there is biometric information contained within individual digits of handwritten text. In particular, we use personal identity numbers consisting of the six digits of the date of birth, DoB. We evaluate two recognition approaches, one based on handcrafted features that compute contour directional…
▽ More
Block characters are often used when filling paper forms for a variety of purposes. We investigate if there is biometric information contained within individual digits of handwritten text. In particular, we use personal identity numbers consisting of the six digits of the date of birth, DoB. We evaluate two recognition approaches, one based on handcrafted features that compute contour directional measurements, and another based on deep features from a ResNet50 model. We use a self-captured database of 317 individuals and 4920 written DoBs in total. Results show the presence of identity-related information in a piece of handwritten information as small as six digits with the DoB. We also analyze the impact of the amount of enrolment samples, varying its number between one and ten. Results with such small amount of data are promising. With ten enrolment samples, the Top-1 accuracy with deep features is around 94%, and reaches nearly 100% by Top-10. The verification accuracy is more modest, with EER>20%with any given feature and enrolment set size, showing that there is still room for improvement.
△ Less
Submitted 7 March, 2022; v1 submitted 25 January, 2022;
originally announced January 2022.
-
Transferring Knowledge from Vision to Language: How to Achieve it and how to Measure it?
Authors:
Tobias Norlund,
Lovisa Hagström,
Richard Johansson
Abstract:
Large language models are known to suffer from the hallucination problem in that they are prone to output statements that are false or inconsistent, indicating a lack of knowledge. A proposed solution to this is to provide the model with additional data modalities that complements the knowledge obtained through text. We investigate the use of visual data to complement the knowledge of large langua…
▽ More
Large language models are known to suffer from the hallucination problem in that they are prone to output statements that are false or inconsistent, indicating a lack of knowledge. A proposed solution to this is to provide the model with additional data modalities that complements the knowledge obtained through text. We investigate the use of visual data to complement the knowledge of large language models by proposing a method for evaluating visual knowledge transfer to text for uni- or multimodal language models. The method is based on two steps, 1) a novel task querying for knowledge of memory colors, i.e. typical colors of well-known objects, and 2) filtering of model training data to clearly separate knowledge contributions. Additionally, we introduce a model architecture that involves a visual imagination step and evaluate it with our proposed method. We find that our method can successfully be used to measure visual knowledge transfer capabilities in models and that our novel model architecture shows promising results for leveraging multimodal knowledge in a unimodal setting.
△ Less
Submitted 30 September, 2021; v1 submitted 23 September, 2021;
originally announced September 2021.