-
Ranger: A Toolkit for Effect-Size Based Multi-Task Evaluation
Authors:
Mete Sertkan,
Sophia Althammer,
Sebastian Hofstätter
Abstract:
In this paper, we introduce Ranger - a toolkit to facilitate the easy use of effect-size-based meta-analysis for multi-task evaluation in NLP and IR. We observed that our communities often face the challenge of aggregating results over incomparable metrics and scenarios, which makes conclusions and take-away messages less reliable. With Ranger, we aim to address this issue by providing a task-agno…
▽ More
In this paper, we introduce Ranger - a toolkit to facilitate the easy use of effect-size-based meta-analysis for multi-task evaluation in NLP and IR. We observed that our communities often face the challenge of aggregating results over incomparable metrics and scenarios, which makes conclusions and take-away messages less reliable. With Ranger, we aim to address this issue by providing a task-agnostic toolkit that combines the effect of a treatment on multiple tasks into one statistical evaluation, allowing for comparison of metrics and computation of an overall summary effect. Our toolkit produces publication-ready forest plots that enable clear communication of evaluation results over multiple tasks. Our goal with the ready-to-use Ranger toolkit is to promote robust, effect-size-based evaluation and improve evaluation standards in the community. We provide two case studies for common IR and NLP settings to highlight Ranger's benefits.
△ Less
Submitted 24 May, 2023;
originally announced May 2023.
-
The Role of Bias in News Recommendation in the Perception of the Covid-19 Pandemic
Authors:
Thomas Elmar Kolb,
Irina Nalis,
Mete Sertkan,
Julia Neidhardt
Abstract:
News recommender systems (NRs) have been shown to shape public discourse and to enforce behaviors that have a critical, oftentimes detrimental effect on democracies. Earlier research on the impact of media bias has revealed their strong impact on opinions and preferences. Responsible NRs are supposed to have depolarizing capacities, once they go beyond accuracy measures. We performed sequence pred…
▽ More
News recommender systems (NRs) have been shown to shape public discourse and to enforce behaviors that have a critical, oftentimes detrimental effect on democracies. Earlier research on the impact of media bias has revealed their strong impact on opinions and preferences. Responsible NRs are supposed to have depolarizing capacities, once they go beyond accuracy measures. We performed sequence prediction by using the BERT4Rec algorithm to investigate the interplay of news of coverage and user behavior. Based on live data and training of a large data set from one news outlet "event bursts", "rally around the flag" effect and "filter bubbles" were investigated in our interdisciplinary approach between data science and psychology. Potentials for fair NRs that go beyond accuracy measures are outlined via training of the models with a large data set of articles, keywords, and user behavior. The development of the news coverage and user behavior of the COVID-19 pandemic from primarily medical to broader political content and debates was traced. Our study provides first insights for future development of responsible news recommendation that acknowledges user preferences while stimulating diversity and accountability instead of accuracy, only.
△ Less
Submitted 15 September, 2022;
originally announced September 2022.
-
Introducing Neural Bag of Whole-Words with ColBERTer: Contextualized Late Interactions using Enhanced Reduction
Authors:
Sebastian Hofstätter,
Omar Khattab,
Sophia Althammer,
Mete Sertkan,
Allan Hanbury
Abstract:
Recent progress in neural information retrieval has demonstrated large gains in effectiveness, while often sacrificing the efficiency and interpretability of the neural model compared to classical approaches. This paper proposes ColBERTer, a neural retrieval model using contextualized late interaction (ColBERT) with enhanced reduction. Along the effectiveness Pareto frontier, ColBERTer's reduction…
▽ More
Recent progress in neural information retrieval has demonstrated large gains in effectiveness, while often sacrificing the efficiency and interpretability of the neural model compared to classical approaches. This paper proposes ColBERTer, a neural retrieval model using contextualized late interaction (ColBERT) with enhanced reduction. Along the effectiveness Pareto frontier, ColBERTer's reductions dramatically lower ColBERT's storage requirements while simultaneously improving the interpretability of its token-matching scores. To this end, ColBERTer fuses single-vector retrieval, multi-vector refinement, and optional lexical matching components into one model. For its multi-vector component, ColBERTer reduces the number of stored vectors per document by learning unique whole-word representations for the terms in each document and learning to identify and remove word representations that are not essential to effective scoring. We employ an explicit multi-task, multi-stage training to facilitate using very small vector dimensions. Results on the MS MARCO and TREC-DL collection show that ColBERTer can reduce the storage footprint by up to 2.5x, while maintaining effectiveness. With just one dimension per token in its smallest setting, ColBERTer achieves index storage parity with the plaintext size, with very strong effectiveness results. Finally, we demonstrate ColBERTer's robustness on seven high-quality out-of-domain collections, yielding statistically significant gains over traditional retrieval baselines.
△ Less
Submitted 24 March, 2022;
originally announced March 2022.
-
PARM: A Paragraph Aggregation Retrieval Model for Dense Document-to-Document Retrieval
Authors:
Sophia Althammer,
Sebastian Hofstätter,
Mete Sertkan,
Suzan Verberne,
Allan Hanbury
Abstract:
Dense passage retrieval (DPR) models show great effectiveness gains in first stage retrieval for the web domain. However in the web domain we are in a setting with large amounts of training data and a query-to-passage or a query-to-document retrieval task. We investigate in this paper dense document-to-document retrieval with limited labelled target data for training, in particular legal case retr…
▽ More
Dense passage retrieval (DPR) models show great effectiveness gains in first stage retrieval for the web domain. However in the web domain we are in a setting with large amounts of training data and a query-to-passage or a query-to-document retrieval task. We investigate in this paper dense document-to-document retrieval with limited labelled target data for training, in particular legal case retrieval. In order to use DPR models for document-to-document retrieval, we propose a Paragraph Aggregation Retrieval Model (PARM) which liberates DPR models from their limited input length. PARM retrieves documents on the paragraph-level: for each query paragraph, relevant documents are retrieved based on their paragraphs. Then the relevant results per query paragraph are aggregated into one ranked list for the whole query document. For the aggregation we propose vector-based aggregation with reciprocal rank fusion (VRRF) weighting, which combines the advantages of rank-based aggregation and topical aggregation based on the dense embeddings. Experimental results show that VRRF outperforms rank-based aggregation strategies for dense document-to-document retrieval with PARM. We compare PARM to document-level retrieval and demonstrate higher retrieval effectiveness of PARM for lexical and dense first-stage retrieval on two different legal case retrieval collections. We investigate how to train the dense retrieval model for PARM on limited target data with labels on the paragraph or the document-level. In addition, we analyze the differences of the retrieved results of lexical and dense retrieval with PARM.
△ Less
Submitted 14 August, 2022; v1 submitted 5 January, 2022;
originally announced January 2022.
-
Establishing Strong Baselines for TripClick Health Retrieval
Authors:
Sebastian Hofstätter,
Sophia Althammer,
Mete Sertkan,
Allan Hanbury
Abstract:
We present strong Transformer-based re-ranking and dense retrieval baselines for the recently released TripClick health ad-hoc retrieval collection. We improve the - originally too noisy - training data with a simple negative sampling policy. We achieve large gains over BM25 in the re-ranking task of TripClick, which were not achieved with the original baselines. Furthermore, we study the impact o…
▽ More
We present strong Transformer-based re-ranking and dense retrieval baselines for the recently released TripClick health ad-hoc retrieval collection. We improve the - originally too noisy - training data with a simple negative sampling policy. We achieve large gains over BM25 in the re-ranking task of TripClick, which were not achieved with the original baselines. Furthermore, we study the impact of different domain-specific pre-trained models on TripClick. Finally, we show that dense retrieval outperforms BM25 by considerable margins, even with simple training procedures.
△ Less
Submitted 2 January, 2022;
originally announced January 2022.
-
A Time-Optimized Content Creation Workflow for Remote Teaching
Authors:
Sebastian Hofstätter,
Sophia Althammer,
Mete Sertkan,
Allan Hanbury
Abstract:
We describe our workflow to create an engaging remote learning experience for a university course, while minimizing the post-production time of the educators. We make use of ubiquitous and commonly free services and platforms, so that our workflow is inclusive for all educators and provides polished experiences for students. Our learning materials provide for each lecture: 1) a recorded video, upl…
▽ More
We describe our workflow to create an engaging remote learning experience for a university course, while minimizing the post-production time of the educators. We make use of ubiquitous and commonly free services and platforms, so that our workflow is inclusive for all educators and provides polished experiences for students. Our learning materials provide for each lecture: 1) a recorded video, uploaded on YouTube, with exact slide timestamp indices, which enables an enhanced navigation UI; and 2) a high-quality flow-text automated transcript of the narration with proper punctuation and capitalization, improved with a student participation workflow on GitHub. All these results could be created by hand in a time consuming and costly way. However, this would generally exceed the time available for creating course materials. Our main contribution is to automate the transformation and post-production between raw narrated slides and our published materials with a custom toolchain. Furthermore, we describe our complete workflow: from content creation to transformation and distribution. Our students gave us overwhelmingly positive feedback and especially liked our use of ubiquitous platforms. The most used feature was YouTube's chapter UI enabled through our automatically generated timestamps. The majority of students, who started using the transcripts, continued to do so. Every single transcript was corrected by students, with an average word-change of 6%. We conclude with the positive feedback that our enhanced content formats are much appreciated and utilized. Important for educators is how our low overhead production workflow was sustainable throughout a busy semester.
△ Less
Submitted 13 October, 2021; v1 submitted 11 October, 2021;
originally announced October 2021.
-
Improving Efficient Neural Ranking Models with Cross-Architecture Knowledge Distillation
Authors:
Sebastian Hofstätter,
Sophia Althammer,
Michael Schröder,
Mete Sertkan,
Allan Hanbury
Abstract:
Retrieval and ranking models are the backbone of many applications such as web search, open domain QA, or text-based recommender systems. The latency of neural ranking models at query time is largely dependent on the architecture and deliberate choices by their designers to trade-off effectiveness for higher efficiency. This focus on low query latency of a rising number of efficient ranking archit…
▽ More
Retrieval and ranking models are the backbone of many applications such as web search, open domain QA, or text-based recommender systems. The latency of neural ranking models at query time is largely dependent on the architecture and deliberate choices by their designers to trade-off effectiveness for higher efficiency. This focus on low query latency of a rising number of efficient ranking architectures make them feasible for production deployment. In machine learning an increasingly common approach to close the effectiveness gap of more efficient models is to apply knowledge distillation from a large teacher model to a smaller student model. We find that different ranking architectures tend to produce output scores in different magnitudes. Based on this finding, we propose a cross-architecture training procedure with a margin focused loss (Margin-MSE), that adapts knowledge distillation to the varying score output distributions of different BERT and non-BERT passage ranking architectures. We apply the teachable information as additional fine-grained labels to existing training triples of the MSMARCO-Passage collection. We evaluate our procedure of distilling knowledge from state-of-the-art concatenated BERT models to four different efficient architectures (TK, ColBERT, PreTT, and a BERT CLS dot product model). We show that across our evaluated architectures our Margin-MSE knowledge distillation significantly improves re-ranking effectiveness without compromising their efficiency. Additionally, we show our general distillation method to improve nearest neighbor based index retrieval with the BERT dot product model, offering competitive results with specialized and much more costly training methods. To benefit the community, we publish the teacher-score training files in a ready-to-use package.
△ Less
Submitted 22 January, 2021; v1 submitted 6 October, 2020;
originally announced October 2020.
-
Fine-Grained Relevance Annotations for Multi-Task Document Ranking and Question Answering
Authors:
Sebastian Hofstätter,
Markus Zlabinger,
Mete Sertkan,
Michael Schröder,
Allan Hanbury
Abstract:
There are many existing retrieval and question answering datasets. However, most of them either focus on ranked list evaluation or single-candidate question answering. This divide makes it challenging to properly evaluate approaches concerned with ranking documents and providing snippets or answers for a given query. In this work, we present FiRA: a novel dataset of Fine-Grained Relevance Annotati…
▽ More
There are many existing retrieval and question answering datasets. However, most of them either focus on ranked list evaluation or single-candidate question answering. This divide makes it challenging to properly evaluate approaches concerned with ranking documents and providing snippets or answers for a given query. In this work, we present FiRA: a novel dataset of Fine-Grained Relevance Annotations. We extend the ranked retrieval annotations of the Deep Learning track of TREC 2019 with passage and word level graded relevance annotations for all relevant documents. We use our newly created data to study the distribution of relevance in long documents, as well as the attention of annotators to specific positions of the text. As an example, we evaluate the recently introduced TKL document ranking model. We find that although TKL exhibits state-of-the-art retrieval results for long documents, it misses many relevant passages.
△ Less
Submitted 12 August, 2020;
originally announced August 2020.
-
Eliciting Touristic Profiles: A User Study on Picture Collections
Authors:
Mete Sertkan,
Julia Neidhardt,
Hannes Werthner
Abstract:
Eliciting the preferences and needs of tourists is challenging, since people often have difficulties to explicitly express them, especially in the initial phase of travel planning. Recommender systems employed at the early stage of planning can therefore be very beneficial to the general satisfaction of a user. Previous studies have explored pictures as a tool of communication and as a way to impl…
▽ More
Eliciting the preferences and needs of tourists is challenging, since people often have difficulties to explicitly express them, especially in the initial phase of travel planning. Recommender systems employed at the early stage of planning can therefore be very beneficial to the general satisfaction of a user. Previous studies have explored pictures as a tool of communication and as a way to implicitly deduce a traveller's preferences and needs. In this paper, we conduct a user study to verify previous claims and conceptual work on the feasibility of modelling travel interests from a selection of a user's pictures. We utilize fine-tuned convolutional neural networks to compute a vector representation of a picture, where each dimension corresponds to a travel behavioural pattern from the traditional Seven-Factor model. In our study, we followed strict privacy principles and did not save uploaded pictures after computing their vector representation. We aggregate the representations of the pictures of a user into a single user representation, i.e., touristic profile, using different strategies. In our user study with 81 participants, we let users adjust the predicted touristic profile and confirm the usefulness of our approach. Our results show that given a collection of pictures the touristic profile of a user can be determined.
△ Less
Submitted 9 June, 2020;
originally announced June 2020.
-
DEXA: Supporting Non-Expert Annotators with Dynamic Examples from Experts
Authors:
Markus Zlabinger,
Marta Sabou,
Sebastian Hofstätter,
Mete Sertkan,
Allan Hanbury
Abstract:
The success of crowdsourcing based annotation of text corpora depends on ensuring that crowdworkers are sufficiently well-trained to perform the annotation task accurately. To that end, a frequent approach to train annotators is to provide instructions and a few example cases that demonstrate how the task should be performed (referred to as the CONTROL approach). These globally defined "task-level…
▽ More
The success of crowdsourcing based annotation of text corpora depends on ensuring that crowdworkers are sufficiently well-trained to perform the annotation task accurately. To that end, a frequent approach to train annotators is to provide instructions and a few example cases that demonstrate how the task should be performed (referred to as the CONTROL approach). These globally defined "task-level examples", however, (i) often only cover the common cases that are encountered during an annotation task; and (ii) require effort from crowdworkers during the annotation process to find the most relevant example for the currently annotated sample. To overcome these limitations, we propose to support workers in addition to task-level examples, also with "task-instance level" examples that are semantically similar to the currently annotated data sample (referred to as Dynamic Examples for Annotation, DEXA). Such dynamic examples can be retrieved from collections previously labeled by experts, which are usually available as gold standard dataset. We evaluate DEXA on a complex task of annotating participants, interventions, and outcomes (known as PIO) in sentences of medical studies. The dynamic examples are retrieved using BioSent2Vec, an unsupervised semantic sentence similarity method specific to the biomedical domain. Results show that (i) workers of the DEXA approach reach on average much higher agreements (Cohen's Kappa) to experts than workers of the the CONTROL approach (avg. of 0.68 to experts in DEXA vs. 0.40 in CONTROL); (ii) already three per majority voting aggregated annotations of the DEXA approach reach substantial agreements to experts of 0.78/0.75/0.69 for P/I/O (in CONTROL 0.73/0.58/0.46). Finally, (iii) we acquire explicit feedback from workers and show that in the majority of cases (avg. 72%) workers find the dynamic examples useful.
△ Less
Submitted 17 May, 2020;
originally announced May 2020.