-
As easy as PIE: understanding when pruning causes language models to disagree
Authors:
Pietro Tropeano,
Maria Maistro,
Tuukka Ruotsalo,
Christina Lioma
Abstract:
Language Model (LM) pruning compresses the model by removing weights, nodes, or other parts of its architecture. Typically, pruning focuses on the resulting efficiency gains at the cost of effectiveness. However, when looking at how individual data points are affected by pruning, it turns out that a particular subset of data points always bears most of the brunt (in terms of reduced accuracy) when…
▽ More
Language Model (LM) pruning compresses the model by removing weights, nodes, or other parts of its architecture. Typically, pruning focuses on the resulting efficiency gains at the cost of effectiveness. However, when looking at how individual data points are affected by pruning, it turns out that a particular subset of data points always bears most of the brunt (in terms of reduced accuracy) when pruning, but this effect goes unnoticed when reporting the mean accuracy of all data points. These data points are called PIEs and have been studied in image processing, but not in NLP. In a study of various NLP datasets, pruning methods, and levels of compression, we find that PIEs impact inference quality considerably, regardless of class frequency, and that BERT is more prone to this than BiLSTM. We also find that PIEs contain a high amount of data points that have the largest influence on how well the model generalises to unseen data. This means that when pruning, with seemingly moderate loss to accuracy across all data points, we in fact hurt tremendously those data points that matter the most. We trace what makes PIEs both hard and impactful to inference to their overall longer and more semantically complex text. These findings are novel and contribute to understanding how LMs are affected by pruning. The code is available at: https://github.com/pietrotrope/AsEasyAsPIE
△ Less
Submitted 27 March, 2025;
originally announced March 2025.
-
Joint Evaluation of Fairness and Relevance in Recommender Systems with Pareto Frontier
Authors:
Theresia Veronika Rampisela,
Tuukka Ruotsalo,
Maria Maistro,
Christina Lioma
Abstract:
Fairness and relevance are two important aspects of recommender systems (RSs). Typically, they are evaluated either (i) separately by individual measures of fairness and relevance, or (ii) jointly using a single measure that accounts for fairness with respect to relevance. However, approach (i) often does not provide a reliable joint estimate of the goodness of the models, as it has two different…
▽ More
Fairness and relevance are two important aspects of recommender systems (RSs). Typically, they are evaluated either (i) separately by individual measures of fairness and relevance, or (ii) jointly using a single measure that accounts for fairness with respect to relevance. However, approach (i) often does not provide a reliable joint estimate of the goodness of the models, as it has two different best models: one for fairness and another for relevance. Approach (ii) is also problematic because these measures tend to be ad-hoc and do not relate well to traditional relevance measures, like NDCG. Motivated by this, we present a new approach for jointly evaluating fairness and relevance in RSs: Distance to Pareto Frontier (DPFR). Given some user-item interaction data, we compute their Pareto frontier for a pair of existing relevance and fairness measures, and then use the distance from the frontier as a measure of the jointly achievable fairness and relevance. Our approach is modular and intuitive as it can be computed with existing measures. Experiments with 4 RS models, 3 re-ranking strategies, and 6 datasets show that existing metrics have inconsistent associations with our Pareto-optimal solution, making DPFR a more robust and theoretically well-founded joint measure for assessing fairness and relevance. Our code: https://github.com/theresiavr/DPFR-recsys-evaluation
△ Less
Submitted 17 February, 2025;
originally announced February 2025.
-
Are Representation Disentanglement and Interpretability Linked in Recommendation Models? A Critical Review and Reproducibility Study
Authors:
Ervin Dervishaj,
Tuukka Ruotsalo,
Maria Maistro,
Christina Lioma
Abstract:
Unsupervised learning of disentangled representations has been closely tied to enhancing the representation intepretability of Recommender Systems (RSs). This has been achieved by making the representation of individual features more distinctly separated, so that it is easier to attribute the contribution of features to the model's predictions. However, such advantages in interpretability and feat…
▽ More
Unsupervised learning of disentangled representations has been closely tied to enhancing the representation intepretability of Recommender Systems (RSs). This has been achieved by making the representation of individual features more distinctly separated, so that it is easier to attribute the contribution of features to the model's predictions. However, such advantages in interpretability and feature attribution have mainly been explored qualitatively. Moreover, the effect of disentanglement on the model's recommendation performance has been largely overlooked. In this work, we reproduce the recommendation performance, representation disentanglement and representation interpretability of five well-known recommendation models on four RS datasets. We quantify disentanglement and investigate the link of disentanglement with recommendation effectiveness and representation interpretability. While several existing work in RSs have proposed disentangled representations as a gateway to improved effectiveness and interpretability, our findings show that disentanglement is not necessarily related to effectiveness but is closely related to representation interpretability. Our code and results are publicly available at https://github.com/edervishaj/disentanglement-interpretability-recsys.
△ Less
Submitted 30 January, 2025;
originally announced January 2025.
-
A Reality Check on Context Utilisation for Retrieval-Augmented Generation
Authors:
Lovisa Hagström,
Sara Vera Marjanović,
Haeun Yu,
Arnav Arora,
Christina Lioma,
Maria Maistro,
Pepa Atanasova,
Isabelle Augenstein
Abstract:
Retrieval-augmented generation (RAG) helps address the limitations of parametric knowledge embedded within a language model (LM). In real world settings, retrieved information can vary in complexity, yet most investigations of LM utilisation of context has been limited to synthetic text. We introduce DRUID (Dataset of Retrieved Unreliable, Insufficient and Difficult-to-understand contexts) with re…
▽ More
Retrieval-augmented generation (RAG) helps address the limitations of parametric knowledge embedded within a language model (LM). In real world settings, retrieved information can vary in complexity, yet most investigations of LM utilisation of context has been limited to synthetic text. We introduce DRUID (Dataset of Retrieved Unreliable, Insufficient and Difficult-to-understand contexts) with real-world queries and contexts manually annotated for stance. The dataset is based on the prototypical task of automated claim verification, for which automated retrieval of real-world evidence is crucial. We compare DRUID to synthetic datasets (CounterFact, ConflictQA) and find that artificial datasets often fail to represent the complexity and diversity of realistically retrieved context. We show that synthetic datasets exaggerate context characteristics rare in real retrieved data, which leads to inflated context utilisation results, as measured by our novel ACU score. Moreover, while previous work has mainly focused on singleton context characteristics to explain context utilisation, correlations between singleton context properties and ACU on DRUID are surprisingly small compared to other properties related to context source. Overall, our work underscores the need for real-world aligned context utilisation studies to represent and improve performance in real-world RAG settings.
△ Less
Submitted 29 May, 2025; v1 submitted 22 December, 2024;
originally announced December 2024.
-
Joint Extraction and Classification of Danish Competences for Job Matching
Authors:
Qiuchi Li,
Christina Lioma
Abstract:
The matching of competences, such as skills, occupations or knowledges, is a key desiderata for candidates to be fit for jobs. Automatic extraction of competences from CVs and Jobs can greatly promote recruiters' productivity in locating relevant candidates for job vacancies. This work presents the first model that jointly extracts and classifies competence from Danish job postings. Different from…
▽ More
The matching of competences, such as skills, occupations or knowledges, is a key desiderata for candidates to be fit for jobs. Automatic extraction of competences from CVs and Jobs can greatly promote recruiters' productivity in locating relevant candidates for job vacancies. This work presents the first model that jointly extracts and classifies competence from Danish job postings. Different from existing works on skill extraction and skill classification, our model is trained on a large volume of annotated Danish corpora and is capable of extracting a wide range of Danish competences, including skills, occupations and knowledges of different categories. More importantly, as a single BERT-like architecture for joint extraction and classification, our model is lightweight and efficient at inference. On a real-scenario job matching dataset, our model beats the state-of-the-art models in the overall performance of Danish competence extraction and classification, and saves over 50% time at inference.
△ Less
Submitted 29 October, 2024;
originally announced October 2024.
-
DYNAMICQA: Tracing Internal Knowledge Conflicts in Language Models
Authors:
Sara Vera Marjanović,
Haeun Yu,
Pepa Atanasova,
Maria Maistro,
Christina Lioma,
Isabelle Augenstein
Abstract:
Knowledge-intensive language understanding tasks require Language Models (LMs) to integrate relevant context, mitigating their inherent weaknesses, such as incomplete or outdated knowledge. However, conflicting knowledge can be present in the LM's parameters, termed intra-memory conflict, which can affect a model's propensity to accept contextual knowledge. To study the effect of intra-memory conf…
▽ More
Knowledge-intensive language understanding tasks require Language Models (LMs) to integrate relevant context, mitigating their inherent weaknesses, such as incomplete or outdated knowledge. However, conflicting knowledge can be present in the LM's parameters, termed intra-memory conflict, which can affect a model's propensity to accept contextual knowledge. To study the effect of intra-memory conflict on an LM's ability to accept relevant context, we utilize two knowledge conflict measures and a novel dataset containing inherently conflicting data, DynamicQA. This dataset includes facts with a temporal dynamic nature where facts can change over time and disputable dynamic facts, which can change depending on the viewpoint. DynamicQA is the first to include real-world knowledge conflicts and provide context to study the link between the different types of knowledge conflicts. We also evaluate several measures on their ability to reflect the presence of intra-memory conflict: semantic entropy and a novel coherent persuasion score. With our extensive experiments, we verify that LMs exhibit a greater degree of intra-memory conflict with dynamic facts compared to facts that have a single truth value. Furthermore, we reveal that facts with intra-memory conflict are harder to update with context, suggesting that retrieval-augmented generation will struggle with the most commonly adapted facts.
△ Less
Submitted 7 October, 2024; v1 submitted 24 July, 2024;
originally announced July 2024.
-
Can We Trust Recommender System Fairness Evaluation? The Role of Fairness and Relevance
Authors:
Theresia Veronika Rampisela,
Tuukka Ruotsalo,
Maria Maistro,
Christina Lioma
Abstract:
Relevance and fairness are two major objectives of recommender systems (RSs). Recent work proposes measures of RS fairness that are either independent from relevance (fairness-only) or conditioned on relevance (joint measures). While fairness-only measures have been studied extensively, we look into whether joint measures can be trusted. We collect all joint evaluation measures of RS relevance and…
▽ More
Relevance and fairness are two major objectives of recommender systems (RSs). Recent work proposes measures of RS fairness that are either independent from relevance (fairness-only) or conditioned on relevance (joint measures). While fairness-only measures have been studied extensively, we look into whether joint measures can be trusted. We collect all joint evaluation measures of RS relevance and fairness, and ask: How much do they agree with each other? To what extent do they agree with relevance/fairness measures? How sensitive are they to changes in rank position, or to increasingly fair and relevant recommendations? We empirically study for the first time the behaviour of these measures across 4 real-world datasets and 4 recommenders. We find that most of these measures: i) correlate weakly with one another and even contradict each other at times; ii) are less sensitive to rank position changes than relevance- and fairness-only measures, meaning that they are less granular than traditional RS measures; and iii) tend to compress scores at the low end of their range, meaning that they are not very expressive. We counter the above limitations with a set of guidelines on the appropriate usage of such measures, i.e., they should be used with caution due to their tendency to contradict each other and of having a very small empirical range.
△ Less
Submitted 28 May, 2024;
originally announced May 2024.
-
Recommending Target Actions Outside Sessions in the Data-poor Insurance Domain
Authors:
Simone Borg Bruun,
Christina Lioma,
Maria Maistro
Abstract:
Providing personalized recommendations for insurance products is particularly challenging due to the intrinsic and distinctive features of the insurance domain. First, unlike more traditional domains like retail, movie etc., a large amount of user feedback is not available and the item catalog is smaller. Second, due to the higher complexity of products, the majority of users still prefer to compl…
▽ More
Providing personalized recommendations for insurance products is particularly challenging due to the intrinsic and distinctive features of the insurance domain. First, unlike more traditional domains like retail, movie etc., a large amount of user feedback is not available and the item catalog is smaller. Second, due to the higher complexity of products, the majority of users still prefer to complete their purchases over the phone instead of online. We present different recommender models to address such data scarcity in the insurance domain. We use recurrent neural networks with 3 different types of loss functions and architectures (cross-entropy, censored Weibull, attention). Our models cope with data scarcity by learning from multiple sessions and different types of user actions. Moreover, differently from previous session-based models, our models learn to predict a target action that does not happen within the session. Our models outperform state-of-the-art baselines on a real-world insurance dataset, with ca. 44K users, 16 items, 54K purchases and 117K sessions. Moreover, combining our models with demographic data boosts the performance. Analysis shows that considering multiple sessions and several types of actions are both beneficial for the models, and that our models are not unfair with respect to age, gender and income.
△ Less
Submitted 1 March, 2024;
originally announced March 2024.
-
Query Augmentation by Decoding Semantics from Brain Signals
Authors:
Ziyi Ye,
Jingtao Zhan,
Qingyao Ai,
Yiqun Liu,
Maarten de Rijke,
Christina Lioma,
Tuukka Ruotsalo
Abstract:
Query augmentation is a crucial technique for refining semantically imprecise queries. Traditionally, query augmentation relies on extracting information from initially retrieved, potentially relevant documents. If the quality of the initially retrieved documents is low, then the effectiveness of query augmentation would be limited as well. We propose Brain-Aug, which enhances a query by incorpora…
▽ More
Query augmentation is a crucial technique for refining semantically imprecise queries. Traditionally, query augmentation relies on extracting information from initially retrieved, potentially relevant documents. If the quality of the initially retrieved documents is low, then the effectiveness of query augmentation would be limited as well. We propose Brain-Aug, which enhances a query by incorporating semantic information decoded from brain signals. BrainAug generates the continuation of the original query with a prompt constructed with brain signal information and a ranking-oriented inference approach. Experimental results on fMRI (functional magnetic resonance imaging) datasets show that Brain-Aug produces semantically more accurate queries, leading to improved document ranking performance. Such improvement brought by brain signals is particularly notable for ambiguous queries.
△ Less
Submitted 3 March, 2024; v1 submitted 23 February, 2024;
originally announced February 2024.
-
Investigating the Impact of Model Instability on Explanations and Uncertainty
Authors:
Sara Vera Marjanović,
Isabelle Augenstein,
Christina Lioma
Abstract:
Explainable AI methods facilitate the understanding of model behaviour, yet, small, imperceptible perturbations to inputs can vastly distort explanations. As these explanations are typically evaluated holistically, before model deployment, it is difficult to assess when a particular explanation is trustworthy. Some studies have tried to create confidence estimators for explanations, but none have…
▽ More
Explainable AI methods facilitate the understanding of model behaviour, yet, small, imperceptible perturbations to inputs can vastly distort explanations. As these explanations are typically evaluated holistically, before model deployment, it is difficult to assess when a particular explanation is trustworthy. Some studies have tried to create confidence estimators for explanations, but none have investigated an existing link between uncertainty and explanation quality. We artificially simulate epistemic uncertainty in text input by introducing noise at inference time. In this large-scale empirical study, we insert different levels of noise perturbations and measure the effect on the output of pre-trained language models and different uncertainty metrics. Realistic perturbations have minimal effect on performance and explanations, yet masking has a drastic effect. We find that high uncertainty doesn't necessarily imply low explanation plausibility; the correlation between the two metrics can be moderately positive when noise is exposed during the training process. This suggests that noise-augmented models may be better at identifying salient tokens when uncertain. Furthermore, when predictive and epistemic uncertainty measures are over-confident, the robustness of a saliency map to perturbation can indicate model stability issues. Integrated Gradients shows the overall greatest robustness to perturbation, while still showing model-specific patterns in performance; however, this phenomenon is limited to smaller Transformer-based language models.
△ Less
Submitted 4 June, 2024; v1 submitted 20 February, 2024;
originally announced February 2024.
-
Digital-analog hybrid matrix multiplication processor for optical neural networks
Authors:
Xiansong Meng,
Deming Kong,
Kwangwoong Kim,
Qiuchi Li,
Po Dong,
Ingemar J. Cox,
Christina Lioma,
Hao Hu
Abstract:
The computational demands of modern AI have spurred interest in optical neural networks (ONNs) which offer the potential benefits of increased speed and lower power consumption. However, current ONNs face various challenges,most significantly a limited calculation precision (typically around 4 bits) and the requirement for high-resolution signal format converters (digital-to-analogue conversions (…
▽ More
The computational demands of modern AI have spurred interest in optical neural networks (ONNs) which offer the potential benefits of increased speed and lower power consumption. However, current ONNs face various challenges,most significantly a limited calculation precision (typically around 4 bits) and the requirement for high-resolution signal format converters (digital-to-analogue conversions (DACs) and analogue-to-digital conversions (ADCs)). These challenges are inherent to their analog computing nature and pose significant obstacles in practical implementation. Here, we propose a digital-analog hybrid optical computing architecture for ONNs, which utilizes digital optical inputs in the form of binary words. By introducing the logic levels and decisions based on thresholding, the calculation precision can be significantly enhanced. The DACs for input data can be removed and the resolution of the ADCs can be greatly reduced. This can increase the operating speed at a high calculation precision and facilitate the compatibility with microelectronics. To validate our approach, we have fabricated a proof-of-concept photonic chip and built up a hybrid optical processor (HOP) system for neural network applications. We have demonstrated an unprecedented 16-bit calculation precision for high-definition image processing, with a pixel error rate (PER) as low as $1.8\times10^{-3}$ at an signal-to-noise ratio (SNR) of 18.2 dB. We have also implemented a convolutional neural network for handwritten digit recognition that shows the same accuracy as the one achieved by a desktop computer. The concept of the digital-analog hybrid optical computing architecture offers a methodology that could potentially be applied to various ONN implementations and may intrigue new research into efficient and accurate domain-specific optical computing architectures for neural networks.
△ Less
Submitted 26 January, 2024;
originally announced January 2024.
-
Language Generation from Brain Recordings
Authors:
Ziyi Ye,
Qingyao Ai,
Yiqun Liu,
Maarten de Rijke,
Min Zhang,
Christina Lioma,
Tuukka Ruotsalo
Abstract:
Generating human language through non-invasive brain-computer interfaces (BCIs) has the potential to unlock many applications, such as serving disabled patients and improving communication. Currently, however, generating language via BCIs has been previously successful only within a classification setup for selecting pre-generated sentence continuation candidates with the most likely cortical sema…
▽ More
Generating human language through non-invasive brain-computer interfaces (BCIs) has the potential to unlock many applications, such as serving disabled patients and improving communication. Currently, however, generating language via BCIs has been previously successful only within a classification setup for selecting pre-generated sentence continuation candidates with the most likely cortical semantic representation. Inspired by recent research that revealed associations between the brain and the large computational language models, we propose a generative language BCI that utilizes the capacity of a large language model (LLM) jointly with a semantic brain decoder to directly generate language from functional magnetic resonance imaging (fMRI) input. The proposed model can generate coherent language sequences aligned with the semantic content of visual or auditory language stimuli perceived, without prior knowledge of any pre-generated candidates. We compare the language generated from the presented model with a random control, pre-generated language selection approach, and a standard LLM, which generates common coherent text solely based on the next word likelihood according to statistical language training data. The proposed model is found to generate language that is more aligned with semantic stimulus in response to which brain input is sampled. Our findings demonstrate the potential and feasibility of employing BCIs in direct language generation.
△ Less
Submitted 11 March, 2024; v1 submitted 16 November, 2023;
originally announced November 2023.
-
Evaluation Measures of Individual Item Fairness for Recommender Systems: A Critical Study
Authors:
Theresia Veronika Rampisela,
Maria Maistro,
Tuukka Ruotsalo,
Christina Lioma
Abstract:
Fairness is an emerging and challenging topic in recommender systems. In recent years, various ways of evaluating and therefore improving fairness have emerged. In this study, we examine existing evaluation measures of fairness in recommender systems. Specifically, we focus solely on exposure-based fairness measures of individual items that aim to quantify the disparity in how individual items are…
▽ More
Fairness is an emerging and challenging topic in recommender systems. In recent years, various ways of evaluating and therefore improving fairness have emerged. In this study, we examine existing evaluation measures of fairness in recommender systems. Specifically, we focus solely on exposure-based fairness measures of individual items that aim to quantify the disparity in how individual items are recommended to users, separate from item relevance to users. We gather all such measures and we critically analyse their theoretical properties. We identify a series of limitations in each of them, which collectively may render the affected measures hard or impossible to interpret, to compute, or to use for comparing recommendations. We resolve these limitations by redefining or correcting the affected measures, or we argue why certain limitations cannot be resolved. We further perform a comprehensive empirical analysis of both the original and our corrected versions of these fairness measures, using real-world and synthetic datasets. Our analysis provides novel insights into the relationship between measures based on different fairness concepts, and different levels of measure sensitivity and strictness. We conclude with practical suggestions of which fairness measures should be used and when. Our code is publicly available. To our knowledge, this is the first critical comparison of individual item fairness measures in recommender systems.
△ Less
Submitted 2 November, 2023;
originally announced November 2023.
-
Faithfulness Tests for Natural Language Explanations
Authors:
Pepa Atanasova,
Oana-Maria Camburu,
Christina Lioma,
Thomas Lukasiewicz,
Jakob Grue Simonsen,
Isabelle Augenstein
Abstract:
Explanations of neural models aim to reveal a model's decision-making process for its predictions. However, recent work shows that current methods giving explanations such as saliency maps or counterfactuals can be misleading, as they are prone to present reasons that are unfaithful to the model's inner workings. This work explores the challenging question of evaluating the faithfulness of natural…
▽ More
Explanations of neural models aim to reveal a model's decision-making process for its predictions. However, recent work shows that current methods giving explanations such as saliency maps or counterfactuals can be misleading, as they are prone to present reasons that are unfaithful to the model's inner workings. This work explores the challenging question of evaluating the faithfulness of natural language explanations (NLEs). To this end, we present two tests. First, we propose a counterfactual input editor for inserting reasons that lead to counterfactual predictions but are not reflected by the NLEs. Second, we reconstruct inputs from the reasons stated in the generated NLEs and check how often they lead to the same predictions. Our tests can evaluate emerging NLE models, proving a fundamental tool in the development of faithful NLEs.
△ Less
Submitted 30 June, 2023; v1 submitted 29 May, 2023;
originally announced May 2023.
-
Adapting Pre-trained Language Models for Quantum Natural Language Processing
Authors:
Qiuchi Li,
Benyou Wang,
Yudong Zhu,
Christina Lioma,
Qun Liu
Abstract:
The emerging classical-quantum transfer learning paradigm has brought a decent performance to quantum computational models in many tasks, such as computer vision, by enabling a combination of quantum models and classical pre-trained neural networks. However, using quantum computing with pre-trained models has yet to be explored in natural language processing (NLP). Due to the high linearity constr…
▽ More
The emerging classical-quantum transfer learning paradigm has brought a decent performance to quantum computational models in many tasks, such as computer vision, by enabling a combination of quantum models and classical pre-trained neural networks. However, using quantum computing with pre-trained models has yet to be explored in natural language processing (NLP). Due to the high linearity constraints of the underlying quantum computing infrastructures, existing Quantum NLP models are limited in performance on real tasks. We fill this gap by pre-training a sentence state with complex-valued BERT-like architecture, and adapting it to the classical-quantum transfer learning scheme for sentence classification. On quantum simulation experiments, the pre-trained representation can bring 50\% to 60\% increases to the capacity of end-to-end quantum models.
△ Less
Submitted 24 February, 2023;
originally announced February 2023.
-
Graph-based Recommendation for Sparse and Heterogeneous User Interactions
Authors:
Simone Borg Bruun,
Kacper Kenji Lesniak,
Mirko Biasini,
Vittorio Carmignani,
Panagiotis Filianos,
Christina Lioma,
Maria Maistro
Abstract:
Recommender system research has oftentimes focused on approaches that operate on large-scale datasets containing millions of user interactions. However, many small businesses struggle to apply state-of-the-art models due to their very limited availability of data. We propose a graph-based recommender model which utilizes heterogeneous interactions between users and content of different types and i…
▽ More
Recommender system research has oftentimes focused on approaches that operate on large-scale datasets containing millions of user interactions. However, many small businesses struggle to apply state-of-the-art models due to their very limited availability of data. We propose a graph-based recommender model which utilizes heterogeneous interactions between users and content of different types and is able to operate well on small-scale datasets. A genetic algorithm is used to find optimal weights that represent the strength of the relationship between users and content. Experiments on two real-world datasets (which we make available to the research community) show promising results (up to 7% improvement), in comparison with other state-of-the-art methods for low-data environments. These improvements are statistically significant and consistent across different data samples.
△ Less
Submitted 26 January, 2023;
originally announced January 2023.
-
Template-based Recruitment Email Generation For Job Recommendation
Authors:
Qiuchi Li,
Christina Lioma
Abstract:
Text generation has long been a popular research topic in NLP. However, the task of generating recruitment emails from recruiters to candidates in the job recommendation scenario has received little attention by the research community. This work aims at defining the topic of automatic email generation for job recommendation, identifying the challenges, and providing a baseline template-based solut…
▽ More
Text generation has long been a popular research topic in NLP. However, the task of generating recruitment emails from recruiters to candidates in the job recommendation scenario has received little attention by the research community. This work aims at defining the topic of automatic email generation for job recommendation, identifying the challenges, and providing a baseline template-based solution for Danish jobs. Evaluation by human experts shows that our method is effective. We wrap up by discussing the future research directions for better solving this task.
△ Less
Submitted 6 December, 2022;
originally announced December 2022.
-
Principled Multi-Aspect Evaluation Measures of Rankings
Authors:
Maria Maistro,
Lucas Chaves Lima,
Jakob Grue Simonsen,
Christina Lioma
Abstract:
Information Retrieval evaluation has traditionally focused on defining principled ways of assessing the relevance of a ranked list of documents with respect to a query. Several methods extend this type of evaluation beyond relevance, making it possible to evaluate different aspects of a document ranking (e.g., relevance, usefulness, or credibility) using a single measure (multi-aspect evaluation).…
▽ More
Information Retrieval evaluation has traditionally focused on defining principled ways of assessing the relevance of a ranked list of documents with respect to a query. Several methods extend this type of evaluation beyond relevance, making it possible to evaluate different aspects of a document ranking (e.g., relevance, usefulness, or credibility) using a single measure (multi-aspect evaluation). However, these methods either are (i) tailor-made for specific aspects and do not extend to other types or numbers of aspects, or (ii) have theoretical anomalies, e.g. assign maximum score to a ranking where all documents are labelled with the lowest grade with respect to all aspects (e.g., not relevant, not credible, etc.).
We present a theoretically principled multi-aspect evaluation method that can be used for any number, and any type, of aspects. A thorough empirical evaluation using up to 5 aspects and a total of 425 runs officially submitted to 10 TREC tracks shows that our method is more discriminative than the state-of-the-art and overcomes theoretical limitations of the state-of-the-art.
△ Less
Submitted 1 December, 2022;
originally announced December 2022.
-
Learning Recommendations from User Actions in the Item-poor Insurance Domain
Authors:
Simone Borg Bruun,
Maria Maistro,
Christina Lioma
Abstract:
While personalised recommendations are successful in domains like retail, where large volumes of user feedback on items are available, the generation of automatic recommendations in data-sparse domains, like insurance purchasing, is an open problem. The insurance domain is notoriously data-sparse because the number of products is typically low (compared to retail) and they are usually purchased to…
▽ More
While personalised recommendations are successful in domains like retail, where large volumes of user feedback on items are available, the generation of automatic recommendations in data-sparse domains, like insurance purchasing, is an open problem. The insurance domain is notoriously data-sparse because the number of products is typically low (compared to retail) and they are usually purchased to last for a long time. Also, many users still prefer the telephone over the web for purchasing products, reducing the amount of web-logged user interactions. To address this, we present a recurrent neural network recommendation model that uses past user sessions as signals for learning recommendations. Learning from past user sessions allows dealing with the data scarcity of the insurance domain. Specifically, our model learns from several types of user actions that are not always associated with items, and unlike all prior session-based recommendation models, it models relationships between input sessions and a target action (purchasing insurance) that does not take place within the input sessions. Evaluation on a real-world dataset from the insurance domain (ca. 44K users, 16 items, 54K purchases, and 117K sessions) against several state-of-the-art baselines shows that our model outperforms the baselines notably. Ablation analysis shows that this is mainly due to the learning of dependencies across sessions in our model. We contribute the first ever session-based model for insurance recommendation, and make available our dataset to the research community.
△ Less
Submitted 28 November, 2022;
originally announced November 2022.
-
Fact Checking with Insufficient Evidence
Authors:
Pepa Atanasova,
Jakob Grue Simonsen,
Christina Lioma,
Isabelle Augenstein
Abstract:
Automating the fact checking (FC) process relies on information obtained from external sources. In this work, we posit that it is crucial for FC models to make veracity predictions only when there is sufficient evidence and otherwise indicate when it is not enough. To this end, we are the first to study what information FC models consider sufficient by introducing a novel task and advancing it wit…
▽ More
Automating the fact checking (FC) process relies on information obtained from external sources. In this work, we posit that it is crucial for FC models to make veracity predictions only when there is sufficient evidence and otherwise indicate when it is not enough. To this end, we are the first to study what information FC models consider sufficient by introducing a novel task and advancing it with three main contributions. First, we conduct an in-depth empirical analysis of the task with a new fluency-preserving method for omitting information from the evidence at the constituent and sentence level. We identify when models consider the remaining evidence (in)sufficient for FC, based on three trained models with different Transformer architectures and three FC datasets. Second, we ask annotators whether the omitted evidence was important for FC, resulting in a novel diagnostic dataset, SufficientFacts, for FC with omitted evidence. We find that models are least successful in detecting missing evidence when adverbial modifiers are omitted (21% accuracy), whereas it is easiest for omitted date modifiers (63% accuracy). Finally, we propose a novel data augmentation strategy for contrastive self-learning of missing evidence by employing the proposed omission method combined with tri-training. It improves performance for Evidence Sufficiency Prediction by up to 17.8 F1 score, which in turn improves FC performance by up to 2.6 F1 score.
△ Less
Submitted 5 April, 2022;
originally announced April 2022.
-
Diagnostics-Guided Explanation Generation
Authors:
Pepa Atanasova,
Jakob Grue Simonsen,
Christina Lioma,
Isabelle Augenstein
Abstract:
Explanations shed light on a machine learning model's rationales and can aid in identifying deficiencies in its reasoning process. Explanation generation models are typically trained in a supervised way given human explanations. When such annotations are not available, explanations are often selected as those portions of the input that maximise a downstream task's performance, which corresponds to…
▽ More
Explanations shed light on a machine learning model's rationales and can aid in identifying deficiencies in its reasoning process. Explanation generation models are typically trained in a supervised way given human explanations. When such annotations are not available, explanations are often selected as those portions of the input that maximise a downstream task's performance, which corresponds to optimising an explanation's Faithfulness to a given model. Faithfulness is one of several so-called diagnostic properties, which prior work has identified as useful for gauging the quality of an explanation without requiring annotations. Other diagnostic properties are Data Consistency, which measures how similar explanations are for similar input instances, and Confidence Indication, which shows whether the explanation reflects the confidence of the model. In this work, we show how to directly optimise for these diagnostic properties when training a model to generate sentence-level explanations, which markedly improves explanation quality, agreement with human rationales, and downstream task performance on three complex reasoning tasks.
△ Less
Submitted 8 September, 2021;
originally announced September 2021.
-
Unsupervised Multi-Index Semantic Hashing
Authors:
Christian Hansen,
Casper Hansen,
Jakob Grue Simonsen,
Stephen Alstrup,
Christina Lioma
Abstract:
Semantic hashing represents documents as compact binary vectors (hash codes) and allows both efficient and effective similarity search in large-scale information retrieval. The state of the art has primarily focused on learning hash codes that improve similarity search effectiveness, while assuming a brute-force linear scan strategy for searching over all the hash codes, even though much faster al…
▽ More
Semantic hashing represents documents as compact binary vectors (hash codes) and allows both efficient and effective similarity search in large-scale information retrieval. The state of the art has primarily focused on learning hash codes that improve similarity search effectiveness, while assuming a brute-force linear scan strategy for searching over all the hash codes, even though much faster alternatives exist. One such alternative is multi-index hashing, an approach that constructs a smaller candidate set to search over, which depending on the distribution of the hash codes can lead to sub-linear search time. In this work, we propose Multi-Index Semantic Hashing (MISH), an unsupervised hashing model that learns hash codes that are both effective and highly efficient by being optimized for multi-index hashing. We derive novel training objectives, which enable to learn hash codes that reduce the candidate sets produced by multi-index hashing, while being end-to-end trainable. In fact, our proposed training objectives are model agnostic, i.e., not tied to how the hash codes are generated specifically in MISH, and are straight-forward to include in existing and future semantic hashing models. We experimentally compare MISH to state-of-the-art semantic hashing baselines in the task of document similarity search. We find that even though multi-index hashing also improves the efficiency of the baselines compared to a linear scan, they are still upwards of 33% slower than MISH, while MISH is still able to obtain state-of-the-art effectiveness.
△ Less
Submitted 26 March, 2021;
originally announced March 2021.
-
Projected Hamming Dissimilarity for Bit-Level Importance Coding in Collaborative Filtering
Authors:
Christian Hansen,
Casper Hansen,
Jakob Grue Simonsen,
Christina Lioma
Abstract:
When reasoning about tasks that involve large amounts of data, a common approach is to represent data items as objects in the Hamming space where operations can be done efficiently and effectively. Object similarity can then be computed by learning binary representations (hash codes) of the objects and computing their Hamming distance. While this is highly efficient, each bit dimension is equally…
▽ More
When reasoning about tasks that involve large amounts of data, a common approach is to represent data items as objects in the Hamming space where operations can be done efficiently and effectively. Object similarity can then be computed by learning binary representations (hash codes) of the objects and computing their Hamming distance. While this is highly efficient, each bit dimension is equally weighted, which means that potentially discriminative information of the data is lost. A more expressive alternative is to use real-valued vector representations and compute their inner product; this allows varying the weight of each dimension but is many magnitudes slower. To fix this, we derive a new way of measuring the dissimilarity between two objects in the Hamming space with binary weighting of each dimension (i.e., disabling bits): we consider a field-agnostic dissimilarity that projects the vector of one object onto the vector of the other. When working in the Hamming space, this results in a novel projected Hamming dissimilarity, which by choice of projection, effectively allows a binary importance weighting of the hash code of one object through the hash code of the other. We propose a variational hashing model for learning hash codes optimized for this projected Hamming dissimilarity, and experimentally evaluate it in collaborative filtering experiments. The resultant hash codes lead to effectiveness gains of up to +7% in NDCG and +14% in MRR compared to state-of-the-art hashing-based collaborative filtering baselines, while requiring no additional storage and no computational overhead compared to using the Hamming distance.
△ Less
Submitted 26 March, 2021;
originally announced March 2021.
-
Quantum-inspired Multimodal Fusion for Video Sentiment Analysis
Authors:
Qiuchi Li,
Dimitris Gkoumas,
Christina Lioma,
Massimo Melucci
Abstract:
We tackle the crucial challenge of fusing different modalities of features for multimodal sentiment analysis. Mainly based on neural networks, existing approaches largely model multimodal interactions in an implicit and hard-to-understand manner. We address this limitation with inspirations from quantum theory, which contains principled methods for modeling complicated interactions and correlation…
▽ More
We tackle the crucial challenge of fusing different modalities of features for multimodal sentiment analysis. Mainly based on neural networks, existing approaches largely model multimodal interactions in an implicit and hard-to-understand manner. We address this limitation with inspirations from quantum theory, which contains principled methods for modeling complicated interactions and correlations. In our quantum-inspired framework, the word interaction within a single modality and the interaction across modalities are formulated with superposition and entanglement respectively at different stages. The complex-valued neural network implementation of the framework achieves comparable results to state-of-the-art systems on two benchmarking video sentiment analysis datasets. In the meantime, we produce the unimodal and bimodal sentiment directly from the model to interpret the entangled decision.
△ Less
Submitted 22 March, 2021; v1 submitted 18 March, 2021;
originally announced March 2021.
-
Multi-Head Self-Attention with Role-Guided Masks
Authors:
Dongsheng Wang,
Casper Hansen,
Lucas Chaves Lima,
Christian Hansen,
Maria Maistro,
Jakob Grue Simonsen,
Christina Lioma
Abstract:
The state of the art in learning meaningful semantic representations of words is the Transformer model and its attention mechanisms. Simply put, the attention mechanisms learn to attend to specific parts of the input dispensing recurrence and convolutions. While some of the learned attention heads have been found to play linguistically interpretable roles, they can be redundant or prone to errors.…
▽ More
The state of the art in learning meaningful semantic representations of words is the Transformer model and its attention mechanisms. Simply put, the attention mechanisms learn to attend to specific parts of the input dispensing recurrence and convolutions. While some of the learned attention heads have been found to play linguistically interpretable roles, they can be redundant or prone to errors. We propose a method to guide the attention heads towards roles identified in prior work as important. We do this by defining role-specific masks to constrain the heads to attend to specific parts of the input, such that different heads are designed to play different roles. Experiments on text classification and machine translation using 7 different datasets show that our method outperforms competitive attention-based, CNN, and RNN baselines.
△ Less
Submitted 22 December, 2020;
originally announced December 2020.
-
Denmark's Participation in the Search Engine TREC COVID-19 Challenge: Lessons Learned about Searching for Precise Biomedical Scientific Information on COVID-19
Authors:
Lucas Chaves Lima,
Casper Hansen,
Christian Hansen,
Dongsheng Wang,
Maria Maistro,
Birger Larsen,
Jakob Grue Simonsen,
Christina Lioma
Abstract:
This report describes the participation of two Danish universities, University of Copenhagen and Aalborg University, in the international search engine competition on COVID-19 (the 2020 TREC-COVID Challenge) organised by the U.S. National Institute of Standards and Technology (NIST) and its Text Retrieval Conference (TREC) division. The aim of the competition was to find the best search engine str…
▽ More
This report describes the participation of two Danish universities, University of Copenhagen and Aalborg University, in the international search engine competition on COVID-19 (the 2020 TREC-COVID Challenge) organised by the U.S. National Institute of Standards and Technology (NIST) and its Text Retrieval Conference (TREC) division. The aim of the competition was to find the best search engine strategy for retrieving precise biomedical scientific information on COVID-19 from the largest, at that point in time, dataset of curated scientific literature on COVID-19 -- the COVID-19 Open Research Dataset (CORD-19). CORD-19 was the result of a call to action to the tech community by the U.S. White House in March 2020, and was shortly thereafter posted on Kaggle as an AI competition by the Allen Institute for AI, the Chan Zuckerberg Initiative, Georgetown University's Center for Security and Emerging Technology, Microsoft, and the National Library of Medicine at the US National Institutes of Health. CORD-19 contained over 200,000 scholarly articles (of which more than 100,000 were with full text) about COVID-19, SARS-CoV-2, and related coronaviruses, gathered from curated biomedical sources. The TREC-COVID challenge asked for the best way to (a) retrieve accurate and precise scientific information, in response to some queries formulated by biomedical experts, and (b) rank this information decreasingly by its relevance to the query.
In this document, we describe the TREC-COVID competition setup, our participation to it, and our resulting reflections and lessons learned about the state-of-art technology when faced with the acute task of retrieving precise scientific information from a rapidly growing corpus of literature, in response to highly specialised queries, in the middle of a pandemic.
△ Less
Submitted 26 November, 2020; v1 submitted 25 November, 2020;
originally announced November 2020.
-
A Diagnostic Study of Explainability Techniques for Text Classification
Authors:
Pepa Atanasova,
Jakob Grue Simonsen,
Christina Lioma,
Isabelle Augenstein
Abstract:
Recent developments in machine learning have introduced models that approach human performance at the cost of increased architectural complexity. Efforts to make the rationales behind the models' predictions transparent have inspired an abundance of new explainability techniques. Provided with an already trained model, they compute saliency scores for the words of an input instance. However, there…
▽ More
Recent developments in machine learning have introduced models that approach human performance at the cost of increased architectural complexity. Efforts to make the rationales behind the models' predictions transparent have inspired an abundance of new explainability techniques. Provided with an already trained model, they compute saliency scores for the words of an input instance. However, there exists no definitive guide on (i) how to choose such a technique given a particular application task and model architecture, and (ii) the benefits and drawbacks of using each such technique. In this paper, we develop a comprehensive list of diagnostic properties for evaluating existing explainability techniques. We then employ the proposed list to compare a set of diverse explainability techniques on downstream text classification tasks and neural network architectures. We also compare the saliency scores assigned by the explainability techniques with human annotations of salient input regions to find relations between a model's performance and the agreement of its rationales with human ones. Overall, we find that the gradient-based explanations perform best across tasks and model architectures, and we present further insights into the properties of the reviewed explainability techniques.
△ Less
Submitted 25 September, 2020;
originally announced September 2020.
-
Unsupervised Semantic Hashing with Pairwise Reconstruction
Authors:
Casper Hansen,
Christian Hansen,
Jakob Grue Simonsen,
Stephen Alstrup,
Christina Lioma
Abstract:
Semantic Hashing is a popular family of methods for efficient similarity search in large-scale datasets. In Semantic Hashing, documents are encoded as short binary vectors (i.e., hash codes), such that semantic similarity can be efficiently computed using the Hamming distance. Recent state-of-the-art approaches have utilized weak supervision to train better performing hashing models. Inspired by t…
▽ More
Semantic Hashing is a popular family of methods for efficient similarity search in large-scale datasets. In Semantic Hashing, documents are encoded as short binary vectors (i.e., hash codes), such that semantic similarity can be efficiently computed using the Hamming distance. Recent state-of-the-art approaches have utilized weak supervision to train better performing hashing models. Inspired by this, we present Semantic Hashing with Pairwise Reconstruction (PairRec), which is a discrete variational autoencoder based hashing model. PairRec first encodes weakly supervised training pairs (a query document and a semantically similar document) into two hash codes, and then learns to reconstruct the same query document from both of these hash codes (i.e., pairwise reconstruction). This pairwise reconstruction enables our model to encode local neighbourhood structures within the hash code directly through the decoder. We experimentally compare PairRec to traditional and state-of-the-art approaches, and obtain significant performance improvements in the task of document similarity search.
△ Less
Submitted 1 July, 2020;
originally announced July 2020.
-
Factuality Checking in News Headlines with Eye Tracking
Authors:
Christian Hansen,
Casper Hansen,
Jakob Grue Simonsen,
Birger Larsen,
Stephen Alstrup,
Christina Lioma
Abstract:
We study whether it is possible to infer if a news headline is true or false using only the movement of the human eyes when reading news headlines. Our study with 55 participants who are eye-tracked when reading 108 news headlines (72 true, 36 false) shows that false headlines receive statistically significantly less visual attention than true headlines. We further build an ensemble learner that p…
▽ More
We study whether it is possible to infer if a news headline is true or false using only the movement of the human eyes when reading news headlines. Our study with 55 participants who are eye-tracked when reading 108 news headlines (72 true, 36 false) shows that false headlines receive statistically significantly less visual attention than true headlines. We further build an ensemble learner that predicts news headline factuality using only eye-tracking measurements. Our model yields a mean AUC of 0.688 and is better at detecting false than true headlines. Through a model analysis, we find that eye-tracking 25 users when reading 3-6 headlines is sufficient for our ensemble learner.
△ Less
Submitted 17 June, 2020;
originally announced June 2020.
-
Content-aware Neural Hashing for Cold-start Recommendation
Authors:
Casper Hansen,
Christian Hansen,
Jakob Grue Simonsen,
Stephen Alstrup,
Christina Lioma
Abstract:
Content-aware recommendation approaches are essential for providing meaningful recommendations for \textit{new} (i.e., \textit{cold-start}) items in a recommender system. We present a content-aware neural hashing-based collaborative filtering approach (NeuHash-CF), which generates binary hash codes for users and items, such that the highly efficient Hamming distance can be used for estimating user…
▽ More
Content-aware recommendation approaches are essential for providing meaningful recommendations for \textit{new} (i.e., \textit{cold-start}) items in a recommender system. We present a content-aware neural hashing-based collaborative filtering approach (NeuHash-CF), which generates binary hash codes for users and items, such that the highly efficient Hamming distance can be used for estimating user-item relevance. NeuHash-CF is modelled as an autoencoder architecture, consisting of two joint hashing components for generating user and item hash codes. Inspired from semantic hashing, the item hashing component generates a hash code directly from an item's content information (i.e., it generates cold-start and seen item hash codes in the same manner). This contrasts existing state-of-the-art models, which treat the two item cases separately. The user hash codes are generated directly based on user id, through learning a user embedding matrix. We show experimentally that NeuHash-CF significantly outperforms state-of-the-art baselines by up to 12\% NDCG and 13\% MRR in cold-start recommendation settings, and up to 4\% in both NDCG and MRR in standard settings where all items are present while training. Our approach uses 2-4x shorter hash codes, while obtaining the same or better performance compared to the state of the art, thus consequently also enabling a notable storage reduction.
△ Less
Submitted 31 May, 2020;
originally announced June 2020.
-
Generating Fact Checking Explanations
Authors:
Pepa Atanasova,
Jakob Grue Simonsen,
Christina Lioma,
Isabelle Augenstein
Abstract:
Most existing work on automated fact checking is concerned with predicting the veracity of claims based on metadata, social network spread, language used in claims, and, more recently, evidence supporting or denying claims. A crucial piece of the puzzle that is still missing is to understand how to automate the most elaborate part of the process -- generating justifications for verdicts on claims.…
▽ More
Most existing work on automated fact checking is concerned with predicting the veracity of claims based on metadata, social network spread, language used in claims, and, more recently, evidence supporting or denying claims. A crucial piece of the puzzle that is still missing is to understand how to automate the most elaborate part of the process -- generating justifications for verdicts on claims. This paper provides the first study of how these explanations can be generated automatically based on available claim context, and how this task can be modelled jointly with veracity prediction. Our results indicate that optimising both objectives at the same time, rather than training them separately, improves the performance of a fact checking system. The results of a manual evaluation further suggest that the informativeness, coverage and overall quality of the generated explanations are also improved in the multi-task model.
△ Less
Submitted 13 April, 2020;
originally announced April 2020.
-
Encoding word order in complex embeddings
Authors:
Benyou Wang,
Donghao Zhao,
Christina Lioma,
Qiuchi Li,
Peng Zhang,
Jakob Grue Simonsen
Abstract:
Sequential word order is important when processing text. Currently, neural networks (NNs) address this by modeling word position using position embeddings. The problem is that position embeddings capture the position of individual words, but not the ordered relationship (e.g., adjacency or precedence) between individual word positions. We present a novel and principled solution for modeling both t…
▽ More
Sequential word order is important when processing text. Currently, neural networks (NNs) address this by modeling word position using position embeddings. The problem is that position embeddings capture the position of individual words, but not the ordered relationship (e.g., adjacency or precedence) between individual word positions. We present a novel and principled solution for modeling both the global absolute positions of words and their order relationships. Our solution generalizes word embeddings, previously defined as independent vectors, to continuous word functions over a variable (position). The benefit of continuous functions over variable positions is that word representations shift smoothly with increasing positions. Hence, word representations in different positions can correlate with each other in a continuous function. The general solution of these functions is extended to complex-valued domain due to richer representations. We extend CNN, RNN and Transformer NNs to complex-valued versions to incorporate our complex embedding (we make all code available). Experiments on text classification, machine translation and language modeling show gains over both classical word embeddings and position-enriched word embeddings. To our knowledge, this is the first work in NLP to link imaginary numbers in complex-valued representations to concrete meanings (i.e., word order).
△ Less
Submitted 28 June, 2020; v1 submitted 27 December, 2019;
originally announced December 2019.
-
Modelling End-of-Session Actions in Educational Systems
Authors:
Christian Hansen,
Casper Hansen,
Stephen Alstrup,
Christina Lioma
Abstract:
In this paper we consider the problem of modelling when students end their session in an online mathematics educational system. Being able to model this accurately will help us optimize the way content is presented and consumed. This is done by modelling the probability of an action being the last in a session, which we denote as the End-of-Session probability. We use log data from a system where…
▽ More
In this paper we consider the problem of modelling when students end their session in an online mathematics educational system. Being able to model this accurately will help us optimize the way content is presented and consumed. This is done by modelling the probability of an action being the last in a session, which we denote as the End-of-Session probability. We use log data from a system where students can learn mathematics through various kinds of learning materials, as well as multiple types of exercises, such that a student session can consist of many different activities. We model the End-of-Session probability by a deep recurrent neural network in order to utilize the long term temporal aspect, which we experimentally show is central for this task. Using a large scale dataset of more than 70 million student actions, we obtain an AUC of 0.81 on an unseen collection of students. Through a detailed error analysis, we observe that our model is robust across different session structures and across varying session lengths.
△ Less
Submitted 15 September, 2019;
originally announced September 2019.
-
MultiFC: A Real-World Multi-Domain Dataset for Evidence-Based Fact Checking of Claims
Authors:
Isabelle Augenstein,
Christina Lioma,
Dongsheng Wang,
Lucas Chaves Lima,
Casper Hansen,
Christian Hansen,
Jakob Grue Simonsen
Abstract:
We contribute the largest publicly available dataset of naturally occurring factual claims for the purpose of automatic claim verification. It is collected from 26 fact checking websites in English, paired with textual sources and rich metadata, and labelled for veracity by human expert journalists. We present an in-depth analysis of the dataset, highlighting characteristics and challenges. Furthe…
▽ More
We contribute the largest publicly available dataset of naturally occurring factual claims for the purpose of automatic claim verification. It is collected from 26 fact checking websites in English, paired with textual sources and rich metadata, and labelled for veracity by human expert journalists. We present an in-depth analysis of the dataset, highlighting characteristics and challenges. Further, we present results for automatic veracity prediction, both with established baselines and with a novel method for joint ranking of evidence pages and predicting veracity that outperforms all baselines. Significant performance increases are achieved by encoding evidence, and by modelling metadata. Our best-performing model achieves a Macro F1 of 49.2%, showing that this is a challenging testbed for claim veracity prediction.
△ Less
Submitted 21 October, 2019; v1 submitted 7 September, 2019;
originally announced September 2019.
-
Contextually Propagated Term Weights for Document Representation
Authors:
Casper Hansen,
Christian Hansen,
Stephen Alstrup,
Jakob Grue Simonsen,
Christina Lioma
Abstract:
Word embeddings predict a word from its neighbours by learning small, dense embedding vectors. In practice, this prediction corresponds to a semantic score given to the predicted word (or term weight). We present a novel model that, given a target word, redistributes part of that word's weight (that has been computed with word embeddings) across words occurring in similar contexts as the target wo…
▽ More
Word embeddings predict a word from its neighbours by learning small, dense embedding vectors. In practice, this prediction corresponds to a semantic score given to the predicted word (or term weight). We present a novel model that, given a target word, redistributes part of that word's weight (that has been computed with word embeddings) across words occurring in similar contexts as the target word. Thus, our model aims to simulate how semantic meaning is shared by words occurring in similar contexts, which is incorporated into bag-of-words document representations. Experimental evaluation in an unsupervised setting against 8 state of the art baselines shows that our model yields the best micro and macro F1 scores across datasets of increasing difficulty.
△ Less
Submitted 3 June, 2019;
originally announced June 2019.
-
Unsupervised Neural Generative Semantic Hashing
Authors:
Casper Hansen,
Christian Hansen,
Jakob Grue Simonsen,
Stephen Alstrup,
Christina Lioma
Abstract:
Fast similarity search is a key component in large-scale information retrieval, where semantic hashing has become a popular strategy for representing documents as binary hash codes. Recent advances in this area have been obtained through neural network based models: generative models trained by learning to reconstruct the original documents. We present a novel unsupervised generative semantic hash…
▽ More
Fast similarity search is a key component in large-scale information retrieval, where semantic hashing has become a popular strategy for representing documents as binary hash codes. Recent advances in this area have been obtained through neural network based models: generative models trained by learning to reconstruct the original documents. We present a novel unsupervised generative semantic hashing approach, \textit{Ranking based Semantic Hashing} (RBSH) that consists of both a variational and a ranking based component. Similarly to variational autoencoders, the variational component is trained to reconstruct the original document conditioned on its generated hash code, and as in prior work, it only considers documents individually. The ranking component solves this limitation by incorporating inter-document similarity into the hash code generation, modelling document ranking through a hinge loss. To circumvent the need for labelled data to compute the hinge loss, we use a weak labeller and thus keep the approach fully unsupervised.
Extensive experimental evaluation on four publicly available datasets against traditional baselines and recent state-of-the-art methods for semantic hashing shows that RBSH significantly outperforms all other methods across all evaluated hash code lengths. In fact, RBSH hash codes are able to perform similarly to state-of-the-art hash codes while using 2-4x fewer bits.
△ Less
Submitted 3 June, 2019;
originally announced June 2019.
-
Neural Speed Reading with Structural-Jump-LSTM
Authors:
Christian Hansen,
Casper Hansen,
Stephen Alstrup,
Jakob Grue Simonsen,
Christina Lioma
Abstract:
Recurrent neural networks (RNNs) can model natural language by sequentially 'reading' input tokens and outputting a distributed representation of each token. Due to the sequential nature of RNNs, inference time is linearly dependent on the input length, and all inputs are read regardless of their importance. Efforts to speed up this inference, known as 'neural speed reading', either ignore or skim…
▽ More
Recurrent neural networks (RNNs) can model natural language by sequentially 'reading' input tokens and outputting a distributed representation of each token. Due to the sequential nature of RNNs, inference time is linearly dependent on the input length, and all inputs are read regardless of their importance. Efforts to speed up this inference, known as 'neural speed reading', either ignore or skim over part of the input. We present Structural-Jump-LSTM: the first neural speed reading model to both skip and jump text during inference. The model consists of a standard LSTM and two agents: one capable of skipping single words when reading, and one capable of exploiting punctuation structure (sub-sentence separators (,:), sentence end symbols (.!?), or end of text markers) to jump ahead after reading a word. A comprehensive experimental evaluation of our model against all five state-of-the-art neural reading models shows that Structural-Jump-LSTM achieves the best overall floating point operations (FLOP) reduction (hence is faster), while keeping the same accuracy or even improving it compared to a vanilla LSTM that reads the whole text.
△ Less
Submitted 2 April, 2019; v1 submitted 20 March, 2019;
originally announced April 2019.
-
Modelling Sequential Music Track Skips using a Multi-RNN Approach
Authors:
Christian Hansen,
Casper Hansen,
Stephen Alstrup,
Jakob Grue Simonsen,
Christina Lioma
Abstract:
Modelling sequential music skips provides streaming companies the ability to better understand the needs of the user base, resulting in a better user experience by reducing the need to manually skip certain music tracks. This paper describes the solution of the University of Copenhagen DIKU-IR team in the 'Spotify Sequential Skip Prediction Challenge', where the task was to predict the skip behavi…
▽ More
Modelling sequential music skips provides streaming companies the ability to better understand the needs of the user base, resulting in a better user experience by reducing the need to manually skip certain music tracks. This paper describes the solution of the University of Copenhagen DIKU-IR team in the 'Spotify Sequential Skip Prediction Challenge', where the task was to predict the skip behaviour of the second half in a music listening session conditioned on the first half. We model this task using a Multi-RNN approach consisting of two distinct stacked recurrent neural networks, where one network focuses on encoding the first half of the session and the other network focuses on utilizing the encoding to make sequential skip predictions. The encoder network is initialized by a learned session-wide music encoding, and both of them utilize a learned track embedding. Our final model consists of a majority voted ensemble of individually trained models, and ranked 2nd out of 45 participating teams in the competition with a mean average accuracy of 0.641 and an accuracy on the first skip prediction of 0.807. Our code is released at https://github.com/Varyn/WSDM-challenge-2019-spotify.
△ Less
Submitted 20 March, 2019;
originally announced March 2019.
-
Neural Check-Worthiness Ranking with Weak Supervision: Finding Sentences for Fact-Checking
Authors:
Casper Hansen,
Christian Hansen,
Stephen Alstrup,
Jakob Grue Simonsen,
Christina Lioma
Abstract:
Automatic fact-checking systems detect misinformation, such as fake news, by (i) selecting check-worthy sentences for fact-checking, (ii) gathering related information to the sentences, and (iii) inferring the factuality of the sentences. Most prior research on (i) uses hand-crafted features to select check-worthy sentences, and does not explicitly account for the recent finding that the top weigh…
▽ More
Automatic fact-checking systems detect misinformation, such as fake news, by (i) selecting check-worthy sentences for fact-checking, (ii) gathering related information to the sentences, and (iii) inferring the factuality of the sentences. Most prior research on (i) uses hand-crafted features to select check-worthy sentences, and does not explicitly account for the recent finding that the top weighted terms in both check-worthy and non-check-worthy sentences are actually overlapping [15]. Motivated by this, we present a neural check-worthiness sentence ranking model that represents each word in a sentence by \textit{both} its embedding (aiming to capture its semantics) and its syntactic dependencies (aiming to capture its role in modifying the semantics of other terms in the sentence). Our model is an end-to-end trainable neural network for check-worthiness ranking, which is trained on large amounts of unlabelled data through weak supervision. Thorough experimental evaluation against state of the art baselines, with and without weak supervision, shows our model to be superior at all times (+13% in MAP and +28% at various Precision cut-offs from the best baseline with statistical significance). Empirical analysis of the use of weak supervision, word embedding pretraining on domain-specific data, and the use of syntactic dependencies of our model reveals that check-worthy sentences contain notably more identical syntactic dependencies than non-check-worthy sentences.
△ Less
Submitted 20 March, 2019;
originally announced March 2019.
-
Contextual Compositionality Detection with External Knowledge Bases andWord Embeddings
Authors:
Dongsheng Wang,
Quichi Li,
Lucas Chaves Lima,
Jakob grue Simonsen,
Christina Lioma
Abstract:
When the meaning of a phrase cannot be inferred from the individual meanings of its words (e.g., hot dog), that phrase is said to be non-compositional. Automatic compositionality detection in multi-word phrases is critical in any application of semantic processing, such as search engines; failing to detect non-compositional phrases can hurt system effectiveness notably. Existing research treats ph…
▽ More
When the meaning of a phrase cannot be inferred from the individual meanings of its words (e.g., hot dog), that phrase is said to be non-compositional. Automatic compositionality detection in multi-word phrases is critical in any application of semantic processing, such as search engines; failing to detect non-compositional phrases can hurt system effectiveness notably. Existing research treats phrases as either compositional or non-compositional in a deterministic manner. In this paper, we operationalize the viewpoint that compositionality is contextual rather than deterministic, i.e., that whether a phrase is compositional or non-compositional depends on its context. For example, the phrase `green card' is compositional when referring to a green colored card, whereas it is non-compositional when meaning permanent residence authorization. We address the challenge of detecting this type of contextual compositionality as follows: given a multi-word phrase, we enrich the word embedding representing its semantics with evidence about its global context (terms it often collocates with) as well as its local context (narratives where that phrase is used, which we call usage scenarios). We further extend this representation with information extracted from external knowledge bases. The resulting representation incorporates both localized context and more general usage of the phrase and allows to detect its compositionality in a non-deterministic and contextual way. Empirical evaluation of our model on a dataset of phrase compositionality, manually collected by crowdsourcing contextual compositionality assessments, shows that our model outperforms state-of-the-art baselines notably on detecting phrase compositionality.
△ Less
Submitted 20 March, 2019;
originally announced March 2019.
-
Predicting antimicrobial drug consumption using web search data
Authors:
Niels Dalum Hansen,
Kåre Mølbak,
Ingemar Cox,
Christina Lioma
Abstract:
Consumption of antimicrobial drugs, such as antibiotics, is linked with antimicrobial resistance. Surveillance of antimicrobial drug consumption is therefore an important element in dealing with antimicrobial resistance. Many countries lack sufficient surveillance systems. Usage of web mined data therefore has the potential to improve current surveillance methods. To this end, we study how well an…
▽ More
Consumption of antimicrobial drugs, such as antibiotics, is linked with antimicrobial resistance. Surveillance of antimicrobial drug consumption is therefore an important element in dealing with antimicrobial resistance. Many countries lack sufficient surveillance systems. Usage of web mined data therefore has the potential to improve current surveillance methods. To this end, we study how well antimicrobial drug consumption can be predicted based on web search queries, compared to historical purchase data of antimicrobial drugs. We present two prediction models (linear Elastic Net, and non-linear Gaussian Processes), which we train and evaluate on almost 6 years of weekly antimicrobial drug consumption data from Denmark and web search data from Google Health Trends. We present a novel method of selecting web search queries by considering diseases and drugs linked to antimicrobials, as well as professional and layman descriptions of antimicrobial drugs, all of which we mine from the open web. We find that predictions based on web search data are marginally more erroneous but overall on a par with predictions based on purchases of antimicrobial drugs. This marginal difference corresponds to $<1$\% point mean absolute error in weekly usage. Best predictions are reported when combining both web search and purchase data.
This study contributes a novel alternative solution to the real-life problem of predicting (and hence monitoring) antimicrobial drug consumption, which is particularly valuable in countries/states lacking centralised and timely surveillance systems.
△ Less
Submitted 9 March, 2018;
originally announced March 2018.
-
Seasonal Web Search Query Selection for Influenza-Like Illness (ILI) Estimation
Authors:
Niels Dalum Hansen,
Kåre Mølbak,
Ingemar J. Cox,
Christina Lioma
Abstract:
Influenza-like illness (ILI) estimation from web search data is an important web analytics task. The basic idea is to use the frequencies of queries in web search logs that are correlated with past ILI activity as features when estimating current ILI activity. It has been noted that since influenza is seasonal, this approach can lead to spurious correlations with features/queries that also exhibit…
▽ More
Influenza-like illness (ILI) estimation from web search data is an important web analytics task. The basic idea is to use the frequencies of queries in web search logs that are correlated with past ILI activity as features when estimating current ILI activity. It has been noted that since influenza is seasonal, this approach can lead to spurious correlations with features/queries that also exhibit seasonality, but have no relationship with ILI. Spurious correlations can, in turn, degrade performance. To address this issue, we propose modeling the seasonal variation in ILI activity and selecting queries that are correlated with the residual of the seasonal model and the observed ILI signal. Experimental results show that re-ranking queries obtained by Google Correlate based on their correlation with the residual strongly favours ILI-related queries.
△ Less
Submitted 19 February, 2018;
originally announced February 2018.
-
To Phrase or Not to Phrase - Impact of User versus System Term Dependence Upon Retrieval
Authors:
Christina Lioma,
Birger Larsen,
Peter Ingwersen
Abstract:
When submitting queries to information retrieval (IR) systems, users often have the option of specifying which, if any, of the query terms are heavily dependent on each other and should be treated as a fixed phrase, for instance by placing them between quotes. In addition to such cases where users specify term dependence, automatic ways also exist for IR systems to detect dependent terms in querie…
▽ More
When submitting queries to information retrieval (IR) systems, users often have the option of specifying which, if any, of the query terms are heavily dependent on each other and should be treated as a fixed phrase, for instance by placing them between quotes. In addition to such cases where users specify term dependence, automatic ways also exist for IR systems to detect dependent terms in queries. Most IR systems use both user and algorithmic approaches. It is not however clear whether and to what extent user-defined term dependence agrees with algorithmic estimates of term dependence, nor which of the two may fetch higher performance gains. Simply put, is it better to trust users or the system to detect term dependence in queries? To answer this question, we experiment with 101 crowdsourced search engine users and 334 queries (52 train and 282 test TREC queries) and we record 10 assessments per query. We find that (i) user assessments of term dependence differ significantly from algorithmic assessments of term dependence (their overlap is approximately 30%); (ii) there is little agreement among users about term dependence in queries, and this disagreement increases as queries become longer; (iii) the potential retrieval gain that can be fetched by treating term dependence (both user- and system-defined) over a bag of words baseline is reserved to a small subset (approxi-mately 8%) of the queries, and is much higher for low-depth than deep preci-sion measures. Points (ii) and (iii) constitute novel insights into term dependence.
△ Less
Submitted 5 March, 2018; v1 submitted 7 February, 2018;
originally announced February 2018.
-
Dependencies: Formalising Semantic Catenae for Information Retrieval
Authors:
Christina Lioma
Abstract:
Building machines that can understand text like humans is an AI-complete problem. A great deal of research has already gone into this, with astounding results, allowing everyday people to discuss with their telephones, or have their reading materials analysed and classified by computers. A prerequisite for processing text semantics, common to the above examples, is having some computational repres…
▽ More
Building machines that can understand text like humans is an AI-complete problem. A great deal of research has already gone into this, with astounding results, allowing everyday people to discuss with their telephones, or have their reading materials analysed and classified by computers. A prerequisite for processing text semantics, common to the above examples, is having some computational representation of text as an abstract object. Operations on this representation practically correspond to making semantic inferences, and by extension simulating understanding text. The complexity and granularity of semantic processing that can be realised is constrained by the mathematical and computational robustness, expressiveness, and rigour of the tools used.
This dissertation contributes a series of such tools, diverse in their mathematical formulation, but common in their application to model semantic inferences when machines process text. These tools are principally expressed in nine distinct models that capture aspects of semantic dependence in highly interpretable and non-complex ways. This dissertation further reflects on present and future problems with the current research paradigm in this area, and makes recommendations on how to overcome them.
The amalgamation of the body of work presented in this dissertation advances the complexity and granularity of semantic inferences that can be made automatically by machines.
△ Less
Submitted 12 September, 2017;
originally announced September 2017.
-
Evaluation Measures for Relevance and Credibility in Ranked Lists
Authors:
Christina Lioma,
Jakob Grue Simonsen,
Birger Larsen
Abstract:
Recent discussions on alternative facts, fake news, and post truth politics have motivated research on creating technologies that allow people not only to access information, but also to assess the credibility of the information presented to them by information retrieval systems. Whereas technology is in place for filtering information according to relevance and/or credibility, no single measure c…
▽ More
Recent discussions on alternative facts, fake news, and post truth politics have motivated research on creating technologies that allow people not only to access information, but also to assess the credibility of the information presented to them by information retrieval systems. Whereas technology is in place for filtering information according to relevance and/or credibility, no single measure currently exists for evaluating the accuracy or precision (and more generally effectiveness) of both the relevance and the credibility of retrieved results. One obvious way of doing so is to measure relevance and credibility effectiveness separately, and then consolidate the two measures into one. There at least two problems with such an approach: (I) it is not certain that the same criteria are applied to the evaluation of both relevance and credibility (and applying different criteria introduces bias to the evaluation); (II) many more and richer measures exist for assessing relevance effectiveness than for assessing credibility effectiveness (hence risking further bias).
Motivated by the above, we present two novel types of evaluation measures that are designed to measure the effectiveness of both relevance and credibility in ranked lists of retrieval results. Experimental evaluation on a small human-annotated dataset (that we make freely available to the research community) shows that our measures are expressive and intuitive in their interpretation.
△ Less
Submitted 23 August, 2017;
originally announced August 2017.
-
Smart City Analytics: Ensemble-Learned Prediction of Citizen Home Care
Authors:
Casper Hansen,
Christian Hansen,
Stephen Alstrup,
Christina Lioma
Abstract:
We present an ensemble learning method that predicts large increases in the hours of home care received by citizens. The method is supervised, and uses different ensembles of either linear (logistic regression) or non-linear (random forests) classifiers. Experiments with data available from 2013 to 2017 for every citizen in Copenhagen receiving home care (27,775 citizens) show that prediction can…
▽ More
We present an ensemble learning method that predicts large increases in the hours of home care received by citizens. The method is supervised, and uses different ensembles of either linear (logistic regression) or non-linear (random forests) classifiers. Experiments with data available from 2013 to 2017 for every citizen in Copenhagen receiving home care (27,775 citizens) show that prediction can achieve state of the art performance as reported in similar health related domains (AUC=0.715). We further find that competitive results can be obtained by using limited information for training, which is very useful when full records are not accessible or available. Smart city analytics does not necessarily require full city records.
To our knowledge this preliminary study is the first to predict large increases in home care for smart city analytics.
△ Less
Submitted 21 August, 2017;
originally announced August 2017.
-
Sequence Modelling For Analysing Student Interaction with Educational Systems
Authors:
Christian Hansen,
Casper Hansen,
Niklas Hjuler,
Stephen Alstrup,
Christina Lioma
Abstract:
The analysis of log data generated by online educational systems is an important task for improving the systems, and furthering our knowledge of how students learn. This paper uses previously unseen log data from Edulab, the largest provider of digital learning for mathematics in Denmark, to analyse the sessions of its users, where 1.08 million student sessions are extracted from a subset of their…
▽ More
The analysis of log data generated by online educational systems is an important task for improving the systems, and furthering our knowledge of how students learn. This paper uses previously unseen log data from Edulab, the largest provider of digital learning for mathematics in Denmark, to analyse the sessions of its users, where 1.08 million student sessions are extracted from a subset of their data. We propose to model students as a distribution of different underlying student behaviours, where the sequence of actions from each session belongs to an underlying student behaviour. We model student behaviour as Markov chains, such that a student is modelled as a distribution of Markov chains, which are estimated using a modified k-means clustering algorithm. The resulting Markov chains are readily interpretable, and in a qualitative analysis around 125,000 student sessions are identified as exhibiting unproductive student behaviour. Based on our results this student representation is promising, especially for educational systems offering many different learning usages, and offers an alternative to common approaches like modelling student behaviour as a single Markov chain often done in the literature.
△ Less
Submitted 14 August, 2017;
originally announced August 2017.
-
Fixed versus Dynamic Co-Occurrence Windows in TextRank Term Weights for Information Retrieval
Authors:
Wei Lu,
Qikai Cheng,
Christina Lioma
Abstract:
TextRank is a variant of PageRank typically used in graphs that represent documents, and where vertices denote terms and edges denote relations between terms. Quite often the relation between terms is simple term co-occurrence within a fixed window of k terms. The output of TextRank when applied iteratively is a score for each vertex, i.e. a term weight, that can be used for information retrieval…
▽ More
TextRank is a variant of PageRank typically used in graphs that represent documents, and where vertices denote terms and edges denote relations between terms. Quite often the relation between terms is simple term co-occurrence within a fixed window of k terms. The output of TextRank when applied iteratively is a score for each vertex, i.e. a term weight, that can be used for information retrieval (IR) just like conventional term frequency based term weights. So far, when computing TextRank term weights over co- occurrence graphs, the window of term co-occurrence is al- ways ?xed. This work departs from this, and considers dy- namically adjusted windows of term co-occurrence that fol- low the document structure on a sentence- and paragraph- level. The resulting TextRank term weights are used in a ranking function that re-ranks 1000 initially returned search results in order to improve the precision of the ranking. Ex- periments with two IR collections show that adjusting the vicinity of term co-occurrence when computing TextRank term weights can lead to gains in early precision.
△ Less
Submitted 6 April, 2017;
originally announced April 2017.
-
Report on TBAS 2012: Workshop on Task-Based and Aggregated Search
Authors:
Birger Larsen,
Christina Lioma,
Arjen de Vries
Abstract:
The ECIR half-day workshop on Task-Based and Aggregated Search (TBAS) was held in Barcelona, Spain on 1 April 2012. The program included a keynote talk by Professor Jarvelin, six full paper presentations, two poster presentations, and an interactive discussion among the approximately 25 participants. This report overviews the aims and contents of the workshop and outlines the major outcomes.
The ECIR half-day workshop on Task-Based and Aggregated Search (TBAS) was held in Barcelona, Spain on 1 April 2012. The program included a keynote talk by Professor Jarvelin, six full paper presentations, two poster presentations, and an interactive discussion among the approximately 25 participants. This report overviews the aims and contents of the workshop and outlines the major outcomes.
△ Less
Submitted 6 April, 2017;
originally announced April 2017.
-
Part of Speech Based Term Weighting for Information Retrieval
Authors:
Christina Lioma,
Roi Blanco
Abstract:
Automatic language processing tools typically assign to terms so-called weights corresponding to the contribution of terms to information content. Traditionally, term weights are computed from lexical statistics, e.g., term frequencies. We propose a new type of term weight that is computed from part of speech (POS) n-gram statistics. The proposed POS-based term weight represents how informative a…
▽ More
Automatic language processing tools typically assign to terms so-called weights corresponding to the contribution of terms to information content. Traditionally, term weights are computed from lexical statistics, e.g., term frequencies. We propose a new type of term weight that is computed from part of speech (POS) n-gram statistics. The proposed POS-based term weight represents how informative a term is in general, based on the POS contexts in which it generally occurs in language. We suggest five different computations of POS-based term weights by extending existing statistical approximations of term information measures. We apply these POS-based term weights to information retrieval, by integrating them into the model that matches documents to queries. Experiments with two TREC collections and 300 queries, using TF-IDF & BM25 as baselines, show that integrating our POS-based term weights to retrieval always leads to gains (up to +33.7% from the baseline). Additional experiments with a different retrieval model as baseline (Language Model with Dirichlet priors smoothing) and our best performing POS-based term weight, show retrieval gains always and consistently across the whole smoothing range of the baseline.
△ Less
Submitted 5 April, 2017;
originally announced April 2017.