Search | arXiv e-print repository

Metadata Might Make Language Models Better

Authors: Kaspar Beelen, Daniel van Strien

Abstract: This paper discusses the benefits of including metadata when training language models on historical collections. Using 19th-century newspapers as a case study, we extend the time-masking approach proposed by Rosin et al., 2022 and compare different strategies for inserting temporal, political and geographical information into a Masked Language Model. After fine-tuning several DistilBERT on enhance… ▽ More This paper discusses the benefits of including metadata when training language models on historical collections. Using 19th-century newspapers as a case study, we extend the time-masking approach proposed by Rosin et al., 2022 and compare different strategies for inserting temporal, political and geographical information into a Masked Language Model. After fine-tuning several DistilBERT on enhanced input data, we provide a systematic evaluation of these models on a set of evaluation tasks: pseudo-perplexity, metadata mask-filling and supervised classification. We find that showing relevant metadata to a language model has a beneficial impact and may even produce more robust and fairer models. △ Less

Submitted 18 November, 2022; originally announced November 2022.

arXiv:2111.15592 [pdf, other]

MapReader: A Computer Vision Pipeline for the Semantic Exploration of Maps at Scale

Authors: Kasra Hosseini, Daniel C. S. Wilson, Kaspar Beelen, Katherine McDonough

Abstract: We present MapReader, a free, open-source software library written in Python for analyzing large map collections (scanned or born-digital). This library transforms the way historians can use maps by turning extensive, homogeneous map sets into searchable primary sources. MapReader allows users with little or no computer vision expertise to i) retrieve maps via web-servers; ii) preprocess and divid… ▽ More We present MapReader, a free, open-source software library written in Python for analyzing large map collections (scanned or born-digital). This library transforms the way historians can use maps by turning extensive, homogeneous map sets into searchable primary sources. MapReader allows users with little or no computer vision expertise to i) retrieve maps via web-servers; ii) preprocess and divide them into patches; iii) annotate patches; iv) train, fine-tune, and evaluate deep neural network models; and v) create structured data about map content. We demonstrate how MapReader enables historians to interpret a collection of $\approx$16K nineteenth-century Ordnance Survey map sheets ($\approx$30.5M patches), foregrounding the challenge of translating visual markers into machine-readable data. We present a case study focusing on British rail infrastructure and buildings as depicted on these maps. We also show how the outputs from the MapReader pipeline can be linked to other, external datasets, which we use to evaluate as well as enrich and interpret the results. We release $\approx$62K manually annotated patches used here for training and evaluating the models. △ Less

Submitted 30 November, 2021; originally announced November 2021.

Comments: 13 pages, 9 figures

arXiv:2105.11321 [pdf, other]

Neural Language Models for Nineteenth-Century English

Authors: Kasra Hosseini, Kaspar Beelen, Giovanni Colavizza, Mariona Coll Ardanuy

Abstract: We present four types of neural language models trained on a large historical dataset of books in English, published between 1760-1900 and comprised of ~5.1 billion tokens. The language model architectures include static (word2vec and fastText) and contextualized models (BERT and Flair). For each architecture, we trained a model instance using the whole dataset. Additionally, we trained separate i… ▽ More We present four types of neural language models trained on a large historical dataset of books in English, published between 1760-1900 and comprised of ~5.1 billion tokens. The language model architectures include static (word2vec and fastText) and contextualized models (BERT and Flair). For each architecture, we trained a model instance using the whole dataset. Additionally, we trained separate instances on text published before 1850 for the two static models, and four instances considering different time slices for BERT. Our models have already been used in various downstream tasks where they consistently improved performance. In this paper, we describe how the models have been created and outline their reuse potential. △ Less

Submitted 24 May, 2021; originally announced May 2021.

Comments: 5 pages, 1 figure

arXiv:2005.11140 [pdf, other]

Living Machines: A study of atypical animacy

Authors: Mariona Coll Ardanuy, Federico Nanni, Kaspar Beelen, Kasra Hosseini, Ruth Ahnert, Jon Lawrence, Katherine McDonough, Giorgia Tolfo, Daniel CS Wilson, Barbara McGillivray

Abstract: This paper proposes a new approach to animacy detection, the task of determining whether an entity is represented as animate in a text. In particular, this work is focused on atypical animacy and examines the scenario in which typically inanimate objects, specifically machines, are given animate attributes. To address it, we have created the first dataset for atypical animacy detection, based on n… ▽ More This paper proposes a new approach to animacy detection, the task of determining whether an entity is represented as animate in a text. In particular, this work is focused on atypical animacy and examines the scenario in which typically inanimate objects, specifically machines, are given animate attributes. To address it, we have created the first dataset for atypical animacy detection, based on nineteenth-century sentences in English, with machines represented as either animate or inanimate. Our method builds on recent innovations in language modeling, specifically BERT contextualized word embeddings, to better capture fine-grained contextual properties of words. We present a fully unsupervised pipeline, which can be easily adapted to different contexts, and report its performance on an established animacy dataset and our newly introduced resource. We show that our method provides a substantially more accurate characterization of atypical animacy, especially when applied to highly complex forms of language use. △ Less

Submitted 19 November, 2020; v1 submitted 22 May, 2020; originally announced May 2020.

Comments: 12 pages, 1 figures

arXiv:1711.05603 [pdf, other]

Words are Malleable: Computing Semantic Shifts in Political and Media Discourse

Authors: Hosein Azarbonyad, Mostafa Dehghani, Kaspar Beelen, Alexandra Arkut, Maarten Marx, Jaap Kamps

Abstract: Recently, researchers started to pay attention to the detection of temporal shifts in the meaning of words. However, most (if not all) of these approaches restricted their efforts to uncovering change over time, thus neglecting other valuable dimensions such as social or political variability. We propose an approach for detecting semantic shifts between different viewpoints--broadly defined as a s… ▽ More Recently, researchers started to pay attention to the detection of temporal shifts in the meaning of words. However, most (if not all) of these approaches restricted their efforts to uncovering change over time, thus neglecting other valuable dimensions such as social or political variability. We propose an approach for detecting semantic shifts between different viewpoints--broadly defined as a set of texts that share a specific metadata feature, which can be a time-period, but also a social entity such as a political party. For each viewpoint, we learn a semantic space in which each word is represented as a low dimensional neural embedded vector. The challenge is to compare the meaning of a word in one space to its meaning in another space and measure the size of the semantic shifts. We compare the effectiveness of a measure based on optimal transformations between the two spaces with a measure based on the similarity of the neighbors of the word in the respective spaces. Our experiments demonstrate that the combination of these two performs best. We show that the semantic shifts not only occur over time, but also along different viewpoints in a short period of time. For evaluation, we demonstrate how this approach captures meaningful semantic shifts and can help improve other tasks such as the contrastive viewpoint summarization and ideology detection (measured as classification accuracy) in political texts. We also show that the two laws of semantic change which were empirically shown to hold for temporal shifts also hold for shifts across viewpoints. These laws state that frequent words are less likely to shift meaning while words with many senses are more likely to do so. △ Less

Submitted 15 November, 2017; originally announced November 2017.

Comments: In Proceedings of the 26th ACM International on Conference on Information and Knowledge Management (CIKM2017)

arXiv:1710.01127 [pdf, other]

Finding Talk About the Past in the Discourse of Non-Historians

Authors: Alex Olieman, Kaspar Beelen, Jaap Kamps

Abstract: A heightened interest in the presence of the past has given rise to the new field of memory studies, but there is a lack of search and research tools to support studying how and why the past is evoked in diachronic discourses. Searching for temporal references is not straightforward. It entails bridging the gap between conceptually-based information needs on one side, and term-based inverted index… ▽ More A heightened interest in the presence of the past has given rise to the new field of memory studies, but there is a lack of search and research tools to support studying how and why the past is evoked in diachronic discourses. Searching for temporal references is not straightforward. It entails bridging the gap between conceptually-based information needs on one side, and term-based inverted indexes on the other. Our approach enables the search for references to (intersubjective) historical periods in diachronic corpora. It consists of a semantically-enhanced search engine that is able to find references to many entities at a time, which is combined with a novel interface that invites its user to actively sculpt the search result set. Until now we have been concerned mostly with user-friendly retrieval and selection of sources, but our tool can also contribute to existing efforts to create reusable linked data from and for research in the humanities. △ Less

Submitted 3 October, 2017; originally announced October 2017.

Comments: Presented at Drift-a-LOD 2017

arXiv:1708.01162 [pdf, other]

Good Applications for Crummy Entity Linkers? The Case of Corpus Selection in Digital Humanities

Authors: Alex Olieman, Kaspar Beelen, Milan van Lange, Jaap Kamps, Maarten Marx

Abstract: Over the last decade we have made great progress in entity linking (EL) systems, but performance may vary depending on the context and, arguably, there are even principled limitations preventing a "perfect" EL system. This also suggests that there may be applications for which current "imperfect" EL is already very useful, and makes finding the "right" application as important as building the "rig… ▽ More Over the last decade we have made great progress in entity linking (EL) systems, but performance may vary depending on the context and, arguably, there are even principled limitations preventing a "perfect" EL system. This also suggests that there may be applications for which current "imperfect" EL is already very useful, and makes finding the "right" application as important as building the "right" EL system. We investigate the Digital Humanities use case, where scholars spend a considerable amount of time selecting relevant source texts. We developed WideNet; a semantically-enhanced search tool which leverages the strengths of (imperfect) EL without getting in the way of its expert users. We evaluate this tool in two historical case-studies aiming to collect a set of references to historical periods in parliamentary debates from the last two decades; the first targeted the Dutch Golden Age, and the second World War II. The case-studies conclude with a critical reflection on the utility of WideNet for this kind of research, after which we outline how such a real-world application can help to improve EL technology in general. △ Less

Submitted 3 August, 2017; originally announced August 2017.

Comments: Accepted for presentation at SEMANTiCS '17

arXiv:1706.07643 [pdf, other]

Computational Controversy

Authors: Benjamin Timmermans, Tobias Kuhn, Kaspar Beelen, Lora Aroyo

Abstract: Climate change, vaccination, abortion, Trump: Many topics are surrounded by fierce controversies. The nature of such heated debates and their elements have been studied extensively in the social science literature. More recently, various computational approaches to controversy analysis have appeared, using new data sources such as Wikipedia, which help us now better understand these phenomena. How… ▽ More Climate change, vaccination, abortion, Trump: Many topics are surrounded by fierce controversies. The nature of such heated debates and their elements have been studied extensively in the social science literature. More recently, various computational approaches to controversy analysis have appeared, using new data sources such as Wikipedia, which help us now better understand these phenomena. However, compared to what social sciences have discovered about such debates, the existing computational approaches mostly focus on just a few of the many important aspects around the concept of controversies. In order to link the two strands, we provide and evaluate here a controversy model that is both, rooted in the findings of the social science literature and at the same time strongly linked to computational methods. We show how this model can lead to computational controversy analytics that have full coverage over all the crucial aspects that make up a controversy. △ Less

Submitted 30 August, 2017; v1 submitted 23 June, 2017; originally announced June 2017.

Comments: In Proceedings of the 9th International Conference on Social Informatics (SocInfo) 2017

Showing 1–8 of 8 results for author: Beelen, K