Skip to main content

Showing 1–29 of 29 results for author: Ginter, F

.
  1. arXiv:2506.07960  [pdf, other

    cs.CV

    Creating a Historical Migration Dataset from Finnish Church Records, 1800-1920

    Authors: Ari Vesalainen, Jenna Kanerva, Aida Nitsch, Kiia Korsu, Ilari Larkiola, Laura Ruotsalainen, Filip Ginter

    Abstract: This article presents a large-scale effort to create a structured dataset of internal migration in Finland between 1800 and 1920 using digitized church moving records. These records, maintained by Evangelical-Lutheran parishes, document the migration of individuals and families and offer a valuable source for studying historical demographic patterns. The dataset includes over six million entries e… ▽ More

    Submitted 9 June, 2025; originally announced June 2025.

    MSC Class: I.4.6; J.5

  2. Interaction Analysis by Humans and AI: A Comparative Perspective

    Authors: Maryam Teimouri, Filip Ginter, Tomi "bgt" Suovuo

    Abstract: This paper explores how Mixed Reality (MR) and 2D video conferencing influence children's communication during a gesture-based guessing game. Finnish-speaking participants engaged in a short collaborative task using two different setups: Microsoft HoloLens MR and Zoom. Audio-video recordings were transcribed and analyzed using Large Language Models (LLMs), enabling iterative correction, translatio… ▽ More

    Submitted 9 June, 2025; originally announced June 2025.

  3. arXiv:2502.13566  [pdf, other

    cs.CL

    Extracting Social Connections from Finnish Karelian Refugee Interviews Using LLMs

    Authors: Joonatan Laato, Jenna Kanerva, John Loehr, Virpi Lummaa, Filip Ginter

    Abstract: We performed a zero-shot information extraction study on a historical collection of 89,339 brief Finnish-language interviews of refugee families relocated post-WWII from Finnish Eastern Karelia. Our research objective is two-fold. First, we aim to extract social organizations and hobbies from the free text of the interviews, separately for each family member. These can act as a proxy variable indi… ▽ More

    Submitted 19 February, 2025; originally announced February 2025.

    Comments: Published at Proceedings of Fifth Conference on Computational Humanities Research (CHR'2024), December 2024 https://ceur-ws.org/Vol-3834/paper52.pdf

  4. arXiv:2502.01205  [pdf, other

    cs.CL

    OCR Error Post-Correction with LLMs in Historical Documents: No Free Lunches

    Authors: Jenna Kanerva, Cassandra Ledins, Siiri Käpyaho, Filip Ginter

    Abstract: Optical Character Recognition (OCR) systems often introduce errors when transcribing historical documents, leaving room for post-correction to improve text quality. This study evaluates the use of open-weight LLMs for OCR error correction in historical English and Finnish datasets. We explore various strategies, including parameter optimization, quantization, segment length effects, and text conti… ▽ More

    Submitted 3 February, 2025; originally announced February 2025.

    Comments: To be published in RESOURCEFUL 2025

  5. arXiv:2501.07314  [pdf, other

    cs.CL

    FinerWeb-10BT: Refining Web Data with LLM-Based Line-Level Filtering

    Authors: Erik Henriksson, Otto Tarkka, Filip Ginter

    Abstract: Data quality is crucial for training Large Language Models (LLMs). Traditional heuristic filters often miss low-quality text or mistakenly remove valuable content. In this paper, we introduce an LLM-based line-level filtering method to enhance training data quality. We use GPT-4o mini to label a 20,000-document sample from FineWeb at the line level, allowing the model to create descriptive labels… ▽ More

    Submitted 13 January, 2025; originally announced January 2025.

    Comments: 11 pages, 4 figures, 4 tables. To be published in NoDaLiDa/Baltic-HLT 2025 proceedings

  6. arXiv:2501.05963  [pdf, other

    cs.CL

    Finnish SQuAD: A Simple Approach to Machine Translation of Span Annotations

    Authors: Emil Nuutinen, Iiro Rastas, Filip Ginter

    Abstract: We apply a simple method to machine translate datasets with span-level annotation using the DeepL MT service and its ability to translate formatted documents. Using this method, we produce a Finnish version of the SQuAD2.0 question answering dataset and train QA retriever models on this new dataset. We evaluate the quality of the dataset and more generally the MT method through direct evaluation,… ▽ More

    Submitted 10 January, 2025; originally announced January 2025.

    Comments: NoDaLiDa 2025

  7. arXiv:2405.15290  [pdf, other

    cond-mat.mtrl-sci

    Question Answering models for information extraction from perovskite materials science literature

    Authors: M. Sipilä, F. Mehryary, S. Pyysalo, F. Ginter, Milica Todorović

    Abstract: Scientific text is a promising source of data in materials science, with ongoing research into utilising textual data for materials discovery. In this study, we developed and tested a novel approach to extract material-property relationships from scientific publications using the Question Answering (QA) method. QA performance was evaluated for information extraction of perovskite bandgaps based on… ▽ More

    Submitted 13 September, 2024; v1 submitted 24 May, 2024; originally announced May 2024.

    Comments: The following article has been submitted to npj Computational Materials

  8. arXiv:2311.05640  [pdf, other

    cs.CL

    FinGPT: Large Generative Models for a Small Language

    Authors: Risto Luukkonen, Ville Komulainen, Jouni Luoma, Anni Eskelinen, Jenna Kanerva, Hanna-Mari Kupari, Filip Ginter, Veronika Laippala, Niklas Muennighoff, Aleksandra Piktus, Thomas Wang, Nouamane Tazi, Teven Le Scao, Thomas Wolf, Osma Suominen, Samuli Sairanen, Mikko Merioksa, Jyrki Heinonen, Aija Vahtola, Samuel Antao, Sampo Pyysalo

    Abstract: Large language models (LLMs) excel in many tasks in NLP and beyond, but most open models have very limited coverage of smaller languages and LLM work tends to focus on languages where nearly unlimited data is available for pretraining. In this work, we study the challenges of creating LLMs for Finnish, a language spoken by less than 0.1% of the world population. We compile an extensive dataset of… ▽ More

    Submitted 3 November, 2023; originally announced November 2023.

    Comments: 17 pages (10 main), 7 figures, 5 tables

  9. arXiv:2305.11016  [pdf, other

    cs.CL

    Silver Syntax Pre-training for Cross-Domain Relation Extraction

    Authors: Elisa Bassignana, Filip Ginter, Sampo Pyysalo, Rob van der Goot, Barbara Plank

    Abstract: Relation Extraction (RE) remains a challenging task, especially when considering realistic out-of-domain evaluations. One of the main reasons for this is the limited training size of current RE datasets: obtaining high-quality (manually annotated) data is extremely expensive and cannot realistically be repeated for each new domain. An intermediate training step on data from related tasks has shown… ▽ More

    Submitted 18 May, 2023; originally announced May 2023.

    Comments: Accepted in Findings of the Association for Computational Linguistics: ACL 2023

  10. arXiv:2305.10985  [pdf, other

    cs.CL

    Multi-CrossRE A Multi-Lingual Multi-Domain Dataset for Relation Extraction

    Authors: Elisa Bassignana, Filip Ginter, Sampo Pyysalo, Rob van der Goot, Barbara Plank

    Abstract: Most research in Relation Extraction (RE) involves the English language, mainly due to the lack of multi-lingual resources. We propose Multi-CrossRE, the broadest multi-lingual dataset for RE, including 26 languages in addition to English, and covering six text domains. Multi-CrossRE is a machine translated version of CrossRE (Bassignana and Plank, 2022), with a sub-portion including more than 200… ▽ More

    Submitted 18 May, 2023; originally announced May 2023.

    Comments: Accepted at NoDaLiDa 2023

  11. arXiv:2211.12504  [pdf, other

    cs.CL cs.AI

    Identifying gender bias in blockbuster movies through the lens of machine learning

    Authors: Muhammad Junaid Haris, Aanchal Upreti, Melih Kurtaran, Filip Ginter, Sebastien Lafond, Sepinoud Azimi

    Abstract: The problem of gender bias is highly prevalent and well known. In this paper, we have analysed the portrayal of gender roles in English movies, a medium that effectively influences society in shaping people's beliefs and opinions. First, we gathered scripts of films from different genres and derived sentiments and emotions using natural language processing techniques. Afterwards, we converted the… ▽ More

    Submitted 21 November, 2022; originally announced November 2022.

  12. arXiv:2206.11249  [pdf, other

    cs.CL cs.AI cs.LG

    GEMv2: Multilingual NLG Benchmarking in a Single Line of Code

    Authors: Sebastian Gehrmann, Abhik Bhattacharjee, Abinaya Mahendiran, Alex Wang, Alexandros Papangelis, Aman Madaan, Angelina McMillan-Major, Anna Shvets, Ashish Upadhyay, Bingsheng Yao, Bryan Wilie, Chandra Bhagavatula, Chaobin You, Craig Thomson, Cristina Garbacea, Dakuo Wang, Daniel Deutsch, Deyi Xiong, Di Jin, Dimitra Gkatzia, Dragomir Radev, Elizabeth Clark, Esin Durmus, Faisal Ladhak, Filip Ginter , et al. (52 additional authors not shown)

    Abstract: Evaluation in machine learning is usually informed by past choices, for example which datasets or metrics to use. This standardization enables the comparison on equal footing using leaderboards, but the evaluation choices become sub-optimal as better alternatives arise. This problem is especially pertinent in natural language generation which requires ever-improving suites of datasets, metrics, an… ▽ More

    Submitted 24 June, 2022; v1 submitted 22 June, 2022; originally announced June 2022.

  13. arXiv:2204.10621  [pdf, other

    cs.CL

    Out-of-Domain Evaluation of Finnish Dependency Parsing

    Authors: Jenna Kanerva, Filip Ginter

    Abstract: The prevailing practice in the academia is to evaluate the model performance on in-domain evaluation data typically set aside from the training corpus. However, in many real world applications the data on which the model is applied may very substantially differ from the characteristics of the training data. In this paper, we focus on Finnish out-of-domain parsing by introducing a novel UD Finnish-… ▽ More

    Submitted 22 April, 2022; originally announced April 2022.

    Comments: Accepted at LREC 2022

  14. Semantic Search as Extractive Paraphrase Span Detection

    Authors: Jenna Kanerva, Hanna Kitti, Li-Hsin Chang, Teemu Vahtola, Mathias Creutz, Filip Ginter

    Abstract: In this paper, we approach the problem of semantic search by framing the search task as paraphrase span detection, i.e. given a segment of text as a query phrase, the task is to identify its paraphrase in a given document, the same modelling setup as typically used in extractive question answering. On the Turku Paraphrase Corpus of 100,000 manually extracted Finnish paraphrase pairs including thei… ▽ More

    Submitted 9 December, 2021; originally announced December 2021.

    Comments: Language Resources and Evaluation (2024)

  15. arXiv:2108.13653  [pdf, other

    cs.CL cs.AI

    Explaining Classes through Word Attribution

    Authors: Samuel Rönnqvist, Amanda Myntti, Aki-Juhani Kyröläinen, Sampo Pyysalo, Veronika Laippala, Filip Ginter

    Abstract: In recent years, several methods have been proposed for explaining individual predictions of deep learning models, yet there has been little study of how to aggregate these predictions to explain how such models view classes as a whole in text classification tasks. In this work, we propose a method for explaining classes using deep learning models and the Integrated Gradients feature attribution t… ▽ More

    Submitted 31 August, 2021; originally announced August 2021.

  16. arXiv:2108.07499  [pdf, ps, other

    cs.CL

    Annotation Guidelines for the Turku Paraphrase Corpus

    Authors: Jenna Kanerva, Filip Ginter, Li-Hsin Chang, Iiro Rastas, Valtteri Skantsi, Jemina Kilpeläinen, Hanna-Mari Kupari, Aurora Piirto, Jenna Saarni, Maija Sevón, Otto Tarkka

    Abstract: This document describes the annotation guidelines used to construct the Turku Paraphrase Corpus. These guidelines were developed together with the corpus annotation, revising and extending the guidelines regularly during the annotation work. Our paraphrase annotation scheme uses the base scale 1-4, where labels 1 and 2 are used for negative candidates (not paraphrases), while labels 3 and 4 are pa… ▽ More

    Submitted 19 August, 2021; v1 submitted 17 August, 2021; originally announced August 2021.

    Comments: The Turku Paraphrase Corpus is available at https://turkunlp.org/paraphrase.html

  17. arXiv:2105.02477  [pdf, other

    cs.CL

    Quantitative Evaluation of Alternative Translations in a Corpus of Highly Dissimilar Finnish Paraphrases

    Authors: Li-Hsin Chang, Sampo Pyysalo, Jenna Kanerva, Filip Ginter

    Abstract: In this paper, we present a quantitative evaluation of differences between alternative translations in a large recently released Finnish paraphrase corpus focusing in particular on non-trivial variation in translation. We combine a series of automatic steps detecting systematic variation with manual analysis to reveal regularities and identify categories of translation differences. We find the par… ▽ More

    Submitted 6 May, 2021; originally announced May 2021.

    Comments: Accepted to Workshop on MOdelling TRAnslation: Translatology in the Digital Age

  18. arXiv:2104.11556  [pdf, other

    cs.CL

    Deep learning for sentence clustering in essay grading support

    Authors: Li-Hsin Chang, Iiro Rastas, Sampo Pyysalo, Filip Ginter

    Abstract: Essays as a form of assessment test student knowledge on a deeper level than short answer and multiple-choice questions. However, the manual evaluation of essays is time- and labor-consuming. Automatic clustering of essays, or their fragments, prior to manual evaluation presents a possible solution to reducing the effort required in the evaluation process. Such clustering presents numerous challen… ▽ More

    Submitted 23 April, 2021; originally announced April 2021.

    Comments: Accepted to EDM 2021

  19. arXiv:2103.13103  [pdf, other

    cs.CL

    Finnish Paraphrase Corpus

    Authors: Jenna Kanerva, Filip Ginter, Li-Hsin Chang, Iiro Rastas, Valtteri Skantsi, Jemina Kilpeläinen, Hanna-Mari Kupari, Jenna Saarni, Maija Sevón, Otto Tarkka

    Abstract: In this paper, we introduce the first fully manually annotated paraphrase corpus for Finnish containing 53,572 paraphrase pairs harvested from alternative subtitles and news headings. Out of all paraphrase pairs in our corpus 98% are manually classified to be paraphrases at least in their given context, if not in all contexts. Additionally, we establish a manual candidate selection method and demo… ▽ More

    Submitted 24 March, 2021; originally announced March 2021.

    Comments: Accepted to NoDaLiDa 2021, data: https://github.com/TurkuNLP/Turku-paraphrase-corpus

  20. arXiv:2010.11639  [pdf, ps, other

    cs.CL

    Towards Fully Bilingual Deep Language Modeling

    Authors: Li-Hsin Chang, Sampo Pyysalo, Jenna Kanerva, Filip Ginter

    Abstract: Language models based on deep neural networks have facilitated great advances in natural language processing and understanding tasks in recent years. While models covering a large number of languages have been introduced, their multilinguality has come at a cost in terms of monolingual performance, and the best-performing models at most tasks not involving cross-lingual transfer remain monolingual… ▽ More

    Submitted 22 October, 2020; originally announced October 2020.

  21. arXiv:2006.01538  [pdf, other

    cs.CL cs.LG

    WikiBERT models: deep transfer learning for many languages

    Authors: Sampo Pyysalo, Jenna Kanerva, Antti Virtanen, Filip Ginter

    Abstract: Deep neural language models such as BERT have enabled substantial recent advances in many natural language processing tasks. Due to the effort and computational cost involved in their pre-training, language-specific models are typically introduced only for a small number of high-resource languages such as English. While multilingual models covering large numbers of languages are available, recent… ▽ More

    Submitted 2 June, 2020; originally announced June 2020.

    Comments: 7 pages, 1 figure

  22. arXiv:2004.10643  [pdf, other

    cs.CL

    Universal Dependencies v2: An Evergrowing Multilingual Treebank Collection

    Authors: Joakim Nivre, Marie-Catherine de Marneffe, Filip Ginter, Jan Hajič, Christopher D. Manning, Sampo Pyysalo, Sebastian Schuster, Francis Tyers, Daniel Zeman

    Abstract: Universal Dependencies is an open community effort to create cross-linguistically consistent treebank annotation for many languages within a dependency-based lexicalist framework. The annotation consists in a linguistically motivated word segmentation; a morphological layer comprising lemmas, universal part-of-speech tags, and standardized morphological features; and a syntactic layer focusing on… ▽ More

    Submitted 22 April, 2020; originally announced April 2020.

    Comments: LREC 2020

  23. arXiv:1912.07076  [pdf, other

    cs.CL

    Multilingual is not enough: BERT for Finnish

    Authors: Antti Virtanen, Jenna Kanerva, Rami Ilo, Jouni Luoma, Juhani Luotolahti, Tapio Salakoski, Filip Ginter, Sampo Pyysalo

    Abstract: Deep learning-based language models pretrained on large unannotated text corpora have been demonstrated to allow efficient transfer learning for natural language processing, with recent approaches such as the transformer-based BERT model advancing the state of the art across a variety of tasks. While most work on these models has focused on high-resource languages, in particular English, a number… ▽ More

    Submitted 15 December, 2019; originally announced December 2019.

  24. arXiv:1912.00991  [pdf

    cs.CL

    Morphological Tagging and Lemmatization of Albanian: A Manually Annotated Corpus and Neural Models

    Authors: Nelda Kote, Marenglen Biba, Jenna Kanerva, Samuel Rönnqvist, Filip Ginter

    Abstract: In this paper, we present the first publicly available part-of-speech and morphologically tagged corpus for the Albanian language, as well as a neural morphological tagger and lemmatizer trained on it. There is currently a lack of available NLP resources for Albanian, and its complex grammar and morphology present challenges to their development. We have created an Albanian part-of-speech corpus b… ▽ More

    Submitted 2 December, 2019; originally announced December 2019.

  25. arXiv:1910.03806  [pdf

    cs.CL cs.LG

    Is Multilingual BERT Fluent in Language Generation?

    Authors: Samuel Rönnqvist, Jenna Kanerva, Tapio Salakoski, Filip Ginter

    Abstract: The multilingual BERT model is trained on 104 languages and meant to serve as a universal language model and tool for encoding sentences. We explore how well the model performs on several languages across several tasks: a diagnostic classification probing the embeddings for a particular syntactic property, a cloze task testing the language modelling ability to fill in gaps in a sentence, and a nat… ▽ More

    Submitted 9 October, 2019; originally announced October 2019.

    Journal ref: In proceedings of the First NLPL Workshop on Deep Learning for Natural Language Processing (2019)

  26. arXiv:1910.01863  [pdf, other

    cs.CL

    Template-free Data-to-Text Generation of Finnish Sports News

    Authors: Jenna Kanerva, Samuel Rönnqvist, Riina Kekki, Tapio Salakoski, Filip Ginter

    Abstract: News articles such as sports game reports are often thought to closely follow the underlying game statistics, but in practice they contain a notable amount of background knowledge, interpretation, insight into the game, and quotes that are not present in the official statistics. This poses a challenge for automated data-to-text news generation with real-world news corpora as training data. We repo… ▽ More

    Submitted 4 October, 2019; originally announced October 2019.

    Comments: NoDaLiDa 2019 (https://www.aclweb.org/anthology/W19-6125/)

  27. arXiv:1906.10907  [pdf, other

    cs.CL

    Leveraging Text Repetitions and Denoising Autoencoders in OCR Post-correction

    Authors: Kai Hakala, Aleksi Vesanto, Niko Miekka, Tapio Salakoski, Filip Ginter

    Abstract: A common approach for improving OCR quality is a post-processing step based on models correcting misdetected characters and tokens. These models are typically trained on aligned pairs of OCR read text and their manually corrected counterparts. In this paper we show that the requirement of manually corrected training data can be alleviated by estimating the OCR errors from repeating text spans foun… ▽ More

    Submitted 26 June, 2019; originally announced June 2019.

  28. arXiv:1902.00972  [pdf, other

    cs.CL

    Universal Lemmatizer: A Sequence to Sequence Model for Lemmatizing Universal Dependencies Treebanks

    Authors: Jenna Kanerva, Filip Ginter, Tapio Salakoski

    Abstract: In this paper we present a novel lemmatization method based on a sequence-to-sequence neural network architecture and morphosyntactic context representation. In the proposed method, our context-sensitive lemmatizer generates the lemma one character at a time based on the surface form characters and its morphosyntactic features obtained from a morphological tagger. We argue that a sliding window co… ▽ More

    Submitted 15 April, 2020; v1 submitted 3 February, 2019; originally announced February 2019.

    Comments: Accepted to the Journal of Natural Language Engineering

  29. An expanded evaluation of protein function prediction methods shows an improvement in accuracy

    Authors: Yuxiang Jiang, Tal Ronnen Oron, Wyatt T Clark, Asma R Bankapur, Daniel D'Andrea, Rosalba Lepore, Christopher S Funk, Indika Kahanda, Karin M Verspoor, Asa Ben-Hur, Emily Koo, Duncan Penfold-Brown, Dennis Shasha, Noah Youngs, Richard Bonneau, Alexandra Lin, Sayed ME Sahraeian, Pier Luigi Martelli, Giuseppe Profiti, Rita Casadio, Renzhi Cao, Zhaolong Zhong, Jianlin Cheng, Adrian Altenhoff, Nives Skunca , et al. (122 additional authors not shown)

    Abstract: Background: The increasing volume and variety of genotypic and phenotypic data is a major defining characteristic of modern biomedical sciences. At the same time, the limitations in technology for generating data and the inherently stochastic nature of biomolecular events have led to the discrepancy between the volume of data and the amount of knowledge gleaned from it. A major bottleneck in our a… ▽ More

    Submitted 2 January, 2016; originally announced January 2016.

    Comments: Submitted to Genome Biology