Skip to main content

Showing 1–8 of 8 results for author: Loftsson, H

Searching in archive cs. Search in all archives.
.
  1. arXiv:2504.18180  [pdf, other

    cs.CL cs.AI cs.LG

    Aligning Language Models for Icelandic Legal Text Summarization

    Authors: Þórir Hrafn Harðarson, Hrafn Loftsson, Stefán Ólafsson

    Abstract: The integration of language models in the legal domain holds considerable promise for streamlining processes and improving efficiency in managing extensive workloads. However, the specialized terminology, nuanced language, and formal style of legal texts can present substantial challenges. This study examines whether preference-based training techniques, specifically Reinforcement Learning from Hu… ▽ More

    Submitted 25 April, 2025; originally announced April 2025.

    Comments: Published at NoDaLiDa 2025

    Journal ref: Proceedings of the 25th Nordic Conference on Computational Linguistics (NoDaLiDa 2025). Tallinn, Estonia

  2. arXiv:2311.08982  [pdf, other

    cs.CL

    SentAlign: Accurate and Scalable Sentence Alignment

    Authors: Steinþór Steingrímsson, Hrafn Loftsson, Andy Way

    Abstract: We present SentAlign, an accurate sentence alignment tool designed to handle very large parallel document pairs. Given user-defined parameters, the alignment algorithm evaluates all possible alignment paths in fairly large documents of thousands of sentences and uses a divide-and-conquer approach to align documents containing tens of thousands of sentences. The scoring function is based on LaBSE b… ▽ More

    Submitted 15 November, 2023; originally announced November 2023.

    Comments: EMNLP 2023 System Demonstration paper

  3. arXiv:2206.05014  [pdf, other

    cs.CL

    Building an Icelandic Entity Linking Corpus

    Authors: Steinunn Rut Friðriksdóttir, Valdimar Ágúst Eggertsson, Benedikt Geir Jóhannesson, Hjalti Daníelsson, Hrafn Loftsson, Hafsteinn Einarsson

    Abstract: In this paper, we present the first Entity Linking corpus for Icelandic. We describe our approach of using a multilingual entity linking model (mGENRE) in combination with Wikipedia API Search (WAPIS) to label our data and compare it to an approach using WAPIS only. We find that our combined method reaches 53.9% coverage on our corpus, compared to 30.9% using only WAPIS. We analyze our results and… ▽ More

    Submitted 10 June, 2022; originally announced June 2022.

    Comments: 9 pages, 5 figures, submitted to Dataset Creation for Lower-Resourced Languages, an LREC 2022 Workshop, 9am-1pm June 24th, 2022

  4. arXiv:2205.10088  [pdf, other

    cs.CL cs.LG stat.ML

    Semi-self-supervised Automated ICD Coding

    Authors: Hlynur D. Hlynsson, Steindór Ellertsson, Jón F. Daðason, Emil L. Sigurdsson, Hrafn Loftsson

    Abstract: Clinical Text Notes (CTNs) contain physicians' reasoning process, written in an unstructured free text format, as they examine and interview patients. In recent years, several studies have been published that provide evidence for the utility of machine learning for predicting doctors' diagnoses from CTNs, a task known as ICD coding. Data annotation is time consuming, particularly when a degree of… ▽ More

    Submitted 18 August, 2022; v1 submitted 20 May, 2022; originally announced May 2022.

    Comments: Re-upload comment: added a baseline comparison as well as an analysis of the features

  5. arXiv:2004.07776  [pdf, other

    cs.CL

    Kvistur 2.0: a BiLSTM Compound Splitter for Icelandic

    Authors: Jón Friðrik Daðason, David Erik Mollberg, Hrafn Loftsson, Kristín Bjarnadóttir

    Abstract: In this paper, we present a character-based BiLSTM model for splitting Icelandic compound words, and show how varying amounts of training data affects the performance of the model. Compounding is highly productive in Icelandic, and new compounds are constantly being created. This results in a large number of out-of-vocabulary (OOV) words, negatively impacting the performance of many NLP tools. Our… ▽ More

    Submitted 16 April, 2020; originally announced April 2020.

    Comments: Accepted at LREC 2020

  6. arXiv:2003.09244  [pdf, ps, other

    cs.CL

    Language Technology Programme for Icelandic 2019-2023

    Authors: Anna Björk Nikulásdóttir, Jón Guðnason, Anton Karl Ingason, Hrafn Loftsson, Eiríkur Rögnvaldsson, Einar Freyr Sigurðsson, Steinþór Steingrímsson

    Abstract: In this paper, we describe a new national language technology programme for Icelandic. The programme, which spans a period of five years, aims at making Icelandic usable in communication and interactions in the digital world, by developing accessible, open-source language resources and software. The research and development work within the programme is carried out by a consortium of universities,… ▽ More

    Submitted 20 March, 2020; originally announced March 2020.

    Comments: Accepted at LREC 2020

  7. arXiv:1907.11907  [pdf, ps, other

    cs.CL

    Nefnir: A high accuracy lemmatizer for Icelandic

    Authors: Svanhvít Lilja Ingólfsdóttir, Hrafn Loftsson, Jón Friðrik Daðason, Kristín Bjarnadóttir

    Abstract: Lemmatization, finding the basic morphological form of a word in a corpus, is an important step in many natural language processing tasks when working with morphologically rich languages. We describe and evaluate Nefnir, a new open source lemmatizer for Icelandic. Nefnir uses suffix substitution rules, derived from a large morphological database, to lemmatize tagged text. Evaluation shows that for… ▽ More

    Submitted 27 July, 2019; originally announced July 2019.

    Comments: Presented at NoDaLiDa 2019, Turku, Finland

  8. arXiv:1907.09038  [pdf, other

    cs.CL cs.LG

    Augmenting a BiLSTM tagger with a Morphological Lexicon and a Lexical Category Identification Step

    Authors: Steinþór Steingrímsson, Örvar Kárason, Hrafn Loftsson

    Abstract: Previous work on using BiLSTM models for PoS tagging has primarily focused on small tagsets. We evaluate BiLSTM models for tagging Icelandic, a morphologically rich language, using a relatively large tagset. Our baseline BiLSTM model achieves higher accuracy than any previously published tagger not taking advantage of a morphological lexicon. When we extend the model by incorporating such data, we… ▽ More

    Submitted 21 July, 2019; originally announced July 2019.

    Comments: Accepted by RANLP 2019