Skip to main content

Showing 1–6 of 6 results for author: Kostiuk, Y

.
  1. arXiv:2501.09164  [pdf, other

    cs.CL cs.AI

    The Veln(ia)s is in the Details: Evaluating LLM Judgment on Latvian and Lithuanian Short Answer Matching

    Authors: Yevhen Kostiuk, Oxana Vitman, Łukasz Gagała, Artur Kiulian

    Abstract: In this work, we address the challenge of evaluating large language models (LLMs) on the short answer matching task for Latvian and Lithuanian languages. We introduce novel datasets consisting of 502 Latvian and 690 Lithuanian question-answer pairs. For each question-answer pair, we generated matched and non-matched answers using a set of alteration rules specifically designed to introduce small b… ▽ More

    Submitted 15 January, 2025; originally announced January 2025.

  2. arXiv:2501.09154  [pdf, other

    cs.CL cs.AI

    Towards Multilingual LLM Evaluation for Baltic and Nordic languages: A study on Lithuanian History

    Authors: Yevhen Kostiuk, Oxana Vitman, Łukasz Gagała, Artur Kiulian

    Abstract: In this work, we evaluated Lithuanian and general history knowledge of multilingual Large Language Models (LLMs) on a multiple-choice question-answering task. The models were tested on a dataset of Lithuanian national and general history questions translated into Baltic, Nordic, and other languages (English, Ukrainian, Arabic) to assess the knowledge sharing from culturally and historically connec… ▽ More

    Submitted 15 January, 2025; originally announced January 2025.

  3. arXiv:2410.18836  [pdf, other

    cs.CL cs.AI

    From English-Centric to Effective Bilingual: LLMs with Custom Tokenizers for Underrepresented Languages

    Authors: Artur Kiulian, Anton Polishko, Mykola Khandoga, Yevhen Kostiuk, Guillermo Gabrielli, Łukasz Gagała, Fadi Zaraket, Qusai Abu Obaida, Hrishikesh Garud, Wendy Wing Yee Mak, Dmytro Chaplynskyi, Selma Belhadj Amor, Grigol Peradze

    Abstract: In this paper, we propose a model-agnostic cost-effective approach to developing bilingual base large language models (LLMs) to support English and any target language. The method includes vocabulary expansion, initialization of new embeddings, model training and evaluation. We performed our experiments with three languages, each using a non-Latin script - Ukrainian, Arabic, and Georgian. Our ap… ▽ More

    Submitted 24 October, 2024; originally announced October 2024.

  4. arXiv:2311.04189  [pdf

    cs.CL

    SpaDeLeF: A Dataset for Hierarchical Classification of Lexical Functions for Collocations in Spanish

    Authors: Yevhen Kostiuk, Grigori Sidorov, Olga Kolesnikova

    Abstract: In natural language processing (NLP), lexical function is a concept to unambiguously represent semantic and syntactic features of words and phrases in text first crafted in the Meaning-Text Theory. Hierarchical classification of lexical functions involves organizing these features into a tree-like hierarchy of categories or labels. This is a challenging task as it requires a good understanding of… ▽ More

    Submitted 7 November, 2023; originally announced November 2023.

  5. arXiv:2306.01261  [pdf, other

    cs.CL

    Automatic Translation of Hate Speech to Non-hate Speech in Social Media Texts

    Authors: Yevhen Kostiuk, Atnafu Lambebo Tonja, Grigori Sidorov, Olga Kolesnikova

    Abstract: In this paper, we investigate the issue of hate speech by presenting a novel task of translating hate speech into non-hate speech text while preserving its meaning. As a case study, we use Spanish texts. We provide a dataset and several baselines as a starting point for further research in the task. We evaluated our baseline results using multiple metrics, including BLEU scores. The aim of this st… ▽ More

    Submitted 2 June, 2023; originally announced June 2023.

  6. arXiv:2211.13014  [pdf, other

    cs.CL cs.LG

    Sarcasm Detection Framework Using Context, Emotion and Sentiment Features

    Authors: Oxana Vitman, Yevhen Kostiuk, Grigori Sidorov, Alexander Gelbukh

    Abstract: Sarcasm detection is an essential task that can help identify the actual sentiment in user-generated data, such as discussion forums or tweets. Sarcasm is a sophisticated form of linguistic expression because its surface meaning usually contradicts its inner, deeper meaning. Such incongruity is the essential component of sarcasm, however, it makes sarcasm detection quite a challenging task. In thi… ▽ More

    Submitted 4 January, 2023; v1 submitted 23 November, 2022; originally announced November 2022.