Skip to main content

Showing 1–23 of 23 results for author: Simko, J

.
  1. arXiv:2506.08564  [pdf, ps, other

    cs.CL cs.SD eess.AS

    Neighbors and relatives: How do speech embeddings reflect linguistic connections across the world?

    Authors: Tuukka Törö, Antti Suni, Juraj Šimko

    Abstract: Investigating linguistic relationships on a global scale requires analyzing diverse features such as syntax, phonology and prosody, which evolve at varying rates influenced by internal diversification, language contact, and sociolinguistic factors. Recent advances in machine learning (ML) offer complementary alternatives to traditional historical and typological approaches. Instead of relying on e… ▽ More

    Submitted 10 June, 2025; originally announced June 2025.

    Comments: 27 pages, 11 figures (+5 supplementary), submitted to PLOS One

  2. arXiv:2504.20668  [pdf, other

    cs.CL

    A Generative-AI-Driven Claim Retrieval System Capable of Detecting and Retrieving Claims from Social Media Platforms in Multiple Languages

    Authors: Ivan Vykopal, Martin Hyben, Robert Moro, Michal Gregor, Jakub Simko

    Abstract: Online disinformation poses a global challenge, placing significant demands on fact-checkers who must verify claims efficiently to prevent the spread of false information. A major issue in this process is the redundant verification of already fact-checked claims, which increases workload and delays responses to newly emerging claims. This research introduces an approach that retrieves previously f… ▽ More

    Submitted 29 April, 2025; originally announced April 2025.

  3. arXiv:2501.09556  [pdf, other

    cs.LG

    Overshoot: Taking advantage of future gradients in momentum-based stochastic optimization

    Authors: Jakub Kopal, Michal Gregor, Santiago de Leon-Martinez, Jakub Simko

    Abstract: Overshoot is a novel, momentum-based stochastic gradient descent optimization method designed to enhance performance beyond standard and Nesterov's momentum. In conventional momentum methods, gradients from previous steps are aggregated with the gradient at current model weights before taking a step and updating the model. Rather than calculating gradient at the current model weights, Overshoot ca… ▽ More

    Submitted 16 January, 2025; originally announced January 2025.

  4. arXiv:2410.10756  [pdf, other

    cs.CL

    Use Random Selection for Now: Investigation of Few-Shot Selection Strategies in LLM-based Text Augmentation for Classification

    Authors: Jan Cegin, Branislav Pecher, Jakub Simko, Ivan Srba, Maria Bielikova, Peter Brusilovsky

    Abstract: The generative large language models (LLMs) are increasingly used for data augmentation tasks, where text samples are paraphrased (or generated anew) and then used for classifier fine-tuning. Existing works on augmentation leverage the few-shot scenarios, where samples are given to LLMs as part of prompts, leading to better augmentations. Yet, the samples are mostly selected randomly and a compreh… ▽ More

    Submitted 14 October, 2024; originally announced October 2024.

  5. arXiv:2408.16502  [pdf, other

    cs.CL

    LLMs vs Established Text Augmentation Techniques for Classification: When do the Benefits Outweight the Costs?

    Authors: Jan Cegin, Jakub Simko, Peter Brusilovsky

    Abstract: The generative large language models (LLMs) are increasingly being used for data augmentation tasks, where text samples are LLM-paraphrased and then used for classifier fine-tuning. However, a research that would confirm a clear cost-benefit advantage of LLMs over more established augmentation methods is largely missing. To study if (and when) is the LLM-based augmentation advantageous, we compare… ▽ More

    Submitted 29 August, 2024; originally announced August 2024.

    Comments: 20 pages

  6. arXiv:2408.06847  [pdf, other

    cs.CY cs.AI

    AI Research is not Magic, it has to be Reproducible and Responsible: Challenges in the AI field from the Perspective of its PhD Students

    Authors: Andrea Hrckova, Jennifer Renoux, Rafael Tolosana Calasanz, Daniela Chuda, Martin Tamajka, Jakub Simko

    Abstract: With the goal of uncovering the challenges faced by European AI students during their research endeavors, we surveyed 28 AI doctoral candidates from 13 European countries. The outcomes underscore challenges in three key areas: (1) the findability and quality of AI resources such as datasets, models, and experiments; (2) the difficulties in replicating the experiments in AI papers; (3) and the lack… ▽ More

    Submitted 13 August, 2024; originally announced August 2024.

    Comments: 8 pages, 4 figures, 1 appendix (interview questions)

  7. Fighting Randomness with Randomness: Mitigating Optimisation Instability of Fine-Tuning using Delayed Ensemble and Noisy Interpolation

    Authors: Branislav Pecher, Jan Cegin, Robert Belanec, Jakub Simko, Ivan Srba, Maria Bielikova

    Abstract: While fine-tuning of pre-trained language models generally helps to overcome the lack of labelled training samples, it also displays model performance instability. This instability mainly originates from randomness in initialisation or data shuffling. To address this, researchers either modify the training process or augment the available samples, which typically results in increased computational… ▽ More

    Submitted 3 October, 2024; v1 submitted 18 June, 2024; originally announced June 2024.

    Comments: Accepted to the Findings of the EMNLP'24 Conference

    Journal ref: Findings of the Association for Computational Linguistics: EMNLP 2024

  8. Authorship Obfuscation in Multilingual Machine-Generated Text Detection

    Authors: Dominik Macko, Robert Moro, Adaku Uchendu, Ivan Srba, Jason Samuel Lucas, Michiharu Yamashita, Nafis Irtiza Tripto, Dongwon Lee, Jakub Simko, Maria Bielikova

    Abstract: High-quality text generation capability of recent Large Language Models (LLMs) causes concerns about their misuse (e.g., in massive generation/spread of disinformation). Machine-generated text (MGT) detection is important to cope with such threats. However, it is susceptible to authorship obfuscation (AO) methods, such as paraphrasing, which can cause MGTs to evade detection. So far, this was eval… ▽ More

    Submitted 4 October, 2024; v1 submitted 15 January, 2024; originally announced January 2024.

    Comments: Accepted to EMNLP 2024 Findings

    Journal ref: Findings of the Association for Computational Linguistics: EMNLP 2024

  9. Effects of diversity incentives on sample diversity and downstream model performance in LLM-based text augmentation

    Authors: Jan Cegin, Branislav Pecher, Jakub Simko, Ivan Srba, Maria Bielikova, Peter Brusilovsky

    Abstract: The latest generative large language models (LLMs) have found their application in data augmentation tasks, where small numbers of text samples are LLM-paraphrased and then used to fine-tune downstream models. However, more research is needed to assess how different prompts, seed data selection strategies, filtering methods, or model settings affect the quality of paraphrased data (and downstream… ▽ More

    Submitted 18 August, 2024; v1 submitted 12 January, 2024; originally announced January 2024.

    Comments: ACL'24 version, 24 pages

    Journal ref: Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2024

  10. arXiv:2311.06121  [pdf, other

    cs.CL

    Multilingual and Multi-topical Benchmark of Fine-tuned Language models and Large Language Models for Check-Worthy Claim Detection

    Authors: Martin Hyben, Sebastian Kula, Ivan Srba, Robert Moro, Jakub Simko

    Abstract: This study compares the performance of (1) fine-tuned language models and (2) large language models on the task of check-worthy claim detection. For the purpose of the comparison we composed a multilingual and multi-topical dataset comprising texts of various sources and styles. Building on this, we performed a benchmark analysis to determine the most general multilingual and multi-topical claim d… ▽ More

    Submitted 11 October, 2024; v1 submitted 10 November, 2023; originally announced November 2023.

    Comments: 21 pages, 10 figures, 18 tables

  11. MULTITuDE: Large-Scale Multilingual Machine-Generated Text Detection Benchmark

    Authors: Dominik Macko, Robert Moro, Adaku Uchendu, Jason Samuel Lucas, Michiharu Yamashita, Matúš Pikuliak, Ivan Srba, Thai Le, Dongwon Lee, Jakub Simko, Maria Bielikova

    Abstract: There is a lack of research into capabilities of recent LLMs to generate convincing text in languages other than English and into performance of detectors of machine-generated text in multilingual settings. This is also reflected in the available benchmarks which lack authentic texts in languages other than English and predominantly cover older generators. To fill this gap, we introduce MULTITuDE,… ▽ More

    Submitted 20 October, 2023; originally announced October 2023.

    Journal ref: Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

  12. arXiv:2306.09814  [pdf, other

    eess.AS cs.CL

    Investigating the Utility of Surprisal from Large Language Models for Speech Synthesis Prosody

    Authors: Sofoklis Kakouros, Juraj Šimko, Martti Vainio, Antti Suni

    Abstract: This paper investigates the use of word surprisal, a measure of the predictability of a word in a given context, as a feature to aid speech synthesis prosody. We explore how word surprisal extracted from large language models (LLMs) correlates with word prominence, a signal-based measure of the salience of a word in a given discourse. We also examine how context length and LLM size affect the resu… ▽ More

    Submitted 16 June, 2023; originally announced June 2023.

    Comments: Accepted at SSW 2023

  13. arXiv:2305.16040  [pdf, other

    eess.AS

    The Power of Prosody and Prosody of Power: An Acoustic Analysis of Finnish Parliamentary Speech

    Authors: Martti Vainio, Antti Suni, Juraj Šimko, Sofoklis Kakouros

    Abstract: Parliamentary recordings provide a rich source of data for studying how politicians use speech to convey their messages and influence their audience. This provides a unique context for studying how politicians use speech, especially prosody, to achieve their goals. Here we analyzed a corpus of parliamentary speeches in the Finnish parliament between the years 2008-2020 and highlight methodological… ▽ More

    Submitted 25 May, 2023; originally announced May 2023.

  14. arXiv:2305.12947  [pdf, other

    cs.CL

    ChatGPT to Replace Crowdsourcing of Paraphrases for Intent Classification: Higher Diversity and Comparable Model Robustness

    Authors: Jan Cegin, Jakub Simko, Peter Brusilovsky

    Abstract: The emergence of generative large language models (LLMs) raises the question: what will be its impact on crowdsourcing? Traditionally, crowdsourcing has been used for acquiring solutions to a wide variety of human-intelligence tasks, including ones involving text generation, modification or evaluation. For some of these tasks, models like ChatGPT can potentially substitute human workers. In this s… ▽ More

    Submitted 19 October, 2023; v1 submitted 22 May, 2023; originally announced May 2023.

    Comments: Long paper accepted to EMNLP 2023 conference main track, 17 pages, 9 figures

  15. Multilingual Previously Fact-Checked Claim Retrieval

    Authors: Matúš Pikuliak, Ivan Srba, Robert Moro, Timo Hromadka, Timotej Smolen, Martin Melisek, Ivan Vykopal, Jakub Simko, Juraj Podrouzek, Maria Bielikova

    Abstract: Fact-checkers are often hampered by the sheer amount of online content that needs to be fact-checked. NLP can help them by retrieving already existing fact-checks relevant to the content being investigated. This paper introduces a new multilingual dataset -- MultiClaim -- for previously fact-checked claim retrieval. We collected 28k posts in 27 languages from social media, 206k fact-checks in 39 l… ▽ More

    Submitted 13 October, 2023; v1 submitted 13 May, 2023; originally announced May 2023.

    Comments: Accepted at EMNLP 2023

    Journal ref: Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

  16. arXiv:2212.06336  [pdf, other

    eess.IV cs.CV cs.LG q-bio.TO

    Mixed Supervision of Histopathology Improves Prostate Cancer Classification from MRI

    Authors: Abhejit Rajagopal, Antonio C. Westphalen, Nathan Velarde, Tim Ullrich, Jeffry P. Simko, Hao Nguyen, Thomas A. Hope, Peder E. Z. Larson, Kirti Magudia

    Abstract: Non-invasive prostate cancer detection from MRI has the potential to revolutionize patient care by providing early detection of clinically-significant disease (ISUP grade group >= 2), but has thus far shown limited positive predictive value. To address this, we present an MRI-based deep learning method for predicting clinically significant prostate cancer applicable to a patient population with su… ▽ More

    Submitted 12 December, 2022; originally announced December 2022.

  17. arXiv:2211.12143  [pdf, other

    cs.CY cs.AI cs.HC

    Autonomation, not Automation: Activities and Needs of Fact-checkers as a Basis for Designing Human-Centered AI Systems

    Authors: Andrea Hrckova, Robert Moro, Ivan Srba, Jakub Simko, Maria Bielikova

    Abstract: To mitigate the negative effects of false information more effectively, the development of Artificial Intelligence (AI) systems assisting fact-checkers is needed. Nevertheless, the lack of focus on the needs of these stakeholders results in their limited acceptance and skepticism toward automating the whole fact-checking process. In this study, we conducted semi-structured in-depth interviews with… ▽ More

    Submitted 13 August, 2024; v1 submitted 22 November, 2022; originally announced November 2022.

    Comments: 37 pages, 14 figures, 2 annexes

  18. arXiv:2210.10085  [pdf, other

    cs.IR cs.LG cs.SI

    Auditing YouTube's Recommendation Algorithm for Misinformation Filter Bubbles

    Authors: Ivan Srba, Robert Moro, Matus Tomlein, Branislav Pecher, Jakub Simko, Elena Stefancova, Michal Kompan, Andrea Hrckova, Juraj Podrouzek, Adrian Gavornik, Maria Bielikova

    Abstract: In this paper, we present results of an auditing study performed over YouTube aimed at investigating how fast a user can get into a misinformation filter bubble, but also what it takes to "burst the bubble", i.e., revert the bubble enclosure. We employ a sock puppet audit methodology, in which pre-programmed agents (acting as YouTube users) delve into misinformation filter bubbles by watching misi… ▽ More

    Submitted 18 October, 2022; originally announced October 2022.

    Comments: Just accepted to ACM Transactions on Recommender Systems (ACM TORS). arXiv admin note: substantial text overlap with arXiv:2203.13769

    Journal ref: ACM Transactions on Recommender Systems. 1, 1, Article 6 (March 2023), 33 pages

  19. arXiv:2204.12294  [pdf, other

    cs.CL cs.CY cs.IR cs.LG

    Monant Medical Misinformation Dataset: Mapping Articles to Fact-Checked Claims

    Authors: Ivan Srba, Branislav Pecher, Matus Tomlein, Robert Moro, Elena Stefancova, Jakub Simko, Maria Bielikova

    Abstract: False information has a significant negative influence on individuals as well as on the whole society. Especially in the current COVID-19 era, we witness an unprecedented growth of medical misinformation. To help tackle this problem with machine learning approaches, we are publishing a feature-rich dataset of approx. 317k medical news articles/blogs and 3.5k fact-checked claims. It also contains 5… ▽ More

    Submitted 26 April, 2022; originally announced April 2022.

    Comments: 11 pages, 4 figures, SIGIR 2022 Resource paper track

    Journal ref: ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2022)

  20. An Audit of Misinformation Filter Bubbles on YouTube: Bubble Bursting and Recent Behavior Changes

    Authors: Matus Tomlein, Branislav Pecher, Jakub Simko, Ivan Srba, Robert Moro, Elena Stefancova, Michal Kompan, Andrea Hrckova, Juraj Podrouzek, Maria Bielikova

    Abstract: The negative effects of misinformation filter bubbles in adaptive systems have been known to researchers for some time. Several studies investigated, most prominently on YouTube, how fast a user can get into a misinformation filter bubble simply by selecting wrong choices from the items offered. Yet, no studies so far have investigated what it takes to burst the bubble, i.e., revert the bubble enc… ▽ More

    Submitted 25 March, 2022; originally announced March 2022.

    Comments: RecSys '21: Fifteenth ACM Conference on Recommender System

    Journal ref: RecSys '21: Fifteenth ACM Conference on Recommender Systems, 2021

  21. arXiv:2109.12523  [pdf, other

    cs.HC cs.CY cs.LG cs.SI

    A Study of Fake News Reading and Annotating in Social Media Context

    Authors: Jakub Simko, Patrik Racsko, Matus Tomlein, Martin Hanakova, Robert Moro, Maria Bielikova

    Abstract: The online spreading of fake news is a major issue threatening entire societies. Much of this spreading is enabled by new media formats, namely social networks and online media sites. Researchers and practitioners have been trying to answer this by characterizing the fake news and devising automated methods for detecting them. The detection methods had so far only limited success, mostly due to th… ▽ More

    Submitted 26 April, 2022; v1 submitted 26 September, 2021; originally announced September 2021.

    ACM Class: H.5.2; H.5.4; K.4.2; H.3.1

    Journal ref: New Review of Hypermedia and Multimedia. pages 1-31 (2021)

  22. Prosodic Prominence and Boundaries in Sequence-to-Sequence Speech Synthesis

    Authors: Antti Suni, Sofoklis Kakouros, Martti Vainio, Juraj Šimko

    Abstract: Recent advances in deep learning methods have elevated synthetic speech quality to human level, and the field is now moving towards addressing prosodic variation in synthetic speech.Despite successes in this effort, the state-of-the-art systems fall short of faithfully reproducing local prosodic events that give rise to, e.g., word-level emphasis and phrasal structure. This type of prosodic variat… ▽ More

    Submitted 29 June, 2020; originally announced June 2020.

  23. Dialect Identification of Spoken North Sámi Language Varieties Using Prosodic Features

    Authors: Sofoklis Kakouros, Katri Hiovain, Martti Vainio, Juraj Šimko

    Abstract: This work explores the application of various supervised classification approaches using prosodic information for the identification of spoken North Sámi language varieties. Dialects are language varieties that enclose characteristics specific for a given region or community. These characteristics reflect segmental and suprasegmental (prosodic) differences but also high-level properties such as le… ▽ More

    Submitted 23 March, 2020; originally announced March 2020.