Skip to main content

Showing 1–28 of 28 results for author: Srba, I

Searching in archive cs. Search in all archives.
.
  1. arXiv:2505.10740  [pdf, ps, other

    cs.CL cs.IR

    SemEval-2025 Task 7: Multilingual and Crosslingual Fact-Checked Claim Retrieval

    Authors: Qiwei Peng, Robert Moro, Michal Gregor, Ivan Srba, Simon Ostermann, Marian Simko, Juraj Podroužek, Matúš Mesarčík, Jaroslav Kopčan, Anders Søgaard

    Abstract: The rapid spread of online disinformation presents a global challenge, and machine learning has been widely explored as a potential solution. However, multilingual settings and low-resource languages are often neglected in this field. To address this gap, we conducted a shared task on multilingual claim retrieval at SemEval 2025, aimed at identifying fact-checked claims that match newly encountere… ▽ More

    Submitted 15 May, 2025; originally announced May 2025.

  2. Revisiting Algorithmic Audits of TikTok: Poor Reproducibility and Short-term Validity of Findings

    Authors: Matej Mosnar, Adam Skurla, Branislav Pecher, Matus Tibensky, Jan Jakubcik, Adrian Bindas, Peter Sakalik, Ivan Srba

    Abstract: Social media platforms are constantly shifting towards algorithmically curated content based on implicit or explicit user feedback. Regulators, as well as researchers, are calling for systematic social media algorithmic audits as this shift leads to enclosing users in filter bubbles and leading them to more problematic content. An important aspect of such audits is the reproducibility and generali… ▽ More

    Submitted 25 April, 2025; originally announced April 2025.

    Comments: ACM SIGIR 2025. 10 pages

  3. arXiv:2503.23242  [pdf

    cs.CL cs.AI

    Beyond speculation: Measuring the growing presence of LLM-generated texts in multilingual disinformation

    Authors: Dominik Macko, Aashish Anantha Ramakrishnan, Jason Samuel Lucas, Robert Moro, Ivan Srba, Adaku Uchendu, Dongwon Lee

    Abstract: Increased sophistication of large language models (LLMs) and the consequent quality of generated multilingual text raises concerns about potential disinformation misuse. While humans struggle to distinguish LLM-generated content from human-written texts, the scholarly debate about their impact remains divided. Some argue that heightened fears are overblown due to natural ecosystem limitations, whi… ▽ More

    Submitted 29 March, 2025; originally announced March 2025.

  4. arXiv:2503.15128  [pdf, other

    cs.CL cs.AI

    Increasing the Robustness of the Fine-tuned Multilingual Machine-Generated Text Detectors

    Authors: Dominik Macko, Robert Moro, Ivan Srba

    Abstract: Since the proliferation of LLMs, there have been concerns about their misuse for harmful content creation and spreading. Recent studies justify such fears, providing evidence of LLM vulnerabilities and high potential of their misuse. Humans are no longer able to distinguish between high-quality machine-generated and authentic human-written texts. Therefore, it is crucial to develop automated means… ▽ More

    Submitted 19 March, 2025; originally announced March 2025.

  5. arXiv:2412.13666  [pdf, other

    cs.CL cs.AI cs.CY

    Evaluation of LLM Vulnerabilities to Being Misused for Personalized Disinformation Generation

    Authors: Aneta Zugecova, Dominik Macko, Ivan Srba, Robert Moro, Jakub Kopal, Katarina Marcincinova, Matus Mesarcik

    Abstract: The capabilities of recent large language models (LLMs) to generate high-quality content indistinguishable by humans from human-written texts rises many concerns regarding their misuse. Previous research has shown that LLMs can be effectively misused for generating disinformation news articles following predefined narratives. Their capabilities to generate personalized (in various aspects) content… ▽ More

    Submitted 18 December, 2024; originally announced December 2024.

  6. arXiv:2410.21360  [pdf, other

    cs.CL

    A Survey on Automatic Credibility Assessment of Textual Credibility Signals in the Era of Large Language Models

    Authors: Ivan Srba, Olesya Razuvayevskaya, João A. Leite, Robert Moro, Ipek Baris Schlicht, Sara Tonelli, Francisco Moreno García, Santiago Barrio Lottmann, Denis Teyssou, Valentin Porcellini, Carolina Scarton, Kalina Bontcheva, Maria Bielikova

    Abstract: In the current era of social media and generative AI, an ability to automatically assess the credibility of online social media content is of tremendous importance. Credibility assessment is fundamentally based on aggregating credibility signals, which refer to small units of information, such as content factuality, bias, or a presence of persuasion techniques, into an overall credibility score. C… ▽ More

    Submitted 28 October, 2024; originally announced October 2024.

  7. arXiv:2410.10756  [pdf, other

    cs.CL

    Use Random Selection for Now: Investigation of Few-Shot Selection Strategies in LLM-based Text Augmentation for Classification

    Authors: Jan Cegin, Branislav Pecher, Jakub Simko, Ivan Srba, Maria Bielikova, Peter Brusilovsky

    Abstract: The generative large language models (LLMs) are increasingly used for data augmentation tasks, where text samples are paraphrased (or generated anew) and then used for classifier fine-tuning. Existing works on augmentation leverage the few-shot scenarios, where samples are given to LLMs as part of prompts, leading to better augmentations. Yet, the samples are mostly selected randomly and a compreh… ▽ More

    Submitted 14 October, 2024; originally announced October 2024.

  8. arXiv:2408.01119  [pdf, ps, other

    cs.CL

    Task Prompt Vectors: Effective Initialization through Multi-Task Soft-Prompt Transfer

    Authors: Robert Belanec, Simon Ostermann, Ivan Srba, Maria Bielikova

    Abstract: Prompt tuning is an efficient solution for training large language models (LLMs). However, current soft-prompt-based methods often sacrifice multi-task modularity, requiring the training process to be fully or partially repeated for each newly added task. While recent work on task vectors applied arithmetic operations on full model weights to achieve the desired multi-task performance, a similar a… ▽ More

    Submitted 3 July, 2025; v1 submitted 2 August, 2024; originally announced August 2024.

  9. arXiv:2406.12549  [pdf, other

    cs.CL cs.AI

    MultiSocial: Multilingual Benchmark of Machine-Generated Text Detection of Social-Media Texts

    Authors: Dominik Macko, Jakub Kopal, Robert Moro, Ivan Srba

    Abstract: Recent LLMs are able to generate high-quality multilingual texts, indistinguishable for humans from authentic human-written ones. Research in machine-generated text detection is however mostly focused on the English language and longer texts, such as news articles, scientific papers or student essays. Social-media texts are usually much shorter and often feature informal language, grammatical erro… ▽ More

    Submitted 18 June, 2024; originally announced June 2024.

  10. Fighting Randomness with Randomness: Mitigating Optimisation Instability of Fine-Tuning using Delayed Ensemble and Noisy Interpolation

    Authors: Branislav Pecher, Jan Cegin, Robert Belanec, Jakub Simko, Ivan Srba, Maria Bielikova

    Abstract: While fine-tuning of pre-trained language models generally helps to overcome the lack of labelled training samples, it also displays model performance instability. This instability mainly originates from randomness in initialisation or data shuffling. To address this, researchers either modify the training process or augment the available samples, which typically results in increased computational… ▽ More

    Submitted 3 October, 2024; v1 submitted 18 June, 2024; originally announced June 2024.

    Comments: Accepted to the Findings of the EMNLP'24 Conference

    Journal ref: Findings of the Association for Computational Linguistics: EMNLP 2024

  11. arXiv:2402.12819  [pdf, other

    cs.CL cs.AI cs.LG

    Comparing Specialised Small and General Large Language Models on Text Classification: 100 Labelled Samples to Achieve Break-Even Performance

    Authors: Branislav Pecher, Ivan Srba, Maria Bielikova

    Abstract: When solving NLP tasks with limited labelled data, researchers typically either use a general large language model without further update, or use a small number of labelled samples to tune a specialised smaller model. In this work, we answer an important question -- how many labelled samples are required for the specialised small models to outperform general large models, while taking the performa… ▽ More

    Submitted 19 May, 2025; v1 submitted 20 February, 2024; originally announced February 2024.

  12. On Sensitivity of Learning with Limited Labelled Data to the Effects of Randomness: Impact of Interactions and Systematic Choices

    Authors: Branislav Pecher, Ivan Srba, Maria Bielikova

    Abstract: While learning with limited labelled data can improve performance when the labels are lacking, it is also sensitive to the effects of uncontrolled randomness introduced by so-called randomness factors (e.g., varying order of data). We propose a method to systematically investigate the effects of randomness factors while taking the interactions between them into consideration. To measure the true e… ▽ More

    Submitted 3 October, 2024; v1 submitted 20 February, 2024; originally announced February 2024.

    Comments: Accepted to the EMNLP'24 Main Conference

    Journal ref: Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

  13. arXiv:2402.03038  [pdf, other

    cs.LG cs.AI cs.CL

    Automatic Combination of Sample Selection Strategies for Few-Shot Learning

    Authors: Branislav Pecher, Ivan Srba, Maria Bielikova, Joaquin Vanschoren

    Abstract: In few-shot learning, such as meta-learning, few-shot fine-tuning or in-context learning, the limited number of samples used to train a model have a significant impact on the overall success. Although a large number of sample selection strategies exist, their impact on the performance of few-shot learning is not extensively known, as most of them have been so far evaluated in typical supervised se… ▽ More

    Submitted 5 February, 2024; originally announced February 2024.

  14. Authorship Obfuscation in Multilingual Machine-Generated Text Detection

    Authors: Dominik Macko, Robert Moro, Adaku Uchendu, Ivan Srba, Jason Samuel Lucas, Michiharu Yamashita, Nafis Irtiza Tripto, Dongwon Lee, Jakub Simko, Maria Bielikova

    Abstract: High-quality text generation capability of recent Large Language Models (LLMs) causes concerns about their misuse (e.g., in massive generation/spread of disinformation). Machine-generated text (MGT) detection is important to cope with such threats. However, it is susceptible to authorship obfuscation (AO) methods, such as paraphrasing, which can cause MGTs to evade detection. So far, this was eval… ▽ More

    Submitted 4 October, 2024; v1 submitted 15 January, 2024; originally announced January 2024.

    Comments: Accepted to EMNLP 2024 Findings

    Journal ref: Findings of the Association for Computational Linguistics: EMNLP 2024

  15. Effects of diversity incentives on sample diversity and downstream model performance in LLM-based text augmentation

    Authors: Jan Cegin, Branislav Pecher, Jakub Simko, Ivan Srba, Maria Bielikova, Peter Brusilovsky

    Abstract: The latest generative large language models (LLMs) have found their application in data augmentation tasks, where small numbers of text samples are LLM-paraphrased and then used to fine-tune downstream models. However, more research is needed to assess how different prompts, seed data selection strategies, filtering methods, or model settings affect the quality of paraphrased data (and downstream… ▽ More

    Submitted 18 August, 2024; v1 submitted 12 January, 2024; originally announced January 2024.

    Comments: ACL'24 version, 24 pages

    Journal ref: Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2024

  16. arXiv:2312.01082  [pdf, other

    cs.LG cs.AI cs.CL

    A Survey on Stability of Learning with Limited Labelled Data and its Sensitivity to the Effects of Randomness

    Authors: Branislav Pecher, Ivan Srba, Maria Bielikova

    Abstract: Learning with limited labelled data, such as prompting, in-context learning, fine-tuning, meta-learning or few-shot learning, aims to effectively train a model using only a small amount of labelled samples. However, these approaches have been observed to be excessively sensitive to the effects of uncontrolled randomness caused by non-determinism in the training process. The randomness negatively a… ▽ More

    Submitted 3 September, 2024; v1 submitted 2 December, 2023; originally announced December 2023.

    Comments: Accepted to ACM Comput. Surv. 2024

    Journal ref: ACM Computing Surveys, Volume 57, Issue 1, 2024

  17. Disinformation Capabilities of Large Language Models

    Authors: Ivan Vykopal, Matúš Pikuliak, Ivan Srba, Robert Moro, Dominik Macko, Maria Bielikova

    Abstract: Automated disinformation generation is often listed as an important risk associated with large language models (LLMs). The theoretical ability to flood the information space with disinformation content might have dramatic consequences for societies around the world. This paper presents a comprehensive study of the disinformation capabilities of the current generation of LLMs to generate false news… ▽ More

    Submitted 23 February, 2024; v1 submitted 15 November, 2023; originally announced November 2023.

    Journal ref: Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2024

  18. arXiv:2311.08374  [pdf, other

    cs.CL

    A Ship of Theseus: Curious Cases of Paraphrasing in LLM-Generated Texts

    Authors: Nafis Irtiza Tripto, Saranya Venkatraman, Dominik Macko, Robert Moro, Ivan Srba, Adaku Uchendu, Thai Le, Dongwon Lee

    Abstract: In the realm of text manipulation and linguistic transformation, the question of authorship has been a subject of fascination and philosophical inquiry. Much like the Ship of Theseus paradox, which ponders whether a ship remains the same when each of its original planks is replaced, our research delves into an intriguing question: Does a text retain its original authorship when it undergoes numero… ▽ More

    Submitted 6 June, 2024; v1 submitted 14 November, 2023; originally announced November 2023.

    Comments: To appear in Association for Computational Linguistics (ACL 2024)

  19. arXiv:2311.06121  [pdf, other

    cs.CL

    Multilingual and Multi-topical Benchmark of Fine-tuned Language models and Large Language Models for Check-Worthy Claim Detection

    Authors: Martin Hyben, Sebastian Kula, Ivan Srba, Robert Moro, Jakub Simko

    Abstract: This study compares the performance of (1) fine-tuned language models and (2) large language models on the task of check-worthy claim detection. For the purpose of the comparison we composed a multilingual and multi-topical dataset comprising texts of various sources and styles. Building on this, we performed a benchmark analysis to determine the most general multilingual and multi-topical claim d… ▽ More

    Submitted 11 October, 2024; v1 submitted 10 November, 2023; originally announced November 2023.

    Comments: 21 pages, 10 figures, 18 tables

  20. MULTITuDE: Large-Scale Multilingual Machine-Generated Text Detection Benchmark

    Authors: Dominik Macko, Robert Moro, Adaku Uchendu, Jason Samuel Lucas, Michiharu Yamashita, Matúš Pikuliak, Ivan Srba, Thai Le, Dongwon Lee, Jakub Simko, Maria Bielikova

    Abstract: There is a lack of research into capabilities of recent LLMs to generate convincing text in languages other than English and into performance of detectors of machine-generated text in multilingual settings. This is also reflected in the available benchmarks which lack authentic texts in languages other than English and predominantly cover older generators. To fill this gap, we introduce MULTITuDE,… ▽ More

    Submitted 20 October, 2023; originally announced October 2023.

    Journal ref: Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

  21. Comparison between parameter-efficient techniques and full fine-tuning: A case study on multilingual news article classification

    Authors: Olesya Razuvayevskaya, Ben Wu, Joao A. Leite, Freddy Heppell, Ivan Srba, Carolina Scarton, Kalina Bontcheva, Xingyi Song

    Abstract: Adapters and Low-Rank Adaptation (LoRA) are parameter-efficient fine-tuning techniques designed to make the training of language models more efficient. Previous results demonstrated that these methods can even improve performance on some classification tasks. This paper complements the existing research by investigating how these techniques influence the classification performance and computation… ▽ More

    Submitted 8 April, 2024; v1 submitted 14 August, 2023; originally announced August 2023.

    Journal ref: PLOS ONE 2024

  22. Multilingual Previously Fact-Checked Claim Retrieval

    Authors: Matúš Pikuliak, Ivan Srba, Robert Moro, Timo Hromadka, Timotej Smolen, Martin Melisek, Ivan Vykopal, Jakub Simko, Juraj Podrouzek, Maria Bielikova

    Abstract: Fact-checkers are often hampered by the sheer amount of online content that needs to be fact-checked. NLP can help them by retrieving already existing fact-checks relevant to the content being investigated. This paper introduces a new multilingual dataset -- MultiClaim -- for previously fact-checked claim retrieval. We collected 28k posts in 27 languages from social media, 206k fact-checks in 39 l… ▽ More

    Submitted 13 October, 2023; v1 submitted 13 May, 2023; originally announced May 2023.

    Comments: Accepted at EMNLP 2023

    Journal ref: Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

  23. KInITVeraAI at SemEval-2023 Task 3: Simple yet Powerful Multilingual Fine-Tuning for Persuasion Techniques Detection

    Authors: Timo Hromadka, Timotej Smolen, Tomas Remis, Branislav Pecher, Ivan Srba

    Abstract: This paper presents the best-performing solution to the SemEval 2023 Task 3 on the subtask 3 dedicated to persuasion techniques detection. Due to a high multilingual character of the input data and a large number of 23 predicted labels (causing a lack of labelled data for some language-label combinations), we opted for fine-tuning pre-trained transformer-based language models. Conducting multiple… ▽ More

    Submitted 24 April, 2023; originally announced April 2023.

    Comments: System paper within SemEval 2023 Task 3 on the subtask 3

    Journal ref: Proceedings of the 17th International Workshop on Semantic Evaluation (SemEval-2023)

  24. arXiv:2211.12143  [pdf, other

    cs.CY cs.AI cs.HC

    Autonomation, not Automation: Activities and Needs of Fact-checkers as a Basis for Designing Human-Centered AI Systems

    Authors: Andrea Hrckova, Robert Moro, Ivan Srba, Jakub Simko, Maria Bielikova

    Abstract: To mitigate the negative effects of false information more effectively, the development of Artificial Intelligence (AI) systems assisting fact-checkers is needed. Nevertheless, the lack of focus on the needs of these stakeholders results in their limited acceptance and skepticism toward automating the whole fact-checking process. In this study, we conducted semi-structured in-depth interviews with… ▽ More

    Submitted 13 August, 2024; v1 submitted 22 November, 2022; originally announced November 2022.

    Comments: 37 pages, 14 figures, 2 annexes

  25. arXiv:2210.10085  [pdf, other

    cs.IR cs.LG cs.SI

    Auditing YouTube's Recommendation Algorithm for Misinformation Filter Bubbles

    Authors: Ivan Srba, Robert Moro, Matus Tomlein, Branislav Pecher, Jakub Simko, Elena Stefancova, Michal Kompan, Andrea Hrckova, Juraj Podrouzek, Adrian Gavornik, Maria Bielikova

    Abstract: In this paper, we present results of an auditing study performed over YouTube aimed at investigating how fast a user can get into a misinformation filter bubble, but also what it takes to "burst the bubble", i.e., revert the bubble enclosure. We employ a sock puppet audit methodology, in which pre-programmed agents (acting as YouTube users) delve into misinformation filter bubbles by watching misi… ▽ More

    Submitted 18 October, 2022; originally announced October 2022.

    Comments: Just accepted to ACM Transactions on Recommender Systems (ACM TORS). arXiv admin note: substantial text overlap with arXiv:2203.13769

    Journal ref: ACM Transactions on Recommender Systems. 1, 1, Article 6 (March 2023), 33 pages

  26. arXiv:2204.12294  [pdf, other

    cs.CL cs.CY cs.IR cs.LG

    Monant Medical Misinformation Dataset: Mapping Articles to Fact-Checked Claims

    Authors: Ivan Srba, Branislav Pecher, Matus Tomlein, Robert Moro, Elena Stefancova, Jakub Simko, Maria Bielikova

    Abstract: False information has a significant negative influence on individuals as well as on the whole society. Especially in the current COVID-19 era, we witness an unprecedented growth of medical misinformation. To help tackle this problem with machine learning approaches, we are publishing a feature-rich dataset of approx. 317k medical news articles/blogs and 3.5k fact-checked claims. It also contains 5… ▽ More

    Submitted 26 April, 2022; originally announced April 2022.

    Comments: 11 pages, 4 figures, SIGIR 2022 Resource paper track

    Journal ref: ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2022)

  27. An Audit of Misinformation Filter Bubbles on YouTube: Bubble Bursting and Recent Behavior Changes

    Authors: Matus Tomlein, Branislav Pecher, Jakub Simko, Ivan Srba, Robert Moro, Elena Stefancova, Michal Kompan, Andrea Hrckova, Juraj Podrouzek, Maria Bielikova

    Abstract: The negative effects of misinformation filter bubbles in adaptive systems have been known to researchers for some time. Several studies investigated, most prominently on YouTube, how fast a user can get into a misinformation filter bubble simply by selecting wrong choices from the items offered. Yet, no studies so far have investigated what it takes to burst the bubble, i.e., revert the bubble enc… ▽ More

    Submitted 25 March, 2022; originally announced March 2022.

    Comments: RecSys '21: Fifteenth ACM Conference on Recommender System

    Journal ref: RecSys '21: Fifteenth ACM Conference on Recommender Systems, 2021

  28. Addressing Hate Speech with Data Science: An Overview from Computer Science Perspective

    Authors: Ivan Srba, Gabriele Lenzini, Matus Pikuliak, Samuel Pecar

    Abstract: From a computer science perspective, addressing on-line hate speech is a challenging task that is attracting the attention of both industry (mainly social media platform owners) and academia. In this chapter, we provide an overview of state-of-the-art data-science approaches - how they define hate speech, which tasks they solve to mitigate the phenomenon, and how they address these tasks. We limit… ▽ More

    Submitted 18 March, 2021; originally announced March 2021.

    Journal ref: Wachs S., Koch-Priewe B., Zick A. (eds) Hate Speech - Multidisziplinare Analysen und Handlungsoptionen. Springer VS, Wiesbaden. 2021