-
ClonEval: An Open Voice Cloning Benchmark
Authors:
Iwona Christop,
Tomasz Kuczyński,
Marek Kubis
Abstract:
We present a novel benchmark for voice cloning text-to-speech models. The benchmark consists of an evaluation protocol, an open-source library for assessing the performance of voice cloning models, and an accompanying leaderboard. The paper discusses design considerations and presents a detailed description of the evaluation procedure. The usage of the software library is explained, along with the…
▽ More
We present a novel benchmark for voice cloning text-to-speech models. The benchmark consists of an evaluation protocol, an open-source library for assessing the performance of voice cloning models, and an accompanying leaderboard. The paper discusses design considerations and presents a detailed description of the evaluation procedure. The usage of the software library is explained, along with the organization of results on the leaderboard.
△ Less
Submitted 28 May, 2025; v1 submitted 29 April, 2025;
originally announced April 2025.
-
LLMzSzŁ: a comprehensive LLM benchmark for Polish
Authors:
Krzysztof Jassem,
Michał Ciesiółka,
Filip Graliński,
Piotr Jabłoński,
Jakub Pokrywka,
Marek Kubis,
Monika Jabłońska,
Ryszard Staruch
Abstract:
This article introduces the first comprehensive benchmark for the Polish language at this scale: LLMzSzŁ (LLMs Behind the School Desk). It is based on a coherent collection of Polish national exams, including both academic and professional tests extracted from the archives of the Polish Central Examination Board. It covers 4 types of exams, coming from 154 domains. Altogether, it consists of almos…
▽ More
This article introduces the first comprehensive benchmark for the Polish language at this scale: LLMzSzŁ (LLMs Behind the School Desk). It is based on a coherent collection of Polish national exams, including both academic and professional tests extracted from the archives of the Polish Central Examination Board. It covers 4 types of exams, coming from 154 domains. Altogether, it consists of almost 19k closed-ended questions. We investigate the performance of open-source multilingual, English, and Polish LLMs to verify LLMs' abilities to transfer knowledge between languages. Also, the correlation between LLMs and humans at model accuracy and exam pass rate levels is examined. We show that multilingual LLMs can obtain superior results over monolingual ones; however, monolingual models may be beneficial when model size matters. Our analysis highlights the potential of LLMs in assisting with exam validation, particularly in identifying anomalies or errors in examination tasks.
△ Less
Submitted 4 January, 2025;
originally announced January 2025.
-
Polish Medical Exams: A new dataset for cross-lingual medical knowledge transfer assessment
Authors:
Łukasz Grzybowski,
Jakub Pokrywka,
Michał Ciesiółka,
Jeremi I. Kaczmarek,
Marek Kubis
Abstract:
Large Language Models (LLMs) have demonstrated significant potential in handling specialized tasks, including medical problem-solving. However, most studies predominantly focus on English-language contexts. This study introduces a novel benchmark dataset based on Polish medical licensing and specialization exams (LEK, LDEK, PES) taken by medical doctor candidates and practicing doctors pursuing sp…
▽ More
Large Language Models (LLMs) have demonstrated significant potential in handling specialized tasks, including medical problem-solving. However, most studies predominantly focus on English-language contexts. This study introduces a novel benchmark dataset based on Polish medical licensing and specialization exams (LEK, LDEK, PES) taken by medical doctor candidates and practicing doctors pursuing specialization. The dataset was web-scraped from publicly available resources provided by the Medical Examination Center and the Chief Medical Chamber. It comprises over 24,000 exam questions, including a subset of parallel Polish-English corpora, where the English portion was professionally translated by the examination center for foreign candidates. By creating a structured benchmark from these existing exam questions, we systematically evaluate state-of-the-art LLMs, including general-purpose, domain-specific, and Polish-specific models, and compare their performance against human medical students. Our analysis reveals that while models like GPT-4o achieve near-human performance, significant challenges persist in cross-lingual translation and domain-specific understanding. These findings underscore disparities in model performance across languages and medical specialties, highlighting the limitations and ethical considerations of deploying LLMs in clinical practice.
△ Less
Submitted 30 November, 2024;
originally announced December 2024.
-
POLygraph: Polish Fake News Dataset
Authors:
Daniel Dzienisiewicz,
Filip Graliński,
Piotr Jabłoński,
Marek Kubis,
Paweł Skórzewski,
Piotr Wierzchoń
Abstract:
This paper presents the POLygraph dataset, a unique resource for fake news detection in Polish. The dataset, created by an interdisciplinary team, is composed of two parts: the "fake-or-not" dataset with 11,360 pairs of news articles (identified by their URLs) and corresponding labels, and the "fake-they-say" dataset with 5,082 news articles (identified by their URLs) and tweets commenting on them…
▽ More
This paper presents the POLygraph dataset, a unique resource for fake news detection in Polish. The dataset, created by an interdisciplinary team, is composed of two parts: the "fake-or-not" dataset with 11,360 pairs of news articles (identified by their URLs) and corresponding labels, and the "fake-they-say" dataset with 5,082 news articles (identified by their URLs) and tweets commenting on them. Unlike existing datasets, POLygraph encompasses a variety of approaches from source literature, providing a comprehensive resource for fake news detection. The data was collected through manual annotation by expert and non-expert annotators. The project also developed a software tool that uses advanced machine learning techniques to analyze the data and determine content authenticity. The tool and dataset are expected to benefit various entities, from public sector institutions to publishers and fact-checking organizations. Further dataset exploration will foster fake news detection and potentially stimulate the implementation of similar models in other languages. The paper focuses on the creation and composition of the dataset, so it does not include a detailed evaluation of the software tool for content authenticity analysis, which is planned at a later stage of the project.
△ Less
Submitted 1 July, 2024;
originally announced July 2024.
-
Spoken Language Corpora Augmentation with Domain-Specific Voice-Cloned Speech
Authors:
Mateusz Czyżnikiewicz,
Łukasz Bondaruk,
Jakub Kubiak,
Adam Wiącek,
Łukasz Degórski,
Marek Kubis,
Paweł Skórzewski
Abstract:
In this paper we study the impact of augmenting spoken language corpora with domain-specific synthetic samples for the purpose of training a speech recognition system. Using both a conventional neural TTS system and a zero-shot one with voice cloning ability we generate speech corpora that vary in the number of voices. We compare speech recognition models trained with addition of different amounts…
▽ More
In this paper we study the impact of augmenting spoken language corpora with domain-specific synthetic samples for the purpose of training a speech recognition system. Using both a conventional neural TTS system and a zero-shot one with voice cloning ability we generate speech corpora that vary in the number of voices. We compare speech recognition models trained with addition of different amounts of synthetic data generated using these two methods with a baseline model trained solely on voice recordings. We show that while the quality of voice-cloned dataset is lower, its increased multivoiceity makes it much more effective than the one with only a few voices synthesized with the use of a conventional neural TTS system. Furthermore, our experiments indicate that using low variability synthetic speech quickly leads to saturation in the quality of the ASR whereas high variability speech provides improvement even when increasing total amount of data used for training by 30%.
△ Less
Submitted 29 July, 2024; v1 submitted 11 June, 2024;
originally announced June 2024.
-
Two Approaches to Diachronic Normalization of Polish Texts
Authors:
Kacper Dudzic,
Filip Graliński,
Krzysztof Jassem,
Marek Kubis,
Piotr Wierzchoń
Abstract:
This paper discusses two approaches to the diachronic normalization of Polish texts: a rule-based solution that relies on a set of handcrafted patterns, and a neural normalization model based on the text-to-text transfer transformer architecture. The training and evaluation data prepared for the task are discussed in detail, along with experiments conducted to compare the proposed normalization so…
▽ More
This paper discusses two approaches to the diachronic normalization of Polish texts: a rule-based solution that relies on a set of handcrafted patterns, and a neural normalization model based on the text-to-text transfer transformer architecture. The training and evaluation data prepared for the task are discussed in detail, along with experiments conducted to compare the proposed normalization solutions. A quantitative and qualitative analysis is made. It is shown that at the current stage of inquiry into the problem, the rule-based solution outperforms the neural one on 3 out of 4 variants of the prepared dataset, although in practice both approaches have distinct advantages and disadvantages.
△ Less
Submitted 2 February, 2024;
originally announced February 2024.
-
Back Transcription as a Method for Evaluating Robustness of Natural Language Understanding Models to Speech Recognition Errors
Authors:
Marek Kubis,
Paweł Skórzewski,
Marcin Sowański,
Tomasz Ziętkiewicz
Abstract:
In a spoken dialogue system, an NLU model is preceded by a speech recognition system that can deteriorate the performance of natural language understanding. This paper proposes a method for investigating the impact of speech recognition errors on the performance of natural language understanding models. The proposed method combines the back transcription procedure with a fine-grained technique for…
▽ More
In a spoken dialogue system, an NLU model is preceded by a speech recognition system that can deteriorate the performance of natural language understanding. This paper proposes a method for investigating the impact of speech recognition errors on the performance of natural language understanding models. The proposed method combines the back transcription procedure with a fine-grained technique for categorizing the errors that affect the performance of NLU models. The method relies on the usage of synthesized speech for NLU evaluation. We show that the use of synthesized speech in place of audio recording does not change the outcomes of the presented technique in a significant way.
△ Less
Submitted 25 October, 2023;
originally announced October 2023.
-
Open Challenge for Correcting Errors of Speech Recognition Systems
Authors:
Marek Kubis,
Zygmunt Vetulani,
Mikołaj Wypych,
Tomasz Ziętkiewicz
Abstract:
The paper announces the new long-term challenge for improving the performance of automatic speech recognition systems. The goal of the challenge is to investigate methods of correcting the recognition results on the basis of previously made errors by the speech processing system. The dataset prepared for the task is described and evaluation criteria are presented.
The paper announces the new long-term challenge for improving the performance of automatic speech recognition systems. The goal of the challenge is to investigate methods of correcting the recognition results on the basis of previously made errors by the speech processing system. The dataset prepared for the task is described and evaluation criteria are presented.
△ Less
Submitted 9 January, 2020;
originally announced January 2020.
-
Application of flash method in the measurements of interfacial thermal resistance in layered and particulate composite materials
Authors:
Karol Pietrak,
Tomasz S. Wiśniewski,
Michał Kubiś
Abstract:
Presented study concerns the possibility of evaluation of interfacial thermal resistance (ITR) between the constituents in composite materials with the use of flash technique. Two variants of such measurement are considered, the first of which is the measurement of ITR between two bonded layers of different materials which had been studied before by various researchers. The second tested measureme…
▽ More
Presented study concerns the possibility of evaluation of interfacial thermal resistance (ITR) between the constituents in composite materials with the use of flash technique. Two variants of such measurement are considered, the first of which is the measurement of ITR between two bonded layers of different materials which had been studied before by various researchers. The second tested measurement method is targeted at determination of ITR in particulate composites with low and moderate filler content based on their effective thermal conductivity. Method of such measurement is proposed and tested on two cases of particle-filled polymer composites. Positive verification results were obtained for polymer/glass composite in which the difference between thermal conductivities of matrix and filler is low. For a polymer filled with aluminum particles the evaluation of average ITR in the samples was impossible as the effective medium models applied in the method strongly underestimated the thermal conductivity of that material. The investigation confirmed the need for more accurate methods of macroscopic thermal properties prediction for composite media with high contrast of thermal conductivities of the constituents. Extended literature study suggests that the method can be applicable to selected classes of engineering materials.
△ Less
Submitted 16 May, 2017;
originally announced May 2017.