-
A Hypernetwork-Based Approach to KAN Representation of Audio Signals
Authors:
Patryk Marszałek,
Maciej Rut,
Piotr Kawa,
Przemysław Spurek,
Piotr Syga
Abstract:
Implicit neural representations (INR) have gained prominence for efficiently encoding multimedia data, yet their applications in audio signals remain limited. This study introduces the Kolmogorov-Arnold Network (KAN), a novel architecture using learnable activation functions, as an effective INR model for audio representation. KAN demonstrates superior perceptual performance over previous INRs, ac…
▽ More
Implicit neural representations (INR) have gained prominence for efficiently encoding multimedia data, yet their applications in audio signals remain limited. This study introduces the Kolmogorov-Arnold Network (KAN), a novel architecture using learnable activation functions, as an effective INR model for audio representation. KAN demonstrates superior perceptual performance over previous INRs, achieving the lowest Log-SpectralDistance of 1.29 and the highest Perceptual Evaluation of Speech Quality of 3.57 for 1.5 s audio. To extend KAN's utility, we propose FewSound, a hypernetwork-based architecture that enhances INR parameter updates. FewSound outperforms the state-of-the-art HyperSound, with a 33.3% improvement in MSE and 60.87% in SI-SNR. These results show KAN as a robust and adaptable audio representation with the potential for scalability and integration into various hypernetwork frameworks. The source code can be accessed at https://github.com/gmum/fewsound.git.
△ Less
Submitted 6 June, 2025; v1 submitted 4 March, 2025;
originally announced March 2025.
-
Counteracting temporal attacks in Video Copy Detection
Authors:
Katarzyna Fojcik,
Piotr Syga
Abstract:
Video Copy Detection (VCD) plays a crucial role in copyright protection and content verification by identifying duplicates and near-duplicates in large-scale video databases. The META AI Challenge on video copy detection provided a benchmark for evaluating state-of-the-art methods, with the Dual-level detection approach emerging as a winning solution. This method integrates Video Editing Detection…
▽ More
Video Copy Detection (VCD) plays a crucial role in copyright protection and content verification by identifying duplicates and near-duplicates in large-scale video databases. The META AI Challenge on video copy detection provided a benchmark for evaluating state-of-the-art methods, with the Dual-level detection approach emerging as a winning solution. This method integrates Video Editing Detection and Frame Scene Detection to handle adversarial transformations and large datasets efficiently. However, our analysis reveals significant limitations in the VED component, particularly in its ability to handle exact copies. Moreover, Dual-level detection shows vulnerability to temporal attacks. To address it, we propose an improved frame selection strategy based on local maxima of interframe differences, which enhances robustness against adversarial temporal modifications while significantly reducing computational overhead. Our method achieves an increase of 1.4 to 5.8 times in efficiency over the standard 1 FPS approach. Compared to Dual-level detection method, our approach maintains comparable micro-average precision ($μ$AP) while also demonstrating improved robustness against temporal attacks. Given 56\% reduced representation size and the inference time of more than 2 times faster, our approach is more suitable to real-world resource restriction.
△ Less
Submitted 19 January, 2025;
originally announced January 2025.
-
Are audio DeepFake detection models polyglots?
Authors:
Bartłomiej Marek,
Piotr Kawa,
Piotr Syga
Abstract:
Since the majority of audio DeepFake (DF) detection methods are trained on English-centric datasets, their applicability to non-English languages remains largely unexplored. In this work, we present a benchmark for the multilingual audio DF detection challenge by evaluating various adaptation strategies. Our experiments focus on analyzing models trained on English benchmark datasets, as well as in…
▽ More
Since the majority of audio DeepFake (DF) detection methods are trained on English-centric datasets, their applicability to non-English languages remains largely unexplored. In this work, we present a benchmark for the multilingual audio DF detection challenge by evaluating various adaptation strategies. Our experiments focus on analyzing models trained on English benchmark datasets, as well as intra-linguistic (same-language) and cross-linguistic adaptation approaches. Our results indicate considerable variations in detection efficacy, highlighting the difficulties of multilingual settings. We show that limiting the dataset to English negatively impacts the efficacy, while stressing the importance of the data in the target language.
△ Less
Submitted 23 December, 2024;
originally announced December 2024.
-
MLAAD: The Multi-Language Audio Anti-Spoofing Dataset
Authors:
Nicolas M. Müller,
Piotr Kawa,
Wei Herng Choong,
Edresson Casanova,
Eren Gölge,
Thorsten Müller,
Piotr Syga,
Philip Sperl,
Konstantin Böttinger
Abstract:
Text-to-Speech (TTS) technology offers notable benefits, such as providing a voice for individuals with speech impairments, but it also facilitates the creation of audio deepfakes and spoofing attacks. AI-based detection methods can help mitigate these risks; however, the performance of such models is inherently dependent on the quality and diversity of their training data. Presently, the availabl…
▽ More
Text-to-Speech (TTS) technology offers notable benefits, such as providing a voice for individuals with speech impairments, but it also facilitates the creation of audio deepfakes and spoofing attacks. AI-based detection methods can help mitigate these risks; however, the performance of such models is inherently dependent on the quality and diversity of their training data. Presently, the available datasets are heavily skewed towards English and Chinese audio, which limits the global applicability of these anti-spoofing systems.
To address this limitation, this paper presents the Multi-Language Audio Anti-Spoofing Dataset (MLAAD), created using 91 TTS models, comprising 42 different architectures, to generate 420.7 hours of synthetic voice in 38 different languages. We train and evaluate three state-of-the-art deepfake detection models with MLAAD and observe that it demonstrates superior performance over comparable datasets like InTheWild and Fake-Or-Real when used as a training resource. Moreover, compared to the renowned ASVspoof 2019 dataset, MLAAD proves to be a complementary resource. In tests across eight datasets, MLAAD and ASVspoof 2019 alternately outperformed each other, each excelling on four datasets. By publishing MLAAD and making a trained model accessible via an interactive webserver, we aim to democratize anti-spoofing technology, making it accessible beyond the realm of specialists, and contributing to global efforts against audio spoofing and deepfakes.
△ Less
Submitted 26 April, 2025; v1 submitted 17 January, 2024;
originally announced January 2024.
-
Improved DeepFake Detection Using Whisper Features
Authors:
Piotr Kawa,
Marcin Plata,
Michał Czuba,
Piotr Szymański,
Piotr Syga
Abstract:
With a recent influx of voice generation methods, the threat introduced by audio DeepFake (DF) is ever-increasing. Several different detection methods have been presented as a countermeasure. Many methods are based on so-called front-ends, which, by transforming the raw audio, emphasize features crucial for assessing the genuineness of the audio sample. Our contribution contains investigating the…
▽ More
With a recent influx of voice generation methods, the threat introduced by audio DeepFake (DF) is ever-increasing. Several different detection methods have been presented as a countermeasure. Many methods are based on so-called front-ends, which, by transforming the raw audio, emphasize features crucial for assessing the genuineness of the audio sample. Our contribution contains investigating the influence of the state-of-the-art Whisper automatic speech recognition model as a DF detection front-end. We compare various combinations of Whisper and well-established front-ends by training 3 detection models (LCNN, SpecRNet, and MesoNet) on a widely used ASVspoof 2021 DF dataset and later evaluating them on the DF In-The-Wild dataset. We show that using Whisper-based features improves the detection for each model and outperforms recent results on the In-The-Wild dataset by reducing Equal Error Rate by 21%.
△ Less
Submitted 2 June, 2023;
originally announced June 2023.
-
Defense Against Adversarial Attacks on Audio DeepFake Detection
Authors:
Piotr Kawa,
Marcin Plata,
Piotr Syga
Abstract:
Audio DeepFakes (DF) are artificially generated utterances created using deep learning, with the primary aim of fooling the listeners in a highly convincing manner. Their quality is sufficient to pose a severe threat in terms of security and privacy, including the reliability of news or defamation. Multiple neural network-based methods to detect generated speech have been proposed to prevent the t…
▽ More
Audio DeepFakes (DF) are artificially generated utterances created using deep learning, with the primary aim of fooling the listeners in a highly convincing manner. Their quality is sufficient to pose a severe threat in terms of security and privacy, including the reliability of news or defamation. Multiple neural network-based methods to detect generated speech have been proposed to prevent the threats. In this work, we cover the topic of adversarial attacks, which decrease the performance of detectors by adding superficial (difficult to spot by a human) changes to input data. Our contribution contains evaluating the robustness of 3 detection architectures against adversarial attacks in two scenarios (white-box and using transferability) and enhancing it later by using adversarial training performed by our novel adaptive training. Moreover, one of the investigated architectures is RawNet3, which, to the best of our knowledge, we adapted for the first time to DeepFake detection.
△ Less
Submitted 10 June, 2023; v1 submitted 30 December, 2022;
originally announced December 2022.
-
SpecRNet: Towards Faster and More Accessible Audio DeepFake Detection
Authors:
Piotr Kawa,
Marcin Plata,
Piotr Syga
Abstract:
Audio DeepFakes are utterances generated with the use of deep neural networks. They are highly misleading and pose a threat due to use in fake news, impersonation, or extortion. In this work, we focus on increasing accessibility to the audio DeepFake detection methods by providing SpecRNet, a neural network architecture characterized by a quick inference time and low computational requirements. Ou…
▽ More
Audio DeepFakes are utterances generated with the use of deep neural networks. They are highly misleading and pose a threat due to use in fake news, impersonation, or extortion. In this work, we focus on increasing accessibility to the audio DeepFake detection methods by providing SpecRNet, a neural network architecture characterized by a quick inference time and low computational requirements. Our benchmark shows that SpecRNet, requiring up to about 40% less time to process an audio sample, provides performance comparable to LCNN architecture - one of the best audio DeepFake detection models. Such a method can not only be used by online multimedia services to verify a large bulk of content uploaded daily but also, thanks to its low requirements, by average citizens to evaluate materials on their devices. In addition, we provide benchmarks in three unique settings that confirm the correctness of our model. They reflect scenarios of low-resource datasets, detection on short utterances and limited attacks benchmark in which we take a closer look at the influence of particular attacks on given architectures.
△ Less
Submitted 12 October, 2022;
originally announced October 2022.
-
Attack Agnostic Dataset: Towards Generalization and Stabilization of Audio DeepFake Detection
Authors:
Piotr Kawa,
Marcin Plata,
Piotr Syga
Abstract:
Audio DeepFakes allow the creation of high-quality, convincing utterances and therefore pose a threat due to its potential applications such as impersonation or fake news. Methods for detecting these manipulations should be characterized by good generalization and stability leading to robustness against attacks conducted with techniques that are not explicitly included in the training. In this wor…
▽ More
Audio DeepFakes allow the creation of high-quality, convincing utterances and therefore pose a threat due to its potential applications such as impersonation or fake news. Methods for detecting these manipulations should be characterized by good generalization and stability leading to robustness against attacks conducted with techniques that are not explicitly included in the training. In this work, we introduce Attack Agnostic Dataset - a combination of two audio DeepFakes and one anti-spoofing datasets that, thanks to the disjoint use of attacks, can lead to better generalization of detection methods. We present a thorough analysis of current DeepFake detection methods and consider different audio features (front-ends). In addition, we propose a model based on LCNN with LFCC and mel-spectrogram front-end, which not only is characterized by a good generalization and stability results but also shows improvement over LFCC-based mode - we decrease standard deviation on all folds and EER in two folds by up to 5%.
△ Less
Submitted 21 July, 2022; v1 submitted 27 June, 2022;
originally announced June 2022.
-
A Note on Deepfake Detection with Low-Resources
Authors:
Piotr Kawa,
Piotr Syga
Abstract:
Deepfakes are videos that include changes, quite often substituting face of a portrayed individual with a different face using neural networks. Even though the technology gained its popularity as a carrier of jokes and parodies it raises a serious threat to ones security - via biometric impersonation or besmearing. In this paper we present two methods that allow detecting Deepfakes for a user with…
▽ More
Deepfakes are videos that include changes, quite often substituting face of a portrayed individual with a different face using neural networks. Even though the technology gained its popularity as a carrier of jokes and parodies it raises a serious threat to ones security - via biometric impersonation or besmearing. In this paper we present two methods that allow detecting Deepfakes for a user without significant computational power. In particular, we enhance MesoNet by replacing the original activation functions allowing a nearly 1% improvement as well as increasing the consistency of the results. Moreover, we introduced and verified a new activation function - Pish that at the cost of slight time overhead allows even higher consistency.
Additionally, we present a preliminary results of Deepfake detection method based on Local Feature Descriptors (LFD), that allows setting up the system even faster and without resorting to GPU computation. Our method achieved Equal Error Rate of 0.28, with both accuracy and recall exceeding 0.7.
△ Less
Submitted 9 June, 2020;
originally announced June 2020.
-
Robust watermarking with double detector-discriminator approach
Authors:
Marcin Plata,
Piotr Syga
Abstract:
In this paper we present a novel deep framework for a watermarking - a technique of embedding a transparent message into an image in a way that allows retrieving the message from a (perturbed) copy, so that copyright infringement can be tracked. For this technique, it is essential to extract the information from the image even after imposing some digital processing operations on it. Our framework…
▽ More
In this paper we present a novel deep framework for a watermarking - a technique of embedding a transparent message into an image in a way that allows retrieving the message from a (perturbed) copy, so that copyright infringement can be tracked. For this technique, it is essential to extract the information from the image even after imposing some digital processing operations on it. Our framework outperforms recent methods in the context of robustness against not only spectrum of attacks (e.g. rotation, resizing, Gaussian smoothing) but also against compression, especially JPEG. The bit accuracy of our method is at least 0.86 for all types of distortions. We also achieved 0.90 bit accuracy for JPEG while recent methods provided at most 0.83. Our method retains high transparency and capacity as well. Moreover, we present our double detector-discriminator approach - a scheme to detect and discriminate if the image contains the embedded message or not, which is crucial for real-life watermarking systems and up to now was not investigated using neural networks. With this, we design a testing formula to validate our extended approach and compared it with a common procedure. We also present an alternative method of balancing between image quality and robustness on attacks which is easily applicable to the framework.
△ Less
Submitted 6 June, 2020;
originally announced June 2020.
-
Robust Spatial-spread Deep Neural Image Watermarking
Authors:
Marcin Plata,
Piotr Syga
Abstract:
Watermarking is an operation of embedding an information into an image in a way that allows to identify ownership of the image despite applying some distortions on it. In this paper, we presented a novel end-to-end solution for embedding and recovering the watermark in the digital image using convolutional neural networks. The method is based on spreading the message over the spatial domain of the…
▽ More
Watermarking is an operation of embedding an information into an image in a way that allows to identify ownership of the image despite applying some distortions on it. In this paper, we presented a novel end-to-end solution for embedding and recovering the watermark in the digital image using convolutional neural networks. The method is based on spreading the message over the spatial domain of the image, hence reducing the "local bits per pixel" capacity. To obtain the model we used adversarial training and applied noiser layers between the encoder and the decoder. Moreover, we broadened the spectrum of typically considered attacks on the watermark and by grouping the attacks according to their scope, we achieved high general robustness, most notably against JPEG compression, Gaussian blurring, subsampling or resizing. To help us in the models training we also proposed a precise differentiable approximation of JPEG.
△ Less
Submitted 4 November, 2020; v1 submitted 24 May, 2020;
originally announced May 2020.
-
Practical Fault-Tolerant Data Aggregation
Authors:
Krzysztof Grining,
Marek Klonowski,
Piotr Syga
Abstract:
During Financial Cryptography 2012 Chan et al. presented a novel privacy-protection fault-tolerant data aggregation protocol. Comparing to previous work, their scheme guaranteed provable privacy of individuals and could work even if some number of users refused to participate. In our paper we demonstrate that despite its merits, their method provides unacceptably low accuracy of aggregated data fo…
▽ More
During Financial Cryptography 2012 Chan et al. presented a novel privacy-protection fault-tolerant data aggregation protocol. Comparing to previous work, their scheme guaranteed provable privacy of individuals and could work even if some number of users refused to participate. In our paper we demonstrate that despite its merits, their method provides unacceptably low accuracy of aggregated data for a wide range of assumed parameters and cannot be used in majority of real-life systems. To show this we use both precise analytic and experimental methods. Additionally, we present a precise data aggregation protocol that provides provable level of security even facing massive failures of nodes. Moreover, the protocol requires significantly less computation (limited exploiting of heavy cryptography) than most of currently known fault tolerant aggregation protocols and offers better security guarantees that make it suitable for systems of limited resources (including sensor networks). To obtain our result we relax however the model and allow some limited communication between the nodes.
△ Less
Submitted 31 May, 2016; v1 submitted 12 February, 2016;
originally announced February 2016.