Search | arXiv e-print repository

Comparative Analysis of ASR Methods for Speech Deepfake Detection

Authors: Davide Salvi, Amit Kumar Singh Yadav, Kratika Bhagtani, Viola Negroni, Paolo Bestagini, Edward J. Delp

Abstract: Recent techniques for speech deepfake detection often rely on pre-trained self-supervised models. These systems, initially developed for Automatic Speech Recognition (ASR), have proved their ability to offer a meaningful representation of speech signals, which can benefit various tasks, including deepfake detection. In this context, pre-trained models serve as feature extractors and are used to ex… ▽ More Recent techniques for speech deepfake detection often rely on pre-trained self-supervised models. These systems, initially developed for Automatic Speech Recognition (ASR), have proved their ability to offer a meaningful representation of speech signals, which can benefit various tasks, including deepfake detection. In this context, pre-trained models serve as feature extractors and are used to extract embeddings from input speech, which are then fed to a binary speech deepfake detector. The remarkable accuracy achieved through this approach underscores a potential relationship between ASR and speech deepfake detection. However, this connection is not yet entirely clear, and we do not know whether improved performance in ASR corresponds to higher speech deepfake detection capabilities. In this paper, we address this question through a systematic analysis. We consider two different pre-trained self-supervised ASR models, Whisper and Wav2Vec 2.0, and adapt them for the speech deepfake detection task. These models have been released in multiple versions, with increasing number of parameters and enhanced ASR performance. We investigate whether performance improvements in ASR correlate with improvements in speech deepfake detection. Our results provide insights into the relationship between these two tasks and offer valuable guidance for the development of more effective speech deepfake detectors. △ Less

Submitted 26 November, 2024; originally announced November 2024.

Comments: Published at Asilomar Conference on Signals, Systems, and Computers 2024

arXiv:2409.13049 [pdf, other]

DiffSSD: A Diffusion-Based Dataset For Speech Forensics

Authors: Kratika Bhagtani, Amit Kumar Singh Yadav, Paolo Bestagini, Edward J. Delp

Abstract: Diffusion-based speech generators are ubiquitous. These methods can generate very high quality synthetic speech and several recent incidents report their malicious use. To counter such misuse, synthetic speech detectors have been developed. Many of these detectors are trained on datasets which do not include diffusion-based synthesizers. In this paper, we demonstrate that existing detectors traine… ▽ More Diffusion-based speech generators are ubiquitous. These methods can generate very high quality synthetic speech and several recent incidents report their malicious use. To counter such misuse, synthetic speech detectors have been developed. Many of these detectors are trained on datasets which do not include diffusion-based synthesizers. In this paper, we demonstrate that existing detectors trained on one such dataset, ASVspoof2019, do not perform well in detecting synthetic speech from recent diffusion-based synthesizers. We propose the Diffusion-Based Synthetic Speech Dataset (DiffSSD), a dataset consisting of about 200 hours of labeled speech, including synthetic speech generated by 8 diffusion-based open-source and 2 commercial generators. We also examine the performance of existing synthetic speech detectors on DiffSSD in both closed-set and open-set scenarios. The results highlight the importance of this dataset in detecting synthetic speech generated from recent open-source and commercial speech generators. △ Less

Submitted 2 October, 2024; v1 submitted 19 September, 2024; originally announced September 2024.

Comments: Submitted to IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP) 2025

arXiv:2404.10989 [pdf, other]

FairSSD: Understanding Bias in Synthetic Speech Detectors

Authors: Amit Kumar Singh Yadav, Kratika Bhagtani, Davide Salvi, Paolo Bestagini, Edward J. Delp

Abstract: Methods that can generate synthetic speech which is perceptually indistinguishable from speech recorded by a human speaker, are easily available. Several incidents report misuse of synthetic speech generated from these methods to commit fraud. To counter such misuse, many methods have been proposed to detect synthetic speech. Some of these detectors are more interpretable, can generalize to detect… ▽ More Methods that can generate synthetic speech which is perceptually indistinguishable from speech recorded by a human speaker, are easily available. Several incidents report misuse of synthetic speech generated from these methods to commit fraud. To counter such misuse, many methods have been proposed to detect synthetic speech. Some of these detectors are more interpretable, can generalize to detect synthetic speech in the wild and are robust to noise. However, limited work has been done on understanding bias in these detectors. In this work, we examine bias in existing synthetic speech detectors to determine if they will unfairly target a particular gender, age and accent group. We also inspect whether these detectors will have a higher misclassification rate for bona fide speech from speech-impaired speakers w.r.t fluent speakers. Extensive experiments on 6 existing synthetic speech detectors using more than 0.9 million speech signals demonstrate that most detectors are gender, age and accent biased, and future work is needed to ensure fairness. To support future research, we release our evaluation dataset, models used in our study and source code at https://gitlab.com/viper-purdue/fairssd. △ Less

Submitted 16 April, 2024; originally announced April 2024.

Comments: Accepted at CVPR 2024 (WMF)

arXiv:2402.14205 [pdf, other]

Compression Robust Synthetic Speech Detection Using Patched Spectrogram Transformer

Authors: Amit Kumar Singh Yadav, Ziyue Xiang, Kratika Bhagtani, Paolo Bestagini, Stefano Tubaro, Edward J. Delp

Abstract: Many deep learning synthetic speech generation tools are readily available. The use of synthetic speech has caused financial fraud, impersonation of people, and misinformation to spread. For this reason forensic methods that can detect synthetic speech have been proposed. Existing methods often overfit on one dataset and their performance reduces substantially in practical scenarios such as detect… ▽ More Many deep learning synthetic speech generation tools are readily available. The use of synthetic speech has caused financial fraud, impersonation of people, and misinformation to spread. For this reason forensic methods that can detect synthetic speech have been proposed. Existing methods often overfit on one dataset and their performance reduces substantially in practical scenarios such as detecting synthetic speech shared on social platforms. In this paper we propose, Patched Spectrogram Synthetic Speech Detection Transformer (PS3DT), a synthetic speech detector that converts a time domain speech signal to a mel-spectrogram and processes it in patches using a transformer neural network. We evaluate the detection performance of PS3DT on ASVspoof2019 dataset. Our experiments show that PS3DT performs well on ASVspoof2019 dataset compared to other approaches using spectrogram for synthetic speech detection. We also investigate generalization performance of PS3DT on In-the-Wild dataset. PS3DT generalizes well than several existing methods on detecting synthetic speech from an out-of-distribution dataset. We also evaluate robustness of PS3DT to detect telephone quality synthetic speech and synthetic speech shared on social platforms (compressed speech). PS3DT is robust to compression and can detect telephone quality synthetic speech better than several existing methods. △ Less

Submitted 21 February, 2024; originally announced February 2024.

Comments: Accepted as long oral paper at ICMLA 2023

arXiv:2304.03323 [pdf, other]

DSVAE: Interpretable Disentangled Representation for Synthetic Speech Detection

Authors: Amit Kumar Singh Yadav, Kratika Bhagtani, Ziyue Xiang, Paolo Bestagini, Stefano Tubaro, Edward J. Delp

Abstract: Tools to generate high quality synthetic speech signal that is perceptually indistinguishable from speech recorded from human speakers are easily available. Several approaches have been proposed for detecting synthetic speech. Many of these approaches use deep learning methods as a black box without providing reasoning for the decisions they make. This limits the interpretability of these approach… ▽ More Tools to generate high quality synthetic speech signal that is perceptually indistinguishable from speech recorded from human speakers are easily available. Several approaches have been proposed for detecting synthetic speech. Many of these approaches use deep learning methods as a black box without providing reasoning for the decisions they make. This limits the interpretability of these approaches. In this paper, we propose Disentangled Spectrogram Variational Auto Encoder (DSVAE) which is a two staged trained variational autoencoder that processes spectrograms of speech using disentangled representation learning to generate interpretable representations of a speech signal for detecting synthetic speech. DSVAE also creates an activation map to highlight the spectrogram regions that discriminate synthetic and bona fide human speech signals. We evaluated the representations obtained from DSVAE using the ASVspoof2019 dataset. Our experimental results show high accuracy (>98%) on detecting synthetic speech from 6 known and 10 out of 11 unknown speech synthesizers. We also visualize the representation obtained from DSVAE for 17 different speech synthesizers and verify that they are indeed interpretable and discriminate bona fide and synthetic speech from each of the synthesizers. △ Less

Submitted 28 July, 2023; v1 submitted 6 April, 2023; originally announced April 2023.

arXiv:2211.14609 [pdf, other]

BEAMERS: Brain-Engaged, Active Music-based Emotion Regulation System

Authors: Jiyang Li, Wei Wang, Kratika Bhagtani, Yincheng Jin, Zhanpeng Jin

Abstract: With the increasing demands of emotion comprehension and regulation in our daily life, a customized music-based emotion regulation system is introduced by employing current EEG information and song features, which predicts users' emotion variation in the valence-arousal model before recommending music. The work shows that: (1) a novel music-based emotion regulation system with a commercial EEG dev… ▽ More With the increasing demands of emotion comprehension and regulation in our daily life, a customized music-based emotion regulation system is introduced by employing current EEG information and song features, which predicts users' emotion variation in the valence-arousal model before recommending music. The work shows that: (1) a novel music-based emotion regulation system with a commercial EEG device is designed without employing deterministic emotion recognition models for daily usage; (2) the system considers users' variant emotions towards the same song, and by which calculate user's emotion instability and it is in accordance with Big Five Personality Test; (3) the system supports different emotion regulation styles with users' designation of desired emotion variation, and achieves an accuracy of over $0.85$ with 2-seconds EEG data; (4) people feel easier to report their emotion variation comparing with absolute emotional states, and would accept a more delicate music recommendation system for emotion regulation according to the questionnaire. △ Less

Submitted 26 November, 2022; originally announced November 2022.

arXiv:2204.12067 [pdf, other]

An Overview of Recent Work in Media Forensics: Methods and Threats

Authors: Kratika Bhagtani, Amit Kumar Singh Yadav, Emily R. Bartusiak, Ziyue Xiang, Ruiting Shao, Sriram Baireddy, Edward J. Delp

Abstract: In this paper, we review recent work in media forensics for digital images, video, audio (specifically speech), and documents. For each data modality, we discuss synthesis and manipulation techniques that can be used to create and modify digital media. We then review technological advancements for detecting and quantifying such manipulations. Finally, we consider open issues and suggest directions… ▽ More In this paper, we review recent work in media forensics for digital images, video, audio (specifically speech), and documents. For each data modality, we discuss synthesis and manipulation techniques that can be used to create and modify digital media. We then review technological advancements for detecting and quantifying such manipulations. Finally, we consider open issues and suggest directions for future research. △ Less

Submitted 12 May, 2022; v1 submitted 26 April, 2022; originally announced April 2022.

Comments: This is a longer version of a paper accepted to the 2022 IEEE International Conference on Multimedia Information Processing and Retrieval entitled "An Overview of Recent Work in Multimedia Forensics"

Showing 1–7 of 7 results for author: Bhagtani, K