Search | arXiv e-print repository

arXiv:2505.19203 [pdf, other]

EnvSDD: Benchmarking Environmental Sound Deepfake Detection

Authors: Han Yin, Yang Xiao, Rohan Kumar Das, Jisheng Bai, Haohe Liu, Wenwu Wang, Mark D Plumbley

Abstract: Audio generation systems now create very realistic soundscapes that can enhance media production, but also pose potential risks. Several studies have examined deepfakes in speech or singing voice. However, environmental sounds have different characteristics, which may make methods for detecting speech and singing deepfakes less effective for real-world sounds. In addition, existing datasets for en… ▽ More Audio generation systems now create very realistic soundscapes that can enhance media production, but also pose potential risks. Several studies have examined deepfakes in speech or singing voice. However, environmental sounds have different characteristics, which may make methods for detecting speech and singing deepfakes less effective for real-world sounds. In addition, existing datasets for environmental sound deepfake detection are limited in scale and audio types. To address this gap, we introduce EnvSDD, the first large-scale curated dataset designed for this task, consisting of 45.25 hours of real and 316.74 hours of fake audio. The test set includes diverse conditions to evaluate the generalizability, such as unseen generation models and unseen datasets. We also propose an audio deepfake detection system, based on a pre-trained audio foundation model. Results on EnvSDD show that our proposed system outperforms the state-of-the-art systems from speech and singing domains. △ Less

Submitted 25 May, 2025; originally announced May 2025.

Comments: Accepted by Interspeech 2025

arXiv:2505.14874 [pdf, ps, other]

Towards Inclusive ASR: Investigating Voice Conversion for Dysarthric Speech Recognition in Low-Resource Languages

Authors: Chin-Jou Li, Eunjung Yeo, Kwanghee Choi, Paula Andrea Pérez-Toro, Masao Someki, Rohan Kumar Das, Zhengjun Yue, Juan Rafael Orozco-Arroyave, Elmar Nöth, David R. Mortensen

Abstract: Automatic speech recognition (ASR) for dysarthric speech remains challenging due to data scarcity, particularly in non-English languages. To address this, we fine-tune a voice conversion model on English dysarthric speech (UASpeech) to encode both speaker characteristics and prosodic distortions, then apply it to convert healthy non-English speech (FLEURS) into non-English dysarthric-like speech.… ▽ More Automatic speech recognition (ASR) for dysarthric speech remains challenging due to data scarcity, particularly in non-English languages. To address this, we fine-tune a voice conversion model on English dysarthric speech (UASpeech) to encode both speaker characteristics and prosodic distortions, then apply it to convert healthy non-English speech (FLEURS) into non-English dysarthric-like speech. The generated data is then used to fine-tune a multilingual ASR model, Massively Multilingual Speech (MMS), for improved dysarthric speech recognition. Evaluation on PC-GITA (Spanish), EasyCall (Italian), and SSNCE (Tamil) demonstrates that VC with both speaker and prosody conversion significantly outperforms the off-the-shelf MMS performance and conventional augmentation techniques such as speed and tempo perturbation. Objective and subjective analyses of the generated data further confirm that the generated speech simulates dysarthric characteristics. △ Less

Submitted 30 May, 2025; v1 submitted 20 May, 2025; originally announced May 2025.

Comments: 5 pages, 1 figure, Accepted to Interspeech 2025

arXiv:2505.14601 [pdf, ps, other]

Listen, Analyze, and Adapt to Learn New Attacks: An Exemplar-Free Class Incremental Learning Method for Audio Deepfake Source Tracing

Authors: Yang Xiao, Rohan Kumar Das

Abstract: As deepfake speech becomes common and hard to detect, it is vital to trace its source. Recent work on audio deepfake source tracing (ST) aims to find the origins of synthetic or manipulated speech. However, ST models must adapt to learn new deepfake attacks while retaining knowledge of the previous ones. A major challenge is catastrophic forgetting, where models lose the ability to recognize previ… ▽ More As deepfake speech becomes common and hard to detect, it is vital to trace its source. Recent work on audio deepfake source tracing (ST) aims to find the origins of synthetic or manipulated speech. However, ST models must adapt to learn new deepfake attacks while retaining knowledge of the previous ones. A major challenge is catastrophic forgetting, where models lose the ability to recognize previously learned attacks. Some continual learning methods help with deepfake detection, but multi-class tasks such as ST introduce additional challenges as the number of classes grows. To address this, we propose an analytic class incremental learning method called AnaST. When new attacks appear, the feature extractor remains fixed, and the classifier is updated with a closed-form analytical solution in one epoch. This approach ensures data privacy, optimizes memory usage, and is suitable for online training. The experiments carried out in this work show that our method outperforms the baselines. △ Less

Submitted 20 May, 2025; originally announced May 2025.

Comments: Accepted by Interspeech 2025

arXiv:2505.14600 [pdf, ps, other]

AdaKWS: Towards Robust Keyword Spotting with Test-Time Adaptation

Authors: Yang Xiao, Tianyi Peng, Yanghao Zhou, Rohan Kumar Das

Abstract: Spoken keyword spotting (KWS) aims to identify keywords in audio for wide applications, especially on edge devices. Current small-footprint KWS systems focus on efficient model designs. However, their inference performance can decline in unseen environments or noisy backgrounds. Test-time adaptation (TTA) helps models adapt to test samples without needing the original training data. In this study,… ▽ More Spoken keyword spotting (KWS) aims to identify keywords in audio for wide applications, especially on edge devices. Current small-footprint KWS systems focus on efficient model designs. However, their inference performance can decline in unseen environments or noisy backgrounds. Test-time adaptation (TTA) helps models adapt to test samples without needing the original training data. In this study, we present AdaKWS, the first TTA method for robust KWS to the best of our knowledge. Specifically, 1) We initially optimize the model's confidence by selecting reliable samples based on prediction entropy minimization and adjusting the normalization statistics in each batch. 2) We introduce pseudo-keyword consistency (PKC) to identify critical, reliable features without overfitting to noise. Our experiments show that AdaKWS outperforms other methods across various conditions, including Gaussian noise and real-scenario noises. The code will be released in due course. △ Less

Submitted 20 May, 2025; originally announced May 2025.

Comments: Accepted by Interspeech 2025

arXiv:2505.11817 [pdf, ps, other]

AnalyticKWS: Towards Exemplar-Free Analytic Class Incremental Learning for Small-footprint Keyword Spotting

Authors: Yang Xiao, Tianyi Peng, Rohan Kumar Das, Yuchen Hu, Huiping Zhuang

Abstract: Keyword spotting (KWS) offers a vital mechanism to identify spoken commands in voice-enabled systems, where user demands often shift, requiring models to learn new keywords continually over time. However, a major problem is catastrophic forgetting, where models lose their ability to recognize earlier keywords. Although several continual learning methods have proven their usefulness for reducing fo… ▽ More Keyword spotting (KWS) offers a vital mechanism to identify spoken commands in voice-enabled systems, where user demands often shift, requiring models to learn new keywords continually over time. However, a major problem is catastrophic forgetting, where models lose their ability to recognize earlier keywords. Although several continual learning methods have proven their usefulness for reducing forgetting, most existing approaches depend on storing and revisiting old data to combat catastrophic forgetting. Though effective, these methods face two practical challenges: 1) privacy risks from keeping user data and 2) large memory and time consumption that limit deployment on small devices. To address these issues, we propose an exemplar-free Analytic Continual Learning (AnalyticKWS) method that updates model parameters without revisiting earlier data. Inspired by efficient learning principles, AnalyticKWS computes a closed-form analytical solution for model updates and requires only a single epoch of adaptation for incoming keywords. AnalyticKWS demands fewer computational resources by avoiding gradient-based updates and does not store old data. By eliminating the need for back-propagation during incremental learning, the model remains lightweight and efficient. As a result, AnalyticKWS meets the challenges mentioned earlier and suits resource-limited settings well. Extensive experiments on various datasets and settings show that AnalyticKWS consistently outperforms existing continual learning methods. △ Less

Submitted 16 May, 2025; originally announced May 2025.

Comments: Accepted by ACL 2025

arXiv:2504.05657 [pdf, other]

Nes2Net: A Lightweight Nested Architecture for Foundation Model Driven Speech Anti-spoofing

Authors: Tianchi Liu, Duc-Tuan Truong, Rohan Kumar Das, Kong Aik Lee, Haizhou Li

Abstract: Speech foundation models have significantly advanced various speech-related tasks by providing exceptional representation capabilities. However, their high-dimensional output features often create a mismatch with downstream task models, which typically require lower-dimensional inputs. A common solution is to apply a dimensionality reduction (DR) layer, but this approach increases parameter overhe… ▽ More Speech foundation models have significantly advanced various speech-related tasks by providing exceptional representation capabilities. However, their high-dimensional output features often create a mismatch with downstream task models, which typically require lower-dimensional inputs. A common solution is to apply a dimensionality reduction (DR) layer, but this approach increases parameter overhead, computational costs, and risks losing valuable information. To address these issues, we propose Nested Res2Net (Nes2Net), a lightweight back-end architecture designed to directly process high-dimensional features without DR layers. The nested structure enhances multi-scale feature extraction, improves feature interaction, and preserves high-dimensional information. We first validate Nes2Net on CtrSVDD, a singing voice deepfake detection dataset, and report a 22% performance improvement and an 87% back-end computational cost reduction over the state-of-the-art baseline. Additionally, extensive testing across four diverse datasets: ASVspoof 2021, ASVspoof 5, PartialSpoof, and In-the-Wild, covering fully spoofed speech, adversarial attacks, partial spoofing, and real-world scenarios, consistently highlights Nes2Net's superior robustness and generalization capabilities. The code package and pre-trained models are available at https://github.com/Liu-Tianchi/Nes2Net. △ Less

Submitted 8 April, 2025; originally announced April 2025.

Comments: This manuscript has been submitted for peer review

arXiv:2503.22808 [pdf]

How to set up a psychedelic study: Unique considerations for research involving human participants

Authors: Marcus J. Glennon, Catherine I. V. Bird, Prateek Yadav, Patrick Kleine, Shayam Suseelan, Christina Boman-Markaki, Vasileia Kotoula, Matt Butler, Robert Leech, Leor Roseman, David Erritzoe, Deepak P. Srivastava, Celia Morgan, Christopher Timmermann, Greg Cooper, Jeremy I. Skipper, James Rucker, Sunjeev K. Kamboj, Mitul A. Mehta, Ravi K. Das, Anjali Bhat

Abstract: Setting up a psychedelic study can be a long, arduous, and kafkaesque process. This rapidly-developing field poses several unique challenges for researchers, necessitating a range of considerations that have not yet been standardised. Many of the complexities inherent to psychedelic research also challenge existing assumptions around, for example, approaches to psychiatric prescribing, the concept… ▽ More Setting up a psychedelic study can be a long, arduous, and kafkaesque process. This rapidly-developing field poses several unique challenges for researchers, necessitating a range of considerations that have not yet been standardised. Many of the complexities inherent to psychedelic research also challenge existing assumptions around, for example, approaches to psychiatric prescribing, the conceptual framing of the placebo effect, and definitions of selfhood. This review paper brings together several of the major psychedelic research teams across the United Kingdom to formalise these unique considerations, identify continuing areas of debate, and provide a practical, experience-based guide, with recommendations for policymakers and future researchers intending to set up a psychedelic research study or clinical trial. We approach this such that the paper can either be read end to end, or treated as a manual: readers can dip into relevant sections as needed. △ Less

Submitted 18 April, 2025; v1 submitted 28 March, 2025; originally announced March 2025.

arXiv:2503.17878 [pdf, other]

Non-Commutative fluid: an alternative source of cosmic acceleration

Authors: Arpan Krishna Mitra, Raj Kumar Das

Abstract: We have developed a Hubble function based on Newtonian Cosmology using non-commutative fluid equations. Our Hubble function contains cosmic fluids with the signature of a new cosmological parameter $σ$, motivated by a non-commutative Poisson bracket structure. Interestingly, this Hubble function does not include any external fluid content related to dark energy or the Cosmological constant; the pa… ▽ More We have developed a Hubble function based on Newtonian Cosmology using non-commutative fluid equations. Our Hubble function contains cosmic fluids with the signature of a new cosmological parameter $σ$, motivated by a non-commutative Poisson bracket structure. Interestingly, this Hubble function does not include any external fluid content related to dark energy or the Cosmological constant; the parameter $σ$ acts as the source of accelerated expansion. In this work, we aim to explain the phenomenon of the accelerating expansion of the universe without "dark energy". Additionally, we have verified the observational bounds for $σ$ to assess its potential in explaining the accelerated expansion. △ Less

Submitted 22 March, 2025; originally announced March 2025.

Comments: 9 pages, 13 figure , 5 table

arXiv:2501.18649 [pdf, other]

Fake News Detection After LLM Laundering: Measurement and Explanation

Authors: Rupak Kumar Das, Jonathan Dodge

Abstract: With their advanced capabilities, Large Language Models (LLMs) can generate highly convincing and contextually relevant fake news, which can contribute to disseminating misinformation. Though there is much research on fake news detection for human-written text, the field of detecting LLM-generated fake news is still under-explored. This research measures the efficacy of detectors in identifying LL… ▽ More With their advanced capabilities, Large Language Models (LLMs) can generate highly convincing and contextually relevant fake news, which can contribute to disseminating misinformation. Though there is much research on fake news detection for human-written text, the field of detecting LLM-generated fake news is still under-explored. This research measures the efficacy of detectors in identifying LLM-paraphrased fake news, in particular, determining whether adding a paraphrase step in the detection pipeline helps or impedes detection. This study contributes: (1) Detectors struggle to detect LLM-paraphrased fake news more than human-written text, (2) We find which models excel at which tasks (evading detection, paraphrasing to evade detection, and paraphrasing for semantic similarity). (3) Via LIME explanations, we discovered a possible reason for detection failures: sentiment shift. (4) We discover a worrisome trend for paraphrase quality measurement: samples that exhibit sentiment shift despite a high BERTSCORE. (5) We provide a pair of datasets augmenting existing datasets with paraphrase outputs and scores. The dataset is available on GitHub △ Less

Submitted 29 January, 2025; originally announced January 2025.

arXiv:2501.06530 [pdf, other]

Multi-modal Speech Enhancement with Limited Electromyography Channels

Authors: Fuyuan Feng, Longting Xu, Rohan Kumar Das

Abstract: Speech enhancement (SE) aims to improve the clarity, intelligibility, and quality of speech signals for various speech enabled applications. However, air-conducted (AC) speech is highly susceptible to ambient noise, particularly in low signal-to-noise ratio (SNR) and non-stationary noise environments. Incorporating multi-modal information has shown promise in enhancing speech in such challenging s… ▽ More Speech enhancement (SE) aims to improve the clarity, intelligibility, and quality of speech signals for various speech enabled applications. However, air-conducted (AC) speech is highly susceptible to ambient noise, particularly in low signal-to-noise ratio (SNR) and non-stationary noise environments. Incorporating multi-modal information has shown promise in enhancing speech in such challenging scenarios. Electromyography (EMG) signals, which capture muscle activity during speech production, offer noise-resistant properties beneficial for SE in adverse conditions. Most previous EMG-based SE methods required 35 EMG channels, limiting their practicality. To address this, we propose a novel method that considers only 8-channel EMG signals with acoustic signals using a modified SEMamba network with added cross-modality modules. Our experiments demonstrate substantial improvements in speech quality and intelligibility over traditional approaches, especially in extremely low SNR settings. Notably, compared to the SE (AC) approach, our method achieves a significant PESQ gain of 0.235 under matched low SNR conditions and 0.527 under mismatched conditions, highlighting its robustness. △ Less

Submitted 11 January, 2025; originally announced January 2025.

Comments: Accepted by ICASSP 2025

arXiv:2411.10027 [pdf, other]

XLSR-Mamba: A Dual-Column Bidirectional State Space Model for Spoofing Attack Detection

Authors: Yang Xiao, Rohan Kumar Das

Abstract: Transformers and their variants have achieved great success in speech processing. However, their multi-head self-attention mechanism is computationally expensive. Therefore, one novel selective state space model, Mamba, has been proposed as an alternative. Building on its success in automatic speech recognition, we apply Mamba for spoofing attack detection. Mamba is well-suited for this task as it… ▽ More Transformers and their variants have achieved great success in speech processing. However, their multi-head self-attention mechanism is computationally expensive. Therefore, one novel selective state space model, Mamba, has been proposed as an alternative. Building on its success in automatic speech recognition, we apply Mamba for spoofing attack detection. Mamba is well-suited for this task as it can capture the artifacts in spoofed speech signals by handling long-length sequences. However, Mamba's performance may suffer when it is trained with limited labeled data. To mitigate this, we propose combining a new structure of Mamba based on a dual-column architecture with self-supervised learning, using the pre-trained wav2vec 2.0 model. The experiments show that our proposed approach achieves competitive results and faster inference on the ASVspoof 2021 LA and DF datasets, and on the more challenging In-the-Wild dataset, it emerges as the strongest candidate for spoofing attack detection. The code has been publicly released in https://github.com/swagshaw/XLSR-Mamba. △ Less

Submitted 1 March, 2025; v1 submitted 15 November, 2024; originally announced November 2024.

Comments: Accepted by IEEE Signal Processing Letters

arXiv:2411.01174 [pdf, other]

Leveraging LLM and Text-Queried Separation for Noise-Robust Sound Event Detection

Authors: Han Yin, Yang Xiao, Jisheng Bai, Rohan Kumar Das

Abstract: Sound Event Detection (SED) is challenging in noisy environments where overlapping sounds obscure target events. Language-queried audio source separation (LASS) aims to isolate the target sound events from a noisy clip. However, this approach can fail when the exact target sound is unknown, particularly in noisy test sets, leading to reduced performance. To address this issue, we leverage the capa… ▽ More Sound Event Detection (SED) is challenging in noisy environments where overlapping sounds obscure target events. Language-queried audio source separation (LASS) aims to isolate the target sound events from a noisy clip. However, this approach can fail when the exact target sound is unknown, particularly in noisy test sets, leading to reduced performance. To address this issue, we leverage the capabilities of large language models (LLMs) to analyze and summarize acoustic data. By using LLMs to identify and select specific noise types, we implement a noise augmentation method for noise-robust fine-tuning. The fine-tuned model is applied to predict clip-wise event predictions as text queries for the LASS model. Our studies demonstrate that the proposed method improves SED performance in noisy environments. This work represents an early application of LLMs in noise-robust SED and suggests a promising direction for handling overlapping events in SED. Codes and pretrained models are available at https://github.com/apple-yinhan/Noise-robust-SED. △ Less

Submitted 12 January, 2025; v1 submitted 2 November, 2024; originally announced November 2024.

Comments: Accepted by ICASSP 2025 Workshop

arXiv:2409.13292 [pdf, other]

Exploring Text-Queried Sound Event Detection with Audio Source Separation

Authors: Han Yin, Jisheng Bai, Yang Xiao, Hui Wang, Siqi Zheng, Yafeng Chen, Rohan Kumar Das, Chong Deng, Jianfeng Chen

Abstract: In sound event detection (SED), overlapping sound events pose a significant challenge, as certain events can be easily masked by background noise or other events, resulting in poor detection performance. To address this issue, we propose the text-queried SED (TQ-SED) framework. Specifically, we first pre-train a language-queried audio source separation (LASS) model to separate the audio tracks cor… ▽ More In sound event detection (SED), overlapping sound events pose a significant challenge, as certain events can be easily masked by background noise or other events, resulting in poor detection performance. To address this issue, we propose the text-queried SED (TQ-SED) framework. Specifically, we first pre-train a language-queried audio source separation (LASS) model to separate the audio tracks corresponding to different events from the input audio. Then, multiple target SED branches are employed to detect individual events. AudioSep is a state-of-the-art LASS model, but has limitations in extracting dynamic audio information because of its pure convolutional structure for separation. To address this, we integrate a dual-path recurrent neural network block into the model. We refer to this structure as AudioSep-DP, which achieves the first place in DCASE 2024 Task 9 on language-queried audio source separation (objective single model track). Experimental results show that TQ-SED can significantly improve the SED performance, with an improvement of 7.22\% on F1 score over the conventional framework. Additionally, we setup comprehensive experiments to explore the impact of model complexity. The source code and pre-trained model are released at https://github.com/apple-yinhan/TQ-SED. △ Less

Submitted 10 January, 2025; v1 submitted 20 September, 2024; originally announced September 2024.

Comments: Accepted by ICASSP 2025

arXiv:2409.05034 [pdf, other]

TF-Mamba: A Time-Frequency Network for Sound Source Localization

Authors: Yang Xiao, Rohan Kumar Das

Abstract: Sound source localization (SSL) determines the position of sound sources using multi-channel audio data. It is commonly used to improve speech enhancement and separation. Extracting spatial features is crucial for SSL, especially in challenging acoustic environments. Recently, a novel structure referred to as Mamba demonstrated notable performance across various sequence-based modalities. This stu… ▽ More Sound source localization (SSL) determines the position of sound sources using multi-channel audio data. It is commonly used to improve speech enhancement and separation. Extracting spatial features is crucial for SSL, especially in challenging acoustic environments. Recently, a novel structure referred to as Mamba demonstrated notable performance across various sequence-based modalities. This study introduces the Mamba for SSL tasks. We consider the Mamba-based model to analyze spatial features from speech signals by fusing both time and frequency features, and we develop an SSL system called TF-Mamba. This system integrates time and frequency fusion, with Bidirectional Mamba managing both time-wise and frequency-wise processing. We conduct the experiments on the simulated and real datasets. Experiments show that TF-Mamba significantly outperforms other advanced methods. The code will be publicly released in due course. △ Less

Submitted 20 May, 2025; v1 submitted 8 September, 2024; originally announced September 2024.

Comments: Accepted by Interspeech 2025

arXiv:2409.00069 [pdf, other]

How to Measure Human-AI Prediction Accuracy in Explainable AI Systems

Authors: Sujay Koujalgi, Andrew Anderson, Iyadunni Adenuga, Shikha Soneji, Rupika Dikkala, Teresita Guzman Nader, Leo Soccio, Sourav Panda, Rupak Kumar Das, Margaret Burnett, Jonathan Dodge

Abstract: Assessing an AI system's behavior-particularly in Explainable AI Systems-is sometimes done empirically, by measuring people's abilities to predict the agent's next move-but how to perform such measurements? In empirical studies with humans, an obvious approach is to frame the task as binary (i.e., prediction is either right or wrong), but this does not scale. As output spaces increase, so do floor… ▽ More Assessing an AI system's behavior-particularly in Explainable AI Systems-is sometimes done empirically, by measuring people's abilities to predict the agent's next move-but how to perform such measurements? In empirical studies with humans, an obvious approach is to frame the task as binary (i.e., prediction is either right or wrong), but this does not scale. As output spaces increase, so do floor effects, because the ratio of right answers to wrong answers quickly becomes very small. The crux of the problem is that the binary framing is failing to capture the nuances of the different degrees of "wrongness." To address this, we begin by proposing three mathematical bases upon which to measure "partial wrongness." We then uses these bases to perform two analyses on sequential decision-making domains: the first is an in-lab study with 86 participants on a size-36 action space; the second is a re-analysis of a prior study on a size-4 action space. Other researchers adopting our operationalization of the prediction task and analysis methodology will improve the rigor of user studies conducted with that task, which is particularly important when the domain features a large output space. △ Less

Submitted 23 August, 2024; originally announced September 2024.

ACM Class: D.2.8

arXiv:2407.03661 [pdf, other]

Where's That Voice Coming? Continual Learning for Sound Source Localization

Authors: Yang Xiao, Rohan Kumar Das

Abstract: Sound source localization (SSL) is essential for many speech-processing applications. Deep learning models have achieved high performance, but often fail when the training and inference environments differ. Adapting SSL models to dynamic acoustic conditions faces a major challenge: catastrophic forgetting. In this work, we propose an exemplar-free continual learning strategy for SSL (CL-SSL) to ad… ▽ More Sound source localization (SSL) is essential for many speech-processing applications. Deep learning models have achieved high performance, but often fail when the training and inference environments differ. Adapting SSL models to dynamic acoustic conditions faces a major challenge: catastrophic forgetting. In this work, we propose an exemplar-free continual learning strategy for SSL (CL-SSL) to address such a forgetting phenomenon. CL-SSL applies task-specific sub-networks to adapt across diverse acoustic environments while retaining previously learned knowledge. It also uses a scaling mechanism to limit parameter growth, ensuring consistent performance across incremental tasks. We evaluated CL-SSL on simulated data with varying microphone distances and real-world data with different noise levels. The results demonstrate CL-SSL's ability to maintain high accuracy with minimal parameter increase, offering an efficient solution for SSL applications. △ Less

Submitted 20 March, 2025; v1 submitted 4 July, 2024; originally announced July 2024.

Comments: Accepted to ICME 2025

arXiv:2407.03657 [pdf, other]

UCIL: An Unsupervised Class Incremental Learning Approach for Sound Event Detection

Authors: Yang Xiao, Rohan Kumar Das

Abstract: This work explores class-incremental learning (CIL) for sound event detection (SED), advancing adaptability towards real-world scenarios. CIL's success in domains like computer vision inspired our SED-tailored method, addressing the unique challenges of diverse and complex audio environments. Our approach employs an independent unsupervised learning framework with a distillation loss function to i… ▽ More This work explores class-incremental learning (CIL) for sound event detection (SED), advancing adaptability towards real-world scenarios. CIL's success in domains like computer vision inspired our SED-tailored method, addressing the unique challenges of diverse and complex audio environments. Our approach employs an independent unsupervised learning framework with a distillation loss function to integrate new sound classes while preserving the SED model consistency across incremental tasks. We further enhance this framework with a sample selection strategy for unlabeled data and a balanced exemplar update mechanism, ensuring varied and illustrative sound representations. Evaluating various continual learning methods on the DCASE 2023 Task 4 dataset, we find that our research offers insights into each method's applicability for real-world SED systems that can have newly added sound classes. The findings also delineate future directions of CIL in dynamic audio settings. △ Less

Submitted 11 January, 2025; v1 submitted 4 July, 2024; originally announced July 2024.

Comments: Accepted by ICASSP 2025

arXiv:2407.03656 [pdf, other]

WildDESED: An LLM-Powered Dataset for Wild Domestic Environment Sound Event Detection System

Authors: Yang Xiao, Rohan Kumar Das

Abstract: This work aims to advance sound event detection (SED) research by presenting a new large language model (LLM)-powered dataset namely wild domestic environment sound event detection (WildDESED). It is crafted as an extension to the original DESED dataset to reflect diverse acoustic variability and complex noises in home settings. We leveraged LLMs to generate eight different domestic scenarios base… ▽ More This work aims to advance sound event detection (SED) research by presenting a new large language model (LLM)-powered dataset namely wild domestic environment sound event detection (WildDESED). It is crafted as an extension to the original DESED dataset to reflect diverse acoustic variability and complex noises in home settings. We leveraged LLMs to generate eight different domestic scenarios based on target sound categories of the DESED dataset. Then we enriched the scenarios with a carefully tailored mixture of noises selected from AudioSet and ensured no overlap with target sound. We consider widely popular convolutional neural recurrent network to study WildDESED dataset, which depicts its challenging nature. We then apply curriculum learning by gradually increasing noise complexity to enhance the model's generalization capabilities across various noise levels. Our results with this approach show improvements within the noisy environment, validating the effectiveness on the WildDESED dataset promoting noise-robust SED advancements. △ Less

Submitted 30 October, 2024; v1 submitted 4 July, 2024; originally announced July 2024.

Comments: DCASE WS 2024

arXiv:2407.03654 [pdf, other]

Mixstyle based Domain Generalization for Sound Event Detection with Heterogeneous Training Data

Authors: Yang Xiao, Han Yin, Jisheng Bai, Rohan Kumar Das

Abstract: This work explores domain generalization (DG) for sound event detection (SED), advancing adaptability towards real-world scenarios. Our approach employs a mean-teacher framework with domain generalization to integrate heterogeneous training data, while preserving the SED model performance across the datasets. Specifically, we first apply mixstyle to the frequency dimension to adapt the mel-spectro… ▽ More This work explores domain generalization (DG) for sound event detection (SED), advancing adaptability towards real-world scenarios. Our approach employs a mean-teacher framework with domain generalization to integrate heterogeneous training data, while preserving the SED model performance across the datasets. Specifically, we first apply mixstyle to the frequency dimension to adapt the mel-spectrograms from different domains. Next, we use the adaptive residual normalization method to generalize features across multiple domains by applying instance normalization in the frequency dimension. Lastly, we use the sound event bounding boxes method for post-processing. Our approach integrates features from bidirectional encoder representations from audio transformers and a convolutional recurrent neural network. We evaluate the proposed approach on DCASE 2024 Challenge Task 4 dataset, measuring polyphonic SED score (PSDS) on the DESED dataset and macro-average pAUC on the MAESTRO dataset. The results indicate that the proposed DG-based method improves both PSDS and macro-average pAUC compared to the challenge baseline. △ Less

Submitted 29 August, 2024; v1 submitted 4 July, 2024; originally announced July 2024.

Comments: Submitted to ICASSP 2025

arXiv:2407.00291 [pdf, other]

FMSG-JLESS Submission for DCASE 2024 Task4 on Sound Event Detection with Heterogeneous Training Dataset and Potentially Missing Labels

Authors: Yang Xiao, Han Yin, Jisheng Bai, Rohan Kumar Das

Abstract: This report presents the systems developed and submitted by Fortemedia Singapore (FMSG) and Joint Laboratory of Environmental Sound Sensing (JLESS) for DCASE 2024 Task 4. The task focuses on recognizing event classes and their time boundaries, given that multiple events can be present and may overlap in an audio recording. The novelty this year is a dataset with two sources, making it challenging… ▽ More This report presents the systems developed and submitted by Fortemedia Singapore (FMSG) and Joint Laboratory of Environmental Sound Sensing (JLESS) for DCASE 2024 Task 4. The task focuses on recognizing event classes and their time boundaries, given that multiple events can be present and may overlap in an audio recording. The novelty this year is a dataset with two sources, making it challenging to achieve good performance without knowing the source of the audio clips during evaluation. To address this, we propose a sound event detection method using domain generalization. Our approach integrates features from bidirectional encoder representations from audio transformers and a convolutional recurrent neural network. We focus on three main strategies to improve our method. First, we apply mixstyle to the frequency dimension to adapt the mel-spectrograms from different domains. Second, we consider training loss of our model specific to each datasets for their corresponding classes. This independent learning framework helps the model extract domain-specific features effectively. Lastly, we use the sound event bounding boxes method for post-processing. Our proposed method shows superior macro-average pAUC and polyphonic SED score performance on the DCASE 2024 Challenge Task 4 validation dataset and public evaluation dataset. △ Less

Submitted 28 June, 2024; originally announced July 2024.

Comments: Technical report for DCASE 2024 Challenge Task 4

arXiv:2406.02483 [pdf, other]

How Do Neural Spoofing Countermeasures Detect Partially Spoofed Audio?

Authors: Tianchi Liu, Lin Zhang, Rohan Kumar Das, Yi Ma, Ruijie Tao, Haizhou Li

Abstract: Partially manipulating a sentence can greatly change its meaning. Recent work shows that countermeasures (CMs) trained on partially spoofed audio can effectively detect such spoofing. However, the current understanding of the decision-making process of CMs is limited. We utilize Grad-CAM and introduce a quantitative analysis metric to interpret CMs' decisions. We find that CMs prioritize the artif… ▽ More Partially manipulating a sentence can greatly change its meaning. Recent work shows that countermeasures (CMs) trained on partially spoofed audio can effectively detect such spoofing. However, the current understanding of the decision-making process of CMs is limited. We utilize Grad-CAM and introduce a quantitative analysis metric to interpret CMs' decisions. We find that CMs prioritize the artifacts of transition regions created when concatenating bona fide and spoofed audio. This focus differs from that of CMs trained on fully spoofed audio, which concentrate on the pattern differences between bona fide and spoofed parts. Our further investigation explains the varying nature of CMs' focus while making correct or incorrect predictions. These insights provide a basis for the design of CM models and the creation of datasets. Moreover, this work lays a foundation of interpretability in the field of partial spoofed audio detection that has not been well explored previously. △ Less

Submitted 4 June, 2024; originally announced June 2024.

Comments: Accepted at Interspeech 2024

arXiv:2404.17280 [pdf, other]

Device Feature based on Graph Fourier Transformation with Logarithmic Processing For Detection of Replay Speech Attacks

Authors: Mingrui He, Longting Xu, Han Wang, Mingjun Zhang, Rohan Kumar Das

Abstract: The most common spoofing attacks on automatic speaker verification systems are replay speech attacks. Detection of replay speech heavily relies on replay configuration information. Previous studies have shown that graph Fourier transform-derived features can effectively detect replay speech but ignore device and environmental noise effects. In this work, we propose a new feature, the graph frequen… ▽ More The most common spoofing attacks on automatic speaker verification systems are replay speech attacks. Detection of replay speech heavily relies on replay configuration information. Previous studies have shown that graph Fourier transform-derived features can effectively detect replay speech but ignore device and environmental noise effects. In this work, we propose a new feature, the graph frequency device cepstral coefficient, derived from the graph frequency domain using a device-related linear transformation. We also introduce two novel representations: graph frequency logarithmic coefficient and graph frequency logarithmic device coefficient. We evaluate our methods using traditional Gaussian mixture model and light convolutional neural network systems as classifiers. On the ASVspoof 2017 V2, ASVspoof 2019 physical access, and ASVspoof 2021 physical access datasets, our proposed features outperform known front-ends, demonstrating their effectiveness for replay speech detection. △ Less

Submitted 26 April, 2024; originally announced April 2024.

arXiv:2404.09342 [pdf, other]

Face-voice Association in Multilingual Environments (FAME) Challenge 2024 Evaluation Plan

Authors: Muhammad Saad Saeed, Shah Nawaz, Muhammad Salman Tahir, Rohan Kumar Das, Muhammad Zaigham Zaheer, Marta Moscati, Markus Schedl, Muhammad Haris Khan, Karthik Nandakumar, Muhammad Haroon Yousaf

Abstract: The advancements of technology have led to the use of multimodal systems in various real-world applications. Among them, the audio-visual systems are one of the widely used multimodal systems. In the recent years, associating face and voice of a person has gained attention due to presence of unique correlation between them. The Face-voice Association in Multilingual Environments (FAME) Challenge 2… ▽ More The advancements of technology have led to the use of multimodal systems in various real-world applications. Among them, the audio-visual systems are one of the widely used multimodal systems. In the recent years, associating face and voice of a person has gained attention due to presence of unique correlation between them. The Face-voice Association in Multilingual Environments (FAME) Challenge 2024 focuses on exploring face-voice association under a unique condition of multilingual scenario. This condition is inspired from the fact that half of the world's population is bilingual and most often people communicate under multilingual scenario. The challenge uses a dataset namely, Multilingual Audio-Visual (MAV-Celeb) for exploring face-voice association in multilingual environments. This report provides the details of the challenge, dataset, baselines and task details for the FAME Challenge. △ Less

Submitted 22 July, 2024; v1 submitted 14 April, 2024; originally announced April 2024.

Comments: ACM Multimedia Conference - Grand Challenge

arXiv:2404.00861 [pdf, other]

Enhancing Real-World Active Speaker Detection with Multi-Modal Extraction Pre-Training

Authors: Ruijie Tao, Xinyuan Qian, Rohan Kumar Das, Xiaoxue Gao, Jiadong Wang, Haizhou Li

Abstract: Audio-visual active speaker detection (AV-ASD) aims to identify which visible face is speaking in a scene with one or more persons. Most existing AV-ASD methods prioritize capturing speech-lip correspondence. However, there is a noticeable gap in addressing the challenges from real-world AV-ASD scenarios. Due to the presence of low-quality noisy videos in such cases, AV-ASD systems without a selec… ▽ More Audio-visual active speaker detection (AV-ASD) aims to identify which visible face is speaking in a scene with one or more persons. Most existing AV-ASD methods prioritize capturing speech-lip correspondence. However, there is a noticeable gap in addressing the challenges from real-world AV-ASD scenarios. Due to the presence of low-quality noisy videos in such cases, AV-ASD systems without a selective listening ability are short of effectively filtering out disruptive voice components from mixed audio inputs. In this paper, we propose a Multi-modal Speaker Extraction-to-Detection framework named `MuSED', which is pre-trained with audio-visual target speaker extraction to learn the denoising ability, then it is fine-tuned with the AV-ASD task. Meanwhile, to better capture the multi-modal information and deal with real-world problems such as missing modality, MuSED is modelled on the time domain directly and integrates the multi-modal plus-and-minus augmentation strategy. Our experiments demonstrate that MuSED substantially outperforms the state-of-the-art AV-ASD methods and achieves 95.6% mAP on the AVA-ActiveSpeaker dataset, 98.3% AP on the ASW dataset, and 97.9% F1 on the Columbia AV-ASD dataset, respectively. We will publicly release the code in due course. △ Less

Submitted 31 March, 2024; originally announced April 2024.

Comments: 10 pages

arXiv:2402.02781 [pdf, other]

Dual Knowledge Distillation for Efficient Sound Event Detection

Authors: Yang Xiao, Rohan Kumar Das

Abstract: Sound event detection (SED) is essential for recognizing specific sounds and their temporal locations within acoustic signals. This becomes challenging particularly for on-device applications, where computational resources are limited. To address this issue, we introduce a novel framework referred to as dual knowledge distillation for developing efficient SED systems in this work. Our proposed dua… ▽ More Sound event detection (SED) is essential for recognizing specific sounds and their temporal locations within acoustic signals. This becomes challenging particularly for on-device applications, where computational resources are limited. To address this issue, we introduce a novel framework referred to as dual knowledge distillation for developing efficient SED systems in this work. Our proposed dual knowledge distillation commences with temporal-averaging knowledge distillation (TAKD), utilizing a mean student model derived from the temporal averaging of the student model's parameters. This allows the student model to indirectly learn from a pre-trained teacher model, ensuring a stable knowledge distillation. Subsequently, we introduce embedding-enhanced feature distillation (EEFD), which involves incorporating an embedding distillation layer within the student model to bolster contextual learning. On DCASE 2023 Task 4A public evaluation dataset, our proposed SED system with dual knowledge distillation having merely one-third of the baseline model's parameters, demonstrates superior performance in terms of PSDS1 and PSDS2. This highlights the importance of proposed dual knowledge distillation for compact SED systems, which can be ideal for edge devices. △ Less

Submitted 5 February, 2024; originally announced February 2024.

Comments: Accepted to ICASSP 2024 (Deep Neural Network Model Compression Workshop)

arXiv:2401.07944 [pdf, ps, other]

SemEval-2017 Task 4: Sentiment Analysis in Twitter using BERT

Authors: Rupak Kumar Das, Ted Pedersen

Abstract: This paper uses the BERT model, which is a transformer-based architecture, to solve task 4A, English Language, Sentiment Analysis in Twitter of SemEval2017. BERT is a very powerful large language model for classification tasks when the amount of training data is small. For this experiment, we have used the BERT(BASE) model, which has 12 hidden layers. This model provides better accuracy, precision… ▽ More This paper uses the BERT model, which is a transformer-based architecture, to solve task 4A, English Language, Sentiment Analysis in Twitter of SemEval2017. BERT is a very powerful large language model for classification tasks when the amount of training data is small. For this experiment, we have used the BERT(BASE) model, which has 12 hidden layers. This model provides better accuracy, precision, recall, and f1 score than the Naive Bayes baseline model. It performs better in binary classification subtasks than the multi-class classification subtasks. We also considered all kinds of ethical issues during this experiment, as Twitter data contains personal and sensible information. The dataset and code used in our experiment can be found in this GitHub repository. △ Less

Submitted 19 June, 2024; v1 submitted 15 January, 2024; originally announced January 2024.

arXiv:2401.04953 [pdf, other]

Adaptive-avg-pooling based Attention Vision Transformer for Face Anti-spoofing

Authors: Jichen Yang, Fangfan Chen, Rohan Kumar Das, Zhengyu Zhu, Shunsi Zhang

Abstract: Traditional vision transformer consists of two parts: transformer encoder and multi-layer perception (MLP). The former plays the role of feature learning to obtain better representation, while the latter plays the role of classification. Here, the MLP is constituted of two fully connected (FC) layers, average value computing, FC layer and softmax layer. However, due to the use of average value com… ▽ More Traditional vision transformer consists of two parts: transformer encoder and multi-layer perception (MLP). The former plays the role of feature learning to obtain better representation, while the latter plays the role of classification. Here, the MLP is constituted of two fully connected (FC) layers, average value computing, FC layer and softmax layer. However, due to the use of average value computing module, some useful information may get lost, which we plan to preserve by the use of alternative framework. In this work, we propose a novel vision transformer referred to as adaptive-avg-pooling based attention vision transformer (AAViT) that uses modules of adaptive average pooling and attention to replace the module of average value computing. We explore the proposed AAViT for the studies on face anti-spoofing using Replay-Attack database. The experiments show that the AAViT outperforms vision transformer in face anti-spoofing by producing a reduced equal error rate. In addition, we found that the proposed AAViT can perform much better than some commonly used neural networks such as ResNet and some other known systems on the Replay-Attack corpus. △ Less

Submitted 10 January, 2024; originally announced January 2024.

Comments: Accepted for Publication in IEEE ICASSP 2024

arXiv:2401.00959 [pdf, other]

Creating an Intelligent Dementia-Friendly Living Space: A Feasibility Study Integrating Assistive Robotics, Wearable Sensors, and Spatial Technology

Authors: Arshia A Khan, Rupak Kumar Das, Anna Martin, Dale Dowling, Rana Imtiaz

Abstract: This study investigates the integration of assistive therapeutic robotics, wearable sensors, and spatial sensors within an intelligent environment tailored for dementia care. The feasibility study aims to assess the collective impact of these technologies in enhancing care giving by seamlessly integrating supportive technology in the background. The wearable sensors track physiological data, while… ▽ More This study investigates the integration of assistive therapeutic robotics, wearable sensors, and spatial sensors within an intelligent environment tailored for dementia care. The feasibility study aims to assess the collective impact of these technologies in enhancing care giving by seamlessly integrating supportive technology in the background. The wearable sensors track physiological data, while spatial sensors monitor geo-spatial information, integrated into a system supporting residents without necessitating technical expertise. The designed space fosters various activities, including robot interactions, medication delivery, physical exercises like walking on a treadmill (Bruce protocol), entertainment, and household tasks, promoting cognitive stimulation through puzzles. Physiological data revealed significant participant engagement during robot interactions, indicating the potential effectiveness of robot-assisted activities in enhancing the quality of life for residents. △ Less

Submitted 1 January, 2024; originally announced January 2024.

arXiv:2310.09177 [pdf, ps, other]

doi 10.3390/s24082509

Future Industrial Applications: Exploring LPWAN-Driven IoT Protocols

Authors: Mahbubul Islam, Hossain Md. Mubashshir Jamil, Samiul Ahsan Pranto, Rupak Kumar Das, Al Amin, Arshia Khan

Abstract: The Internet of Things (IoT) will bring about the next industrial revolution in Industry 4.0. The communication aspect of IoT devices is one of the most critical factors in choosing the suitable device for the suitable usage. So far, the IoT physical layer communication challenges have been met with various communications protocols that provide varying strengths and weaknesses. Moreover, most of t… ▽ More The Internet of Things (IoT) will bring about the next industrial revolution in Industry 4.0. The communication aspect of IoT devices is one of the most critical factors in choosing the suitable device for the suitable usage. So far, the IoT physical layer communication challenges have been met with various communications protocols that provide varying strengths and weaknesses. Moreover, most of them are wireless protocols due to the sheer number of device requirements for IoT. This paper summarizes the network architectures of some of the most popular IoT wireless communications protocols. It also presents a comparative analysis of critical features, including power consumption, coverage, data rate, security, cost, and Quality of Service (QoS). This comparative study shows that Low Power Wide Area Network (LPWAN) based IoT protocols (LoRa, Sigfox, NB-IoT, LTE-M ) are more suitable for future industrial applications because of their energy efficiency, high coverage, and cost efficiency. In addition, the study also presents an industrial Internet of Things (IIoT) application perspective on the suitability of LPWAN protocols in a particular scenario and addresses some open issues that need to be researched. Thus, this study can assist in deciding the most suitable protocol for an industrial and production field. △ Less

Submitted 19 January, 2024; v1 submitted 13 October, 2023; originally announced October 2023.

Report number: s24082509

Journal ref: Sensors 2024, 24, 2509

arXiv:2305.15901 [pdf, other]

Consistent Optimal Transport with Empirical Conditional Measures

Authors: Piyushi Manupriya, Rachit Keerti Das, Sayantan Biswas, Saketha Nath Jagarlapudi

Abstract: Given samples from two joint distributions, we consider the problem of Optimal Transportation (OT) between them when conditioned on a common variable. We focus on the general setting where the conditioned variable may be continuous, and the marginals of this variable in the two joint distributions may not be the same. In such settings, standard OT variants cannot be employed, and novel estimation… ▽ More Given samples from two joint distributions, we consider the problem of Optimal Transportation (OT) between them when conditioned on a common variable. We focus on the general setting where the conditioned variable may be continuous, and the marginals of this variable in the two joint distributions may not be the same. In such settings, standard OT variants cannot be employed, and novel estimation techniques are necessary. Since the main challenge is that the conditional distributions are not explicitly available, the key idea in our OT formulation is to employ kernelized-least-squares terms computed over the joint samples, which implicitly match the transport plan's marginals with the empirical conditionals. Under mild conditions, we prove that our estimated transport plans, as a function of the conditioned variable, are asymptotically optimal. For finite samples, we show that the deviation in terms of our regularized objective is bounded by $O(1/m^{1/4})$, where $m$ is the number of samples. We also discuss how the conditional transport plan could be modelled using explicit probabilistic models as well as using implicit generative ones. We empirically verify the consistency of our estimator on synthetic datasets, where the optimal plan is analytically known. When employed in applications like prompt learning for few-shot classification and conditional-generation in the context of predicting cell responses to treatment, our methodology improves upon state-of-the-art methods. △ Less

Submitted 10 June, 2024; v1 submitted 25 May, 2023; originally announced May 2023.

arXiv:2305.12533 [pdf, ps, other]

Unified framework for Fiedler-like strong linearizations of polynomial and rational matrices

Authors: Ranjan Kumar Das, Harish K. Pillai

Abstract: Linearization is a widely used method for solving polynomial eigenvalue problems (PEPs) and rational eigenvalue problem (REPs) in which the PEP/REP is transformed to a generalized eigenproblem and then solve this generalized eigenproblem with algorithms available in the literature. Fiedler-like pencils (Fiedler pencils (FPs), generalized Fiedler pencils (GFPs), Fiedler pencils with repetition (FPR… ▽ More Linearization is a widely used method for solving polynomial eigenvalue problems (PEPs) and rational eigenvalue problem (REPs) in which the PEP/REP is transformed to a generalized eigenproblem and then solve this generalized eigenproblem with algorithms available in the literature. Fiedler-like pencils (Fiedler pencils (FPs), generalized Fiedler pencils (GFPs), Fiedler pencils with repetition (FPRs) and generalized Fiedler pencils with repetition (GFPRs)) are well known classes of strong linearizations. GFPs are an intriguing family of linearizations, and GF pencils are the fundamental building blocks of FPRs and GFPRs. As a result, FPRs and GFPRs have distinctive features and they provide structure-preserving linearizations for structured matrix polynomials. But GFPRs do not use the full potential of GF pencils. Indeed, not all the GFPs are FPRs or GFPRs, and vice versa. The main aim of this paper is two-fold. First, to build a unified framework for all the Fiedler-like pencils FPs, GFPs, FPRs and GFPRs. To that end, we construct a new family of strong linearizations (named as EGFPs) of a matrix polynomial $P(\lam)$ that subsumes all the Fiedler-like linearizations. A salient feature of the EGFPs family is that it allows the construction of structured preserving banded linearizations with low bandwidth for structured (symmetric, Hermitian, palindromic) matrix polynomial. Low bandwidth structured linearizations may be useful for numerical computations. Second, to utilize EGFPs directly to form a family of Rosenbrock strong linearizations of an $n \times n$ rational matrix $G(\lam)$ associated with a realization. We describe the formulas for the construction of low bandwidth linearizations for $P(\lam)$ and $G(\lam)$. We show that the eigenvectors, minimal bases/indices of $P(\lam)$ and $G(\lam)$ can be easily recovered from those of the linearizations of $P(\lam)$ and $G(\lam)$. △ Less

Submitted 21 May, 2023; originally announced May 2023.

Comments: arXiv admin note: text overlap with arXiv:2008.00427

MSC Class: 65F15; 15A57; 15A18; 65F35

arXiv:2305.10729 [pdf, other]

A Multi-Task Learning Framework for Sound Event Detection using High-level Acoustic Characteristics of Sounds

Authors: Tanmay Khandelwal, Rohan Kumar Das

Abstract: Sound event detection (SED) entails identifying the type of sound and estimating its temporal boundaries from acoustic signals. These events are uniquely characterized by their spatio-temporal features, which are determined by the way they are produced. In this study, we leverage some distinctive high-level acoustic characteristics of various sound events to assist the SED model training, without… ▽ More Sound event detection (SED) entails identifying the type of sound and estimating its temporal boundaries from acoustic signals. These events are uniquely characterized by their spatio-temporal features, which are determined by the way they are produced. In this study, we leverage some distinctive high-level acoustic characteristics of various sound events to assist the SED model training, without requiring additional labeled data. Specifically, we use the DCASE Task 4 2022 dataset and categorize the 10 classes into four subcategories based on their high-level acoustic characteristics. We then introduce a novel multi-task learning framework that jointly trains the SED and high-level acoustic characteristics classification tasks, using shared layers and weighted loss. Our method significantly improves the performance of the SED system, achieving a 36.3% improvement in terms of the polyphonic sound event detection score compared to the baseline on the DCASE 2022 Task 4 validation set. △ Less

Submitted 18 May, 2023; originally announced May 2023.

Comments: Accepted for Publication at INTERSPEECH 2023

arXiv:2304.12688 [pdf, other]

Leveraging Audio-Tagging Assisted Sound Event Detection using Weakified Strong Labels and Frequency Dynamic Convolutions

Authors: Tanmay Khandelwal, Rohan Kumar Das, Andrew Koh, Eng Siong Chng

Abstract: Jointly learning from a small labeled set and a larger unlabeled set is an active research topic under semi-supervised learning (SSL). In this paper, we propose a novel SSL method based on a two-stage framework for leveraging a large unlabeled in-domain set. Stage-1 of our proposed framework focuses on audio-tagging (AT), which assists the sound event detection (SED) system in Stage-2. The AT syst… ▽ More Jointly learning from a small labeled set and a larger unlabeled set is an active research topic under semi-supervised learning (SSL). In this paper, we propose a novel SSL method based on a two-stage framework for leveraging a large unlabeled in-domain set. Stage-1 of our proposed framework focuses on audio-tagging (AT), which assists the sound event detection (SED) system in Stage-2. The AT system is trained utilizing a strongly labeled set converted into weak predictions referred to as weakified set, a weakly labeled set, and an unlabeled set. This AT system then infers on the unlabeled set to generate reliable pseudo-weak labels, which are used with the strongly and weakly labeled set to train a frequency dynamic convolutional recurrent neural network-based SED system at Stage-2 in a supervised manner. Our system outperforms the baseline by 45.5% in terms of polyphonic sound detection score on the DESED real validation set. △ Less

Submitted 25 April, 2023; originally announced April 2023.

Comments: Accepted for Publication in IEEE-Statistical Signal Processing (SSP) Workshop 2023

arXiv:2304.03803 [pdf, other]

doi 10.1016/j.jheap.2024.07.011

Cosmology in $R^2$-gravity: Effects of a Higher Derivative Scalar Condensate Background

Authors: Raj Kumar Das, Aurindam Mondal, Subir Ghosh, Supriya Pan

Abstract: A well known extension of Einstein General Relativity is the addition of an $R^2$-term, which is free of ghost excitations and in the linearized framework, reduces Einstein General Relativity and an additional higher derivative scalar. According to \cite{Chakraborty:2020ktp}, the above scalar sector can sustain a Time Crystal-like minimum energy state, with non-trivial time dependence. Exploiting… ▽ More A well known extension of Einstein General Relativity is the addition of an $R^2$-term, which is free of ghost excitations and in the linearized framework, reduces Einstein General Relativity and an additional higher derivative scalar. According to \cite{Chakraborty:2020ktp}, the above scalar sector can sustain a Time Crystal-like minimum energy state, with non-trivial time dependence. Exploiting previous result that the scalar can sustain modes with periodic time dependence in its lowest energy, we consider this condensate as a source and study the Friedmann-Lemaître-Robertson-Walker (FLRW) cosmology in this background. The effect of the $R^2$-term is interpreted as a back reaction. A remarkable consequence of the condensate is that, irrespective of open or close geometry of the Universe, for an appropriate choice of parameter window, the condensate can induce a decelerating phase before the accelerated expansion starts and again, in some cases, it can help to avoid the singularity in the deceleration parameter (that is present in conventional FLRW Cosmology). △ Less

Submitted 8 August, 2024; v1 submitted 7 April, 2023; originally announced April 2023.

Comments: 10 pages including references, 4 compound figures; published version in JHEAp

Journal ref: JHEAp 43 (2024) 231-238

arXiv:2211.01091 [pdf, ps, other]

I4U System Description for NIST SRE'20 CTS Challenge

Authors: Kong Aik Lee, Tomi Kinnunen, Daniele Colibro, Claudio Vair, Andreas Nautsch, Hanwu Sun, Liang He, Tianyu Liang, Qiongqiong Wang, Mickael Rouvier, Pierre-Michel Bousquet, Rohan Kumar Das, Ignacio Viñals Bailo, Meng Liu, Héctor Deldago, Xuechen Liu, Md Sahidullah, Sandro Cumani, Boning Zhang, Koji Okabe, Hitoshi Yamamoto, Ruijie Tao, Haizhou Li, Alfonso Ortega Giménez, Longbiao Wang , et al. (1 additional authors not shown)

Abstract: This manuscript describes the I4U submission to the 2020 NIST Speaker Recognition Evaluation (SRE'20) Conversational Telephone Speech (CTS) Challenge. The I4U's submission was resulted from active collaboration among researchers across eight research teams - I$^2$R (Singapore), UEF (Finland), VALPT (Italy, Spain), NEC (Japan), THUEE (China), LIA (France), NUS (Singapore), INRIA (France) and TJU (C… ▽ More This manuscript describes the I4U submission to the 2020 NIST Speaker Recognition Evaluation (SRE'20) Conversational Telephone Speech (CTS) Challenge. The I4U's submission was resulted from active collaboration among researchers across eight research teams - I$^2$R (Singapore), UEF (Finland), VALPT (Italy, Spain), NEC (Japan), THUEE (China), LIA (France), NUS (Singapore), INRIA (France) and TJU (China). The submission was based on the fusion of top performing sub-systems and sub-fusion systems contributed by individual teams. Efforts have been spent on the use of common development and validation sets, submission schedule and milestone, minimizing inconsistency in trial list and score file format across sites. △ Less

Submitted 2 November, 2022; originally announced November 2022.

Comments: SRE 2021, NIST Speaker Recognition Evaluation Workshop, CTS Speaker Recognition Challenge, 14-12 December 2021

arXiv:2210.15385 [pdf, other]

Self-Supervised Training of Speaker Encoder with Multi-Modal Diverse Positive Pairs

Authors: Ruijie Tao, Kong Aik Lee, Rohan Kumar Das, Ville Hautamäki, Haizhou Li

Abstract: We study a novel neural architecture and its training strategies of speaker encoder for speaker recognition without using any identity labels. The speaker encoder is trained to extract a fixed-size speaker embedding from a spoken utterance of various length. Contrastive learning is a typical self-supervised learning technique. However, the quality of the speaker encoder depends very much on the sa… ▽ More We study a novel neural architecture and its training strategies of speaker encoder for speaker recognition without using any identity labels. The speaker encoder is trained to extract a fixed-size speaker embedding from a spoken utterance of various length. Contrastive learning is a typical self-supervised learning technique. However, the quality of the speaker encoder depends very much on the sampling strategy of positive and negative pairs. It is common that we sample a positive pair of segments from the same utterance. Unfortunately, such poor-man's positive pairs (PPP) lack necessary diversity for the training of a robust encoder. In this work, we propose a multi-modal contrastive learning technique with novel sampling strategies. By cross-referencing between speech and face data, we study a method that finds diverse positive pairs (DPP) for contrastive learning, thus improving the robustness of the speaker encoder. We train the speaker encoder on the VoxCeleb2 dataset without any speaker labels, and achieve an equal error rate (EER) of 2.89\%, 3.17\% and 6.27\% under the proposed progressive clustering strategy, and an EER of 1.44\%, 1.77\% and 3.27\% under the two-stage learning strategy with pseudo labels, on the three test sets of VoxCeleb1. This novel solution outperforms the state-of-the-art self-supervised learning methods by a large margin, at the same time, achieves comparable results with the supervised learning counterpart. We also evaluate our self-supervised learning technique on LRS2 and LRW datasets, where the speaker information is unknown. All experiments suggest that the proposed neural architecture and sampling strategies are robust across datasets. △ Less

Submitted 27 October, 2022; originally announced October 2022.

Comments: 13 pages

arXiv:2202.01624 [pdf, other]

MFA: TDNN with Multi-scale Frequency-channel Attention for Text-independent Speaker Verification with Short Utterances

Authors: Tianchi Liu, Rohan Kumar Das, Kong Aik Lee, Haizhou Li

Abstract: The time delay neural network (TDNN) represents one of the state-of-the-art of neural solutions to text-independent speaker verification. However, they require a large number of filters to capture the speaker characteristics at any local frequency region. In addition, the performance of such systems may degrade under short utterance scenarios. To address these issues, we propose a multi-scale freq… ▽ More The time delay neural network (TDNN) represents one of the state-of-the-art of neural solutions to text-independent speaker verification. However, they require a large number of filters to capture the speaker characteristics at any local frequency region. In addition, the performance of such systems may degrade under short utterance scenarios. To address these issues, we propose a multi-scale frequency-channel attention (MFA), where we characterize speakers at different scales through a novel dual-path design which consists of a convolutional neural network and TDNN. We evaluate the proposed MFA on the VoxCeleb database and observe that the proposed framework with MFA can achieve state-of-the-art performance while reducing parameters and computation complexity. Further, the MFA mechanism is found to be effective for speaker verification with short test utterances. △ Less

Submitted 15 February, 2022; v1 submitted 3 February, 2022; originally announced February 2022.

Comments: Accepted by ICASSP 2022

arXiv:2112.04573 [pdf]

Application of Artificial Intelligence and Machine Learning in Libraries: A Systematic Review

Authors: Rajesh Kumar Das, Mohammad Sharif Ul Islam

Abstract: As the concept and implementation of cutting-edge technologies like artificial intelligence and machine learning has become relevant, academics, researchers and information professionals involve research in this area. The objective of this systematic literature review is to provide a synthesis of empirical studies exploring application of artificial intelligence and machine learning in libraries.… ▽ More As the concept and implementation of cutting-edge technologies like artificial intelligence and machine learning has become relevant, academics, researchers and information professionals involve research in this area. The objective of this systematic literature review is to provide a synthesis of empirical studies exploring application of artificial intelligence and machine learning in libraries. To achieve the objectives of the study, a systematic literature review was conducted based on the original guidelines proposed by Kitchenham et al. (2009). Data was collected from Web of Science, Scopus, LISA and LISTA databases. Following the rigorous/ established selection process, a total of thirty-two articles were finally selected, reviewed and analyzed to summarize on the application of AI and ML domain and techniques which are most often used in libraries. Findings show that the current state of the AI and ML research that is relevant with the LIS domain mainly focuses on theoretical works. However, some researchers also emphasized on implementation projects or case studies. This study will provide a panoramic view of AI and ML in libraries for researchers, practitioners and educators for furthering the more technology-oriented approaches, and anticipating future innovation pathways. △ Less

Submitted 6 December, 2021; originally announced December 2021.

arXiv:2111.06671 [pdf, ps, other]

HLT-NUS SUBMISSION FOR 2020 NIST Conversational Telephone Speech SRE

Authors: Rohan Kumar Das, Ruijie Tao, Haizhou Li

Abstract: This work provides a brief description of Human Language Technology (HLT) Laboratory, National University of Singapore (NUS) system submission for 2020 NIST conversational telephone speech (CTS) speaker recognition evaluation (SRE). The challenge focuses on evaluation under CTS data containing multilingual speech. The systems developed at HLT-NUS consider time-delay neural network (TDNN) x-vector… ▽ More This work provides a brief description of Human Language Technology (HLT) Laboratory, National University of Singapore (NUS) system submission for 2020 NIST conversational telephone speech (CTS) speaker recognition evaluation (SRE). The challenge focuses on evaluation under CTS data containing multilingual speech. The systems developed at HLT-NUS consider time-delay neural network (TDNN) x-vector and ECAPA-TDNN systems. We also perform domain adaption of probabilistic linear discriminant analysis (PLDA) model and adaptive s-norm on our systems. The score level fusion of TDNN x-vector and ECAPA-TDNN systems is carried out, which improves the final system performance of our submission to 2020 NIST CTS SRE. △ Less

Submitted 12 November, 2021; originally announced November 2021.

Comments: 3 pages

arXiv:2110.03869 [pdf, other]

Self-supervised Speaker Recognition with Loss-gated Learning

Authors: Ruijie Tao, Kong Aik Lee, Rohan Kumar Das, Ville Hautamäki, Haizhou Li

Abstract: In self-supervised learning for speaker recognition, pseudo labels are useful as the supervision signals. It is a known fact that a speaker recognition model doesn't always benefit from pseudo labels due to their unreliability. In this work, we observe that a speaker recognition network tends to model the data with reliable labels faster than those with unreliable labels. This motivates us to stud… ▽ More In self-supervised learning for speaker recognition, pseudo labels are useful as the supervision signals. It is a known fact that a speaker recognition model doesn't always benefit from pseudo labels due to their unreliability. In this work, we observe that a speaker recognition network tends to model the data with reliable labels faster than those with unreliable labels. This motivates us to study a loss-gated learning (LGL) strategy, which extracts the reliable labels through the fitting ability of the neural network during training. With the proposed LGL, our speaker recognition model obtains a $46.3\%$ performance gain over the system without it. Further, the proposed self-supervised speaker recognition with LGL trained on the VoxCeleb2 dataset without any labels achieves an equal error rate of $1.66\%$ on the VoxCeleb1 original test set. Code has been made available at: https://github.com/TaoRuijie/Loss-Gated-Learning. △ Less

Submitted 14 July, 2022; v1 submitted 7 October, 2021; originally announced October 2021.

Comments: 5 pages, 3 figures

arXiv:2110.00797 [pdf, other]

Significance of Data Augmentation for Improving Cleft Lip and Palate Speech Recognition

Authors: Protima Nomo Sudro, Rohan Kumar Das, Rohit Sinha, S. R. Mahadeva Prasanna

Abstract: The automatic recognition of pathological speech, particularly from children with any articulatory impairment, is a challenging task due to various reasons. The lack of available domain specific data is one such obstacle that hinders its usage for different speech-based applications targeting pathological speakers. In line with the challenge, in this work, we investigate a few data augmentation te… ▽ More The automatic recognition of pathological speech, particularly from children with any articulatory impairment, is a challenging task due to various reasons. The lack of available domain specific data is one such obstacle that hinders its usage for different speech-based applications targeting pathological speakers. In line with the challenge, in this work, we investigate a few data augmentation techniques to simulate training data for improving the children speech recognition considering the case of cleft lip and palate (CLP) speech. The augmentation techniques explored in this study, include vocal tract length perturbation (VTLP), reverberation, speaking rate, pitch modification, and speech feature modification using cycle consistent adversarial networks (CycleGAN). Our study finds that the data augmentation methods significantly improve the CLP speech recognition performance, which is more evident when we used feature modification using CycleGAN, VTLP and reverberation based methods. More specifically, the results from this study show that our systems produce an improved phone error rate compared to the systems without data augmentation. △ Less

Submitted 2 October, 2021; originally announced October 2021.

arXiv:2109.08007 [pdf, other]

Graph Fourier Transform based Audio Zero-watermarking

Authors: Longting Xu, Daiyu Huang, Syed Faham Ali Zaidi, Abdul Rauf, Rohan Kumar Das

Abstract: The frequent exchange of multimedia information in the present era projects an increasing demand for copyright protection. In this work, we propose a novel audio zero-watermarking technology based on graph Fourier transform for enhancing the robustness with respect to copyright protection. In this approach, the combined shift operator is used to construct the graph signal, upon which the graph Fou… ▽ More The frequent exchange of multimedia information in the present era projects an increasing demand for copyright protection. In this work, we propose a novel audio zero-watermarking technology based on graph Fourier transform for enhancing the robustness with respect to copyright protection. In this approach, the combined shift operator is used to construct the graph signal, upon which the graph Fourier analysis is performed. The selected maximum absolute graph Fourier coefficients representing the characteristics of the audio segment are then encoded into a feature binary sequence using K-means algorithm. Finally, the resultant feature binary sequence is XOR-ed with the watermark binary sequence to realize the embedding of the zero-watermarking. The experimental studies show that the proposed approach performs more effectively in resisting common or synchronization attacks than the existing state-of-the-art methods. △ Less

Submitted 16 September, 2021; originally announced September 2021.

arXiv:2107.06592 [pdf, other]

doi 10.1145/3474085.3475587

Is Someone Speaking? Exploring Long-term Temporal Features for Audio-visual Active Speaker Detection

Authors: Ruijie Tao, Zexu Pan, Rohan Kumar Das, Xinyuan Qian, Mike Zheng Shou, Haizhou Li

Abstract: Active speaker detection (ASD) seeks to detect who is speaking in a visual scene of one or more speakers. The successful ASD depends on accurate interpretation of short-term and long-term audio and visual information, as well as audio-visual interaction. Unlike the prior work where systems make decision instantaneously using short-term features, we propose a novel framework, named TalkNet, that ma… ▽ More Active speaker detection (ASD) seeks to detect who is speaking in a visual scene of one or more speakers. The successful ASD depends on accurate interpretation of short-term and long-term audio and visual information, as well as audio-visual interaction. Unlike the prior work where systems make decision instantaneously using short-term features, we propose a novel framework, named TalkNet, that makes decision by taking both short-term and long-term features into consideration. TalkNet consists of audio and visual temporal encoders for feature representation, audio-visual cross-attention mechanism for inter-modality interaction, and a self-attention mechanism to capture long-term speaking evidence. The experiments demonstrate that TalkNet achieves 3.5% and 2.2% improvement over the state-of-the-art systems on the AVA-ActiveSpeaker dataset and Columbia ASD dataset, respectively. Code has been made available at: https://github.com/TaoRuijie/TalkNet_ASD. △ Less

Submitted 25 July, 2021; v1 submitted 14 July, 2021; originally announced July 2021.

Comments: ACM Multimedia 2021

arXiv:2102.06332 [pdf, ps, other]

Data Augmentation with Signal Companding for Detection of Logical Access Attacks

Authors: Rohan Kumar Das, Jichen Yang, Haizhou Li

Abstract: The recent advances in voice conversion (VC) and text-to-speech (TTS) make it possible to produce natural sounding speech that poses threat to automatic speaker verification (ASV) systems. To this end, research on spoofing countermeasures has gained attention to protect ASV systems from such attacks. While the advanced spoofing countermeasures are able to detect known nature of spoofing attacks, t… ▽ More The recent advances in voice conversion (VC) and text-to-speech (TTS) make it possible to produce natural sounding speech that poses threat to automatic speaker verification (ASV) systems. To this end, research on spoofing countermeasures has gained attention to protect ASV systems from such attacks. While the advanced spoofing countermeasures are able to detect known nature of spoofing attacks, they are not that effective under unknown attacks. In this work, we propose a novel data augmentation technique using a-law and mu-law based signal companding. We believe that the proposed method has an edge over traditional data augmentation by adding small perturbation or quantization noise. The studies are conducted on ASVspoof 2019 logical access corpus using light convolutional neural network based system. We find that the proposed data augmentation technique based on signal companding outperforms the state-of-the-art spoofing countermeasures showing ability to handle unknown nature of attacks. △ Less

Submitted 11 February, 2021; originally announced February 2021.

Comments: 5 pages, Accepted for publication in International Conference on Acoustics, Speech, and Signal Processing (ICASSP) 2021

arXiv:2102.00270 [pdf, other]

Enhancing the Intelligibility of Cleft Lip and Palate Speech using Cycle-consistent Adversarial Networks

Authors: Protima Nomo Sudro, Rohan Kumar Das, Rohit Sinha, S R Mahadeva Prasanna

Abstract: Cleft lip and palate (CLP) refer to a congenital craniofacial condition that causes various speech-related disorders. As a result of structural and functional deformities, the affected subjects' speech intelligibility is significantly degraded, limiting the accessibility and usability of speech-controlled devices. Towards addressing this problem, it is desirable to improve the CLP speech intelligi… ▽ More Cleft lip and palate (CLP) refer to a congenital craniofacial condition that causes various speech-related disorders. As a result of structural and functional deformities, the affected subjects' speech intelligibility is significantly degraded, limiting the accessibility and usability of speech-controlled devices. Towards addressing this problem, it is desirable to improve the CLP speech intelligibility. Moreover, it would be useful during speech therapy. In this study, the cycle-consistent adversarial network (CycleGAN) method is exploited for improving CLP speech intelligibility. The model is trained on native Kannada-speaking childrens' speech data. The effectiveness of the proposed approach is also measured using automatic speech recognition performance. Further, subjective evaluation is performed, and those results also confirm the intelligibility improvement in the enhanced speech over the original. △ Less

Submitted 30 January, 2021; originally announced February 2021.

Comments: 8 pages, 4 figures, IEEE spoken language and technology workshop

arXiv:2011.00699 [pdf, other]

Transformer-based Arabic Dialect Identification

Authors: Wanqiu Lin, Maulik Madhavi, Rohan Kumar Das, Haizhou Li

Abstract: This paper presents a dialect identification (DID) system based on the transformer neural network architecture. The conventional convolutional neural network (CNN)-based systems use the shorter receptive fields. We believe that long range information is equally important for language and DID, and self-attention mechanism in transformer captures the long range dependencies. In addition, to reduce t… ▽ More This paper presents a dialect identification (DID) system based on the transformer neural network architecture. The conventional convolutional neural network (CNN)-based systems use the shorter receptive fields. We believe that long range information is equally important for language and DID, and self-attention mechanism in transformer captures the long range dependencies. In addition, to reduce the computational complexity, self-attention with downsampling is used to process the acoustic features. This process extracts sparse, yet informative features. Our experimental results show that transformer outperforms CNN-based networks on the Arabic dialect identification (ADI) dataset. We also report that the score-level fusion of CNN and transformer-based systems obtains an overall accuracy of 86.29% on the ADI17 database. △ Less

Submitted 1 November, 2020; originally announced November 2020.

Comments: Accepted for publication in International Conference on Asian Language Processing (IALP) 2020

arXiv:2010.03909 [pdf, other]

Emotion Invariant Speaker Embeddings for Speaker Identification with Emotional Speech

Authors: Biswajit Dev Sarma, Rohan Kumar Das

Abstract: Emotional state of a speaker is found to have significant effect in speech production, which can deviate speech from that arising from neutral state. This makes identifying speakers with different emotions a challenging task as generally the speaker models are trained using neutral speech. In this work, we propose to overcome this problem by creation of emotion invariant speaker embedding. We lear… ▽ More Emotional state of a speaker is found to have significant effect in speech production, which can deviate speech from that arising from neutral state. This makes identifying speakers with different emotions a challenging task as generally the speaker models are trained using neutral speech. In this work, we propose to overcome this problem by creation of emotion invariant speaker embedding. We learn an extractor network that maps the test embeddings with different emotions obtained using i-vector based system to an emotion invariant space. The resultant test embeddings thus become emotion invariant and thereby compensate the mismatch between various emotional states. The studies are conducted using four different emotion classes from IEMOCAP database. We obtain an absolute improvement of 2.6% in accuracy for speaker identification studies using emotion invariant speaker embedding against average speaker model based framework with different emotions. △ Less

Submitted 8 October, 2020; originally announced October 2020.

Comments: Accepted for publication in APSIPA ASC 2020

arXiv:2010.03907 [pdf, ps, other]

Classification of Speech with and without Face Mask using Acoustic Features

Authors: Rohan Kumar Das, Haizhou Li

Abstract: The understanding and interpretation of speech can be affected by various external factors. The use of face masks is one such factors that can create obstruction to speech while communicating. This may lead to degradation of speech processing and affect humans perceptually. Knowing whether a speaker wears a mask may be useful for modeling speech for different applications. With this motivation, fi… ▽ More The understanding and interpretation of speech can be affected by various external factors. The use of face masks is one such factors that can create obstruction to speech while communicating. This may lead to degradation of speech processing and affect humans perceptually. Knowing whether a speaker wears a mask may be useful for modeling speech for different applications. With this motivation, finding whether a speaker wears face mask from a given speech is included as a task in Computational Paralinguistics Evaluation (ComParE) 2020. We study novel acoustic features based on linear filterbanks, instantaneous phase and long-term information that can capture the artifacts for classification of speech with and without face mask. These acoustic features are used along with the state-of-the-art baselines of ComParE functionals, bag-of-audio-words, DeepSpectrum and auDeep features for ComParE 2020. The studies reveal the effectiveness of acoustic features, and their score level fusion with the ComParE 2020 baselines leads to an unweighted average recall of 73.50% on the test set. △ Less

Submitted 8 October, 2020; originally announced October 2020.

Comments: Accepted for publication in APSIPA ASC 2020

arXiv:2010.03905 [pdf, other]

HLT-NUS Submission for NIST 2019 Multimedia Speaker Recognition Evaluation

Authors: Rohan Kumar Das, Ruijie Tao, Jichen Yang, Wei Rao, Cheng Yu, Haizhou Li

Abstract: This work describes the speaker verification system developed by Human Language Technology Laboratory, National University of Singapore (HLT-NUS) for 2019 NIST Multimedia Speaker Recognition Evaluation (SRE). The multimedia research has gained attention to a wide range of applications and speaker recognition is no exception to it. In contrast to the previous NIST SREs, the latest edition focuses o… ▽ More This work describes the speaker verification system developed by Human Language Technology Laboratory, National University of Singapore (HLT-NUS) for 2019 NIST Multimedia Speaker Recognition Evaluation (SRE). The multimedia research has gained attention to a wide range of applications and speaker recognition is no exception to it. In contrast to the previous NIST SREs, the latest edition focuses on a multimedia track to recognize speakers with both audio and visual information. We developed separate systems for audio and visual inputs followed by a score level fusion of the systems from the two modalities to collectively use their information. The audio systems are based on x-vector based speaker embedding, whereas the face recognition systems are based on ResNet and InsightFace based face embeddings. With post evaluation studies and refinements, we obtain an equal error rate (EER) of 0.88% and an actual detection cost function (actDCF) of 0.026 on the evaluation set of 2019 NIST multimedia SRE corpus. △ Less

Submitted 8 October, 2020; originally announced October 2020.

Comments: Accepted for publication in APSIPA ASC 2020

arXiv:2009.09637 [pdf, other]

Light Convolutional Neural Network with Feature Genuinization for Detection of Synthetic Speech Attacks

Authors: Zhenzong Wu, Rohan Kumar Das, Jichen Yang, Haizhou Li

Abstract: Modern text-to-speech (TTS) and voice conversion (VC) systems produce natural sounding speech that questions the security of automatic speaker verification (ASV). This makes detection of such synthetic speech very important to safeguard ASV systems from unauthorized access. Most of the existing spoofing countermeasures perform well when the nature of the attacks is made known to the system during… ▽ More Modern text-to-speech (TTS) and voice conversion (VC) systems produce natural sounding speech that questions the security of automatic speaker verification (ASV). This makes detection of such synthetic speech very important to safeguard ASV systems from unauthorized access. Most of the existing spoofing countermeasures perform well when the nature of the attacks is made known to the system during training. However, their performance degrades in face of unseen nature of attacks. In comparison to the synthetic speech created by a wide range of TTS and VC methods, genuine speech has a more consistent distribution. We believe that the difference between the distribution of synthetic and genuine speech is an important discriminative feature between the two classes. In this regard, we propose a novel method referred to as feature genuinization that learns a transformer with convolutional neural network (CNN) using the characteristics of only genuine speech. We then use this genuinization transformer with a light CNN classifier. The ASVspoof 2019 logical access corpus is used to evaluate the proposed method. The studies show that the proposed feature genuinization based LCNN system outperforms other state-of-the-art spoofing countermeasures, depicting its effectiveness for detection of synthetic speech attacks. △ Less

Submitted 21 September, 2020; originally announced September 2020.

Comments: Accepted for publication in Interspeech 2020

Showing 1–50 of 78 results for author: Das, R K