Skip to main content

Showing 1–50 of 78 results for author: Kinnunen, T

Searching in archive eess. Search in all archives.
.
  1. arXiv:2506.00402  [pdf, ps, other

    cs.CL cs.SD eess.AS

    Causal Structure Discovery for Error Diagnostics of Children's ASR

    Authors: Vishwanath Pratap Singh, Md. Sahidullah, Tomi Kinnunen

    Abstract: Children's automatic speech recognition (ASR) often underperforms compared to that of adults due to a confluence of interdependent factors: physiological (e.g., smaller vocal tracts), cognitive (e.g., underdeveloped pronunciation), and extrinsic (e.g., vocabulary limitations, background noise). Existing analysis methods examine the impact of these factors in isolation, neglecting interdependencies… ▽ More

    Submitted 31 May, 2025; originally announced June 2025.

    Comments: Interspeech 2025

  2. arXiv:2505.20216  [pdf, other

    eess.AS

    Continuous Learning for Children's ASR: Overcoming Catastrophic Forgetting with Elastic Weight Consolidation and Synaptic Intelligence

    Authors: Edem Ahadzi, Vishwanath Pratap Singh, Tomi Kinnunen, Ville Hautamaki

    Abstract: In this work, we present the first study addressing automatic speech recognition (ASR) for children in an online learning setting. This is particularly important for both child-centric applications and the privacy protection of minors, where training models with sequentially arriving data is critical. The conventional approach of model fine-tuning often suffers from catastrophic forgetting. To tac… ▽ More

    Submitted 26 May, 2025; originally announced May 2025.

    Comments: Accepted at INTERSPEECH 2025. 5 pages

  3. arXiv:2505.19644  [pdf, ps, other

    cs.SD cs.AI cs.CR eess.AS

    STOPA: A Database of Systematic VariaTion Of DeePfake Audio for Open-Set Source Tracing and Attribution

    Authors: Anton Firc, Manasi Chibber, Jagabandhu Mishra, Vishwanath Pratap Singh, Tomi Kinnunen, Kamil Malinka

    Abstract: A key research area in deepfake speech detection is source tracing - determining the origin of synthesised utterances. The approaches may involve identifying the acoustic model (AM), vocoder model (VM), or other generation-specific parameters. However, progress is limited by the lack of a dedicated, systematically curated dataset. To address this, we introduce STOPA, a systematically varied and me… ▽ More

    Submitted 5 June, 2025; v1 submitted 26 May, 2025; originally announced May 2025.

    Comments: Accepted to Interspeech 2025 conference

    MSC Class: 68T45; 68T10; 94A08 ACM Class: I.2.7; I.5.4; K.4.1

  4. arXiv:2502.08857  [pdf, other

    eess.AS

    ASVspoof 5: Design, Collection and Validation of Resources for Spoofing, Deepfake, and Adversarial Attack Detection Using Crowdsourced Speech

    Authors: Xin Wang, Héctor Delgado, Hemlata Tak, Jee-weon Jung, Hye-jin Shim, Massimiliano Todisco, Ivan Kukanov, Xuechen Liu, Md Sahidullah, Tomi Kinnunen, Nicholas Evans, Kong Aik Lee, Junichi Yamagishi, Myeonghun Jeong, Ge Zhu, Yongyi Zang, You Zhang, Soumi Maiti, Florian Lux, Nicolas Müller, Wangyou Zhang, Chengzhe Sun, Shuwei Hou, Siwei Lyu, Sébastien Le Maguer , et al. (4 additional authors not shown)

    Abstract: ASVspoof 5 is the fifth edition in a series of challenges which promote the study of speech spoofing and deepfake attacks as well as the design of detection solutions. We introduce the ASVspoof 5 database which is generated in a crowdsourced fashion from data collected in diverse acoustic conditions (cf. studio-quality data for earlier ASVspoof databases) and from ~2,000 speakers (cf. ~100 earlier… ▽ More

    Submitted 24 April, 2025; v1 submitted 12 February, 2025; originally announced February 2025.

    Comments: Database link: https://zenodo.org/records/14498691, Database mirror link: https://huggingface.co/datasets/jungjee/asvspoof5, ASVspoof 5 Challenge Workshop Proceeding: https://www.isca-archive.org/asvspoof_2024/index.html

  5. arXiv:2502.08587  [pdf, other

    eess.AS

    Causal Analysis of ASR Errors for Children: Quantifying the Impact of Physiological, Cognitive, and Extrinsic Factors

    Authors: Vishwanath Pratap Singh, Md. Sahidullah, Tomi Kinnunen

    Abstract: The increasing use of children's automatic speech recognition (ASR) systems has spurred research efforts to improve the accuracy of models designed for children's speech in recent years. The current approach utilizes either open-source speech foundation models (SFMs) directly or fine-tuning them with children's speech data. These SFMs, whether open-source or fine-tuned for children, often exhibit… ▽ More

    Submitted 12 February, 2025; originally announced February 2025.

    Comments: Submitted to Computer Speech & Language

  6. arXiv:2502.04049  [pdf, ps, other

    eess.AS

    Towards Explainable Spoofed Speech Attribution and Detection:a Probabilistic Approach for Characterizing Speech Synthesizer Components

    Authors: Jagabandhu Mishra, Manasi Chhibber, Hye-jin Shim, Tomi H. Kinnunen

    Abstract: We propose an explainable probabilistic framework for characterizing spoofed speech by decomposing it into probabilistic attribute embeddings. Unlike raw high-dimensional countermeasure embeddings, which lack interpretability, the proposed probabilistic attribute embeddings aim to detect specific speech synthesizer components, represented through high-level attributes and their corresponding value… ▽ More

    Submitted 1 June, 2025; v1 submitted 6 February, 2025; originally announced February 2025.

    Comments: Accepted in Computer Speech and Language

  7. arXiv:2412.18191  [pdf, other

    cs.SD eess.AS

    Explaining Speaker and Spoof Embeddings via Probing

    Authors: Xuechen Liu, Junichi Yamagishi, Md Sahidullah, Tomi kinnunen

    Abstract: This study investigates the explainability of embedding representations, specifically those used in modern audio spoofing detection systems based on deep neural networks, known as spoof embeddings. Building on established work in speaker embedding explainability, we examine how well these spoof embeddings capture speaker-related information. We train simple neural classifiers using either speaker… ▽ More

    Submitted 24 December, 2024; originally announced December 2024.

    Comments: To appear in IEEE ICASSP 2025

  8. arXiv:2410.20578  [pdf, other

    eess.AS cs.AI cs.SD

    Meta-Learning Approaches for Improving Detection of Unseen Speech Deepfakes

    Authors: Ivan Kukanov, Janne Laakkonen, Tomi Kinnunen, Ville Hautamäki

    Abstract: Current speech deepfake detection approaches perform satisfactorily against known adversaries; however, generalization to unseen attacks remains an open challenge. The proliferation of speech deepfakes on social media underscores the need for systems that can generalize to unseen attacks not observed during training. We address this problem from the perspective of meta-learning, aiming to learn at… ▽ More

    Submitted 31 October, 2024; v1 submitted 27 October, 2024; originally announced October 2024.

    Comments: 6 pages, accepted to the IEEE Spoken Language Technology Workshop (SLT) 2024

  9. arXiv:2409.11027  [pdf, other

    eess.AS cs.SD

    An Explainable Probabilistic Attribute Embedding Approach for Spoofed Speech Characterization

    Authors: Manasi Chhibber, Jagabandhu Mishra, Hyejin Shim, Tomi H. Kinnunen

    Abstract: We propose a novel approach for spoofed speech characterization through explainable probabilistic attribute embeddings. In contrast to high-dimensional raw embeddings extracted from a spoofing countermeasure (CM) whose dimensions are not easy to interpret, the probabilistic attributes are designed to gauge the presence or absence of sub-components that make up a specific spoofing attack. These att… ▽ More

    Submitted 17 September, 2024; originally announced September 2024.

    Comments: Submitted to ICASSP-2025

  10. arXiv:2408.08739  [pdf, other

    eess.AS cs.AI cs.SD

    ASVspoof 5: Crowdsourced Speech Data, Deepfakes, and Adversarial Attacks at Scale

    Authors: Xin Wang, Hector Delgado, Hemlata Tak, Jee-weon Jung, Hye-jin Shim, Massimiliano Todisco, Ivan Kukanov, Xuechen Liu, Md Sahidullah, Tomi Kinnunen, Nicholas Evans, Kong Aik Lee, Junichi Yamagishi

    Abstract: ASVspoof 5 is the fifth edition in a series of challenges that promote the study of speech spoofing and deepfake attacks, and the design of detection solutions. Compared to previous challenges, the ASVspoof 5 database is built from crowdsourced data collected from a vastly greater number of speakers in diverse acoustic conditions. Attacks, also crowdsourced, are generated and tested using surrogat… ▽ More

    Submitted 16 August, 2024; originally announced August 2024.

    Comments: 8 pages, ASVspoof 5 Workshop (Interspeech2024 Satellite)

  11. arXiv:2407.04034  [pdf, other

    eess.AS

    Optimizing a-DCF for Spoofing-Robust Speaker Verification

    Authors: Oğuzhan Kurnaz, Jagabandhu Mishra, Tomi H. Kinnunen, Cemal Hanilçi

    Abstract: Automatic speaker verification (ASV) systems are vulnerable to spoofing attacks. We propose a spoofing-robust ASV system optimized directly for the recently introduced architecture-agnostic detection cost function (a-DCF), which allows targeting a desired trade-off between the contradicting aims of user convenience and robustness to spoofing. We combine a-DCF and binary cross-entropy (BCE) with a… ▽ More

    Submitted 3 March, 2025; v1 submitted 4 July, 2024; originally announced July 2024.

  12. arXiv:2406.17246  [pdf, other

    cs.SD cs.AI eess.AS

    Beyond Silence: Bias Analysis through Loss and Asymmetric Approach in Audio Anti-Spoofing

    Authors: Hye-jin Shim, Md Sahidullah, Jee-weon Jung, Shinji Watanabe, Tomi Kinnunen

    Abstract: Current trends in audio anti-spoofing detection research strive to improve models' ability to generalize across unseen attacks by learning to identify a variety of spoofing artifacts. This emphasis has primarily focused on the spoof class. Recently, several studies have noted that the distribution of silence differs between the two classes, which can serve as a shortcut. In this paper, we extend c… ▽ More

    Submitted 26 August, 2024; v1 submitted 24 June, 2024; originally announced June 2024.

    Comments: 5 pages, 1 figure, 5 tables, ISCA Interspeech 2024 SynData4GenAI Workshop

  13. arXiv:2406.10836  [pdf, other

    eess.AS cs.SD

    Revisiting and Improving Scoring Fusion for Spoofing-aware Speaker Verification Using Compositional Data Analysis

    Authors: Xin Wang, Tomi Kinnunen, Kong Aik Lee, Paul-Gauthier Noé, Junichi Yamagishi

    Abstract: Fusing outputs from automatic speaker verification (ASV) and spoofing countermeasure (CM) is expected to make an integrated system robust to zero-effort imposters and synthesized spoofing attacks. Many score-level fusion methods have been proposed, but many remain heuristic. This paper revisits score-level fusion using tools from decision theory and presents three main findings. First, fusion by s… ▽ More

    Submitted 24 September, 2024; v1 submitted 16 June, 2024; originally announced June 2024.

    Comments: Proceedings of Interspeech, DOI: 10.21437/Interspeech.2024-422. Code: https://github.com/nii-yamagishilab/SpeechSPC-mini

  14. arXiv:2406.09999  [pdf, other

    eess.AS

    ROAR: Reinforcing Original to Augmented Data Ratio Dynamics for Wav2Vec2.0 Based ASR

    Authors: Vishwanath Pratap Singh, Federico Malato, Ville Hautamaki, Md. Sahidullah, Tomi Kinnunen

    Abstract: While automatic speech recognition (ASR) greatly benefits from data augmentation, the augmentation recipes themselves tend to be heuristic. In this paper, we address one of the heuristic approach associated with balancing the right amount of augmented data in ASR training by introducing a reinforcement learning (RL) based dynamic adjustment of original-to-augmented data ratio (OAR). Unlike the fix… ▽ More

    Submitted 14 June, 2024; originally announced June 2024.

    Comments: Accepted: Interspeech 2024

    Journal ref: Interspeech 2024

  15. arXiv:2403.01355  [pdf, ps, other

    eess.AS cs.LG

    a-DCF: an architecture agnostic metric with application to spoofing-robust speaker verification

    Authors: Hye-jin Shim, Jee-weon Jung, Tomi Kinnunen, Nicholas Evans, Jean-Francois Bonastre, Itshak Lapidot

    Abstract: Spoofing detection is today a mainstream research topic. Standard metrics can be applied to evaluate the performance of isolated spoofing detection solutions and others have been proposed to support their evaluation when they are combined with speaker detection. These either have well-known deficiencies or restrict the architectural approach to combine speaker and spoof detectors. In this paper, w… ▽ More

    Submitted 15 April, 2025; v1 submitted 2 March, 2024; originally announced March 2024.

    Comments: published at ISCA Speaker Odyssey 2024

  16. arXiv:2402.15214  [pdf, other

    eess.AS cs.SD

    ChildAugment: Data Augmentation Methods for Zero-Resource Children's Speaker Verification

    Authors: Vishwanath Pratap Singh, Md Sahidullah, Tomi Kinnunen

    Abstract: The accuracy of modern automatic speaker verification (ASV) systems, when trained exclusively on adult data, drops substantially when applied to children's speech. The scarcity of children's speech corpora hinders fine-tuning ASV systems for children's speech. Hence, there is a timely need to explore more effective ways of reusing adults' speech data. One promising approach is to align vocal-tract… ▽ More

    Submitted 23 February, 2024; originally announced February 2024.

    Comments: The following article has been accepted by The Journal of the Acoustical Society of America (JASA). After it is published, it will be found at https://pubs.aip.org/asa/jasa

  17. arXiv:2401.11156  [pdf, other

    cs.CR cs.AI cs.SD eess.AS

    Generalizing Speaker Verification for Spoof Awareness in the Embedding Space

    Authors: Xuechen Liu, Md Sahidullah, Kong Aik Lee, Tomi Kinnunen

    Abstract: It is now well-known that automatic speaker verification (ASV) systems can be spoofed using various types of adversaries. The usual approach to counteract ASV systems against such attacks is to develop a separate spoofing countermeasure (CM) module to classify speech input either as a bonafide, or a spoofed utterance. Nevertheless, such a design requires additional computation and utilization effo… ▽ More

    Submitted 27 January, 2024; v1 submitted 20 January, 2024; originally announced January 2024.

    Comments: Published in IEEE/ACM Transactions on Audio, Speech, and Language Processing (doi updated)

  18. arXiv:2309.12237  [pdf, other

    cs.CR cs.LG cs.SD eess.AS eess.IV stat.CO

    t-EER: Parameter-Free Tandem Evaluation of Countermeasures and Biometric Comparators

    Authors: Tomi Kinnunen, Kong Aik Lee, Hemlata Tak, Nicholas Evans, Andreas Nautsch

    Abstract: Presentation attack (spoofing) detection (PAD) typically operates alongside biometric verification to improve reliablity in the face of spoofing attacks. Even though the two sub-systems operate in tandem to solve the single task of reliable biometric verification, they address different detection tasks and are hence typically evaluated separately. Evidence shows that this approach is suboptimal. W… ▽ More

    Submitted 21 September, 2023; originally announced September 2023.

    Comments: To appear in IEEE Transactions on Pattern Analysis and Machine Intelligence. For associated codes, see https://github.com/TakHemlata/T-EER (Github) and https://colab.research.google.com/drive/1ga7eiKFP11wOFMuZjThLJlkBcwEG6_4m?usp=sharing (Google Colab)

  19. arXiv:2306.07501  [pdf, other

    eess.AS cs.SD

    Speaker Verification Across Ages: Investigating Deep Speaker Embedding Sensitivity to Age Mismatch in Enrollment and Test Speech

    Authors: Vishwanath Pratap Singh, Md Sahidullah, Tomi Kinnunen

    Abstract: In this paper, we study the impact of the ageing on modern deep speaker embedding based automatic speaker verification (ASV) systems. We have selected two different datasets to examine ageing on the state-of-the-art ECAPA-TDNN system. The first dataset, used for addressing short-term ageing (up to 10 years time difference between enrollment and test) under uncontrolled conditions, is VoxCeleb. The… ▽ More

    Submitted 12 June, 2023; originally announced June 2023.

    Journal ref: Interspeech 2023

  20. arXiv:2306.00044  [pdf, ps, other

    cs.LG cs.CR cs.SD eess.AS

    How to Construct Perfect and Worse-than-Coin-Flip Spoofing Countermeasures: A Word of Warning on Shortcut Learning

    Authors: Hye-jin Shim, Rosa González Hautamäki, Md Sahidullah, Tomi Kinnunen

    Abstract: Shortcut learning, or `Clever Hans effect` refers to situations where a learning agent (e.g., deep neural networks) learns spurious correlations present in data, resulting in biased models. We focus on finding shortcuts in deep learning based spoofing countermeasures (CMs) that predict whether a given utterance is spoofed or not. While prior work has addressed specific data artifacts, such as sile… ▽ More

    Submitted 31 May, 2023; originally announced June 2023.

    Comments: Interspeech 2023

  21. arXiv:2305.19953  [pdf, ps, other

    cs.SD cs.LG eess.AS

    Multi-Dataset Co-Training with Sharpness-Aware Optimization for Audio Anti-spoofing

    Authors: Hye-jin Shim, Jee-weon Jung, Tomi Kinnunen

    Abstract: Audio anti-spoofing for automatic speaker verification aims to safeguard users' identities from spoofing attacks. Although state-of-the-art spoofing countermeasure(CM) models perform well on specific datasets, they lack generalization when evaluated with different datasets. To address this limitation, previous studies have explored large pre-trained models, which require significant resources and… ▽ More

    Submitted 1 June, 2023; v1 submitted 31 May, 2023; originally announced May 2023.

    Comments: Interspeech 2023

  22. arXiv:2305.19051  [pdf, other

    eess.AS cs.AI cs.SD

    Towards single integrated spoofing-aware speaker verification embeddings

    Authors: Sung Hwan Mun, Hye-jin Shim, Hemlata Tak, Xin Wang, Xuechen Liu, Md Sahidullah, Myeonghun Jeong, Min Hyun Han, Massimiliano Todisco, Kong Aik Lee, Junichi Yamagishi, Nicholas Evans, Tomi Kinnunen, Nam Soo Kim, Jee-weon Jung

    Abstract: This study aims to develop a single integrated spoofing-aware speaker verification (SASV) embeddings that satisfy two aspects. First, rejecting non-target speakers' input as well as target speakers' spoofed inputs should be addressed. Second, competitive performance should be demonstrated compared to the fusion of automatic speaker verification (ASV) and countermeasure (CM) embeddings, which outpe… ▽ More

    Submitted 1 June, 2023; v1 submitted 30 May, 2023; originally announced May 2023.

    Comments: Accepted by INTERSPEECH 2023. Code and models are available in https://github.com/sasv-challenge/ASVSpoof5-SASVBaseline

  23. arXiv:2303.01126  [pdf, other

    cs.SD cs.CR eess.AS

    Speaker-Aware Anti-Spoofing

    Authors: Xuechen Liu, Md Sahidullah, Kong Aik Lee, Tomi Kinnunen

    Abstract: We address speaker-aware anti-spoofing, where prior knowledge of the target speaker is incorporated into a voice spoofing countermeasure (CM). In contrast to the frequently used speaker-independent solutions, we train the CM in a speaker-conditioned way. As a proof of concept, we consider speaker-aware extension to the state-of-the-art AASIST (audio anti-spoofing using integrated spectro-temporal… ▽ More

    Submitted 8 June, 2023; v1 submitted 2 March, 2023; originally announced March 2023.

  24. arXiv:2303.01125  [pdf, other

    cs.SD cs.LG eess.AS

    Distilling Multi-Level X-vector Knowledge for Small-footprint Speaker Verification

    Authors: Xuechen Liu, Md Sahidullah, Tomi Kinnunen

    Abstract: Even though deep speaker models have demonstrated impressive accuracy in speaker verification tasks, this often comes at the expense of increased model size and computation time, presenting challenges for deployment in resource-constrained environments. Our research focuses on addressing this limitation through the development of small footprint deep speaker embedding extraction using knowledge di… ▽ More

    Submitted 19 December, 2023; v1 submitted 2 March, 2023; originally announced March 2023.

    Comments: Submitted to Data & Knowledge Engineering at Dec. 2023. Copyright may be transferred without notice

  25. arXiv:2302.10014  [pdf, other

    eess.AS

    Learnable Frontends that do not Learn: Quantifying Sensitivity to Filterbank Initialisation

    Authors: Mark Anderson, Tomi Kinnunen, Naomi Harte

    Abstract: While much of modern speech and audio processing relies on deep neural networks trained using fixed audio representations, recent studies suggest great potential in acoustic frontends learnt jointly with a backend. In this study, we focus specifically on learnable filterbanks. Prior studies have reported that in frontends using learnable filterbanks initialised to a mel scale, the learned filters… ▽ More

    Submitted 20 February, 2023; originally announced February 2023.

    Comments: Accepted at ICASSP 2023, 5 pages, 2 figures, 2 tables

  26. arXiv:2211.01091  [pdf, ps, other

    eess.AS cs.AI cs.SD

    I4U System Description for NIST SRE'20 CTS Challenge

    Authors: Kong Aik Lee, Tomi Kinnunen, Daniele Colibro, Claudio Vair, Andreas Nautsch, Hanwu Sun, Liang He, Tianyu Liang, Qiongqiong Wang, Mickael Rouvier, Pierre-Michel Bousquet, Rohan Kumar Das, Ignacio Viñals Bailo, Meng Liu, Héctor Deldago, Xuechen Liu, Md Sahidullah, Sandro Cumani, Boning Zhang, Koji Okabe, Hitoshi Yamamoto, Ruijie Tao, Haizhou Li, Alfonso Ortega Giménez, Longbiao Wang , et al. (1 additional authors not shown)

    Abstract: This manuscript describes the I4U submission to the 2020 NIST Speaker Recognition Evaluation (SRE'20) Conversational Telephone Speech (CTS) Challenge. The I4U's submission was resulted from active collaboration among researchers across eight research teams - I$^2$R (Singapore), UEF (Finland), VALPT (Italy, Spain), NEC (Japan), THUEE (China), LIA (France), NUS (Singapore), INRIA (France) and TJU (C… ▽ More

    Submitted 2 November, 2022; originally announced November 2022.

    Comments: SRE 2021, NIST Speaker Recognition Evaluation Workshop, CTS Speaker Recognition Challenge, 14-12 December 2021

  27. arXiv:2210.02437  [pdf, other

    cs.SD cs.CR cs.MM eess.AS

    ASVspoof 2021: Towards Spoofed and Deepfake Speech Detection in the Wild

    Authors: Xuechen Liu, Xin Wang, Md Sahidullah, Jose Patino, Héctor Delgado, Tomi Kinnunen, Massimiliano Todisco, Junichi Yamagishi, Nicholas Evans, Andreas Nautsch, Kong Aik Lee

    Abstract: Benchmarking initiatives support the meaningful comparison of competing solutions to prominent problems in speech and language processing. Successive benchmarking evaluations typically reflect a progressive evolution from ideal lab conditions towards to those encountered in the wild. ASVspoof, the spoofing and deepfake detection initiative and challenge series, has followed the same trend. This ar… ▽ More

    Submitted 22 June, 2023; v1 submitted 5 October, 2022; originally announced October 2022.

    Comments: IEEE/ACM Transactions on Audio, Speech, and Language Processing

  28. arXiv:2209.10479  [pdf, other

    eess.AS cs.SD eess.SP

    An Initial study on Birdsong Re-synthesis Using Neural Vocoders

    Authors: Rhythm Bhatia, Tomi H. Kinnunen

    Abstract: Modern speech synthesis uses neural vocoders to model raw waveform samples directly. This increased versatility has expanded the scope of vocoders from speech to other domains, such as music. We address another interesting domain of bio-acoustics. We provide initial comparative analysis-resynthesis experiments of birdsong using traditional (WORLD) and two neural (WaveNet autoencoder, parallel Wave… ▽ More

    Submitted 21 September, 2022; originally announced September 2022.

    Comments: To appear in 24th International Conference on Speech and Computer (SPECOM), GURUGRAM, INDIA

  29. arXiv:2205.04923  [pdf, other

    cs.SD eess.AS

    Gamified Speaker Comparison by Listening

    Authors: Sandip Ghimire, Tomi Kinnunen, Rosa Gonzalez Hautamäki

    Abstract: We address speaker comparison by listening in a game-like environment, hypothesized to make the task more motivating for naive listeners. We present the same 30 trials selected with the help of an x-vector speaker recognition system from VoxCeleb to a total of 150 crowdworkers recruited through Amazon's Mechanical Turk. They are divided into cohorts of 50, each using one of three alternative inter… ▽ More

    Submitted 10 May, 2022; originally announced May 2022.

    Comments: Accepted to Odyssey 2022 The Speaker and Language Recognition Workshop

  30. arXiv:2205.00288  [pdf, other

    eess.AS cs.SD

    Baselines and Protocols for Household Speaker Recognition

    Authors: Alexey Sholokhov, Xuechen Liu, Md Sahidullah, Tomi Kinnunen

    Abstract: Speaker recognition on household devices, such as smart speakers, features several challenges: (i) robustness across a vast number of heterogeneous domains (households), (ii) short utterances, (iii) possibly absent speaker labels of the enrollment data (passive enrollment), and (iv) presence of unknown persons (guests). While many commercial products exist, there is less published research and no… ▽ More

    Submitted 5 May, 2022; v1 submitted 30 April, 2022; originally announced May 2022.

    Comments: Accepted to Odyssey 2022

  31. arXiv:2204.09976  [pdf, other

    cs.SD eess.AS

    Baseline Systems for the First Spoofing-Aware Speaker Verification Challenge: Score and Embedding Fusion

    Authors: Hye-jin Shim, Hemlata Tak, Xuechen Liu, Hee-Soo Heo, Jee-weon Jung, Joon Son Chung, Soo-Whan Chung, Ha-Jin Yu, Bong-Jin Lee, Massimiliano Todisco, Héctor Delgado, Kong Aik Lee, Md Sahidullah, Tomi Kinnunen, Nicholas Evans

    Abstract: Deep learning has brought impressive progress in the study of both automatic speaker verification (ASV) and spoofing countermeasures (CM). Although solutions are mutually dependent, they have typically evolved as standalone sub-systems whereby CM solutions are usually designed for a fixed ASV system. The work reported in this paper aims to gauge the improvements in reliability that can be gained f… ▽ More

    Submitted 21 April, 2022; originally announced April 2022.

    Comments: 8 pages, accepted by Odyssey 2022

  32. Improving speaker de-identification with functional data analysis of f0 trajectories

    Authors: Lauri Tavi, Tomi Kinnunen, Rosa González Hautamäki

    Abstract: Due to a constantly increasing amount of speech data that is stored in different types of databases, voice privacy has become a major concern. To respond to such concern, speech researchers have developed various methods for speaker de-identification. The state-of-the-art solutions utilize deep learning solutions which can be effective but might be unavailable or impractical to apply for, for exam… ▽ More

    Submitted 30 March, 2022; originally announced March 2022.

    Comments: Accepted to Speech Communication. March 2022

  33. arXiv:2203.14732  [pdf, other

    eess.AS

    SASV 2022: The First Spoofing-Aware Speaker Verification Challenge

    Authors: Jee-weon Jung, Hemlata Tak, Hye-jin Shim, Hee-Soo Heo, Bong-Jin Lee, Soo-Whan Chung, Ha-Jin Yu, Nicholas Evans, Tomi Kinnunen

    Abstract: The first spoofing-aware speaker verification (SASV) challenge aims to integrate research efforts in speaker verification and anti-spoofing. We extend the speaker verification scenario by introducing spoofed trials to the usual set of target and impostor trials. In contrast to the established ASVspoof challenge where the focus is upon separate, independently optimised spoofing detection and speake… ▽ More

    Submitted 28 March, 2022; originally announced March 2022.

    Comments: 5 pages, 2 figures, 2 tables, submitted to Interspeech 2022 as a conference paper

  34. arXiv:2203.10992  [pdf, other

    cs.SD cs.AI eess.AS

    Spoofing-Aware Speaker Verification with Unsupervised Domain Adaptation

    Authors: Xuechen Liu, Md Sahidullah, Tomi Kinnunen

    Abstract: In this paper, we initiate the concern of enhancing the spoofing robustness of the automatic speaker verification (ASV) system, without the primary presence of a separate countermeasure module. We start from the standard ASV framework of the ASVspoof 2019 baseline and approach the problem from the back-end classifier based on probabilistic linear discriminant analysis. We employ three unsupervised… ▽ More

    Submitted 26 April, 2022; v1 submitted 21 March, 2022; originally announced March 2022.

    Comments: Accepted by Speaker Odyssey 2022

  35. arXiv:2202.05236  [pdf, other

    cs.SD cs.AI eess.AS

    Learnable Nonlinear Compression for Robust Speaker Verification

    Authors: Xuechen Liu, Md Sahidullah, Tomi Kinnunen

    Abstract: In this study, we focus on nonlinear compression methods in spectral features for speaker verification based on deep neural network. We consider different kinds of channel-dependent (CD) nonlinear compression methods optimized in a data-driven manner. Our methods are based on power nonlinearities and dynamic range compression (DRC). We also propose multi-regime (MR) design on the nonlinearities, a… ▽ More

    Submitted 10 February, 2022; originally announced February 2022.

    Comments: Accepted by ICASSP2022

  36. arXiv:2201.10283  [pdf, ps, other

    cs.SD cs.CR eess.AS

    SASV Challenge 2022: A Spoofing Aware Speaker Verification Challenge Evaluation Plan

    Authors: Jee-weon Jung, Hemlata Tak, Hye-jin Shim, Hee-Soo Heo, Bong-Jin Lee, Soo-Whan Chung, Hong-Goo Kang, Ha-Jin Yu, Nicholas Evans, Tomi Kinnunen

    Abstract: ASV (automatic speaker verification) systems are intrinsically required to reject both non-target (e.g., voice uttered by different speaker) and spoofed (e.g., synthesised or converted) inputs. However, there is little consideration for how ASV systems themselves should be adapted when they are expected to encounter spoofing attacks, nor when they operate in tandem with CMs (spoofing countermeasur… ▽ More

    Submitted 2 March, 2022; v1 submitted 25 January, 2022; originally announced January 2022.

    Comments: Evaluation plan of the SASV Challenge 2022. See this webpage for more information: https://sasv-challenge.github.io

  37. arXiv:2201.09709  [pdf, other

    cs.SD cs.CR cs.LG eess.AS

    Optimizing Tandem Speaker Verification and Anti-Spoofing Systems

    Authors: Anssi Kanervisto, Ville Hautamäki, Tomi Kinnunen, Junichi Yamagishi

    Abstract: As automatic speaker verification (ASV) systems are vulnerable to spoofing attacks, they are typically used in conjunction with spoofing countermeasure (CM) systems to improve security. For example, the CM can first determine whether the input is human speech, then the ASV can determine whether this speech matches the speaker's identity. The performance of such a tandem system can be measured with… ▽ More

    Submitted 24 January, 2022; originally announced January 2022.

    Comments: Published in IEEE/ACM Transactions on Audio, Speech, and Language Processing. Published version available at: https://ieeexplore.ieee.org/document/9664367

    Journal ref: in IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 30, pp. 477-488, 2022

  38. arXiv:2110.10983  [pdf, other

    cs.SD cs.AI eess.AS

    Optimizing Multi-Taper Features for Deep Speaker Verification

    Authors: Xuechen Liu, Md Sahidullah, Tomi Kinnunen

    Abstract: Multi-taper estimators provide low-variance power spectrum estimates that can be used in place of the windowed discrete Fourier transform (DFT) to extract speech features such as mel-frequency cepstral coefficients (MFCCs). Even if past work has reported promising automatic speaker verification (ASV) results with Gaussian mixture model-based classifiers, the performance of multi-taper MFCCs with d… ▽ More

    Submitted 21 October, 2021; originally announced October 2021.

    Comments: To appear in IEEE Signal Processing Letters

  39. arXiv:2109.13510  [pdf, other

    cs.LG cs.CL cs.SD eess.AS

    VoxCeleb Enrichment for Age and Gender Recognition

    Authors: Khaled Hechmi, Trung Ngo Trong, Ville Hautamaki, Tomi Kinnunen

    Abstract: VoxCeleb datasets are widely used in speaker recognition studies. Our work serves two purposes. First, we provide speaker age labels and (an alternative) annotation of speaker gender. Second, we demonstrate the use of this metadata by constructing age and gender recognition models with different features and classifiers. We query different celebrity databases and apply consensus rules to derive ag… ▽ More

    Submitted 20 December, 2021; v1 submitted 28 September, 2021; originally announced September 2021.

    Comments: Accepted for presentation at ASRU 2021; repository: https://github.com/hechmik/voxceleb_enrichment_age_gender

  40. arXiv:2109.12058  [pdf, other

    cs.SD cs.AI eess.AS

    Optimized Power Normalized Cepstral Coefficients towards Robust Deep Speaker Verification

    Authors: Xuechen Liu, Md Sahidullah, Tomi Kinnunen

    Abstract: After their introduction to robust speech recognition, power normalized cepstral coefficient (PNCC) features were successfully adopted to other tasks, including speaker verification. However, as a feature extractor with long-term operations on the power spectrogram, its temporal processing and amplitude scaling steps dedicated on environmental compensation may be redundant. Further, they might sup… ▽ More

    Submitted 24 September, 2021; originally announced September 2021.

    Comments: Accepted for publication at ASRU 2021

  41. arXiv:2109.12056  [pdf, other

    cs.SD cs.AI eess.AS

    Parameterized Channel Normalization for Far-field Deep Speaker Verification

    Authors: Xuechen Liu, Md Sahidullah, Tomi Kinnunen

    Abstract: We address far-field speaker verification with deep neural network (DNN) based speaker embedding extractor, where mismatch between enrollment and test data often comes from convolutive effects (e.g. room reverberation) and noise. To mitigate these effects, we focus on two parametric normalization methods: per-channel energy normalization (PCEN) and parameterized cepstral mean normalization (PCMN).… ▽ More

    Submitted 24 September, 2021; originally announced September 2021.

    Comments: Accepted for publication at ASRU 2021

  42. arXiv:2109.00537  [pdf, other

    eess.AS cs.CR cs.LG cs.SD

    ASVspoof 2021: accelerating progress in spoofed and deepfake speech detection

    Authors: Junichi Yamagishi, Xin Wang, Massimiliano Todisco, Md Sahidullah, Jose Patino, Andreas Nautsch, Xuechen Liu, Kong Aik Lee, Tomi Kinnunen, Nicholas Evans, Héctor Delgado

    Abstract: ASVspoof 2021 is the forth edition in the series of bi-annual challenges which aim to promote the study of spoofing and the design of countermeasures to protect automatic speaker verification systems from manipulation. In addition to a continued focus upon logical and physical access tasks in which there are a number of advances compared to previous editions, ASVspoof 2021 introduces a new task in… ▽ More

    Submitted 1 September, 2021; originally announced September 2021.

    Comments: Accepted to the ASVspoof 2021 Workshop

  43. arXiv:2109.00535  [pdf, other

    eess.AS cs.CR cs.LG cs.SD

    ASVspoof 2021: Automatic Speaker Verification Spoofing and Countermeasures Challenge Evaluation Plan

    Authors: Héctor Delgado, Nicholas Evans, Tomi Kinnunen, Kong Aik Lee, Xuechen Liu, Andreas Nautsch, Jose Patino, Md Sahidullah, Massimiliano Todisco, Xin Wang, Junichi Yamagishi

    Abstract: The automatic speaker verification spoofing and countermeasures (ASVspoof) challenge series is a community-led initiative which aims to promote the consideration of spoofing and the development of countermeasures. ASVspoof 2021 is the 4th in a series of bi-annual, competitive challenges where the goal is to develop countermeasures capable of discriminating between bona fide and spoofed or deepfake… ▽ More

    Submitted 1 September, 2021; originally announced September 2021.

    Comments: http://www.asvspoof.org

  44. arXiv:2109.00281  [pdf, other

    cs.CR cs.SD eess.AS

    Benchmarking and challenges in security and privacy for voice biometrics

    Authors: Jean-Francois Bonastre, Hector Delgado, Nicholas Evans, Tomi Kinnunen, Kong Aik Lee, Xuechen Liu, Andreas Nautsch, Paul-Gauthier Noe, Jose Patino, Md Sahidullah, Brij Mohan Lal Srivastava, Massimiliano Todisco, Natalia Tomashenko, Emmanuel Vincent, Xin Wang, Junichi Yamagishi

    Abstract: For many decades, research in speech technologies has focused upon improving reliability. With this now meeting user expectations for a range of diverse applications, speech technology is today omni-present. As result, a focus on security and privacy has now come to the fore. Here, the research effort is in its relative infancy and progress calls for greater, multidisciplinary collaboration with s… ▽ More

    Submitted 1 September, 2021; originally announced September 2021.

    Comments: Submitted to the symposium of the ISCA Security & Privacy in Speech Communications (SPSC) special interest group

  45. arXiv:2106.06362  [pdf, other

    cs.SD cs.LG eess.AS stat.AP

    Visualizing Classifier Adjacency Relations: A Case Study in Speaker Verification and Voice Anti-Spoofing

    Authors: Tomi Kinnunen, Andreas Nautsch, Md Sahidullah, Nicholas Evans, Xin Wang, Massimiliano Todisco, Héctor Delgado, Junichi Yamagishi, Kong Aik Lee

    Abstract: Whether it be for results summarization, or the analysis of classifier fusion, some means to compare different classifiers can often provide illuminating insight into their behaviour, (dis)similarity or complementarity. We propose a simple method to derive 2D representation from detection scores produced by an arbitrary set of binary classifiers in response to a common dataset. Based upon rank cor… ▽ More

    Submitted 11 June, 2021; originally announced June 2021.

    Comments: Accepted to Interspeech 2021. Example code available at https://github.com/asvspoof-challenge/classifier-adjacency

  46. arXiv:2103.14602  [pdf, other

    eess.AS cs.CV cs.LG cs.SD

    Data Quality as Predictor of Voice Anti-Spoofing Generalization

    Authors: Bhusan Chettri, Rosa González Hautamäki, Md Sahidullah, Tomi Kinnunen

    Abstract: Voice anti-spoofing aims at classifying a given utterance either as a bonafide human sample, or a spoofing attack (e.g. synthetic or replayed sample). Many anti-spoofing methods have been proposed but most of them fail to generalize across domains (corpora) -- and we do not know \emph{why}. We outline a novel interpretative framework for gauging the impact of data quality upon anti-spoofing perfor… ▽ More

    Submitted 21 June, 2021; v1 submitted 26 March, 2021; originally announced March 2021.

    Comments: INTERSPEECH 2021

  47. arXiv:2102.10322  [pdf, other

    cs.SD cs.LG eess.AS

    Learnable MFCCs for Speaker Verification

    Authors: Xuechen Liu, Md Sahidullah, Tomi Kinnunen

    Abstract: We propose a learnable mel-frequency cepstral coefficient (MFCC) frontend architecture for deep neural network (DNN) based automatic speaker verification. Our architecture retains the simplicity and interpretability of MFCC-based features while allowing the model to be adapted to data flexibly. In practice, we formulate data-driven versions of the four linear transforms of a standard MFCC extracto… ▽ More

    Submitted 20 February, 2021; originally announced February 2021.

    Comments: Accepted to ISCAS 2021

  48. arXiv:2102.05889  [pdf, other

    eess.AS cs.CR cs.SD

    ASVspoof 2019: spoofing countermeasures for the detection of synthesized, converted and replayed speech

    Authors: Andreas Nautsch, Xin Wang, Nicholas Evans, Tomi Kinnunen, Ville Vestman, Massimiliano Todisco, Héctor Delgado, Md Sahidullah, Junichi Yamagishi, Kong Aik Lee

    Abstract: The ASVspoof initiative was conceived to spearhead research in anti-spoofing for automatic speaker verification (ASV). This paper describes the third in a series of bi-annual challenges: ASVspoof 2019. With the challenge database and protocols being described elsewhere, the focus of this paper is on results and the top performing single and ensemble system submissions from 62 teams, all of which o… ▽ More

    Submitted 11 February, 2021; originally announced February 2021.

    Journal ref: IEEE Transactions on Biometrics, Behavior, and Identity Science 2021

  49. arXiv:2009.03554  [pdf, other

    eess.AS cs.SD

    Predictions of Subjective Ratings and Spoofing Assessments of Voice Conversion Challenge 2020 Submissions

    Authors: Rohan Kumar Das, Tomi Kinnunen, Wen-Chin Huang, Zhenhua Ling, Junichi Yamagishi, Yi Zhao, Xiaohai Tian, Tomoki Toda

    Abstract: The Voice Conversion Challenge 2020 is the third edition under its flagship that promotes intra-lingual semiparallel and cross-lingual voice conversion (VC). While the primary evaluation of the challenge submissions was done through crowd-sourced listening tests, we also performed an objective assessment of the submitted systems. The aim of the objective assessment is to provide complementary perf… ▽ More

    Submitted 8 September, 2020; originally announced September 2020.

    Comments: Submitted to ISCA Joint Workshop for the Blizzard Challenge and Voice Conversion Challenge 2020

  50. arXiv:2008.12527  [pdf, other

    eess.AS cs.SD

    Voice Conversion Challenge 2020: Intra-lingual semi-parallel and cross-lingual voice conversion

    Authors: Yi Zhao, Wen-Chin Huang, Xiaohai Tian, Junichi Yamagishi, Rohan Kumar Das, Tomi Kinnunen, Zhenhua Ling, Tomoki Toda

    Abstract: The voice conversion challenge is a bi-annual scientific event held to compare and understand different voice conversion (VC) systems built on a common dataset. In 2020, we organized the third edition of the challenge and constructed and distributed a new database for two tasks, intra-lingual semi-parallel and cross-lingual VC. After a two-month challenge period, we received 33 submissions, includ… ▽ More

    Submitted 28 August, 2020; originally announced August 2020.

    Comments: Submitted to ISCA Joint Workshop for the Blizzard Challenge and Voice Conversion Challenge 2020