Search | arXiv e-print repository

arXiv:2004.01546 [pdf, other]

doi 10.1109/TASLP.2020.2982297

Temporarily-Aware Context Modelling using Generative Adversarial Networks for Speech Activity Detection

Authors: Tharindu Fernando, Sridha Sridharan, Mitchell McLaren, Darshana Priyasad, Simon Denman, Clinton Fookes

Abstract: This paper presents a novel framework for Speech Activity Detection (SAD). Inspired by the recent success of multi-task learning approaches in the speech processing domain, we propose a novel joint learning framework for SAD. We utilise generative adversarial networks to automatically learn a loss function for joint prediction of the frame-wise speech/ non-speech classifications together with the… ▽ More This paper presents a novel framework for Speech Activity Detection (SAD). Inspired by the recent success of multi-task learning approaches in the speech processing domain, we propose a novel joint learning framework for SAD. We utilise generative adversarial networks to automatically learn a loss function for joint prediction of the frame-wise speech/ non-speech classifications together with the next audio segment. In order to exploit the temporal relationships within the input signal, we propose a temporal discriminator which aims to ensure that the predicted signal is temporally consistent. We evaluate the proposed framework on multiple public benchmarks, including NIST OpenSAT' 17, AMI Meeting and HAVIC, where we demonstrate its capability to outperform state-of-the-art SAD approaches. Furthermore, our cross-database evaluations demonstrate the robustness of the proposed approach across different languages, accents, and acoustic environments. △ Less

Submitted 1 April, 2020; originally announced April 2020.

Journal ref: IEEE/ACM Transactions on Audio, Speech and Language Processing, 2020

arXiv:2002.03802 [pdf, other]

A Speaker Verification Backend for Improved Calibration Performance across Varying Conditions

Authors: Luciana Ferrer, Mitchell McLaren

Abstract: In a recent work, we presented a discriminative backend for speaker verification that achieved good out-of-the-box calibration performance on most tested conditions containing varying levels of mismatch to the training conditions. This backend mimics the standard PLDA-based backend process used in most current speaker verification systems, including the calibration stage. All parameters of the bac… ▽ More In a recent work, we presented a discriminative backend for speaker verification that achieved good out-of-the-box calibration performance on most tested conditions containing varying levels of mismatch to the training conditions. This backend mimics the standard PLDA-based backend process used in most current speaker verification systems, including the calibration stage. All parameters of the backend are jointly trained to optimize the binary cross-entropy for the speaker verification task. Calibration robustness is achieved by making the parameters of the calibration stage a function of vectors representing the conditions of the signal, which are extracted using a model trained to predict condition labels. In this work, we propose a simplified version of this backend where the vectors used to compute the calibration parameters are estimated within the backend, without the need for a condition prediction model. We show that this simplified method provides similar performance to the previously proposed method while being simpler to implement, and having less requirements on the training data. Further, we provide an analysis of different aspects of the method including the effect of initialization, the nature of the vectors used to compute the calibration parameters, and the effect that the random seed and the number of training epochs has on performance. We also compare the proposed method with the trial-based calibration (TBC) method that, to our knowledge, was the state-of-the-art for achieving good calibration across varying conditions. We show that the proposed method outperforms TBC while also being several orders of magnitude faster to run, comparable to the standard PLDA baseline. △ Less

Submitted 5 February, 2020; originally announced February 2020.

Comments: arXiv admin note: substantial text overlap with arXiv:1911.11622

arXiv:1912.02522 [pdf, other]

VoxSRC 2019: The first VoxCeleb Speaker Recognition Challenge

Authors: Joon Son Chung, Arsha Nagrani, Ernesto Coto, Weidi Xie, Mitchell McLaren, Douglas A Reynolds, Andrew Zisserman

Abstract: The VoxCeleb Speaker Recognition Challenge 2019 aimed to assess how well current speaker recognition technology is able to identify speakers in unconstrained or `in the wild' data. It consisted of: (i) a publicly available speaker recognition dataset from YouTube videos together with ground truth annotation and standardised evaluation software; and (ii) a public challenge and workshop held at Inte… ▽ More The VoxCeleb Speaker Recognition Challenge 2019 aimed to assess how well current speaker recognition technology is able to identify speakers in unconstrained or `in the wild' data. It consisted of: (i) a publicly available speaker recognition dataset from YouTube videos together with ground truth annotation and standardised evaluation software; and (ii) a public challenge and workshop held at Interspeech 2019 in Graz, Austria. This paper outlines the challenge and provides its baselines, results and discussions. △ Less

Submitted 5 December, 2019; originally announced December 2019.

Comments: ISCA Archive

arXiv:1911.11622 [pdf, other]

A discriminative condition-aware backend for speaker verification

Authors: Luciana Ferrer, Mitchell McLaren

Abstract: We present a scoring approach for speaker verification that mimics the standard PLDA-based backend process used in most current speaker verification systems. However, unlike the standard backends, all parameters of the model are jointly trained to optimize the binary cross-entropy for the speaker verification task. We further integrate the calibration stage inside the model, making the parameters… ▽ More We present a scoring approach for speaker verification that mimics the standard PLDA-based backend process used in most current speaker verification systems. However, unlike the standard backends, all parameters of the model are jointly trained to optimize the binary cross-entropy for the speaker verification task. We further integrate the calibration stage inside the model, making the parameters of this stage depend on metadata vectors that represent the conditions of the signals. We show that the proposed backend has excellent out-of-the-box calibration performance on most of our test sets, making it an ideal approach for cases in which the test conditions are not known and development data is not available for training a domain-specific calibration model. △ Less

Submitted 26 November, 2019; originally announced November 2019.

Journal ref: Proceedings of ICASSP 2020

arXiv:1910.12597 [pdf]

Extending Deep Knowledge Tracing: Inferring Interpretable Knowledge and Predicting Post-System Performance

Authors: Richard Scruggs, Ryan S. Baker, Bruce M. McLaren

Abstract: Recent student knowledge modeling algorithms such as Deep Knowledge Tracing (DKT) and Dynamic Key-Value Memory Networks (DKVMN) have been shown to produce accurate predictions of problem correctness within the same learning system. However, these algorithms do not attempt to directly infer student knowledge. In this paper we present an extension to these algorithms to also infer knowledge. We appl… ▽ More Recent student knowledge modeling algorithms such as Deep Knowledge Tracing (DKT) and Dynamic Key-Value Memory Networks (DKVMN) have been shown to produce accurate predictions of problem correctness within the same learning system. However, these algorithms do not attempt to directly infer student knowledge. In this paper we present an extension to these algorithms to also infer knowledge. We apply this extension to DKT and DKVMN, resulting in knowledge estimates that correlate better with a posttest than knowledge estimates from Bayesian Knowledge Tracing (BKT), an algorithm designed to infer knowledge, and another classic algorithm, Performance Factors Analysis (PFA). We also apply our extension to correctness predictions from BKT and PFA, finding that knowledge estimates produced with it correlate better with the posttest than BKT and PFA's standard knowledge estimates. These findings are significant since the primary aim of education is to prepare students for later experiences outside of the immediate learning activity. △ Less

Submitted 31 August, 2020; v1 submitted 14 October, 2019; originally announced October 2019.

Comments: 8 pages, accepted to the International Conference on Computers in Education

arXiv:1803.10554 [pdf, other]

Joint PLDA for Simultaneous Modeling of Two Factors

Authors: Luciana Ferrer, Mitchell McLaren

Abstract: Probabilistic linear discriminant analysis (PLDA) is a method used for biometric problems like speaker or face recognition that models the variability of the samples using two latent variables, one that depends on the class of the sample and another one that is assumed independent across samples and models the within-class variability. In this work, we propose a generalization of PLDA that enables… ▽ More Probabilistic linear discriminant analysis (PLDA) is a method used for biometric problems like speaker or face recognition that models the variability of the samples using two latent variables, one that depends on the class of the sample and another one that is assumed independent across samples and models the within-class variability. In this work, we propose a generalization of PLDA that enables joint modeling of two sample-dependent factors: the class of interest and a nuisance condition. The approach does not change the basic form of PLDA but rather modifies the training procedure to consider the dependency across samples of the latent variable that models within-class variability. While the identity of the nuisance condition is needed during training, it is not needed during testing since we propose a scoring procedure that marginalizes over the corresponding latent variable. We show results on a multilingual speaker-verification task, where the language spoken is considered a nuisance condition. We show that the proposed joint PLDA approach leads to significant performance gains in this task for two different datasets, in particular when the training data contains mostly or only monolingual speakers. △ Less

Submitted 28 March, 2018; originally announced March 2018.

Comments: Submitted to Journal of Machine Learning Research

Journal ref: Journal of Machine Learning Research, January, 2019

Showing 1–6 of 6 results for author: McLaren, M