Skip to main content

Showing 1–25 of 25 results for author: Renals, S

Searching in archive eess. Search in all archives.
.
  1. arXiv:2406.00898  [pdf, other

    cs.SD cs.CL eess.AS

    Phonetic Error Analysis of Raw Waveform Acoustic Models with Parametric and Non-Parametric CNNs

    Authors: Erfan Loweimi, Andrea Carmantini, Peter Bell, Steve Renals, Zoran Cvetkovic

    Abstract: In this paper, we analyse the error patterns of the raw waveform acoustic models in TIMIT's phone recognition task. Our analysis goes beyond the conventional phone error rate (PER) metric. We categorise the phones into three groups: {affricate, diphthong, fricative, nasal, plosive, semi-vowel, vowel, silence}, {consonant, vowel+, silence}, and {voiced, unvoiced, silence} and, compute the PER for e… ▽ More

    Submitted 2 June, 2024; originally announced June 2024.

    Comments: 5 pages, 6 figures, 3 tables

  2. arXiv:2110.08634  [pdf, other

    cs.SD cs.LG eess.AS stat.ML

    Towards Robust Waveform-Based Acoustic Models

    Authors: Dino Oglic, Zoran Cvetkovic, Peter Sollich, Steve Renals, Bin Yu

    Abstract: We study the problem of learning robust acoustic models in adverse environments, characterized by a significant mismatch between training and test conditions. This problem is of paramount importance for the deployment of speech recognition systems that need to perform well in unseen environments. First, we characterize data augmentation theoretically as an instance of vicinal risk minimization, wh… ▽ More

    Submitted 29 June, 2022; v1 submitted 16 October, 2021; originally announced October 2021.

    Journal ref: IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2022

  3. arXiv:2105.15162  [pdf, other

    eess.AS cs.CL cs.LG cs.SD eess.IV

    Automatic audiovisual synchronisation for ultrasound tongue imaging

    Authors: Aciel Eshky, Joanne Cleland, Manuel Sam Ribeiro, Eleanor Sugden, Korin Richmond, Steve Renals

    Abstract: Ultrasound tongue imaging is used to visualise the intra-oral articulators during speech production. It is utilised in a range of applications, including speech and language therapy and phonetics research. Ultrasound and speech audio are recorded simultaneously, and in order to correctly use this data, the two modalities should be correctly synchronised. Synchronisation is achieved using specialis… ▽ More

    Submitted 31 May, 2021; originally announced May 2021.

    Comments: 18 pages, 10 figures. Manuscript accepted at Speech Communication

  4. arXiv:2103.00333  [pdf, other

    eess.AS cs.CL cs.SD q-bio.QM

    Silent versus modal multi-speaker speech recognition from ultrasound and video

    Authors: Manuel Sam Ribeiro, Aciel Eshky, Korin Richmond, Steve Renals

    Abstract: We investigate multi-speaker speech recognition from ultrasound images of the tongue and video images of the lips. We train our systems on imaging data from modal speech, and evaluate on matched test sets of two speaking modes: silent and modal speech. We observe that silent speech recognition from imaging data underperforms compared to modal speech recognition, likely due to a speaking-mode misma… ▽ More

    Submitted 27 February, 2021; originally announced March 2021.

    Comments: 5 pages, 5 figures, Submitted to Interspeech 2021

  5. arXiv:2103.00324  [pdf, ps, other

    eess.AS cs.CL cs.SD q-bio.NC

    Exploiting ultrasound tongue imaging for the automatic detection of speech articulation errors

    Authors: Manuel Sam Ribeiro, Joanne Cleland, Aciel Eshky, Korin Richmond, Steve Renals

    Abstract: Speech sound disorders are a common communication impairment in childhood. Because speech disorders can negatively affect the lives and the development of children, clinical intervention is often recommended. To help with diagnosis and treatment, clinicians use instrumented methods such as spectrograms or ultrasound tongue imaging to analyse speech articulations. Analysis with these methods can be… ▽ More

    Submitted 27 February, 2021; originally announced March 2021.

    Comments: 15 pages, 9 figures, 6 tables

    Journal ref: Speech Communication, Volume 128, April 2021, Pages 24-34

  6. arXiv:2102.04697  [pdf, other

    eess.AS cs.AI cs.SD

    Train your classifier first: Cascade Neural Networks Training from upper layers to lower layers

    Authors: Shucong Zhang, Cong-Thanh Do, Rama Doddipatla, Erfan Loweimi, Peter Bell, Steve Renals

    Abstract: Although the lower layers of a deep neural network learn features which are transferable across datasets, these layers are not transferable within the same dataset. That is, in general, freezing the trained feature extractor (the lower layers) and retraining the classifier (the upper layers) on the same dataset leads to worse performance. In this paper, for the first time, we show that the frozen… ▽ More

    Submitted 9 February, 2021; originally announced February 2021.

    Comments: Accepted by ICASSP 2021

  7. arXiv:2011.09804  [pdf, other

    eess.AS cs.CL cs.CV cs.SD eess.IV

    TaL: a synchronised multi-speaker corpus of ultrasound tongue imaging, audio, and lip videos

    Authors: Manuel Sam Ribeiro, Jennifer Sanger, Jing-Xuan Zhang, Aciel Eshky, Alan Wrench, Korin Richmond, Steve Renals

    Abstract: We present the Tongue and Lips corpus (TaL), a multi-speaker corpus of audio, ultrasound tongue imaging, and lip videos. TaL consists of two parts: TaL1 is a set of six recording sessions of one professional voice talent, a male native speaker of English; TaL80 is a set of recording sessions of 81 native speakers of English without voice talent experience. Overall, the corpus contains 24 hours of… ▽ More

    Submitted 19 November, 2020; originally announced November 2020.

    Comments: 8 pages, 4 figures, Accepted to SLT2021, IEEE Spoken Language Technology Workshop

  8. arXiv:2011.04906  [pdf, other

    cs.CL cs.SD eess.AS

    On the Usefulness of Self-Attention for Automatic Speech Recognition with Transformers

    Authors: Shucong Zhang, Erfan Loweimi, Peter Bell, Steve Renals

    Abstract: Self-attention models such as Transformers, which can capture temporal relationships without being limited by the distance between events, have given competitive speech recognition results. However, we note the range of the learned context increases from the lower to upper self-attention layers, whilst acoustic events often happen within short time spans in a left-to-right order. This leads to a q… ▽ More

    Submitted 8 November, 2020; originally announced November 2020.

    Comments: arXiv admin note: substantial text overlap with arXiv:2005.13895

  9. arXiv:2011.04004  [pdf, other

    cs.CL cs.SD eess.AS

    Stochastic Attention Head Removal: A simple and effective method for improving Transformer Based ASR Models

    Authors: Shucong Zhang, Erfan Loweimi, Peter Bell, Steve Renals

    Abstract: Recently, Transformer based models have shown competitive automatic speech recognition (ASR) performance. One key factor in the success of these models is the multi-head attention mechanism. However, for trained models, we have previously observed that many attention matrices are close to diagonal, indicating the redundancy of the corresponding attention heads. We have also found that some archite… ▽ More

    Submitted 6 April, 2021; v1 submitted 8 November, 2020; originally announced November 2020.

  10. arXiv:2010.14269  [pdf, other

    cs.SD cs.LG eess.AS

    Leveraging speaker attribute information using multi task learning for speaker verification and diarization

    Authors: Chau Luu, Peter Bell, Steve Renals

    Abstract: Deep speaker embeddings have become the leading method for encoding speaker identity in speaker recognition tasks. The embedding space should ideally capture the variations between all possible speakers, encoding the multiple acoustic aspects that make up a speaker's identity, whilst being robust to non-speaker acoustic variation. Deep speaker embeddings are normally trained discriminatively, pred… ▽ More

    Submitted 23 April, 2021; v1 submitted 27 October, 2020; originally announced October 2020.

    Comments: Submitted to Interspeech 2021

  11. arXiv:2008.06580  [pdf, other

    eess.AS cs.CL cs.SD

    Adaptation Algorithms for Neural Network-Based Speech Recognition: An Overview

    Authors: Peter Bell, Joachim Fainberg, Ondrej Klejch, Jinyu Li, Steve Renals, Pawel Swietojanski

    Abstract: We present a structured overview of adaptation algorithms for neural network-based speech recognition, considering both hybrid hidden Markov model / neural network systems and end-to-end neural network systems, with a focus on speaker adaptation, domain adaptation, and accent adaptation. The overview characterizes adaptation algorithms as based on embeddings, model parameter adaptation, or data au… ▽ More

    Submitted 28 February, 2021; v1 submitted 14 August, 2020; originally announced August 2020.

    Comments: Total of 31 pages, 27 figures. Associated repository: https://github.com/pswietojanski/ojsp_adaptation_review_2020

    Journal ref: IEEE Open Journal of Signal Processing, vol. 2, pp. 33-66, 2021

  12. arXiv:2008.03403  [pdf, other

    eess.AS cs.CL cs.SD

    Word Error Rate Estimation Without ASR Output: e-WER2

    Authors: Ahmed Ali, Steve Renals

    Abstract: Measuring the performance of automatic speech recognition (ASR) systems requires manually transcribed data in order to compute the word error rate (WER), which is often time-consuming and expensive. In this paper, we continue our effort in estimating WER using acoustic, lexical and phonotactic features. Our novel approach to estimate the WER uses a multistream end-to-end architecture. We report re… ▽ More

    Submitted 7 August, 2020; originally announced August 2020.

  13. arXiv:2005.13895  [pdf, other

    eess.AS cs.CL cs.SD

    When Can Self-Attention Be Replaced by Feed Forward Layers?

    Authors: Shucong Zhang, Erfan Loweimi, Peter Bell, Steve Renals

    Abstract: Recently, self-attention models such as Transformers have given competitive results compared to recurrent neural network systems in speech recognition. The key factor for the outstanding performance of self-attention models is their ability to capture temporal relationships without being limited by the distance between two related events. However, we note that the range of the learned context prog… ▽ More

    Submitted 28 May, 2020; originally announced May 2020.

  14. arXiv:2002.00453  [pdf, other

    cs.SD cs.LG eess.AS

    DropClass and DropAdapt: Dropping classes for deep speaker representation learning

    Authors: Chau Luu, Peter Bell, Steve Renals

    Abstract: Many recent works on deep speaker embeddings train their feature extraction networks on large classification tasks, distinguishing between all speakers in a training set. Empirically, this has been shown to produce speaker-discriminative embeddings, even for unseen speakers. However, it is not clear that this is the optimal means of training embeddings that generalize well. This work proposes two… ▽ More

    Submitted 2 February, 2020; originally announced February 2020.

    Comments: Submitted to Speaker Odyssey 2020

  15. arXiv:1910.14443  [pdf, other

    cs.CL cs.LG cs.SD eess.AS

    Multi-scale Octave Convolutions for Robust Speech Recognition

    Authors: Joanna Rownicka, Peter Bell, Steve Renals

    Abstract: We propose a multi-scale octave convolution layer to learn robust speech representations efficiently. Octave convolutions were introduced by Chen et al [1] in the computer vision field to reduce the spatial redundancy of the feature maps by decomposing the output of a convolutional layer into feature maps at two different spatial resolutions, one octave apart. This approach improved the efficiency… ▽ More

    Submitted 31 October, 2019; originally announced October 2019.

    Comments: submitted to ICASSP2020

  16. Channel adversarial training for speaker verification and diarization

    Authors: Chau Luu, Peter Bell, Steve Renals

    Abstract: Previous work has encouraged domain-invariance in deep speaker embedding by adversarially classifying the dataset or labelled environment to which the generated features belong. We propose a training strategy which aims to produce features that are invariant at the granularity of the recording or channel, a finer grained objective than dataset- or environment-invariance. By training an adversary t… ▽ More

    Submitted 25 October, 2019; originally announced October 2019.

    Comments: Submitted to IEEE ICASSP 2020

  17. arXiv:1910.10605  [pdf, ps, other

    cs.CL cs.LG eess.AS

    Speaker Adaptive Training using Model Agnostic Meta-Learning

    Authors: Ondřej Klejch, Joachim Fainberg, Peter Bell, Steve Renals

    Abstract: Speaker adaptive training (SAT) of neural network acoustic models learns models in a way that makes them more suitable for adaptation to test conditions. Conventionally, model-based speaker adaptive training is performed by having a set of speaker dependent parameters that are jointly optimised with speaker independent parameters in order to remove speaker variation. However, this does not scale w… ▽ More

    Submitted 23 October, 2019; originally announced October 2019.

    Comments: Accepted to IEEE ASRU 2019

  18. arXiv:1910.02168  [pdf, other

    eess.AS

    Cross lingual transfer learning for zero-resource domain adaptation

    Authors: Alberto Abad, Peter Bell, Andrea Carmantini, Steve Renals

    Abstract: We propose a method for zero-resource domain adaptation of DNN acoustic models, for use in low-resource situations where the only in-language training data available may be poorly matched to the intended target domain. Our method uses a multi-lingual model in which several DNN layers are shared between languages. This architecture enables domain adaptation transforms learned for one well-resourced… ▽ More

    Submitted 29 October, 2019; v1 submitted 4 October, 2019; originally announced October 2019.

    Comments: Submitted to ICASSP 2020. Main updates wrt previous versions: same network config in all experiments, added Babel/Material LR target language experiments, added comparison with alternative/similar methods of cross-lingual adaptation

  19. arXiv:1909.13759  [pdf, other

    eess.AS cs.CL cs.SD

    Acoustic Model Adaptation from Raw Waveforms with SincNet

    Authors: Joachim Fainberg, Ondřej Klejch, Erfan Loweimi, Peter Bell, Steve Renals

    Abstract: Raw waveform acoustic modelling has recently gained interest due to neural networks' ability to learn feature extraction, and the potential for finding better representations for a given scenario than hand-crafted features. SincNet has been proposed to reduce the number of parameters required in raw-waveform modelling, by restricting the filter functions, rather than having to learn every tap of e… ▽ More

    Submitted 30 September, 2019; originally announced September 2019.

    Comments: Accepted to IEEE ASRU 2019

  20. arXiv:1909.13537  [pdf, other

    cs.CL cs.SD eess.AS

    Embeddings for DNN speaker adaptive training

    Authors: Joanna Rownicka, Peter Bell, Steve Renals

    Abstract: In this work, we investigate the use of embeddings for speaker-adaptive training of DNNs (DNN-SAT) focusing on a small amount of adaptation data per speaker. DNN-SAT can be viewed as learning a mapping from each embedding to transformation parameters that are applied to the shared parameters of the DNN. We investigate different approaches to applying these transformations, and find that with a goo… ▽ More

    Submitted 30 September, 2019; originally announced September 2019.

    Comments: Accepted at ASRU 2019

  21. arXiv:1907.01413  [pdf, other

    eess.AS cs.CL cs.CV cs.LG cs.SD eess.IV

    Speaker-independent classification of phonetic segments from raw ultrasound in child speech

    Authors: Manuel Sam Ribeiro, Aciel Eshky, Korin Richmond, Steve Renals

    Abstract: Ultrasound tongue imaging (UTI) provides a convenient way to visualize the vocal tract during speech production. UTI is increasingly being used for speech therapy, making it important to develop automatic methods to assist various time-consuming manual tasks currently performed by speech therapists. A key challenge is to generalize the automatic processing of ultrasound tongue images to previously… ▽ More

    Submitted 1 July, 2019; originally announced July 2019.

    Comments: 5 pages, 4 figures, published in ICASSP2019 (IEEE International Conference on Acoustics, Speech and Signal Processing, 2019)

  22. arXiv:1907.00818  [pdf, other

    eess.AS cs.CL cs.SD eess.IV

    Ultrasound tongue imaging for diarization and alignment of child speech therapy sessions

    Authors: Manuel Sam Ribeiro, Aciel Eshky, Korin Richmond, Steve Renals

    Abstract: We investigate the automatic processing of child speech therapy sessions using ultrasound visual biofeedback, with a specific focus on complementing acoustic features with ultrasound images of the tongue for the tasks of speaker diarization and time-alignment of target words. For speaker diarization, we propose an ultrasound-based time-domain signal which we call estimated tongue activity. For wor… ▽ More

    Submitted 15 August, 2019; v1 submitted 1 July, 2019; originally announced July 2019.

    Comments: 5 pages, 3 figures, Accepted for publication at Interspeech 2019

  23. arXiv:1907.00758  [pdf, other

    cs.CL cs.CV cs.LG cs.SD eess.AS eess.IV

    Synchronising audio and ultrasound by learning cross-modal embeddings

    Authors: Aciel Eshky, Manuel Sam Ribeiro, Korin Richmond, Steve Renals

    Abstract: Audiovisual synchronisation is the task of determining the time offset between speech audio and a video recording of the articulators. In child speech therapy, audio and ultrasound videos of the tongue are captured using instruments which rely on hardware to synchronise the two modalities at recording time. Hardware synchronisation can fail in practice, and no mechanism exists to synchronise the s… ▽ More

    Submitted 27 November, 2019; v1 submitted 1 July, 2019; originally announced July 2019.

    Comments: 5 pages, 1 figure, 4 tables; Interspeech 2019 with the following edits: 1) Loss and accuracy upon convergence were accidentally reported from an older model. Now updated with model described throughout the paper. All other results remain unchanged. 2) Max true offset in the training data corrected from 179ms to 1789ms. 3) Detectability "boundary/range" renamed to detectability "thresholds"

  24. arXiv:1906.11521  [pdf, ps, other

    cs.CL cs.SD eess.AS

    Lattice-Based Unsupervised Test-Time Adaptation of Neural Network Acoustic Models

    Authors: Ondrej Klejch, Joachim Fainberg, Peter Bell, Steve Renals

    Abstract: Acoustic model adaptation to unseen test recordings aims to reduce the mismatch between training and testing conditions. Most adaptation schemes for neural network models require the use of an initial one-best transcription for the test data, generated by an unadapted model, in order to estimate the adaptation transform. It has been found that adaptation methods using discriminative objective func… ▽ More

    Submitted 27 June, 2019; originally announced June 2019.

  25. arXiv:1905.13150  [pdf, other

    cs.CL cs.SD eess.AS

    Lattice-based lightly-supervised acoustic model training

    Authors: Joachim Fainberg, Ondřej Klejch, Steve Renals, Peter Bell

    Abstract: In the broadcast domain there is an abundance of related text data and partial transcriptions, such as closed captions and subtitles. This text data can be used for lightly supervised training, in which text matching the audio is selected using an existing speech recognition model. Current approaches to light supervision typically filter the data based on matching error rates between the transcrip… ▽ More

    Submitted 13 July, 2019; v1 submitted 30 May, 2019; originally announced May 2019.

    Comments: Proc. INTERSPEECH 2019