Skip to main content

Showing 1–50 of 78 results for author: Kamper, H

Searching in archive cs. Search in all archives.
.
  1. arXiv:2506.04037  [pdf, ps, other

    cs.CL eess.AS

    The mutual exclusivity bias of bilingual visually grounded speech models

    Authors: Dan Oneata, Leanne Nortje, Yevgen Matusevych, Herman Kamper

    Abstract: Mutual exclusivity (ME) is a strategy where a novel word is associated with a novel object rather than a familiar one, facilitating language learning in children. Recent work has found an ME bias in a visually grounded speech (VGS) model trained on English speech with paired images. But ME has also been studied in bilingual children, who may employ it less due to cross-lingual ambiguity. We explor… ▽ More

    Submitted 4 June, 2025; originally announced June 2025.

    Comments: Interspeech 2025

  2. arXiv:2506.01510  [pdf, ps, other

    eess.AS cs.CL

    LinearVC: Linear transformations of self-supervised features through the lens of voice conversion

    Authors: Herman Kamper, Benjamin van Niekerk, Julian Zaïdi, Marc-André Carbonneau

    Abstract: We introduce LinearVC, a simple voice conversion method that sheds light on the structure of self-supervised representations. First, we show that simple linear transformations of self-supervised features effectively convert voices. Next, we probe the geometry of the feature space by constraining the set of allowed transformations. We find that just rotating the features is sufficient for high-qual… ▽ More

    Submitted 2 June, 2025; originally announced June 2025.

    Comments: Accepted to Interspeech 2025

  3. arXiv:2505.23494  [pdf, ps, other

    cs.CL eess.AS

    Spoken Language Modeling with Duration-Penalized Self-Supervised Units

    Authors: Nicol Visser, Herman Kamper

    Abstract: Spoken language models (SLMs) operate on acoustic units obtained by discretizing self-supervised speech representations. Although the characteristics of these units directly affect performance, the interaction between codebook size and unit coarseness (i.e., duration) remains unexplored. We investigate SLM performance as we vary codebook size and unit coarseness using the simple duration-penalized… ▽ More

    Submitted 29 May, 2025; originally announced May 2025.

    Comments: Accepted to Interspeech 2025

  4. arXiv:2501.06478  [pdf, other

    eess.AS cs.CL cs.SD

    Speech Recognition for Automatically Assessing Afrikaans and isiXhosa Preschool Oral Narratives

    Authors: Christiaan Jacobs, Annelien Smith, Daleen Klop, Ondřej Klejch, Febe de Wet, Herman Kamper

    Abstract: We develop automatic speech recognition (ASR) systems for stories told by Afrikaans and isiXhosa preschool children. Oral narratives provide a way to assess children's language development before they learn to read. We consider a range of prior child-speech ASR strategies to determine which is best suited to this unique setting. Using Whisper and only 5 minutes of transcribed in-domain child speec… ▽ More

    Submitted 11 January, 2025; originally announced January 2025.

    Comments: Accepted to ICASSP 2025

  5. arXiv:2501.05787  [pdf, other

    eess.AS cs.CL

    MARS6: A Small and Robust Hierarchical-Codec Text-to-Speech Model

    Authors: Matthew Baas, Pieter Scholtz, Arnav Mehta, Elliott Dyson, Akshat Prakash, Herman Kamper

    Abstract: Codec-based text-to-speech (TTS) models have shown impressive quality with zero-shot voice cloning abilities. However, they often struggle with more expressive references or complex text inputs. We present MARS6, a robust encoder-decoder transformer for rapid, expressive TTS. MARS6 is built on recent improvements in spoken language modelling. Utilizing a hierarchical setup for its decoder, new spe… ▽ More

    Submitted 10 January, 2025; originally announced January 2025.

    Comments: 5 pages, 2 figures, 1 table. Accepted at ICASSP 2025

  6. arXiv:2409.14486  [pdf, other

    eess.AS cs.CL cs.SD

    Unsupervised Word Discovery: Boundary Detection with Clustering vs. Dynamic Programming

    Authors: Simon Malan, Benjamin van Niekerk, Herman Kamper

    Abstract: We look at the long-standing problem of segmenting unlabeled speech into word-like segments and clustering these into a lexicon. Several previous methods use a scoring model coupled with dynamic programming to find an optimal segmentation. Here we propose a much simpler strategy: we predict word boundaries using the dissimilarity between adjacent self-supervised features, then we cluster the predi… ▽ More

    Submitted 12 January, 2025; v1 submitted 22 September, 2024; originally announced September 2024.

    Comments: Accepted at ICASSP 2025

  7. arXiv:2409.06013  [pdf, other

    cs.CL cs.CV eess.AS

    Improved Visually Prompted Keyword Localisation in Real Low-Resource Settings

    Authors: Leanne Nortje, Dan Oneata, Herman Kamper

    Abstract: Given an image query, visually prompted keyword localisation (VPKL) aims to find occurrences of the depicted word in a speech collection. This can be useful when transcriptions are not available for a low-resource language (e.g. if it is unwritten). Previous work showed that VPKL can be performed with a visually grounded speech model trained on paired images and unlabelled speech. But all experime… ▽ More

    Submitted 9 September, 2024; originally announced September 2024.

  8. arXiv:2406.07133  [pdf, other

    eess.AS cs.CL cs.SD

    Translating speech with just images

    Authors: Dan Oneata, Herman Kamper

    Abstract: Visually grounded speech models link speech to images. We extend this connection by linking images to text via an existing image captioning system, and as a result gain the ability to map speech audio directly to text. This approach can be used for speech translation with just images by having the audio in a different language from the generated captions. We investigate such a system on a real low… ▽ More

    Submitted 11 June, 2024; originally announced June 2024.

    Comments: Accepted at Interspeech 2024

  9. arXiv:2403.13922  [pdf, other

    cs.CL eess.AS

    Visually Grounded Speech Models have a Mutual Exclusivity Bias

    Authors: Leanne Nortje, Dan Oneaţă, Yevgen Matusevych, Herman Kamper

    Abstract: When children learn new words, they employ constraints such as the mutual exclusivity (ME) bias: a novel word is mapped to a novel object rather than a familiar one. This bias has been studied computationally, but only in models that use discrete word representations as input, ignoring the high variability of spoken words. We investigate the ME bias in the context of visually grounded speech model… ▽ More

    Submitted 20 March, 2024; originally announced March 2024.

    Comments: Accepted to TACL, pre-MIT Press publication version

  10. arXiv:2401.17902  [pdf, other

    eess.AS cs.CL cs.SD

    Revisiting speech segmentation and lexicon learning with better features

    Authors: Herman Kamper, Benjamin van Niekerk

    Abstract: We revisit a self-supervised method that segments unlabelled speech into word-like segments. We start from the two-stage duration-penalised dynamic programming method that performs zero-resource segmentation without learning an explicit lexicon. In the first acoustic unit discovery stage, we replace contrastive predictive coding features with HuBERT. After word segmentation in the second stage, we… ▽ More

    Submitted 31 January, 2024; originally announced January 2024.

    Comments: 2 pages

  11. arXiv:2310.08104  [pdf, other

    eess.AS cs.CL cs.SD

    Voice Conversion for Stuttered Speech, Instruments, Unseen Languages and Textually Described Voices

    Authors: Matthew Baas, Herman Kamper

    Abstract: Voice conversion aims to convert source speech into a target voice using recordings of the target speaker as a reference. Newer models are producing increasingly realistic output. But what happens when models are fed with non-standard data, such as speech from a user with a speech impairment? We investigate how a recent voice conversion model performs on non-standard downstream voice conversion ta… ▽ More

    Submitted 12 October, 2023; originally announced October 2023.

    Comments: 11 pages, 1 figure, 5 tables. Accepted at SACAIR 2023

  12. arXiv:2307.06040  [pdf, other

    eess.AS cs.LG cs.SD

    Rhythm Modeling for Voice Conversion

    Authors: Benjamin van Niekerk, Marc-André Carbonneau, Herman Kamper

    Abstract: Voice conversion aims to transform source speech into a different target voice. However, typical voice conversion systems do not account for rhythm, which is an important factor in the perception of speaker identity. To bridge this gap, we introduce Urhythmic-an unsupervised method for rhythm conversion that does not require parallel data or text transcriptions. Using self-supervised representatio… ▽ More

    Submitted 12 July, 2023; originally announced July 2023.

    Comments: 5 pages, 4 figures, 4 tables, submitted to IEEE Signal Processing Letters

  13. arXiv:2307.02083  [pdf, other

    eess.AS cs.CL

    Leveraging multilingual transfer for unsupervised semantic acoustic word embeddings

    Authors: Christiaan Jacobs, Herman Kamper

    Abstract: Acoustic word embeddings (AWEs) are fixed-dimensional vector representations of speech segments that encode phonetic content so that different realisations of the same word have similar embeddings. In this paper we explore semantic AWE modelling. These AWEs should not only capture phonetics but also the meaning of a word (similar to textual word embeddings). We consider the scenario where we only… ▽ More

    Submitted 5 July, 2023; originally announced July 2023.

    Comments: Submitted to IEEE SPL

  14. arXiv:2307.01673  [pdf, other

    eess.AS cs.CL cs.SD

    Disentanglement in a GAN for Unconditional Speech Synthesis

    Authors: Matthew Baas, Herman Kamper

    Abstract: Can we develop a model that can synthesize realistic speech directly from a latent space, without explicit conditioning? Despite several efforts over the last decade, previous adversarial and diffusion-based approaches still struggle to achieve this, even on small-vocabulary datasets. To address this, we propose AudioStyleGAN (ASGAN) -- a generative adversarial network for unconditional speech syn… ▽ More

    Submitted 25 January, 2024; v1 submitted 4 July, 2023; originally announced July 2023.

    Comments: 12 pages, 5 tables, 4 figures. Accepted to IEEE TASLP. arXiv admin note: substantial text overlap with arXiv:2210.05271

  15. arXiv:2306.11371  [pdf, other

    eess.AS cs.CL

    Visually grounded few-shot word learning in low-resource settings

    Authors: Leanne Nortje, Dan Oneata, Herman Kamper

    Abstract: We propose a visually grounded speech model that learns new words and their visual depictions from just a few word-image example pairs. Given a set of test images and a spoken query, we ask the model which image depicts the query word. Previous work has simplified this few-shot learning problem by either using an artificial setting with digit word-image pairs or by using a large number of examples… ▽ More

    Submitted 18 April, 2024; v1 submitted 20 June, 2023; originally announced June 2023.

    Comments: Accepted to TASLP. arXiv admin note: substantial text overlap with arXiv:2305.15937

  16. arXiv:2306.00410  [pdf, other

    cs.CL cs.SD eess.AS

    Towards hate speech detection in low-resource languages: Comparing ASR to acoustic word embeddings on Wolof and Swahili

    Authors: Christiaan Jacobs, Nathanaël Carraz Rakotonirina, Everlyn Asiko Chimoto, Bruce A. Bassett, Herman Kamper

    Abstract: We consider hate speech detection through keyword spotting on radio broadcasts. One approach is to build an automatic speech recognition (ASR) system for the target low-resource language. We compare this to using acoustic word embedding (AWE) models that map speech segments to a space where matching words have similar vectors. We specifically use a multilingual AWE model trained on labelled data f… ▽ More

    Submitted 1 June, 2023; originally announced June 2023.

    Comments: Accepted to Interspeech 2023

  17. arXiv:2305.18975  [pdf, other

    eess.AS cs.CL cs.SD

    Voice Conversion With Just Nearest Neighbors

    Authors: Matthew Baas, Benjamin van Niekerk, Herman Kamper

    Abstract: Any-to-any voice conversion aims to transform source speech into a target voice with just a few examples of the target speaker as a reference. Recent methods produce convincing conversions, but at the cost of increased complexity -- making results difficult to reproduce and build on. Instead, we keep it simple. We propose k-nearest neighbors voice conversion (kNN-VC): a straightforward yet effecti… ▽ More

    Submitted 30 May, 2023; originally announced May 2023.

    Comments: 5 page, 1 table, 2 figures. Accepted at Interspeech 2023

  18. arXiv:2305.15937  [pdf, other

    cs.CL cs.AI eess.AS

    Visually grounded few-shot word acquisition with fewer shots

    Authors: Leanne Nortje, Benjamin van Niekerk, Herman Kamper

    Abstract: We propose a visually grounded speech model that acquires new words and their visual depictions from just a few word-image example pairs. Given a set of test images and a spoken query, we ask the model which image depicts the query word. Previous work has simplified this problem by either using an artificial setting with digit word-image pairs or by using a large number of examples per class. We p… ▽ More

    Submitted 25 May, 2023; originally announced May 2023.

    Comments: Accepted at Interspeech 2023

  19. arXiv:2305.13080  [pdf, other

    cs.CL cs.AI eess.AS

    Mitigating Catastrophic Forgetting for Few-Shot Spoken Word Classification Through Meta-Learning

    Authors: Ruan van der Merwe, Herman Kamper

    Abstract: We consider the problem of few-shot spoken word classification in a setting where a model is incrementally introduced to new word classes. This would occur in a user-defined keyword system where new words can be added as the system is used. In such a continual learning scenario, a model might start to misclassify earlier words as newer classes are added, i.e. catastrophic forgetting. To address th… ▽ More

    Submitted 22 May, 2023; originally announced May 2023.

    Comments: 5 pages, 3 figures, Accepted to Interspeech 2023

    ACM Class: I.2.7; I.2.6

  20. arXiv:2210.07677  [pdf, other

    eess.AS cs.AI cs.SD

    TransFusion: Transcribing Speech with Multinomial Diffusion

    Authors: Matthew Baas, Kevin Eloff, Herman Kamper

    Abstract: Diffusion models have shown exceptional scaling properties in the image synthesis domain, and initial attempts have shown similar benefits for applying diffusion to unconditional text synthesis. Denoising diffusion models attempt to iteratively refine a sampled noise signal until it resembles a coherent signal (such as an image or written sentence). In this work we aim to see whether the benefits… ▽ More

    Submitted 14 October, 2022; originally announced October 2022.

    Comments: 12 pages, 4 figures, 1 table. Accepted at SACAIR 2022

  21. arXiv:2210.06229  [pdf, other

    cs.CL cs.SD eess.AS

    Towards visually prompted keyword localisation for zero-resource spoken languages

    Authors: Leanne Nortje, Herman Kamper

    Abstract: Imagine being able to show a system a visual depiction of a keyword and finding spoken utterances that contain this keyword from a zero-resource speech corpus. We formalise this task and call it visually prompted keyword localisation (VPKL): given an image of a keyword, detect and predict where in an utterance the keyword occurs. To do VPKL, we propose a speech-vision model with a novel localising… ▽ More

    Submitted 12 October, 2022; originally announced October 2022.

    Comments: Accepted to IEEE SLT 2022

  22. arXiv:2210.05271  [pdf, other

    cs.SD cs.AI eess.AS

    GAN You Hear Me? Reclaiming Unconditional Speech Synthesis from Diffusion Models

    Authors: Matthew Baas, Herman Kamper

    Abstract: We propose AudioStyleGAN (ASGAN), a new generative adversarial network (GAN) for unconditional speech synthesis. As in the StyleGAN family of image synthesis models, ASGAN maps sampled noise to a disentangled latent vector which is then mapped to a sequence of audio features so that signal aliasing is suppressed at every layer. To successfully train ASGAN, we introduce a number of new techniques,… ▽ More

    Submitted 11 October, 2022; originally announced October 2022.

    Comments: 6 pages, 2 figures, 2 tables. Accepted at IEEE SLT 2022

  23. arXiv:2210.04600  [pdf, other

    cs.CL eess.AS

    YFACC: A Yorùbá speech-image dataset for cross-lingual keyword localisation through visual grounding

    Authors: Kayode Olaleye, Dan Oneata, Herman Kamper

    Abstract: Visually grounded speech (VGS) models are trained on images paired with unlabelled spoken captions. Such models could be used to build speech systems in settings where it is impossible to get labelled data, e.g. for documenting unwritten languages. However, most VGS studies are in English or other high-resource languages. This paper attempts to address this shortcoming. We collect and release a ne… ▽ More

    Submitted 12 October, 2022; v1 submitted 10 October, 2022; originally announced October 2022.

    Comments: Accepted to IEEE SLT 2022

  24. arXiv:2206.11706  [pdf, other

    eess.AS cs.CL cs.LG stat.ML

    A Temporal Extension of Latent Dirichlet Allocation for Unsupervised Acoustic Unit Discovery

    Authors: Werner van der Merwe, Herman Kamper, Johan du Preez

    Abstract: Latent Dirichlet allocation (LDA) is widely used for unsupervised topic modelling on sets of documents. No temporal information is used in the model. However, there is often a relationship between the corresponding topics of consecutive tokens. In this paper, we present an extension to LDA that uses a Markov chain to model temporal information. We use this new model for acoustic unit discovery fro… ▽ More

    Submitted 29 June, 2022; v1 submitted 23 June, 2022; originally announced June 2022.

  25. arXiv:2202.11929  [pdf, other

    cs.CL cs.SD eess.AS

    Word Segmentation on Discovered Phone Units with Dynamic Programming and Self-Supervised Scoring

    Authors: Herman Kamper

    Abstract: Recent work on unsupervised speech segmentation has used self-supervised models with phone and word segmentation modules that are trained jointly. This paper instead revisits an older approach to word segmentation: bottom-up phone-like unit discovery is performed first, and symbolic word segmentation is then performed on top of the discovered units (without influencing the lower level). To do this… ▽ More

    Submitted 9 January, 2023; v1 submitted 24 February, 2022; originally announced February 2022.

    Comments: 11 pages, 5 figures, 5 tables

    Journal ref: IEEE/ACM Transactions on Audio, Speech and Language Processing 31 (2023) 684-694

  26. arXiv:2202.01107  [pdf, other

    cs.CL cs.SD eess.AS

    Keyword localisation in untranscribed speech using visually grounded speech models

    Authors: Kayode Olaleye, Dan Oneata, Herman Kamper

    Abstract: Keyword localisation is the task of finding where in a speech utterance a given query keyword occurs. We investigate to what extent keyword localisation is possible using a visually grounded speech (VGS) model. VGS models are trained on unlabelled images paired with spoken captions. These models are therefore self-supervised -- trained without any explicit textual label or location information. To… ▽ More

    Submitted 2 February, 2022; originally announced February 2022.

    Comments: 10 figures, 5 tables

  27. arXiv:2111.02827  [pdf, other

    cs.CL cs.LG

    Towards Learning to Speak and Hear Through Multi-Agent Communication over a Continuous Acoustic Channel

    Authors: Kevin Eloff, Okko Räsänen, Herman A. Engelbrecht, Arnu Pretorius, Herman Kamper

    Abstract: Multi-agent reinforcement learning has been used as an effective means to study emergent communication between agents, yet little focus has been given to continuous acoustic communication. This would be more akin to human language acquisition; human infants acquire language in large part through continuous signalling with their caregivers. We therefore ask: Are we able to observe emergent language… ▽ More

    Submitted 2 May, 2023; v1 submitted 4 November, 2021; originally announced November 2021.

    Comments: 10 pages, 3 figures, 6 tables

  28. arXiv:2111.02674  [pdf, other

    eess.AS cs.CL cs.SD

    Voice Conversion Can Improve ASR in Very Low-Resource Settings

    Authors: Matthew Baas, Herman Kamper

    Abstract: Voice conversion (VC) could be used to improve speech recognition systems in low-resource languages by using it to augment limited training data. However, VC has not been widely used for this purpose because of practical issues such as compute speed and limitations when converting to and from unseen speakers. Moreover, it is still unclear whether a VC model trained on one well-resourced language c… ▽ More

    Submitted 21 June, 2022; v1 submitted 4 November, 2021; originally announced November 2021.

    Comments: 5 page, 4 tables, 2 figures. Accepted at Interspeech 2022

  29. A Comparison of Discrete and Soft Speech Units for Improved Voice Conversion

    Authors: Benjamin van Niekerk, Marc-André Carbonneau, Julian Zaïdi, Mathew Baas, Hugo Seuté, Herman Kamper

    Abstract: The goal of voice conversion is to transform source speech into a target voice, keeping the content unchanged. In this paper, we focus on self-supervised representation learning for voice conversion. Specifically, we compare discrete and soft speech units as input features. We find that discrete representations effectively remove speaker information but discard some linguistic content - leading to… ▽ More

    Submitted 8 June, 2022; v1 submitted 3 November, 2021; originally announced November 2021.

    Comments: 5 pages, 2 figures, 2 tables. Accepted at ICASSP 2022

  30. arXiv:2108.00917  [pdf, other

    eess.AS cs.SD

    Analyzing Speaker Information in Self-Supervised Models to Improve Zero-Resource Speech Processing

    Authors: Benjamin van Niekerk, Leanne Nortje, Matthew Baas, Herman Kamper

    Abstract: Contrastive predictive coding (CPC) aims to learn representations of speech by distinguishing future observations from a set of negative examples. Previous work has shown that linear classifiers trained on CPC features can accurately predict speaker and phone labels. However, it is unclear how the features actually capture speaker and phonetic information, and whether it is possible to normalize o… ▽ More

    Submitted 2 August, 2021; originally announced August 2021.

    Comments: Accepted to Interspeech 2021

  31. arXiv:2106.12834  [pdf, other

    cs.CL cs.SD eess.AS

    Multilingual transfer of acoustic word embeddings improves when training on languages related to the target zero-resource language

    Authors: Christiaan Jacobs, Herman Kamper

    Abstract: Acoustic word embedding models map variable duration speech segments to fixed dimensional vectors, enabling efficient speech search and discovery. Previous work explored how embeddings can be obtained in zero-resource settings where no labelled data is available in the target language. The current best approach uses transfer learning: a single supervised multilingual model is trained using labelle… ▽ More

    Submitted 24 June, 2021; originally announced June 2021.

    Comments: Accepted to Interspeech 2021

  32. arXiv:2106.08859  [pdf, other

    cs.CL cs.SD eess.AS

    Attention-Based Keyword Localisation in Speech using Visual Grounding

    Authors: Kayode Olaleye, Herman Kamper

    Abstract: Visually grounded speech models learn from images paired with spoken captions. By tagging images with soft text labels using a trained visual classifier with a fixed vocabulary, previous work has shown that it is possible to train a model that can detect whether a particular text keyword occurs in speech utterances or not. Here we investigate whether visually grounded speech models can also do key… ▽ More

    Submitted 23 June, 2021; v1 submitted 16 June, 2021; originally announced June 2021.

    Comments: Accepted to Interspeech 2021

  33. arXiv:2106.00043  [pdf, other

    eess.AS cs.CL cs.SD

    StarGAN-ZSVC: Towards Zero-Shot Voice Conversion in Low-Resource Contexts

    Authors: Matthew Baas, Herman Kamper

    Abstract: Voice conversion is the task of converting a spoken utterance from a source speaker so that it appears to be said by a different target speaker while retaining the linguistic content of the utterance. Recent advances have led to major improvements in the quality of voice conversion systems. However, to be useful in a wider range of contexts, voice conversion systems would need to be (i) trainable… ▽ More

    Submitted 31 May, 2021; originally announced June 2021.

    Comments: 16 pages, 3 figures. Published in Springer Communications in Computer and Information Science, Artificial Intelligence Research (SACAIR 2021), vol. 1342, pp. 69-84, 2020

    Journal ref: In: Springer Communications in Computer and Information Science, Artificial Intelligence Research (SACAIR 2021), vol. 1342, pp. 69-84, 2020

  34. arXiv:2103.10731  [pdf, other

    cs.CL eess.AS

    Acoustic word embeddings for zero-resource languages using self-supervised contrastive learning and multilingual adaptation

    Authors: Christiaan Jacobs, Yevgen Matusevych, Herman Kamper

    Abstract: Acoustic word embeddings (AWEs) are fixed-dimensional representations of variable-length speech segments. For zero-resource languages where labelled data is not available, one AWE approach is to use unsupervised autoencoder-based recurrent models. Another recent approach is to use multilingual transfer: a supervised AWE model is trained on several well-resourced languages and then applied to an un… ▽ More

    Submitted 19 March, 2021; originally announced March 2021.

    Comments: Accepted to SLT 2021

  35. arXiv:2101.11332  [pdf, other

    cs.CL

    A phonetic model of non-native spoken word processing

    Authors: Yevgen Matusevych, Herman Kamper, Thomas Schatz, Naomi H. Feldman, Sharon Goldwater

    Abstract: Non-native speakers show difficulties with spoken word processing. Many studies attribute these difficulties to imprecise phonological encoding of words in the lexical memory. We test an alternative hypothesis: that some of these difficulties can arise from the non-native speakers' phonetic perception. We train a computational model of phonetic learning, which has no access to phonology, on either… ▽ More

    Submitted 11 March, 2021; v1 submitted 27 January, 2021; originally announced January 2021.

    Comments: Accepted for publication in Proceedings of EACL-2021. 11 pages, 5 figures, 2 tables

  36. arXiv:2012.07551  [pdf, other

    cs.CL eess.AS

    Towards unsupervised phone and word segmentation using self-supervised vector-quantized neural networks

    Authors: Herman Kamper, Benjamin van Niekerk

    Abstract: We investigate segmenting and clustering speech into low-bitrate phone-like sequences without supervision. We specifically constrain pretrained self-supervised vector-quantized (VQ) neural networks so that blocks of contiguous feature vectors are assigned to the same code, thereby giving a variable-rate segmentation of the speech into discrete units. Two segmentation methods are considered. In the… ▽ More

    Submitted 11 June, 2021; v1 submitted 14 December, 2020; originally announced December 2020.

    Comments: Accepted to Interspeech 2021

  37. arXiv:2012.07396  [pdf, other

    cs.CL eess.AS

    Towards localisation of keywords in speech using weak supervision

    Authors: Kayode Olaleye, Benjamin van Niekerk, Herman Kamper

    Abstract: Developments in weakly supervised and self-supervised models could enable speech technology in low-resource settings where full transcriptions are not available. We consider whether keyword localisation is possible using two forms of weak supervision where location information is not provided explicitly. In the first, only the presence or absence of a word is indicated, i.e. a bag-of-words (BoW) l… ▽ More

    Submitted 14 December, 2020; originally announced December 2020.

    Comments: Accepted to NeurIPS-SAS

  38. arXiv:2012.07387  [pdf, other

    cs.CL eess.AS

    A comparison of self-supervised speech representations as input features for unsupervised acoustic word embeddings

    Authors: Lisa van Staden, Herman Kamper

    Abstract: Many speech processing tasks involve measuring the acoustic similarity between speech segments. Acoustic word embeddings (AWE) allow for efficient comparisons by mapping speech segments of arbitrary duration to fixed-dimensional vectors. For zero-resource speech processing, where unlabelled speech is the only available resource, some of the best AWE approaches rely on weak top-down constraints in… ▽ More

    Submitted 14 December, 2020; originally announced December 2020.

    Comments: Accepted to SLT 2021

  39. arXiv:2012.05680  [pdf, other

    cs.CL cs.SD eess.AS

    Direct multimodal few-shot learning of speech and images

    Authors: Leanne Nortje, Herman Kamper

    Abstract: We propose direct multimodal few-shot models that learn a shared embedding space of spoken words and images from only a few paired examples. Imagine an agent is shown an image along with a spoken word describing the object in the picture, e.g. pen, book and eraser. After observing a few paired examples of each class, the model is asked to identify the "book" in a set of unseen pictures. Previous w… ▽ More

    Submitted 29 July, 2021; v1 submitted 10 December, 2020; originally announced December 2020.

    Comments: Accepted to Interspeech 2021

  40. arXiv:2012.02221  [pdf, other

    eess.AS cs.CL cs.SD

    A Correspondence Variational Autoencoder for Unsupervised Acoustic Word Embeddings

    Authors: Puyuan Peng, Herman Kamper, Karen Livescu

    Abstract: We propose a new unsupervised model for mapping a variable-duration speech segment to a fixed-dimensional representation. The resulting acoustic word embeddings can form the basis of search, discovery, and indexing systems for low- and zero-resource languages. Our model, which we refer to as a maximal sampling correspondence variational autoencoder (MCVAE), is a recurrent neural network (RNN) trai… ▽ More

    Submitted 3 December, 2020; originally announced December 2020.

    Comments: 10 pages, 6 figures, NeurIPS 2020 Workshop Self-Supervised Learning for Speech and Audio Processing

  41. arXiv:2010.02353  [pdf, other

    cs.CL cs.AI cs.LG

    Participatory Research for Low-resourced Machine Translation: A Case Study in African Languages

    Authors: Wilhelmina Nekoto, Vukosi Marivate, Tshinondiwa Matsila, Timi Fasubaa, Tajudeen Kolawole, Taiwo Fagbohungbe, Solomon Oluwole Akinola, Shamsuddeen Hassan Muhammad, Salomon Kabongo, Salomey Osei, Sackey Freshia, Rubungo Andre Niyongabo, Ricky Macharm, Perez Ogayo, Orevaoghene Ahia, Musie Meressa, Mofe Adeyemi, Masabata Mokgesi-Selinga, Lawrence Okegbemi, Laura Jane Martinus, Kolawole Tajudeen, Kevin Degila, Kelechi Ogueji, Kathleen Siminyu, Julia Kreutzer , et al. (23 additional authors not shown)

    Abstract: Research in NLP lacks geographic diversity, and the question of how NLP can be scaled to low-resourced languages has not yet been adequately solved. "Low-resourced"-ness is a complex problem going beyond data availability and reflects systemic problems in society. In this paper, we focus on the task of Machine Translation (MT), that plays a crucial role for information accessibility and communicat… ▽ More

    Submitted 6 November, 2020; v1 submitted 5 October, 2020; originally announced October 2020.

    Comments: Findings of EMNLP 2020; updated benchmarks

  42. arXiv:2008.06258  [pdf, other

    cs.CL cs.CV cs.SD eess.AS

    Unsupervised vs. transfer learning for multimodal one-shot matching of speech and images

    Authors: Leanne Nortje, Herman Kamper

    Abstract: We consider the task of multimodal one-shot speech-image matching. An agent is shown a picture along with a spoken word describing the object in the picture, e.g. cookie, broccoli and ice-cream. After observing one paired speech-image example per class, it is shown a new set of unseen pictures, and asked to pick the "ice-cream". Previous work attempted to tackle this problem using transfer learnin… ▽ More

    Submitted 14 August, 2020; originally announced August 2020.

    Comments: Accepted at Interspeech 2020

  43. arXiv:2008.02888  [pdf, other

    cs.CL cs.SD eess.AS

    Evaluating computational models of infant phonetic learning across languages

    Authors: Yevgen Matusevych, Thomas Schatz, Herman Kamper, Naomi H. Feldman, Sharon Goldwater

    Abstract: In the first year of life, infants' speech perception becomes attuned to the sounds of their native language. Many accounts of this early phonetic learning exist, but computational models predicting the attunement patterns observed in infants from the speech input they hear have been lacking. A recent study presented the first such model, drawing on algorithms proposed for unsupervised learning fr… ▽ More

    Submitted 6 August, 2020; originally announced August 2020.

    Comments: 7 pages, 1 figure

    Journal ref: 2020. In S. Denison, M. Mack, Y. Xu, and B. Armstrong (Eds.), Proceedings of the 42nd Annual Conference of the Cognitive Science Society (pp. 571-577). Austin, TX: Cognitive Science Society

  44. arXiv:2006.02295  [pdf, other

    cs.CL cs.SD eess.AS

    Improved acoustic word embeddings for zero-resource languages using multilingual transfer

    Authors: Herman Kamper, Yevgen Matusevych, Sharon Goldwater

    Abstract: Acoustic word embeddings are fixed-dimensional representations of variable-length speech segments. Such embeddings can form the basis for speech search, indexing and discovery systems when conventional speech recognition is not possible. In zero-resource settings where unlabelled speech is the only available resource, we need a method that gives robust embeddings on an arbitrary language. Here we… ▽ More

    Submitted 5 February, 2021; v1 submitted 2 June, 2020; originally announced June 2020.

    Comments: 11 pages, 7 figures, 8 tables. arXiv admin note: text overlap with arXiv:2002.02109. Submitted to the IEEE Transactions on Audio, Speech and Language Processing

  45. arXiv:2005.09409  [pdf, other

    eess.AS cs.CL

    Vector-quantized neural networks for acoustic unit discovery in the ZeroSpeech 2020 challenge

    Authors: Benjamin van Niekerk, Leanne Nortje, Herman Kamper

    Abstract: In this paper, we explore vector quantization for acoustic unit discovery. Leveraging unlabelled data, we aim to learn discrete representations of speech that separate phonetic content from speaker-specific details. We propose two neural models to tackle this challenge - both use vector quantization to map continuous features to a finite set of codes. The first model is a type of vector-quantized… ▽ More

    Submitted 19 August, 2020; v1 submitted 19 May, 2020; originally announced May 2020.

    Comments: 5 pages, 3 figures, 2 tables, accepted to Interspeech 2020

  46. arXiv:2004.01647  [pdf, other

    cs.CL

    Analyzing autoencoder-based acoustic word embeddings

    Authors: Yevgen Matusevych, Herman Kamper, Sharon Goldwater

    Abstract: Recent studies have introduced methods for learning acoustic word embeddings (AWEs)---fixed-size vector representations of words which encode their acoustic features. Despite the widespread use of AWEs in speech processing research, they have only been evaluated quantitatively in their ability to discriminate between whole word tokens. To better understand the applications of AWEs in various downs… ▽ More

    Submitted 3 April, 2020; originally announced April 2020.

    Comments: 6 pages, 7 figures, accepted to BAICS workshop (ICLR2020)

  47. Unsupervised feature learning for speech using correspondence and Siamese networks

    Authors: Petri-Johan Last, Herman A. Engelbrecht, Herman Kamper

    Abstract: In zero-resource settings where transcribed speech audio is unavailable, unsupervised feature learning is essential for downstream speech processing tasks. Here we compare two recent methods for frame-level acoustic feature learning. For both methods, unsupervised term discovery is used to find pairs of word examples of the same unknown type. Dynamic programming is then used to align the feature f… ▽ More

    Submitted 28 March, 2020; originally announced March 2020.

    Comments: 5 pages, 3 figures, 2 tables; accepted to the IEEE Signal Processing Letters, (c) 2020 IEEE

    Journal ref: IEEE Signal Processing Letters 27 (2020) 421-425

  48. arXiv:2003.11529  [pdf, other

    cs.CL

    Masakhane -- Machine Translation For Africa

    Authors: Iroro Orife, Julia Kreutzer, Blessing Sibanda, Daniel Whitenack, Kathleen Siminyu, Laura Martinus, Jamiil Toure Ali, Jade Abbott, Vukosi Marivate, Salomon Kabongo, Musie Meressa, Espoir Murhabazi, Orevaoghene Ahia, Elan van Biljon, Arshath Ramkilowan, Adewale Akinfaderin, Alp Öktem, Wole Akin, Ghollah Kioko, Kevin Degila, Herman Kamper, Bonaventure Dossou, Chris Emezue, Kelechi Ogueji, Abdallah Bashir

    Abstract: Africa has over 2000 languages. Despite this, African languages account for a small portion of available resources and publications in Natural Language Processing (NLP). This is due to multiple factors, including: a lack of focus from government and funding, discoverability, a lack of community, sheer language complexity, difficulty in reproducing papers and no benchmarks to compare techniques. To… ▽ More

    Submitted 13 March, 2020; originally announced March 2020.

    Comments: Accepted for the AfricaNLP Workshop, ICLR 2020

  49. arXiv:2002.02109  [pdf, other

    cs.CL eess.AS

    Multilingual acoustic word embedding models for processing zero-resource languages

    Authors: Herman Kamper, Yevgen Matusevych, Sharon Goldwater

    Abstract: Acoustic word embeddings are fixed-dimensional representations of variable-length speech segments. In settings where unlabelled speech is the only available resource, such embeddings can be used in "zero-resource" speech search, indexing and discovery systems. Here we propose to train a single supervised embedding model on labelled data from multiple well-resourced languages and then apply it to u… ▽ More

    Submitted 21 February, 2020; v1 submitted 6 February, 2020; originally announced February 2020.

    Comments: 5 pages, 4 figures, 1 table; accepted to ICASSP 2020. arXiv admin note: text overlap with arXiv:1811.00403

  50. arXiv:1912.05193  [pdf, other

    eess.IV cs.CV

    Deep motion estimation for parallel inter-frame prediction in video compression

    Authors: André Nortje, Herman A. Engelbrecht, Herman Kamper

    Abstract: Standard video codecs rely on optical flow to guide inter-frame prediction: pixels from reference frames are moved via motion vectors to predict target video frames. We propose to learn binary motion codes that are encoded based on an input video sequence. These codes are not limited to 2D translations, but can capture complex motion (warping, rotation and occlusion). Our motion codes are learned… ▽ More

    Submitted 11 December, 2019; originally announced December 2019.

    Comments: 25 pages, 11 figures, 5 tables