Skip to main content

Showing 1–7 of 7 results for author: Havtorn, J D

Searching in archive eess. Search in all archives.
.
  1. arXiv:2205.10643  [pdf, other

    cs.CL cs.SD eess.AS

    Self-Supervised Speech Representation Learning: A Review

    Authors: Abdelrahman Mohamed, Hung-yi Lee, Lasse Borgholt, Jakob D. Havtorn, Joakim Edin, Christian Igel, Katrin Kirchhoff, Shang-Wen Li, Karen Livescu, Lars Maaløe, Tara N. Sainath, Shinji Watanabe

    Abstract: Although supervised deep learning has revolutionized speech and audio processing, it has necessitated the building of specialist models for individual tasks and application scenarios. It is likewise difficult to apply this to dialects and languages for which only limited labeled data is available. Self-supervised representation learning methods promise a single universal model that would benefit a… ▽ More

    Submitted 27 October, 2022; v1 submitted 21 May, 2022; originally announced May 2022.

  2. arXiv:2203.01829  [pdf, other

    eess.AS cs.LG cs.SD

    A Brief Overview of Unsupervised Neural Speech Representation Learning

    Authors: Lasse Borgholt, Jakob Drachmann Havtorn, Joakim Edin, Lars Maaløe, Christian Igel

    Abstract: Unsupervised representation learning for speech processing has matured greatly in the last few years. Work in computer vision and natural language processing has paved the way, but speech data offers unique challenges. As a result, methods from other domains rarely translate directly. We review the development of unsupervised representation learning for speech over the last decade. We identify two… ▽ More

    Submitted 1 March, 2022; originally announced March 2022.

    Comments: The 2nd Workshop on Self-supervised Learning for Audio and Speech Processing (SAS) at AAAI

  3. arXiv:2202.12707  [pdf, other

    eess.AS cs.AI cs.LG cs.SD stat.ML

    Benchmarking Generative Latent Variable Models for Speech

    Authors: Jakob D. Havtorn, Lasse Borgholt, Søren Hauberg, Jes Frellsen, Lars Maaløe

    Abstract: Stochastic latent variable models (LVMs) achieve state-of-the-art performance on natural image generation but are still inferior to deterministic models on speech. In this paper, we develop a speech benchmark of popular temporal LVMs and compare them against state-of-the-art deterministic models. We report the likelihood, which is a much used metric in the image domain, but rarely, or incomparably… ▽ More

    Submitted 5 April, 2022; v1 submitted 22 February, 2022; originally announced February 2022.

    Comments: Accepted at the 2022 ICLR workshop on Deep Generative Models for Highly Structured Data (https://deep-gen-struct.github.io)

  4. arXiv:2111.14842  [pdf, other

    eess.AS cs.CL cs.LG

    Do We Still Need Automatic Speech Recognition for Spoken Language Understanding?

    Authors: Lasse Borgholt, Jakob Drachmann Havtorn, Mostafa Abdou, Joakim Edin, Lars Maaløe, Anders Søgaard, Christian Igel

    Abstract: Spoken language understanding (SLU) tasks are usually solved by first transcribing an utterance with automatic speech recognition (ASR) and then feeding the output to a text-based model. Recent advances in self-supervised representation learning for speech data have focused on improving the ASR component. We investigate whether representation learning for speech has matured enough to replace ASR i… ▽ More

    Submitted 29 November, 2021; originally announced November 2021.

    Comments: Under review as a conference paper at ICASSP 2022

  5. arXiv:2102.09928  [pdf, other

    eess.AS cs.CL cs.LG cs.SD

    Do End-to-End Speech Recognition Models Care About Context?

    Authors: Lasse Borgholt, Jakob Drachmann Havtorn, Željko Agić, Anders Søgaard, Lars Maaløe, Christian Igel

    Abstract: The two most common paradigms for end-to-end speech recognition are connectionist temporal classification (CTC) and attention-based encoder-decoder (AED) models. It has been argued that the latter is better suited for learning an implicit language model. We test this hypothesis by measuring temporal context sensitivity and evaluate how the models perform when we constrain the amount of contextual… ▽ More

    Submitted 17 February, 2021; originally announced February 2021.

    Comments: Published in the proceedings of INTERSPEECH 2020, pp. 4352-4356

  6. arXiv:2102.00850  [pdf, other

    eess.AS cs.LG cs.SD

    On Scaling Contrastive Representations for Low-Resource Speech Recognition

    Authors: Lasse Borgholt, Tycho Max Sylvester Tax, Jakob Drachmann Havtorn, Lars Maaløe, Christian Igel

    Abstract: Recent advances in self-supervised learning through contrastive training have shown that it is possible to learn a competitive speech recognition system with as little as 10 minutes of labeled data. However, these systems are computationally expensive since they require pre-training followed by fine-tuning in a large parameter space. We explore the performance of such systems without fine-tuning b… ▽ More

    Submitted 1 February, 2021; originally announced February 2021.

    Comments: © 2021 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works

  7. arXiv:2005.00812  [pdf, other

    cs.CL cs.SD eess.AS

    MultiQT: Multimodal Learning for Real-Time Question Tracking in Speech

    Authors: Jakob D. Havtorn, Jan Latko, Joakim Edin, Lasse Borgholt, Lars Maaløe, Lorenzo Belgrano, Nicolai F. Jacobsen, Regitze Sdun, Željko Agić

    Abstract: We address a challenging and practical task of labeling questions in speech in real time during telephone calls to emergency medical services in English, which embeds within a broader decision support system for emergency call-takers. We propose a novel multimodal approach to real-time sequence labeling in speech. Our model treats speech and its own textual representation as two separate modalitie… ▽ More

    Submitted 12 May, 2020; v1 submitted 2 May, 2020; originally announced May 2020.

    Comments: Accepted at ACL 2020