Do End-to-End Speech Recognition Models Care About Context?
Authors:
Lasse Borgholt,
Jakob Drachmann Havtorn,
Željko Agić,
Anders Søgaard,
Lars Maaløe,
Christian Igel
Abstract:
The two most common paradigms for end-to-end speech recognition are connectionist temporal classification (CTC) and attention-based encoder-decoder (AED) models. It has been argued that the latter is better suited for learning an implicit language model. We test this hypothesis by measuring temporal context sensitivity and evaluate how the models perform when we constrain the amount of contextual…
▽ More
The two most common paradigms for end-to-end speech recognition are connectionist temporal classification (CTC) and attention-based encoder-decoder (AED) models. It has been argued that the latter is better suited for learning an implicit language model. We test this hypothesis by measuring temporal context sensitivity and evaluate how the models perform when we constrain the amount of contextual information in the audio input. We find that the AED model is indeed more context sensitive, but that the gap can be closed by adding self-attention to the CTC model. Furthermore, the two models perform similarly when contextual information is constrained. Finally, in contrast to previous research, our results show that the CTC model is highly competitive on WSJ and LibriSpeech without the help of an external language model.
△ Less
Submitted 17 February, 2021;
originally announced February 2021.
MultiQT: Multimodal Learning for Real-Time Question Tracking in Speech
Authors:
Jakob D. Havtorn,
Jan Latko,
Joakim Edin,
Lasse Borgholt,
Lars Maaløe,
Lorenzo Belgrano,
Nicolai F. Jacobsen,
Regitze Sdun,
Željko Agić
Abstract:
We address a challenging and practical task of labeling questions in speech in real time during telephone calls to emergency medical services in English, which embeds within a broader decision support system for emergency call-takers. We propose a novel multimodal approach to real-time sequence labeling in speech. Our model treats speech and its own textual representation as two separate modalitie…
▽ More
We address a challenging and practical task of labeling questions in speech in real time during telephone calls to emergency medical services in English, which embeds within a broader decision support system for emergency call-takers. We propose a novel multimodal approach to real-time sequence labeling in speech. Our model treats speech and its own textual representation as two separate modalities or views, as it jointly learns from streamed audio and its noisy transcription into text via automatic speech recognition. Our results show significant gains of jointly learning from the two modalities when compared to text or audio only, under adverse noise and limited volume of training data. The results generalize to medical symptoms detection where we observe a similar pattern of improvements with multimodal learning.
△ Less
Submitted 12 May, 2020; v1 submitted 2 May, 2020;
originally announced May 2020.