Super-Human Performance in Online Low-latency Recognition of Conversational Speech

Nguyen, Thai-Son; Stueker, Sebastian; Waibel, Alex

Computer Science > Computer Vision and Pattern Recognition

arXiv:2010.03449 (cs)

[Submitted on 7 Oct 2020 (v1), last revised 26 Jul 2021 (this version, v5)]

Title:Super-Human Performance in Online Low-latency Recognition of Conversational Speech

Authors:Thai-Son Nguyen, Sebastian Stueker, Alex Waibel

View PDF

Abstract:Achieving super-human performance in recognizing human speech has been a goal for several decades, as researchers have worked on increasingly challenging tasks. In the 1990's it was discovered, that conversational speech between two humans turns out to be considerably more difficult than read speech as hesitations, disfluencies, false starts and sloppy articulation complicate acoustic processing and require robust handling of acoustic, lexical and language context, jointly. Early attempts with statistical models could only reach error rates over 50% and far from human performance (WER of around 5.5%). Neural hybrid models and recent attention-based encoder-decoder models have considerably improved performance as such contexts can now be learned in an integral fashion. However, processing such contexts requires an entire utterance presentation and thus introduces unwanted delays before a recognition result can be output. In this paper, we address performance as well as latency. We present results for a system that can achieve super-human performance (at a WER of 5.0%, over the Switchboard conversational benchmark) at a word based latency of only 1 second behind a speaker's speech. The system uses multiple attention-based encoder-decoder networks integrated within a novel low latency incremental inference approach.

Comments:	To appear in Interspeech 2021
Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2010.03449 [cs.CV]
	(or arXiv:2010.03449v5 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2010.03449

Submission history

From: Thai Son Nguyen [view email]
[v1] Wed, 7 Oct 2020 14:41:32 UTC (98 KB)
[v2] Thu, 22 Oct 2020 15:10:57 UTC (94 KB)
[v3] Wed, 10 Feb 2021 20:16:58 UTC (95 KB)
[v4] Tue, 8 Jun 2021 14:47:11 UTC (112 KB)
[v5] Mon, 26 Jul 2021 20:56:49 UTC (111 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Super-Human Performance in Online Low-latency Recognition of Conversational Speech

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Super-Human Performance in Online Low-latency Recognition of Conversational Speech

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators