Korean Tokenization for Beam Search Rescoring in Speech Recognition

Shim, Kyuhong; Bae, Hyewon; Sung, Wonyong

Computer Science > Computation and Language

arXiv:2203.03583 (cs)

[Submitted on 22 Feb 2022 (v1), last revised 28 Mar 2022 (this version, v2)]

Title:Korean Tokenization for Beam Search Rescoring in Speech Recognition

Authors:Kyuhong Shim, Hyewon Bae, Wonyong Sung

View PDF

Abstract:The performance of automatic speech recognition (ASR) models can be greatly improved by proper beam-search decoding with external language model (LM). There has been an increasing interest in Korean speech recognition, but not many studies have been focused on the decoding procedure. In this paper, we propose a Korean tokenization method for neural network-based LM used for Korean ASR. Although the common approach is to use the same tokenization method for external LM as the ASR model, we show that it may not be the best choice for Korean. We propose a new tokenization method that inserts a special token, SkipTC, when there is no trailing consonant in a Korean syllable. By utilizing the proposed SkipTC token, the input sequence for LM becomes very regularly patterned so that the LM can better learn the linguistic characteristics. Our experiments show that the proposed approach achieves a lower word error rate compared to the same LM model without SkipTC. In addition, we are the first to report the ASR performance for the recently introduced large-scale 7,600h Korean speech dataset.

Comments:	Submitted to INTERSPEECH 2022
Subjects:	Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
Cite as:	arXiv:2203.03583 [cs.CL]
	(or arXiv:2203.03583v2 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2203.03583

Submission history

From: Kyuhong Shim [view email]
[v1] Tue, 22 Feb 2022 11:25:01 UTC (188 KB)
[v2] Mon, 28 Mar 2022 07:58:52 UTC (188 KB)

Computer Science > Computation and Language

Title:Korean Tokenization for Beam Search Rescoring in Speech Recognition

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:Korean Tokenization for Beam Search Rescoring in Speech Recognition

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators