-
DYNAC: Dynamic Vocabulary based Non-Autoregressive Contextualization for Speech Recognition
Authors:
Yui Sudo,
Yosuke Fukumoto,
Muhammad Shakeel,
Yifan Peng,
Chyi-Jiunn Lin,
Shinji Watanabe
Abstract:
Contextual biasing (CB) improves automatic speech recognition for rare and unseen phrases. Recent studies have introduced dynamic vocabulary, which represents context phrases as expandable tokens in autoregressive (AR) models. This method improves CB accuracy but with slow inference speed. While dynamic vocabulary can be applied to non-autoregressive (NAR) models, such as connectionist temporal cl…
▽ More
Contextual biasing (CB) improves automatic speech recognition for rare and unseen phrases. Recent studies have introduced dynamic vocabulary, which represents context phrases as expandable tokens in autoregressive (AR) models. This method improves CB accuracy but with slow inference speed. While dynamic vocabulary can be applied to non-autoregressive (NAR) models, such as connectionist temporal classification (CTC), the conditional independence assumption fails to capture dependencies between static and dynamic tokens. This paper proposes DYNAC (Dynamic Vocabulary-based NAR Contextualization), a self-conditioned CTC method that integrates dynamic vocabulary into intermediate layers. Conditioning the encoder on dynamic vocabulary, DYNAC effectively captures dependencies between static and dynamic tokens while reducing the real-time factor (RTF). Experimental results show that DYNAC reduces RTF by 81% with a 0.1-point degradation in word error rate on the LibriSpeech 960 test-clean set.
△ Less
Submitted 31 May, 2025;
originally announced June 2025.
-
Joint Beam Search Integrating CTC, Attention, and Transducer Decoders
Authors:
Yui Sudo,
Muhammad Shakeel,
Yosuke Fukumoto,
Brian Yan,
Jiatong Shi,
Yifan Peng,
Shinji Watanabe
Abstract:
End-to-end automatic speech recognition (E2E-ASR) can be classified by its decoder architectures, such as connectionist temporal classification (CTC), recurrent neural network transducer (RNN-T), attention-based encoder-decoder, and Mask-CTC models. Each decoder architecture has advantages and disadvantages, leading practitioners to switch between these different models depending on application re…
▽ More
End-to-end automatic speech recognition (E2E-ASR) can be classified by its decoder architectures, such as connectionist temporal classification (CTC), recurrent neural network transducer (RNN-T), attention-based encoder-decoder, and Mask-CTC models. Each decoder architecture has advantages and disadvantages, leading practitioners to switch between these different models depending on application requirements. Instead of building separate models, we propose a joint modeling scheme where four decoders (CTC, RNN-T, attention, and Mask-CTC) share the same encoder -- we refer to this as 4D modeling. The 4D model is trained jointly, which will bring model regularization and maximize the model robustness thanks to their complementary properties. To efficiently train the 4D model, we introduce a two-stage training strategy that stabilizes the joint training. In addition, we propose three novel joint beam search algorithms by combining three decoders (CTC, RNN-T, and attention) to further improve performance. These three beam search algorithms differ in which decoder is used as the primary decoder. We carefully evaluate the performance and computational tradeoffs associated with each algorithm. Experimental results demonstrate that the jointly trained 4D model outperforms the E2E-ASR models trained with only one individual decoder. Furthermore, we demonstrate that the proposed joint beam search algorithm outperforms the previously proposed CTC/attention decoding.
△ Less
Submitted 14 January, 2025; v1 submitted 5 June, 2024;
originally announced June 2024.
-
Contextualized Automatic Speech Recognition with Dynamic Vocabulary
Authors:
Yui Sudo,
Yosuke Fukumoto,
Muhammad Shakeel,
Yifan Peng,
Shinji Watanabe
Abstract:
Deep biasing (DB) enhances the performance of end-to-end automatic speech recognition (E2E-ASR) models for rare words or contextual phrases using a bias list. However, most existing methods treat bias phrases as sequences of subwords in a predefined static vocabulary. This naive sequence decomposition produces unnatural token patterns, significantly lowering their occurrence probability. More adva…
▽ More
Deep biasing (DB) enhances the performance of end-to-end automatic speech recognition (E2E-ASR) models for rare words or contextual phrases using a bias list. However, most existing methods treat bias phrases as sequences of subwords in a predefined static vocabulary. This naive sequence decomposition produces unnatural token patterns, significantly lowering their occurrence probability. More advanced techniques address this problem by expanding the vocabulary with additional modules, including the external language model shallow fusion or rescoring. However, they result in increasing the workload due to the additional modules. This paper proposes a dynamic vocabulary where bias tokens can be added during inference. Each entry in a bias list is represented as a single token, unlike a sequence of existing subword tokens. This approach eliminates the need to learn subword dependencies within the bias phrases. This method is easily applied to various architectures because it only expands the embedding and output layers in common E2E-ASR architectures. Experimental results demonstrate that the proposed method improves the bias phrase WER on English and Japanese datasets by 3.1 -- 4.9 points compared with the conventional DB method.
△ Less
Submitted 30 August, 2024; v1 submitted 22 May, 2024;
originally announced May 2024.
-
Contextualized Automatic Speech Recognition with Attention-Based Bias Phrase Boosted Beam Search
Authors:
Yui Sudo,
Muhammad Shakeel,
Yosuke Fukumoto,
Yifan Peng,
Shinji Watanabe
Abstract:
End-to-end (E2E) automatic speech recognition (ASR) methods exhibit remarkable performance. However, since the performance of such methods is intrinsically linked to the context present in the training data, E2E-ASR methods do not perform as desired for unseen user contexts (e.g., technical terms, personal names, and playlists). Thus, E2E-ASR methods must be easily contextualized by the user or de…
▽ More
End-to-end (E2E) automatic speech recognition (ASR) methods exhibit remarkable performance. However, since the performance of such methods is intrinsically linked to the context present in the training data, E2E-ASR methods do not perform as desired for unseen user contexts (e.g., technical terms, personal names, and playlists). Thus, E2E-ASR methods must be easily contextualized by the user or developer. This paper proposes an attention-based contextual biasing method that can be customized using an editable phrase list (referred to as a bias list). The proposed method can be trained effectively by combining a bias phrase index loss and special tokens to detect the bias phrases in the input speech data. In addition, to improve the contextualization performance during inference further, we propose a bias phrase boosted (BPB) beam search algorithm based on the bias phrase index probability. Experimental results demonstrate that the proposed method consistently improves the word error rate and the character error rate of the target phrases in the bias list on both the Librispeech-960 (English) and our in-house (Japanese) dataset, respectively.
△ Less
Submitted 18 January, 2024;
originally announced January 2024.
-
Security Camera Movie and ERP Data Matching System to Prevent Theft
Authors:
Yoji Yamato,
Yoshifumi Fukumoto,
Hiroki Kumazaki
Abstract:
"(c) 2017 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works." In this pape…
▽ More
"(c) 2017 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works." In this paper, we propose a SaaS service which prevents shoplifting using image analysis and ERP. In Japan, total damage of shoplifting reaches 450 billion yen. Based on cloud and data analysis technology, we propose a shoplifting prevention service with image analysis of security camera and ERP data check for small shops. We evaluated movie analysis. Y. Yamato, Y. Fukumoto and H. Kumazaki, "Security Camera Movie and ERP Data Matching System to Prevent Theft," IEEE Consumer Communications and Networking Conference (CCNC 2017), pp.1021-1022, DOI: 10.1109/CCNC.2017.7983275, Jan. 2017.
△ Less
Submitted 21 September, 2024; v1 submitted 12 June, 2017;
originally announced June 2017.
-
Realtime Predictive Maintenance with Lambda Architecture
Authors:
Yoji Yamato,
Hiroki Kumazaki,
Yoshifumi Fukumoto
Abstract:
Recently, IoT technologies have been progressed and applications of maintenance area are expected. However, IoT maintenance applications are not spread in Japan yet because of insufficient analysis of real time situation, high cost to collect sensing data and to configure failure detection rules. In this paper, using lambda architecture concept, we propose a maintenance platform in which edge node…
▽ More
Recently, IoT technologies have been progressed and applications of maintenance area are expected. However, IoT maintenance applications are not spread in Japan yet because of insufficient analysis of real time situation, high cost to collect sensing data and to configure failure detection rules. In this paper, using lambda architecture concept, we propose a maintenance platform in which edge nodes analyze sensing data, detect anomaly, extract a new detection rule in real time and a cloud orders maintenance automatically, also analyzes whole data collected by batch process in detail, updates learning model of edge nodes to improve analysis accuracy.
△ Less
Submitted 8 December, 2016;
originally announced December 2016.
-
Study of shoplifting prevention using image analysis and ERP check
Authors:
Yoji Yamato,
Yoshifumi Fukumoto,
Hiroki Kumazaki
Abstract:
In this paper, we propose a SaaS service which prevents shoplifting using image analysis and ERP. In Japan, total damage of shoplifting reaches 450 billion yen and more than 1000 small shops gave up their businesses because of shoplifting. Based on recent cloud technology and data analysis technology, we propose a shoplifting prevention service with image analysis of security camera and ERP data c…
▽ More
In this paper, we propose a SaaS service which prevents shoplifting using image analysis and ERP. In Japan, total damage of shoplifting reaches 450 billion yen and more than 1000 small shops gave up their businesses because of shoplifting. Based on recent cloud technology and data analysis technology, we propose a shoplifting prevention service with image analysis of security camera and ERP data check for small shops. We evaluated stream analysis of security camera movie using online machine learining framework Jubatus.
△ Less
Submitted 5 December, 2016;
originally announced December 2016.
-
Proposal of Real Time Predictive Maintenance Platform with 3D Printer for Business Vehicles
Authors:
Yoji Yamato,
Yoshifumi Fukumoto,
Hiroki Kumazaki
Abstract:
This paper proposes a maintenance platform for business vehicles which detects failure sign using IoT data on the move, orders to create repair parts by 3D printers and to deliver them to the destination. Recently, IoT and 3D printer technologies have been progressed and application cases to manufacturing and maintenance have been increased. Especially in air flight industry, various sensing data…
▽ More
This paper proposes a maintenance platform for business vehicles which detects failure sign using IoT data on the move, orders to create repair parts by 3D printers and to deliver them to the destination. Recently, IoT and 3D printer technologies have been progressed and application cases to manufacturing and maintenance have been increased. Especially in air flight industry, various sensing data are collected during flight by IoT technologies and parts are created by 3D printers. And IoT platforms which improve development/operation of IoT applications also have been appeared. However, existing IoT platforms mainly targets to visualize "things" statuses by batch processing of collected sensing data, and 3 factors of real-time, automatic orders of repair parts and parts stock cost are insufficient to accelerate businesses. This paper targets maintenance of business vehicles such as airplane or high-speed bus. We propose a maintenance platform with real-time analysis, automatic orders of repair parts and minimum stock cost of parts. The proposed platform collects data via closed VPN, analyzes stream data and predicts failures in real-time by online machine learning framework Jubatus, coordinates ERP or SCM via in memory DB to order repair parts and also distributes repair parts data to 3D printers to create repair parts near the destination.
△ Less
Submitted 30 January, 2023; v1 submitted 29 November, 2016;
originally announced November 2016.