-
Dynamic Acoustic Unit Augmentation With BPE-Dropout for Low-Resource End-to-End Speech Recognition
Authors:
Aleksandr Laptev,
Andrei Andrusenko,
Ivan Podluzhny,
Anton Mitrofanov,
Ivan Medennikov,
Yuri Matveev
Abstract:
With the rapid development of speech assistants, adapting server-intended automatic speech recognition (ASR) solutions to a direct device has become crucial. Researchers and industry prefer to use end-to-end ASR systems for on-device speech recognition tasks. This is because end-to-end systems can be made resource-efficient while maintaining a higher quality compared to hybrid systems. However, bu…
▽ More
With the rapid development of speech assistants, adapting server-intended automatic speech recognition (ASR) solutions to a direct device has become crucial. Researchers and industry prefer to use end-to-end ASR systems for on-device speech recognition tasks. This is because end-to-end systems can be made resource-efficient while maintaining a higher quality compared to hybrid systems. However, building end-to-end models requires a significant amount of speech data. Another challenging task associated with speech assistants is personalization, which mainly lies in handling out-of-vocabulary (OOV) words. In this work, we consider building an effective end-to-end ASR system in low-resource setups with a high OOV rate, embodied in Babel Turkish and Babel Georgian tasks. To address the aforementioned problems, we propose a method of dynamic acoustic unit augmentation based on the BPE-dropout technique. It non-deterministically tokenizes utterances to extend the token's contexts and to regularize their distribution for the model's recognition of unseen words. It also reduces the need for optimal subword vocabulary size search. The technique provides a steady improvement in regular and personalized (OOV-oriented) speech recognition tasks (at least 6% relative WER and 25% relative F-score) at no additional computational cost. Owing to the use of BPE-dropout, our monolingual Turkish Conformer established a competitive result with 22.2% character error rate (CER) and 38.9% word error rate (WER), which is close to the best published multilingual system.
△ Less
Submitted 12 March, 2021;
originally announced March 2021.
-
Secure and User-Friendly Over-the-Air Firmware Distribution in a Portable Faraday Cage
Authors:
Martin Striegel,
Florian Jakobsmeier,
Yacov Matveev,
Johann Heyszl,
Georg Sigl
Abstract:
Setting up a large-scale wireless sensor network is challenging, as firmware must be distributed and trust between sensor nodes and a backend needs to be established. To perform this task efficiently, we propose an approach named Box, which utilizes an intelligent Faraday cage (FC). The FC acquires firmware images and secret keys from a backend, patches the firmware with the keys and deploys those…
▽ More
Setting up a large-scale wireless sensor network is challenging, as firmware must be distributed and trust between sensor nodes and a backend needs to be established. To perform this task efficiently, we propose an approach named Box, which utilizes an intelligent Faraday cage (FC). The FC acquires firmware images and secret keys from a backend, patches the firmware with the keys and deploys those customized images over the air to sensor nodes placed in the FC. Electromagnetic shielding protects this exchange against passive attackers. We place few demands on the sensor node, not requiring additional hardware components or firmware customized by the manufacturer. We describe this novel workflow, implement the Box and a backend system and demonstrate the feasibility of our approach by batch-deploying firmware to multiple commercial off-the-shelf sensor nodes. We conduct a user-study with 31 participants with diverse backgrounds and find, that our approach is both faster and more user-friendly than firmware distribution over a wired connection.
△ Less
Submitted 25 May, 2020;
originally announced May 2020.
-
Deep Speaker Embeddings for Far-Field Speaker Recognition on Short Utterances
Authors:
Aleksei Gusev,
Vladimir Volokhov,
Tseren Andzhukaev,
Sergey Novoselov,
Galina Lavrentyeva,
Marina Volkova,
Alice Gazizullina,
Andrey Shulipa,
Artem Gorlanov,
Anastasia Avdeeva,
Artem Ivanov,
Alexander Kozlov,
Timur Pekhovsky,
Yuri Matveev
Abstract:
Speaker recognition systems based on deep speaker embeddings have achieved significant performance in controlled conditions according to the results obtained for early NIST SRE (Speaker Recognition Evaluation) datasets. From the practical point of view, taking into account the increased interest in virtual assistants (such as Amazon Alexa, Google Home, AppleSiri, etc.), speaker verification on sho…
▽ More
Speaker recognition systems based on deep speaker embeddings have achieved significant performance in controlled conditions according to the results obtained for early NIST SRE (Speaker Recognition Evaluation) datasets. From the practical point of view, taking into account the increased interest in virtual assistants (such as Amazon Alexa, Google Home, AppleSiri, etc.), speaker verification on short utterances in uncontrolled noisy environment conditions is one of the most challenging and highly demanded tasks. This paper presents approaches aimed to achieve two goals: a) improve the quality of far-field speaker verification systems in the presence of environmental noise, reverberation and b) reduce the system qualitydegradation for short utterances. For these purposes, we considered deep neural network architectures based on TDNN (TimeDelay Neural Network) and ResNet (Residual Neural Network) blocks. We experimented with state-of-the-art embedding extractors and their training procedures. Obtained results confirm that ResNet architectures outperform the standard x-vector approach in terms of speaker verification quality for both long-duration and short-duration utterances. We also investigate the impact of speech activity detector, different scoring models, adaptation and score normalization techniques. The experimental results are presented for publicly available data and verification protocols for the VoxCeleb1, VoxCeleb2, and VOiCES datasets.
△ Less
Submitted 14 February, 2020;
originally announced February 2020.