Large scale weakly and semi-supervised learning for low-resource video ASR
Authors:
Kritika Singh,
Vimal Manohar,
Alex Xiao,
Sergey Edunov,
Ross Girshick,
Vitaliy Liptchinsky,
Christian Fuegen,
Yatharth Saraf,
Geoffrey Zweig,
Abdelrahman Mohamed
Abstract:
Many semi- and weakly-supervised approaches have been investigated for overcoming the labeling cost of building high quality speech recognition systems. On the challenging task of transcribing social media videos in low-resource conditions, we conduct a large scale systematic comparison between two self-labeling methods on one hand, and weakly-supervised pretraining using contextual metadata on th…
▽ More
Many semi- and weakly-supervised approaches have been investigated for overcoming the labeling cost of building high quality speech recognition systems. On the challenging task of transcribing social media videos in low-resource conditions, we conduct a large scale systematic comparison between two self-labeling methods on one hand, and weakly-supervised pretraining using contextual metadata on the other. We investigate distillation methods at the frame level and the sequence level for hybrid, encoder-only CTC-based, and encoder-decoder speech recognition systems on Dutch and Romanian languages using 27,000 and 58,000 hours of unlabeled audio respectively. Although all approaches improved upon their respective baseline WERs by more than 8%, sequence-level distillation for encoder-decoder models provided the largest relative WER reduction of 20% compared to the strongest data-augmented supervised baseline.
△ Less
Submitted 6 August, 2020; v1 submitted 15 May, 2020;
originally announced May 2020.
Training ASR models by Generation of Contextual Information
Authors:
Kritika Singh,
Dmytro Okhonko,
Jun Liu,
Yongqiang Wang,
Frank Zhang,
Ross Girshick,
Sergey Edunov,
Fuchun Peng,
Yatharth Saraf,
Geoffrey Zweig,
Abdelrahman Mohamed
Abstract:
Supervised ASR models have reached unprecedented levels of accuracy, thanks in part to ever-increasing amounts of labelled training data. However, in many applications and locales, only moderate amounts of data are available, which has led to a surge in semi- and weakly-supervised learning research. In this paper, we conduct a large-scale study evaluating the effectiveness of weakly-supervised lea…
▽ More
Supervised ASR models have reached unprecedented levels of accuracy, thanks in part to ever-increasing amounts of labelled training data. However, in many applications and locales, only moderate amounts of data are available, which has led to a surge in semi- and weakly-supervised learning research. In this paper, we conduct a large-scale study evaluating the effectiveness of weakly-supervised learning for speech recognition by using loosely related contextual information as a surrogate for ground-truth labels. For weakly supervised training, we use 50k hours of public English social media videos along with their respective titles and post text to train an encoder-decoder transformer model. Our best encoder-decoder models achieve an average of 20.8% WER reduction over a 1000 hours supervised baseline, and an average of 13.4% WER reduction when using only the weakly supervised encoder for CTC fine-tuning. Our results show that our setup for weak supervision improved both the encoder acoustic representations as well as the decoder language generation abilities.
△ Less
Submitted 14 February, 2020; v1 submitted 27 October, 2019;
originally announced October 2019.