Scaling Speech Technology to 1,000+ Languages
Authors:
Vineel Pratap,
Andros Tjandra,
Bowen Shi,
Paden Tomasello,
Arun Babu,
Sayani Kundu,
Ali Elkahky,
Zhaoheng Ni,
Apoorv Vyas,
Maryam Fazel-Zarandi,
Alexei Baevski,
Yossi Adi,
Xiaohui Zhang,
Wei-Ning Hsu,
Alexis Conneau,
Michael Auli
Abstract:
Expanding the language coverage of speech technology has the potential to improve access to information for many more people. However, current speech technology is restricted to about one hundred languages which is a small fraction of the over 7,000 languages spoken around the world. The Massively Multilingual Speech (MMS) project increases the number of supported languages by 10-40x, depending on…
▽ More
Expanding the language coverage of speech technology has the potential to improve access to information for many more people. However, current speech technology is restricted to about one hundred languages which is a small fraction of the over 7,000 languages spoken around the world. The Massively Multilingual Speech (MMS) project increases the number of supported languages by 10-40x, depending on the task. The main ingredients are a new dataset based on readings of publicly available religious texts and effectively leveraging self-supervised learning. We built pre-trained wav2vec 2.0 models covering 1,406 languages, a single multilingual automatic speech recognition model for 1,107 languages, speech synthesis models for the same number of languages, as well as a language identification model for 4,017 languages. Experiments show that our multilingual speech recognition model more than halves the word error rate of Whisper on 54 languages of the FLEURS benchmark while being trained on a small fraction of the labeled data.
△ Less
Submitted 22 May, 2023;
originally announced May 2023.
Cocktail HuBERT: Generalized Self-Supervised Pre-training for Mixture and Single-Source Speech
Authors:
Maryam Fazel-Zarandi,
Wei-Ning Hsu
Abstract:
Self-supervised learning leverages unlabeled data effectively, improving label efficiency and generalization to domains without labeled data. While recent work has studied generalization to more acoustic/linguistic domains, languages, and modalities, these investigations are limited to single-source speech with one primary speaker in the recording. This paper presents Cocktail HuBERT, a self-super…
▽ More
Self-supervised learning leverages unlabeled data effectively, improving label efficiency and generalization to domains without labeled data. While recent work has studied generalization to more acoustic/linguistic domains, languages, and modalities, these investigations are limited to single-source speech with one primary speaker in the recording. This paper presents Cocktail HuBERT, a self-supervised learning framework that generalizes to mixture speech using a masked pseudo source separation objective. This objective encourages the model to identify the number of sources, separate and understand the context, and infer the content of masked regions represented as discovered units. Cocktail HuBERT outperforms state-of-the-art results with 69% lower WER on multi-speaker ASR, 31% lower DER on diarization, and is competitive on single- and multi-speaker tasks from SUPERB.
△ Less
Submitted 20 March, 2023;
originally announced March 2023.