Open-Source Conversational AI with SpeechBrain 1.0
Authors:
Mirco Ravanelli,
Titouan Parcollet,
Adel Moumen,
Sylvain de Langen,
Cem Subakan,
Peter Plantinga,
Yingzhi Wang,
Pooneh Mousavi,
Luca Della Libera,
Artem Ploujnikov,
Francesco Paissan,
Davide Borra,
Salah Zaiem,
Zeyu Zhao,
Shucong Zhang,
Georgios Karakasidis,
Sung-Lin Yeh,
Pierre Champion,
Aku Rouhe,
Rudolf Braun,
Florian Mai,
Juan Zuluaga-Gomez,
Seyed Mahed Mousavi,
Andreas Nautsch,
Ha Nguyen
, et al. (8 additional authors not shown)
Abstract:
SpeechBrain is an open-source Conversational AI toolkit based on PyTorch, focused particularly on speech processing tasks such as speech recognition, speech enhancement, speaker recognition, text-to-speech, and much more. It promotes transparency and replicability by releasing both the pre-trained models and the complete "recipes" of code and algorithms required for training them. This paper prese…
▽ More
SpeechBrain is an open-source Conversational AI toolkit based on PyTorch, focused particularly on speech processing tasks such as speech recognition, speech enhancement, speaker recognition, text-to-speech, and much more. It promotes transparency and replicability by releasing both the pre-trained models and the complete "recipes" of code and algorithms required for training them. This paper presents SpeechBrain 1.0, a significant milestone in the evolution of the toolkit, which now has over 200 recipes for speech, audio, and language processing tasks, and more than 100 models available on Hugging Face. SpeechBrain 1.0 introduces new technologies to support diverse learning modalities, Large Language Model (LLM) integration, and advanced decoding strategies, along with novel models, tasks, and modalities. It also includes a new benchmark repository, offering researchers a unified platform for evaluating models across diverse tasks.
△ Less
Submitted 16 October, 2024; v1 submitted 29 June, 2024;
originally announced July 2024.
HyperConformer: Multi-head HyperMixer for Efficient Speech Recognition
Authors:
Florian Mai,
Juan Zuluaga-Gomez,
Titouan Parcollet,
Petr Motlicek
Abstract:
State-of-the-art ASR systems have achieved promising results by modeling local and global interactions separately. While the former can be computed efficiently, global interactions are usually modeled via attention mechanisms, which are expensive for long input sequences. Here, we address this by extending HyperMixer, an efficient alternative to attention exhibiting linear complexity, to the Confo…
▽ More
State-of-the-art ASR systems have achieved promising results by modeling local and global interactions separately. While the former can be computed efficiently, global interactions are usually modeled via attention mechanisms, which are expensive for long input sequences. Here, we address this by extending HyperMixer, an efficient alternative to attention exhibiting linear complexity, to the Conformer architecture for speech recognition, leading to HyperConformer. In particular, multi-head HyperConformer achieves comparable or higher recognition performance while being more efficient than Conformer in terms of inference speed, memory, parameter count, and available training data. HyperConformer achieves a word error rate of 2.9% on Librispeech test-clean with less than 8M neural parameters and a peak memory during training of 5.7GB, hence trainable with accessible hardware. Encoder speed is between 38% on mid-length speech and 56% on long speech faster than an equivalent Conformer. (The HyperConformer recipe is publicly available in: https://github.com/speechbrain/speechbrain/tree/develop/recipes/LibriSpeech/ASR/transformer/)
△ Less
Submitted 29 May, 2023;
originally announced May 2023.