-
Compact Neural TTS Voices for Accessibility
Authors:
Kunal Jain,
Eoin Murphy,
Deepanshu Gupta,
Jonathan Dyke,
Saumya Shah,
Vasilieios Tsiaras,
Petko Petkov,
Alistair Conkie
Abstract:
Contemporary text-to-speech solutions for accessibility applications can typically be classified into two categories: (i) device-based statistical parametric speech synthesis (SPSS) or unit selection (USEL) and (ii) cloud-based neural TTS. SPSS and USEL offer low latency and low disk footprint at the expense of naturalness and audio quality. Cloud-based neural TTS systems provide significantly bet…
▽ More
Contemporary text-to-speech solutions for accessibility applications can typically be classified into two categories: (i) device-based statistical parametric speech synthesis (SPSS) or unit selection (USEL) and (ii) cloud-based neural TTS. SPSS and USEL offer low latency and low disk footprint at the expense of naturalness and audio quality. Cloud-based neural TTS systems provide significantly better audio quality and naturalness but regress in terms of latency and responsiveness, rendering these impractical for real-world applications. More recently, neural TTS models were made deployable to run on handheld devices. Nevertheless, latency remains higher than SPSS and USEL, while disk footprint prohibits pre-installation for multiple voices at once. In this work, we describe a high-quality compact neural TTS system achieving latency on the order of 15 ms with low disk footprint. The proposed solution is capable of running on low-power devices.
△ Less
Submitted 28 January, 2025;
originally announced January 2025.
-
A fully recurrent feature extraction for single channel speech enhancement
Authors:
Muhammed PV Shifas,
Santelli Claudio,
Vassilis Tsiaras,
Yannis Stylianou
Abstract:
Convolutional neural network (CNN) modules are widely being used to build high-end speech enhancement neural models. However, the feature extraction power of vanilla CNN modules has been limited by the dimensionality constraint of the convolution kernels that are integrated - thereby, they have limitations to adequately model the noise context information at the feature extraction stage. To this e…
▽ More
Convolutional neural network (CNN) modules are widely being used to build high-end speech enhancement neural models. However, the feature extraction power of vanilla CNN modules has been limited by the dimensionality constraint of the convolution kernels that are integrated - thereby, they have limitations to adequately model the noise context information at the feature extraction stage. To this end, adding recurrency factor into the feature extracting CNN layers, we introduce a robust context-aware feature extraction strategy for single-channel speech enhancement. As shown, adding recurrency results in capturing the local statistics of noise attributes at the extracted features level and thus, the suggested model is effective in differentiating speech cues even at very noisy conditions. When evaluated against enhancement models using vanilla CNN modules, in unseen noise conditions, the suggested model with recurrency in the feature extraction layers has produced a segmental SNR (SSNR) gain of up to 1.5 dB, an improvement of 0.4 in subjective quality in the Mean Opinion Score scale, while the parameters to be optimized are reduced by 25%.
△ Less
Submitted 3 June, 2021; v1 submitted 9 June, 2020;
originally announced June 2020.
-
A non-causal FFTNet architecture for speech enhancement
Authors:
Muhammed PV Shifas,
Nagaraj Adiga,
Vassilis Tsiaras,
Yannis Stylianou
Abstract:
In this paper, we suggest a new parallel, non-causal and shallow waveform domain architecture for speech enhancement based on FFTNet, a neural network for generating high quality audio waveform. In contrast to other waveform based approaches like WaveNet, FFTNet uses an initial wide dilation pattern. Such an architecture better represents the long term correlated structure of speech in the time do…
▽ More
In this paper, we suggest a new parallel, non-causal and shallow waveform domain architecture for speech enhancement based on FFTNet, a neural network for generating high quality audio waveform. In contrast to other waveform based approaches like WaveNet, FFTNet uses an initial wide dilation pattern. Such an architecture better represents the long term correlated structure of speech in the time domain, where noise is usually highly non-correlated, and therefore it is suitable for waveform domain based speech enhancement. To further strengthen this feature of FFTNet, we suggest a non-causal FFTNet architecture, where the present sample in each layer is estimated from the past and future samples of the previous layer. By suggesting a shallow network and applying non-causality within certain limits, the suggested FFTNet for speech enhancement (SE-FFTNet) uses much fewer parameters compared to other neural network based approaches for speech enhancement like WaveNet and SEGAN. Specifically, the suggested network has considerably reduced model parameters: 32% fewer compared to WaveNet and 87% fewer compared to SEGAN. Finally, based on subjective and objective metrics, SE-FFTNet outperforms WaveNet in terms of enhanced signal quality, while it provides equally good performance as SEGAN. A Tensorflow implementation of the architecture is provided at 1 .
△ Less
Submitted 8 June, 2020;
originally announced June 2020.
-
Speaker-independent raw waveform model for glottal excitation
Authors:
Lauri Juvela,
Vassilis Tsiaras,
Bajibabu Bollepalli,
Manu Airaksinen,
Junichi Yamagishi,
Paavo Alku
Abstract:
Recent speech technology research has seen a growing interest in using WaveNets as statistical vocoders, i.e., generating speech waveforms from acoustic features. These models have been shown to improve the generated speech quality over classical vocoders in many tasks, such as text-to-speech synthesis and voice conversion. Furthermore, conditioning WaveNets with acoustic features allows sharing t…
▽ More
Recent speech technology research has seen a growing interest in using WaveNets as statistical vocoders, i.e., generating speech waveforms from acoustic features. These models have been shown to improve the generated speech quality over classical vocoders in many tasks, such as text-to-speech synthesis and voice conversion. Furthermore, conditioning WaveNets with acoustic features allows sharing the waveform generator model across multiple speakers without additional speaker codes. However, multi-speaker WaveNet models require large amounts of training data and computation to cover the entire acoustic space. This paper proposes leveraging the source-filter model of speech production to more effectively train a speaker-independent waveform generator with limited resources. We present a multi-speaker 'GlotNet' vocoder, which utilizes a WaveNet to generate glottal excitation waveforms, which are then used to excite the corresponding vocal tract filter to produce speech. Listening tests show that the proposed model performs favourably to a direct WaveNet vocoder trained with the same model architecture and data.
△ Less
Submitted 25 April, 2018;
originally announced April 2018.