-
Semi-supervised Time Domain Target Speaker Extraction with Attention
Authors:
Zhepei Wang,
Ritwik Giri,
Shrikant Venkataramani,
Umut Isik,
Jean-Marc Valin,
Paris Smaragdis,
Mike Goodwin,
Arvindh Krishnaswamy
Abstract:
In this work, we propose Exformer, a time-domain architecture for target speaker extraction. It consists of a pre-trained speaker embedder network and a separator network based on transformer encoder blocks. We study multiple methods to combine speaker information with the input mixture, and the resulting Exformer architecture obtains superior extraction performance compared to prior time-domain n…
▽ More
In this work, we propose Exformer, a time-domain architecture for target speaker extraction. It consists of a pre-trained speaker embedder network and a separator network based on transformer encoder blocks. We study multiple methods to combine speaker information with the input mixture, and the resulting Exformer architecture obtains superior extraction performance compared to prior time-domain networks. Furthermore, we investigate a two-stage procedure to train the model using mixtures without reference signals upon a pre-trained supervised model. Experimental results show that the proposed semi-supervised learning procedure improves the performance of the supervised baselines.
△ Less
Submitted 17 June, 2022;
originally announced June 2022.
-
To Dereverb Or Not to Dereverb? Perceptual Studies On Real-Time Dereverberation Targets
Authors:
Jean-Marc Valin,
Ritwik Giri,
Shrikant Venkataramani,
Umut Isik,
Arvindh Krishnaswamy
Abstract:
In real life, room effect, also known as room reverberation, and the present background noise degrade the quality of speech. Recently, deep learning-based speech enhancement approaches have shown a lot of promise and surpassed traditional denoising and dereverberation methods. It is also well established that these state-of-the-art denoising algorithms significantly improve the quality of speech a…
▽ More
In real life, room effect, also known as room reverberation, and the present background noise degrade the quality of speech. Recently, deep learning-based speech enhancement approaches have shown a lot of promise and surpassed traditional denoising and dereverberation methods. It is also well established that these state-of-the-art denoising algorithms significantly improve the quality of speech as perceived by human listeners. But the role of dereverberation on subjective (perceived) speech quality, and whether the additional artifacts introduced by dereverberation cause more harm than good are still unclear. In this paper, we attempt to answer these questions by evaluating a state of the art speech enhancement system in a comprehensive subjective evaluation study for different choices of dereverberation targets.
△ Less
Submitted 16 June, 2022;
originally announced June 2022.
-
Improved singing voice separation with chromagram-based pitch-aware remixing
Authors:
Siyuan Yuan,
Zhepei Wang,
Umut Isik,
Ritwik Giri,
Jean-Marc Valin,
Michael M. Goodwin,
Arvindh Krishnaswamy
Abstract:
Singing voice separation aims to separate music into vocals and accompaniment components. One of the major constraints for the task is the limited amount of training data with separated vocals. Data augmentation techniques such as random source mixing have been shown to make better use of existing data and mildly improve model performance. We propose a novel data augmentation technique, chromagram…
▽ More
Singing voice separation aims to separate music into vocals and accompaniment components. One of the major constraints for the task is the limited amount of training data with separated vocals. Data augmentation techniques such as random source mixing have been shown to make better use of existing data and mildly improve model performance. We propose a novel data augmentation technique, chromagram-based pitch-aware remixing, where music segments with high pitch alignment are mixed. By performing controlled experiments in both supervised and semi-supervised settings, we demonstrate that training models with pitch-aware remixing significantly improves the test signal-to-distortion ratio (SDR)
△ Less
Submitted 28 March, 2022;
originally announced March 2022.
-
End-to-end LPCNet: A Neural Vocoder With Fully-Differentiable LPC Estimation
Authors:
Krishna Subramani,
Jean-Marc Valin,
Umut Isik,
Paris Smaragdis,
Arvindh Krishnaswamy
Abstract:
Neural vocoders have recently demonstrated high quality speech synthesis, but typically require a high computational complexity. LPCNet was proposed as a way to reduce the complexity of neural synthesis by using linear prediction (LP) to assist an autoregressive model. At inference time, LPCNet relies on the LP coefficients being explicitly computed from the input acoustic features. That makes the…
▽ More
Neural vocoders have recently demonstrated high quality speech synthesis, but typically require a high computational complexity. LPCNet was proposed as a way to reduce the complexity of neural synthesis by using linear prediction (LP) to assist an autoregressive model. At inference time, LPCNet relies on the LP coefficients being explicitly computed from the input acoustic features. That makes the design of LPCNet-based systems more complicated, while adding the constraint that the input features must represent a clean speech spectrum. We propose an end-to-end version of LPCNet that lifts these limitations by learning to infer the LP coefficients from the input features in the frame rate network. Results show that the proposed end-to-end approach equals or exceeds the quality of the original LPCNet model, but without explicit LP analysis. Our open-source end-to-end model still benefits from LPCNet's low complexity, while allowing for any type of conditioning features.
△ Less
Submitted 29 March, 2022; v1 submitted 22 February, 2022;
originally announced February 2022.
-
Neural Speech Synthesis on a Shoestring: Improving the Efficiency of LPCNet
Authors:
Jean-Marc Valin,
Umut Isik,
Paris Smaragdis,
Arvindh Krishnaswamy
Abstract:
Neural speech synthesis models can synthesize high quality speech but typically require a high computational complexity to do so. In previous work, we introduced LPCNet, which uses linear prediction to significantly reduce the complexity of neural synthesis. In this work, we further improve the efficiency of LPCNet -- targeting both algorithmic and computational improvements -- to make it usable o…
▽ More
Neural speech synthesis models can synthesize high quality speech but typically require a high computational complexity to do so. In previous work, we introduced LPCNet, which uses linear prediction to significantly reduce the complexity of neural synthesis. In this work, we further improve the efficiency of LPCNet -- targeting both algorithmic and computational improvements -- to make it usable on a wide variety of devices. We demonstrate an improvement in synthesis quality while operating 2.5x faster. The resulting open-source LPCNet algorithm can perform real-time neural synthesis on most existing phones and is even usable in some embedded devices.
△ Less
Submitted 22 February, 2022;
originally announced February 2022.
-
Enhancing into the codec: Noise Robust Speech Coding with Vector-Quantized Autoencoders
Authors:
Jonah Casebeer,
Vinjai Vale,
Umut Isik,
Jean-Marc Valin,
Ritwik Giri,
Arvindh Krishnaswamy
Abstract:
Audio codecs based on discretized neural autoencoders have recently been developed and shown to provide significantly higher compression levels for comparable quality speech output. However, these models are tightly coupled with speech content, and produce unintended outputs in noisy conditions. Based on VQ-VAE autoencoders with WaveRNN decoders, we develop compressor-enhancer encoders and accompa…
▽ More
Audio codecs based on discretized neural autoencoders have recently been developed and shown to provide significantly higher compression levels for comparable quality speech output. However, these models are tightly coupled with speech content, and produce unintended outputs in noisy conditions. Based on VQ-VAE autoencoders with WaveRNN decoders, we develop compressor-enhancer encoders and accompanying decoders, and show that they operate well in noisy conditions. We also observe that a compressor-enhancer model performs better on clean speech inputs than a compressor model trained only on clean speech.
△ Less
Submitted 12 February, 2021;
originally announced February 2021.
-
Generating Music with a Self-Correcting Non-Chronological Autoregressive Model
Authors:
Wayne Chi,
Prachi Kumar,
Suri Yaddanapudi,
Rahul Suresh,
Umut Isik
Abstract:
We describe a novel approach for generating music using a self-correcting, non-chronological, autoregressive model. We represent music as a sequence of edit events, each of which denotes either the addition or removal of a note---even a note previously generated by the model. During inference, we generate one edit event at a time using direct ancestral sampling. Our approach allows the model to fi…
▽ More
We describe a novel approach for generating music using a self-correcting, non-chronological, autoregressive model. We represent music as a sequence of edit events, each of which denotes either the addition or removal of a note---even a note previously generated by the model. During inference, we generate one edit event at a time using direct ancestral sampling. Our approach allows the model to fix previous mistakes such as incorrectly sampled notes and prevent accumulation of errors which autoregressive models are prone to have. Another benefit is a finer, note-by-note control during human and AI collaborative composition. We show through quantitative metrics and human survey evaluation that our approach generates better results than orderless NADE and Gibbs sampling approaches.
△ Less
Submitted 18 August, 2020;
originally announced August 2020.
-
PoCoNet: Better Speech Enhancement with Frequency-Positional Embeddings, Semi-Supervised Conversational Data, and Biased Loss
Authors:
Umut Isik,
Ritwik Giri,
Neerad Phansalkar,
Jean-Marc Valin,
Karim Helwani,
Arvindh Krishnaswamy
Abstract:
Neural network applications generally benefit from larger-sized models, but for current speech enhancement models, larger scale networks often suffer from decreased robustness to the variety of real-world use cases beyond what is encountered in training data. We introduce several innovations that lead to better large neural networks for speech enhancement. The novel PoCoNet architecture is a convo…
▽ More
Neural network applications generally benefit from larger-sized models, but for current speech enhancement models, larger scale networks often suffer from decreased robustness to the variety of real-world use cases beyond what is encountered in training data. We introduce several innovations that lead to better large neural networks for speech enhancement. The novel PoCoNet architecture is a convolutional neural network that, with the use of frequency-positional embeddings, is able to more efficiently build frequency-dependent features in the early layers. A semi-supervised method helps increase the amount of conversational training data by pre-enhancing noisy datasets, improving performance on real recordings. A new loss function biased towards preserving speech quality helps the optimization better match human perceptual opinions on speech quality. Ablation experiments and objective and human opinion metrics show the benefits of the proposed improvements.
△ Less
Submitted 10 August, 2020;
originally announced August 2020.
-
Efficient Trainable Front-Ends for Neural Speech Enhancement
Authors:
Jonah Casebeer,
Umut Isik,
Shrikant Venkataramani,
Arvindh Krishnaswamy
Abstract:
Many neural speech enhancement and source separation systems operate in the time-frequency domain. Such models often benefit from making their Short-Time Fourier Transform (STFT) front-ends trainable. In current literature, these are implemented as large Discrete Fourier Transform matrices; which are prohibitively inefficient for low-compute systems. We present an efficient, trainable front-end ba…
▽ More
Many neural speech enhancement and source separation systems operate in the time-frequency domain. Such models often benefit from making their Short-Time Fourier Transform (STFT) front-ends trainable. In current literature, these are implemented as large Discrete Fourier Transform matrices; which are prohibitively inefficient for low-compute systems. We present an efficient, trainable front-end based on the butterfly mechanism to compute the Fast Fourier Transform, and show its accuracy and efficiency benefits for low-compute neural speech enhancement models. We also explore the effects of making the STFT window trainable.
△ Less
Submitted 19 February, 2020;
originally announced February 2020.
-
Channel-Attention Dense U-Net for Multichannel Speech Enhancement
Authors:
Bahareh Tolooshams,
Ritwik Giri,
Andrew H. Song,
Umut Isik,
Arvindh Krishnaswamy
Abstract:
Supervised deep learning has gained significant attention for speech enhancement recently. The state-of-the-art deep learning methods perform the task by learning a ratio/binary mask that is applied to the mixture in the time-frequency domain to produce the clean speech. Despite the great performance in the single-channel setting, these frameworks lag in performance in the multichannel setting as…
▽ More
Supervised deep learning has gained significant attention for speech enhancement recently. The state-of-the-art deep learning methods perform the task by learning a ratio/binary mask that is applied to the mixture in the time-frequency domain to produce the clean speech. Despite the great performance in the single-channel setting, these frameworks lag in performance in the multichannel setting as the majority of these methods a) fail to exploit the available spatial information fully, and b) still treat the deep architecture as a black box which may not be well-suited for multichannel audio processing. This paper addresses these drawbacks, a) by utilizing complex ratio masking instead of masking on the magnitude of the spectrogram, and more importantly, b) by introducing a channel-attention mechanism inside the deep architecture to mimic beamforming. We propose Channel-Attention Dense U-Net, in which we apply the channel-attention unit recursively on feature maps at every layer of the network, enabling the network to perform non-linear beamforming. We demonstrate the superior performance of the network against the state-of-the-art approaches on the CHiME-3 dataset.
△ Less
Submitted 30 January, 2020;
originally announced January 2020.
-
From Speech-to-Speech Translation to Automatic Dubbing
Authors:
Marcello Federico,
Robert Enyedi,
Roberto Barra-Chicote,
Ritwik Giri,
Umut Isik,
Arvindh Krishnaswamy,
Hassan Sawaf
Abstract:
We present enhancements to a speech-to-speech translation pipeline in order to perform automatic dubbing. Our architecture features neural machine translation generating output of preferred length, prosodic alignment of the translation with the original speech segments, neural text-to-speech with fine tuning of the duration of each utterance, and, finally, audio rendering to enriches text-to-speec…
▽ More
We present enhancements to a speech-to-speech translation pipeline in order to perform automatic dubbing. Our architecture features neural machine translation generating output of preferred length, prosodic alignment of the translation with the original speech segments, neural text-to-speech with fine tuning of the duration of each utterance, and, finally, audio rendering to enriches text-to-speech output with background noise and reverberation extracted from the original audio. We report on a subjective evaluation of automatic dubbing of excerpts of TED Talks from English into Italian, which measures the perceived naturalness of automatic dubbing and the relative importance of each proposed enhancement.
△ Less
Submitted 2 February, 2020; v1 submitted 19 January, 2020;
originally announced January 2020.
-
Categorical Complexity
Authors:
Saugata Basu,
M. Umut Isik
Abstract:
We introduce a notion of complexity of diagrams (and in particular of objects and morphisms) in an arbitrary category, as well as a notion of complexity of functors between categories equipped with complexity functions. We discuss several examples of this new definition in categories of wide common interest, such as finite sets, Boolean functions, topological spaces, vector spaces, semi-linear and…
▽ More
We introduce a notion of complexity of diagrams (and in particular of objects and morphisms) in an arbitrary category, as well as a notion of complexity of functors between categories equipped with complexity functions. We discuss several examples of this new definition in categories of wide common interest, such as finite sets, Boolean functions, topological spaces, vector spaces, semi-linear and semi-algebraic sets, graded algebras, affine and projective varieties and schemes, and modules over polynomial rings. We show that on one hand categorical complexity recovers in several settings classical notions of non-uniform computational complexity (such as circuit complexity), while on the other hand it has features which make it mathematically more natural. We also postulate that studying functor complexity is the categorical analog of classical questions in complexity theory about separating different complexity classes.
△ Less
Submitted 10 December, 2019; v1 submitted 25 October, 2016;
originally announced October 2016.
-
Complexity Classes and Completeness in Algebraic Geometry
Authors:
M. Umut Isik
Abstract:
We study the computational complexity of sequences of projective varieties. We define analogues of the complexity classes P and NP for these and prove the NP-completeness of a sequence called the universal circuit resultant. This is the first family of compact spaces shown to be NP-complete in a geometric setting.
We study the computational complexity of sequences of projective varieties. We define analogues of the complexity classes P and NP for these and prove the NP-completeness of a sequence called the universal circuit resultant. This is the first family of compact spaces shown to be NP-complete in a geometric setting.
△ Less
Submitted 8 September, 2016;
originally announced September 2016.