-
Coincidence, Categorization, and Consolidation: Learning to Recognize Sounds with Minimal Supervision
Authors:
Aren Jansen,
Daniel P. W. Ellis,
Shawn Hershey,
R. Channing Moore,
Manoj Plakal,
Ashok C. Popat,
Rif A. Saurous
Abstract:
Humans do not acquire perceptual abilities in the way we train machines. While machine learning algorithms typically operate on large collections of randomly-chosen, explicitly-labeled examples, human acquisition relies more heavily on multimodal unsupervised learning (as infants) and active learning (as children). With this motivation, we present a learning framework for sound representation and…
▽ More
Humans do not acquire perceptual abilities in the way we train machines. While machine learning algorithms typically operate on large collections of randomly-chosen, explicitly-labeled examples, human acquisition relies more heavily on multimodal unsupervised learning (as infants) and active learning (as children). With this motivation, we present a learning framework for sound representation and recognition that combines (i) a self-supervised objective based on a general notion of unimodal and cross-modal coincidence, (ii) a clustering objective that reflects our need to impose categorical structure on our experiences, and (iii) a cluster-based active learning procedure that solicits targeted weak supervision to consolidate categories into relevant semantic classes. By training a combined sound embedding/clustering/classification network according to these criteria, we achieve a new state-of-the-art unsupervised audio representation and demonstrate up to a 20-fold reduction in the number of labels required to reach a desired classification performance.
△ Less
Submitted 13 November, 2019;
originally announced November 2019.
-
Differentiable Consistency Constraints for Improved Deep Speech Enhancement
Authors:
Scott Wisdom,
John R. Hershey,
Kevin Wilson,
Jeremy Thorpe,
Michael Chinen,
Brian Patton,
Rif A. Saurous
Abstract:
In recent years, deep networks have led to dramatic improvements in speech enhancement by framing it as a data-driven pattern recognition problem. In many modern enhancement systems, large amounts of data are used to train a deep network to estimate masks for complex-valued short-time Fourier transforms (STFTs) to suppress noise and preserve speech. However, current masking approaches often neglec…
▽ More
In recent years, deep networks have led to dramatic improvements in speech enhancement by framing it as a data-driven pattern recognition problem. In many modern enhancement systems, large amounts of data are used to train a deep network to estimate masks for complex-valued short-time Fourier transforms (STFTs) to suppress noise and preserve speech. However, current masking approaches often neglect two important constraints: STFT consistency and mixture consistency. Without STFT consistency, the system's output is not necessarily the STFT of a time-domain signal, and without mixture consistency, the sum of the estimated sources does not necessarily equal the input mixture. Furthermore, the only previous approaches that apply mixture consistency use real-valued masks; mixture consistency has been ignored for complex-valued masks. In this paper, we show that STFT consistency and mixture consistency can be jointly imposed by adding simple differentiable projection layers to the enhancement network. These layers are compatible with real or complex-valued masks. Using both of these constraints with complex-valued masks provides a 0.7 dB increase in scale-invariant signal-to-distortion ratio (SI-SDR) on a large dataset of speech corrupted by a wide variety of nonstationary noise across a range of input SNRs.
△ Less
Submitted 20 November, 2018;
originally announced November 2018.
-
Exploring Tradeoffs in Models for Low-latency Speech Enhancement
Authors:
Kevin Wilson,
Michael Chinen,
Jeremy Thorpe,
Brian Patton,
John Hershey,
Rif A. Saurous,
Jan Skoglund,
Richard F. Lyon
Abstract:
We explore a variety of neural networks configurations for one- and two-channel spectrogram-mask-based speech enhancement. Our best model improves on previous state-of-the-art performance on the CHiME2 speech enhancement task by 0.4 decibels in signal-to-distortion ratio (SDR). We examine trade-offs such as non-causal look-ahead, computation, and parameter count versus enhancement performance and…
▽ More
We explore a variety of neural networks configurations for one- and two-channel spectrogram-mask-based speech enhancement. Our best model improves on previous state-of-the-art performance on the CHiME2 speech enhancement task by 0.4 decibels in signal-to-distortion ratio (SDR). We examine trade-offs such as non-causal look-ahead, computation, and parameter count versus enhancement performance and find that zero-look-ahead models can achieve, on average, within 0.03 dB SDR of our best bidirectional model. Further, we find that 200 milliseconds of look-ahead is sufficient to achieve equivalent performance to our best bidirectional model.
△ Less
Submitted 16 November, 2018;
originally announced November 2018.
-
VoiceFilter: Targeted Voice Separation by Speaker-Conditioned Spectrogram Masking
Authors:
Quan Wang,
Hannah Muckenhirn,
Kevin Wilson,
Prashant Sridhar,
Zelin Wu,
John Hershey,
Rif A. Saurous,
Ron J. Weiss,
Ye Jia,
Ignacio Lopez Moreno
Abstract:
In this paper, we present a novel system that separates the voice of a target speaker from multi-speaker signals, by making use of a reference signal from the target speaker. We achieve this by training two separate neural networks: (1) A speaker recognition network that produces speaker-discriminative embeddings; (2) A spectrogram masking network that takes both noisy spectrogram and speaker embe…
▽ More
In this paper, we present a novel system that separates the voice of a target speaker from multi-speaker signals, by making use of a reference signal from the target speaker. We achieve this by training two separate neural networks: (1) A speaker recognition network that produces speaker-discriminative embeddings; (2) A spectrogram masking network that takes both noisy spectrogram and speaker embedding as input, and produces a mask. Our system significantly reduces the speech recognition WER on multi-speaker signals, with minimal WER degradation on single-speaker signals.
△ Less
Submitted 19 June, 2019; v1 submitted 10 October, 2018;
originally announced October 2018.
-
Towards End-to-End Prosody Transfer for Expressive Speech Synthesis with Tacotron
Authors:
RJ Skerry-Ryan,
Eric Battenberg,
Ying Xiao,
Yuxuan Wang,
Daisy Stanton,
Joel Shor,
Ron J. Weiss,
Rob Clark,
Rif A. Saurous
Abstract:
We present an extension to the Tacotron speech synthesis architecture that learns a latent embedding space of prosody, derived from a reference acoustic representation containing the desired prosody. We show that conditioning Tacotron on this learned embedding space results in synthesized audio that matches the prosody of the reference signal with fine time detail even when the reference and synth…
▽ More
We present an extension to the Tacotron speech synthesis architecture that learns a latent embedding space of prosody, derived from a reference acoustic representation containing the desired prosody. We show that conditioning Tacotron on this learned embedding space results in synthesized audio that matches the prosody of the reference signal with fine time detail even when the reference and synthesis speakers are different. Additionally, we show that a reference prosody embedding can be used to synthesize text that is different from that of the reference utterance. We define several quantitative and subjective metrics for evaluating prosody transfer, and report results with accompanying audio samples from single-speaker and 44-speaker Tacotron models on a prosody transfer task.
△ Less
Submitted 23 March, 2018;
originally announced March 2018.
-
Style Tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis
Authors:
Yuxuan Wang,
Daisy Stanton,
Yu Zhang,
RJ Skerry-Ryan,
Eric Battenberg,
Joel Shor,
Ying Xiao,
Fei Ren,
Ye Jia,
Rif A. Saurous
Abstract:
In this work, we propose "global style tokens" (GSTs), a bank of embeddings that are jointly trained within Tacotron, a state-of-the-art end-to-end speech synthesis system. The embeddings are trained with no explicit labels, yet learn to model a large range of acoustic expressiveness. GSTs lead to a rich set of significant results. The soft interpretable "labels" they generate can be used to contr…
▽ More
In this work, we propose "global style tokens" (GSTs), a bank of embeddings that are jointly trained within Tacotron, a state-of-the-art end-to-end speech synthesis system. The embeddings are trained with no explicit labels, yet learn to model a large range of acoustic expressiveness. GSTs lead to a rich set of significant results. The soft interpretable "labels" they generate can be used to control synthesis in novel ways, such as varying speed and speaking style - independently of the text content. They can also be used for style transfer, replicating the speaking style of a single audio clip across an entire long-form text corpus. When trained on noisy, unlabeled found data, GSTs learn to factorize noise and speaker identity, providing a path towards highly scalable but robust speech synthesis.
△ Less
Submitted 23 March, 2018;
originally announced March 2018.
-
On Using Backpropagation for Speech Texture Generation and Voice Conversion
Authors:
Jan Chorowski,
Ron J. Weiss,
Rif A. Saurous,
Samy Bengio
Abstract:
Inspired by recent work on neural network image generation which rely on backpropagation towards the network inputs, we present a proof-of-concept system for speech texture synthesis and voice conversion based on two mechanisms: approximate inversion of the representation learned by a speech recognition neural network, and on matching statistics of neuron activations between different source and t…
▽ More
Inspired by recent work on neural network image generation which rely on backpropagation towards the network inputs, we present a proof-of-concept system for speech texture synthesis and voice conversion based on two mechanisms: approximate inversion of the representation learned by a speech recognition neural network, and on matching statistics of neuron activations between different source and target utterances. Similar to image texture synthesis and neural style transfer, the system works by optimizing a cost function with respect to the input waveform samples. To this end we use a differentiable mel-filterbank feature extraction pipeline and train a convolutional CTC speech recognition network. Our system is able to extract speaker characteristics from very limited amounts of target speaker data, as little as a few seconds, and can be used to generate realistic speech babble or reconstruct an utterance in a different voice.
△ Less
Submitted 8 March, 2018; v1 submitted 22 December, 2017;
originally announced December 2017.
-
Unsupervised Learning of Semantic Audio Representations
Authors:
Aren Jansen,
Manoj Plakal,
Ratheet Pandya,
Daniel P. W. Ellis,
Shawn Hershey,
Jiayang Liu,
R. Channing Moore,
Rif A. Saurous
Abstract:
Even in the absence of any explicit semantic annotation, vast collections of audio recordings provide valuable information for learning the categorical structure of sounds. We consider several class-agnostic semantic constraints that apply to unlabeled nonspeech audio: (i) noise and translations in time do not change the underlying sound category, (ii) a mixture of two sound events inherits the ca…
▽ More
Even in the absence of any explicit semantic annotation, vast collections of audio recordings provide valuable information for learning the categorical structure of sounds. We consider several class-agnostic semantic constraints that apply to unlabeled nonspeech audio: (i) noise and translations in time do not change the underlying sound category, (ii) a mixture of two sound events inherits the categories of the constituents, and (iii) the categories of events in close temporal proximity are likely to be the same or related. Without labels to ground them, these constraints are incompatible with classification loss functions. However, they may still be leveraged to identify geometric inequalities needed for triplet loss-based training of convolutional neural networks. The result is low-dimensional embeddings of the input spectrograms that recover 41% and 84% of the performance of their fully-supervised counterparts when applied to downstream query-by-example sound retrieval and sound event classification tasks, respectively. Moreover, in limited-supervision settings, our unsupervised embeddings double the state-of-the-art classification performance.
△ Less
Submitted 6 November, 2017;
originally announced November 2017.