Skip to main content

Showing 1–8 of 8 results for author: Saurous, R A

Searching in archive eess. Search in all archives.
.
  1. arXiv:1911.05894  [pdf, other

    cs.SD eess.AS stat.ML

    Coincidence, Categorization, and Consolidation: Learning to Recognize Sounds with Minimal Supervision

    Authors: Aren Jansen, Daniel P. W. Ellis, Shawn Hershey, R. Channing Moore, Manoj Plakal, Ashok C. Popat, Rif A. Saurous

    Abstract: Humans do not acquire perceptual abilities in the way we train machines. While machine learning algorithms typically operate on large collections of randomly-chosen, explicitly-labeled examples, human acquisition relies more heavily on multimodal unsupervised learning (as infants) and active learning (as children). With this motivation, we present a learning framework for sound representation and… ▽ More

    Submitted 13 November, 2019; originally announced November 2019.

    Comments: This extended version of a ICASSP 2020 submission under same title has an added figure and additional discussion for easier consumption

  2. arXiv:1811.08521  [pdf, other

    cs.SD eess.AS

    Differentiable Consistency Constraints for Improved Deep Speech Enhancement

    Authors: Scott Wisdom, John R. Hershey, Kevin Wilson, Jeremy Thorpe, Michael Chinen, Brian Patton, Rif A. Saurous

    Abstract: In recent years, deep networks have led to dramatic improvements in speech enhancement by framing it as a data-driven pattern recognition problem. In many modern enhancement systems, large amounts of data are used to train a deep network to estimate masks for complex-valued short-time Fourier transforms (STFTs) to suppress noise and preserve speech. However, current masking approaches often neglec… ▽ More

    Submitted 20 November, 2018; originally announced November 2018.

  3. arXiv:1811.07030  [pdf, other

    cs.SD eess.AS

    Exploring Tradeoffs in Models for Low-latency Speech Enhancement

    Authors: Kevin Wilson, Michael Chinen, Jeremy Thorpe, Brian Patton, John Hershey, Rif A. Saurous, Jan Skoglund, Richard F. Lyon

    Abstract: We explore a variety of neural networks configurations for one- and two-channel spectrogram-mask-based speech enhancement. Our best model improves on previous state-of-the-art performance on the CHiME2 speech enhancement task by 0.4 decibels in signal-to-distortion ratio (SDR). We examine trade-offs such as non-causal look-ahead, computation, and parameter count versus enhancement performance and… ▽ More

    Submitted 16 November, 2018; originally announced November 2018.

  4. arXiv:1810.04826  [pdf, other

    eess.AS cs.LG eess.SP stat.ML

    VoiceFilter: Targeted Voice Separation by Speaker-Conditioned Spectrogram Masking

    Authors: Quan Wang, Hannah Muckenhirn, Kevin Wilson, Prashant Sridhar, Zelin Wu, John Hershey, Rif A. Saurous, Ron J. Weiss, Ye Jia, Ignacio Lopez Moreno

    Abstract: In this paper, we present a novel system that separates the voice of a target speaker from multi-speaker signals, by making use of a reference signal from the target speaker. We achieve this by training two separate neural networks: (1) A speaker recognition network that produces speaker-discriminative embeddings; (2) A spectrogram masking network that takes both noisy spectrogram and speaker embe… ▽ More

    Submitted 19 June, 2019; v1 submitted 10 October, 2018; originally announced October 2018.

    Comments: To appear in Interspeech 2019

  5. arXiv:1803.09047  [pdf, other

    cs.CL cs.LG cs.SD eess.AS

    Towards End-to-End Prosody Transfer for Expressive Speech Synthesis with Tacotron

    Authors: RJ Skerry-Ryan, Eric Battenberg, Ying Xiao, Yuxuan Wang, Daisy Stanton, Joel Shor, Ron J. Weiss, Rob Clark, Rif A. Saurous

    Abstract: We present an extension to the Tacotron speech synthesis architecture that learns a latent embedding space of prosody, derived from a reference acoustic representation containing the desired prosody. We show that conditioning Tacotron on this learned embedding space results in synthesized audio that matches the prosody of the reference signal with fine time detail even when the reference and synth… ▽ More

    Submitted 23 March, 2018; originally announced March 2018.

  6. arXiv:1803.09017  [pdf, other

    cs.CL cs.LG cs.SD eess.AS

    Style Tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis

    Authors: Yuxuan Wang, Daisy Stanton, Yu Zhang, RJ Skerry-Ryan, Eric Battenberg, Joel Shor, Ying Xiao, Fei Ren, Ye Jia, Rif A. Saurous

    Abstract: In this work, we propose "global style tokens" (GSTs), a bank of embeddings that are jointly trained within Tacotron, a state-of-the-art end-to-end speech synthesis system. The embeddings are trained with no explicit labels, yet learn to model a large range of acoustic expressiveness. GSTs lead to a rich set of significant results. The soft interpretable "labels" they generate can be used to contr… ▽ More

    Submitted 23 March, 2018; originally announced March 2018.

  7. arXiv:1712.08363  [pdf, other

    cs.SD eess.AS stat.ML

    On Using Backpropagation for Speech Texture Generation and Voice Conversion

    Authors: Jan Chorowski, Ron J. Weiss, Rif A. Saurous, Samy Bengio

    Abstract: Inspired by recent work on neural network image generation which rely on backpropagation towards the network inputs, we present a proof-of-concept system for speech texture synthesis and voice conversion based on two mechanisms: approximate inversion of the representation learned by a speech recognition neural network, and on matching statistics of neuron activations between different source and t… ▽ More

    Submitted 8 March, 2018; v1 submitted 22 December, 2017; originally announced December 2017.

    Comments: Accepted to ICASSP 2018

  8. arXiv:1711.02209  [pdf, ps, other

    cs.SD eess.AS stat.ML

    Unsupervised Learning of Semantic Audio Representations

    Authors: Aren Jansen, Manoj Plakal, Ratheet Pandya, Daniel P. W. Ellis, Shawn Hershey, Jiayang Liu, R. Channing Moore, Rif A. Saurous

    Abstract: Even in the absence of any explicit semantic annotation, vast collections of audio recordings provide valuable information for learning the categorical structure of sounds. We consider several class-agnostic semantic constraints that apply to unlabeled nonspeech audio: (i) noise and translations in time do not change the underlying sound category, (ii) a mixture of two sound events inherits the ca… ▽ More

    Submitted 6 November, 2017; originally announced November 2017.

    Comments: Submitted to ICASSP 2018