Skip to main content

Showing 51–55 of 55 results for author: Harwath, D

.
  1. arXiv:1804.03052  [pdf, other

    cs.CL cs.SD eess.AS

    Vision as an Interlingua: Learning Multilingual Semantic Embeddings of Untranscribed Speech

    Authors: David Harwath, Galen Chuang, James Glass

    Abstract: In this paper, we explore the learning of neural network embeddings for natural images and speech waveforms describing the content of those images. These embeddings are learned directly from the waveforms without the use of linguistic transcriptions or conventional speech recognition technology. While prior work has investigated this setting in the monolingual case using English speech data, this… ▽ More

    Submitted 9 April, 2018; originally announced April 2018.

    Comments: to appear at ICASSP 2018

  2. arXiv:1804.01452  [pdf, other

    cs.CV cs.CL cs.SD

    Jointly Discovering Visual Objects and Spoken Words from Raw Sensory Input

    Authors: David Harwath, Adrià Recasens, Dídac Surís, Galen Chuang, Antonio Torralba, James Glass

    Abstract: In this paper, we explore neural network models that learn to associate segments of spoken audio captions with the semantically relevant portions of natural images that they refer to. We demonstrate that these audio-visual associative localizations emerge from network-internal representations learned as a by-product of training to perform an image-audio retrieval task. Our models operate directly… ▽ More

    Submitted 4 April, 2018; originally announced April 2018.

  3. arXiv:1712.03897  [pdf, other

    cs.LG cs.CL cs.CV

    Learning Modality-Invariant Representations for Speech and Images

    Authors: Kenneth Leidal, David Harwath, James Glass

    Abstract: In this paper, we explore the unsupervised learning of a semantic embedding space for co-occurring sensory inputs. Specifically, we focus on the task of learning a semantic vector space for both spoken and handwritten digits using the TIDIGITs and MNIST datasets. Current techniques encode image and audio/textual inputs directly to semantic embeddings. In contrast, our technique maps an input to th… ▽ More

    Submitted 11 December, 2017; originally announced December 2017.

  4. arXiv:1701.07481  [pdf, other

    cs.CL cs.CV

    Learning Word-Like Units from Joint Audio-Visual Analysis

    Authors: David Harwath, James R. Glass

    Abstract: Given a collection of images and spoken audio captions, we present a method for discovering word-like acoustic units in the continuous speech signal and grounding them to semantically relevant image regions. For example, our model is able to detect spoken instances of the word 'lighthouse' within an utterance and associate them with image regions containing lighthouses. We do not use any form of c… ▽ More

    Submitted 24 May, 2017; v1 submitted 25 January, 2017; originally announced January 2017.

  5. arXiv:1511.03690  [pdf, other

    cs.CV cs.AI cs.CL

    Deep Multimodal Semantic Embeddings for Speech and Images

    Authors: David Harwath, James Glass

    Abstract: In this paper, we present a model which takes as input a corpus of images with relevant spoken captions and finds a correspondence between the two modalities. We employ a pair of convolutional neural networks to model visual objects and speech signals at the word level, and tie the networks together with an embedding and alignment model which learns a joint semantic space over both modalities. We… ▽ More

    Submitted 11 November, 2015; originally announced November 2015.