Skip to main content

Showing 1–44 of 44 results for author: Harwath, D

Searching in archive eess. Search in all archives.
.
  1. arXiv:2505.24248  [pdf, ps, other

    eess.AS cs.SD

    Probing the Robustness Properties of Neural Speech Codecs

    Authors: Wei-Cheng Tseng, David Harwath

    Abstract: Neural speech codecs have revolutionized speech coding, achieving higher compression while preserving audio fidelity. Beyond compression, they have emerged as tokenization strategies, enabling language modeling on speech and driving paradigm shifts across various speech processing tasks. Despite these advancements, their robustness in noisy environments remains underexplored, raising concerns abou… ▽ More

    Submitted 30 May, 2025; originally announced May 2025.

    Comments: Interspeech 2025

  2. arXiv:2505.19462  [pdf, ps, other

    eess.AS cs.SD

    VoiceStar: Robust Zero-Shot Autoregressive TTS with Duration Control and Extrapolation

    Authors: Puyuan Peng, Shang-Wen Li, Abdelrahman Mohamed, David Harwath

    Abstract: We present VoiceStar, the first zero-shot TTS model that achieves both output duration control and extrapolation. VoiceStar is an autoregressive encoder-decoder neural codec language model, that leverages a novel Progress-Monitoring Rotary Position Embedding (PM-RoPE) and is trained with Continuation-Prompt Mixed (CPM) training. PM-RoPE enables the model to better align text and speech tokens, ind… ▽ More

    Submitted 31 May, 2025; v1 submitted 25 May, 2025; originally announced May 2025.

  3. arXiv:2504.02386  [pdf, other

    cs.CV eess.AS

    VoiceCraft-Dub: Automated Video Dubbing with Neural Codec Language Models

    Authors: Kim Sung-Bin, Jeongsoo Choi, Puyuan Peng, Joon Son Chung, Tae-Hyun Oh, David Harwath

    Abstract: We present VoiceCraft-Dub, a novel approach for automated video dubbing that synthesizes high-quality speech from text and facial cues. This task has broad applications in filmmaking, multimedia creation, and assisting voice-impaired individuals. Building on the success of Neural Codec Language Models (NCLMs) for speech synthesis, our method extends their capabilities by incorporating video featur… ▽ More

    Submitted 3 April, 2025; originally announced April 2025.

    Comments: https://voicecraft-dub.github.io/

  4. arXiv:2503.04713  [pdf, other

    eess.AS cs.AI cs.CL cs.LG cs.SD

    Scaling Rich Style-Prompted Text-to-Speech Datasets

    Authors: Anuj Diwan, Zhisheng Zheng, David Harwath, Eunsol Choi

    Abstract: We introduce Paralinguistic Speech Captions (ParaSpeechCaps), a large-scale dataset that annotates speech utterances with rich style captions. While rich abstract tags (e.g. guttural, nasal, pained) have been explored in small-scale human-annotated datasets, existing large-scale datasets only cover basic tags (e.g. low-pitched, slow, loud). We combine off-the-shelf text and speech embedders, class… ▽ More

    Submitted 6 March, 2025; originally announced March 2025.

  5. arXiv:2411.18217  [pdf, other

    cs.SD cs.CL eess.AS

    How to Learn a New Language? An Efficient Solution for Self-Supervised Learning Models Unseen Languages Adaption in Low-Resource Scenario

    Authors: Shih-Heng Wang, Zih-Ching Chen, Jiatong Shi, Ming-To Chuang, Guan-Ting Lin, Kuan-Po Huang, David Harwath, Shang-Wen Li, Hung-yi Lee

    Abstract: The utilization of speech Self-Supervised Learning (SSL) models achieves impressive performance on Automatic Speech Recognition (ASR). However, in low-resource language ASR, they encounter the domain mismatch problem between pre-trained and low-resource languages. Typical solutions like fine-tuning the SSL model suffer from high computation costs while using frozen SSL models as feature extractors… ▽ More

    Submitted 5 January, 2025; v1 submitted 27 November, 2024; originally announced November 2024.

  6. arXiv:2411.05361  [pdf, ps, other

    cs.CL eess.AS

    Dynamic-SUPERB Phase-2: A Collaboratively Expanding Benchmark for Measuring the Capabilities of Spoken Language Models with 180 Tasks

    Authors: Chien-yu Huang, Wei-Chih Chen, Shu-wen Yang, Andy T. Liu, Chen-An Li, Yu-Xiang Lin, Wei-Cheng Tseng, Anuj Diwan, Yi-Jen Shih, Jiatong Shi, William Chen, Chih-Kai Yang, Wenze Ren, Xuanjun Chen, Chi-Yuan Hsiao, Puyuan Peng, Shih-Heng Wang, Chun-Yi Kuan, Ke-Han Lu, Kai-Wei Chang, Fabian Ritter-Gutierrez, Kuan-Po Huang, Siddhant Arora, You-Kuan Lin, Ming To Chuang , et al. (55 additional authors not shown)

    Abstract: Multimodal foundation models, such as Gemini and ChatGPT, have revolutionized human-machine interactions by seamlessly integrating various forms of data. Developing a universal spoken language model that comprehends a wide range of natural language instructions is critical for bridging communication gaps and facilitating more intuitive interactions. However, the absence of a comprehensive evaluati… ▽ More

    Submitted 9 June, 2025; v1 submitted 8 November, 2024; originally announced November 2024.

    Comments: ICLR 2025

  7. arXiv:2410.04029  [pdf, other

    cs.CL cs.AI eess.AS

    SyllableLM: Learning Coarse Semantic Units for Speech Language Models

    Authors: Alan Baade, Puyuan Peng, David Harwath

    Abstract: Language models require tokenized inputs. However, tokenization strategies for continuous data like audio and vision are often based on simple heuristics such as fixed sized convolutions or discrete clustering, which do not necessarily align with the semantic structure of the data. For speech in particular, the high resolution of waveforms (16,000 samples/second or more) presents a significant cha… ▽ More

    Submitted 5 October, 2024; originally announced October 2024.

    Comments: 10 pages, 2 figures

  8. arXiv:2409.12306  [pdf, other

    cs.CL cs.CV cs.SD eess.AS

    Measuring Sound Symbolism in Audio-visual Models

    Authors: Wei-Cheng Tseng, Yi-Jen Shih, David Harwath, Raymond Mooney

    Abstract: Audio-visual pre-trained models have gained substantial attention recently and demonstrated superior performance on various audio-visual tasks. This study investigates whether pre-trained audio-visual models demonstrate non-arbitrary associations between sounds and visual representations$\unicode{x2013}$known as sound symbolism$\unicode{x2013}$which is also observed in humans. We developed a speci… ▽ More

    Submitted 11 November, 2024; v1 submitted 18 September, 2024; originally announced September 2024.

    Comments: SLT 2024

  9. arXiv:2409.10704  [pdf, other

    eess.AS cs.AI cs.CL cs.SD

    Self-supervised Speech Models for Word-Level Stuttered Speech Detection

    Authors: Yi-Jen Shih, Zoi Gkalitsiou, Alexandros G. Dimakis, David Harwath

    Abstract: Clinical diagnosis of stuttering requires an assessment by a licensed speech-language pathologist. However, this process is time-consuming and requires clinicians with training and experience in stuttering and fluency disorders. Unfortunately, only a small percentage of speech-language pathologists report being comfortable working with individuals who stutter, which is inadequate to accommodate fo… ▽ More

    Submitted 16 September, 2024; originally announced September 2024.

    Comments: Accepted by IEEE SLT 2024

  10. arXiv:2406.12209  [pdf, other

    cs.SD cs.CL eess.AS

    Interface Design for Self-Supervised Speech Models

    Authors: Yi-Jen Shih, David Harwath

    Abstract: Self-supervised speech (SSL) models have recently become widely adopted for many downstream speech processing tasks. The general usage pattern is to employ SSL models as feature extractors, and then train a downstream prediction head to solve a specific task. However, different layers of SSL models have been shown to capture different types of information, and the methods of combining them are not… ▽ More

    Submitted 17 June, 2024; originally announced June 2024.

    Comments: Accepted to Interspeech2024

  11. arXiv:2406.09272  [pdf, other

    cs.CV cs.AI cs.SD eess.AS

    Action2Sound: Ambient-Aware Generation of Action Sounds from Egocentric Videos

    Authors: Changan Chen, Puyuan Peng, Ami Baid, Zihui Xue, Wei-Ning Hsu, David Harwath, Kristen Grauman

    Abstract: Generating realistic audio for human actions is important for many applications, such as creating sound effects for films or virtual reality games. Existing approaches implicitly assume total correspondence between the video and audio during training, yet many sounds happen off-screen and have weak to no correspondence with the visuals -- resulting in uncontrolled ambient sounds or hallucinations… ▽ More

    Submitted 25 July, 2024; v1 submitted 13 June, 2024; originally announced June 2024.

    Comments: Project page: https://vision.cs.utexas.edu/projects/action2sound. ECCV 2024 camera-ready version

  12. arXiv:2406.06438  [pdf, other

    cs.CL cs.CV cs.HC cs.LG cs.SD eess.AS

    Multimodal Contextualized Semantic Parsing from Speech

    Authors: Jordan Voas, Raymond Mooney, David Harwath

    Abstract: We introduce Semantic Parsing in Contextual Environments (SPICE), a task designed to enhance artificial agents' contextual awareness by integrating multimodal inputs with prior contexts. SPICE goes beyond traditional semantic parsing by offering a structured, interpretable framework for dynamically updating an agent's knowledge with new information, mirroring the complexity of human communication.… ▽ More

    Submitted 10 June, 2024; originally announced June 2024.

    Comments: 10 Pages, 3 figures, ACL 2024 Main

  13. arXiv:2404.05206  [pdf, other

    cs.CV cs.MM cs.SD eess.AS

    SoundingActions: Learning How Actions Sound from Narrated Egocentric Videos

    Authors: Changan Chen, Kumar Ashutosh, Rohit Girdhar, David Harwath, Kristen Grauman

    Abstract: We propose a novel self-supervised embedding to learn how actions sound from narrated in-the-wild egocentric videos. Whereas existing methods rely on curated data with known audio-visual correspondence, our multimodal contrastive-consensus coding (MC3) embedding reinforces the associations between audio, language, and vision when all modality pairs agree, while diminishing those associations when… ▽ More

    Submitted 8 April, 2024; originally announced April 2024.

    Comments: Accepted at CVPR 2024. Project page: https://vision.cs.utexas.edu/projects/soundingactions

  14. arXiv:2403.16973  [pdf, other

    eess.AS cs.AI cs.CL cs.LG cs.SD

    VoiceCraft: Zero-Shot Speech Editing and Text-to-Speech in the Wild

    Authors: Puyuan Peng, Po-Yao Huang, Shang-Wen Li, Abdelrahman Mohamed, David Harwath

    Abstract: We introduce VoiceCraft, a token infilling neural codec language model, that achieves state-of-the-art performance on both speech editing and zero-shot text-to-speech (TTS) on audiobooks, internet videos, and podcasts. VoiceCraft employs a Transformer decoder architecture and introduces a token rearrangement procedure that combines causal masking and delayed stacking to enable generation within an… ▽ More

    Submitted 13 June, 2024; v1 submitted 25 March, 2024; originally announced March 2024.

    Comments: ACL 2024. Data, code, and model weights are available at https://github.com/jasonppy/VoiceCraft

  15. arXiv:2402.06959  [pdf, other

    cs.CL cs.SD eess.AS

    SpeechCLIP+: Self-supervised multi-task representation learning for speech via CLIP and speech-image data

    Authors: Hsuan-Fu Wang, Yi-Jen Shih, Heng-Jui Chang, Layne Berry, Puyuan Peng, Hung-yi Lee, Hsin-Min Wang, David Harwath

    Abstract: The recently proposed visually grounded speech model SpeechCLIP is an innovative framework that bridges speech and text through images via CLIP without relying on text transcription. On this basis, this paper introduces two extensions to SpeechCLIP. First, we apply the Continuous Integrate-and-Fire (CIF) module to replace a fixed number of CLS tokens in the cascaded architecture. Second, we propos… ▽ More

    Submitted 10 February, 2024; originally announced February 2024.

    Comments: Accepted to ICASSP 2024, Self-supervision in Audio, Speech, and Beyond (SASB) workshop

  16. arXiv:2402.05819  [pdf, other

    eess.AS cs.CL cs.LG

    Integrating Self-supervised Speech Model with Pseudo Word-level Targets from Visually-grounded Speech Model

    Authors: Hung-Chieh Fang, Nai-Xuan Ye, Yi-Jen Shih, Puyuan Peng, Hsuan-Fu Wang, Layne Berry, Hung-yi Lee, David Harwath

    Abstract: Recent advances in self-supervised speech models have shown significant improvement in many downstream tasks. However, these models predominantly centered on frame-level training objectives, which can fall short in spoken language understanding tasks that require semantic comprehension. Existing works often rely on additional speech-text data as intermediate targets, which is costly in the real-wo… ▽ More

    Submitted 8 February, 2024; originally announced February 2024.

    Comments: Accepted to ICASSP 2024 workshop on Self-supervision in Audio, Speech, and Beyond (SASB)

  17. arXiv:2402.01591  [pdf, other

    eess.AS cs.AI cs.CL cs.SD

    BAT: Learning to Reason about Spatial Sounds with Large Language Models

    Authors: Zhisheng Zheng, Puyuan Peng, Ziyang Ma, Xie Chen, Eunsol Choi, David Harwath

    Abstract: Spatial sound reasoning is a fundamental human skill, enabling us to navigate and interpret our surroundings based on sound. In this paper we present BAT, which combines the spatial sound perception ability of a binaural acoustic scene analysis model with the natural language reasoning capabilities of a large language model (LLM) to replicate this innate ability. To address the lack of existing da… ▽ More

    Submitted 17 May, 2025; v1 submitted 2 February, 2024; originally announced February 2024.

    Comments: Accepted to ICML 2024. Our demo, dataset, code and model weights are available at: https://zhishengzheng.com/bat

  18. arXiv:2310.07654  [pdf, other

    cs.CL cs.LG cs.SD eess.AS

    Audio-Visual Neural Syntax Acquisition

    Authors: Cheng-I Jeff Lai, Freda Shi, Puyuan Peng, Yoon Kim, Kevin Gimpel, Shiyu Chang, Yung-Sung Chuang, Saurabhchand Bhati, David Cox, David Harwath, Yang Zhang, Karen Livescu, James Glass

    Abstract: We study phrase structure induction from visually-grounded speech. The core idea is to first segment the speech waveform into sequences of word segments, and subsequently induce phrase structure using the inferred segment-level continuous representations. We present the Audio-Visual Neural Syntax Learner (AV-NSL) that learns phrase structure by listening to audio and looking at images, without eve… ▽ More

    Submitted 11 October, 2023; originally announced October 2023.

  19. arXiv:2309.10787  [pdf, other

    eess.AS cs.CV cs.MM cs.SD

    AV-SUPERB: A Multi-Task Evaluation Benchmark for Audio-Visual Representation Models

    Authors: Yuan Tseng, Layne Berry, Yi-Ting Chen, I-Hsiang Chiu, Hsuan-Hao Lin, Max Liu, Puyuan Peng, Yi-Jen Shih, Hung-Yu Wang, Haibin Wu, Po-Yao Huang, Chun-Mao Lai, Shang-Wen Li, David Harwath, Yu Tsao, Shinji Watanabe, Abdelrahman Mohamed, Chi-Luen Feng, Hung-yi Lee

    Abstract: Audio-visual representation learning aims to develop systems with human-like perception by utilizing correlation between auditory and visual information. However, current models often focus on a limited set of tasks, and generalization abilities of learned representations are unclear. To this end, we propose the AV-SUPERB benchmark that enables general-purpose evaluation of unimodal audio/visual a… ▽ More

    Submitted 19 March, 2024; v1 submitted 19 September, 2023; originally announced September 2023.

    Comments: Accepted to ICASSP 2024; Evaluation Code: https://github.com/roger-tseng/av-superb Submission Platform: https://av.superbbenchmark.org

  20. arXiv:2306.08667  [pdf, other

    cs.CL cs.SD eess.AS

    When to Use Efficient Self Attention? Profiling Text, Speech and Image Transformer Variants

    Authors: Anuj Diwan, Eunsol Choi, David Harwath

    Abstract: We present the first unified study of the efficiency of self-attention-based Transformer variants spanning text, speech and vision. We identify input length thresholds (tipping points) at which efficient Transformer variants become more efficient than vanilla models, using a variety of efficiency metrics (latency, throughput, and memory). To conduct this analysis for speech, we introduce L-HuBERT,… ▽ More

    Submitted 14 June, 2023; originally announced June 2023.

    Comments: 10 pages, 6 pages. Accepted to ACL 2023

  21. arXiv:2305.15405  [pdf, other

    cs.CL eess.AS

    Textless Speech-to-Speech Translation With Limited Parallel Data

    Authors: Anuj Diwan, Anirudh Srinivasan, David Harwath, Eunsol Choi

    Abstract: Existing speech-to-speech translation (S2ST) models fall into two camps: they either leverage text as an intermediate step or require hundreds of hours of parallel speech data. Both approaches are incompatible with textless languages or language pairs with limited parallel data. We present PFB, a framework for training textless S2ST models that require just dozens of hours of parallel speech data.… ▽ More

    Submitted 6 November, 2024; v1 submitted 24 May, 2023; originally announced May 2023.

    Comments: Accepted to EMNLP 2024 Findings

  22. arXiv:2305.12606  [pdf, other

    cs.CL cs.SD eess.AS

    Comparison of Multilingual Self-Supervised and Weakly-Supervised Speech Pre-Training for Adaptation to Unseen Languages

    Authors: Andrew Rouditchenko, Sameer Khurana, Samuel Thomas, Rogerio Feris, Leonid Karlinsky, Hilde Kuehne, David Harwath, Brian Kingsbury, James Glass

    Abstract: Recent models such as XLS-R and Whisper have made multilingual speech technologies more accessible by pre-training on audio from around 100 spoken languages each. However, there are thousands of spoken languages worldwide, and adapting to new languages is an important problem. In this work, we aim to understand which model adapts better to languages unseen during pre-training. We fine-tune both mo… ▽ More

    Submitted 30 May, 2023; v1 submitted 21 May, 2023; originally announced May 2023.

    Comments: Accepted at Interspeech 2023

  23. arXiv:2305.11435  [pdf, other

    eess.AS cs.AI cs.CL cs.SD

    Syllable Discovery and Cross-Lingual Generalization in a Visually Grounded, Self-Supervised Speech Model

    Authors: Puyuan Peng, Shang-Wen Li, Okko Räsänen, Abdelrahman Mohamed, David Harwath

    Abstract: In this paper, we show that representations capturing syllabic units emerge when training a self-supervised speech model with a visually-grounded training objective. We demonstrate that a nearly identical model architecture (HuBERT) trained with a masked language modeling loss does not exhibit this same ability, suggesting that the visual grounding objective is responsible for the emergence of thi… ▽ More

    Submitted 23 July, 2023; v1 submitted 19 May, 2023; originally announced May 2023.

    Comments: Interspeech 2023. Code & Model: https://github.com/jasonppy/syllable-discovery

  24. arXiv:2305.11095  [pdf, other

    eess.AS cs.AI cs.CL cs.LG cs.SD

    Prompting the Hidden Talent of Web-Scale Speech Models for Zero-Shot Task Generalization

    Authors: Puyuan Peng, Brian Yan, Shinji Watanabe, David Harwath

    Abstract: We investigate the emergent abilities of the recently proposed web-scale speech model Whisper, by adapting it to unseen tasks with prompt engineering. We selected three tasks: audio-visual speech recognition (AVSR), code-switched speech recognition (CS-ASR), and speech translation (ST) on unseen language pairs. We design task-specific prompts, by either leveraging another large-scale model, or sim… ▽ More

    Submitted 15 August, 2023; v1 submitted 18 May, 2023; originally announced May 2023.

    Comments: Interspeech 2023

  25. arXiv:2212.01661  [pdf, other

    eess.AS cs.CL cs.SD

    Unsupervised Fine-Tuning Data Selection for ASR Using Self-Supervised Speech Models

    Authors: Reem Gody, David Harwath

    Abstract: Self-supervised learning (SSL) has been able to leverage unlabeled data to boost the performance of automatic speech recognition (ASR) models when we have access to only a small amount of transcribed speech data. However, this raises the question of which subset of the available unlabeled data should be selected for transcription. Our work investigates different unsupervised data selection techniq… ▽ More

    Submitted 3 December, 2022; originally announced December 2022.

  26. arXiv:2212.01393  [pdf, other

    eess.AS cs.CL cs.LG cs.SD

    Continual Learning for On-Device Speech Recognition using Disentangled Conformers

    Authors: Anuj Diwan, Ching-Feng Yeh, Wei-Ning Hsu, Paden Tomasello, Eunsol Choi, David Harwath, Abdelrahman Mohamed

    Abstract: Automatic speech recognition research focuses on training and evaluating on static datasets. Yet, as speech models are increasingly deployed on personal devices, such models encounter user-specific distributional shifts. To simulate this real-world scenario, we introduce LibriContinual, a continual learning benchmark for speaker-specific domain adaptation derived from LibriVox audiobooks, with dat… ▽ More

    Submitted 13 December, 2022; v1 submitted 2 December, 2022; originally announced December 2022.

    Comments: 8 pages, 2 figures. Submitted to ICASSP 2023

  27. arXiv:2211.01461  [pdf, other

    eess.AS cs.CL cs.SD

    Phoneme Segmentation Using Self-Supervised Speech Models

    Authors: Luke Strgar, David Harwath

    Abstract: We apply transfer learning to the task of phoneme segmentation and demonstrate the utility of representations learned in self-supervised pre-training for the task. Our model extends transformer-style encoders with strategically placed convolutions that manipulate features learned in pre-training. Using the TIMIT and Buckeye corpora we train and test the model in the supervised and unsupervised set… ▽ More

    Submitted 2 November, 2022; originally announced November 2022.

    Comments: Accepted to SLT 2022

  28. arXiv:2211.01180  [pdf, other

    cs.CL cs.SD eess.AS

    M-SpeechCLIP: Leveraging Large-Scale, Pre-Trained Models for Multilingual Speech to Image Retrieval

    Authors: Layne Berry, Yi-Jen Shih, Hsuan-Fu Wang, Heng-Jui Chang, Hung-yi Lee, David Harwath

    Abstract: This work investigates the use of large-scale, English-only pre-trained models (CLIP and HuBERT) for multilingual image-speech retrieval. For non-English image-speech retrieval, we outperform the current state-of-the-art performance by a wide margin both when training separate models for each language, and with a single model which processes speech in all three languages. We identify key differenc… ▽ More

    Submitted 10 April, 2023; v1 submitted 2 November, 2022; originally announced November 2022.

    Comments: Accepted to ICASSP 2023

  29. arXiv:2210.07839  [pdf, other

    cs.MM cs.CV cs.SD eess.AS

    Contrastive Audio-Visual Masked Autoencoder

    Authors: Yuan Gong, Andrew Rouditchenko, Alexander H. Liu, David Harwath, Leonid Karlinsky, Hilde Kuehne, James Glass

    Abstract: In this paper, we first extend the recent Masked Auto-Encoder (MAE) model from a single modality to audio-visual multi-modalities. Subsequently, we propose the Contrastive Audio-Visual Masked Auto-Encoder (CAV-MAE) by combining contrastive learning and masked data modeling, two major self-supervised learning frameworks, to learn a joint and coordinated audio-visual representation. Our experiments… ▽ More

    Submitted 11 April, 2023; v1 submitted 2 October, 2022; originally announced October 2022.

    Comments: Accepted at ICLR 2023 as a notable top 25% paper. Code and pretrained models are at https://github.com/yuangongnd/cav-mae

  30. arXiv:2210.00705  [pdf, other

    cs.CL cs.SD eess.AS

    SpeechCLIP: Integrating Speech with Pre-Trained Vision and Language Model

    Authors: Yi-Jen Shih, Hsuan-Fu Wang, Heng-Jui Chang, Layne Berry, Hung-yi Lee, David Harwath

    Abstract: Data-driven speech processing models usually perform well with a large amount of text supervision, but collecting transcribed speech data is costly. Therefore, we propose SpeechCLIP, a novel framework bridging speech and text through images to enhance speech models without transcriptions. We leverage state-of-the-art pre-trained HuBERT and CLIP, aligning them via paired images and spoken captions… ▽ More

    Submitted 25 October, 2022; v1 submitted 3 October, 2022; originally announced October 2022.

    Comments: Accepted to IEEE SLT 2022

  31. arXiv:2203.16691  [pdf, other

    eess.AS cs.AI cs.CL cs.LG cs.SD

    MAE-AST: Masked Autoencoding Audio Spectrogram Transformer

    Authors: Alan Baade, Puyuan Peng, David Harwath

    Abstract: In this paper, we propose a simple yet powerful improvement over the recent Self-Supervised Audio Spectrogram Transformer (SSAST) model for speech and audio classification. Specifically, we leverage the insight that the SSAST uses a very high masking ratio (75%) during pretraining, meaning that the vast majority of self-attention compute is performed on mask tokens. We address this by integrating… ▽ More

    Submitted 30 March, 2022; originally announced March 2022.

    Comments: Submitted to INTERSPEECH. 5 pages, 2 figures, 5 tables

  32. arXiv:2203.15081  [pdf, other

    eess.AS cs.AI cs.CL cs.SD

    Word Discovery in Visually Grounded, Self-Supervised Speech Models

    Authors: Puyuan Peng, David Harwath

    Abstract: We present a method for visually-grounded spoken term discovery. After training either a HuBERT or wav2vec2.0 model to associate spoken captions with natural images, we show that powerful word segmentation and clustering capability emerges within the model's self-attention heads. Our experiments reveal that this ability is not present to nearly the same extent in the base HuBERT and wav2vec2.0 mod… ▽ More

    Submitted 19 June, 2023; v1 submitted 28 March, 2022; originally announced March 2022.

    Comments: Interspeech 2022 Oral. Update Table 5

  33. arXiv:2202.03543  [pdf, other

    eess.AS cs.CL cs.CV cs.LG cs.SD

    Self-Supervised Representation Learning for Speech Using Visual Grounding and Masked Language Modeling

    Authors: Puyuan Peng, David Harwath

    Abstract: In this paper, we describe our submissions to the ZeroSpeech 2021 Challenge and SUPERB benchmark. Our submissions are based on the recently proposed FaST-VGS model, which is a Transformer-based model that learns to associate raw speech waveforms with semantically related images, all without the use of any transcriptions of the speech. Additionally, we introduce a novel extension of this model, FaS… ▽ More

    Submitted 2 March, 2022; v1 submitted 7 February, 2022; originally announced February 2022.

    Comments: SAS workshop at AAAI2022, code and model weights available at https://github.com/jasonppy/FaST-VGS-Family

  34. arXiv:2112.04446  [pdf, other

    cs.CV cs.CL cs.SD eess.AS

    Everything at Once -- Multi-modal Fusion Transformer for Video Retrieval

    Authors: Nina Shvetsova, Brian Chen, Andrew Rouditchenko, Samuel Thomas, Brian Kingsbury, Rogerio Feris, David Harwath, James Glass, Hilde Kuehne

    Abstract: Multi-modal learning from video data has seen increased attention recently as it allows to train semantically meaningful embeddings without human annotation enabling tasks like zero-shot retrieval and classification. In this work, we present a multi-modal, modality agnostic fusion transformer approach that learns to exchange information between multiple modalities, such as video, audio, and text,… ▽ More

    Submitted 18 August, 2022; v1 submitted 8 December, 2021; originally announced December 2021.

    Comments: CVPR2022. The final published version of the proceedings will be available on IEEE Xplore

  35. arXiv:2111.04823  [pdf, other

    cs.CL cs.CV cs.MM cs.SD eess.AS eess.IV

    Cascaded Multilingual Audio-Visual Learning from Videos

    Authors: Andrew Rouditchenko, Angie Boggust, David Harwath, Samuel Thomas, Hilde Kuehne, Brian Chen, Rameswar Panda, Rogerio Feris, Brian Kingsbury, Michael Picheny, James Glass

    Abstract: In this paper, we explore self-supervised audio-visual models that learn from instructional videos. Prior work has shown that these models can relate spoken words and sounds to visual content after training on a large-scale dataset of videos, but they were only trained and evaluated on videos in English. To learn multilingual audio-visual representations, we propose a cascaded approach that levera… ▽ More

    Submitted 8 November, 2021; originally announced November 2021.

    Comments: Presented at Interspeech 2021. This version contains updated results using the YouCook-Japanese dataset

  36. arXiv:2109.08186  [pdf, other

    eess.AS cs.CL cs.IR

    Fast-Slow Transformer for Visually Grounding Speech

    Authors: Puyuan Peng, David Harwath

    Abstract: We present Fast-Slow Transformer for Visually Grounding Speech, or FaST-VGS. FaST-VGS is a Transformer-based model for learning the associations between raw speech waveforms and visual images. The model unifies dual-encoder and cross-attention architectures into a single model, reaping the superior retrieval speed of the former along with the accuracy of the latter. FaST-VGS achieves state-of-the-… ▽ More

    Submitted 2 March, 2022; v1 submitted 16 September, 2021; originally announced September 2021.

    Comments: ICASSP 2022, code and model weights are available at https://github.com/jasonppy/FaST-VGS-Family

  37. arXiv:2106.07732  [pdf, other

    cs.SD cs.CV cs.LG eess.AS

    Learning Audio-Visual Dereverberation

    Authors: Changan Chen, Wei Sun, David Harwath, Kristen Grauman

    Abstract: Reverberation not only degrades the quality of speech for human perception, but also severely impacts the accuracy of automatic speech recognition. Prior work attempts to remove reverberation based on the audio modality only. Our idea is to learn to dereverberate speech from audio-visual observations. The visual environment surrounding a human speaker reveals important cues about the room geometry… ▽ More

    Submitted 13 March, 2023; v1 submitted 14 June, 2021; originally announced June 2021.

    Comments: Accepted at ICASSP 2023. This is the longer version of the five-page camera-ready paper. Project page: https://vision.cs.utexas.edu/projects/learning-audio-visual-dereverberation

  38. arXiv:2105.04489  [pdf, other

    cs.CV cs.CL cs.LG cs.SD eess.AS

    Spoken Moments: Learning Joint Audio-Visual Representations from Video Descriptions

    Authors: Mathew Monfort, SouYoung Jin, Alexander Liu, David Harwath, Rogerio Feris, James Glass, Aude Oliva

    Abstract: When people observe events, they are able to abstract key information and build concise summaries of what is happening. These summaries include contextual and semantic information describing the important high-level details (what, where, who and how) of the observed event and exclude background information that is deemed unimportant to the observer. With this in mind, the descriptions people gener… ▽ More

    Submitted 10 May, 2021; originally announced May 2021.

    Comments: To appear at CVPR 2021

  39. arXiv:2012.15454  [pdf, other

    cs.CL cs.AI cs.CV cs.LG eess.AS

    Text-Free Image-to-Speech Synthesis Using Learned Segmental Units

    Authors: Wei-Ning Hsu, David Harwath, Christopher Song, James Glass

    Abstract: In this paper we present the first model for directly synthesizing fluent, natural-sounding spoken audio captions for images that does not require natural language text as an intermediate representation or source of supervision. Instead, we connect the image captioning module and the speech synthesis module with a set of discrete, sub-word speech units that are discovered with a self-supervised vi… ▽ More

    Submitted 31 December, 2020; originally announced December 2020.

  40. arXiv:2006.09199  [pdf, other

    cs.CV cs.CL cs.MM cs.SD eess.AS

    AVLnet: Learning Audio-Visual Language Representations from Instructional Videos

    Authors: Andrew Rouditchenko, Angie Boggust, David Harwath, Brian Chen, Dhiraj Joshi, Samuel Thomas, Kartik Audhkhasi, Hilde Kuehne, Rameswar Panda, Rogerio Feris, Brian Kingsbury, Michael Picheny, Antonio Torralba, James Glass

    Abstract: Current methods for learning visually grounded language from videos often rely on text annotation, such as human generated captions or machine generated automatic speech recognition (ASR) transcripts. In this work, we introduce the Audio-Video Language Network (AVLnet), a self-supervised network that learns a shared audio-visual embedding space directly from raw video inputs. To circumvent the nee… ▽ More

    Submitted 29 June, 2021; v1 submitted 16 June, 2020; originally announced June 2020.

    Comments: A version of this work has been accepted to Interspeech 2021

  41. arXiv:1911.09602  [pdf, other

    cs.CL cs.LG cs.SD eess.AS

    Learning Hierarchical Discrete Linguistic Units from Visually-Grounded Speech

    Authors: David Harwath, Wei-Ning Hsu, James Glass

    Abstract: In this paper, we present a method for learning discrete linguistic units by incorporating vector quantization layers into neural models of visually grounded speech. We show that our method is capable of capturing both word-level and sub-word units, depending on how it is configured. What differentiates this paper from prior work on speech unit learning is the choice of training objective. Rather… ▽ More

    Submitted 14 February, 2020; v1 submitted 21 November, 2019; originally announced November 2019.

    Comments: Camera-ready version for ICLR

  42. arXiv:1907.04355  [pdf, other

    cs.CL cs.LG cs.SD eess.AS

    Transfer Learning from Audio-Visual Grounding to Speech Recognition

    Authors: Wei-Ning Hsu, David Harwath, James Glass

    Abstract: Transfer learning aims to reduce the amount of data required to excel at a new task by re-using the knowledge acquired from learning other related tasks. This paper proposes a novel transfer learning scenario, which distills robust phonetic features from grounding models that are trained to tell whether a pair of image and speech are semantically correlated, without using any textual transcripts.… ▽ More

    Submitted 9 July, 2019; originally announced July 2019.

    Comments: Accepted to Interspeech 2019. 4 pages, 2 figures

  43. arXiv:1902.08213  [pdf, other

    cs.CL cs.LG cs.SD eess.AS

    Towards Visually Grounded Sub-Word Speech Unit Discovery

    Authors: David Harwath, James Glass

    Abstract: In this paper, we investigate the manner in which interpretable sub-word speech units emerge within a convolutional neural network model trained to associate raw speech waveforms with semantically related natural image scenes. We show how diphone boundaries can be superficially extracted from the activation patterns of intermediate layers of the model, suggesting that the model may be leveraging t… ▽ More

    Submitted 21 February, 2019; originally announced February 2019.

    Comments: Accepted to ICASSP 2019

  44. arXiv:1804.03052  [pdf, other

    cs.CL cs.SD eess.AS

    Vision as an Interlingua: Learning Multilingual Semantic Embeddings of Untranscribed Speech

    Authors: David Harwath, Galen Chuang, James Glass

    Abstract: In this paper, we explore the learning of neural network embeddings for natural images and speech waveforms describing the content of those images. These embeddings are learned directly from the waveforms without the use of linguistic transcriptions or conventional speech recognition technology. While prior work has investigated this setting in the monolingual case using English speech data, this… ▽ More

    Submitted 9 April, 2018; originally announced April 2018.

    Comments: to appear at ICASSP 2018