Skip to main content

Showing 1–50 of 73 results for author: Hasegawa-Johnson, M

Searching in archive cs. Search in all archives.
.
  1. arXiv:2509.18235  [pdf, ps, other

    eess.AS cs.SD

    Automated Analysis of Naturalistic Recordings in Early Childhood: Applications, Challenges, and Opportunities

    Authors: Jialu Li, Marvin Lavechin, Xulin Fan, Nancy L. McElwain, Alejandrina Cristia, Paola Garcia-Perera, Mark Hasegawa-Johnson

    Abstract: Naturalistic recordings capture audio in real-world environments where participants behave naturally without interference from researchers or experimental protocols. Naturalistic long-form recordings extend this concept by capturing spontaneous and continuous interactions over extended periods, often spanning hours or even days, in participants' daily lives. Naturalistic recordings have been exten… ▽ More

    Submitted 22 September, 2025; originally announced September 2025.

    Comments: Accepted to IEEE Signal Processing Magazine

  2. arXiv:2509.13395  [pdf, ps, other

    eess.AS cs.AI cs.CL cs.LG cs.MM

    TICL: Text-Embedding KNN For Speech In-Context Learning Unlocks Speech Recognition Abilities of Large Multimodal Models

    Authors: Haolong Zheng, Yekaterina Yegorova, Mark Hasegawa-Johnson

    Abstract: Speech foundation models have recently demonstrated the ability to perform Speech In-Context Learning (SICL). Selecting effective in-context examples is crucial for SICL performance, yet selection methodologies remain underexplored. In this work, we propose Text-Embedding KNN for SICL (TICL), a simple pipeline that uses semantic context to enhance off-the-shelf large multimodal models' speech reco… ▽ More

    Submitted 16 September, 2025; originally announced September 2025.

  3. arXiv:2507.22047  [pdf, ps, other

    cs.AI

    The Interspeech 2025 Speech Accessibility Project Challenge

    Authors: Xiuwen Zheng, Bornali Phukon, Jonghwan Na, Ed Cutrell, Kyu Han, Mark Hasegawa-Johnson, Pan-Pan Jiang, Aadhrik Kuila, Colin Lea, Bob MacDonald, Gautam Mantena, Venkatesh Ravichandran, Leda Sari, Katrin Tomanek, Chang D. Yoo, Chris Zwilling

    Abstract: While the last decade has witnessed significant advancements in Automatic Speech Recognition (ASR) systems, performance of these systems for individuals with speech disabilities remains inadequate, partly due to limited public training data. To bridge this gap, the 2025 Interspeech Speech Accessibility Project (SAP) Challenge was launched, utilizing over 400 hours of SAP data collected and transcr… ▽ More

    Submitted 29 July, 2025; originally announced July 2025.

    Comments: To appear in Proceedings of Interspeech, 2025

  4. arXiv:2507.20091  [pdf, ps, other

    cs.CL eess.AS

    ProsodyLM: Uncovering the Emerging Prosody Processing Capabilities in Speech Language Models

    Authors: Kaizhi Qian, Xulin Fan, Junrui Ni, Slava Shechtman, Mark Hasegawa-Johnson, Chuang Gan, Yang Zhang

    Abstract: Speech language models refer to language models with speech processing and understanding capabilities. One key desirable capability for speech language models is the ability to capture the intricate interdependency between content and prosody. The existing mainstream paradigm of training speech language models, which converts speech into discrete tokens before feeding them into LLMs, is sub-optima… ▽ More

    Submitted 7 August, 2025; v1 submitted 26 July, 2025; originally announced July 2025.

  5. arXiv:2507.06202  [pdf, ps, other

    cs.HC

    V(is)owel: An Interactive Vowel Chart to Understand What Makes Visual Pronunciation Effective in Second Language Learning

    Authors: Charlotte Kiesel, Dipayan Mukherjee, Mark Hasegawa-Johnson, Karrie Karahalios

    Abstract: Visual feedback speeds up learners' improvement of pronunciation in a second language. The visual combined with audio allows speakers to see sounds and differences in pronunciation that they are unable to hear. Prior studies have tested different visual methods for improving pronunciation, however, we do not have conclusive understanding of what aspects of the visualizations contributed to improve… ▽ More

    Submitted 8 July, 2025; originally announced July 2025.

    ACM Class: K.3.1

  6. arXiv:2507.04976  [pdf, ps, other

    cs.CV cs.CL

    Can Video LLMs Refuse to Answer? Alignment for Answerability in Video Large Language Models

    Authors: Eunseop Yoon, Hee Suk Yoon, Mark A. Hasegawa-Johnson, Chang D. Yoo

    Abstract: In the broader context of deep learning, Multimodal Large Language Models have achieved significant breakthroughs by leveraging powerful Large Language Models as a backbone to align different modalities into the language space. A prime exemplification is the development of Video Large Language Models (Video-LLMs). While numerous advancements have been proposed to enhance the video understanding ca… ▽ More

    Submitted 7 July, 2025; originally announced July 2025.

    Comments: ICLR 2025

  7. arXiv:2506.16528  [pdf, ps, other

    cs.LG

    Aligning ASR Evaluation with Human and LLM Judgments: Intelligibility Metrics Using Phonetic, Semantic, and NLI Approaches

    Authors: Bornali Phukon, Xiuwen Zheng, Mark Hasegawa-Johnson

    Abstract: Traditional ASR metrics like WER and CER fail to capture intelligibility, especially for dysarthric and dysphonic speech, where semantic alignment matters more than exact word matches. ASR systems struggle with these speech types, often producing errors like phoneme repetitions and imprecise consonants, yet the meaning remains clear to human listeners. We identify two key challenges: (1) Existing… ▽ More

    Submitted 19 June, 2025; originally announced June 2025.

    Comments: 5 pages, 2 figures, Interspeech 2025

  8. arXiv:2506.08712  [pdf, ps, other

    cs.CL cs.AI cs.LG

    ConfPO: Exploiting Policy Model Confidence for Critical Token Selection in Preference Optimization

    Authors: Hee Suk Yoon, Eunseop Yoon, Mark Hasegawa-Johnson, Sungwoong Kim, Chang D. Yoo

    Abstract: We introduce ConfPO, a method for preference learning in Large Language Models (LLMs) that identifies and optimizes preference-critical tokens based solely on the training policy's confidence, without requiring any auxiliary models or compute. Unlike prior Direct Alignment Algorithms (DAAs) such as Direct Preference Optimization (DPO), which uniformly adjust all token probabilities regardless of t… ▽ More

    Submitted 12 June, 2025; v1 submitted 10 June, 2025; originally announced June 2025.

    Comments: ICML 2025

  9. arXiv:2503.13371  [pdf, other

    cs.LG

    SyncDiff: Diffusion-based Talking Head Synthesis with Bottlenecked Temporal Visual Prior for Improved Synchronization

    Authors: Xulin Fan, Heting Gao, Ziyi Chen, Peng Chang, Mei Han, Mark Hasegawa-Johnson

    Abstract: Talking head synthesis, also known as speech-to-lip synthesis, reconstructs the facial motions that align with the given audio tracks. The synthesized videos are evaluated on mainly two aspects, lip-speech synchronization and image fidelity. Recent studies demonstrate that GAN-based and diffusion-based models achieve state-of-the-art (SOTA) performance on this task, with diffusion-based models ach… ▽ More

    Submitted 17 March, 2025; originally announced March 2025.

    Comments: Accepted to WACV 2025

  10. arXiv:2501.14994  [pdf, other

    cs.SD cs.AI cs.LG eess.AS

    Robust Cross-Etiology and Speaker-Independent Dysarthric Speech Recognition

    Authors: Satwinder Singh, Qianli Wang, Zihan Zhong, Clarion Mendes, Mark Hasegawa-Johnson, Waleed Abdulla, Seyed Reza Shahamiri

    Abstract: In this paper, we present a speaker-independent dysarthric speech recognition system, with a focus on evaluating the recently released Speech Accessibility Project (SAP-1005) dataset, which includes speech data from individuals with Parkinson's disease (PD). Despite the growing body of research in dysarthric speech recognition, many existing systems are speaker-dependent and adaptive, limiting the… ▽ More

    Submitted 24 January, 2025; originally announced January 2025.

    Comments: Accepted to ICASSP 2025

  11. arXiv:2410.15851  [pdf, other

    eess.IV cs.CV cs.HC cs.LG

    R2I-rPPG: A Robust Region of Interest Selection Method for Remote Photoplethysmography to Extract Heart Rate

    Authors: Sandeep Nagar, Mark Hasegawa-Johnson, David G. Beiser, Narendra Ahuja

    Abstract: The COVID-19 pandemic has underscored the need for low-cost, scalable approaches to measuring contactless vital signs, either during initial triage at a healthcare facility or virtual telemedicine visits. Remote photoplethysmography (rPPG) can accurately estimate heart rate (HR) when applied to close-up videos of healthy volunteers in well-lit laboratory settings. However, results from such highly… ▽ More

    Submitted 25 November, 2024; v1 submitted 21 October, 2024; originally announced October 2024.

    Comments: preprint

  12. Fine-Tuning Automatic Speech Recognition for People with Parkinson's: An Effective Strategy for Enhancing Speech Technology Accessibility

    Authors: Xiuwen Zheng, Bornali Phukon, Mark Hasegawa-Johnson

    Abstract: This paper enhances dysarthric and dysphonic speech recognition by fine-tuning pretrained automatic speech recognition (ASR) models on the 2023-10-05 data package of the Speech Accessibility Project (SAP), which contains the speech of 253 people with Parkinson's disease. Experiments tested methods that have been effective for Cerebral Palsy, including the use of speaker clustering and severity-dep… ▽ More

    Submitted 29 September, 2024; originally announced September 2024.

    Journal ref: Proceedings of Interspeech 2024

  13. arXiv:2409.04927  [pdf, other

    cs.CL eess.AS

    Just ASR + LLM? A Study on Speech Large Language Models' Ability to Identify and Understand Speaker in Spoken Dialogue

    Authors: Junkai Wu, Xulin Fan, Bo-Ru Lu, Xilin Jiang, Nima Mesgarani, Mark Hasegawa-Johnson, Mari Ostendorf

    Abstract: In recent years, we have observed a rapid advancement in speech language models (SpeechLLMs), catching up with humans' listening and reasoning abilities. SpeechLLMs have demonstrated impressive spoken dialog question-answering (SQA) performance in benchmarks like Gaokao, the English listening test of the college entrance exam in China, which seemingly requires understanding both the spoken content… ▽ More

    Submitted 2 October, 2024; v1 submitted 7 September, 2024; originally announced September 2024.

    Comments: Accepted to IEEE SLT 2024

  14. arXiv:2408.05769  [pdf, other

    cs.CL cs.SD eess.AS

    LI-TTA: Language Informed Test-Time Adaptation for Automatic Speech Recognition

    Authors: Eunseop Yoon, Hee Suk Yoon, John Harvill, Mark Hasegawa-Johnson, Chang D. Yoo

    Abstract: Test-Time Adaptation (TTA) has emerged as a crucial solution to the domain shift challenge, wherein the target environment diverges from the original training environment. A prime exemplification is TTA for Automatic Speech Recognition (ASR), which enhances model performance by leveraging output prediction entropy minimization as a self-supervision signal. However, a key limitation of this self-su… ▽ More

    Submitted 11 August, 2024; originally announced August 2024.

    Comments: INTERSPEECH 2024

  15. arXiv:2407.16574  [pdf, other

    cs.CL

    TLCR: Token-Level Continuous Reward for Fine-grained Reinforcement Learning from Human Feedback

    Authors: Eunseop Yoon, Hee Suk Yoon, SooHwan Eom, Gunsoo Han, Daniel Wontae Nam, Daejin Jo, Kyoung-Woon On, Mark A. Hasegawa-Johnson, Sungwoong Kim, Chang D. Yoo

    Abstract: Reinforcement Learning from Human Feedback (RLHF) leverages human preference data to train language models to align more closely with human essence. These human preference data, however, are labeled at the sequence level, creating a mismatch between sequence-level preference labels and tokens, which are autoregressively generated from the language model. Although several recent approaches have tri… ▽ More

    Submitted 8 December, 2024; v1 submitted 23 July, 2024; originally announced July 2024.

    Comments: ACL2024 Findings

  16. arXiv:2406.17190  [pdf, other

    cs.SD cs.LG eess.AS

    Sound Tagging in Infant-centric Home Soundscapes

    Authors: Mohammad Nur Hossain Khan, Jialu Li, Nancy L. McElwain, Mark Hasegawa-Johnson, Bashima Islam

    Abstract: Certain environmental noises have been associated with negative developmental outcomes for infants and young children. Though classifying or tagging sound events in a domestic environment is an active research area, previous studies focused on data collected from a non-stationary microphone placed in the environment or from the perspective of adults. Further, many of these works ignore infants or… ▽ More

    Submitted 24 June, 2024; originally announced June 2024.

    Comments: Accepted in IEEE/ACM CHASE 2024

  17. arXiv:2406.08380  [pdf, other

    cs.CL cs.SD eess.AS

    Towards Unsupervised Speech Recognition Without Pronunciation Models

    Authors: Junrui Ni, Liming Wang, Yang Zhang, Kaizhi Qian, Heting Gao, Mark Hasegawa-Johnson, Chang D. Yoo

    Abstract: Recent advancements in supervised automatic speech recognition (ASR) have achieved remarkable performance, largely due to the growing availability of large transcribed speech corpora. However, most languages lack sufficient paired speech and text data to effectively train these systems. In this article, we tackle the challenge of developing ASR systems without paired speech and text corpora by pro… ▽ More

    Submitted 8 January, 2025; v1 submitted 12 June, 2024; originally announced June 2024.

    Comments: This work has been submitted to Speech Communication for possible publication

  18. arXiv:2403.14119  [pdf, other

    cs.CV cs.AI cs.CL cs.LG

    C-TPT: Calibrated Test-Time Prompt Tuning for Vision-Language Models via Text Feature Dispersion

    Authors: Hee Suk Yoon, Eunseop Yoon, Joshua Tian Jin Tee, Mark Hasegawa-Johnson, Yingzhen Li, Chang D. Yoo

    Abstract: In deep learning, test-time adaptation has gained attention as a method for model fine-tuning without the need for labeled data. A prime exemplification is the recently proposed test-time prompt tuning for large-scale vision-language models such as CLIP. Unfortunately, these prompts have been mainly developed to improve accuracy, overlooking the importance of calibration, which is a crucial aspect… ▽ More

    Submitted 31 March, 2024; v1 submitted 21 March, 2024; originally announced March 2024.

    Comments: ICLR 2024

  19. arXiv:2402.06888  [pdf, other

    eess.AS cs.SD

    Analysis of Self-Supervised Speech Models on Children's Speech and Infant Vocalizations

    Authors: Jialu Li, Mark Hasegawa-Johnson, Nancy L. McElwain

    Abstract: To understand why self-supervised learning (SSL) models have empirically achieved strong performances on several speech-processing downstream tasks, numerous studies have focused on analyzing the encoded information of the SSL layer representations in adult speech. Limited work has investigated how pre-training and fine-tuning affect SSL models encoding children's speech and vocalizations. In this… ▽ More

    Submitted 6 June, 2024; v1 submitted 10 February, 2024; originally announced February 2024.

    Comments: Accepted to 2024 ICASSP Workshop of Self-supervision in Audio, Speech and Beyond (SASB)

  20. arXiv:2312.00079  [pdf, other

    cs.CV cs.AI cs.CL cs.LG

    HiFi Tuner: High-Fidelity Subject-Driven Fine-Tuning for Diffusion Models

    Authors: Zhonghao Wang, Wei Wei, Yang Zhao, Zhisheng Xiao, Mark Hasegawa-Johnson, Humphrey Shi, Tingbo Hou

    Abstract: This paper explores advancements in high-fidelity personalized image generation through the utilization of pre-trained text-to-image diffusion models. While previous approaches have made significant strides in generating versatile scenes based on text descriptions and a few input images, challenges persist in maintaining the subject fidelity within the generated images. In this work, we introduce… ▽ More

    Submitted 29 November, 2023; originally announced December 2023.

  21. arXiv:2310.02382  [pdf, other

    cs.CL cs.SD eess.AS

    Unsupervised Speech Recognition with N-Skipgram and Positional Unigram Matching

    Authors: Liming Wang, Mark Hasegawa-Johnson, Chang D. Yoo

    Abstract: Training unsupervised speech recognition systems presents challenges due to GAN-associated instability, misalignment between speech and text, and significant memory demands. To tackle these challenges, we introduce a novel ASR system, ESPUM. This system harnesses the power of lower-order N-skipgrams (up to N=3) combined with positional unigram statistics gathered from a small batch of samples. Eva… ▽ More

    Submitted 3 October, 2023; originally announced October 2023.

  22. arXiv:2309.07287  [pdf, other

    eess.AS cs.SD

    Enhancing Child Vocalization Classification with Phonetically-Tuned Embeddings for Assisting Autism Diagnosis

    Authors: Jialu Li, Mark Hasegawa-Johnson, Karrie Karahalios

    Abstract: The assessment of children at risk of autism typically involves a clinician observing, taking notes, and rating children's behaviors. A machine learning model that can label adult and child audio may largely save labor in coding children's behaviors, helping clinicians capture critical events and better communicate with parents. In this study, we leverage Wav2Vec 2.0 (W2V2), pre-trained on 4300-ho… ▽ More

    Submitted 5 June, 2024; v1 submitted 13 September, 2023; originally announced September 2023.

    Comments: Accepted to Interspeech 2024

  23. arXiv:2308.08442  [pdf, other

    cs.CL cs.SD eess.AS

    Mitigating the Exposure Bias in Sentence-Level Grapheme-to-Phoneme (G2P) Transduction

    Authors: Eunseop Yoon, Hee Suk Yoon, Dhananjaya Gowda, SooHwan Eom, Daehyeok Kim, John Harvill, Heting Gao, Mark Hasegawa-Johnson, Chanwoo Kim, Chang D. Yoo

    Abstract: Text-to-Text Transfer Transformer (T5) has recently been considered for the Grapheme-to-Phoneme (G2P) transduction. As a follow-up, a tokenizer-free byte-level model based on T5 referred to as ByT5, recently gave promising results on word-level G2P conversion by representing each input character with its corresponding UTF-8 encoding. Although it is generally understood that sentence-level or parag… ▽ More

    Submitted 16 August, 2023; originally announced August 2023.

    Comments: INTERSPEECH 2023

  24. arXiv:2306.15808  [pdf, other

    cs.MM cs.SD eess.AS eess.SP

    Classification of Infant Sleep/Wake States: Cross-Attention among Large Scale Pretrained Transformer Networks using Audio, ECG, and IMU Data

    Authors: Kai Chieh Chang, Mark Hasegawa-Johnson, Nancy L. McElwain, Bashima Islam

    Abstract: Infant sleep is critical to brain and behavioral development. Prior studies on infant sleep/wake classification have been largely limited to reliance on expensive and burdensome polysomnography (PSG) tests in the laboratory or wearable devices that collect single-modality data. To facilitate data collection and accuracy of detection, we aimed to advance this field of study by using a multi-modal w… ▽ More

    Submitted 27 June, 2023; originally announced June 2023.

    Comments: Preprint for APSIPA2023

  25. arXiv:2306.07926  [pdf, other

    eess.AS cs.CL cs.LG cs.SD

    A Theory of Unsupervised Speech Recognition

    Authors: Liming Wang, Mark Hasegawa-Johnson, Chang D. Yoo

    Abstract: Unsupervised speech recognition (ASR-U) is the problem of learning automatic speech recognition (ASR) systems from unpaired speech-only and text-only corpora. While various algorithms exist to solve this problem, a theoretical framework is missing from studying their properties and addressing such issues as sensitivity to hyperparameters and training instability. In this paper, we proposed a gener… ▽ More

    Submitted 9 June, 2023; originally announced June 2023.

  26. arXiv:2305.16371  [pdf, other

    cs.CL cs.SD eess.AS

    INTapt: Information-Theoretic Adversarial Prompt Tuning for Enhanced Non-Native Speech Recognition

    Authors: Eunseop Yoon, Hee Suk Yoon, John Harvill, Mark Hasegawa-Johnson, Chang D. Yoo

    Abstract: Automatic Speech Recognition (ASR) systems have attained unprecedented performance with large speech models pre-trained based on self-supervised speech representation learning. However, these pre-trained speech models suffer from representational bias as they tend to better represent those prominent accents (i.e., native (L1) English accent) in the pre-training speech corpus than less represented… ▽ More

    Submitted 25 May, 2023; originally announced May 2023.

    Comments: ACL2023

  27. Towards Robust Family-Infant Audio Analysis Based on Unsupervised Pretraining of Wav2vec 2.0 on Large-Scale Unlabeled Family Audio

    Authors: Jialu Li, Mark Hasegawa-Johnson, Nancy L. McElwain

    Abstract: To perform automatic family audio analysis, past studies have collected recordings using phone, video, or audio-only recording devices like LENA, investigated supervised learning methods, and used or fine-tuned general-purpose embeddings learned from large pretrained models. In this study, we advance the audio component of a new infant wearable multi-modal device called LittleBeats (LB) by learnin… ▽ More

    Submitted 8 December, 2023; v1 submitted 21 May, 2023; originally announced May 2023.

    Comments: Proceedings of Interspeech 2023; v4 version updates: correction of W2V2-base pretrained on 960-hour of LibriSpeech and number of families participated for LENA home recordings

  28. arXiv:2212.07072  [pdf, other

    cs.CL cs.LG

    SMSMix: Sense-Maintained Sentence Mixup for Word Sense Disambiguation

    Authors: Hee Suk Yoon, Eunseop Yoon, John Harvill, Sunjae Yoon, Mark Hasegawa-Johnson, Chang D. Yoo

    Abstract: Word Sense Disambiguation (WSD) is an NLP task aimed at determining the correct sense of a word in a sentence from discrete sense choices. Although current systems have attained unprecedented performances for such tasks, the nonuniform distribution of word senses during training generally results in systems performing poorly on rare senses. To this end, we consider data augmentation to increase th… ▽ More

    Submitted 21 December, 2022; v1 submitted 14 December, 2022; originally announced December 2022.

    Comments: EMNLP2022

  29. arXiv:2207.04213  [pdf, other

    cs.MM cs.CV cs.LG cs.SD eess.AS

    Dual-Path Cross-Modal Attention for better Audio-Visual Speech Extraction

    Authors: Zhongweiyang Xu, Xulin Fan, Mark Hasegawa-Johnson

    Abstract: Audio-visual target speech extraction, which aims to extract a certain speaker's speech from the noisy mixture by looking at lip movements, has made significant progress combining time-domain speech separation models and visual feature extractors (CNN). One problem of fusing audio and video information is that they have different time resolutions. Most current research upsamples the visual feature… ▽ More

    Submitted 3 March, 2023; v1 submitted 9 July, 2022; originally announced July 2022.

    Comments: Paper Accepted by ICASSP2023

  30. End-to-End Zero-Shot Voice Conversion with Location-Variable Convolutions

    Authors: Wonjune Kang, Mark Hasegawa-Johnson, Deb Roy

    Abstract: Zero-shot voice conversion is becoming an increasingly popular research topic, as it promises the ability to transform speech to sound like any speaker. However, relatively little work has been done on end-to-end methods for this task, which are appealing because they remove the need for a separate vocoder to generate audio from intermediate features. In this work, we propose LVC-VC, an end-to-end… ▽ More

    Submitted 22 May, 2023; v1 submitted 19 May, 2022; originally announced May 2022.

    Comments: INTERSPEECH 2023

  31. arXiv:2204.09224  [pdf, other

    cs.SD cs.AI eess.AS

    ContentVec: An Improved Self-Supervised Speech Representation by Disentangling Speakers

    Authors: Kaizhi Qian, Yang Zhang, Heting Gao, Junrui Ni, Cheng-I Lai, David Cox, Mark Hasegawa-Johnson, Shiyu Chang

    Abstract: Self-supervised learning in speech involves training a speech representation network on a large-scale unannotated speech corpus, and then applying the learned representations to downstream tasks. Since the majority of the downstream tasks of SSL learning in speech largely focus on the content information in speech, the most desirable speech representations should be able to disentangle unwanted va… ▽ More

    Submitted 23 June, 2022; v1 submitted 20 April, 2022; originally announced April 2022.

  32. arXiv:2204.03640  [pdf, other

    cs.LG cs.CV

    Equivariance Discovery by Learned Parameter-Sharing

    Authors: Raymond A. Yeh, Yuan-Ting Hu, Mark Hasegawa-Johnson, Alexander G. Schwing

    Abstract: Designing equivariance as an inductive bias into deep-nets has been a prominent approach to build effective models, e.g., a convolutional neural network incorporates translation equivariance. However, incorporating these inductive biases requires knowledge about the equivariance properties of the data, which may not be available, e.g., when encountering a new domain. To address this, we study how… ▽ More

    Submitted 7 April, 2022; originally announced April 2022.

    Comments: AISTATS 2022

  33. arXiv:2203.15863  [pdf, other

    eess.AS cs.AI cs.CL

    WAVPROMPT: Towards Few-Shot Spoken Language Understanding with Frozen Language Models

    Authors: Heting Gao, Junrui Ni, Kaizhi Qian, Yang Zhang, Shiyu Chang, Mark Hasegawa-Johnson

    Abstract: Large-scale auto-regressive language models pretrained on massive text have demonstrated their impressive ability to perform new natural language tasks with only a few text examples, without the need for fine-tuning. Recent studies further show that such a few-shot learning ability can be extended to the text-image setting by training an encoder to encode the images into embeddings functioning lik… ▽ More

    Submitted 13 April, 2022; v1 submitted 29 March, 2022; originally announced March 2022.

    Comments: submitted to INTERSPEECH 2022

  34. arXiv:2203.15796  [pdf, other

    eess.AS cs.AI

    Unsupervised Text-to-Speech Synthesis by Unsupervised Automatic Speech Recognition

    Authors: Junrui Ni, Liming Wang, Heting Gao, Kaizhi Qian, Yang Zhang, Shiyu Chang, Mark Hasegawa-Johnson

    Abstract: An unsupervised text-to-speech synthesis (TTS) system learns to generate speech waveforms corresponding to any written sentence in a language by observing: 1) a collection of untranscribed speech waveforms in that language; 2) a collection of texts written in that language without access to any transcribed speech. Developing such a system can significantly improve the availability of speech techno… ▽ More

    Submitted 15 August, 2022; v1 submitted 29 March, 2022; originally announced March 2022.

    Comments: INTERSPEECH 2022

  35. arXiv:2203.15183  [pdf, other

    eess.AS cs.CL cs.SD

    Visualizations of Complex Sequences of Family-Infant Vocalizations Using Bag-of-Audio-Words Approach Based on Wav2vec 2.0 Features

    Authors: Jialu Li, Mark Hasegawa-Johnson, Nancy L. McElwain

    Abstract: In the U.S., approximately 15-17% of children 2-8 years of age are estimated to have at least one diagnosed mental, behavioral or developmental disorder. However, such disorders often go undiagnosed, and the ability to evaluate and treat disorders in the first years of life is limited. To analyze infant developmental changes, previous studies have shown advanced ML models excel at classifying infa… ▽ More

    Submitted 28 March, 2022; originally announced March 2022.

    Comments: Submitted to Interspeech 2022

  36. arXiv:2203.14156  [pdf, other

    eess.AS cs.AI cs.SD

    SpeechSplit 2.0: Unsupervised speech disentanglement for voice conversion Without tuning autoencoder Bottlenecks

    Authors: Chak Ho Chan, Kaizhi Qian, Yang Zhang, Mark Hasegawa-Johnson

    Abstract: SpeechSplit can perform aspect-specific voice conversion by disentangling speech into content, rhythm, pitch, and timbre using multiple autoencoders in an unsupervised manner. However, SpeechSplit requires careful tuning of the autoencoder bottlenecks, which can be time-consuming and less robust. This paper proposes SpeechSplit 2.0, which constrains the information flow of the speech component to… ▽ More

    Submitted 26 March, 2022; originally announced March 2022.

  37. arXiv:2201.11207  [pdf, other

    cs.SD cs.CL eess.AS

    Discovering Phonetic Inventories with Crosslingual Automatic Speech Recognition

    Authors: Piotr Żelasko, Siyuan Feng, Laureano Moro Velazquez, Ali Abavisani, Saurabhchand Bhati, Odette Scharenborg, Mark Hasegawa-Johnson, Najim Dehak

    Abstract: The high cost of data acquisition makes Automatic Speech Recognition (ASR) model training problematic for most existing languages, including languages that do not even have a written script, or for which the phone inventories remain unknown. Past works explored multilingual training, transfer learning, as well as zero-shot learning in order to build ASR systems for these low-resource languages. Wh… ▽ More

    Submitted 27 January, 2022; v1 submitted 26 January, 2022; originally announced January 2022.

    Comments: Accepted for publication in Computer Speech and Language

  38. arXiv:2109.11196  [pdf, other

    stat.ML cs.CY cs.LG

    Fast and Efficient MMD-based Fair PCA via Optimization over Stiefel Manifold

    Authors: Junghyun Lee, Gwangsu Kim, Matt Olfat, Mark Hasegawa-Johnson, Chang D. Yoo

    Abstract: This paper defines fair principal component analysis (PCA) as minimizing the maximum mean discrepancy (MMD) between dimensionality-reduced conditional distributions of different protected classes. The incorporation of MMD naturally leads to an exact and tractable mathematical formulation of fairness with good statistical properties. We formulate the problem of fair PCA subject to MMD constraints a… ▽ More

    Submitted 25 January, 2022; v1 submitted 23 September, 2021; originally announced September 2021.

    Comments: 24 pages, 18 figures. Accepted to the 36th AAAI Conference on Artificial Intelligence (AAAI 2022)

  39. arXiv:2106.08519  [pdf, other

    eess.AS cs.LG cs.SD

    Global Rhythm Style Transfer Without Text Transcriptions

    Authors: Kaizhi Qian, Yang Zhang, Shiyu Chang, Jinjun Xiong, Chuang Gan, David Cox, Mark Hasegawa-Johnson

    Abstract: Prosody plays an important role in characterizing the style of a speaker or an emotion, but most non-parallel voice or emotion style transfer algorithms do not convert any prosody information. Two major components of prosody are pitch and rhythm. Disentangling the prosody information, particularly the rhythm component, from the speech is challenging because it involves breaking the synchrony betwe… ▽ More

    Submitted 15 June, 2021; originally announced June 2021.

  40. arXiv:2012.15484  [pdf, other

    cs.CL cs.LG

    Seeing is Knowing! Fact-based Visual Question Answering using Knowledge Graph Embeddings

    Authors: Kiran Ramnath, Mark Hasegawa-Johnson

    Abstract: Fact-based Visual Question Answering (FVQA), a challenging variant of VQA, requires a QA-system to include facts from a diverse knowledge graph (KG) in its reasoning process to produce an answer. Large KGs, especially common-sense KGs, are known to be incomplete, i.e., not all non-existent facts are always incorrect. Therefore, being able to reason over incomplete KGs for QA is a critical requirem… ▽ More

    Submitted 18 June, 2021; v1 submitted 31 December, 2020; originally announced December 2020.

    Comments: 17 pages

  41. arXiv:2011.12022  [pdf, other

    cs.SD cs.LG eess.AS

    Multi-Decoder DPRNN: High Accuracy Source Counting and Separation

    Authors: Junzhe Zhu, Raymond Yeh, Mark Hasegawa-Johnson

    Abstract: We propose an end-to-end trainable approach to single-channel speech separation with unknown number of speakers. Our approach extends the MulCat source separation backbone with additional output heads: a count-head to infer the number of speakers, and decoder-heads for reconstructing the original signals. Beyond the model, we also propose a metric on how to evaluate source separation with variable… ▽ More

    Submitted 30 November, 2020; v1 submitted 24 November, 2020; originally announced November 2020.

    Comments: Project Page: https://junzhejosephzhu.github.io/Multi-Decoder-DPRNN/ Submitted to ICASSP 2021

  42. arXiv:2011.11603  [pdf, other

    cs.CV cs.AI cs.CL cs.LG

    Interpretable Visual Reasoning via Induced Symbolic Space

    Authors: Zhonghao Wang, Kai Wang, Mo Yu, Jinjun Xiong, Wen-mei Hwu, Mark Hasegawa-Johnson, Humphrey Shi

    Abstract: We study the problem of concept induction in visual reasoning, i.e., identifying concepts and their hierarchical relationships from question-answer pairs associated with images; and achieve an interpretable model via working on the induced symbolic concept space. To this end, we first design a new framework named object-centric compositional attention model (OCCAM) to perform the visual reasoning… ▽ More

    Submitted 24 August, 2021; v1 submitted 23 November, 2020; originally announced November 2020.

    Comments: ICCV 2021

  43. arXiv:2011.02698  [pdf, other

    eess.AS cs.SD

    A Comparison Study on Infant-Parent Voice Diarization

    Authors: Junzhe Zhu, Mark Hasegawa-Johnson, Nancy McElwain

    Abstract: We design a framework for studying prelinguistic child voicefrom 3 to 24 months based on state-of-the-art algorithms in di-arization. Our system consists of a time-invariant feature ex-tractor, a context-dependent embedding generator, and a clas-sifier. We study the effect of swapping out different compo-nents of the system, as well as changing loss function, to findthe best performance. We also p… ▽ More

    Submitted 5 November, 2020; originally announced November 2020.

    Comments: ICASSP 2021

  44. arXiv:2010.12267  [pdf, other

    cs.CV cs.CL

    Show and Speak: Directly Synthesize Spoken Description of Images

    Authors: Xinsheng Wang, Siyuan Feng, Jihua Zhu, Mark Hasegawa-Johnson, Odette Scharenborg

    Abstract: This paper proposes a new model, referred to as the show and speak (SAS) model that, for the first time, is able to directly synthesize spoken descriptions of images, bypassing the need for any text or phonemes. The basic structure of SAS is an encoder-decoder architecture that takes an image as input and predicts the spectrogram of speech that describes this image. The final speech audio is obtai… ▽ More

    Submitted 17 November, 2020; v1 submitted 23 October, 2020; originally announced October 2020.

  45. How Phonotactics Affect Multilingual and Zero-shot ASR Performance

    Authors: Siyuan Feng, Piotr Żelasko, Laureano Moro-Velázquez, Ali Abavisani, Mark Hasegawa-Johnson, Odette Scharenborg, Najim Dehak

    Abstract: The idea of combining multiple languages' recordings to train a single automatic speech recognition (ASR) model brings the promise of the emergence of universal speech representation. Recently, a Transformer encoder-decoder model has been shown to leverage multilingual data well in IPA transcriptions of languages presented during training. However, the representations it learned were not successfu… ▽ More

    Submitted 10 February, 2021; v1 submitted 22 October, 2020; originally announced October 2020.

    Comments: Accepted for publication in IEEE ICASSP 2021. The first 2 authors contributed equally to this work

  46. arXiv:2009.08064  [pdf, other

    eess.AS cs.SD

    Utterance-level Intent Recognition from Keywords

    Authors: Wenda Chen, Jonathan Huang, Mark Hasegawa-Johnson

    Abstract: This paper focuses on wake on intent (WOI) techniques for platforms with limited compute and memory. Our approach of utterance-level intent classification is based on a sequence of keywords in the utterance instead of a single fixed key phrase. The keyword sequence is transformed into four types of input features, namely acoustics, phones, word2vec and speech2vec for individual intent learning and… ▽ More

    Submitted 17 September, 2020; originally announced September 2020.

  47. arXiv:2008.03425  [pdf, other

    eess.AS cs.CL cs.LG cs.SD

    Deep F-measure Maximization for End-to-End Speech Understanding

    Authors: Leda Sarı, Mark Hasegawa-Johnson

    Abstract: Spoken language understanding (SLU) datasets, like many other machine learning datasets, usually suffer from the label imbalance problem. Label imbalance usually causes the learned model to replicate similar biases at the output which raises the issue of unfairness to the minority classes in the dataset. In this work, we approach the fairness problem by maximizing the F-measure instead of accuracy… ▽ More

    Submitted 7 August, 2020; originally announced August 2020.

    Comments: Interspeech 2020 submission (Accepted)

  48. arXiv:2007.15916  [pdf

    cs.CL cs.CV

    Evaluating Automatically Generated Phoneme Captions for Images

    Authors: Justin van der Hout, Zoltán D'Haese, Mark Hasegawa-Johnson, Odette Scharenborg

    Abstract: Image2Speech is the relatively new task of generating a spoken description of an image. This paper presents an investigation into the evaluation of this task. For this, first an Image2Speech system was implemented which generates image captions consisting of phoneme sequences. This system outperformed the original Image2Speech system on the Flickr8k corpus. Subsequently, these phoneme captions wer… ▽ More

    Submitted 31 July, 2020; originally announced July 2020.

    Comments: Accepted at Interspeech2020

  49. Autosegmental Neural Nets: Should Phones and Tones be Synchronous or Asynchronous?

    Authors: Jialu Li, Mark Hasegawa-Johnson

    Abstract: Phones, the segmental units of the International Phonetic Alphabet (IPA), are used for lexical distinctions in most human languages; Tones, the suprasegmental units of the IPA, are used in perhaps 70%. Many previous studies have explored cross-lingual adaptation of automatic speech recognition (ASR) phone models, but few have explored the multilingual and cross-lingual transfer of synchronization… ▽ More

    Submitted 28 July, 2020; originally announced July 2020.

    Comments: Accepted to Interspeech 2020

  50. arXiv:2005.11408  [pdf, other

    eess.AS cs.LG

    Identify Speakers in Cocktail Parties with End-to-End Attention

    Authors: Junzhe Zhu, Mark Hasegawa-Johnson, Leda Sari

    Abstract: In scenarios where multiple speakers talk at the same time, it is important to be able to identify the talkers accurately. This paper presents an end-to-end system that integrates speech source extraction and speaker identification, and proposes a new way to jointly optimize these two parts by max-pooling the speaker predictions along the channel dimension. Residual attention permits us to learn s… ▽ More

    Submitted 9 August, 2020; v1 submitted 22 May, 2020; originally announced May 2020.

    Comments: Accepted by Interspeech 2020 for presentation; https://github.com/JunzheJosephZhu/Identify-Speakers-in-Cocktail-Parties-with-E2E-Attention