Skip to main content

Showing 1–7 of 7 results for author: Hirschberg, J

Searching in archive eess. Search in all archives.
.
  1. arXiv:2506.22972  [pdf, ps, other

    eess.AS

    Adaptable Non-parametric Approach for Speech-based Symptom Assessment: Isolating Private Medical Data in a Retrieval Datastore

    Authors: Yu-Wen Chen, Julia Hirschberg

    Abstract: The automatic assessment of health-related acoustic cues has the potential to improve healthcare accessibility and affordability. Although parametric models are promising, they face challenges in privacy and adaptability. To address these, we propose a NoN-Parametric framework for Speech-based symptom Assessment (NoNPSA). By isolating medical data in a retrieval datastore, NoNPSA avoids encoding p… ▽ More

    Submitted 28 June, 2025; originally announced June 2025.

    Comments: IEEE MLSP 2025

  2. arXiv:2505.17326  [pdf, ps, other

    cs.IR cs.SD eess.AS

    VoxRAG: A Step Toward Transcription-Free RAG Systems in Spoken Question Answering

    Authors: Zackary Rackauckas, Julia Hirschberg

    Abstract: We introduce VoxRAG, a modular speech-to-speech retrieval-augmented generation system that bypasses transcription to retrieve semantically relevant audio segments directly from spoken queries. VoxRAG employs silence-aware segmentation, speaker diarization, CLAP audio embeddings, and FAISS retrieval using L2-normalized cosine similarity. We construct a 50-query test set recorded as spoken input by… ▽ More

    Submitted 22 May, 2025; originally announced May 2025.

    Comments: Accepted to ACL 2025 Workshop MAGMaR

  3. arXiv:2505.17320  [pdf, ps, other

    cs.CL cs.SD eess.AS

    Benchmarking Expressive Japanese Character Text-to-Speech with VITS and Style-BERT-VITS2

    Authors: Zackary Rackauckas, Julia Hirschberg

    Abstract: Synthesizing expressive Japanese character speech poses unique challenges due to pitch-accent sensitivity and stylistic variability. This paper benchmarks two open-source text-to-speech models--VITS and Style-BERT-VITS2 JP Extra (SBV2JE)--on in-domain, character-driven Japanese speech. Using three character-specific datasets, we evaluate models across naturalness (mean opinion and comparative mean… ▽ More

    Submitted 22 May, 2025; originally announced May 2025.

  4. arXiv:2410.00316  [pdf, other

    cs.CL cs.AI cs.HC cs.SD eess.AS

    EmoKnob: Enhance Voice Cloning with Fine-Grained Emotion Control

    Authors: Haozhe Chen, Run Chen, Julia Hirschberg

    Abstract: While recent advances in Text-to-Speech (TTS) technology produce natural and expressive speech, they lack the option for users to select emotion and control intensity. We propose EmoKnob, a framework that allows fine-grained emotion control in speech synthesis with few-shot demonstrative samples of arbitrary emotion. Our framework leverages the expressive speaker representation space made possible… ▽ More

    Submitted 30 September, 2024; originally announced October 2024.

    Comments: EMNLP 2024 Main

  5. arXiv:2311.07703  [pdf, other

    cs.CL cs.SD eess.AS

    Measuring Entrainment in Spontaneous Code-switched Speech

    Authors: Debasmita Bhattacharya, Siying Ding, Alayna Nguyen, Julia Hirschberg

    Abstract: It is well-known that speakers who entrain to one another have more successful conversations than those who do not. Previous research has shown that interlocutors entrain on linguistic features in both written and spoken monolingual domains. More recent work on code-switched communication has also shown preliminary evidence of entrainment on certain aspects of code-switching (CSW). However, such s… ▽ More

    Submitted 26 March, 2024; v1 submitted 13 November, 2023; originally announced November 2023.

    Comments: Edits: camera-ready manuscript for NAACL 2024

  6. arXiv:2309.01164  [pdf, other

    eess.AS cs.LG cs.SD

    Noise robust speech emotion recognition with signal-to-noise ratio adapting speech enhancement

    Authors: Yu-Wen Chen, Julia Hirschberg, Yu Tsao

    Abstract: Speech emotion recognition (SER) often experiences reduced performance due to background noise. In addition, making a prediction on signals with only background noise could undermine user trust in the system. In this study, we propose a Noise Robust Speech Emotion Recognition system, NRSER. NRSER employs speech enhancement (SE) to effectively reduce the noise in input signals. Then, the signal-to-… ▽ More

    Submitted 3 September, 2023; originally announced September 2023.

  7. arXiv:2308.12490  [pdf, other

    cs.CL cs.SD eess.AS

    MultiPA: A Multi-task Speech Pronunciation Assessment Model for Open Response Scenarios

    Authors: Yu-Wen Chen, Zhou Yu, Julia Hirschberg

    Abstract: Pronunciation assessment models designed for open response scenarios enable users to practice language skills in a manner similar to real-life communication. However, previous open-response pronunciation assessment models have predominantly focused on a single pronunciation task, such as sentence-level accuracy, rather than offering a comprehensive assessment in various aspects. We propose MultiPA… ▽ More

    Submitted 4 June, 2024; v1 submitted 23 August, 2023; originally announced August 2023.

    Comments: INTERSPEECH 2024