Skip to main content

Showing 1–5 of 5 results for author: Hetz, G

Searching in archive eess. Search in all archives.
.
  1. arXiv:2506.09874  [pdf, ps, other

    cs.SD cs.LG eess.AS

    UmbraTTS: Adapting Text-to-Speech to Environmental Contexts with Flow Matching

    Authors: Neta Glazer, Aviv Navon, Yael Segal, Aviv Shamsian, Hilit Segev, Asaf Buchnick, Menachem Pirchi, Gil Hetz, Joseph Keshet

    Abstract: Recent advances in Text-to-Speech (TTS) have enabled highly natural speech synthesis, yet integrating speech with complex background environments remains challenging. We introduce UmbraTTS, a flow-matching based TTS model that jointly generates both speech and environmental audio, conditioned on text and acoustic context. Our model allows fine-grained control over background volume and produces di… ▽ More

    Submitted 10 July, 2025; v1 submitted 11 June, 2025; originally announced June 2025.

    Comments: ICML Workshop on Machine Learning for Audio 2025

  2. arXiv:2505.14465  [pdf, ps, other

    eess.AS cs.LG cs.SD

    FlowTSE: Target Speaker Extraction with Flow Matching

    Authors: Aviv Navon, Aviv Shamsian, Yael Segal-Feldman, Neta Glazer, Gil Hetz, Joseph Keshet

    Abstract: Target speaker extraction (TSE) aims to isolate a specific speaker's speech from a mixture using speaker enrollment as a reference. While most existing approaches are discriminative, recent generative methods for TSE achieve strong results. However, generative methods for TSE remain underexplored, with most existing approaches relying on complex pipelines and pretrained components, leading to comp… ▽ More

    Submitted 20 May, 2025; originally announced May 2025.

    Comments: InterSpeech 2025

  3. arXiv:2409.15869  [pdf, other

    eess.AS cs.AI cs.LG cs.SD

    Whisper in Medusa's Ear: Multi-head Efficient Decoding for Transformer-based ASR

    Authors: Yael Segal-Feldman, Aviv Shamsian, Aviv Navon, Gill Hetz, Joseph Keshet

    Abstract: Large transformer-based models have significant potential for speech transcription and translation. Their self-attention mechanisms and parallel processing enable them to capture complex patterns and dependencies in audio sequences. However, this potential comes with challenges, as these large and computationally intensive models lead to slow inference speeds. Various optimization strategies have… ▽ More

    Submitted 24 September, 2024; originally announced September 2024.

    Comments: Under Review

  4. arXiv:2406.02649  [pdf, other

    eess.AS cs.LG cs.SD

    Keyword-Guided Adaptation of Automatic Speech Recognition

    Authors: Aviv Shamsian, Aviv Navon, Neta Glazer, Gill Hetz, Joseph Keshet

    Abstract: Automatic Speech Recognition (ASR) technology has made significant progress in recent years, providing accurate transcription across various domains. However, some challenges remain, especially in noisy environments and specialized jargon. In this paper, we propose a novel approach for improved jargon word recognition by contextual biasing Whisper-based models. We employ a keyword spotting model t… ▽ More

    Submitted 4 June, 2024; originally announced June 2024.

    Comments: Accepted to InterSpeech 2024

  5. arXiv:2309.08561  [pdf, other

    eess.AS cs.LG cs.SD

    Open-vocabulary Keyword-spotting with Adaptive Instance Normalization

    Authors: Aviv Navon, Aviv Shamsian, Neta Glazer, Gill Hetz, Joseph Keshet

    Abstract: Open vocabulary keyword spotting is a crucial and challenging task in automatic speech recognition (ASR) that focuses on detecting user-defined keywords within a spoken utterance. Keyword spotting methods commonly map the audio utterance and keyword into a joint embedding space to obtain some affinity score. In this work, we propose AdaKWS, a novel method for keyword spotting in which a text encod… ▽ More

    Submitted 13 September, 2023; originally announced September 2023.

    Comments: Under Review