Skip to main content

Showing 1–12 of 12 results for author: Hoory, R

Searching in archive cs. Search in all archives.
.
  1. arXiv:2505.23308  [pdf, ps, other

    eess.AS cs.AI eess.IV

    Spoken question answering for visual queries

    Authors: Nimrod Shabtay, Zvi Kons, Avihu Dekel, Hagai Aronowitz, Ron Hoory, Assaf Arbelle

    Abstract: Question answering (QA) systems are designed to answer natural language questions. Visual QA (VQA) and Spoken QA (SQA) systems extend the textual QA system to accept visual and spoken input respectively. This work aims to create a system that enables user interaction through both speech and images. That is achieved through the fusion of text, speech, and image modalities to tackle the task of sp… ▽ More

    Submitted 29 May, 2025; originally announced May 2025.

    Comments: Accepted for Interspeech 2025 (with additional results)

  2. arXiv:2403.11209  [pdf, other

    cs.CL cs.HC

    Creating an African American-Sounding TTS: Guidelines, Technical Challenges,and Surprising Evaluations

    Authors: Claudio Pinhanez, Raul Fernandez, Marcelo Grave, Julio Nogima, Ron Hoory

    Abstract: Representations of AI agents in user interfaces and robotics are predominantly White, not only in terms of facial and skin features, but also in the synthetic voices they use. In this paper we explore some unexpected challenges in the representation of race we found in the process of developing an U.S. English Text-to-Speech (TTS) system aimed to sound like an educated, professional, regional acce… ▽ More

    Submitted 17 March, 2024; originally announced March 2024.

    Comments: Full version including appendixes

  3. arXiv:2309.11210  [pdf, other

    eess.AS cs.CL cs.SD

    Speak While You Think: Streaming Speech Synthesis During Text Generation

    Authors: Avihu Dekel, Slava Shechtman, Raul Fernandez, David Haws, Zvi Kons, Ron Hoory

    Abstract: Large Language Models (LLMs) demonstrate impressive capabilities, yet interaction with these models is mostly facilitated through text. Using Text-To-Speech to synthesize LLM outputs typically results in notable latency, which is impractical for fluent voice conversations. We propose LLM2Speech, an architecture to synthesize speech while text is being generated by an LLM which yields significant l… ▽ More

    Submitted 20 September, 2023; originally announced September 2023.

    Comments: Under review for ICASSP 2024

  4. arXiv:2203.00613  [pdf

    cs.CL cs.LG cs.SD eess.AS

    Towards a Common Speech Analysis Engine

    Authors: Hagai Aronowitz, Itai Gat, Edmilson Morais, Weizhong Zhu, Ron Hoory

    Abstract: Recent innovations in self-supervised representation learning have led to remarkable advances in natural language processing. That said, in the speech processing domain, self-supervised representation learning-based systems are not yet considered state-of-the-art. We propose leveraging recent advances in self-supervised-based speech processing to create a common speech analysis engine. Such an eng… ▽ More

    Submitted 1 March, 2022; originally announced March 2022.

    Comments: ICASSP 2022

  5. arXiv:2202.10137  [pdf, other

    cs.CL eess.AS

    A new data augmentation method for intent classification enhancement and its application on spoken conversation datasets

    Authors: Zvi Kons, Aharon Satt, Hong-Kwang Kuo, Samuel Thomas, Boaz Carmeli, Ron Hoory, Brian Kingsbury

    Abstract: Intent classifiers are vital to the successful operation of virtual agent systems. This is especially so in voice activated systems where the data can be noisy with many ambiguous directions for user intents. Before operation begins, these classifiers are generally lacking in real-world training data. Active learning is a common approach used to help label large amounts of collected user input. Ho… ▽ More

    Submitted 21 February, 2022; originally announced February 2022.

    Comments: \c{opyright} 2022 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works

  6. arXiv:2202.03896  [pdf

    cs.SD cs.AI cs.LG eess.AS

    Speech Emotion Recognition using Self-Supervised Features

    Authors: Edmilson Morais, Ron Hoory, Weizhong Zhu, Itai Gat, Matheus Damasceno, Hagai Aronowitz

    Abstract: Self-supervised pre-trained features have consistently delivered state-of-art results in the field of natural language processing (NLP); however, their merits in the field of speech emotion recognition (SER) still need further investigation. In this paper we introduce a modular End-to- End (E2E) SER system based on an Upstream + Downstream architecture paradigm, which allows easy use/integration o… ▽ More

    Submitted 6 February, 2022; originally announced February 2022.

    Comments: 5 pages, 4 figures, 2 tables, ICASSP 2022

  7. arXiv:2202.01252  [pdf, other

    cs.LG

    Speaker Normalization for Self-supervised Speech Emotion Recognition

    Authors: Itai Gat, Hagai Aronowitz, Weizhong Zhu, Edmilson Morais, Ron Hoory

    Abstract: Large speech emotion recognition datasets are hard to obtain, and small datasets may contain biases. Deep-net-based classifiers, in turn, are prone to exploit those biases and find shortcuts such as speaker characteristics. These shortcuts usually harm a model's ability to generalize. To address this challenge, we propose a gradient-based adversary learning framework that learns a speech emotion r… ▽ More

    Submitted 6 November, 2022; v1 submitted 2 February, 2022; originally announced February 2022.

    Comments: ICASSP 22

  8. arXiv:2104.03842  [pdf, other

    cs.CL cs.LG cs.SD eess.AS

    RNN Transducer Models For Spoken Language Understanding

    Authors: Samuel Thomas, Hong-Kwang J. Kuo, George Saon, Zoltán Tüske, Brian Kingsbury, Gakuto Kurata, Zvi Kons, Ron Hoory

    Abstract: We present a comprehensive study on building and adapting RNN transducer (RNN-T) models for spoken language understanding(SLU). These end-to-end (E2E) models are constructed in three practical settings: a case where verbatim transcripts are available, a constrained case where the only available annotations are SLU labels and their values, and a more restrictive case where transcripts are available… ▽ More

    Submitted 8 April, 2021; originally announced April 2021.

    Comments: To appear in the proceedings of ICASSP 2021

  9. arXiv:2010.04284  [pdf, other

    cs.CL cs.LG cs.SD eess.AS

    Leveraging Unpaired Text Data for Training End-to-End Speech-to-Intent Systems

    Authors: Yinghui Huang, Hong-Kwang Kuo, Samuel Thomas, Zvi Kons, Kartik Audhkhasi, Brian Kingsbury, Ron Hoory, Michael Picheny

    Abstract: Training an end-to-end (E2E) neural network speech-to-intent (S2I) system that directly extracts intents from speech requires large amounts of intent-labeled speech data, which is time consuming and expensive to collect. Initializing the S2I model with an ASR model trained on copious speech data can alleviate data sparsity. In this paper, we attempt to leverage NLU text resources. We implemented a… ▽ More

    Submitted 8 October, 2020; originally announced October 2020.

    Comments: 5 pages, published in ICASSP 2020

    ACM Class: I.2.7

  10. arXiv:2009.14386  [pdf, other

    cs.CL cs.LG cs.SD eess.AS

    End-to-End Spoken Language Understanding Without Full Transcripts

    Authors: Hong-Kwang J. Kuo, Zoltán Tüske, Samuel Thomas, Yinghui Huang, Kartik Audhkhasi, Brian Kingsbury, Gakuto Kurata, Zvi Kons, Ron Hoory, Luis Lastras

    Abstract: An essential component of spoken language understanding (SLU) is slot filling: representing the meaning of a spoken utterance using semantic entity labels. In this paper, we develop end-to-end (E2E) spoken language understanding systems that directly convert speech input to semantic entities and investigate if these E2E SLU models can be trained solely on semantic entity annotations without word-f… ▽ More

    Submitted 29 September, 2020; originally announced September 2020.

    Comments: 5 pages, to be published in Interspeech 2020

    ACM Class: I.2.7

  11. arXiv:2007.14146  [pdf

    eess.AS cs.LG cs.SD

    Siamese x-vector reconstruction for domain adapted speaker recognition

    Authors: Shai Rozenberg, Hagai Aronowitz, Ron Hoory

    Abstract: With the rise of voice-activated applications, the need for speaker recognition is rapidly increasing. The x-vector, an embedding approach based on a deep neural network (DNN), is considered the state-of-the-art when proper end-to-end training is not feasible. However, the accuracy significantly decreases when recording conditions (noise, sample rate, etc.) are mismatched, either between the x-vec… ▽ More

    Submitted 28 July, 2020; originally announced July 2020.

  12. arXiv:1905.00590  [pdf

    eess.AS cs.SD

    High quality, lightweight and adaptable TTS using LPCNet

    Authors: Zvi Kons, Slava Shechtman, Alex Sorin, Carmel Rabinovitz, Ron Hoory

    Abstract: We present a lightweight adaptable neural TTS system with high quality output. The system is composed of three separate neural network blocks: prosody prediction, acoustic feature prediction and Linear Prediction Coding Net as a neural vocoder. This system can synthesize speech with close to natural quality while running 3 times faster than real-time on a standard CPU. The modular setup of the sys… ▽ More

    Submitted 26 June, 2019; v1 submitted 2 May, 2019; originally announced May 2019.

    Comments: Accepted to Interspeech 2019