Skip to main content

Showing 1–17 of 17 results for author: Aldeneh, Z

Searching in archive eess. Search in all archives.
.
  1. arXiv:2506.14767  [pdf, ps, other

    cs.CL cs.AI cs.LG cs.SD eess.AS

    A Variational Framework for Improving Naturalness in Generative Spoken Language Models

    Authors: Li-Wei Chen, Takuya Higuchi, Zakaria Aldeneh, Ahmed Hussen Abdelaziz, Alexander Rudnicky

    Abstract: The success of large language models in text processing has inspired their adaptation to speech modeling. However, since speech is continuous and complex, it is often discretized for autoregressive modeling. Speech tokens derived from self-supervised models (known as semantic tokens) typically focus on the linguistic aspects of speech but neglect prosodic information. As a result, models trained o… ▽ More

    Submitted 17 June, 2025; originally announced June 2025.

    Comments: International Conference on Machine Learning (ICML) 2025

  2. arXiv:2411.17690  [pdf, other

    cs.MM cs.CV cs.SD eess.AS

    Visatronic: A Multimodal Decoder-Only Model for Speech Synthesis

    Authors: Akshita Gupta, Tatiana Likhomanenko, Karren Dai Yang, Richard He Bai, Zakaria Aldeneh, Navdeep Jaitly

    Abstract: The rapid progress of foundation models and large language models (LLMs) has fueled significantly improvement in the capabilities of machine learning systems that benefit from mutlimodal input data. However, existing multimodal models are predominantly built on top of pre-trained LLMs, which can limit accurate modeling of temporal dependencies across other modalities and thus limit the model's abi… ▽ More

    Submitted 29 May, 2025; v1 submitted 26 November, 2024; originally announced November 2024.

  3. arXiv:2409.11369  [pdf, other

    cs.SD cs.LG eess.AS

    Learning Spatially-Aware Language and Audio Embeddings

    Authors: Bhavika Devnani, Skyler Seto, Zakaria Aldeneh, Alessandro Toso, Elena Menyaylenko, Barry-John Theobald, Jonathan Sheaffer, Miguel Sarabia

    Abstract: Humans can picture a sound scene given an imprecise natural language description. For example, it is easy to imagine an acoustic environment given a phrase like "the lion roar came from right behind me!". For a machine to have the same degree of comprehension, the machine must know what a lion is (semantic attribute), what the concept of "behind" is (spatial attribute) and how these pieces of ling… ▽ More

    Submitted 26 November, 2024; v1 submitted 17 September, 2024; originally announced September 2024.

    Comments: 26 pages, 7 figures, accepted at NeurIPS 2024

  4. arXiv:2409.10791  [pdf, other

    eess.AS cs.SD

    Speaker-IPL: Unsupervised Learning of Speaker Characteristics with i-Vector based Pseudo-Labels

    Authors: Zakaria Aldeneh, Takuya Higuchi, Jee-weon Jung, Li-Wei Chen, Stephen Shum, Ahmed Hussen Abdelaziz, Shinji Watanabe, Tatiana Likhomanenko, Barry-John Theobald

    Abstract: Iterative self-training, or iterative pseudo-labeling (IPL) -- using an improved model from the current iteration to provide pseudo-labels for the next iteration -- has proven to be a powerful approach to enhance the quality of speaker representations. Recent applications of IPL in unsupervised speaker recognition start with representations extracted from very elaborate self-supervised methods (e.… ▽ More

    Submitted 17 January, 2025; v1 submitted 16 September, 2024; originally announced September 2024.

    Comments: ICASSP 2025

  5. arXiv:2409.10788  [pdf, other

    eess.AS cs.SD

    Exploring Prediction Targets in Masked Pre-Training for Speech Foundation Models

    Authors: Li-Wei Chen, Takuya Higuchi, He Bai, Ahmed Hussen Abdelaziz, Alexander Rudnicky, Shinji Watanabe, Tatiana Likhomanenko, Barry-John Theobald, Zakaria Aldeneh

    Abstract: Speech foundation models, such as HuBERT and its variants, are pre-trained on large amounts of unlabeled speech data and then used for a range of downstream tasks. These models use a masked prediction objective, where the model learns to predict information about masked input segments from the unmasked context. The choice of prediction targets in this framework impacts their performance on downstr… ▽ More

    Submitted 17 January, 2025; v1 submitted 16 September, 2024; originally announced September 2024.

    Comments: ICASSP 2025

  6. arXiv:2409.10787  [pdf, other

    eess.AS cs.SD

    Towards Automatic Assessment of Self-Supervised Speech Models using Rank

    Authors: Zakaria Aldeneh, Vimal Thilak, Takuya Higuchi, Barry-John Theobald, Tatiana Likhomanenko

    Abstract: This study explores using embedding rank as an unsupervised evaluation metric for general-purpose speech encoders trained via self-supervised learning (SSL). Traditionally, assessing the performance of these encoders is resource-intensive and requires labeled data from the downstream tasks. Inspired by the vision domain, where embedding rank has shown promise for evaluating image encoders without… ▽ More

    Submitted 17 January, 2025; v1 submitted 16 September, 2024; originally announced September 2024.

    Comments: ICASSP 2025

  7. arXiv:2407.15835  [pdf, other

    cs.CL cs.AI cs.SD eess.AS

    dMel: Speech Tokenization made Simple

    Authors: Richard He Bai, Tatiana Likhomanenko, Ruixiang Zhang, Zijin Gu, Zakaria Aldeneh, Navdeep Jaitly

    Abstract: Large language models have revolutionized natural language processing by leveraging self-supervised pretraining on vast textual data. Inspired by this success, researchers have investigated various compression-based speech tokenization methods to discretize continuous speech signals, enabling the application of language modeling techniques to discrete tokens. However, audio compressor introduces a… ▽ More

    Submitted 21 May, 2025; v1 submitted 22 July, 2024; originally announced July 2024.

    Comments: preprint

  8. arXiv:2402.00340  [pdf, other

    cs.SD eess.AS

    Can you Remove the Downstream Model for Speaker Recognition with Self-Supervised Speech Features?

    Authors: Zakaria Aldeneh, Takuya Higuchi, Jee-weon Jung, Skyler Seto, Tatiana Likhomanenko, Stephen Shum, Ahmed Hussen Abdelaziz, Shinji Watanabe, Barry-John Theobald

    Abstract: Self-supervised features are typically used in place of filter-bank features in speaker verification models. However, these models were originally designed to ingest filter-bank features as inputs, and thus, training them on top of self-supervised features assumes that both feature types require the same amount of learning for the task. In this work, we observe that pre-trained self-supervised spe… ▽ More

    Submitted 13 June, 2024; v1 submitted 1 February, 2024; originally announced February 2024.

  9. arXiv:2401.17230  [pdf, other

    cs.SD cs.AI eess.AS

    ESPnet-SPK: full pipeline speaker embedding toolkit with reproducible recipes, self-supervised front-ends, and off-the-shelf models

    Authors: Jee-weon Jung, Wangyou Zhang, Jiatong Shi, Zakaria Aldeneh, Takuya Higuchi, Barry-John Theobald, Ahmed Hussen Abdelaziz, Shinji Watanabe

    Abstract: This paper introduces ESPnet-SPK, a toolkit designed with several objectives for training speaker embedding extractors. First, we provide an open-source platform for researchers in the speaker recognition community to effortlessly build models. We provide several models, ranging from x-vector to recent SKA-TDNN. Through the modularized architecture design, variants can be developed easily. We also… ▽ More

    Submitted 13 June, 2024; v1 submitted 30 January, 2024; originally announced January 2024.

    Comments: 5 pages, 3 figures, 7 tables, Interspeech 2024

  10. arXiv:2308.09514  [pdf, other

    cs.SD cs.AI cs.LG eess.AS

    Spatial LibriSpeech: An Augmented Dataset for Spatial Audio Learning

    Authors: Miguel Sarabia, Elena Menyaylenko, Alessandro Toso, Skyler Seto, Zakaria Aldeneh, Shadi Pirhosseinloo, Luca Zappella, Barry-John Theobald, Nicholas Apostoloff, Jonathan Sheaffer

    Abstract: We present Spatial LibriSpeech, a spatial audio dataset with over 650 hours of 19-channel audio, first-order ambisonics, and optional distractor noise. Spatial LibriSpeech is designed for machine learning model training, and it includes labels for source position, speaking direction, room acoustics and geometry. Spatial LibriSpeech is generated by augmenting LibriSpeech samples with 200k+ simulate… ▽ More

    Submitted 18 August, 2023; originally announced August 2023.

    Journal ref: Proceedings of INTERSPEECH (2023), pp. 3724-3728

  11. arXiv:2210.14800  [pdf, other

    eess.AS cs.HC cs.SD

    Naturalistic Head Motion Generation from Speech

    Authors: Trisha Mittal, Zakaria Aldeneh, Masha Fedzechkina, Anurag Ranjan, Barry-John Theobald

    Abstract: Synthesizing natural head motion to accompany speech for an embodied conversational agent is necessary for providing a rich interactive experience. Most prior works assess the quality of generated head motion by comparing them against a single ground-truth using an objective metric. Yet there are many plausible head motion sequences to accompany a speech utterance. In this work, we study the varia… ▽ More

    Submitted 26 October, 2022; originally announced October 2022.

    Comments: Submitted to ICASSP 2023

  12. arXiv:2203.10117  [pdf, other

    cs.SD cs.CV cs.GR eess.AS

    On the role of Lip Articulation in Visual Speech Perception

    Authors: Zakaria Aldeneh, Masha Fedzechkina, Skyler Seto, Katherine Metcalf, Miguel Sarabia, Nicholas Apostoloff, Barry-John Theobald

    Abstract: Generating realistic lip motion from audio to simulate speech production is critical for driving natural character animation. Previous research has shown that traditional metrics used to optimize and assess models for generating lip motion from speech are not a good indicator of subjective opinion of animation quality. Devising metrics that align with subjective opinion first requires understandin… ▽ More

    Submitted 10 November, 2022; v1 submitted 18 March, 2022; originally announced March 2022.

    Comments: Submitted to ICASSP 2023

  13. Aphasic Speech Recognition using a Mixture of Speech Intelligibility Experts

    Authors: Matthew Perez, Zakaria Aldeneh, Emily Mower Provost

    Abstract: Robust speech recognition is a key prerequisite for semantic feature extraction in automatic aphasic speech analysis. However, standard one-size-fits-all automatic speech recognition models perform poorly when applied to aphasic speech. One reason for this is the wide range of speech intelligibility due to different levels of severity (i.e., higher severity lends itself to less intelligible speech… ▽ More

    Submitted 24 August, 2020; originally announced August 2020.

    Comments: 4 pages

  14. arXiv:2004.12031  [pdf, ps, other

    cs.LG cs.CL cs.CV cs.SD eess.AS

    On the Role of Visual Cues in Audiovisual Speech Enhancement

    Authors: Zakaria Aldeneh, Anushree Prasanna Kumar, Barry-John Theobald, Erik Marchi, Sachin Kajarekar, Devang Naik, Ahmed Hussen Abdelaziz

    Abstract: We present an introspection of an audiovisual speech enhancement model. In particular, we focus on interpreting how a neural audiovisual speech enhancement model uses visual cues to improve the quality of the target speech signal. We show that visual cues provide not only high-level information about speech activity, i.e., speech/silence, but also fine-grained visual information about the place of… ▽ More

    Submitted 25 February, 2021; v1 submitted 24 April, 2020; originally announced April 2020.

    Comments: ICASSP 2021

  15. arXiv:1910.05115  [pdf, ps, other

    eess.AS cs.SD q-bio.NC

    Identifying Mood Episodes Using Dialogue Features from Clinical Interviews

    Authors: Zakaria Aldeneh, Mimansa Jaiswal, Michael Picheny, Melvin McInnis, Emily Mower Provost

    Abstract: Bipolar disorder, a severe chronic mental illness characterized by pathological mood swings from depression to mania, requires ongoing symptom severity tracking to both guide and measure treatments that are critical for maintaining long-term health. Mental health professionals assess symptom severity through semi-structured clinical interviews. During these interviews, they observe their patients'… ▽ More

    Submitted 24 March, 2022; v1 submitted 28 September, 2019; originally announced October 2019.

  16. arXiv:1908.08979  [pdf, other

    cs.LG cs.CL cs.SD eess.AS stat.ML

    Controlling for Confounders in Multimodal Emotion Classification via Adversarial Learning

    Authors: Mimansa Jaiswal, Zakaria Aldeneh, Emily Mower Provost

    Abstract: Various psychological factors affect how individuals express emotions. Yet, when we collect data intended for use in building emotion recognition systems, we often try to do so by creating paradigms that are designed just with a focus on eliciting emotional behavior. Algorithms trained with these types of data are unlikely to function outside of controlled environments because our emotions natural… ▽ More

    Submitted 23 August, 2019; originally announced August 2019.

    Comments: 10 pages, ICMI 2019

  17. arXiv:1903.11672  [pdf, other

    cs.SD cs.HC cs.LG eess.AS

    MuSE-ing on the Impact of Utterance Ordering On Crowdsourced Emotion Annotations

    Authors: Mimansa Jaiswal, Zakaria Aldeneh, Cristian-Paul Bara, Yuanhang Luo, Mihai Burzo, Rada Mihalcea, Emily Mower Provost

    Abstract: Emotion recognition algorithms rely on data annotated with high quality labels. However, emotion expression and perception are inherently subjective. There is generally not a single annotation that can be unambiguously declared "correct". As a result, annotations are colored by the manner in which they were collected. In this paper, we conduct crowdsourcing experiments to investigate this impact o… ▽ More

    Submitted 27 March, 2019; originally announced March 2019.

    Comments: 5 pages, ICASSP 2019