Skip to main content

Showing 1–12 of 12 results for author: La Quatra, M

Searching in archive eess. Search in all archives.
.
  1. arXiv:2505.20176  [pdf, ps, other

    cs.CL cs.LG eess.AS

    "KAN you hear me?" Exploring Kolmogorov-Arnold Networks for Spoken Language Understanding

    Authors: Alkis Koudounas, Moreno La Quatra, Eliana Pastor, Sabato Marco Siniscalchi, Elena Baralis

    Abstract: Kolmogorov-Arnold Networks (KANs) have recently emerged as a promising alternative to traditional neural architectures, yet their application to speech processing remains under explored. This work presents the first investigation of KANs for Spoken Language Understanding (SLU) tasks. We experiment with 2D-CNN models on two datasets, integrating KAN layers in five different configurations within th… ▽ More

    Submitted 26 May, 2025; originally announced May 2025.

    Comments: Accepted at INTERSPEECH 2025

  2. arXiv:2505.20163  [pdf, other

    cs.CL eess.AS

    Exploring Generative Error Correction for Dysarthric Speech Recognition

    Authors: Moreno La Quatra, Alkis Koudounas, Valerio Mario Salerno, Sabato Marco Siniscalchi

    Abstract: Despite the remarkable progress in end-to-end Automatic Speech Recognition (ASR) engines, accurately transcribing dysarthric speech remains a major challenge. In this work, we proposed a two-stage framework for the Speech Accessibility Project Challenge at INTERSPEECH 2025, which combines cutting-edge speech recognition models with LLM-based generative error correction (GER). We assess different c… ▽ More

    Submitted 26 May, 2025; originally announced May 2025.

    Comments: Accepted at INTERSPEECH 2025

  3. arXiv:2505.20050  [pdf, ps, other

    eess.AS cs.CL

    MVP: Multi-source Voice Pathology detection

    Authors: Alkis Koudounas, Moreno La Quatra, Gabriele Ciravegna, Marco Fantini, Erika Crosetti, Giovanni Succo, Tania Cerquitelli, Sabato Marco Siniscalchi, Elena Baralis

    Abstract: Voice disorders significantly impact patient quality of life, yet non-invasive automated diagnosis remains under-explored due to both the scarcity of pathological voice data, and the variability in recording sources. This work introduces MVP (Multi-source Voice Pathology detection), a novel approach that leverages transformers operating directly on raw voice signals. We explore three fusion strate… ▽ More

    Submitted 26 May, 2025; originally announced May 2025.

    Comments: Accepted at Interspeech 2025

  4. arXiv:2505.19978  [pdf, ps, other

    cs.CL cs.SD eess.AS

    DeepDialogue: A Multi-Turn Emotionally-Rich Spoken Dialogue Dataset

    Authors: Alkis Koudounas, Moreno La Quatra, Elena Baralis

    Abstract: Recent advances in conversational AI have demonstrated impressive capabilities in single-turn responses, yet multi-turn dialogues remain challenging for even the most sophisticated language models. Current dialogue datasets are limited in their emotional range, domain diversity, turn depth, and are predominantly text-only, hindering progress in developing more human-like conversational systems acr… ▽ More

    Submitted 26 May, 2025; originally announced May 2025.

    Comments: Currently under review. See the official website: https://salt-research.github.io/DeepDialogue

  5. Bilingual Dual-Head Deep Model for Parkinson's Disease Detection from Speech

    Authors: Moreno La Quatra, Juan Rafael Orozco-Arroyave, Marco Sabato Siniscalchi

    Abstract: This work aims to tackle the Parkinson's disease (PD) detection problem from the speech signal in a bilingual setting by proposing an ad-hoc dual-head deep neural architecture for type-based binary classification. One head is specialized for diadochokinetic patterns. The other head looks for natural speech patterns present in continuous spoken utterances. Only one of the two heads is operative acc… ▽ More

    Submitted 13 March, 2025; originally announced March 2025.

    Comments: Accepted at ICASSP 2025 - Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses

  6. arXiv:2502.16298  [pdf, other

    eess.AS cs.SD

    voc2vec: A Foundation Model for Non-Verbal Vocalization

    Authors: Alkis Koudounas, Moreno La Quatra, Marco Sabato Siniscalchi, Elena Baralis

    Abstract: Speech foundation models have demonstrated exceptional capabilities in speech-related tasks. Nevertheless, these models often struggle with non-verbal audio data, such as vocalizations, baby crying, etc., which are critical for various real-world applications. Audio foundation models well handle non-speech data but also fail to capture the nuanced features of non-verbal human sounds. In this work,… ▽ More

    Submitted 22 February, 2025; originally announced February 2025.

    Comments: Accepted at ICASSP 2025

  7. arXiv:2501.12979  [pdf, ps, other

    cs.CL cs.AI cs.SD eess.AS

    FlanEC: Exploring Flan-T5 for Post-ASR Error Correction

    Authors: Moreno La Quatra, Valerio Mario Salerno, Yu Tsao, Sabato Marco Siniscalchi

    Abstract: In this paper, we present an encoder-decoder model leveraging Flan-T5 for post-Automatic Speech Recognition (ASR) Generative Speech Error Correction (GenSEC), and we refer to it as FlanEC. We explore its application within the GenSEC framework to enhance ASR outputs by mapping n-best hypotheses into a single output sentence. By utilizing n-best lists from ASR models, we aim to improve the linguist… ▽ More

    Submitted 22 January, 2025; originally announced January 2025.

    Comments: Accepted at the 2024 IEEE Workshop on Spoken Language Technology (SLT) - GenSEC Challenge

    Journal ref: 2024 IEEE Spoken Language Technology Workshop (SLT), Macao, 2024, pp. 608-615

  8. arXiv:2408.04773  [pdf, other

    cs.SD eess.AS

    Exploiting Consistency-Preserving Loss and Perceptual Contrast Stretching to Boost SSL-based Speech Enhancement

    Authors: Muhammad Salman Khan, Moreno La Quatra, Kuo-Hsuan Hung, Szu-Wei Fu, Sabato Marco Siniscalchi, Yu Tsao

    Abstract: Self-supervised representation learning (SSL) has attained SOTA results on several downstream speech tasks, but SSL-based speech enhancement (SE) solutions still lag behind. To address this issue, we exploit three main ideas: (i) Transformer-based masking generation, (ii) consistency-preserving loss, and (iii) perceptual contrast stretching (PCS). In detail, conformer layers, leveraging an attenti… ▽ More

    Submitted 8 August, 2024; originally announced August 2024.

  9. Exploiting Foundation Models and Speech Enhancement for Parkinson's Disease Detection from Speech in Real-World Operative Conditions

    Authors: Moreno La Quatra, Maria Francesca Turco, Torbjørn Svendsen, Giampiero Salvi, Juan Rafael Orozco-Arroyave, Sabato Marco Siniscalchi

    Abstract: This work is concerned with devising a robust Parkinson's (PD) disease detector from speech in real-world operating conditions using (i) foundational models, and (ii) speech enhancement (SE) methods. To this end, we first fine-tune several foundational-based models on the standard PC-GITA (s-PC-GITA) clean data. Our results demonstrate superior performance to previously proposed models. Second, we… ▽ More

    Submitted 23 June, 2024; originally announced June 2024.

    Comments: Accepted at INTERSPEECH 2024

  10. arXiv:2405.06573  [pdf, other

    cs.SD cs.AI eess.AS

    An Investigation of Incorporating Mamba for Speech Enhancement

    Authors: Rong Chao, Wen-Huang Cheng, Moreno La Quatra, Sabato Marco Siniscalchi, Chao-Han Huck Yang, Szu-Wei Fu, Yu Tsao

    Abstract: This work aims to study a scalable state-space model (SSM), Mamba, for the speech enhancement (SE) task. We exploit a Mamba-based regression model to characterize speech signals and build an SE system upon Mamba, termed SEMamba. We explore the properties of Mamba by integrating it as the core model in both basic and advanced SE systems, along with utilizing signal-level distances as well as metric… ▽ More

    Submitted 10 May, 2024; originally announced May 2024.

  11. Benchmarking Representations for Speech, Music, and Acoustic Events

    Authors: Moreno La Quatra, Alkis Koudounas, Lorenzo Vaiani, Elena Baralis, Luca Cagliero, Paolo Garza, Sabato Marco Siniscalchi

    Abstract: Limited diversity in standardized benchmarks for evaluating audio representation learning (ARL) methods may hinder systematic comparison of current methods' capabilities. We present ARCH, a comprehensive benchmark for evaluating ARL methods on diverse audio classification domains, covering acoustic events, music, and speech. ARCH comprises 12 datasets, that allow us to thoroughly assess pre-traine… ▽ More

    Submitted 1 May, 2024; originally announced May 2024.

  12. ITALIC: An Italian Intent Classification Dataset

    Authors: Alkis Koudounas, Moreno La Quatra, Lorenzo Vaiani, Luca Colomba, Giuseppe Attanasio, Eliana Pastor, Luca Cagliero, Elena Baralis

    Abstract: Recent large-scale Spoken Language Understanding datasets focus predominantly on English and do not account for language-specific phenomena such as particular phonemes or words in different lects. We introduce ITALIC, the first large-scale speech dataset designed for intent classification in Italian. The dataset comprises 16,521 crowdsourced audio samples recorded by 70 speakers from various Itali… ▽ More

    Submitted 14 June, 2023; originally announced June 2023.

    Comments: Accepted at INTERSPEECH 2023. Data and code at https://github.com/RiTA-nlp/ITALIC