Skip to main content

Showing 1–12 of 12 results for author: Gaido, M

Searching in archive eess. Search in all archives.
.
  1. arXiv:2505.13404  [pdf, other

    cs.CL eess.AS

    Granary: Speech Recognition and Translation Dataset in 25 European Languages

    Authors: Nithin Rao Koluguri, Monica Sekoyan, George Zelenfroynd, Sasha Meister, Shuoyang Ding, Sofia Kostandian, He Huang, Nikolay Karpov, Jagadeesh Balam, Vitaly Lavrukhin, Yifan Peng, Sara Papi, Marco Gaido, Alessio Brutti, Boris Ginsburg

    Abstract: Multi-task and multilingual approaches benefit large models, yet speech processing for low-resource languages remains underexplored due to data scarcity. To address this, we present Granary, a large-scale collection of speech datasets for recognition and translation across 25 European languages. This is the first open-source effort at this scale for both transcription and translation. We enhance d… ▽ More

    Submitted 21 May, 2025; v1 submitted 19 May, 2025; originally announced May 2025.

    Comments: Accepted at Interspeech 2025 v2: Added links

  2. arXiv:2501.02370  [pdf, other

    cs.CL cs.SD eess.AS

    Prepending or Cross-Attention for Speech-to-Text? An Empirical Comparison

    Authors: Tsz Kin Lam, Marco Gaido, Sara Papi, Luisa Bentivogli, Barry Haddow

    Abstract: Following the remarkable success of Large Language Models (LLMs) in NLP tasks, there is increasing interest in extending their capabilities to speech -- the most common form of communication. The most widespread approach to integrating speech into LLMs is dense feature prepending (DFP), which prepends the projected speech representations to the textual representations, allowing end-to-end training… ▽ More

    Submitted 7 February, 2025; v1 submitted 4 January, 2025; originally announced January 2025.

    Comments: Accepted at NAACL 2025

  3. arXiv:2412.11978  [pdf, other

    cs.CL cs.SD eess.AS

    Speech Foundation Models and Crowdsourcing for Efficient, High-Quality Data Collection

    Authors: Beomseok Lee, Marco Gaido, Ioan Calapodescu, Laurent Besacier, Matteo Negri

    Abstract: While crowdsourcing is an established solution for facilitating and scaling the collection of speech data, the involvement of non-experts necessitates protocols to ensure final data quality. To reduce the costs of these essential controls, this paper investigates the use of Speech Foundation Models (SFMs) to automate the validation process, examining for the first time the cost/quality trade-off i… ▽ More

    Submitted 16 December, 2024; originally announced December 2024.

    Comments: Accepted at COLING 2025 main conference

  4. arXiv:2411.01710  [pdf, other

    cs.CL cs.SD eess.AS

    SPES: Spectrogram Perturbation for Explainable Speech-to-Text Generation

    Authors: Dennis Fucci, Marco Gaido, Beatrice Savoldi, Matteo Negri, Mauro Cettolo, Luisa Bentivogli

    Abstract: Spurred by the demand for interpretable models, research on eXplainable AI for language technologies has experienced significant growth, with feature attribution methods emerging as a cornerstone of this progress. While prior work in NLP explored such methods for classification tasks and textual applications, explainability intersecting generation and speech is lagging, with existing techniques fa… ▽ More

    Submitted 14 March, 2025; v1 submitted 3 November, 2024; originally announced November 2024.

  5. arXiv:2410.01036  [pdf, other

    cs.CL cs.AI cs.SD eess.AS

    MOSEL: 950,000 Hours of Speech Data for Open-Source Speech Foundation Model Training on EU Languages

    Authors: Marco Gaido, Sara Papi, Luisa Bentivogli, Alessio Brutti, Mauro Cettolo, Roberto Gretter, Marco Matassoni, Mohamed Nabih, Matteo Negri

    Abstract: The rise of foundation models (FMs), coupled with regulatory efforts addressing their risks and impacts, has sparked significant interest in open-source models. However, existing speech FMs (SFMs) fall short of full compliance with the open-source principles, even if claimed otherwise, as no existing SFM has model weights, code, and training data publicly available under open-source terms. In this… ▽ More

    Submitted 1 October, 2024; originally announced October 2024.

    Comments: Accepted at EMNLP 2024 Main Conference

  6. arXiv:2408.03900  [pdf, ps, other

    cs.CL cs.SD eess.AS

    Speech-MASSIVE: A Multilingual Speech Dataset for SLU and Beyond

    Authors: Beomseok Lee, Ioan Calapodescu, Marco Gaido, Matteo Negri, Laurent Besacier

    Abstract: We present Speech-MASSIVE, a multilingual Spoken Language Understanding (SLU) dataset comprising the speech counterpart for a portion of the MASSIVE textual corpus. Speech-MASSIVE covers 12 languages from different families and inherits from MASSIVE the annotations for the intent prediction and slot-filling tasks. Our extension is prompted by the scarcity of massively multilingual SLU datasets and… ▽ More

    Submitted 7 August, 2024; originally announced August 2024.

    Comments: Accepted at INTERSPEECH 2024. This version includes the same content but with additional appendices

  7. arXiv:2406.14177  [pdf, other

    cs.CL cs.AI cs.SD eess.AS

    SimulSeamless: FBK at IWSLT 2024 Simultaneous Speech Translation

    Authors: Sara Papi, Marco Gaido, Matteo Negri, Luisa Bentivogli

    Abstract: This paper describes the FBK's participation in the Simultaneous Translation Evaluation Campaign at IWSLT 2024. For this year's submission in the speech-to-text translation (ST) sub-track, we propose SimulSeamless, which is realized by combining AlignAtt and SeamlessM4T in its medium configuration. The SeamlessM4T model is used "off-the-shelf" and its simultaneous inference is enabled through the… ▽ More

    Submitted 20 June, 2024; originally announced June 2024.

  8. arXiv:2406.06097  [pdf, other

    cs.SD cs.AI cs.CL eess.AS

    StreamAtt: Direct Streaming Speech-to-Text Translation with Attention-based Audio History Selection

    Authors: Sara Papi, Marco Gaido, Matteo Negri, Luisa Bentivogli

    Abstract: Streaming speech-to-text translation (StreamST) is the task of automatically translating speech while incrementally receiving an audio stream. Unlike simultaneous ST (SimulST), which deals with pre-segmented speech, StreamST faces the challenges of handling continuous and unbounded audio streams. This requires additional decisions about what to retain of the previous history, which is impractical… ▽ More

    Submitted 10 June, 2024; originally announced June 2024.

    Comments: Accepted at ACL 2024 main conference

  9. arXiv:2309.15554  [pdf, other

    cs.CL cs.AI cs.SD eess.AS

    Direct Models for Simultaneous Translation and Automatic Subtitling: FBK@IWSLT2023

    Authors: Sara Papi, Marco Gaido, Matteo Negri

    Abstract: This paper describes the FBK's participation in the Simultaneous Translation and Automatic Subtitling tracks of the IWSLT 2023 Evaluation Campaign. Our submission focused on the use of direct architectures to perform both tasks: for the simultaneous one, we leveraged the knowledge already acquired by offline-trained models and directly applied a policy to obtain the real-time inference; for the su… ▽ More

    Submitted 27 September, 2023; originally announced September 2023.

    Comments: Published at IWSTL 2023

    Journal ref: Proceedings of the 20th International Conference on Spoken Language Translation (IWSLT 2023)

  10. arXiv:2106.12607  [pdf, other

    cs.CL cs.SD eess.AS

    Dealing with training and test segmentation mismatch: FBK@IWSLT2021

    Authors: Sara Papi, Marco Gaido, Matteo Negri, Marco Turchi

    Abstract: This paper describes FBK's system submission to the IWSLT 2021 Offline Speech Translation task. We participated with a direct model, which is a Transformer-based architecture trained to translate English speech audio data into German texts. The training pipeline is characterized by knowledge distillation and a two-step fine-tuning procedure. Both knowledge distillation and the first fine-tuning st… ▽ More

    Submitted 28 June, 2021; v1 submitted 23 June, 2021; originally announced June 2021.

    Comments: Accepted at IWSLT2021

    Journal ref: Proceedings of the 18th International Conference on Spoken Language Translation (IWSLT 2021)

  11. arXiv:2104.11710  [pdf, other

    cs.SD cs.CL eess.AS

    Beyond Voice Activity Detection: Hybrid Audio Segmentation for Direct Speech Translation

    Authors: Marco Gaido, Matteo Negri, Mauro Cettolo, Marco Turchi

    Abstract: The audio segmentation mismatch between training data and those seen at run-time is a major problem in direct speech translation. Indeed, while systems are usually trained on manually segmented corpora, in real use cases they are often presented with continuous audio requiring automatic (and sub-optimal) segmentation. After comparing existing techniques (VAD-based, fixed-length and hybrid segmenta… ▽ More

    Submitted 14 October, 2021; v1 submitted 23 April, 2021; originally announced April 2021.

    Comments: Accepted to ICNLSP 2021

  12. arXiv:2006.02965  [pdf, other

    cs.CL cs.SD eess.AS

    End-to-End Speech-Translation with Knowledge Distillation: FBK@IWSLT2020

    Authors: Marco Gaido, Mattia Antonino Di Gangi, Matteo Negri, Marco Turchi

    Abstract: This paper describes FBK's participation in the IWSLT 2020 offline speech translation (ST) task. The task evaluates systems' ability to translate English TED talks audio into German texts. The test talks are provided in two versions: one contains the data already segmented with automatic tools and the other is the raw data without any segmentation. Participants can decide whether to work on custom… ▽ More

    Submitted 4 June, 2020; originally announced June 2020.

    Comments: Accepted at IWSLT2020