Skip to main content

Showing 1–50 of 52 results for author: Pino, J

Searching in archive cs. Search in all archives.
.
  1. arXiv:2506.22554  [pdf, ps, other

    cs.CV cs.AI

    Seamless Interaction: Dyadic Audiovisual Motion Modeling and Large-Scale Dataset

    Authors: Vasu Agrawal, Akinniyi Akinyemi, Kathryn Alvero, Morteza Behrooz, Julia Buffalini, Fabio Maria Carlucci, Joy Chen, Junming Chen, Zhang Chen, Shiyang Cheng, Praveen Chowdary, Joe Chuang, Antony D'Avirro, Jon Daly, Ning Dong, Mark Duppenthaler, Cynthia Gao, Jeff Girard, Martin Gleize, Sahir Gomez, Hongyu Gong, Srivathsan Govindarajan, Brandon Han, Sen He, Denise Hernandez , et al. (59 additional authors not shown)

    Abstract: Human communication involves a complex interplay of verbal and nonverbal signals, essential for conveying meaning and achieving interpersonal goals. To develop socially intelligent AI technologies, it is crucial to develop models that can both comprehend and generate dyadic behavioral dynamics. To this end, we introduce the Seamless Interaction Dataset, a large-scale collection of over 4,000 hours… ▽ More

    Submitted 30 June, 2025; v1 submitted 27 June, 2025; originally announced June 2025.

  2. arXiv:2506.22355  [pdf, ps, other

    cs.AI

    Embodied AI Agents: Modeling the World

    Authors: Pascale Fung, Yoram Bachrach, Asli Celikyilmaz, Kamalika Chaudhuri, Delong Chen, Willy Chung, Emmanuel Dupoux, Hongyu Gong, Hervé Jégou, Alessandro Lazaric, Arjun Majumdar, Andrea Madotto, Franziska Meier, Florian Metze, Louis-Philippe Morency, Théo Moutakanni, Juan Pino, Basile Terver, Joseph Tighe, Paden Tomasello, Jitendra Malik

    Abstract: This paper describes our research on AI agents embodied in visual, virtual or physical forms, enabling them to interact with both users and their environments. These agents, which include virtual avatars, wearable devices, and robots, are designed to perceive, learn and act within their surroundings, which makes them more similar to how humans learn and interact with the environments as compared t… ▽ More

    Submitted 7 July, 2025; v1 submitted 27 June, 2025; originally announced June 2025.

  3. arXiv:2410.00215  [pdf, other

    cs.LG

    Characterizing and Efficiently Accelerating Multimodal Generation Model Inference

    Authors: Yejin Lee, Anna Sun, Basil Hosmer, Bilge Acun, Can Balioglu, Changhan Wang, Charles David Hernandez, Christian Puhrsch, Daniel Haziza, Driss Guessous, Francisco Massa, Jacob Kahn, Jeffrey Wan, Jeremy Reizenstein, Jiaqi Zhai, Joe Isaacson, Joel Schlosser, Juan Pino, Kaushik Ram Sadagopan, Leonid Shamis, Linjian Ma, Min-Jae Hwang, Mingda Chen, Mostafa Elhoushi, Pedro Rodriguez , et al. (5 additional authors not shown)

    Abstract: Generative artificial intelligence (AI) technology is revolutionizing the computing industry. Not only its applications have broadened to various sectors but also poses new system design and optimization opportunities. The technology is capable of understanding and responding in multiple modalities. However, the advanced capability currently comes with significant system resource demands. To susta… ▽ More

    Submitted 9 May, 2025; v1 submitted 30 September, 2024; originally announced October 2024.

    Comments: 13 pages including references. 8 Figures. Under review to HPCA 2025 Industry Track

  4. Residual-based Attention Physics-informed Neural Networks for Spatio-Temporal Ageing Assessment of Transformers Operated in Renewable Power Plants

    Authors: Ibai Ramirez, Joel Pino, David Pardo, Mikel Sanz, Luis del Rio, Alvaro Ortiz, Kateryna Morozovska, Jose I. Aizpurua

    Abstract: Transformers are crucial for reliable and efficient power system operations, particularly in supporting the integration of renewable energy. Effective monitoring of transformer health is critical to maintain grid stability and performance. Thermal insulation ageing is a key transformer failure mode, which is generally tracked by monitoring the hotspot temperature (HST). However, HST measurement is… ▽ More

    Submitted 3 October, 2024; v1 submitted 10 May, 2024; originally announced May 2024.

    Comments: 23 pages, 18 figures

  5. arXiv:2403.14402  [pdf, other

    cs.SD cs.CL eess.AS

    XLAVS-R: Cross-Lingual Audio-Visual Speech Representation Learning for Noise-Robust Speech Perception

    Authors: HyoJung Han, Mohamed Anwar, Juan Pino, Wei-Ning Hsu, Marine Carpuat, Bowen Shi, Changhan Wang

    Abstract: Speech recognition and translation systems perform poorly on noisy inputs, which are frequent in realistic environments. Augmenting these systems with visual signals has the potential to improve robustness to noise. However, audio-visual (AV) data is only available in limited amounts and for fewer languages than audio-only resources. To address this gap, we present XLAVS-R, a cross-lingual audio-v… ▽ More

    Submitted 12 August, 2024; v1 submitted 21 March, 2024; originally announced March 2024.

    Comments: ACL2024

  6. arXiv:2402.05755  [pdf, other

    cs.CL cs.SD eess.AS

    Spirit LM: Interleaved Spoken and Written Language Model

    Authors: Tu Anh Nguyen, Benjamin Muller, Bokai Yu, Marta R. Costa-jussa, Maha Elbayad, Sravya Popuri, Christophe Ropers, Paul-Ambroise Duquenne, Robin Algayres, Ruslan Mavlyutov, Itai Gat, Mary Williamson, Gabriel Synnaeve, Juan Pino, Benoit Sagot, Emmanuel Dupoux

    Abstract: We introduce Spirit LM, a foundation multimodal language model that freely mixes text and speech. Our model is based on a 7B pretrained text language model that we extend to the speech modality by continuously training it on text and speech units. Speech and text sequences are concatenated as a single stream of tokens, and trained with a word-level interleaving method using a small automatically-c… ▽ More

    Submitted 18 October, 2024; v1 submitted 8 February, 2024; originally announced February 2024.

  7. arXiv:2312.05187  [pdf, other

    cs.CL cs.SD eess.AS

    Seamless: Multilingual Expressive and Streaming Speech Translation

    Authors: Seamless Communication, Loïc Barrault, Yu-An Chung, Mariano Coria Meglioli, David Dale, Ning Dong, Mark Duppenthaler, Paul-Ambroise Duquenne, Brian Ellis, Hady Elsahar, Justin Haaheim, John Hoffman, Min-Jae Hwang, Hirofumi Inaguma, Christopher Klaiber, Ilia Kulikov, Pengwei Li, Daniel Licht, Jean Maillard, Ruslan Mavlyutov, Alice Rakotoarison, Kaushik Ram Sadagopan, Abinesh Ramakrishnan, Tuan Tran, Guillaume Wenzek , et al. (40 additional authors not shown)

    Abstract: Large-scale automatic speech translation systems today lack key features that help machine-mediated communication feel seamless when compared to human-to-human dialogue. In this work, we introduce a family of models that enable end-to-end expressive and multilingual translations in a streaming fashion. First, we contribute an improved version of the massively multilingual and multimodal SeamlessM4… ▽ More

    Submitted 8 December, 2023; originally announced December 2023.

  8. arXiv:2308.11596  [pdf, other

    cs.CL

    SeamlessM4T: Massively Multilingual & Multimodal Machine Translation

    Authors: Seamless Communication, Loïc Barrault, Yu-An Chung, Mariano Cora Meglioli, David Dale, Ning Dong, Paul-Ambroise Duquenne, Hady Elsahar, Hongyu Gong, Kevin Heffernan, John Hoffman, Christopher Klaiber, Pengwei Li, Daniel Licht, Jean Maillard, Alice Rakotoarison, Kaushik Ram Sadagopan, Guillaume Wenzek, Ethan Ye, Bapi Akula, Peng-Jen Chen, Naji El Hachem, Brian Ellis, Gabriel Mejia Gonzalez, Justin Haaheim , et al. (43 additional authors not shown)

    Abstract: What does it take to create the Babel Fish, a tool that can help individuals translate speech between any two languages? While recent breakthroughs in text-based models have pushed machine translation coverage beyond 200 languages, unified speech-to-speech translation models have yet to achieve similar strides. More specifically, conventional speech-to-speech translation systems rely on cascaded s… ▽ More

    Submitted 24 October, 2023; v1 submitted 22 August, 2023; originally announced August 2023.

    ACM Class: I.2.7

  9. arXiv:2307.08655  [pdf, other

    cs.CL cs.SD eess.AS

    Multilingual Speech-to-Speech Translation into Multiple Target Languages

    Authors: Hongyu Gong, Ning Dong, Sravya Popuri, Vedanuj Goswami, Ann Lee, Juan Pino

    Abstract: Speech-to-speech translation (S2ST) enables spoken communication between people talking in different languages. Despite a few studies on multilingual S2ST, their focus is the multilinguality on the source side, i.e., the translation from multiple source languages to one target language. We present the first work on multilingual S2ST supporting multiple target languages. Leveraging recent advance i… ▽ More

    Submitted 17 July, 2023; originally announced July 2023.

  10. arXiv:2306.01084  [pdf, other

    cs.SD eess.AS

    Exploration on HuBERT with Multiple Resolutions

    Authors: Jiatong Shi, Yun Tang, Hirofumi Inaguma, Hongyu GOng, Juan Pino, Shinji Watanabe

    Abstract: Hidden-unit BERT (HuBERT) is a widely-used self-supervised learning (SSL) model in speech processing. However, we argue that its fixed 20ms resolution for hidden representations would not be optimal for various speech-processing tasks since their attributes (e.g., speaker characteristics and semantics) are based on different time scales. To address this limitation, we propose utilizing HuBERT repr… ▽ More

    Submitted 22 June, 2023; v1 submitted 1 June, 2023; originally announced June 2023.

    Comments: Accepted to Interspeech2023

  11. arXiv:2305.03101  [pdf, other

    cs.CL cs.SD eess.AS

    Hybrid Transducer and Attention based Encoder-Decoder Modeling for Speech-to-Text Tasks

    Authors: Yun Tang, Anna Y. Sun, Hirofumi Inaguma, Xinyue Chen, Ning Dong, Xutai Ma, Paden D. Tomasello, Juan Pino

    Abstract: Transducer and Attention based Encoder-Decoder (AED) are two widely used frameworks for speech-to-text tasks. They are designed for different purposes and each has its own benefits and drawbacks for speech-to-text tasks. In order to leverage strengths of both modeling methods, we propose a solution by combining Transducer and Attention based Encoder-Decoder (TAED) for speech-to-text tasks. The new… ▽ More

    Submitted 4 May, 2023; originally announced May 2023.

    Comments: ACL 2023 main conference

  12. arXiv:2304.04618  [pdf, other

    cs.SD cs.CL eess.AS

    Enhancing Speech-to-Speech Translation with Multiple TTS Targets

    Authors: Jiatong Shi, Yun Tang, Ann Lee, Hirofumi Inaguma, Changhan Wang, Juan Pino, Shinji Watanabe

    Abstract: It has been known that direct speech-to-speech translation (S2ST) models usually suffer from the data scarcity issue because of the limited existing parallel materials for both source and target speech. Therefore to train a direct S2ST system, previous works usually utilize text-to-speech (TTS) systems to generate samples in the target language by augmenting the data from speech-to-text translatio… ▽ More

    Submitted 10 April, 2023; originally announced April 2023.

  13. arXiv:2304.04596  [pdf, other

    cs.SD cs.CL eess.AS

    ESPnet-ST-v2: Multipurpose Spoken Language Translation Toolkit

    Authors: Brian Yan, Jiatong Shi, Yun Tang, Hirofumi Inaguma, Yifan Peng, Siddharth Dalmia, Peter Polák, Patrick Fernandes, Dan Berrebbi, Tomoki Hayashi, Xiaohui Zhang, Zhaoheng Ni, Moto Hira, Soumi Maiti, Juan Pino, Shinji Watanabe

    Abstract: ESPnet-ST-v2 is a revamp of the open-source ESPnet-ST toolkit necessitated by the broadening interests of the spoken language translation community. ESPnet-ST-v2 supports 1) offline speech-to-text translation (ST), 2) simultaneous speech-to-text translation (SST), and 3) offline speech-to-speech translation (S2ST) -- each task is supported with a wide variety of approaches, differentiating ESPnet-… ▽ More

    Submitted 6 July, 2023; v1 submitted 10 April, 2023; originally announced April 2023.

    Comments: ACL 2023; System Demonstration

  14. arXiv:2303.00628  [pdf, ps, other

    cs.CL eess.AS

    MuAViC: A Multilingual Audio-Visual Corpus for Robust Speech Recognition and Robust Speech-to-Text Translation

    Authors: Mohamed Anwar, Bowen Shi, Vedanuj Goswami, Wei-Ning Hsu, Juan Pino, Changhan Wang

    Abstract: We introduce MuAViC, a multilingual audio-visual corpus for robust speech recognition and robust speech-to-text translation providing 1200 hours of audio-visual speech in 9 languages. It is fully transcribed and covers 6 English-to-X translation as well as 6 X-to-English translation directions. To the best of our knowledge, this is the first open benchmark for audio-visual speech-to-text translati… ▽ More

    Submitted 7 March, 2023; v1 submitted 1 March, 2023; originally announced March 2023.

  15. arXiv:2301.11716  [pdf, other

    cs.CL cs.LG cs.SD eess.AS

    Pre-training for Speech Translation: CTC Meets Optimal Transport

    Authors: Phuong-Hang Le, Hongyu Gong, Changhan Wang, Juan Pino, Benjamin Lecouteux, Didier Schwab

    Abstract: The gap between speech and text modalities is a major challenge in speech-to-text translation (ST). Different methods have been proposed to reduce this gap, but most of them require architectural changes in ST training. In this work, we propose to mitigate this issue at the pre-training stage, requiring no change in the ST model. First, we show that the connectionist temporal classification (CTC)… ▽ More

    Submitted 5 June, 2023; v1 submitted 27 January, 2023; originally announced January 2023.

    Comments: ICML 2023 (oral presentation). This version fixed URLs, updated affiliations & acknowledgements, and improved formatting

  16. arXiv:2212.08055  [pdf, other

    cs.CL cs.SD eess.AS

    UnitY: Two-pass Direct Speech-to-speech Translation with Discrete Units

    Authors: Hirofumi Inaguma, Sravya Popuri, Ilia Kulikov, Peng-Jen Chen, Changhan Wang, Yu-An Chung, Yun Tang, Ann Lee, Shinji Watanabe, Juan Pino

    Abstract: Direct speech-to-speech translation (S2ST), in which all components can be optimized jointly, is advantageous over cascaded approaches to achieve fast inference with a simplified pipeline. We present a novel two-pass direct S2ST architecture, UnitY, which first generates textual representations and predicts discrete acoustic units subsequently. We enhance the model performance by subword predictio… ▽ More

    Submitted 26 May, 2023; v1 submitted 15 December, 2022; originally announced December 2022.

    Comments: ACL 2023 (main conference)

  17. arXiv:2211.06474  [pdf, other

    cs.CL cs.SD eess.AS

    Speech-to-Speech Translation For A Real-world Unwritten Language

    Authors: Peng-Jen Chen, Kevin Tran, Yilin Yang, Jingfei Du, Justine Kao, Yu-An Chung, Paden Tomasello, Paul-Ambroise Duquenne, Holger Schwenk, Hongyu Gong, Hirofumi Inaguma, Sravya Popuri, Changhan Wang, Juan Pino, Wei-Ning Hsu, Ann Lee

    Abstract: We study speech-to-speech translation (S2ST) that translates speech from one language into another language and focuses on building systems to support languages without standard text writing systems. We use English-Taiwanese Hokkien as a case study, and present an end-to-end solution from training data collection, modeling choices to benchmark dataset release. First, we present efforts on creating… ▽ More

    Submitted 11 November, 2022; originally announced November 2022.

  18. arXiv:2211.04508  [pdf, other

    cs.CL cs.SD eess.AS

    SpeechMatrix: A Large-Scale Mined Corpus of Multilingual Speech-to-Speech Translations

    Authors: Paul-Ambroise Duquenne, Hongyu Gong, Ning Dong, Jingfei Du, Ann Lee, Vedanuj Goswani, Changhan Wang, Juan Pino, Benoît Sagot, Holger Schwenk

    Abstract: We present SpeechMatrix, a large-scale multilingual corpus of speech-to-speech translations mined from real speech of European Parliament recordings. It contains speech alignments in 136 language pairs with a total of 418 thousand hours of speech. To evaluate the quality of this parallel speech, we train bilingual speech-to-speech translation models on mined data only and establish extensive basel… ▽ More

    Submitted 8 November, 2022; originally announced November 2022.

    Comments: 18 pages

  19. arXiv:2210.10191  [pdf, other

    cs.CL cs.SD eess.AS

    Simple and Effective Unsupervised Speech Translation

    Authors: Changhan Wang, Hirofumi Inaguma, Peng-Jen Chen, Ilia Kulikov, Yun Tang, Wei-Ning Hsu, Michael Auli, Juan Pino

    Abstract: The amount of labeled data to train models for speech tasks is limited for most languages, however, the data scarcity is exacerbated for speech translation which requires labeled data covering two different languages. To address this issue, we study a simple and effective approach to build speech translation systems without labeled data by leveraging recent advances in unsupervised speech recognit… ▽ More

    Submitted 18 October, 2022; originally announced October 2022.

  20. arXiv:2204.05409  [pdf, other

    cs.CL

    Unified Speech-Text Pre-training for Speech Translation and Recognition

    Authors: Yun Tang, Hongyu Gong, Ning Dong, Changhan Wang, Wei-Ning Hsu, Jiatao Gu, Alexei Baevski, Xian Li, Abdelrahman Mohamed, Michael Auli, Juan Pino

    Abstract: We describe a method to jointly pre-train speech and text in an encoder-decoder modeling framework for speech translation and recognition. The proposed method incorporates four self-supervised and supervised subtasks for cross modality learning. A self-supervised speech subtask leverages unlabelled speech data, and a (self-)supervised text to text subtask makes use of abundant text training data.… ▽ More

    Submitted 11 April, 2022; originally announced April 2022.

    Comments: ACL 2022 main conference

  21. arXiv:2204.02967  [pdf, other

    cs.CL cs.SD eess.AS

    Enhanced Direct Speech-to-Speech Translation Using Self-supervised Pre-training and Data Augmentation

    Authors: Sravya Popuri, Peng-Jen Chen, Changhan Wang, Juan Pino, Yossi Adi, Jiatao Gu, Wei-Ning Hsu, Ann Lee

    Abstract: Direct speech-to-speech translation (S2ST) models suffer from data scarcity issues as there exists little parallel S2ST data, compared to the amount of data available for conventional cascaded systems that consist of automatic speech recognition (ASR), machine translation (MT), and text-to-speech (TTS) synthesis. In this work, we explore self-supervised pre-training with unlabeled speech data and… ▽ More

    Submitted 13 September, 2022; v1 submitted 6 April, 2022; originally announced April 2022.

    Comments: Accepted to be published in the Proceedings of Interspeech 2022

  22. arXiv:2112.08352  [pdf, other

    cs.CL cs.AI cs.LG eess.AS

    Textless Speech-to-Speech Translation on Real Data

    Authors: Ann Lee, Hongyu Gong, Paul-Ambroise Duquenne, Holger Schwenk, Peng-Jen Chen, Changhan Wang, Sravya Popuri, Yossi Adi, Juan Pino, Jiatao Gu, Wei-Ning Hsu

    Abstract: We present a textless speech-to-speech translation (S2ST) system that can translate speech from one language into another language and can be built without the need of any text data. Different from existing work in the literature, we tackle the challenge in modeling multi-speaker target speech and train the systems with real-world S2ST data. The key to our approach is a self-supervised unit-based… ▽ More

    Submitted 4 May, 2022; v1 submitted 15 December, 2021; originally announced December 2021.

    Comments: Accepted to NAACL 2022 (long paper)

  23. arXiv:2111.09296  [pdf, other

    cs.CL cs.SD eess.AS

    XLS-R: Self-supervised Cross-lingual Speech Representation Learning at Scale

    Authors: Arun Babu, Changhan Wang, Andros Tjandra, Kushal Lakhotia, Qiantong Xu, Naman Goyal, Kritika Singh, Patrick von Platen, Yatharth Saraf, Juan Pino, Alexei Baevski, Alexis Conneau, Michael Auli

    Abstract: This paper presents XLS-R, a large-scale model for cross-lingual speech representation learning based on wav2vec 2.0. We train models with up to 2B parameters on nearly half a million hours of publicly available speech audio in 128 languages, an order of magnitude more public data than the largest known prior work. Our evaluation covers a wide range of tasks, domains, data regimes and languages, b… ▽ More

    Submitted 16 December, 2021; v1 submitted 17 November, 2021; originally announced November 2021.

  24. arXiv:2110.08250  [pdf, other

    cs.CL cs.SD eess.AS

    Direct Simultaneous Speech-to-Speech Translation with Variational Monotonic Multihead Attention

    Authors: Xutai Ma, Hongyu Gong, Danni Liu, Ann Lee, Yun Tang, Peng-Jen Chen, Wei-Ning Hsu, Phillip Koehn, Juan Pino

    Abstract: We present a direct simultaneous speech-to-speech translation (Simul-S2ST) model, Furthermore, the generation of translation is independent from intermediate text representations. Our approach leverages recent progress on direct speech-to-speech translation with discrete units, in which a sequence of discrete representations, instead of continuous spectrogram features, learned in an unsupervised m… ▽ More

    Submitted 12 January, 2022; v1 submitted 15 October, 2021; originally announced October 2021.

  25. arXiv:2110.08214  [pdf, other

    cs.CL cs.SD eess.AS

    From Start to Finish: Latency Reduction Strategies for Incremental Speech Synthesis in Simultaneous Speech-to-Speech Translation

    Authors: Danni Liu, Changhan Wang, Hongyu Gong, Xutai Ma, Yun Tang, Juan Pino

    Abstract: Speech-to-speech translation (S2ST) converts input speech to speech in another language. A challenge of delivering S2ST in real time is the accumulated delay between the translation and speech synthesis modules. While recently incremental text-to-speech (iTTS) models have shown large quality improvements, they typically require additional future text inputs to reach optimal performance. In this wo… ▽ More

    Submitted 15 July, 2022; v1 submitted 15 October, 2021; originally announced October 2021.

    Comments: Accepted by Interspeech 2022

  26. arXiv:2109.06912  [pdf, other

    eess.AS cs.CL cs.SD

    fairseq S^2: A Scalable and Integrable Speech Synthesis Toolkit

    Authors: Changhan Wang, Wei-Ning Hsu, Yossi Adi, Adam Polyak, Ann Lee, Peng-Jen Chen, Jiatao Gu, Juan Pino

    Abstract: This paper presents fairseq S^2, a fairseq extension for speech synthesis. We implement a number of autoregressive (AR) and non-AR text-to-speech models, and their multi-speaker variants. To enable training speech synthesis models with less curated data, a number of preprocessing tools are built and their importance is shown empirically. To facilitate faster iteration of development and analysis,… ▽ More

    Submitted 14 September, 2021; originally announced September 2021.

    Comments: Accepted to EMNLP 2021 Demo

  27. arXiv:2107.06959  [pdf, ps, other

    cs.CL cs.SD eess.AS

    FST: the FAIR Speech Translation System for the IWSLT21 Multilingual Shared Task

    Authors: Yun Tang, Hongyu Gong, Xian Li, Changhan Wang, Juan Pino, Holger Schwenk, Naman Goyal

    Abstract: In this paper, we describe our end-to-end multilingual speech translation system submitted to the IWSLT 2021 evaluation campaign on the Multilingual Speech Translation shared task. Our system is built by leveraging transfer learning across modalities, tasks and languages. First, we leverage general-purpose multilingual modules pretrained with large amounts of unlabelled and labelled data. We furth… ▽ More

    Submitted 14 August, 2021; v1 submitted 14 July, 2021; originally announced July 2021.

    Comments: Accepted by IWSLT 2021 as a system paper

  28. arXiv:2107.05782  [pdf, other

    cs.CL cs.SD eess.AS

    Improving Speech Translation by Understanding and Learning from the Auxiliary Text Translation Task

    Authors: Yun Tang, Juan Pino, Xian Li, Changhan Wang, Dmitriy Genzel

    Abstract: Pretraining and multitask learning are widely used to improve the speech to text translation performance. In this study, we are interested in training a speech to text translation model along with an auxiliary text to text translation task. We conduct a detailed analysis to understand the impact of the auxiliary task on the primary task within the multitask learning framework. Our analysis confirm… ▽ More

    Submitted 12 July, 2021; originally announced July 2021.

    Comments: Accepted by ACL 2021

  29. arXiv:2107.05604  [pdf, other

    cs.CL cs.LG eess.AS

    Direct speech-to-speech translation with discrete units

    Authors: Ann Lee, Peng-Jen Chen, Changhan Wang, Jiatao Gu, Sravya Popuri, Xutai Ma, Adam Polyak, Yossi Adi, Qing He, Yun Tang, Juan Pino, Wei-Ning Hsu

    Abstract: We present a direct speech-to-speech translation (S2ST) model that translates speech from one language to speech in another language without relying on intermediate text generation. We tackle the problem by first applying a self-supervised discrete speech encoder on the target speech and then training a sequence-to-sequence speech-to-unit translation (S2UT) model to predict the discrete representa… ▽ More

    Submitted 21 March, 2022; v1 submitted 12 July, 2021; originally announced July 2021.

    Comments: Accepted to ACL 2022 (long paper)

  30. arXiv:2106.10840  [pdf, other

    cs.CL cs.AI

    Pay Better Attention to Attention: Head Selection in Multilingual and Multi-Domain Sequence Modeling

    Authors: Hongyu Gong, Yun Tang, Juan Pino, Xian Li

    Abstract: Multi-head attention has each of the attention heads collect salient information from different parts of an input sequence, making it a powerful mechanism for sequence modeling. Multilingual and multi-domain learning are common scenarios for sequence modeling, where the key challenge is to maximize positive transfer and mitigate negative transfer across languages and domains. In this paper, we fin… ▽ More

    Submitted 21 June, 2021; originally announced June 2021.

  31. arXiv:2106.01463  [pdf, other

    cs.CL

    Lightweight Adapter Tuning for Multilingual Speech Translation

    Authors: Hang Le, Juan Pino, Changhan Wang, Jiatao Gu, Didier Schwab, Laurent Besacier

    Abstract: Adapter modules were recently introduced as an efficient alternative to fine-tuning in NLP. Adapter tuning consists in freezing pretrained parameters of a model and injecting lightweight modules between layers, resulting in the addition of only a small number of task-specific trainable parameters. While adapter tuning was investigated for multilingual neural machine translation, this paper propose… ▽ More

    Submitted 12 July, 2021; v1 submitted 2 June, 2021; originally announced June 2021.

    Comments: Accepted at ACL-IJCNLP 2021

  32. arXiv:2104.06678  [pdf, ps, other

    cs.CL

    Large-Scale Self- and Semi-Supervised Learning for Speech Translation

    Authors: Changhan Wang, Anne Wu, Juan Pino, Alexei Baevski, Michael Auli, Alexis Conneau

    Abstract: In this paper, we improve speech translation (ST) through effectively leveraging large quantities of unlabeled speech and text data in different and complementary ways. We explore both pretraining and self-training by using the large Libri-Light speech audio corpus and language modeling with CommonCrawl. Our experiments improve over the previous state of the art by 2.6 BLEU on average on all four… ▽ More

    Submitted 14 April, 2021; originally announced April 2021.

  33. arXiv:2101.00390  [pdf, other

    cs.CL eess.AS

    VoxPopuli: A Large-Scale Multilingual Speech Corpus for Representation Learning, Semi-Supervised Learning and Interpretation

    Authors: Changhan Wang, Morgane Rivière, Ann Lee, Anne Wu, Chaitanya Talnikar, Daniel Haziza, Mary Williamson, Juan Pino, Emmanuel Dupoux

    Abstract: We introduce VoxPopuli, a large-scale multilingual corpus providing 100K hours of unlabelled speech data in 23 languages. It is the largest open data to date for unsupervised representation learning as well as semi-supervised learning. VoxPopuli also contains 1.8K hours of transcribed speeches in 16 languages and their aligned oral interpretations into 5 other languages totaling 5.1K hours. We pro… ▽ More

    Submitted 27 July, 2021; v1 submitted 2 January, 2021; originally announced January 2021.

    Comments: Accepted to ACL 2021 (long paper)

  34. arXiv:2011.02048  [pdf, other

    cs.CL

    SimulMT to SimulST: Adapting Simultaneous Text Translation to End-to-End Simultaneous Speech Translation

    Authors: Xutai Ma, Juan Pino, Philipp Koehn

    Abstract: Simultaneous text translation and end-to-end speech translation have recently made great progress but little work has combined these tasks together. We investigate how to adapt simultaneous text translation methods such as wait-k and monotonic multihead attention to end-to-end simultaneous speech translation by introducing a pre-decision module. A detailed analysis is provided on the latency-quali… ▽ More

    Submitted 3 November, 2020; originally announced November 2020.

  35. arXiv:2011.00747  [pdf, other

    cs.CL cs.SD eess.AS

    Dual-decoder Transformer for Joint Automatic Speech Recognition and Multilingual Speech Translation

    Authors: Hang Le, Juan Pino, Changhan Wang, Jiatao Gu, Didier Schwab, Laurent Besacier

    Abstract: We introduce dual-decoder Transformer, a new model architecture that jointly performs automatic speech recognition (ASR) and multilingual speech translation (ST). Our models are based on the original Transformer architecture (Vaswani et al., 2017) but consist of two decoders, each responsible for one task (ASR or ST). Our major contribution lies in how these decoders interact with each other: one… ▽ More

    Submitted 1 November, 2020; originally announced November 2020.

    Comments: Accepted at COLING 2020 (Oral)

    Journal ref: The 28th International Conference on Computational Linguistics (COLING 2020)

  36. arXiv:2011.00033  [pdf, other

    cs.CL

    Streaming Simultaneous Speech Translation with Augmented Memory Transformer

    Authors: Xutai Ma, Yongqiang Wang, Mohammad Javad Dousti, Philipp Koehn, Juan Pino

    Abstract: Transformer-based models have achieved state-of-the-art performance on speech translation tasks. However, the model architecture is not efficient enough for streaming scenarios since self-attention is computed over an entire input sequence and the computational cost grows quadratically with the length of the input sequence. Nevertheless, most of the previous work on simultaneous speech translation… ▽ More

    Submitted 30 October, 2020; originally announced November 2020.

  37. arXiv:2010.12829  [pdf, other

    cs.CL

    Multilingual Speech Translation with Efficient Finetuning of Pretrained Models

    Authors: Xian Li, Changhan Wang, Yun Tang, Chau Tran, Yuqing Tang, Juan Pino, Alexei Baevski, Alexis Conneau, Michael Auli

    Abstract: We present a simple yet effective approach to build multilingual speech-to-text (ST) translation by efficient transfer learning from pretrained speech encoder and text decoder. Our key finding is that a minimalistic LNA (LayerNorm and Attention) finetuning can achieve zero-shot crosslingual and cross-modality transfer ability by only finetuning less than 10% of the pretrained parameters. This enab… ▽ More

    Submitted 2 January, 2021; v1 submitted 24 October, 2020; originally announced October 2020.

  38. arXiv:2010.11338  [pdf, other

    cs.CL

    A General Multi-Task Learning Framework to Leverage Text Data for Speech to Text Tasks

    Authors: Yun Tang, Juan Pino, Changhan Wang, Xutai Ma, Dmitriy Genzel

    Abstract: Attention-based sequence-to-sequence modeling provides a powerful and elegant solution for applications that need to map one sequence to a different sequence. Its success heavily relies on the availability of large amounts of training data. This presents a challenge for speech applications where labelled speech data is very expensive to obtain, such as automatic speech recognition (ASR) and speech… ▽ More

    Submitted 11 February, 2021; v1 submitted 21 October, 2020; originally announced October 2020.

    Comments: Accepted for ICASSP2021

  39. arXiv:2010.05171  [pdf, other

    cs.CL eess.AS

    fairseq S2T: Fast Speech-to-Text Modeling with fairseq

    Authors: Changhan Wang, Yun Tang, Xutai Ma, Anne Wu, Sravya Popuri, Dmytro Okhonko, Juan Pino

    Abstract: We introduce fairseq S2T, a fairseq extension for speech-to-text (S2T) modeling tasks such as end-to-end speech recognition and speech-to-text translation. It follows fairseq's careful design for scalability and extensibility. We provide end-to-end workflows from data pre-processing, model training to offline (online) inference. We implement state-of-the-art RNN-based, Transformer-based as well as… ▽ More

    Submitted 14 June, 2022; v1 submitted 11 October, 2020; originally announced October 2020.

    Comments: Post-conference updates (accepted to AACL 2020 Demo)

  40. arXiv:2007.16193  [pdf, other

    cs.CL

    SimulEval: An Evaluation Toolkit for Simultaneous Translation

    Authors: Xutai Ma, Mohammad Javad Dousti, Changhan Wang, Jiatao Gu, Juan Pino

    Abstract: Simultaneous translation on both text and speech focuses on a real-time and low-latency scenario where the model starts translating before reading the complete source input. Evaluating simultaneous translation models is more complex than offline models because the latency is another factor to consider in addition to translation quality. The research community, despite its growing focus on novel mo… ▽ More

    Submitted 31 July, 2020; originally announced July 2020.

  41. arXiv:2007.10310  [pdf, ps, other

    cs.CL cs.SD eess.AS

    CoVoST 2 and Massively Multilingual Speech-to-Text Translation

    Authors: Changhan Wang, Anne Wu, Juan Pino

    Abstract: Speech translation has recently become an increasingly popular topic of research, partly due to the development of benchmark datasets. Nevertheless, current datasets cover a limited number of languages. With the aim to foster research in massive multilingual speech translation and speech translation for low resource language pairs, we release CoVoST 2, a large-scale multilingual speech translation… ▽ More

    Submitted 24 October, 2020; v1 submitted 20 July, 2020; originally announced July 2020.

  42. arXiv:2006.12124  [pdf, ps, other

    eess.AS cs.CL cs.SD

    Self-Supervised Representations Improve End-to-End Speech Translation

    Authors: Anne Wu, Changhan Wang, Juan Pino, Jiatao Gu

    Abstract: End-to-end speech-to-text translation can provide a simpler and smaller system but is facing the challenge of data scarcity. Pre-training methods can leverage unlabeled data and have been shown to be effective on data-scarce settings. In this work, we explore whether self-supervised pre-trained speech representations can benefit the speech translation task in both high- and low-resource settings,… ▽ More

    Submitted 24 October, 2020; v1 submitted 22 June, 2020; originally announced June 2020.

    Comments: Accepted to INTERSPEECH 2020

  43. arXiv:2006.05474  [pdf, other

    eess.AS cs.CL cs.SD

    Improving Cross-Lingual Transfer Learning for End-to-End Speech Recognition with Speech Translation

    Authors: Changhan Wang, Juan Pino, Jiatao Gu

    Abstract: Transfer learning from high-resource languages is known to be an efficient way to improve end-to-end automatic speech recognition (ASR) for low-resource languages. Pre-trained or jointly trained encoder-decoder models, however, do not share the language modeling (decoder) for the same language, which is likely to be inefficient for distant target languages. We introduce speech-to-text translation… ▽ More

    Submitted 9 October, 2020; v1 submitted 9 June, 2020; originally announced June 2020.

    Comments: Accepted to INTERSPEECH 2020

  44. arXiv:2006.02490  [pdf, ps, other

    cs.CL cs.SD eess.AS

    Self-Training for End-to-End Speech Translation

    Authors: Juan Pino, Qiantong Xu, Xutai Ma, Mohammad Javad Dousti, Yun Tang

    Abstract: One of the main challenges for end-to-end speech translation is data scarcity. We leverage pseudo-labels generated from unlabeled audio by a cascade and an end-to-end speech translation model. This provides 8.3 and 5.7 BLEU gains over a strong semi-supervised baseline on the MuST-C English-French and English-German datasets, reaching state-of-the art performance. The effect of the quality of the p… ▽ More

    Submitted 13 October, 2020; v1 submitted 3 June, 2020; originally announced June 2020.

    Comments: INTERSPEECH 2020

  45. arXiv:2002.12231  [pdf, other

    eess.AS cs.CL cs.SD

    SkinAugment: Auto-Encoding Speaker Conversions for Automatic Speech Translation

    Authors: Arya D. McCarthy, Liezl Puzon, Juan Pino

    Abstract: We propose autoencoding speaker conversion for training data augmentation in automatic speech translation. This technique directly transforms an audio sequence, resulting in audio synthesized to resemble another speaker's voice. Our method compares favorably to SpecAugment on English$\to$French and English$\to$Romanian automatic speech translation (AST) tasks as well as on a low-resource English a… ▽ More

    Submitted 27 February, 2020; originally announced February 2020.

    Comments: Accepted to ICASSP 2020

  46. arXiv:2002.01320  [pdf, other

    cs.CL

    CoVoST: A Diverse Multilingual Speech-To-Text Translation Corpus

    Authors: Changhan Wang, Juan Pino, Anne Wu, Jiatao Gu

    Abstract: Spoken language translation has recently witnessed a resurgence in popularity, thanks to the development of end-to-end models and the creation of new corpora, such as Augmented LibriSpeech and MuST-C. Existing datasets involve language pairs with English as a source language, involve very specific domains or are low resource. We introduce CoVoST, a multilingual speech-to-text translation corpus fr… ▽ More

    Submitted 9 June, 2020; v1 submitted 4 February, 2020; originally announced February 2020.

    Comments: LREC 2020 camera-ready

  47. arXiv:1909.12406  [pdf, other

    cs.CL

    Monotonic Multihead Attention

    Authors: Xutai Ma, Juan Pino, James Cross, Liezl Puzon, Jiatao Gu

    Abstract: Simultaneous machine translation models start generating a target sequence before they have encoded or read the source sequence. Recent approaches for this task either apply a fixed policy on a state-of-the art Transformer model, or a learnable monotonic attention on a weaker recurrent neural network-based structure. In this paper, we propose a new attention mechanism, Monotonic Multihead Attentio… ▽ More

    Submitted 26 September, 2019; originally announced September 2019.

  48. arXiv:1909.06515  [pdf, other

    cs.CL cs.SD eess.AS

    Harnessing Indirect Training Data for End-to-End Automatic Speech Translation: Tricks of the Trade

    Authors: Juan Pino, Liezl Puzon, Jiatao Gu, Xutai Ma, Arya D. McCarthy, Deepak Gopinath

    Abstract: For automatic speech translation (AST), end-to-end approaches are outperformed by cascaded models that transcribe with automatic speech recognition (ASR), then translate with machine translation (MT). A major cause of the performance gap is that, while existing AST corpora are small, massive datasets exist for both the ASR and MT subsystems. In this work, we evaluate several data augmentation and… ▽ More

    Submitted 22 October, 2019; v1 submitted 13 September, 2019; originally announced September 2019.

    Comments: IWSLT 2019

  49. arXiv:1906.11943  [pdf, other

    cs.CL

    Findings of the First Shared Task on Machine Translation Robustness

    Authors: Xian Li, Paul Michel, Antonios Anastasopoulos, Yonatan Belinkov, Nadir Durrani, Orhan Firat, Philipp Koehn, Graham Neubig, Juan Pino, Hassan Sajjad

    Abstract: We share the findings of the first shared task on improving robustness of Machine Translation (MT). The task provides a testbed representing challenges facing MT models deployed in the real world, and facilitates new approaches to improve models; robustness to noisy input and domain mismatch. We focus on two language pairs (English-French and English-Japanese), and the submitted systems are evalua… ▽ More

    Submitted 3 July, 2019; v1 submitted 27 June, 2019; originally announced June 2019.

  50. arXiv:1903.06620  [pdf, other

    cs.CL

    On Evaluation of Adversarial Perturbations for Sequence-to-Sequence Models

    Authors: Paul Michel, Xian Li, Graham Neubig, Juan Miguel Pino

    Abstract: Adversarial examples --- perturbations to the input of a model that elicit large changes in the output --- have been shown to be an effective way of assessing the robustness of sequence-to-sequence (seq2seq) models. However, these perturbations only indicate weaknesses in the model if they do not change the input so significantly that it legitimately results in changes in the expected output. This… ▽ More

    Submitted 18 March, 2019; v1 submitted 15 March, 2019; originally announced March 2019.

    Comments: NAACL-HLT 2019 long paper