Skip to main content

Showing 1–11 of 11 results for author: Conneau, A

Searching in archive eess. Search in all archives.
.
  1. arXiv:2410.21276  [pdf, other

    cs.CL cs.AI cs.CV cs.CY cs.LG cs.SD eess.AS

    GPT-4o System Card

    Authors: OpenAI, :, Aaron Hurst, Adam Lerer, Adam P. Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, Aleksander MÄ…dry, Alex Baker-Whitcomb, Alex Beutel, Alex Borzunov, Alex Carney, Alex Chow, Alex Kirillov, Alex Nichol, Alex Paino, Alex Renzin, Alex Tachard Passos, Alexander Kirillov, Alexi Christakis , et al. (395 additional authors not shown)

    Abstract: GPT-4o is an autoregressive omni model that accepts as input any combination of text, audio, image, and video, and generates any combination of text, audio, and image outputs. It's trained end-to-end across text, vision, and audio, meaning all inputs and outputs are processed by the same neural network. GPT-4o can respond to audio inputs in as little as 232 milliseconds, with an average of 320 mil… ▽ More

    Submitted 25 October, 2024; originally announced October 2024.

  2. arXiv:2310.08715  [pdf, other

    cs.CL cs.AI cs.SD eess.AS

    Toward Joint Language Modeling for Speech Units and Text

    Authors: Ju-Chieh Chou, Chung-Ming Chien, Wei-Ning Hsu, Karen Livescu, Arun Babu, Alexis Conneau, Alexei Baevski, Michael Auli

    Abstract: Speech and text are two major forms of human language. The research community has been focusing on mapping speech to text or vice versa for many years. However, in the field of language modeling, very little effort has been made to model them jointly. In light of this, we explore joint language modeling for speech units and text. Specifically, we compare different speech tokenizers to transform co… ▽ More

    Submitted 12 October, 2023; originally announced October 2023.

    Comments: EMNLP findings 2023

  3. arXiv:2305.13516  [pdf, other

    cs.CL cs.SD eess.AS

    Scaling Speech Technology to 1,000+ Languages

    Authors: Vineel Pratap, Andros Tjandra, Bowen Shi, Paden Tomasello, Arun Babu, Sayani Kundu, Ali Elkahky, Zhaoheng Ni, Apoorv Vyas, Maryam Fazel-Zarandi, Alexei Baevski, Yossi Adi, Xiaohui Zhang, Wei-Ning Hsu, Alexis Conneau, Michael Auli

    Abstract: Expanding the language coverage of speech technology has the potential to improve access to information for many more people. However, current speech technology is restricted to about one hundred languages which is a small fraction of the over 7,000 languages spoken around the world. The Massively Multilingual Speech (MMS) project increases the number of supported languages by 10-40x, depending on… ▽ More

    Submitted 22 May, 2023; originally announced May 2023.

  4. arXiv:2305.13009  [pdf, other

    cs.CL cs.LG cs.SD eess.AS

    Textually Pretrained Speech Language Models

    Authors: Michael Hassid, Tal Remez, Tu Anh Nguyen, Itai Gat, Alexis Conneau, Felix Kreuk, Jade Copet, Alexandre Defossez, Gabriel Synnaeve, Emmanuel Dupoux, Roy Schwartz, Yossi Adi

    Abstract: Speech language models (SpeechLMs) process and generate acoustic data only, without textual supervision. In this work, we propose TWIST, a method for training SpeechLMs using a warm-start from a pretrained textual language models. We show using both automatic and human evaluations that TWIST outperforms a cold-start SpeechLM across the board. We empirically analyze the effect of different model de… ▽ More

    Submitted 30 January, 2024; v1 submitted 22 May, 2023; originally announced May 2023.

    Comments: NeurIPS 2023

  5. arXiv:2205.12446  [pdf, other

    cs.CL cs.LG cs.SD eess.AS

    FLEURS: Few-shot Learning Evaluation of Universal Representations of Speech

    Authors: Alexis Conneau, Min Ma, Simran Khanuja, Yu Zhang, Vera Axelrod, Siddharth Dalmia, Jason Riesa, Clara Rivera, Ankur Bapna

    Abstract: We introduce FLEURS, the Few-shot Learning Evaluation of Universal Representations of Speech benchmark. FLEURS is an n-way parallel speech dataset in 102 languages built on top of the machine translation FLoRes-101 benchmark, with approximately 12 hours of speech supervision per language. FLEURS can be used for a variety of speech tasks, including Automatic Speech Recognition (ASR), Speech Languag… ▽ More

    Submitted 24 May, 2022; originally announced May 2022.

  6. arXiv:2203.13339  [pdf, other

    cs.CL cs.LG cs.SD eess.AS

    Leveraging unsupervised and weakly-supervised data to improve direct speech-to-speech translation

    Authors: Ye Jia, Yifan Ding, Ankur Bapna, Colin Cherry, Yu Zhang, Alexis Conneau, Nobuyuki Morioka

    Abstract: End-to-end speech-to-speech translation (S2ST) without relying on intermediate text representations is a rapidly emerging frontier of research. Recent works have demonstrated that the performance of such direct S2ST systems is approaching that of conventional cascade S2ST when trained on comparable datasets. However, in practice, the performance of direct S2ST is bounded by the availability of pai… ▽ More

    Submitted 27 June, 2022; v1 submitted 24 March, 2022; originally announced March 2022.

    Comments: Interspeech 2022

  7. arXiv:2111.09296  [pdf, other

    cs.CL cs.SD eess.AS

    XLS-R: Self-supervised Cross-lingual Speech Representation Learning at Scale

    Authors: Arun Babu, Changhan Wang, Andros Tjandra, Kushal Lakhotia, Qiantong Xu, Naman Goyal, Kritika Singh, Patrick von Platen, Yatharth Saraf, Juan Pino, Alexei Baevski, Alexis Conneau, Michael Auli

    Abstract: This paper presents XLS-R, a large-scale model for cross-lingual speech representation learning based on wav2vec 2.0. We train models with up to 2B parameters on nearly half a million hours of publicly available speech audio in 128 languages, an order of magnitude more public data than the largest known prior work. Our evaluation covers a wide range of tasks, domains, data regimes and languages, b… ▽ More

    Submitted 16 December, 2021; v1 submitted 17 November, 2021; originally announced November 2021.

  8. arXiv:2107.04082  [pdf, other

    cs.CL cs.SD eess.AS

    Improved Language Identification Through Cross-Lingual Self-Supervised Learning

    Authors: Andros Tjandra, Diptanu Gon Choudhury, Frank Zhang, Kritika Singh, Alexis Conneau, Alexei Baevski, Assaf Sela, Yatharth Saraf, Michael Auli

    Abstract: Language identification greatly impacts the success of downstream tasks such as automatic speech recognition. Recently, self-supervised speech representations learned by wav2vec 2.0 have been shown to be very effective for a range of speech tasks. We extend previous self-supervised work on language identification by experimenting with pre-trained models which were learned on real-world unconstrain… ▽ More

    Submitted 17 October, 2021; v1 submitted 8 July, 2021; originally announced July 2021.

  9. arXiv:2105.11084  [pdf, other

    cs.CL cs.SD eess.AS

    Unsupervised Speech Recognition

    Authors: Alexei Baevski, Wei-Ning Hsu, Alexis Conneau, Michael Auli

    Abstract: Despite rapid progress in the recent past, current speech recognition systems still require labeled training data which limits this technology to a small fraction of the languages spoken around the globe. This paper describes wav2vec-U, short for wav2vec Unsupervised, a method to train speech recognition models without any labeled data. We leverage self-supervised speech representations to segment… ▽ More

    Submitted 2 May, 2022; v1 submitted 24 May, 2021; originally announced May 2021.

  10. arXiv:2010.11430  [pdf, other

    cs.LG cs.SD eess.AS

    Self-training and Pre-training are Complementary for Speech Recognition

    Authors: Qiantong Xu, Alexei Baevski, Tatiana Likhomanenko, Paden Tomasello, Alexis Conneau, Ronan Collobert, Gabriel Synnaeve, Michael Auli

    Abstract: Self-training and unsupervised pre-training have emerged as effective approaches to improve speech recognition systems using unlabeled data. However, it is not clear whether they learn similar patterns or if they can be effectively combined. In this paper, we show that pseudo-labeling and pre-training with wav2vec 2.0 are complementary in a variety of labeled data setups. Using just 10 minutes of… ▽ More

    Submitted 22 October, 2020; originally announced October 2020.

  11. arXiv:2006.13979  [pdf, other

    cs.CL cs.LG cs.SD eess.AS

    Unsupervised Cross-lingual Representation Learning for Speech Recognition

    Authors: Alexis Conneau, Alexei Baevski, Ronan Collobert, Abdelrahman Mohamed, Michael Auli

    Abstract: This paper presents XLSR which learns cross-lingual speech representations by pretraining a single model from the raw waveform of speech in multiple languages. We build on wav2vec 2.0 which is trained by solving a contrastive task over masked latent speech representations and jointly learns a quantization of the latents shared across languages. The resulting model is fine-tuned on labeled data and… ▽ More

    Submitted 15 December, 2020; v1 submitted 24 June, 2020; originally announced June 2020.