Skip to main content

Showing 1–50 of 64 results for author: Jaitly, N

Searching in archive cs. Search in all archives.
.
  1. arXiv:2504.16431  [pdf, other

    cs.LG

    Target Concrete Score Matching: A Holistic Framework for Discrete Diffusion

    Authors: Ruixiang Zhang, Shuangfei Zhai, Yizhe Zhang, James Thornton, Zijing Ou, Joshua Susskind, Navdeep Jaitly

    Abstract: Discrete diffusion is a promising framework for modeling and generating discrete data. In this work, we present Target Concrete Score Matching (TCSM), a novel and versatile objective for training and fine-tuning discrete diffusion models. TCSM provides a general framework with broad applicability. It supports pre-training discrete diffusion models directly from data samples, and many existing disc… ▽ More

    Submitted 23 April, 2025; originally announced April 2025.

  2. arXiv:2503.03040  [pdf, other

    cs.CL cs.AI

    SAGE: Steering and Refining Dialog Generation with State-Action Augmentation

    Authors: Yizhe Zhang, Navdeep Jaitly

    Abstract: Recent advances in large language models have demonstrated impressive capabilities in task-oriented applications, yet building emotionally intelligent chatbots that can engage in natural, strategic conversations remains a challenge. We present a novel approach called SAGE that uses latent variables to control long-horizon behavior in dialogue generation. At the core of our method is the State-Acti… ▽ More

    Submitted 4 March, 2025; originally announced March 2025.

  3. arXiv:2502.18435  [pdf, other

    cs.CL cs.IT cs.LG

    Reversal Blessing: Thinking Backward May Outpace Thinking Forward in Multi-choice Questions

    Authors: Yizhe Zhang, Richard Bai, Zijin Gu, Ruixiang Zhang, Jiatao Gu, Emmanuel Abbe, Samy Bengio, Navdeep Jaitly

    Abstract: Language models usually use left-to-right (L2R) autoregressive factorization. However, L2R factorization may not always be the best inductive bias. Therefore, we investigate whether alternative factorizations of the text distribution could be beneficial in some tasks. We investigate right-to-left (R2L) training as a compelling alternative, focusing on multiple-choice questions (MCQs) as a test bed… ▽ More

    Submitted 19 March, 2025; v1 submitted 25 February, 2025; originally announced February 2025.

  4. arXiv:2501.08248  [pdf, other

    cs.CL cs.AI cs.IR cs.LG

    Eliciting In-context Retrieval and Reasoning for Long-context Large Language Models

    Authors: Yifu Qiu, Varun Embar, Yizhe Zhang, Navdeep Jaitly, Shay B. Cohen, Benjamin Han

    Abstract: Recent advancements in long-context language models (LCLMs) promise to transform Retrieval-Augmented Generation (RAG) by simplifying pipelines. With their expanded context windows, LCLMs can process entire knowledge bases and perform retrieval and reasoning directly -- a capability we define as In-Context Retrieval and Reasoning (ICR^2). However, existing benchmarks like LOFT often overestimate LC… ▽ More

    Submitted 28 February, 2025; v1 submitted 14 January, 2025; originally announced January 2025.

  5. arXiv:2412.21139  [pdf, other

    cs.SE cs.CL

    Training Software Engineering Agents and Verifiers with SWE-Gym

    Authors: Jiayi Pan, Xingyao Wang, Graham Neubig, Navdeep Jaitly, Heng Ji, Alane Suhr, Yizhe Zhang

    Abstract: We present SWE-Gym, the first environment for training real-world software engineering (SWE) agents. SWE-Gym contains 2,438 real-world Python task instances, each comprising a codebase with an executable runtime environment, unit tests, and a task specified in natural language. We use SWE-Gym to train language model based SWE agents , achieving up to 19% absolute gains in resolve rate on the popul… ▽ More

    Submitted 30 December, 2024; originally announced December 2024.

    Comments: Code at https://github.com/SWE-Gym/SWE-Gym

  6. arXiv:2412.06329  [pdf, other

    cs.CV cs.LG

    Normalizing Flows are Capable Generative Models

    Authors: Shuangfei Zhai, Ruixiang Zhang, Preetum Nakkiran, David Berthelot, Jiatao Gu, Huangjie Zheng, Tianrong Chen, Miguel Angel Bautista, Navdeep Jaitly, Josh Susskind

    Abstract: Normalizing Flows (NFs) are likelihood-based models for continuous inputs. They have demonstrated promising results on both density estimation and generative modeling tasks, but have received relatively little attention in recent years. In this work, we demonstrate that NFs are more powerful than previously believed. We present TarFlow: a simple and scalable architecture that enables highly perfor… ▽ More

    Submitted 9 December, 2024; v1 submitted 9 December, 2024; originally announced December 2024.

  7. arXiv:2411.17690  [pdf, other

    cs.MM cs.CV cs.SD eess.AS

    Visatronic: A Multimodal Decoder-Only Model for Speech Synthesis

    Authors: Akshita Gupta, Tatiana Likhomanenko, Karren Dai Yang, Richard He Bai, Zakaria Aldeneh, Navdeep Jaitly

    Abstract: In this paper, we propose a new task -- generating speech from videos of people and their transcripts (VTTS) -- to motivate new techniques for multimodal speech generation. This task generalizes the task of generating speech from cropped lip videos, and is also more complicated than the task of generating generic audio clips (e.g., dog barking) from videos and text. Multilingual versions of the ta… ▽ More

    Submitted 26 November, 2024; originally announced November 2024.

  8. arXiv:2411.02437  [pdf, other

    cs.CV cs.AI

    TypeScore: A Text Fidelity Metric for Text-to-Image Generative Models

    Authors: Georgia Gabriela Sampaio, Ruixiang Zhang, Shuangfei Zhai, Jiatao Gu, Josh Susskind, Navdeep Jaitly, Yizhe Zhang

    Abstract: Evaluating text-to-image generative models remains a challenge, despite the remarkable progress being made in their overall performances. While existing metrics like CLIPScore work for coarse evaluations, they lack the sensitivity to distinguish finer differences as model performance rapidly improves. In this work, we focus on the text rendering aspect of these models, which provides a lens for ev… ▽ More

    Submitted 2 November, 2024; originally announced November 2024.

  9. arXiv:2410.23698  [pdf, other

    cs.LG cs.CV

    Aggregate-and-Adapt Natural Language Prompts for Downstream Generalization of CLIP

    Authors: Chen Huang, Skyler Seto, Samira Abnar, David Grangier, Navdeep Jaitly, Josh Susskind

    Abstract: Large pretrained vision-language models like CLIP have shown promising generalization capability, but may struggle in specialized domains (e.g., satellite imagery) or fine-grained classification (e.g., car models) where the visual concepts are unseen or under-represented during pretraining. Prompt learning offers a parameter-efficient finetuning framework that can adapt CLIP to downstream tasks ev… ▽ More

    Submitted 31 October, 2024; originally announced October 2024.

    Comments: NeurIPS 2024

  10. arXiv:2410.08159  [pdf, other

    cs.CV cs.LG

    DART: Denoising Autoregressive Transformer for Scalable Text-to-Image Generation

    Authors: Jiatao Gu, Yuyang Wang, Yizhe Zhang, Qihang Zhang, Dinghuai Zhang, Navdeep Jaitly, Josh Susskind, Shuangfei Zhai

    Abstract: Diffusion models have become the dominant approach for visual generation. They are trained by denoising a Markovian process which gradually adds noise to the input. We argue that the Markovian property limits the model's ability to fully utilize the generation trajectory, leading to inefficiencies during training and inference. In this paper, we propose DART, a transformer-based model that unifies… ▽ More

    Submitted 23 January, 2025; v1 submitted 10 October, 2024; originally announced October 2024.

    Comments: Accepted by ICLR2025

  11. arXiv:2408.03906  [pdf, other

    cs.RO

    Achieving Human Level Competitive Robot Table Tennis

    Authors: David B. D'Ambrosio, Saminda Abeyruwan, Laura Graesser, Atil Iscen, Heni Ben Amor, Alex Bewley, Barney J. Reed, Krista Reymann, Leila Takayama, Yuval Tassa, Krzysztof Choromanski, Erwin Coumans, Deepali Jain, Navdeep Jaitly, Natasha Jaques, Satoshi Kataoka, Yuheng Kuang, Nevena Lazic, Reza Mahjourian, Sherry Moore, Kenneth Oslund, Anish Shankar, Vikas Sindhwani, Vincent Vanhoucke, Grace Vesom , et al. (2 additional authors not shown)

    Abstract: Achieving human-level speed and performance on real world tasks is a north star for the robotics research community. This work takes a step towards that goal and presents the first learned robot agent that reaches amateur human-level performance in competitive table tennis. Table tennis is a physically demanding sport which requires human players to undergo years of training to achieve an advanced… ▽ More

    Submitted 1 May, 2025; v1 submitted 7 August, 2024; originally announced August 2024.

  12. arXiv:2407.15835  [pdf, other

    cs.CL cs.AI cs.SD eess.AS

    dMel: Speech Tokenization made Simple

    Authors: He Bai, Tatiana Likhomanenko, Ruixiang Zhang, Zijin Gu, Zakaria Aldeneh, Navdeep Jaitly

    Abstract: Large language models have revolutionized natural language processing by leveraging self-supervised pretraining on vast textual data. Inspired by this success, researchers have investigated complicated speech tokenization methods to discretize continuous speech signals so that language modeling techniques can be applied to speech data. However, existing approaches either model semantic (content) t… ▽ More

    Submitted 2 October, 2024; v1 submitted 22 July, 2024; originally announced July 2024.

    Comments: under review

  13. arXiv:2406.00633  [pdf, other

    cs.LG cs.AI cs.CV stat.ML

    Improving GFlowNets for Text-to-Image Diffusion Alignment

    Authors: Dinghuai Zhang, Yizhe Zhang, Jiatao Gu, Ruixiang Zhang, Josh Susskind, Navdeep Jaitly, Shuangfei Zhai

    Abstract: Diffusion models have become the de-facto approach for generating visual data, which are trained to match the distribution of the training dataset. In addition, we also want to control generation to fulfill desired properties such as alignment to a text description, which can be specified with a black-box reward function. Prior works fine-tune pretrained diffusion models to achieve this goal throu… ▽ More

    Submitted 25 December, 2024; v1 submitted 2 June, 2024; originally announced June 2024.

  14. arXiv:2405.21048  [pdf, other

    cs.CV

    Kaleido Diffusion: Improving Conditional Diffusion Models with Autoregressive Latent Modeling

    Authors: Jiatao Gu, Ying Shen, Shuangfei Zhai, Yizhe Zhang, Navdeep Jaitly, Joshua M. Susskind

    Abstract: Diffusion models have emerged as a powerful tool for generating high-quality images from textual descriptions. Despite their successes, these models often exhibit limited diversity in the sampled images, particularly when sampling with a high classifier-free guidance weight. To address this issue, we present Kaleido, a novel approach that enhances the diversity of samples by incorporating autoregr… ▽ More

    Submitted 31 May, 2024; originally announced May 2024.

    Comments: 22 pages, 14 figures

  15. arXiv:2405.15216  [pdf, other

    cs.LG cs.CL cs.SD eess.AS

    Denoising LM: Pushing the Limits of Error Correction Models for Speech Recognition

    Authors: Zijin Gu, Tatiana Likhomanenko, He Bai, Erik McDermott, Ronan Collobert, Navdeep Jaitly

    Abstract: Language models (LMs) have long been used to improve results of automatic speech recognition (ASR) systems, but they are unaware of the errors that ASR systems make. Error correction models are designed to fix ASR errors, however, they showed little improvement over traditional LMs mainly due to the lack of supervised training data. In this paper, we present Denoising LM (DLM), which is a… ▽ More

    Submitted 24 May, 2024; originally announced May 2024.

    Comments: under review

  16. arXiv:2403.04732  [pdf, other

    cs.AI cs.CL cs.CV

    How Far Are We from Intelligent Visual Deductive Reasoning?

    Authors: Yizhe Zhang, He Bai, Ruixiang Zhang, Jiatao Gu, Shuangfei Zhai, Josh Susskind, Navdeep Jaitly

    Abstract: Vision-Language Models (VLMs) have recently demonstrated incredible strides on diverse vision language tasks. We dig into vision-based deductive reasoning, a more sophisticated but less explored realm, and find previously unexposed blindspots in the current SOTA VLMs. Specifically, we leverage Raven's Progressive Matrices (RPMs), to assess VLMs' abilities to perform multi-hop relational and deduct… ▽ More

    Submitted 1 October, 2024; v1 submitted 7 March, 2024; originally announced March 2024.

    Comments: COLM 2024. https://github.com/apple/ml-rpm-bench

  17. arXiv:2402.15000  [pdf, other

    cs.CL cs.LG

    Divide-or-Conquer? Which Part Should You Distill Your LLM?

    Authors: Zhuofeng Wu, He Bai, Aonan Zhang, Jiatao Gu, VG Vinod Vydiswaran, Navdeep Jaitly, Yizhe Zhang

    Abstract: Recent methods have demonstrated that Large Language Models (LLMs) can solve reasoning tasks better when they are encouraged to solve subtasks of the main task first. In this paper we devise a similar strategy that breaks down reasoning tasks into a problem decomposition phase and a problem solving phase and show that the strategy is able to outperform a single stage solution. Further, we hypothes… ▽ More

    Submitted 19 November, 2024; v1 submitted 22 February, 2024; originally announced February 2024.

    Comments: Findings of the Association for Computational Linguistics: EMNLP 2024

    Journal ref: 2024.findings-emnlp.145

  18. arXiv:2401.16380  [pdf, other

    cs.CL

    Rephrasing the Web: A Recipe for Compute and Data-Efficient Language Modeling

    Authors: Pratyush Maini, Skyler Seto, He Bai, David Grangier, Yizhe Zhang, Navdeep Jaitly

    Abstract: Large language models are trained on massive scrapes of the web, which are often unstructured, noisy, and poorly phrased. Current scaling laws show that learning from such data requires an abundance of both compute and data, which grows with the size of the model being trained. This is infeasible both because of the large compute costs and duration associated with pre-training, and the impending s… ▽ More

    Submitted 29 January, 2024; originally announced January 2024.

  19. arXiv:2312.11539  [pdf, other

    cs.AI cs.CL cs.LG

    KGLens: Towards Efficient and Effective Knowledge Probing of Large Language Models with Knowledge Graphs

    Authors: Shangshang Zheng, He Bai, Yizhe Zhang, Yi Su, Xiaochuan Niu, Navdeep Jaitly

    Abstract: Large Language Models (LLMs) might hallucinate facts, while curated Knowledge Graph (KGs) are typically factually reliable especially with domain-specific knowledge. Measuring the alignment between KGs and LLMs can effectively probe the factualness and identify the knowledge blind spots of LLMs. However, verifying the LLMs over extensive KGs can be expensive. In this paper, we present KGLens, a Th… ▽ More

    Submitted 31 July, 2024; v1 submitted 15 December, 2023; originally announced December 2023.

    Comments: ACL 2024 Workshop Towards Knowledgeable Language Models

  20. arXiv:2311.17932  [pdf, other

    physics.chem-ph cs.LG

    Swallowing the Bitter Pill: Simplified Scalable Conformer Generation

    Authors: Yuyang Wang, Ahmed A. Elhag, Navdeep Jaitly, Joshua M. Susskind, Miguel Angel Bautista

    Abstract: We present a novel way to predict molecular conformers through a simple formulation that sidesteps many of the heuristics of prior works and achieves state of the art results by using the advantages of scale. By training a diffusion generative model directly on 3D atomic positions without making assumptions about the explicit structure of molecules (e.g. modeling torsional angles) we are able to r… ▽ More

    Submitted 10 May, 2024; v1 submitted 27 November, 2023; originally announced November 2023.

    Comments: 19 pages, 11 figures

  21. arXiv:2310.15111  [pdf, other

    cs.CV cs.LG

    Matryoshka Diffusion Models

    Authors: Jiatao Gu, Shuangfei Zhai, Yizhe Zhang, Josh Susskind, Navdeep Jaitly

    Abstract: Diffusion models are the de facto approach for generating high-quality images and videos, but learning high-dimensional models remains a formidable task due to computational and optimization challenges. Existing methods often resort to training cascaded models in pixel space or using a downsampled latent space of a separately trained auto-encoder. In this paper, we introduce Matryoshka Diffusion M… ▽ More

    Submitted 30 August, 2024; v1 submitted 23 October, 2023; originally announced October 2023.

    Comments: Accepted by ICLR2024

  22. arXiv:2310.01468  [pdf, other

    cs.CL cs.AI cs.HC cs.LG

    Probing the Multi-turn Planning Capabilities of LLMs via 20 Question Games

    Authors: Yizhe Zhang, Jiarui Lu, Navdeep Jaitly

    Abstract: Large language models (LLMs) are effective at answering questions that are clearly asked. However, when faced with ambiguous queries they can act unpredictably and produce incorrect outputs. This underscores the need for the development of intelligent agents capable of asking clarification questions to resolve ambiguities effectively. This capability requires complex understanding, state tracking,… ▽ More

    Submitted 20 February, 2024; v1 submitted 2 October, 2023; originally announced October 2023.

    Comments: 24 pages

  23. arXiv:2309.11669  [pdf, other

    cs.CL

    Construction of Paired Knowledge Graph-Text Datasets Informed by Cyclic Evaluation

    Authors: Ali Mousavi, Xin Zhan, He Bai, Peng Shi, Theo Rekatsinas, Benjamin Han, Yunyao Li, Jeff Pound, Josh Susskind, Natalie Schluter, Ihab Ilyas, Navdeep Jaitly

    Abstract: Datasets that pair Knowledge Graphs (KG) and text together (KG-T) can be used to train forward and reverse neural models that generate text from KG and vice versa. However models trained on datasets where KG and text pairs are not equivalent can suffer from more hallucination and poorer recall. In this paper, we verify this empirically by generating datasets with different levels of noise and find… ▽ More

    Submitted 20 September, 2023; originally announced September 2023.

    Comments: 16 pages

  24. arXiv:2309.03964  [pdf, other

    cs.LG cs.CV

    REALM: Robust Entropy Adaptive Loss Minimization for Improved Single-Sample Test-Time Adaptation

    Authors: Skyler Seto, Barry-John Theobald, Federico Danieli, Navdeep Jaitly, Dan Busbridge

    Abstract: Fully-test-time adaptation (F-TTA) can mitigate performance loss due to distribution shifts between train and test data (1) without access to the training data, and (2) without knowledge of the model training procedure. In online F-TTA, a pre-trained model is adapted using a stream of test samples by minimizing a self-supervised objective, such as entropy minimization. However, models adapted with… ▽ More

    Submitted 7 September, 2023; originally announced September 2023.

    Comments: Accepted at WACV 2024, 17 pages, 7 figures, 11 tables

  25. Robotic Table Tennis: A Case Study into a High Speed Learning System

    Authors: David B. D'Ambrosio, Jonathan Abelian, Saminda Abeyruwan, Michael Ahn, Alex Bewley, Justin Boyd, Krzysztof Choromanski, Omar Cortes, Erwin Coumans, Tianli Ding, Wenbo Gao, Laura Graesser, Atil Iscen, Navdeep Jaitly, Deepali Jain, Juhana Kangaspunta, Satoshi Kataoka, Gus Kouretas, Yuheng Kuang, Nevena Lazic, Corey Lynch, Reza Mahjourian, Sherry Q. Moore, Thinh Nguyen, Ken Oslund , et al. (10 additional authors not shown)

    Abstract: We present a deep-dive into a real-world robotic learning system that, in previous work, was shown to be capable of hundreds of table tennis rallies with a human and has the ability to precisely return the ball to desired targets. This system puts together a highly optimized perception subsystem, a high-speed low-latency robot controller, a simulation paradigm that can prevent damage in the real w… ▽ More

    Submitted 19 February, 2025; v1 submitted 6 September, 2023; originally announced September 2023.

    Comments: Published and presented at Robotics: Science and Systems (RSS2023)

  26. arXiv:2306.02531  [pdf, other

    cs.CL

    PLANNER: Generating Diversified Paragraph via Latent Language Diffusion Model

    Authors: Yizhe Zhang, Jiatao Gu, Zhuofeng Wu, Shuangfei Zhai, Josh Susskind, Navdeep Jaitly

    Abstract: Autoregressive models for text sometimes generate repetitive and low-quality output because errors accumulate during the steps of generation. This issue is often attributed to exposure bias - the difference between how a model is trained, and how it is used during inference. Denoising diffusion models provide an alternative approach in which a model can revisit and revise its output. However, they… ▽ More

    Submitted 22 March, 2024; v1 submitted 4 June, 2023; originally announced June 2023.

    Comments: Accepted by NeurIPS 2023, code at https://github.com/apple/ml-planner

  27. arXiv:2212.01562  [pdf, other

    cs.LG cs.CV

    Understanding the Robustness of Multi-Exit Models under Common Corruptions

    Authors: Akshay Mehra, Skyler Seto, Navdeep Jaitly, Barry-John Theobald

    Abstract: Multi-Exit models (MEMs) use an early-exit strategy to improve the accuracy and efficiency of deep neural networks (DNNs) by allowing samples to exit the network before the last layer. However, the effectiveness of MEMs in the presence of distribution shifts remains largely unexplored. Our work examines how distribution shifts generated by common image corruptions affect the accuracy/efficiency of… ▽ More

    Submitted 3 December, 2022; originally announced December 2022.

    Comments: 16 pages, 22 figures

  28. arXiv:2211.06007  [pdf, other

    cs.LG cs.SD eess.AS stat.ML

    Continuous Soft Pseudo-Labeling in ASR

    Authors: Tatiana Likhomanenko, Ronan Collobert, Navdeep Jaitly, Samy Bengio

    Abstract: Continuous pseudo-labeling (PL) algorithms such as slimIPL have recently emerged as a powerful strategy for semi-supervised learning in speech recognition. In contrast with earlier strategies that alternated between training a model and generating pseudo-labels (PLs) with it, here PLs are generated in end-to-end manner as training proceeds, improving training speed and the accuracy of the final mo… ▽ More

    Submitted 30 January, 2023; v1 submitted 11 November, 2022; originally announced November 2022.

  29. arXiv:2211.00854  [pdf, other

    cs.LG cs.SD eess.AS

    More Speaking or More Speakers?

    Authors: Dan Berrebbi, Ronan Collobert, Navdeep Jaitly, Tatiana Likhomanenko

    Abstract: Self-training (ST) and self-supervised learning (SSL) methods have demonstrated strong improvements in automatic speech recognition (ASR). In spite of these advances, to the best of our knowledge, there is no analysis of how the composition of the labelled and unlabelled datasets used in these methods affects the results. In this work we aim to analyse the effect of number of speakers in the train… ▽ More

    Submitted 2 March, 2023; v1 submitted 1 November, 2022; originally announced November 2022.

    Comments: ICASSP 2023

  30. arXiv:2210.08711  [pdf, other

    cs.LG

    Continuous Pseudo-Labeling from the Start

    Authors: Dan Berrebbi, Ronan Collobert, Samy Bengio, Navdeep Jaitly, Tatiana Likhomanenko

    Abstract: Self-training (ST), or pseudo-labeling has sparked significant interest in the automatic speech recognition (ASR) community recently because of its success in harnessing unlabeled data. Unlike prior semi-supervised learning approaches that relied on iteratively regenerating pseudo-labels (PLs) from a trained model and using them to train a new model, recent state-of-the-art methods perform `contin… ▽ More

    Submitted 7 April, 2023; v1 submitted 16 October, 2022; originally announced October 2022.

    Comments: To appear in ICLR 2023

  31. arXiv:2207.07611  [pdf, other

    cs.LG cs.CV cs.SD eess.AS

    Position Prediction as an Effective Pretraining Strategy

    Authors: Shuangfei Zhai, Navdeep Jaitly, Jason Ramapuram, Dan Busbridge, Tatiana Likhomanenko, Joseph Yitan Cheng, Walter Talbott, Chen Huang, Hanlin Goh, Joshua Susskind

    Abstract: Transformers have gained increasing popularity in a wide range of applications, including Natural Language Processing (NLP), Computer Vision and Speech Recognition, because of their powerful representational capacity. However, harnessing this representational capacity effectively requires a large amount of data, strong regularization, or both, to mitigate overfitting. Recently, the power of the Tr… ▽ More

    Submitted 15 July, 2022; originally announced July 2022.

    Comments: Accepted to ICML 2022

  32. arXiv:2207.01844  [pdf, other

    cs.LG cs.CV

    Efficient Representation Learning via Adaptive Context Pooling

    Authors: Chen Huang, Walter Talbott, Navdeep Jaitly, Josh Susskind

    Abstract: Self-attention mechanisms model long-range context by using pairwise attention between all input tokens. In doing so, they assume a fixed attention granularity defined by the individual tokens (e.g., text characters or image pixels), which may not be optimal for modeling complex dependencies at higher levels. In this paper, we propose ContextPool to address this problem by adapting the attention g… ▽ More

    Submitted 5 July, 2022; originally announced July 2022.

    Comments: ICML 2022

  33. arXiv:2005.03271  [pdf, other

    eess.AS cs.CL

    RNN-T Models Fail to Generalize to Out-of-Domain Audio: Causes and Solutions

    Authors: Chung-Cheng Chiu, Arun Narayanan, Wei Han, Rohit Prabhavalkar, Yu Zhang, Navdeep Jaitly, Ruoming Pang, Tara N. Sainath, Patrick Nguyen, Liangliang Cao, Yonghui Wu

    Abstract: In recent years, all-neural end-to-end approaches have obtained state-of-the-art results on several challenging automatic speech recognition (ASR) tasks. However, most existing works focus on building ASR models where train and test data are drawn from the same domain. This results in poor generalization characteristics on mismatched-domains: e.g., end-to-end models trained on short segments perfo… ▽ More

    Submitted 23 December, 2020; v1 submitted 7 May, 2020; originally announced May 2020.

    Comments: SLT camera-ready version

  34. arXiv:2003.14398  [pdf, other

    cs.LG cs.RO stat.ML

    Robotic Table Tennis with Model-Free Reinforcement Learning

    Authors: Wenbo Gao, Laura Graesser, Krzysztof Choromanski, Xingyou Song, Nevena Lazic, Pannag Sanketi, Vikas Sindhwani, Navdeep Jaitly

    Abstract: We propose a model-free algorithm for learning efficient policies capable of returning table tennis balls by controlling robot joints at a rate of 100Hz. We demonstrate that evolutionary search (ES) methods acting on CNN-based policy architectures for non-visual inputs and convolving across time learn compact controllers leading to smooth motions. Furthermore, we show that with appropriately tuned… ▽ More

    Submitted 27 May, 2020; v1 submitted 31 March, 2020; originally announced March 2020.

    Comments: V2: new URL of supplementary video. 8 pages, 4 figures

    ACM Class: I.2.6; I.2.9

  35. arXiv:2002.08926  [pdf, ps, other

    eess.AS cs.CL cs.LG cs.SD

    Imputer: Sequence Modelling via Imputation and Dynamic Programming

    Authors: William Chan, Chitwan Saharia, Geoffrey Hinton, Mohammad Norouzi, Navdeep Jaitly

    Abstract: This paper presents the Imputer, a neural sequence model that generates output sequences iteratively via imputations. The Imputer is an iterative generative model, requiring only a constant number of generation steps independent of the number of input or output tokens. The Imputer can be trained to approximately marginalize over all possible alignments between the input and output sequences, and a… ▽ More

    Submitted 22 April, 2020; v1 submitted 20 February, 2020; originally announced February 2020.

  36. arXiv:1912.06640  [pdf, other

    cs.CV cs.LG

    SPIN: A High Speed, High Resolution Vision Dataset for Tracking and Action Recognition in Ping Pong

    Authors: Steven Schwarcz, Peng Xu, David D'Ambrosio, Juhana Kangaspunta, Anelia Angelova, Huong Phan, Navdeep Jaitly

    Abstract: We introduce a new high resolution, high frame rate stereo video dataset, which we call SPIN, for tracking and action recognition in the game of ping pong. The corpus consists of ping pong play with three main annotation streams that can be used to learn tracking and action recognition models -- tracking of the ping pong ball and poses of humans in the videos and the spin of the ball being hit by… ▽ More

    Submitted 13 December, 2019; originally announced December 2019.

  37. arXiv:1902.08295  [pdf, other

    cs.LG stat.ML

    Lingvo: a Modular and Scalable Framework for Sequence-to-Sequence Modeling

    Authors: Jonathan Shen, Patrick Nguyen, Yonghui Wu, Zhifeng Chen, Mia X. Chen, Ye Jia, Anjuli Kannan, Tara Sainath, Yuan Cao, Chung-Cheng Chiu, Yanzhang He, Jan Chorowski, Smit Hinsu, Stella Laurenzo, James Qin, Orhan Firat, Wolfgang Macherey, Suyog Gupta, Ankur Bapna, Shuyuan Zhang, Ruoming Pang, Ron J. Weiss, Rohit Prabhavalkar, Qiao Liang, Benoit Jacob , et al. (66 additional authors not shown)

    Abstract: Lingvo is a Tensorflow framework offering a complete solution for collaborative deep learning research, with a particular focus towards sequence-to-sequence models. Lingvo models are composed of modular building blocks that are flexible and easily extensible, and experiment configurations are centralized and highly customizable. Distributed training and quantized inference are supported directly w… ▽ More

    Submitted 21 February, 2019; originally announced February 2019.

  38. arXiv:1811.12927  [pdf, other

    cs.RO

    Hierarchical Policy Design for Sample-Efficient Learning of Robot Table Tennis Through Self-Play

    Authors: Reza Mahjourian, Risto Miikkulainen, Nevena Lazic, Sergey Levine, Navdeep Jaitly

    Abstract: Training robots with physical bodies requires developing new methods and action representations that allow the learning agents to explore the space of policies efficiently. This work studies sample-efficient learning of complex policies in the context of robot table tennis. It incorporates learning into a hierarchical control framework using a model-free strategy layer (which requires complex reas… ▽ More

    Submitted 17 February, 2019; v1 submitted 30 November, 2018; originally announced November 2018.

  39. arXiv:1712.05884  [pdf, other

    cs.CL

    Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions

    Authors: Jonathan Shen, Ruoming Pang, Ron J. Weiss, Mike Schuster, Navdeep Jaitly, Zongheng Yang, Zhifeng Chen, Yu Zhang, Yuxuan Wang, RJ Skerry-Ryan, Rif A. Saurous, Yannis Agiomyrgiannakis, Yonghui Wu

    Abstract: This paper describes Tacotron 2, a neural network architecture for speech synthesis directly from text. The system is composed of a recurrent sequence-to-sequence feature prediction network that maps character embeddings to mel-scale spectrograms, followed by a modified WaveNet model acting as a vocoder to synthesize timedomain waveforms from those spectrograms. Our model achieves a mean opinion s… ▽ More

    Submitted 15 February, 2018; v1 submitted 15 December, 2017; originally announced December 2017.

    Comments: Accepted to ICASSP 2018

  40. arXiv:1712.01769  [pdf, other

    cs.CL cs.SD eess.AS stat.ML

    State-of-the-art Speech Recognition With Sequence-to-Sequence Models

    Authors: Chung-Cheng Chiu, Tara N. Sainath, Yonghui Wu, Rohit Prabhavalkar, Patrick Nguyen, Zhifeng Chen, Anjuli Kannan, Ron J. Weiss, Kanishka Rao, Ekaterina Gonina, Navdeep Jaitly, Bo Li, Jan Chorowski, Michiel Bacchiani

    Abstract: Attention-based encoder-decoder architectures such as Listen, Attend, and Spell (LAS), subsume the acoustic, pronunciation and language model components of a traditional automatic speech recognition (ASR) system into a single neural network. In previous work, we have shown that such architectures are comparable to state-of-theart ASR systems on dictation tasks, but it was not clear if such archite… ▽ More

    Submitted 23 February, 2018; v1 submitted 5 December, 2017; originally announced December 2017.

    Comments: ICASSP camera-ready version

  41. arXiv:1711.07274  [pdf, ps, other

    cs.CL cs.SD eess.AS stat.ML

    Speech recognition for medical conversations

    Authors: Chung-Cheng Chiu, Anshuman Tripathi, Katherine Chou, Chris Co, Navdeep Jaitly, Diana Jaunzeikare, Anjuli Kannan, Patrick Nguyen, Hasim Sak, Ananth Sankar, Justin Tansuwan, Nathan Wan, Yonghui Wu, Xuedong Zhang

    Abstract: In this work we explored building automatic speech recognition models for transcribing doctor patient conversation. We collected a large scale dataset of clinical conversations ($14,000$ hr), designed the task to represent the real word scenario, and explored several alignment approaches to iteratively improve data quality. We explored both CTC and LAS systems for building speech recognition model… ▽ More

    Submitted 20 June, 2018; v1 submitted 20 November, 2017; originally announced November 2017.

    Comments: Interspeech 2018 camera ready

  42. arXiv:1706.06428  [pdf, other

    cs.CL cs.LG stat.ML

    An online sequence-to-sequence model for noisy speech recognition

    Authors: Chung-Cheng Chiu, Dieterich Lawson, Yuping Luo, George Tucker, Kevin Swersky, Ilya Sutskever, Navdeep Jaitly

    Abstract: Generative models have long been the dominant approach for speech recognition. The success of these models however relies on the use of sophisticated recipes and complicated machinery that is not easily accessible to non-practitioners. Recent innovations in Deep Learning have given rise to an alternative - discriminative models called Sequence-to-Sequence models, that can almost match the accuracy… ▽ More

    Submitted 16 June, 2017; originally announced June 2017.

    Comments: arXiv admin note: substantial text overlap with arXiv:1608.01281

  43. arXiv:1705.05524  [pdf, other

    cs.AI cs.LG stat.ML

    Learning Hard Alignments with Variational Inference

    Authors: Dieterich Lawson, Chung-Cheng Chiu, George Tucker, Colin Raffel, Kevin Swersky, Navdeep Jaitly

    Abstract: There has recently been significant interest in hard attention models for tasks such as object recognition, visual captioning and speech recognition. Hard attention can offer benefits over soft attention such as decreased computational cost, but training hard attention models can be difficult because of the discrete latent variables they introduce. Previous work used REINFORCE and Q-learning to ap… ▽ More

    Submitted 1 November, 2017; v1 submitted 16 May, 2017; originally announced May 2017.

  44. arXiv:1705.05035  [pdf, other

    cs.LG cs.AI stat.ML

    Discrete Sequential Prediction of Continuous Actions for Deep RL

    Authors: Luke Metz, Julian Ibarz, Navdeep Jaitly, James Davidson

    Abstract: It has long been assumed that high dimensional continuous control problems cannot be solved effectively by discretizing individual dimensions of the action space due to the exponentially large number of bins over which policies would have to be learned. In this paper, we draw inspiration from the recent success of sequence-to-sequence models for structured prediction problems to develop policies o… ▽ More

    Submitted 7 June, 2019; v1 submitted 14 May, 2017; originally announced May 2017.

  45. arXiv:1703.10135  [pdf, other

    cs.CL cs.LG cs.SD

    Tacotron: Towards End-to-End Speech Synthesis

    Authors: Yuxuan Wang, RJ Skerry-Ryan, Daisy Stanton, Yonghui Wu, Ron J. Weiss, Navdeep Jaitly, Zongheng Yang, Ying Xiao, Zhifeng Chen, Samy Bengio, Quoc Le, Yannis Agiomyrgiannakis, Rob Clark, Rif A. Saurous

    Abstract: A text-to-speech synthesis system typically consists of multiple stages, such as a text analysis frontend, an acoustic model and an audio synthesis module. Building these components often requires extensive domain expertise and may contain brittle design choices. In this paper, we present Tacotron, an end-to-end generative text-to-speech model that synthesizes speech directly from characters. Give… ▽ More

    Submitted 6 April, 2017; v1 submitted 29 March, 2017; originally announced March 2017.

    Comments: Submitted to Interspeech 2017. v2 changed paper title to be consistent with our conference submission (no content change other than typo fixes)

  46. arXiv:1703.08581  [pdf, other

    cs.CL cs.LG stat.ML

    Sequence-to-Sequence Models Can Directly Translate Foreign Speech

    Authors: Ron J. Weiss, Jan Chorowski, Navdeep Jaitly, Yonghui Wu, Zhifeng Chen

    Abstract: We present a recurrent encoder-decoder deep neural network architecture that directly translates speech in one language into text in another. The model does not explicitly transcribe the speech into text in the source language, nor does it require supervision from the ground truth source language transcription during training. We apply a slightly modified sequence-to-sequence with attention archit… ▽ More

    Submitted 12 June, 2017; v1 submitted 24 March, 2017; originally announced March 2017.

    Comments: 5 pages, 1 figure. Interspeech 2017

  47. arXiv:1702.03865  [pdf, other

    cs.LG q-bio.BM

    Next-Step Conditioned Deep Convolutional Neural Networks Improve Protein Secondary Structure Prediction

    Authors: Akosua Busia, Navdeep Jaitly

    Abstract: Recently developed deep learning techniques have significantly improved the accuracy of various speech and image recognition systems. In this paper we show how to adapt some of these techniques to create a novel chained convolutional architecture with next-step conditioning for improving performance on protein sequence prediction problems. We explore its value by demonstrating its ability to impro… ▽ More

    Submitted 13 February, 2017; originally announced February 2017.

    Comments: 11 pages, 3 figures, 4 tables, submitted to ISMB/ECCB 2017. arXiv admin note: text overlap with arXiv:1611.01503

  48. arXiv:1612.02695  [pdf, other

    cs.NE cs.CL cs.LG stat.ML

    Towards better decoding and language model integration in sequence to sequence models

    Authors: Jan Chorowski, Navdeep Jaitly

    Abstract: The recently proposed Sequence-to-Sequence (seq2seq) framework advocates replacing complex data processing pipelines, such as an entire automatic speech recognition system, with a single neural network trained in an end-to-end fashion. In this contribution, we analyse an attention-based seq2seq speech recognition system that directly transcribes recordings into characters. We observe two shortcomi… ▽ More

    Submitted 8 December, 2016; originally announced December 2016.

  49. arXiv:1611.01503  [pdf, other

    cs.LG q-bio.BM

    Protein Secondary Structure Prediction Using Deep Multi-scale Convolutional Neural Networks and Next-Step Conditioning

    Authors: Akosua Busia, Jasmine Collins, Navdeep Jaitly

    Abstract: Recently developed deep learning techniques have significantly improved the accuracy of various speech and image recognition systems. In this paper we adapt some of these techniques for protein secondary structure prediction. We first train a series of deep neural networks to predict eight-class secondary structure labels given a protein's amino acid sequence information and find that using recent… ▽ More

    Submitted 4 November, 2016; originally announced November 2016.

    Comments: 10 pages, 2 figures, submitted to RECOMB 2017

  50. arXiv:1611.00068  [pdf

    cs.CL

    RNN Approaches to Text Normalization: A Challenge

    Authors: Richard Sproat, Navdeep Jaitly

    Abstract: This paper presents a challenge to the community: given a large corpus of written text aligned to its normalized spoken form, train an RNN to learn the correct normalization function. We present a data set of general text where the normalizations were generated using an existing text normalization component of a text-to-speech system. This data set will be released open-source in the near future.… ▽ More

    Submitted 24 January, 2017; v1 submitted 31 October, 2016; originally announced November 2016.

    Comments: 17 pages, 13 tables, 3 figures