Skip to main content

Showing 1–16 of 16 results for author: Salazar, J

Searching in archive eess. Search in all archives.
.
  1. arXiv:2507.23292  [pdf, ps, other

    cs.LG cs.CL cs.PL cs.SE eess.AS

    SequenceLayers: Sequence Processing and Streaming Neural Networks Made Easy

    Authors: RJ Skerry-Ryan, Julian Salazar, Soroosh Mariooryad, David Kao, Daisy Stanton, Eric Battenberg, Matt Shannon, Ron J. Weiss, Robin Scheibler, Jonas Rothfuss, Tom Bagby

    Abstract: We introduce a neural network layer API and library for sequence modeling, designed for easy creation of sequence models that can be executed both layer-by-layer (e.g., teacher-forced training) and step-by-step (e.g., autoregressive sampling). To achieve this, layers define an explicit representation of their state over time (e.g., a Transformer KV cache, a convolution buffer, an RNN hidden state)… ▽ More

    Submitted 31 July, 2025; originally announced July 2025.

  2. arXiv:2504.06500  [pdf, other

    eess.SY cs.LG cs.RO

    Data-driven Fuzzy Control for Time-Optimal Aggressive Trajectory Following

    Authors: August Phelps, Juan Augusto Paredes Salazar, Ankit Goel

    Abstract: Optimal trajectories that minimize a user-defined cost function in dynamic systems require the solution of a two-point boundary value problem. The optimization process yields an optimal control sequence that depends on the initial conditions and system parameters. However, the optimal sequence may result in undesirable behavior if the system's initial conditions and parameters are erroneous. This… ▽ More

    Submitted 8 April, 2025; originally announced April 2025.

    Comments: 6 pages, 10 figures, submitted to MECC 2025

  3. arXiv:2501.04275  [pdf, other

    eess.SY

    Adaptive Numerical Differentiation for Extremum Seeking with Sensor Noise

    Authors: Shashank Verma, Juan Augusto Paredes Salazar, Jhon Manuel Portella Delgado, Ankit Goel, Dennis S. Bernstein

    Abstract: Extremum-seeking control (ESC) is widely used to optimize performance when the system dynamics are uncertain. However, sensitivity to sensor noise is an important issue in ESC implementation due to the use of high-pass filters or gradient estimators. To reduce the sensitivity of ESC to noise, this paper investigates the use of adaptive input and state estimation (AISE) for numerical differentiatio… ▽ More

    Submitted 7 January, 2025; originally announced January 2025.

    Comments: 8 pages, 13 figures. Submitted to ACC 2025

  4. arXiv:2412.18603  [pdf, ps, other

    cs.CL cs.SD eess.AS

    Long-Form Speech Generation with Spoken Language Models

    Authors: Se Jin Park, Julian Salazar, Aren Jansen, Keisuke Kinoshita, Yong Man Ro, RJ Skerry-Ryan

    Abstract: We consider the generative modeling of speech over multiple minutes, a requirement for long-form multimedia generation and audio-native voice assistants. However, textless spoken language models struggle to generate plausible speech past tens of seconds, due to high temporal resolution of speech tokens causing loss of coherence, architectural issues with long-sequence training or extrapolation, an… ▽ More

    Submitted 10 July, 2025; v1 submitted 24 December, 2024; originally announced December 2024.

    Comments: Accepted to ICML 2025 (oral)

  5. arXiv:2412.08356  [pdf, other

    cs.SD cs.LG eess.AS

    Zero-Shot Mono-to-Binaural Speech Synthesis

    Authors: Alon Levkovitch, Julian Salazar, Soroosh Mariooryad, RJ Skerry-Ryan, Nadav Bar, Bastiaan Kleijn, Eliya Nachmani

    Abstract: We present ZeroBAS, a neural method to synthesize binaural audio from monaural audio recordings and positional information without training on any binaural data. To our knowledge, this is the first published zero-shot neural approach to mono-to-binaural audio synthesis. Specifically, we show that a parameter-free geometric time warping and amplitude scaling based on source location suffices to get… ▽ More

    Submitted 28 May, 2025; v1 submitted 11 December, 2024; originally announced December 2024.

  6. arXiv:2410.22179  [pdf, other

    cs.CL cs.LG cs.SD eess.AS

    Robust and Unbounded Length Generalization in Autoregressive Transformer-Based Text-to-Speech

    Authors: Eric Battenberg, RJ Skerry-Ryan, Daisy Stanton, Soroosh Mariooryad, Matt Shannon, Julian Salazar, David Kao

    Abstract: Autoregressive (AR) Transformer-based sequence models are known to have difficulty generalizing to sequences longer than those seen during training. When applied to text-to-speech (TTS), these models tend to drop or repeat words or produce erratic output, especially for longer utterances. In this paper, we introduce enhancements aimed at AR Transformer-based encoder-decoder TTS systems that addres… ▽ More

    Submitted 11 March, 2025; v1 submitted 29 October, 2024; originally announced October 2024.

    Comments: Accepted to NAACL 2025

  7. arXiv:2410.06556  [pdf, ps, other

    eess.SY

    MPC-guided, Data-driven Fuzzy Controller Synthesis

    Authors: Juan Augusto Paredes Salazar, Ankit Goel

    Abstract: Model predictive control (MPC) is a powerful control technique for online optimization using system model-based predictions over a finite time horizon. However, the computational cost MPC requires can be prohibitive in resource-constrained computer systems. This paper presents a fuzzy controller synthesis framework guided by MPC. In the proposed framework, training data is obtained from MPC closed… ▽ More

    Submitted 2 December, 2024; v1 submitted 9 October, 2024; originally announced October 2024.

    Comments: 9 pages, 8 figures, shorter version submitted to the American Control Conference 2025

  8. arXiv:2305.15255  [pdf, other

    cs.CL cs.LG cs.SD eess.AS

    Spoken Question Answering and Speech Continuation Using Spectrogram-Powered LLM

    Authors: Eliya Nachmani, Alon Levkovitch, Roy Hirsch, Julian Salazar, Chulayuth Asawaroengchai, Soroosh Mariooryad, Ehud Rivlin, RJ Skerry-Ryan, Michelle Tadmor Ramanovich

    Abstract: We present Spectron, a novel approach to adapting pre-trained large language models (LLMs) to perform spoken question answering (QA) and speech continuation. By endowing the LLM with a pre-trained speech encoder, our model becomes able to take speech inputs and generate speech outputs. The entire system is trained end-to-end and operates directly on spectrograms, simplifying our architecture. Key… ▽ More

    Submitted 30 May, 2024; v1 submitted 24 May, 2023; originally announced May 2023.

    Comments: ICLR 2024 camera-ready

  9. arXiv:2305.12793  [pdf, other

    eess.AS cs.CL cs.MM cs.SD

    Zero-Shot End-to-End Spoken Language Understanding via Cross-Modal Selective Self-Training

    Authors: Jianfeng He, Julian Salazar, Kaisheng Yao, Haoqi Li, Jinglun Cai

    Abstract: End-to-end (E2E) spoken language understanding (SLU) is constrained by the cost of collecting speech-semantics pairs, especially when label domains change. Hence, we explore \textit{zero-shot} E2E SLU, which learns E2E SLU without speech-semantics pairs, instead using only speech-text and text-semantics pairs. Previous work achieved zero-shot by pseudolabeling all speech-text transcripts with a na… ▽ More

    Submitted 2 February, 2024; v1 submitted 22 May, 2023; originally announced May 2023.

    Comments: 18 pages, 7 figures

  10. arXiv:2202.01969  [pdf, other

    cs.RO eess.SY math.DG

    A Novel Assistive Controller Based on Differential Geometry for Users of the Differential-Drive Wheeled Mobile Robots

    Authors: Seyed Amir Tafrishi, Ankit A. Ravankar, Jose Salazar, Yasuhisa Hirata

    Abstract: Certain wheeled mobile robots e.g., electric wheelchairs, can operate through indirect joystick controls from users. Correct steering angle becomes essential when the user should determine the vehicle direction and velocity, in particular for differential wheeled vehicles since the vehicle velocity and direction are controlled with only two actuating wheels. This problem gets more challenging when… ▽ More

    Submitted 4 February, 2022; originally announced February 2022.

    Comments: 10 pages, 12 figures, paper is accepted to 2022 International Conference on Robotics and Automation (ICRA 2022). This is the extended version

  11. arXiv:2010.14233  [pdf, other

    eess.AS cs.CL cs.LG cs.SD

    Align-Refine: Non-Autoregressive Speech Recognition via Iterative Realignment

    Authors: Ethan A. Chi, Julian Salazar, Katrin Kirchhoff

    Abstract: Non-autoregressive models greatly improve decoding speed over typical sequence-to-sequence models, but suffer from degraded performance. Infilling and iterative refinement models make up some of this gap by editing the outputs of a non-autoregressive model, but are constrained in the edits that they can make. We propose iterative realignment, where refinements occur over latent alignments rather t… ▽ More

    Submitted 24 October, 2020; originally announced October 2020.

    ACM Class: I.2.7

  12. arXiv:2002.05150  [pdf, other

    eess.AS cs.CL cs.LG cs.SD

    Attentional Speech Recognition Models Misbehave on Out-of-domain Utterances

    Authors: Phillip Keung, Wei Niu, Yichao Lu, Julian Salazar, Vikas Bhardwaj

    Abstract: We discuss the problem of echographic transcription in autoregressive sequence-to-sequence attentional architectures for automatic speech recognition, where a model produces very long sequences of repetitive outputs when presented with out-of-domain utterances. We decode audio from the British National Corpus with an attentional encoder-decoder model trained solely on the LibriSpeech corpus. We ob… ▽ More

    Submitted 12 February, 2020; originally announced February 2020.

    Comments: Artifacts like our filtered Audio BNC dataset can be found at https://github.com/aws-samples/seq2seq-asr-misbehaves

  13. arXiv:1912.01679  [pdf, other

    eess.AS cs.CL cs.LG cs.SD

    Deep Contextualized Acoustic Representations For Semi-Supervised Speech Recognition

    Authors: Shaoshi Ling, Yuzong Liu, Julian Salazar, Katrin Kirchhoff

    Abstract: We propose a novel approach to semi-supervised automatic speech recognition (ASR). We first exploit a large amount of unlabeled audio data via representation learning, where we reconstruct a temporal slice of filterbank features from past and future context frames. The resulting deep contextualized acoustic representations (DeCoAR) are then used to train a CTC-based end-to-end ASR system using a s… ▽ More

    Submitted 9 April, 2020; v1 submitted 3 December, 2019; originally announced December 2019.

    Comments: Accepted to ICASSP 2020 (oral)

  14. arXiv:1910.14659  [pdf, other

    cs.CL cs.LG eess.AS stat.ML

    Masked Language Model Scoring

    Authors: Julian Salazar, Davis Liang, Toan Q. Nguyen, Katrin Kirchhoff

    Abstract: Pretrained masked language models (MLMs) require finetuning for most NLP tasks. Instead, we evaluate MLMs out of the box via their pseudo-log-likelihood scores (PLLs), which are computed by masking tokens one by one. We show that PLLs outperform scores from autoregressive language models like GPT-2 in a variety of tasks. By rescoring ASR and NMT hypotheses, RoBERTa reduces an end-to-end LibriSpeec… ▽ More

    Submitted 31 December, 2020; v1 submitted 31 October, 2019; originally announced October 2019.

    Comments: ACL 2020 camera-ready (presented July 2020)

    Journal ref: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (2020), 2699-2712

  15. arXiv:1907.00457  [pdf, other

    cs.CL cs.LG cs.SD eess.AS

    BERTphone: Phonetically-Aware Encoder Representations for Utterance-Level Speaker and Language Recognition

    Authors: Shaoshi Ling, Julian Salazar, Yuzong Liu, Katrin Kirchhoff

    Abstract: We introduce BERTphone, a Transformer encoder trained on large speech corpora that outputs phonetically-aware contextual representation vectors that can be used for both speaker and language recognition. This is accomplished by training on two objectives: the first, inspired by adapting BERT to the continuous domain, involves masking spans of input frames and reconstructing the whole sequence for… ▽ More

    Submitted 29 December, 2021; v1 submitted 30 June, 2019; originally announced July 2019.

    Comments: Odyssey 2020 camera-ready (presented Nov. 2020)

    Journal ref: Proc. the Speaker and Language Recognition Workshop (Odyssey 2020), 9-16

  16. arXiv:1901.10055  [pdf, other

    eess.AS cs.CL cs.LG cs.SD

    Self-Attention Networks for Connectionist Temporal Classification in Speech Recognition

    Authors: Julian Salazar, Katrin Kirchhoff, Zhiheng Huang

    Abstract: The success of self-attention in NLP has led to recent applications in end-to-end encoder-decoder architectures for speech recognition. Separately, connectionist temporal classification (CTC) has matured as an alignment-free, non-autoregressive approach to sequence transduction, either by itself or in various multitask and decoding frameworks. We propose SAN-CTC, a deep, fully self-attentional net… ▽ More

    Submitted 19 February, 2019; v1 submitted 22 January, 2019; originally announced January 2019.

    Comments: Accepted to ICASSP 2019