Skip to main content

Showing 1–9 of 9 results for author: Jukic, A

Searching in archive cs. Search in all archives.
.
  1. arXiv:2505.21198  [pdf, ps, other

    cs.SD eess.AS

    Universal Speech Enhancement with Regression and Generative Mamba

    Authors: Rong Chao, Rauf Nasretdinov, Yu-Chiang Frank Wang, Ante Jukić, Szu-Wei Fu, Yu Tsao

    Abstract: The Interspeech 2025 URGENT Challenge aimed to advance universal, robust, and generalizable speech enhancement by unifying speech enhancement tasks across a wide variety of conditions, including seven different distortion types and five languages. We present Universal Speech Enhancement Mamba (USEMamba), a state-space speech enhancement model designed to handle long-range sequence modeling, time-f… ▽ More

    Submitted 27 May, 2025; originally announced May 2025.

    Comments: Accepted to Interspeech 2025

  2. Robust Speech Recognition with Schrödinger Bridge-Based Speech Enhancement

    Authors: Rauf Nasretdinov, Roman Korostik, Ante Jukić

    Abstract: In this work, we investigate application of generative speech enhancement to improve the robustness of ASR models in noisy and reverberant conditions. We employ a recently-proposed speech enhancement model based on Schrödinger bridge, which has been shown to perform well compared to diffusion-based approaches. We analyze the impact of model scaling and different sampling methods on the ASR perform… ▽ More

    Submitted 7 May, 2025; originally announced May 2025.

    Comments: 5 pages. Published in ICASSP 2025

    Journal ref: ICASSP 2025: IEEE International Conference on Acoustics, Speech and Signal Processing, Hyderabad, India, April 2025. ICASSP 2025: IEEE International Conference on Acoustics, Speech and Signal Processing, Hyderabad, India, April 2025

  3. arXiv:2501.11311  [pdf, other

    cs.SD cs.LG eess.AS

    A2SB: Audio-to-Audio Schrodinger Bridges

    Authors: Zhifeng Kong, Kevin J Shih, Weili Nie, Arash Vahdat, Sang-gil Lee, Joao Felipe Santos, Ante Jukic, Rafael Valle, Bryan Catanzaro

    Abstract: Audio in the real world may be perturbed due to numerous factors, causing the audio quality to be degraded. The following work presents an audio restoration model tailored for high-res music at 44.1kHz. Our model, Audio-to-Audio Schrodinger Bridges (A2SB), is capable of both bandwidth extension (predicting high-frequency components) and inpainting (re-generating missing segments). Critically, A2SB… ▽ More

    Submitted 20 January, 2025; originally announced January 2025.

  4. arXiv:2409.16117  [pdf, ps, other

    eess.AS cs.SD

    Generative Speech Foundation Model Pretraining for High-Quality Speech Extraction and Restoration

    Authors: Pin-Jui Ku, Alexander H. Liu, Roman Korostik, Sung-Feng Huang, Szu-Wei Fu, Ante Jukić

    Abstract: This paper proposes a generative pretraining foundation model for high-quality speech restoration tasks. By directly operating on complex-valued short-time Fourier transform coefficients, our model does not rely on any vocoders for time-domain signal reconstruction. As a result, our model simplifies the synthesis process and removes the quality upper-bound introduced by any mel-spectrogram vocoder… ▽ More

    Submitted 24 September, 2024; v1 submitted 24 September, 2024; originally announced September 2024.

    Comments: 5 pages, Submitted to ICASSP 2025. The implementation and configuration could be found in https://github.com/NVIDIA/NeMo/blob/main/examples/audio/conf/flow_matching_generative_ssl_pretraining.yaml The audio demo page could be found in https://kuray107.github.io/ssl_gen25-examples/index.html

  5. arXiv:2409.12117  [pdf, ps, other

    eess.AS cs.CL cs.SD

    Low Frame-rate Speech Codec: a Codec Designed for Fast High-quality Speech LLM Training and Inference

    Authors: Edresson Casanova, Ryan Langman, Paarth Neekhara, Shehzeen Hussain, Jason Li, Subhankar Ghosh, Ante Jukić, Sang-gil Lee

    Abstract: Large language models (LLMs) have significantly advanced audio processing through audio codecs that convert audio into discrete tokens, enabling the application of language modeling techniques to audio data. However, audio codecs often operate at high frame rates, resulting in slow training and inference, especially for autoregressive models. To address this challenge, we present the Low Frame-rat… ▽ More

    Submitted 18 September, 2024; originally announced September 2024.

    Comments: Submitted to ICASSP 2025

  6. Codec-ASR: Training Performant Automatic Speech Recognition Systems with Discrete Speech Representations

    Authors: Kunal Dhawan, Nithin Rao Koluguri, Ante Jukić, Ryan Langman, Jagadeesh Balam, Boris Ginsburg

    Abstract: Discrete speech representations have garnered recent attention for their efficacy in training transformer-based models for various speech-related tasks such as automatic speech recognition (ASR), translation, speaker verification, and joint speech-text foundational models. In this work, we present a comprehensive analysis on building ASR systems with discrete codes. We investigate different method… ▽ More

    Submitted 3 July, 2024; originally announced July 2024.

    Comments: Accepted at Interspeech 2024

    Journal ref: Proceedings of Interspeech 2024

  7. arXiv:2405.06473  [pdf, other

    cs.RO cs.CV

    Autonomous Driving with a Deep Dual-Model Solution for Steering and Braking Control

    Authors: Ana Petra Jukić, Ana Šelek, Marija Seder, Ivana Podnar Žarko

    Abstract: The technology of autonomous driving is currently attracting a great deal of interest in both research and industry. In this paper, we present a deep learning dual-model solution that uses two deep neural networks for combined braking and steering in autonomous vehicles. Steering control is achieved by applying the NVIDIA's PilotNet model to predict the steering wheel angle, while braking control… ▽ More

    Submitted 10 May, 2024; originally announced May 2024.

    Comments: 6 pages, 2 figures, accepted for publication in Proceedings of International Conference on Smart and Sustainable Technologies (SpliTech 2024)

  8. arXiv:2310.12378  [pdf, other

    eess.AS cs.SD

    The CHiME-7 Challenge: System Description and Performance of NeMo Team's DASR System

    Authors: Tae Jin Park, He Huang, Ante Jukic, Kunal Dhawan, Krishna C. Puvvada, Nithin Koluguri, Nikolay Karpov, Aleksandr Laptev, Jagadeesh Balam, Boris Ginsburg

    Abstract: We present the NVIDIA NeMo team's multi-channel speech recognition system for the 7th CHiME Challenge Distant Automatic Speech Recognition (DASR) Task, focusing on the development of a multi-channel, multi-speaker speech recognition system tailored to transcribe speech from distributed microphones and microphone arrays. The system predominantly comprises of the following integral modules: the Spea… ▽ More

    Submitted 18 October, 2023; originally announced October 2023.

    Journal ref: CHiME-7 Workshop 2023

  9. arXiv:2310.12371  [pdf, other

    eess.AS cs.SD

    Property-Aware Multi-Speaker Data Simulation: A Probabilistic Modelling Technique for Synthetic Data Generation

    Authors: Tae Jin Park, He Huang, Coleman Hooper, Nithin Koluguri, Kunal Dhawan, Ante Jukic, Jagadeesh Balam, Boris Ginsburg

    Abstract: We introduce a sophisticated multi-speaker speech data simulator, specifically engineered to generate multi-speaker speech recordings. A notable feature of this simulator is its capacity to modulate the distribution of silence and overlap via the adjustment of statistical parameters. This capability offers a tailored training environment for developing neural models suited for speaker diarization… ▽ More

    Submitted 18 October, 2023; originally announced October 2023.

    Journal ref: CHiME-7 Workshop 2023