Skip to main content

Showing 1–11 of 11 results for author: Petermann, D

Searching in archive cs. Search in all archives.
.
  1. arXiv:2506.10274  [pdf, ps, other

    cs.SD cs.AI cs.CL eess.AS

    Discrete Audio Tokens: More Than a Survey!

    Authors: Pooneh Mousavi, Gallil Maimon, Adel Moumen, Darius Petermann, Jiatong Shi, Haibin Wu, Haici Yang, Anastasia Kuznetsova, Artem Ploujnikov, Ricard Marxer, Bhuvana Ramabhadran, Benjamin Elizalde, Loren Lugosch, Jinyu Li, Cem Subakan, Phil Woodland, Minje Kim, Hung-yi Lee, Shinji Watanabe, Yossi Adi, Mirco Ravanelli

    Abstract: Discrete audio tokens are compact representations that aim to preserve perceptual quality, phonetic content, and speaker characteristics while enabling efficient storage and inference, as well as competitive performance across diverse downstream tasks. They provide a practical alternative to continuous features, enabling the integration of speech and audio into modern large language models (LLMs).… ▽ More

    Submitted 16 June, 2025; v1 submitted 11 June, 2025; originally announced June 2025.

  2. arXiv:2501.05413  [pdf, other

    cs.SD cs.CV cs.GR eess.AS

    Seeing Sound: Assembling Sounds from Visuals for Audio-to-Image Generation

    Authors: Darius Petermann, Mahdi M. Kalayeh

    Abstract: Training audio-to-image generative models requires an abundance of diverse audio-visual pairs that are semantically aligned. Such data is almost always curated from in-the-wild videos, given the cross-modal semantic correspondence that is inherent to them. In this work, we hypothesize that insisting on the absolute need for ground truth audio-visual correspondence, is not only unnecessary, but als… ▽ More

    Submitted 9 January, 2025; originally announced January 2025.

  3. arXiv:2412.17667  [pdf, other

    cs.SD cs.MM eess.AS

    VERSA: A Versatile Evaluation Toolkit for Speech, Audio, and Music

    Authors: Jiatong Shi, Hye-jin Shim, Jinchuan Tian, Siddhant Arora, Haibin Wu, Darius Petermann, Jia Qi Yip, You Zhang, Yuxun Tang, Wangyou Zhang, Dareen Safar Alharthi, Yichen Huang, Koichi Saito, Jionghao Han, Yiwen Zhao, Chris Donahue, Shinji Watanabe

    Abstract: In this work, we introduce VERSA, a unified and standardized evaluation toolkit designed for various speech, audio, and music signals. The toolkit features a Pythonic interface with flexible configuration and dependency control, making it user-friendly and efficient. With full installation, VERSA offers 65 metrics with 729 metric variations based on different configurations. These metrics encompas… ▽ More

    Submitted 26 March, 2025; v1 submitted 23 December, 2024; originally announced December 2024.

  4. arXiv:2401.03567  [pdf, other

    eess.AS cs.SD

    Hyperbolic Distance-Based Speech Separation

    Authors: Darius Petermann, Minje Kim

    Abstract: In this work, we explore the task of hierarchical distance-based speech separation defined on a hyperbolic manifold. Based on the recent advent of audio-related tasks performed in non-Euclidean spaces, we propose to make use of the PoincarĂ© ball to effectively unveil the inherent hierarchical structure found in complex speaker mixtures. We design two sets of experiments in which the distance-based… ▽ More

    Submitted 7 January, 2024; originally announced January 2024.

    Comments: To be published at ICASSP2024, 14th of April 2024, Seoul, South Korea. Copyright (c) 2023 IEEE. 5 pages, 2 figures, 3 tables

  5. arXiv:2303.08005  [pdf, other

    eess.AS cs.SD

    Native Multi-Band Audio Coding within Hyper-Autoencoded Reconstruction Propagation Networks

    Authors: Darius Petermann, Inseon Jang, Minje Kim

    Abstract: Spectral sub-bands do not portray the same perceptual relevance. In audio coding, it is therefore desirable to have independent control over each of the constituent bands so that bitrate assignment and signal reconstruction can be achieved efficiently. In this work, we present a novel neural audio coding network that natively supports a multi-band coding paradigm. Our model extends the idea of com… ▽ More

    Submitted 14 March, 2023; originally announced March 2023.

    Comments: Accepted to ICASSP 2023. For resources and examples, see https://saige.sice.indiana.edu/research-projects/HARP-Net/

  6. arXiv:2212.07327  [pdf, other

    eess.AS cs.SD

    Tackling the Cocktail Fork Problem for Separation and Transcription of Real-World Soundtracks

    Authors: Darius Petermann, Gordon Wichern, Aswin Shanmugam Subramanian, Zhong-Qiu Wang, Jonathan Le Roux

    Abstract: Emulating the human ability to solve the cocktail party problem, i.e., focus on a source of interest in a complex acoustic scene, is a long standing goal of audio source separation research. Much of this research investigates separating speech from noise, speech from speech, musical instruments from each other, or sound events from each other. In this paper, we focus on the cocktail fork problem,… ▽ More

    Submitted 14 December, 2022; originally announced December 2022.

    Comments: Submitted to IEEE TASLP (In review), 13 pages, 6 figures

  7. arXiv:2212.05008  [pdf, other

    eess.AS cs.SD

    Hyperbolic Audio Source Separation

    Authors: Darius Petermann, Gordon Wichern, Aswin Subramanian, Jonathan Le Roux

    Abstract: We introduce a framework for audio source separation using embeddings on a hyperbolic manifold that compactly represent the hierarchical relationship between sound sources and time-frequency features. Inspired by recent successes modeling hierarchical relationships in text and images with hyperbolic embeddings, our algorithm obtains a hyperbolic embedding for each time-frequency bin of a mixture s… ▽ More

    Submitted 9 December, 2022; originally announced December 2022.

    Comments: Submitted to ICASSP 2023, Demo page: https://darius522.github.io/hyperbolic-audio-sep/

  8. arXiv:2202.07523  [pdf, other

    eess.AS cs.SD

    SpaIn-Net: Spatially-Informed Stereophonic Music Source Separation

    Authors: Darius Petermann, Minje Kim

    Abstract: With the recent advancements of data driven approaches using deep neural networks, music source separation has been formulated as an instrument-specific supervised problem. While existing deep learning models implicitly absorb the spatial information conveyed by the multi-channel input signals, we argue that a more explicit and active use of spatial information could not only improve the separatio… ▽ More

    Submitted 15 February, 2022; originally announced February 2022.

    Comments: To Appear in Proc. ICASSP2022

  9. arXiv:2110.09958  [pdf, other

    eess.AS cs.SD

    The Cocktail Fork Problem: Three-Stem Audio Separation for Real-World Soundtracks

    Authors: Darius Petermann, Gordon Wichern, Zhong-Qiu Wang, Jonathan Le Roux

    Abstract: The cocktail party problem aims at isolating any source of interest within a complex acoustic scene, and has long inspired audio source separation research. Recent efforts have mainly focused on separating speech from noise, speech from speech, musical instruments from each other, or sound events from each other. However, separating an audio mixture (e.g., movie soundtrack) into the three broad ca… ▽ More

    Submitted 23 March, 2022; v1 submitted 19 October, 2021; originally announced October 2021.

    Comments: Accepted to ICASSP2022. For resources and examples, see https://cocktail-fork.github.io

  10. arXiv:2107.10843  [pdf, other

    eess.AS cs.AI cs.SD

    HARP-Net: Hyper-Autoencoded Reconstruction Propagation for Scalable Neural Audio Coding

    Authors: Darius Petermann, Seungkwon Beack, Minje Kim

    Abstract: An autoencoder-based codec employs quantization to turn its bottleneck layer activation into bitstrings, a process that hinders information flow between the encoder and decoder parts. To circumvent this issue, we employ additional skip connections between the corresponding pair of encoder-decoder layers. The assumption is that, in a mirrored autoencoder topology, a decoder layer reconstructs the i… ▽ More

    Submitted 23 July, 2021; v1 submitted 22 July, 2021; originally announced July 2021.

    Comments: Accepted to the IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA) 2021, Mohonk Mountain House, New Paltz, NY

  11. arXiv:2008.07645  [pdf, other

    eess.AS cs.LG cs.SD

    Deep Learning Based Source Separation Applied To Choir Ensembles

    Authors: Darius Petermann, Pritish Chandna, Helena Cuesta, Jordi Bonada, Emilia Gomez

    Abstract: Choral singing is a widely practiced form of ensemble singing wherein a group of people sing simultaneously in polyphonic harmony. The most commonly practiced setting for choir ensembles consists of four parts; Soprano, Alto, Tenor and Bass (SATB), each with its own range of fundamental frequencies (F$0$s). The task of source separation for this choral setting entails separating the SATB mixture i… ▽ More

    Submitted 17 August, 2020; originally announced August 2020.

    Comments: To appear at the 21st International Society for Music Information Retrieval Conference, Montréal, Canada, 2020, audio examples available at: "https://darius522.github.io/satb-source-separation-results/"