Skip to main content

Showing 1–20 of 20 results for author: Uhlich, S

Searching in archive eess. Search in all archives.
.
  1. arXiv:2411.01135  [pdf, other

    cs.SD cs.IR cs.LG eess.AS

    Music Foundation Model as Generic Booster for Music Downstream Tasks

    Authors: WeiHsiang Liao, Yuhta Takida, Yukara Ikemiya, Zhi Zhong, Chieh-Hsin Lai, Giorgio Fabbro, Kazuki Shimada, Keisuke Toyama, Kinwai Cheuk, Marco A. Martínez-Ramírez, Shusuke Takahashi, Stefan Uhlich, Taketo Akama, Woosung Choi, Yuichiro Koyama, Yuki Mitsufuji

    Abstract: We demonstrate the efficacy of using intermediate representations from a single foundation model to enhance various music downstream tasks. We introduce SoniDo, a music foundation model (MFM) designed to extract hierarchical features from target music samples. By leveraging hierarchical intermediate features, SoniDo constrains the information granularity, leading to improved performance across var… ▽ More

    Submitted 27 May, 2025; v1 submitted 2 November, 2024; originally announced November 2024.

    Comments: 41 pages with 14 figures

    Journal ref: Published in Transactions on Machine Learning Research (TMLR), 2025

  2. arXiv:2409.06096  [pdf, ps, other

    cs.SD cs.AI cs.IR eess.AS

    Latent Diffusion Bridges for Unsupervised Musical Audio Timbre Transfer

    Authors: Michele Mancusi, Yurii Halychanskyi, Kin Wai Cheuk, Eloi Moliner, Chieh-Hsin Lai, Stefan Uhlich, Junghyun Koo, Marco A. Martínez-Ramírez, Wei-Hsiang Liao, Giorgio Fabbro, Yuki Mitsufuji

    Abstract: Music timbre transfer is a challenging task that involves modifying the timbral characteristics of an audio signal while preserving its melodic structure. In this paper, we propose a novel method based on dual diffusion bridges, trained using the CocoChorales Dataset, which consists of unpaired monophonic single-instrument audio data. Each diffusion model is trained on a specific instrument with a… ▽ More

    Submitted 7 January, 2025; v1 submitted 9 September, 2024; originally announced September 2024.

  3. arXiv:2408.03204  [pdf, other

    cs.SD eess.AS

    GRAFX: An Open-Source Library for Audio Processing Graphs in PyTorch

    Authors: Sungho Lee, Marco Martínez-Ramírez, Wei-Hsiang Liao, Stefan Uhlich, Giorgio Fabbro, Kyogu Lee, Yuki Mitsufuji

    Abstract: We present GRAFX, an open-source library designed for handling audio processing graphs in PyTorch. Along with various library functionalities, we describe technical details on the efficient parallel computation of input graphs, signals, and processor parameters in GPU. Then, we show its example use under a music mixing scenario, where parameters of every differentiable processor in a large graph a… ▽ More

    Submitted 6 August, 2024; originally announced August 2024.

    Comments: Accepted to DAFx 2024 demo

  4. arXiv:2308.06981  [pdf, other

    eess.AS cs.SD

    The Sound Demixing Challenge 2023 $\unicode{x2013}$ Cinematic Demixing Track

    Authors: Stefan Uhlich, Giorgio Fabbro, Masato Hirano, Shusuke Takahashi, Gordon Wichern, Jonathan Le Roux, Dipam Chakraborty, Sharada Mohanty, Kai Li, Yi Luo, Jianwei Yu, Rongzhi Gu, Roman Solovyev, Alexander Stempkovskiy, Tatiana Habruseva, Mikhail Sukhovei, Yuki Mitsufuji

    Abstract: This paper summarizes the cinematic demixing (CDX) track of the Sound Demixing Challenge 2023 (SDX'23). We provide a comprehensive summary of the challenge setup, detailing the structure of the competition and the datasets used. Especially, we detail CDXDB23, a new hidden dataset constructed from real movies that was used to rank the submissions. The paper also offers insights into the most succes… ▽ More

    Submitted 18 April, 2024; v1 submitted 14 August, 2023; originally announced August 2023.

    Comments: Accepted for Transactions of the International Society for Music Information Retrieval

  5. arXiv:2308.06979  [pdf, other

    eess.AS cs.SD

    The Sound Demixing Challenge 2023 $\unicode{x2013}$ Music Demixing Track

    Authors: Giorgio Fabbro, Stefan Uhlich, Chieh-Hsin Lai, Woosung Choi, Marco Martínez-Ramírez, Weihsiang Liao, Igor Gadelha, Geraldo Ramos, Eddie Hsu, Hugo Rodrigues, Fabian-Robert Stöter, Alexandre Défossez, Yi Luo, Jianwei Yu, Dipam Chakraborty, Sharada Mohanty, Roman Solovyev, Alexander Stempkovskiy, Tatiana Habruseva, Nabarun Goswami, Tatsuya Harada, Minseok Kim, Jun Hyung Lee, Yuanliang Dong, Xinran Zhang , et al. (2 additional authors not shown)

    Abstract: This paper summarizes the music demixing (MDX) track of the Sound Demixing Challenge (SDX'23). We provide a summary of the challenge setup and introduce the task of robust music source separation (MSS), i.e., training MSS models in the presence of errors in the training data. We propose a formalization of the errors that can occur in the design of a training dataset for MSS systems and introduce t… ▽ More

    Submitted 19 April, 2024; v1 submitted 14 August, 2023; originally announced August 2023.

    Comments: Published in Transactions of the International Society for Music Information Retrieval (https://transactions.ismir.net/articles/10.5334/tismir.171)

    Journal ref: Transactions of the International Society for Music Information Retrieval, 7(1), pp.63-84, 2024

  6. The Whole Is Greater than the Sum of Its Parts: Improving Music Source Separation by Bridging Network

    Authors: Ryosuke Sawata, Naoya Takahashi, Stefan Uhlich, Shusuke Takahashi, Yuki Mitsufuji

    Abstract: This paper presents the crossing scheme (X-scheme) for improving the performance of deep neural network (DNN)-based music source separation (MSS) with almost no increasing calculation cost. It consists of three components: (i) multi-domain loss (MDL), (ii) bridging operation, which couples the individual instrument networks, and (iii) combination loss (CL). MDL enables the taking advantage of the… ▽ More

    Submitted 5 August, 2024; v1 submitted 13 May, 2023; originally announced May 2023.

    Comments: Acceptedt by EURASIP Journal on Audio, Speech, and Music Processing (under CC BY)

    Journal ref: EURASIP Journal on Audio, Speech, and Music Processing (JASM), 39 (2024)

  7. arXiv:2303.03717  [pdf, other

    cs.SD eess.AS

    Improving Self-Supervised Learning for Audio Representations by Feature Diversity and Decorrelation

    Authors: Bac Nguyen, Stefan Uhlich, Fabien Cardinaux

    Abstract: Self-supervised learning (SSL) has recently shown remarkable results in closing the gap between supervised and unsupervised learning. The idea is to learn robust features that are invariant to distortions of the input data. Despite its success, this idea can suffer from a collapsing issue where the network produces a constant representation. To this end, we introduce SELFIE, a novel Self-supervise… ▽ More

    Submitted 7 March, 2023; originally announced March 2023.

    Comments: ICASSP 2023

  8. arXiv:2211.02247  [pdf, other

    eess.AS cs.LG cs.SD

    Music Mixing Style Transfer: A Contrastive Learning Approach to Disentangle Audio Effects

    Authors: Junghyun Koo, Marco A. Martínez-Ramírez, Wei-Hsiang Liao, Stefan Uhlich, Kyogu Lee, Yuki Mitsufuji

    Abstract: We propose an end-to-end music mixing style transfer system that converts the mixing style of an input multitrack to that of a reference song. This is achieved with an encoder pre-trained with a contrastive objective to extract only audio effects related information from a reference music recording. All our models are trained in a self-supervised manner from an already-processed wet multitrack dat… ▽ More

    Submitted 11 April, 2023; v1 submitted 3 November, 2022; originally announced November 2022.

  9. arXiv:2208.11428  [pdf, other

    eess.AS cs.LG cs.SD eess.SP

    Automatic music mixing with deep learning and out-of-domain data

    Authors: Marco A. Martínez-Ramírez, Wei-Hsiang Liao, Giorgio Fabbro, Stefan Uhlich, Chihiro Nagashima, Yuki Mitsufuji

    Abstract: Music mixing traditionally involves recording instruments in the form of clean, individual tracks and blending them into a final mixture using audio effects and expert knowledge (e.g., a mixing engineer). The automation of music production tasks has become an emerging field in recent years, where rule-based methods and machine learning approaches have been explored. Nevertheless, the lack of dry o… ▽ More

    Submitted 29 August, 2022; v1 submitted 24 August, 2022; originally announced August 2022.

    Comments: 23rd International Society for Music Information Retrieval Conference (ISMIR), December, 2022. Source code, demo and audio examples: https://marco-martinez-sony.github.io/FxNorm-automix/ - added acknowledgements

  10. arXiv:2203.11049  [pdf, other

    cs.SD cs.LG eess.AS

    AutoTTS: End-to-End Text-to-Speech Synthesis through Differentiable Duration Modeling

    Authors: Bac Nguyen, Fabien Cardinaux, Stefan Uhlich

    Abstract: Parallel text-to-speech (TTS) models have recently enabled fast and highly-natural speech synthesis. However, they typically require external alignment models, which are not necessarily optimized for the decoder as they are not jointly trained. In this paper, we propose a differentiable duration method for learning monotonic alignments between input and output sequences. Our method is based on a s… ▽ More

    Submitted 7 March, 2023; v1 submitted 21 March, 2022; originally announced March 2022.

    Comments: ICASSP 2023

  11. arXiv:2202.01664  [pdf, other

    eess.AS cs.LG cs.SD

    Distortion Audio Effects: Learning How to Recover the Clean Signal

    Authors: Johannes Imort, Giorgio Fabbro, Marco A. Martínez Ramírez, Stefan Uhlich, Yuichiro Koyama, Yuki Mitsufuji

    Abstract: Given the recent advances in music source separation and automatic mixing, removing audio effects in music tracks is a meaningful step toward developing an automated remixing system. This paper focuses on removing distortion audio effects applied to guitar tracks in music production. We explore whether effect removal can be solved by neural networks designed for source separation and audio effect… ▽ More

    Submitted 13 September, 2022; v1 submitted 3 February, 2022; originally announced February 2022.

    Comments: Audio examples available at https://joimort.github.io/distortionremoval/

  12. arXiv:2110.06494  [pdf, other

    cs.SD eess.AS

    Music Source Separation with Deep Equilibrium Models

    Authors: Yuichiro Koyama, Naoki Murata, Stefan Uhlich, Giorgio Fabbro, Shusuke Takahashi, Yuki Mitsufuji

    Abstract: While deep neural network-based music source separation (MSS) is very effective and achieves high performance, its model size is often a problem for practical deployment. Deep implicit architectures such as deep equilibrium models (DEQ) were recently proposed, which can achieve higher performance than their explicit counterparts with limited depth while keeping the number of parameters small. This… ▽ More

    Submitted 28 April, 2022; v1 submitted 13 October, 2021; originally announced October 2021.

    Comments: 5 pages, 4 figures, accepted for publication in IEEE ICASSP 2022

  13. arXiv:2110.04047  [pdf, other

    eess.AS cs.SD eess.SP

    TRUNet: Transformer-Recurrent-U Network for Multi-channel Reverberant Sound Source Separation

    Authors: Ali Aroudi, Stefan Uhlich, Marc Ferras Font

    Abstract: In recent years, many deep learning techniques for single-channel sound source separation have been proposed using recurrent, convolutional and transformer networks. When multiple microphones are available, spatial diversity between speakers and background noise in addition to spectro-temporal diversity can be exploited by using multi-channel filters for sound source separation. Aiming at end-to-e… ▽ More

    Submitted 22 August, 2022; v1 submitted 8 October, 2021; originally announced October 2021.

  14. Music Demixing Challenge 2021

    Authors: Yuki Mitsufuji, Giorgio Fabbro, Stefan Uhlich, Fabian-Robert Stöter, Alexandre Défossez, Minseok Kim, Woosung Choi, Chin-Yun Yu, Kin-Wai Cheuk

    Abstract: Music source separation has been intensively studied in the last decade and tremendous progress with the advent of deep learning could be observed. Evaluation campaigns such as MIREX or SiSEC connected state-of-the-art models and corresponding papers, which can help researchers integrate the best practices into their models. In recent years, the widely used MUSDB18 dataset played an important role… ▽ More

    Submitted 23 May, 2022; v1 submitted 30 August, 2021; originally announced August 2021.

    Journal ref: Frontiers in Signal Processing, 28 January 2022

  15. arXiv:2105.12315  [pdf, other

    eess.AS cs.LG cs.SD

    Training Speech Enhancement Systems with Noisy Speech Datasets

    Authors: Koichi Saito, Stefan Uhlich, Giorgio Fabbro, Yuki Mitsufuji

    Abstract: Recently, deep neural network (DNN)-based speech enhancement (SE) systems have been used with great success. During training, such systems require clean speech data - ideally, in large quantity with a variety of acoustic conditions, many different speaker characteristics and for a given sampling rate (e.g., 48kHz for fullband SE). However, obtaining such clean speech data is not straightforward -… ▽ More

    Submitted 25 May, 2021; originally announced May 2021.

    Comments: 5 pages, 3 figures, submitted to WASPAA2021

  16. arXiv:2010.04228  [pdf, ps, other

    eess.AS cs.SD

    All for One and One for All: Improving Music Separation by Bridging Networks

    Authors: Ryosuke Sawata, Stefan Uhlich, Shusuke Takahashi, Yuki Mitsufuji

    Abstract: This paper proposes several improvements for music separation with deep neural networks (DNNs), namely a multi-domain loss (MDL) and two combination schemes. First, by using MDL we take advantage of the frequency and time domain representation of audio signals. Next, we utilize the relationship among instruments by jointly considering them. We do this on the one hand by modifying the network archi… ▽ More

    Submitted 11 May, 2021; v1 submitted 8 October, 2020; originally announced October 2020.

    Comments: The both implementations of our code, i.e., NNabla and PyTorch, are available on this latest paper

  17. arXiv:2005.11611  [pdf, other

    eess.AS cs.SD

    Exploring the Best Loss Function for DNN-Based Low-latency Speech Enhancement with Temporal Convolutional Networks

    Authors: Yuichiro Koyama, Tyler Vuong, Stefan Uhlich, Bhiksha Raj

    Abstract: Recently, deep neural networks (DNNs) have been successfully used for speech enhancement, and DNN-based speech enhancement is becoming an attractive research area. While time-frequency masking based on the short-time Fourier transform (STFT) has been widely used for DNN-based speech enhancement over the last years, time domain methods such as the time-domain audio separation network (TasNet) have… ▽ More

    Submitted 20 August, 2020; v1 submitted 23 May, 2020; originally announced May 2020.

  18. arXiv:2005.07810  [pdf, other

    eess.AS cs.LG cs.SD

    Unsupervised Cross-Domain Speech-to-Speech Conversion with Time-Frequency Consistency

    Authors: Mohammad Asif Khan, Fabien Cardinaux, Stefan Uhlich, Marc Ferras, Asja Fischer

    Abstract: In recent years generative adversarial network (GAN) based models have been successfully applied for unsupervised speech-to-speech conversion.The rich compact harmonic view of the magnitude spectrogram is considered a suitable choice for training these models with audio data. To reconstruct the speech signal first a magnitude spectrogram is generated by the neural network, which is then utilized b… ▽ More

    Submitted 18 May, 2020; v1 submitted 15 May, 2020; originally announced May 2020.

  19. arXiv:1911.02091  [pdf, other

    eess.AS cs.SD

    Closing the Training/Inference Gap for Deep Attractor Networks

    Authors: Cyril Cadoux, Stefan Uhlich, Marc Ferras, Yuki Mitsufuji

    Abstract: This paper improves the deep attractor network (DANet) approach by closing its gap between training and inference. During training, DANet relies on attractors, which are computed from the ground truth separations. As this information is not available at inference time, the attractors have to be estimated, which is typically done by k-means. This results in two mismatches: The first mismatch stems… ▽ More

    Submitted 5 November, 2019; originally announced November 2019.

  20. arXiv:1807.02710  [pdf, other

    cs.SD cs.LG eess.AS

    Improving DNN-based Music Source Separation using Phase Features

    Authors: Joachim Muth, Stefan Uhlich, Nathanael Perraudin, Thomas Kemp, Fabien Cardinaux, Yuki Mitsufuji

    Abstract: Music source separation with deep neural networks typically relies only on amplitude features. In this paper we show that additional phase features can improve the separation performance. Using the theoretical relationship between STFT phase and amplitude, we conjecture that derivatives of the phase are a good feature representation opposed to the raw phase. We verify this conjecture experimentall… ▽ More

    Submitted 16 July, 2018; v1 submitted 7 July, 2018; originally announced July 2018.

    Comments: 7 pages, 9 figures, Joint Workshop on Machine Learning for Music at ICML, IJCAI/ECAI and AAMAS, 2018