Skip to main content

Showing 1–50 of 80 results for author: Mitsufuji, Y

Searching in archive eess. Search in all archives.
.
  1. arXiv:2505.20770  [pdf, ps, other

    cs.SD cs.MM eess.AS

    Can Large Language Models Predict Audio Effects Parameters from Natural Language?

    Authors: Seungheon Doh, Junghyun Koo, Marco A. Martínez-Ramírez, Wei-Hsiang Liao, Juhan Nam, Yuki Mitsufuji

    Abstract: In music production, manipulating audio effects (Fx) parameters through natural language has the potential to reduce technical barriers for non-experts. We present LLM2Fx, a framework leveraging Large Language Models (LLMs) to predict Fx parameters directly from textual descriptions without requiring task-specific training or fine-tuning. Our approach address the text-to-effect parameter predictio… ▽ More

    Submitted 27 May, 2025; originally announced May 2025.

    Comments: Submitted to WASPAA 2025

  2. arXiv:2505.19663  [pdf, ps, other

    cs.SD cs.AI cs.CR cs.LG eess.AS

    A Comprehensive Real-World Assessment of Audio Watermarking Algorithms: Will They Survive Neural Codecs?

    Authors: Yigitcan Özer, Woosung Choi, Joan Serrà, Mayank Kumar Singh, Wei-Hsiang Liao, Yuki Mitsufuji

    Abstract: We introduce the Robust Audio Watermarking Benchmark (RAW-Bench), a benchmark for evaluating deep learning-based audio watermarking methods with standardized and systematic comparisons. To simulate real-world usage, we introduce a comprehensive audio attack pipeline with various distortions such as compression, background noise, and reverberation, along with a diverse test dataset including speech… ▽ More

    Submitted 28 May, 2025; v1 submitted 26 May, 2025; originally announced May 2025.

    Comments: 5 pages; 5 tables; accepted at INTERSPEECH 2025

  3. arXiv:2505.16195  [pdf, other

    cs.SD cs.AI cs.LG eess.AS eess.IV

    SpecMaskFoley: Steering Pretrained Spectral Masked Generative Transformer Toward Synchronized Video-to-audio Synthesis via ControlNet

    Authors: Zhi Zhong, Akira Takahashi, Shuyang Cui, Keisuke Toyama, Shusuke Takahashi, Yuki Mitsufuji

    Abstract: Foley synthesis aims to synthesize high-quality audio that is both semantically and temporally aligned with video frames. Given its broad application in creative industries, the task has gained increasing attention in the research community. To avoid the non-trivial task of training audio generative models from scratch, adapting pretrained audio generative models for video-synchronized foley synth… ▽ More

    Submitted 21 May, 2025; originally announced May 2025.

    Comments: 4 pages, 2 figures, 2 tables. Demo page: https://zzaudio.github.io/SpecMaskFoley_Demo/

  4. arXiv:2505.11315  [pdf, other

    cs.SD cs.LG eess.AS

    Improving Inference-Time Optimisation for Vocal Effects Style Transfer with a Gaussian Prior

    Authors: Chin-Yun Yu, Marco A. Martínez-Ramírez, Junghyun Koo, Wei-Hsiang Liao, Yuki Mitsufuji, György Fazekas

    Abstract: Style Transfer with Inference-Time Optimisation (ST-ITO) is a recent approach for transferring the applied effects of a reference audio to a raw audio track. It optimises the effect parameters to minimise the distance between the style embeddings of the processed audio and the reference. However, this method treats all possible configurations equally and relies solely on the embedding space, which… ▽ More

    Submitted 16 May, 2025; originally announced May 2025.

    Comments: Submitted to WASPAA 2025

  5. arXiv:2504.14735  [pdf, other

    cs.SD eess.AS

    DiffVox: A Differentiable Model for Capturing and Analysing Professional Effects Distributions

    Authors: Chin-Yun Yu, Marco A. Martínez-Ramírez, Junghyun Koo, Ben Hayes, Wei-Hsiang Liao, György Fazekas, Yuki Mitsufuji

    Abstract: This study introduces a novel and interpretable model, DiffVox, for matching vocal effects in music production. DiffVox, short for ``Differentiable Vocal Fx", integrates parametric equalisation, dynamic range control, delay, and reverb with efficient differentiable implementations to enable gradient-based optimisation for parameter estimation. Vocal presets are retrieved from two datasets, compris… ▽ More

    Submitted 20 April, 2025; originally announced April 2025.

    Comments: Submitted to DAFx 2025

  6. arXiv:2504.10826  [pdf, other

    cs.SD cs.MM eess.AS

    SteerMusic: Enhanced Musical Consistency for Zero-shot Text-Guided and Personalized Music Editing

    Authors: Xinlei Niu, Kin Wai Cheuk, Jing Zhang, Naoki Murata, Chieh-Hsin Lai, Michele Mancusi, Woosung Choi, Giorgio Fabbro, Wei-Hsiang Liao, Charles Patrick Martin, Yuki Mitsufuji

    Abstract: Music editing is an important step in music production, which has broad applications, including game development and film production. Most existing zero-shot text-guided methods rely on pretrained diffusion models by involving forward-backward diffusion processes for editing. However, these methods often struggle to maintain the music content consistency. Additionally, text instructions alone usua… ▽ More

    Submitted 14 April, 2025; originally announced April 2025.

  7. arXiv:2503.16669  [pdf, other

    cs.SD cs.AI eess.AS

    Aligning Text-to-Music Evaluation with Human Preferences

    Authors: Yichen Huang, Zachary Novack, Koichi Saito, Jiatong Shi, Shinji Watanabe, Yuki Mitsufuji, John Thickstun, Chris Donahue

    Abstract: Despite significant recent advances in generative acoustic text-to-music (TTM) modeling, robust evaluation of these models lags behind, relying in particular on the popular Fréchet Audio Distance (FAD). In this work, we rigorously study the design space of reference-based divergence metrics for evaluating TTM models through (1) designing four synthetic meta-evaluations to measure sensitivity to pa… ▽ More

    Submitted 20 March, 2025; originally announced March 2025.

  8. arXiv:2503.11190  [pdf, other

    cs.SD cs.AI cs.CL cs.MM eess.AS

    Cross-Modal Learning for Music-to-Music-Video Description Generation

    Authors: Zhuoyuan Mao, Mengjie Zhao, Qiyu Wu, Zhi Zhong, Wei-Hsiang Liao, Hiromi Wakaki, Yuki Mitsufuji

    Abstract: Music-to-music-video generation is a challenging task due to the intrinsic differences between the music and video modalities. The advent of powerful text-to-video diffusion models has opened a promising pathway for music-video (MV) generation by first addressing the music-to-MV description task and subsequently leveraging these models for video generation. In this study, we focus on the MV descri… ▽ More

    Submitted 14 March, 2025; originally announced March 2025.

    Comments: Accepted by RepL4NLP 2025 @ NAACL 2025

  9. arXiv:2502.16936  [pdf, other

    cs.SD cs.AI cs.LG eess.AS stat.ML

    Supervised contrastive learning from weakly-labeled audio segments for musical version matching

    Authors: Joan Serrà, R. Oguz Araz, Dmitry Bogdanov, Yuki Mitsufuji

    Abstract: Detecting musical versions (different renditions of the same piece) is a challenging task with important applications. Because of the ground truth nature, existing approaches match musical versions at the track level (e.g., whole song). However, most applications require to match them at the segment level (e.g., 20s chunks). In addition, existing approaches resort to classification and triplet los… ▽ More

    Submitted 16 May, 2025; v1 submitted 24 February, 2025; originally announced February 2025.

    Comments: 17 pages, 6 figures, 8 tables (includes Appendix); accepted at ICML25

  10. arXiv:2502.12623  [pdf, other

    cs.SD cs.AI cs.CL cs.MM eess.AS

    DeepResonance: Enhancing Multimodal Music Understanding via Music-centric Multi-way Instruction Tuning

    Authors: Zhuoyuan Mao, Mengjie Zhao, Qiyu Wu, Hiromi Wakaki, Yuki Mitsufuji

    Abstract: Recent advancements in music large language models (LLMs) have significantly improved music understanding tasks, which involve the model's ability to analyze and interpret various musical elements. These improvements primarily focused on integrating both music and text inputs. However, the potential of incorporating additional modalities such as images, videos and textual music features to enhance… ▽ More

    Submitted 20 May, 2025; v1 submitted 18 February, 2025; originally announced February 2025.

  11. arXiv:2501.11837  [pdf, ps, other

    eess.AS cs.SD eess.SP

    30+ Years of Source Separation Research: Achievements and Future Challenges

    Authors: Shoko Araki, Nobutaka Ito, Reinhold Haeb-Umbach, Gordon Wichern, Zhong-Qiu Wang, Yuki Mitsufuji

    Abstract: Source separation (SS) of acoustic signals is a research field that emerged in the mid-1990s and has flourished ever since. On the occasion of ICASSP's 50th anniversary, we review the major contributions and advancements in the past three decades in the speech, audio, and music SS research field. We will cover both single- and multi-channel SS approaches. We will also look back on key efforts to f… ▽ More

    Submitted 20 January, 2025; originally announced January 2025.

    Comments: Accepted by IEEE ICASSP 2025

  12. arXiv:2501.02786  [pdf, other

    cs.SD cs.CV eess.AS

    CCStereo: Audio-Visual Contextual and Contrastive Learning for Binaural Audio Generation

    Authors: Yuanhong Chen, Kazuki Shimada, Christian Simon, Yukara Ikemiya, Takashi Shibuya, Yuki Mitsufuji

    Abstract: Binaural audio generation (BAG) aims to convert monaural audio to stereo audio using visual prompts, requiring a deep understanding of spatial and semantic information. However, current models risk overfitting to room environments and lose fine-grained spatial details. In this paper, we propose a new audio-visual binaural generation model incorporating an audio-visual conditional normalisation lay… ▽ More

    Submitted 6 January, 2025; originally announced January 2025.

  13. arXiv:2412.15322  [pdf, other

    cs.CV cs.LG cs.SD eess.AS

    MMAudio: Taming Multimodal Joint Training for High-Quality Video-to-Audio Synthesis

    Authors: Ho Kei Cheng, Masato Ishii, Akio Hayakawa, Takashi Shibuya, Alexander Schwing, Yuki Mitsufuji

    Abstract: We propose to synthesize high-quality and synchronized audio, given video and optional text conditions, using a novel multimodal joint training framework MMAudio. In contrast to single-modality training conditioned on (limited) video data only, MMAudio is jointly trained with larger-scale, readily available text-audio data to learn to generate semantically aligned high-quality audio samples. Addit… ▽ More

    Submitted 7 April, 2025; v1 submitted 19 December, 2024; originally announced December 2024.

    Comments: Accepted to CVPR 2025. Project page: https://hkchengrex.github.io/MMAudio

  14. arXiv:2412.13462  [pdf, other

    cs.SD cs.MM eess.AS

    SAVGBench: Benchmarking Spatially Aligned Audio-Video Generation

    Authors: Kazuki Shimada, Christian Simon, Takashi Shibuya, Shusuke Takahashi, Yuki Mitsufuji

    Abstract: This work addresses the lack of multimodal generative models capable of producing high-quality videos with spatially aligned audio. While recent advancements in generative models have been successful in video generation, they often overlook the spatial alignment between audio and visuals, which is essential for immersive experiences. To tackle this problem, we establish a new research direction in… ▽ More

    Submitted 17 December, 2024; originally announced December 2024.

    Comments: 5 pages, 3 figures

  15. arXiv:2411.01135  [pdf, other

    cs.SD cs.IR cs.LG eess.AS

    Music Foundation Model as Generic Booster for Music Downstream Tasks

    Authors: WeiHsiang Liao, Yuhta Takida, Yukara Ikemiya, Zhi Zhong, Chieh-Hsin Lai, Giorgio Fabbro, Kazuki Shimada, Keisuke Toyama, Kinwai Cheuk, Marco A. Martínez-Ramírez, Shusuke Takahashi, Stefan Uhlich, Taketo Akama, Woosung Choi, Yuichiro Koyama, Yuki Mitsufuji

    Abstract: We demonstrate the efficacy of using intermediate representations from a single foundation model to enhance various music downstream tasks. We introduce SoniDo, a music foundation model (MFM) designed to extract hierarchical features from target music samples. By leveraging hierarchical intermediate features, SoniDo constrains the information granularity, leading to improved performance across var… ▽ More

    Submitted 27 May, 2025; v1 submitted 2 November, 2024; originally announced November 2024.

    Comments: 41 pages with 14 figures

    Journal ref: Published in Transactions on Machine Learning Research (TMLR), 2025

  16. arXiv:2410.15573  [pdf, other

    cs.SD cs.AI cs.CL cs.MM eess.AS

    OpenMU: Your Swiss Army Knife for Music Understanding

    Authors: Mengjie Zhao, Zhi Zhong, Zhuoyuan Mao, Shiqi Yang, Wei-Hsiang Liao, Shusuke Takahashi, Hiromi Wakaki, Yuki Mitsufuji

    Abstract: We present OpenMU-Bench, a large-scale benchmark suite for addressing the data scarcity issue in training multimodal language models to understand music. To construct OpenMU-Bench, we leveraged existing datasets and bootstrapped new annotations. OpenMU-Bench also broadens the scope of music understanding by including lyrics understanding and music tool usage. Using OpenMU-Bench, we trained our mus… ▽ More

    Submitted 27 November, 2024; v1 submitted 20 October, 2024; originally announced October 2024.

    Comments: Resources: https://github.com/sony/openmu

  17. arXiv:2410.06016  [pdf, other

    cs.SD cs.LG eess.AS

    Variable Bitrate Residual Vector Quantization for Audio Coding

    Authors: Yunkee Chae, Woosung Choi, Yuhta Takida, Junghyun Koo, Yukara Ikemiya, Zhi Zhong, Kin Wai Cheuk, Marco A. Martínez-Ramírez, Kyogu Lee, Wei-Hsiang Liao, Yuki Mitsufuji

    Abstract: Recent state-of-the-art neural audio compression models have progressively adopted residual vector quantization (RVQ). Despite this success, these models employ a fixed number of codebooks per frame, which can be suboptimal in terms of rate-distortion tradeoff, particularly in scenarios with simple input audio, such as silence. To address this limitation, we propose variable bitrate RVQ (VRVQ) for… ▽ More

    Submitted 27 April, 2025; v1 submitted 8 October, 2024; originally announced October 2024.

    Comments: ICASSP 2025 camera ready version

  18. arXiv:2409.17550  [pdf, other

    cs.LG cs.MM cs.SD eess.AS

    A Simple but Strong Baseline for Sounding Video Generation: Effective Adaptation of Audio and Video Diffusion Models for Joint Generation

    Authors: Masato Ishii, Akio Hayakawa, Takashi Shibuya, Yuki Mitsufuji

    Abstract: In this work, we build a simple but strong baseline for sounding video generation. Given base diffusion models for audio and video, we integrate them with additional modules into a single model and train it to make the model jointly generate audio and video. To enhance alignment between audio-video pairs, we introduce two novel mechanisms in our model. The first one is timestep adjustment, which p… ▽ More

    Submitted 8 April, 2025; v1 submitted 26 September, 2024; originally announced September 2024.

    Comments: IJCNN 2025. The source code is available: https://github.com/SonyResearch/SVG_baseline

  19. arXiv:2409.06096  [pdf, ps, other

    cs.SD cs.AI cs.IR eess.AS

    Latent Diffusion Bridges for Unsupervised Musical Audio Timbre Transfer

    Authors: Michele Mancusi, Yurii Halychanskyi, Kin Wai Cheuk, Eloi Moliner, Chieh-Hsin Lai, Stefan Uhlich, Junghyun Koo, Marco A. Martínez-Ramírez, Wei-Hsiang Liao, Giorgio Fabbro, Yuki Mitsufuji

    Abstract: Music timbre transfer is a challenging task that involves modifying the timbral characteristics of an audio signal while preserving its melodic structure. In this paper, we propose a novel method based on dual diffusion bridges, trained using the CocoChorales Dataset, which consists of unpaired monophonic single-instrument audio data. Each diffusion model is trained on a specific instrument with a… ▽ More

    Submitted 7 January, 2025; v1 submitted 9 September, 2024; originally announced September 2024.

  20. arXiv:2408.10807  [pdf, other

    cs.SD cs.AI cs.LG eess.AS

    DisMix: Disentangling Mixtures of Musical Instruments for Source-level Pitch and Timbre Manipulation

    Authors: Yin-Jyun Luo, Kin Wai Cheuk, Woosung Choi, Toshimitsu Uesaka, Keisuke Toyama, Koichi Saito, Chieh-Hsin Lai, Yuhta Takida, Wei-Hsiang Liao, Simon Dixon, Yuki Mitsufuji

    Abstract: Existing work on pitch and timbre disentanglement has been mostly focused on single-instrument music audio, excluding the cases where multiple instruments are presented. To fill the gap, we propose DisMix, a generative framework in which the pitch and timbre representations act as modular building blocks for constructing the melody and instrument of a source, and the collection of which forms a se… ▽ More

    Submitted 20 August, 2024; originally announced August 2024.

  21. arXiv:2408.03204  [pdf, other

    cs.SD eess.AS

    GRAFX: An Open-Source Library for Audio Processing Graphs in PyTorch

    Authors: Sungho Lee, Marco Martínez-Ramírez, Wei-Hsiang Liao, Stefan Uhlich, Giorgio Fabbro, Kyogu Lee, Yuki Mitsufuji

    Abstract: We present GRAFX, an open-source library designed for handling audio processing graphs in PyTorch. Along with various library functionalities, we describe technical details on the efficient parallel computation of input graphs, signals, and processor parameters in GPU. Then, we show its example use under a music mixing scenario, where parameters of every differentiable processor in a large graph a… ▽ More

    Submitted 6 August, 2024; originally announced August 2024.

    Comments: Accepted to DAFx 2024 demo

  22. arXiv:2407.14364  [pdf, other

    cs.SD cs.AI cs.MM eess.AS

    Towards Assessing Data Replication in Music Generation with Music Similarity Metrics on Raw Audio

    Authors: Roser Batlle-Roca, Wei-Hisang Liao, Xavier Serra, Yuki Mitsufuji, Emilia Gómez

    Abstract: Recent advancements in music generation are raising multiple concerns about the implications of AI in creative music processes, current business models and impacts related to intellectual property management. A relevant discussion and related technical challenge is the potential replication and plagiarism of the training set in AI-generated music, which could lead to misuse of data and intellectua… ▽ More

    Submitted 1 August, 2024; v1 submitted 19 July, 2024; originally announced July 2024.

    Comments: Accepted at ISMIR 2024

  23. arXiv:2406.17672  [pdf, other

    cs.SD eess.AS

    SpecMaskGIT: Masked Generative Modeling of Audio Spectrograms for Efficient Audio Synthesis and Beyond

    Authors: Marco Comunità, Zhi Zhong, Akira Takahashi, Shiqi Yang, Mengjie Zhao, Koichi Saito, Yukara Ikemiya, Takashi Shibuya, Shusuke Takahashi, Yuki Mitsufuji

    Abstract: Recent advances in generative models that iteratively synthesize audio clips sparked great success to text-to-audio synthesis (TTA), but with the cost of slow synthesis speed and heavy computation. Although there have been attempts to accelerate the iterative procedure, high-quality TTA systems remain inefficient due to hundreds of iterations required in the inference phase and large amount of mod… ▽ More

    Submitted 26 June, 2024; v1 submitted 25 June, 2024; originally announced June 2024.

    Comments: 6 pages, 8 figures, 8 tables. Audio samples: https://zzaudio.github.io/SpecMaskGIT/index.html

  24. arXiv:2406.15751  [pdf, other

    cs.SD eess.AS

    Improving Unsupervised Clean-to-Rendered Guitar Tone Transformation Using GANs and Integrated Unaligned Clean Data

    Authors: Yu-Hua Chen, Woosung Choi, Wei-Hsiang Liao, Marco Martínez-Ramírez, Kin Wai Cheuk, Yuki Mitsufuji, Jyh-Shing Roger Jang, Yi-Hsuan Yang

    Abstract: Recent years have seen increasing interest in applying deep learning methods to the modeling of guitar amplifiers or effect pedals. Existing methods are mainly based on the supervised approach, requiring temporally-aligned data pairs of unprocessed and rendered audio. However, this approach does not scale well, due to the complicated process involved in creating the data pairs. A very recent work… ▽ More

    Submitted 22 June, 2024; originally announced June 2024.

    Comments: Accepted to DAFx 2024

  25. SilentCipher: Deep Audio Watermarking

    Authors: Mayank Kumar Singh, Naoya Takahashi, Weihsiang Liao, Yuki Mitsufuji

    Abstract: In the realm of audio watermarking, it is challenging to simultaneously encode imperceptible messages while enhancing the message capacity and robustness. Although recent advancements in deep learning-based methods bolster the message capacity and robustness over traditional methods, the encoded messages introduce audible artefacts that restricts their usage in professional settings. In this study… ▽ More

    Submitted 21 September, 2024; v1 submitted 6 June, 2024; originally announced June 2024.

  26. arXiv:2405.18503  [pdf, other

    cs.SD cs.LG eess.AS

    SoundCTM: Unifying Score-based and Consistency Models for Full-band Text-to-Sound Generation

    Authors: Koichi Saito, Dongjun Kim, Takashi Shibuya, Chieh-Hsin Lai, Zhi Zhong, Yuhta Takida, Yuki Mitsufuji

    Abstract: Sound content creation, essential for multimedia works such as video games and films, often involves extensive trial-and-error, enabling creators to semantically reflect their artistic ideas and inspirations, which evolve throughout the creation process, into the sound. Recent high-quality diffusion-based Text-to-Sound (T2S) generative models provide valuable tools for creators. However, these mod… ▽ More

    Submitted 10 March, 2025; v1 submitted 28 May, 2024; originally announced May 2024.

    Comments: Audio samples: https://anonymus-soundctm.github.io/soundctm_iclr/. Codes: https://github.com/sony/soundctm. Checkpoints: https://huggingface.co/Sony/soundctm

  27. arXiv:2405.18386  [pdf, other

    cs.SD cs.AI cs.LG cs.MM eess.AS

    Instruct-MusicGen: Unlocking Text-to-Music Editing for Music Language Models via Instruction Tuning

    Authors: Yixiao Zhang, Yukara Ikemiya, Woosung Choi, Naoki Murata, Marco A. Martínez-Ramírez, Liwei Lin, Gus Xia, Wei-Hsiang Liao, Yuki Mitsufuji, Simon Dixon

    Abstract: Recent advances in text-to-music editing, which employ text queries to modify music (e.g.\ by changing its style or adjusting instrumental components), present unique challenges and opportunities for AI-assisted music creation. Previous approaches in this domain have been constrained by the necessity to train specific editing models from scratch, which is both resource-intensive and inefficient; o… ▽ More

    Submitted 29 May, 2024; v1 submitted 28 May, 2024; originally announced May 2024.

    Comments: Code and demo are available at: https://github.com/ldzhangyx/instruct-musicgen

  28. arXiv:2405.17842  [pdf, other

    cs.CV cs.LG cs.MM cs.SD eess.AS

    MMDisCo: Multi-Modal Discriminator-Guided Cooperative Diffusion for Joint Audio and Video Generation

    Authors: Akio Hayakawa, Masato Ishii, Takashi Shibuya, Yuki Mitsufuji

    Abstract: This study aims to construct an audio-video generative model with minimal computational cost by leveraging pre-trained single-modal generative models for audio and video. To achieve this, we propose a novel method that guides single-modal models to cooperatively generate well-aligned samples across modalities. Specifically, given two pre-trained base diffusion models, we train a lightweight joint… ▽ More

    Submitted 25 February, 2025; v1 submitted 28 May, 2024; originally announced May 2024.

    Comments: ICLR 2025

  29. arXiv:2405.14598  [pdf, other

    cs.CV cs.LG cs.MM cs.SD eess.AS

    Visual Echoes: A Simple Unified Transformer for Audio-Visual Generation

    Authors: Shiqi Yang, Zhi Zhong, Mengjie Zhao, Shusuke Takahashi, Masato Ishii, Takashi Shibuya, Yuki Mitsufuji

    Abstract: In recent years, with the realistic generation results and a wide range of personalized applications, diffusion-based generative models gain huge attention in both visual and audio generation areas. Compared to the considerable advancements of text2image or text2audio generation, research in audio2visual or visual2audio generation has been relatively slow. The recent audio-visual generation method… ▽ More

    Submitted 24 May, 2024; v1 submitted 23 May, 2024; originally announced May 2024.

    Comments: 10 pages

  30. arXiv:2403.10024  [pdf, other

    cs.SD cs.AI cs.LG cs.MM eess.AS

    MR-MT3: Memory Retaining Multi-Track Music Transcription to Mitigate Instrument Leakage

    Authors: Hao Hao Tan, Kin Wai Cheuk, Taemin Cho, Wei-Hsiang Liao, Yuki Mitsufuji

    Abstract: This paper presents enhancements to the MT3 model, a state-of-the-art (SOTA) token-based multi-instrument automatic music transcription (AMT) model. Despite SOTA performance, MT3 has the issue of instrument leakage, where transcriptions are fragmented across different instruments. To mitigate this, we propose MR-MT3, with enhancements including a memory retention mechanism, prior token sampling, a… ▽ More

    Submitted 15 March, 2024; originally announced March 2024.

  31. arXiv:2402.06178  [pdf, other

    cs.SD cs.AI cs.MM eess.AS

    MusicMagus: Zero-Shot Text-to-Music Editing via Diffusion Models

    Authors: Yixiao Zhang, Yukara Ikemiya, Gus Xia, Naoki Murata, Marco A. Martínez-Ramírez, Wei-Hsiang Liao, Yuki Mitsufuji, Simon Dixon

    Abstract: Recent advances in text-to-music generation models have opened new avenues in musical creativity. However, music generation usually involves iterative refinements, and how to edit the generated music remains a significant challenge. This paper introduces a novel approach to the editing of music generated by such models, enabling the modification of specific attributes, such as genre, mood and inst… ▽ More

    Submitted 28 May, 2024; v1 submitted 8 February, 2024; originally announced February 2024.

    Comments: Accepted to IJCAI 2024

  32. arXiv:2310.13267  [pdf, other

    cs.CL cs.CV cs.LG cs.SD eess.AS

    On the Language Encoder of Contrastive Cross-modal Models

    Authors: Mengjie Zhao, Junya Ono, Zhi Zhong, Chieh-Hsin Lai, Yuhta Takida, Naoki Murata, Wei-Hsiang Liao, Takashi Shibuya, Hiromi Wakaki, Yuki Mitsufuji

    Abstract: Contrastive cross-modal models such as CLIP and CLAP aid various vision-language (VL) and audio-language (AL) tasks. However, there has been limited investigation of and improvement in their language encoder, which is the central component of encoding natural language descriptions of image/audio into vector representations. We extensively evaluate how unsupervised and supervised sentence embedding… ▽ More

    Submitted 20 October, 2023; originally announced October 2023.

  33. arXiv:2309.15717  [pdf, other

    eess.AS cs.LG cs.SD

    Timbre-Trap: A Low-Resource Framework for Instrument-Agnostic Music Transcription

    Authors: Frank Cwitkowitz, Kin Wai Cheuk, Woosung Choi, Marco A. Martínez-Ramírez, Keisuke Toyama, Wei-Hsiang Liao, Yuki Mitsufuji

    Abstract: In recent years, research on music transcription has focused mainly on architecture design and instrument-specific data acquisition. With the lack of availability of diverse datasets, progress is often limited to solo-instrument tasks such as piano transcription. Several works have explored multi-instrument transcription as a means to bolster the performance of models on low-resource tasks, but th… ▽ More

    Submitted 24 January, 2024; v1 submitted 27 September, 2023; originally announced September 2023.

    Comments: Accepted to ICASSP 2024

  34. arXiv:2309.09223  [pdf, other

    cs.SD eess.AS

    Zero- and Few-shot Sound Event Localization and Detection

    Authors: Kazuki Shimada, Kengo Uchida, Yuichiro Koyama, Takashi Shibuya, Shusuke Takahashi, Yuki Mitsufuji, Tatsuya Kawahara

    Abstract: Sound event localization and detection (SELD) systems estimate direction-of-arrival (DOA) and temporal activation for sets of target classes. Neural network (NN)-based SELD systems have performed well in various sets of target classes, but they only output the DOA and temporal activation of preset classes trained before inference. To customize target classes after training, we tackle zero- and few… ▽ More

    Submitted 17 January, 2024; v1 submitted 17 September, 2023; originally announced September 2023.

    Comments: 5 pages, 4 figures, accepted for publication in IEEE ICASSP 2024

  35. arXiv:2309.06934  [pdf, other

    eess.AS cs.SD

    VRDMG: Vocal Restoration via Diffusion Posterior Sampling with Multiple Guidance

    Authors: Carlos Hernandez-Olivan, Koichi Saito, Naoki Murata, Chieh-Hsin Lai, Marco A. Martínez-Ramirez, Wei-Hsiang Liao, Yuki Mitsufuji

    Abstract: Restoring degraded music signals is essential to enhance audio quality for downstream music manipulation. Recent diffusion-based music restoration methods have demonstrated impressive performance, and among them, diffusion posterior sampling (DPS) stands out given its intrinsic properties, making it versatile across various restoration tasks. In this paper, we identify that there are potential iss… ▽ More

    Submitted 13 September, 2023; originally announced September 2023.

  36. arXiv:2309.02836  [pdf, other

    cs.SD cs.LG eess.AS

    BigVSAN: Enhancing GAN-based Neural Vocoders with Slicing Adversarial Network

    Authors: Takashi Shibuya, Yuhta Takida, Yuki Mitsufuji

    Abstract: Generative adversarial network (GAN)-based vocoders have been intensively studied because they can synthesize high-fidelity audio waveforms faster than real-time. However, it has been reported that most GANs fail to obtain the optimal projection for discriminating between real and fake data in the feature space. In the literature, it has been demonstrated that slicing adversarial network (SAN), an… ▽ More

    Submitted 24 March, 2024; v1 submitted 6 September, 2023; originally announced September 2023.

    Comments: Accepted at ICASSP 2024. Equation (5) in the previous version is wrong. We modified it

  37. arXiv:2309.02478  [pdf, other

    cs.LG cs.AI eess.SP

    Enhancing Semantic Communication with Deep Generative Models -- An ICASSP Special Session Overview

    Authors: Eleonora Grassucci, Yuki Mitsufuji, Ping Zhang, Danilo Comminiello

    Abstract: Semantic communication is poised to play a pivotal role in shaping the landscape of future AI-driven communication systems. Its challenge of extracting semantic information from the original complex content and regenerating semantically consistent data at the receiver, possibly being robust to channel corruptions, can be addressed with deep generative models. This ICASSP special session overview p… ▽ More

    Submitted 5 September, 2023; originally announced September 2023.

    Comments: Submitted to IEEE ICASSP

  38. arXiv:2308.06981  [pdf, other

    eess.AS cs.SD

    The Sound Demixing Challenge 2023 $\unicode{x2013}$ Cinematic Demixing Track

    Authors: Stefan Uhlich, Giorgio Fabbro, Masato Hirano, Shusuke Takahashi, Gordon Wichern, Jonathan Le Roux, Dipam Chakraborty, Sharada Mohanty, Kai Li, Yi Luo, Jianwei Yu, Rongzhi Gu, Roman Solovyev, Alexander Stempkovskiy, Tatiana Habruseva, Mikhail Sukhovei, Yuki Mitsufuji

    Abstract: This paper summarizes the cinematic demixing (CDX) track of the Sound Demixing Challenge 2023 (SDX'23). We provide a comprehensive summary of the challenge setup, detailing the structure of the competition and the datasets used. Especially, we detail CDXDB23, a new hidden dataset constructed from real movies that was used to rank the submissions. The paper also offers insights into the most succes… ▽ More

    Submitted 18 April, 2024; v1 submitted 14 August, 2023; originally announced August 2023.

    Comments: Accepted for Transactions of the International Society for Music Information Retrieval

  39. arXiv:2308.06979  [pdf, other

    eess.AS cs.SD

    The Sound Demixing Challenge 2023 $\unicode{x2013}$ Music Demixing Track

    Authors: Giorgio Fabbro, Stefan Uhlich, Chieh-Hsin Lai, Woosung Choi, Marco Martínez-Ramírez, Weihsiang Liao, Igor Gadelha, Geraldo Ramos, Eddie Hsu, Hugo Rodrigues, Fabian-Robert Stöter, Alexandre Défossez, Yi Luo, Jianwei Yu, Dipam Chakraborty, Sharada Mohanty, Roman Solovyev, Alexander Stempkovskiy, Tatiana Habruseva, Nabarun Goswami, Tatsuya Harada, Minseok Kim, Jun Hyung Lee, Yuanliang Dong, Xinran Zhang , et al. (2 additional authors not shown)

    Abstract: This paper summarizes the music demixing (MDX) track of the Sound Demixing Challenge (SDX'23). We provide a summary of the challenge setup and introduce the task of robust music source separation (MSS), i.e., training MSS models in the presence of errors in the training data. We propose a formalization of the errors that can occur in the design of a training dataset for MSS systems and introduce t… ▽ More

    Submitted 19 April, 2024; v1 submitted 14 August, 2023; originally announced August 2023.

    Comments: Published in Transactions of the International Society for Music Information Retrieval (https://transactions.ismir.net/articles/10.5334/tismir.171)

    Journal ref: Transactions of the International Society for Music Information Retrieval, 7(1), pp.63-84, 2024

  40. arXiv:2307.04305  [pdf, other

    cs.SD cs.LG eess.AS

    Automatic Piano Transcription with Hierarchical Frequency-Time Transformer

    Authors: Keisuke Toyama, Taketo Akama, Yukara Ikemiya, Yuhta Takida, Wei-Hsiang Liao, Yuki Mitsufuji

    Abstract: Taking long-term spectral and temporal dependencies into account is essential for automatic piano transcription. This is especially helpful when determining the precise onset and offset for each note in the polyphonic piano content. In this case, we may rely on the capability of self-attention mechanism in Transformers to capture these long-term dependencies in the frequency and time axes. In this… ▽ More

    Submitted 9 July, 2023; originally announced July 2023.

    Comments: 8 pages, 6 figures, to be published in ISMIR2023

  41. arXiv:2306.09126  [pdf, other

    cs.SD cs.CV cs.MM eess.AS eess.IV

    STARSS23: An Audio-Visual Dataset of Spatial Recordings of Real Scenes with Spatiotemporal Annotations of Sound Events

    Authors: Kazuki Shimada, Archontis Politis, Parthasaarathy Sudarsanam, Daniel Krause, Kengo Uchida, Sharath Adavanne, Aapo Hakala, Yuichiro Koyama, Naoya Takahashi, Shusuke Takahashi, Tuomas Virtanen, Yuki Mitsufuji

    Abstract: While direction of arrival (DOA) of sound events is generally estimated from multichannel audio data recorded in a microphone array, sound events usually derive from visually perceptible source objects, e.g., sounds of footsteps come from the feet of a walker. This paper proposes an audio-visual sound event localization and detection (SELD) task, which uses multichannel audio and video information… ▽ More

    Submitted 14 November, 2023; v1 submitted 15 June, 2023; originally announced June 2023.

    Comments: 27 pages, 9 figures, accepted for publication in NeurIPS 2023 Track on Datasets and Benchmarks

  42. arXiv:2305.10734  [pdf, other

    cs.SD cs.CL eess.AS

    Diffusion-Based Speech Enhancement with Joint Generative and Predictive Decoders

    Authors: Hao Shi, Kazuki Shimada, Masato Hirano, Takashi Shibuya, Yuichiro Koyama, Zhi Zhong, Shusuke Takahashi, Tatsuya Kawahara, Yuki Mitsufuji

    Abstract: Diffusion-based generative speech enhancement (SE) has recently received attention, but reverse diffusion remains time-consuming. One solution is to initialize the reverse diffusion process with enhanced features estimated by a predictive SE system. However, the pipeline structure currently does not consider for a combined use of generative and predictive decoders. The predictive decoder allows us… ▽ More

    Submitted 28 February, 2024; v1 submitted 18 May, 2023; originally announced May 2023.

  43. The Whole Is Greater than the Sum of Its Parts: Improving Music Source Separation by Bridging Network

    Authors: Ryosuke Sawata, Naoya Takahashi, Stefan Uhlich, Shusuke Takahashi, Yuki Mitsufuji

    Abstract: This paper presents the crossing scheme (X-scheme) for improving the performance of deep neural network (DNN)-based music source separation (MSS) with almost no increasing calculation cost. It consists of three components: (i) multi-domain loss (MDL), (ii) bridging operation, which couples the individual instrument networks, and (iii) combination loss (CL). MDL enables the taking advantage of the… ▽ More

    Submitted 5 August, 2024; v1 submitted 13 May, 2023; originally announced May 2023.

    Comments: Acceptedt by EURASIP Journal on Audio, Speech, and Music Processing (under CC BY)

    Journal ref: EURASIP Journal on Audio, Speech, and Music Processing (JASM), 39 (2024)

  44. arXiv:2305.06701  [pdf, ps, other

    cs.SD eess.AS

    Extending Audio Masked Autoencoders Toward Audio Restoration

    Authors: Zhi Zhong, Hao Shi, Masato Hirano, Kazuki Shimada, Kazuya Tateishi, Takashi Shibuya, Shusuke Takahashi, Yuki Mitsufuji

    Abstract: Audio classification and restoration are among major downstream tasks in audio signal processing. However, restoration derives less of a benefit from pretrained models compared to the overwhelming success of pretrained models in classification tasks. Due to such unbalanced benefits, there has been rising interest in how to improve the performance of pretrained models for restoration tasks, e.g., s… ▽ More

    Submitted 17 August, 2023; v1 submitted 11 May, 2023; originally announced May 2023.

    Comments: WASPAA 2023.Copyright 2023 IEEE.Personal use of this material is permitted.Permission from IEEE must be obtained for all other uses,in any current or future media,including reprinting/republishing this material for advertising or promotional purposes, creating new collective works,for resale or redistribution to servers or lists,or reuse of any copyrighted component of this work in other works

  45. arXiv:2305.05857  [pdf, other

    eess.AS cs.SD

    Diffusion-based Signal Refiner for Speech Separation

    Authors: Masato Hirano, Kazuki Shimada, Yuichiro Koyama, Shusuke Takahashi, Yuki Mitsufuji

    Abstract: We have developed a diffusion-based speech refiner that improves the reference-free perceptual quality of the audio predicted by preceding single-channel speech separation models. Although modern deep neural network-based speech separation models have show high performance in reference-based metrics, they often produce perceptually unnatural artifacts. The recent advancements made to diffusion mod… ▽ More

    Submitted 12 May, 2023; v1 submitted 9 May, 2023; originally announced May 2023.

    Comments: Under review

  46. arXiv:2302.13838  [pdf, other

    cs.CV cs.SD eess.AS

    Cross-modal Face- and Voice-style Transfer

    Authors: Naoya Takahashi, Mayank K. Singh, Yuki Mitsufuji

    Abstract: Image-to-image translation and voice conversion enable the generation of a new facial image and voice while maintaining some of the semantics such as a pose in an image and linguistic content in audio, respectively. They can aid in the content-creation process in many applications. However, as they are limited to the conversion within each modality, matching the impression of the generated face an… ▽ More

    Submitted 1 March, 2023; v1 submitted 27 February, 2023; originally announced February 2023.

  47. arXiv:2302.08136  [pdf, ps, other

    cs.SD eess.AS

    An Attention-based Approach to Hierarchical Multi-label Music Instrument Classification

    Authors: Zhi Zhong, Masato Hirano, Kazuki Shimada, Kazuya Tateishi, Shusuke Takahashi, Yuki Mitsufuji

    Abstract: Although music is typically multi-label, many works have studied hierarchical music tagging with simplified settings such as single-label data. Moreover, there lacks a framework to describe various joint training methods under the multi-label setting. In order to discuss the above topics, we introduce hierarchical multi-label music instrument classification task. The task provides a realistic sett… ▽ More

    Submitted 16 February, 2023; originally announced February 2023.

    Comments: To appear at ICASSP 2023

  48. arXiv:2301.12686  [pdf, other

    cs.LG cs.AI cs.CV cs.SD eess.AS

    GibbsDDRM: A Partially Collapsed Gibbs Sampler for Solving Blind Inverse Problems with Denoising Diffusion Restoration

    Authors: Naoki Murata, Koichi Saito, Chieh-Hsin Lai, Yuhta Takida, Toshimitsu Uesaka, Yuki Mitsufuji, Stefano Ermon

    Abstract: Pre-trained diffusion models have been successfully used as priors in a variety of linear inverse problems, where the goal is to reconstruct a signal from noisy linear measurements. However, existing approaches require knowledge of the linear operator. In this paper, we propose GibbsDDRM, an extension of Denoising Diffusion Restoration Models (DDRM) to a blind setting in which the linear measureme… ▽ More

    Submitted 27 June, 2023; v1 submitted 30 January, 2023; originally announced January 2023.

  49. arXiv:2212.07065  [pdf, other

    cs.SD cs.CV cs.LG cs.MM eess.AS

    CLIPSep: Learning Text-queried Sound Separation with Noisy Unlabeled Videos

    Authors: Hao-Wen Dong, Naoya Takahashi, Yuki Mitsufuji, Julian McAuley, Taylor Berg-Kirkpatrick

    Abstract: Recent years have seen progress beyond domain-specific sound separation for speech or music towards universal sound separation for arbitrary sounds. Prior work on universal sound separation has investigated separating a target sound out of an audio mixture given a text query. Such text-queried sound separation systems provide a natural and scalable interface for specifying arbitrary target sounds.… ▽ More

    Submitted 3 March, 2023; v1 submitted 14 December, 2022; originally announced December 2022.

    Comments: Accepted by ICLR 2023. Audio samples can be found at https://sony.github.io/CLIPSep/

  50. arXiv:2211.04124  [pdf, other

    eess.AS cs.LG cs.SD

    Unsupervised vocal dereverberation with diffusion-based generative models

    Authors: Koichi Saito, Naoki Murata, Toshimitsu Uesaka, Chieh-Hsin Lai, Yuhta Takida, Takao Fukui, Yuki Mitsufuji

    Abstract: Removing reverb from reverberant music is a necessary technique to clean up audio for downstream music manipulations. Reverberation of music contains two categories, natural reverb, and artificial reverb. Artificial reverb has a wider diversity than natural reverb due to its various parameter setups and reverberation types. However, recent supervised dereverberation methods may fail because they r… ▽ More

    Submitted 8 November, 2022; originally announced November 2022.

    Comments: 6 pages, 2 figures, submitted to ICASSP 2023