Skip to main content

Showing 1–50 of 131 results for author: Mitsufuji, Y

Searching in archive cs. Search in all archives.
.
  1. arXiv:2507.06613  [pdf, ps, other

    cs.LG cs.AI cs.CV

    Denoising Multi-Beta VAE: Representation Learning for Disentanglement and Generation

    Authors: Anshuk Uppal, Yuhta Takida, Chieh-Hsin Lai, Yuki Mitsufuji

    Abstract: Disentangled and interpretable latent representations in generative models typically come at the cost of generation quality. The $β$-VAE framework introduces a hyperparameter $β$ to balance disentanglement and reconstruction quality, where setting $β> 1$ introduces an information bottleneck that favors disentanglement over sharp, accurate reconstructions. To address this trade-off, we propose a no… ▽ More

    Submitted 9 July, 2025; originally announced July 2025.

    Comments: 24 pages, 8 figures and 7 tables

  2. arXiv:2507.06547  [pdf, ps, other

    cs.CV cs.LG

    Concept-TRAK: Understanding how diffusion models learn concepts through concept-level attribution

    Authors: Yonghyun Park, Chieh-Hsin Lai, Satoshi Hayakawa, Yuhta Takida, Naoki Murata, Wei-Hsiang Liao, Woosung Choi, Kin Wai Cheuk, Junghyun Koo, Yuki Mitsufuji

    Abstract: While diffusion models excel at image generation, their growing adoption raises critical concerns around copyright issues and model transparency. Existing attribution methods identify training examples influencing an entire image, but fall short in isolating contributions to specific elements, such as styles or objects, that matter most to stakeholders. To bridge this gap, we introduce \emph{conce… ▽ More

    Submitted 9 July, 2025; originally announced July 2025.

    Comments: Preprint

  3. arXiv:2507.02273  [pdf, ps, other

    cs.SD eess.AS

    Fx-Encoder++: Extracting Instrument-Wise Audio Effects Representations from Mixtures

    Authors: Yen-Tung Yeh, Junghyun Koo, Marco A. Martínez-Ramírez, Wei-Hsiang Liao, Yi-Hsuan Yang, Yuki Mitsufuji

    Abstract: General-purpose audio representations have proven effective across diverse music information retrieval applications, yet their utility in intelligent music production remains limited by insufficient understanding of audio effects (Fx). Although previous approaches have emphasized audio effects analysis at the mixture level, this focus falls short for tasks demanding instrument-wise audio effects u… ▽ More

    Submitted 2 July, 2025; originally announced July 2025.

    Comments: ISMIR 2025

  4. arXiv:2506.22661  [pdf, ps, other

    cs.SD eess.AS

    Enhancing Neural Audio Fingerprint Robustness to Audio Degradation for Music Identification

    Authors: R. Oguz Araz, Guillem Cortès-Sebastià, Emilio Molina, Joan Serrà, Xavier Serra, Yuki Mitsufuji, Dmitry Bogdanov

    Abstract: Audio fingerprinting (AFP) allows the identification of unknown audio content by extracting compact representations, termed audio fingerprints, that are designed to remain robust against common audio degradations. Neural AFP methods often employ metric learning, where representation quality is influenced by the nature of the supervision and the utilized loss function. However, recent work unrealis… ▽ More

    Submitted 27 June, 2025; originally announced June 2025.

    Comments: Accepted to ISMIR2025

  5. arXiv:2506.20995  [pdf, ps, other

    cs.CV cs.LG cs.SD eess.AS

    Step-by-Step Video-to-Audio Synthesis via Negative Audio Guidance

    Authors: Akio Hayakawa, Masato Ishii, Takashi Shibuya, Yuki Mitsufuji

    Abstract: We propose a novel step-by-step video-to-audio generation method that sequentially produces individual audio tracks, each corresponding to a specific sound event in the video. Our approach mirrors traditional Foley workflows, aiming to capture all sound events induced by a given video comprehensively. Each generation step is formulated as a guided video-to-audio synthesis task, conditioned on a ta… ▽ More

    Submitted 27 June, 2025; v1 submitted 26 June, 2025; originally announced June 2025.

  6. arXiv:2506.18312  [pdf, ps, other

    cs.SD eess.AS

    Large-Scale Training Data Attribution for Music Generative Models via Unlearning

    Authors: Woosung Choi, Junghyun Koo, Kin Wai Cheuk, Joan Serrà, Marco A. Martínez-Ramírez, Yukara Ikemiya, Naoki Murata, Yuhta Takida, Wei-Hsiang Liao, Yuki Mitsufuji

    Abstract: This paper explores the use of unlearning methods for training data attribution (TDA) in music generative models trained on large-scale datasets. TDA aims to identify which specific training data points contributed to the generation of a particular output from a specific model. This is crucial in the context of AI-generated music, where proper recognition and credit for original artists are genera… ▽ More

    Submitted 23 June, 2025; originally announced June 2025.

    Comments: accepted at ICML 2025 Workshop on Machine Learning for Audio

  7. arXiv:2506.16889  [pdf, ps, other

    cs.SD eess.AS

    ITO-Master: Inference-Time Optimization for Audio Effects Modeling of Music Mastering Processors

    Authors: Junghyun Koo, Marco A. Martínez-Ramírez, Wei-Hsiang Liao, Giorgio Fabbro, Michele Mancusi, Yuki Mitsufuji

    Abstract: Music mastering style transfer aims to model and apply the mastering characteristics of a reference track to a target track, simulating the professional mastering process. However, existing methods apply fixed processing based on a reference track, limiting users' ability to fine-tune the results to match their artistic intent. In this paper, we introduce the ITO-Master framework, a reference-base… ▽ More

    Submitted 2 July, 2025; v1 submitted 20 June, 2025; originally announced June 2025.

    Comments: ISMIR 2025

  8. arXiv:2506.13697  [pdf, ps, other

    cs.CV

    Vid-CamEdit: Video Camera Trajectory Editing with Generative Rendering from Estimated Geometry

    Authors: Junyoung Seo, Jisang Han, Jaewoo Jung, Siyoon Jin, Joungbin Lee, Takuya Narihira, Kazumi Fukuda, Takashi Shibuya, Donghoon Ahn, Shoukang Hu, Seungryong Kim, Yuki Mitsufuji

    Abstract: We introduce Vid-CamEdit, a novel framework for video camera trajectory editing, enabling the re-synthesis of monocular videos along user-defined camera paths. This task is challenging due to its ill-posed nature and the limited multi-view video data for training. Traditional reconstruction methods struggle with extreme trajectory changes, and existing generative models for dynamic novel view synt… ▽ More

    Submitted 16 June, 2025; originally announced June 2025.

    Comments: Our project page can be found at https://cvlab-kaist.github.io/Vid-CamEdit/

  9. arXiv:2506.01493  [pdf, ps, other

    cs.CV cs.LG

    Efficiency without Compromise: CLIP-aided Text-to-Image GANs with Increased Diversity

    Authors: Yuya Kobayashi, Yuhta Takida, Takashi Shibuya, Yuki Mitsufuji

    Abstract: Recently, Generative Adversarial Networks (GANs) have been successfully scaled to billion-scale large text-to-image datasets. However, training such models entails a high training cost, limiting some applications and research usage. To reduce the cost, one promising direction is the incorporation of pre-trained models. The existing method of utilizing pre-trained models for a generator significant… ▽ More

    Submitted 2 June, 2025; originally announced June 2025.

    Comments: Accepted at IJCNN 2025

  10. arXiv:2505.20770  [pdf, ps, other

    cs.SD cs.MM eess.AS

    Can Large Language Models Predict Audio Effects Parameters from Natural Language?

    Authors: Seungheon Doh, Junghyun Koo, Marco A. Martínez-Ramírez, Wei-Hsiang Liao, Juhan Nam, Yuki Mitsufuji

    Abstract: In music production, manipulating audio effects (Fx) parameters through natural language has the potential to reduce technical barriers for non-experts. We present LLM2Fx, a framework leveraging Large Language Models (LLMs) to predict Fx parameters directly from textual descriptions without requiring task-specific training or fine-tuning. Our approach address the text-to-effect parameter predictio… ▽ More

    Submitted 27 May, 2025; originally announced May 2025.

    Comments: Submitted to WASPAA 2025

  11. arXiv:2505.19663  [pdf, ps, other

    cs.SD cs.AI cs.CR cs.LG eess.AS

    A Comprehensive Real-World Assessment of Audio Watermarking Algorithms: Will They Survive Neural Codecs?

    Authors: Yigitcan Özer, Woosung Choi, Joan Serrà, Mayank Kumar Singh, Wei-Hsiang Liao, Yuki Mitsufuji

    Abstract: We introduce the Robust Audio Watermarking Benchmark (RAW-Bench), a benchmark for evaluating deep learning-based audio watermarking methods with standardized and systematic comparisons. To simulate real-world usage, we introduce a comprehensive audio attack pipeline with various distortions such as compression, background noise, and reverberation, along with a diverse test dataset including speech… ▽ More

    Submitted 28 May, 2025; v1 submitted 26 May, 2025; originally announced May 2025.

    Comments: 5 pages; 5 tables; accepted at INTERSPEECH 2025

  12. arXiv:2505.16195  [pdf, other

    cs.SD cs.AI cs.LG eess.AS eess.IV

    SpecMaskFoley: Steering Pretrained Spectral Masked Generative Transformer Toward Synchronized Video-to-audio Synthesis via ControlNet

    Authors: Zhi Zhong, Akira Takahashi, Shuyang Cui, Keisuke Toyama, Shusuke Takahashi, Yuki Mitsufuji

    Abstract: Foley synthesis aims to synthesize high-quality audio that is both semantically and temporally aligned with video frames. Given its broad application in creative industries, the task has gained increasing attention in the research community. To avoid the non-trivial task of training audio generative models from scratch, adapting pretrained audio generative models for video-synchronized foley synth… ▽ More

    Submitted 21 May, 2025; originally announced May 2025.

    Comments: 4 pages, 2 figures, 2 tables. Demo page: https://zzaudio.github.io/SpecMaskFoley_Demo/

  13. arXiv:2505.11315  [pdf, other

    cs.SD cs.LG eess.AS

    Improving Inference-Time Optimisation for Vocal Effects Style Transfer with a Gaussian Prior

    Authors: Chin-Yun Yu, Marco A. Martínez-Ramírez, Junghyun Koo, Wei-Hsiang Liao, Yuki Mitsufuji, György Fazekas

    Abstract: Style Transfer with Inference-Time Optimisation (ST-ITO) is a recent approach for transferring the applied effects of a reference audio to a raw audio track. It optimises the effect parameters to minimise the distance between the style embeddings of the processed audio and the reference. However, this method treats all possible configurations equally and relies solely on the embedding space, which… ▽ More

    Submitted 16 May, 2025; originally announced May 2025.

    Comments: Submitted to WASPAA 2025

  14. arXiv:2505.09827  [pdf, ps, other

    cs.CV

    Dyadic Mamba: Long-term Dyadic Human Motion Synthesis

    Authors: Julian Tanke, Takashi Shibuya, Kengo Uchida, Koichi Saito, Yuki Mitsufuji

    Abstract: Generating realistic dyadic human motion from text descriptions presents significant challenges, particularly for extended interactions that exceed typical training sequence lengths. While recent transformer-based approaches have shown promising results for short-term dyadic motion synthesis, they struggle with longer sequences due to inherent limitations in positional encoding schemes. In this pa… ▽ More

    Submitted 14 May, 2025; originally announced May 2025.

    Comments: CVPR 2025 HuMoGen Workshop

  15. arXiv:2504.20111  [pdf, ps, other

    cs.CV

    Forging and Removing Latent-Noise Diffusion Watermarks Using a Single Image

    Authors: Anubhav Jain, Yuya Kobayashi, Naoki Murata, Yuhta Takida, Takashi Shibuya, Yuki Mitsufuji, Niv Cohen, Nasir Memon, Julian Togelius

    Abstract: Watermarking techniques are vital for protecting intellectual property and preventing fraudulent use of media. Most previous watermarking schemes designed for diffusion models embed a secret key in the initial noise. The resulting pattern is often considered hard to remove and forge into unrelated images. In this paper, we propose a black-box adversarial attack without presuming access to the diff… ▽ More

    Submitted 27 April, 2025; originally announced April 2025.

  16. arXiv:2504.14735  [pdf, other

    cs.SD eess.AS

    DiffVox: A Differentiable Model for Capturing and Analysing Professional Effects Distributions

    Authors: Chin-Yun Yu, Marco A. Martínez-Ramírez, Junghyun Koo, Ben Hayes, Wei-Hsiang Liao, György Fazekas, Yuki Mitsufuji

    Abstract: This study introduces a novel and interpretable model, DiffVox, for matching vocal effects in music production. DiffVox, short for ``Differentiable Vocal Fx", integrates parametric equalisation, dynamic range control, delay, and reverb with efficient differentiable implementations to enable gradient-based optimisation for parameter estimation. Vocal presets are retrieved from two datasets, compris… ▽ More

    Submitted 20 April, 2025; originally announced April 2025.

    Comments: Submitted to DAFx 2025

  17. arXiv:2504.10826  [pdf, other

    cs.SD cs.MM eess.AS

    SteerMusic: Enhanced Musical Consistency for Zero-shot Text-Guided and Personalized Music Editing

    Authors: Xinlei Niu, Kin Wai Cheuk, Jing Zhang, Naoki Murata, Chieh-Hsin Lai, Michele Mancusi, Woosung Choi, Giorgio Fabbro, Wei-Hsiang Liao, Charles Patrick Martin, Yuki Mitsufuji

    Abstract: Music editing is an important step in music production, which has broad applications, including game development and film production. Most existing zero-shot text-guided methods rely on pretrained diffusion models by involving forward-backward diffusion processes for editing. However, these methods often struggle to maintain the music content consistency. Additionally, text instructions alone usua… ▽ More

    Submitted 14 April, 2025; originally announced April 2025.

  18. arXiv:2504.06264  [pdf, other

    cs.CV

    D^2USt3R: Enhancing 3D Reconstruction with 4D Pointmaps for Dynamic Scenes

    Authors: Jisang Han, Honggyu An, Jaewoo Jung, Takuya Narihira, Junyoung Seo, Kazumi Fukuda, Chaehyun Kim, Sunghwan Hong, Yuki Mitsufuji, Seungryong Kim

    Abstract: We address the task of 3D reconstruction in dynamic scenes, where object motions degrade the quality of previous 3D pointmap regression methods, such as DUSt3R, originally designed for static 3D scene reconstruction. Although these methods provide an elegant and powerful solution in static settings, they struggle in the presence of dynamic motions that disrupt alignment based solely on camera pose… ▽ More

    Submitted 8 April, 2025; originally announced April 2025.

    Comments: project page: https://cvlab-kaist.github.io/DDUSt3R/

  19. arXiv:2504.05154  [pdf, ps, other

    cs.CL

    CARE: Assessing the Impact of Multilingual Human Preference Learning on Cultural Awareness

    Authors: Geyang Guo, Tarek Naous, Hiromi Wakaki, Yukiko Nishimura, Yuki Mitsufuji, Alan Ritter, Wei Xu

    Abstract: Language Models (LMs) are typically tuned with human preferences to produce helpful responses, but the impact of preference tuning on the ability to handle culturally diverse queries remains understudied. In this paper, we systematically analyze how native human cultural preferences can be incorporated into the preference learning process to train more culturally aware LMs. We introduce CARE, a mu… ▽ More

    Submitted 18 June, 2025; v1 submitted 7 April, 2025; originally announced April 2025.

    Comments: 28 pages

  20. arXiv:2503.20871  [pdf, other

    cs.CV cs.AI cs.CL

    VinaBench: Benchmark for Faithful and Consistent Visual Narratives

    Authors: Silin Gao, Sheryl Mathew, Li Mi, Sepideh Mamooler, Mengjie Zhao, Hiromi Wakaki, Yuki Mitsufuji, Syrielle Montariol, Antoine Bosselut

    Abstract: Visual narrative generation transforms textual narratives into sequences of images illustrating the content of the text. However, generating visual narratives that are faithful to the input text and self-consistent across generated images remains an open challenge, due to the lack of knowledge constraints used for planning the stories. In this work, we propose a new benchmark, VinaBench, to addres… ▽ More

    Submitted 3 April, 2025; v1 submitted 26 March, 2025; originally announced March 2025.

    Comments: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2025)

  21. arXiv:2503.16669  [pdf, other

    cs.SD cs.AI eess.AS

    Aligning Text-to-Music Evaluation with Human Preferences

    Authors: Yichen Huang, Zachary Novack, Koichi Saito, Jiatong Shi, Shinji Watanabe, Yuki Mitsufuji, John Thickstun, Chris Donahue

    Abstract: Despite significant recent advances in generative acoustic text-to-music (TTM) modeling, robust evaluation of these models lags behind, relying in particular on the popular Fréchet Audio Distance (FAD). In this work, we rigorously study the design space of reference-based divergence metrics for evaluating TTM models through (1) designing four synthetic meta-evaluations to measure sensitivity to pa… ▽ More

    Submitted 20 March, 2025; originally announced March 2025.

  22. arXiv:2503.11190  [pdf, other

    cs.SD cs.AI cs.CL cs.MM eess.AS

    Cross-Modal Learning for Music-to-Music-Video Description Generation

    Authors: Zhuoyuan Mao, Mengjie Zhao, Qiyu Wu, Zhi Zhong, Wei-Hsiang Liao, Hiromi Wakaki, Yuki Mitsufuji

    Abstract: Music-to-music-video generation is a challenging task due to the intrinsic differences between the music and video modalities. The advent of powerful text-to-video diffusion models has opened a promising pathway for music-video (MV) generation by first addressing the music-to-MV description task and subsequently leveraging these models for video generation. In this study, we focus on the MV descri… ▽ More

    Submitted 14 March, 2025; originally announced March 2025.

    Comments: Accepted by RepL4NLP 2025 @ NAACL 2025

  23. arXiv:2502.18197  [pdf, ps, other

    cs.LG cs.CV

    VCT: Training Consistency Models with Variational Noise Coupling

    Authors: Gianluigi Silvestri, Luca Ambrogioni, Chieh-Hsin Lai, Yuhta Takida, Yuki Mitsufuji

    Abstract: Consistency Training (CT) has recently emerged as a strong alternative to diffusion models for image generation. However, non-distillation CT often suffers from high variance and instability, motivating ongoing research into its training dynamics. We propose Variational Consistency Training (VCT), a flexible and effective framework compatible with various forward kernels, including those in flow m… ▽ More

    Submitted 4 June, 2025; v1 submitted 25 February, 2025; originally announced February 2025.

    Comments: 23 pages, 11 figures

  24. arXiv:2502.16936  [pdf, other

    cs.SD cs.AI cs.LG eess.AS stat.ML

    Supervised contrastive learning from weakly-labeled audio segments for musical version matching

    Authors: Joan Serrà, R. Oguz Araz, Dmitry Bogdanov, Yuki Mitsufuji

    Abstract: Detecting musical versions (different renditions of the same piece) is a challenging task with important applications. Because of the ground truth nature, existing approaches match musical versions at the track level (e.g., whole song). However, most applications require to match them at the segment level (e.g., 20s chunks). In addition, existing approaches resort to classification and triplet los… ▽ More

    Submitted 16 May, 2025; v1 submitted 24 February, 2025; originally announced February 2025.

    Comments: 17 pages, 6 figures, 8 tables (includes Appendix); accepted at ICML25

  25. arXiv:2502.12623  [pdf, other

    cs.SD cs.AI cs.CL cs.MM eess.AS

    DeepResonance: Enhancing Multimodal Music Understanding via Music-centric Multi-way Instruction Tuning

    Authors: Zhuoyuan Mao, Mengjie Zhao, Qiyu Wu, Hiromi Wakaki, Yuki Mitsufuji

    Abstract: Recent advancements in music large language models (LLMs) have significantly improved music understanding tasks, which involve the model's ability to analyze and interpret various musical elements. These improvements primarily focused on integrating both music and text inputs. However, the potential of incorporating additional modalities such as images, videos and textual music features to enhance… ▽ More

    Submitted 20 May, 2025; v1 submitted 18 February, 2025; originally announced February 2025.

  26. arXiv:2502.12080  [pdf, ps, other

    cs.CV

    HumanGif: Single-View Human Diffusion with Generative Prior

    Authors: Shoukang Hu, Takuya Narihira, Kazumi Fukuda, Ryosuke Sawata, Takashi Shibuya, Yuki Mitsufuji

    Abstract: Previous 3D human creation methods have made significant progress in synthesizing view-consistent and temporally aligned results from sparse-view images or monocular videos. However, it remains challenging to produce perpetually realistic, view-consistent, and temporally coherent human avatars from a single image, as limited information is available in the single-view input setting. Motivated by t… ▽ More

    Submitted 29 June, 2025; v1 submitted 17 February, 2025; originally announced February 2025.

    Comments: Project page: https://skhu101.github.io/HumanGif/

  27. arXiv:2501.11837  [pdf, ps, other

    eess.AS cs.SD eess.SP

    30+ Years of Source Separation Research: Achievements and Future Challenges

    Authors: Shoko Araki, Nobutaka Ito, Reinhold Haeb-Umbach, Gordon Wichern, Zhong-Qiu Wang, Yuki Mitsufuji

    Abstract: Source separation (SS) of acoustic signals is a research field that emerged in the mid-1990s and has flourished ever since. On the occasion of ICASSP's 50th anniversary, we review the major contributions and advancements in the past three decades in the speech, audio, and music SS research field. We will cover both single- and multi-channel SS approaches. We will also look back on key efforts to f… ▽ More

    Submitted 20 January, 2025; originally announced January 2025.

    Comments: Accepted by IEEE ICASSP 2025

  28. arXiv:2501.08727  [pdf, other

    cs.LG

    Transformed Low-rank Adaptation via Tensor Decomposition and Its Applications to Text-to-image Models

    Authors: Zerui Tao, Yuhta Takida, Naoki Murata, Qibin Zhao, Yuki Mitsufuji

    Abstract: Parameter-Efficient Fine-Tuning (PEFT) of text-to-image models has become an increasingly popular technique with many applications. Among the various PEFT methods, Low-Rank Adaptation (LoRA) and its variants have gained significant attention due to their effectiveness, enabling users to fine-tune models with limited computational resources. However, the approximation gap between the low-rank assum… ▽ More

    Submitted 15 January, 2025; originally announced January 2025.

  29. arXiv:2501.02786  [pdf, other

    cs.SD cs.CV eess.AS

    CCStereo: Audio-Visual Contextual and Contrastive Learning for Binaural Audio Generation

    Authors: Yuanhong Chen, Kazuki Shimada, Christian Simon, Yukara Ikemiya, Takashi Shibuya, Yuki Mitsufuji

    Abstract: Binaural audio generation (BAG) aims to convert monaural audio to stereo audio using visual prompts, requiring a deep understanding of spatial and semantic information. However, current models risk overfitting to room environments and lose fine-grained spatial details. In this paper, we propose a new audio-visual binaural generation model incorporating an audio-visual conditional normalisation lay… ▽ More

    Submitted 6 January, 2025; originally announced January 2025.

  30. arXiv:2412.15322  [pdf, other

    cs.CV cs.LG cs.SD eess.AS

    MMAudio: Taming Multimodal Joint Training for High-Quality Video-to-Audio Synthesis

    Authors: Ho Kei Cheng, Masato Ishii, Akio Hayakawa, Takashi Shibuya, Alexander Schwing, Yuki Mitsufuji

    Abstract: We propose to synthesize high-quality and synchronized audio, given video and optional text conditions, using a novel multimodal joint training framework MMAudio. In contrast to single-modality training conditioned on (limited) video data only, MMAudio is jointly trained with larger-scale, readily available text-audio data to learn to generate semantically aligned high-quality audio samples. Addit… ▽ More

    Submitted 7 April, 2025; v1 submitted 19 December, 2024; originally announced December 2024.

    Comments: Accepted to CVPR 2025. Project page: https://hkchengrex.github.io/MMAudio

  31. arXiv:2412.13462  [pdf, other

    cs.SD cs.MM eess.AS

    SAVGBench: Benchmarking Spatially Aligned Audio-Video Generation

    Authors: Kazuki Shimada, Christian Simon, Takashi Shibuya, Shusuke Takahashi, Yuki Mitsufuji

    Abstract: This work addresses the lack of multimodal generative models capable of producing high-quality videos with spatially aligned audio. While recent advancements in generative models have been successful in video generation, they often overlook the spatial alignment between audio and visuals, which is essential for immersive experiences. To tackle this problem, we establish a new research direction in… ▽ More

    Submitted 17 December, 2024; originally announced December 2024.

    Comments: 5 pages, 3 figures

  32. arXiv:2412.07658  [pdf, other

    cs.CV cs.AI cs.LG

    TraSCE: Trajectory Steering for Concept Erasure

    Authors: Anubhav Jain, Yuya Kobayashi, Takashi Shibuya, Yuhta Takida, Nasir Memon, Julian Togelius, Yuki Mitsufuji

    Abstract: Recent advancements in text-to-image diffusion models have brought them to the public spotlight, becoming widely accessible and embraced by everyday users. However, these models have been shown to generate harmful content such as not-safe-for-work (NSFW) images. While approaches have been proposed to erase such abstract concepts from the models, jail-breaking techniques have succeeded in bypassing… ▽ More

    Submitted 17 March, 2025; v1 submitted 10 December, 2024; originally announced December 2024.

  33. arXiv:2412.00557  [pdf, other

    cs.CV cs.AI cs.LG

    Blind Inverse Problem Solving Made Easy by Text-to-Image Latent Diffusion

    Authors: Michail Dontas, Yutong He, Naoki Murata, Yuki Mitsufuji, J. Zico Kolter, Ruslan Salakhutdinov

    Abstract: Blind inverse problems, where both the target data and forward operator are unknown, are crucial to many computer vision applications. Existing methods often depend on restrictive assumptions such as additional training, operator linearity, or narrow image distributions, thus limiting their generalizability. In this work, we present LADiBI, a training-free framework that uses large-scale text-to-i… ▽ More

    Submitted 30 November, 2024; originally announced December 2024.

  34. arXiv:2411.16738  [pdf, other

    cs.CV cs.AI cs.LG

    Classifier-Free Guidance inside the Attraction Basin May Cause Memorization

    Authors: Anubhav Jain, Yuya Kobayashi, Takashi Shibuya, Yuhta Takida, Nasir Memon, Julian Togelius, Yuki Mitsufuji

    Abstract: Diffusion models are prone to exactly reproduce images from the training data. This exact reproduction of the training data is concerning as it can lead to copyright infringement and/or leakage of privacy-sensitive information. In this paper, we present a novel perspective on the memorization phenomenon and propose a simple yet effective approach to mitigate it. We argue that memorization occurs b… ▽ More

    Submitted 17 March, 2025; v1 submitted 23 November, 2024; originally announced November 2024.

    Comments: CVPR 2025

  35. arXiv:2411.01135  [pdf, other

    cs.SD cs.IR cs.LG eess.AS

    Music Foundation Model as Generic Booster for Music Downstream Tasks

    Authors: WeiHsiang Liao, Yuhta Takida, Yukara Ikemiya, Zhi Zhong, Chieh-Hsin Lai, Giorgio Fabbro, Kazuki Shimada, Keisuke Toyama, Kinwai Cheuk, Marco A. Martínez-Ramírez, Shusuke Takahashi, Stefan Uhlich, Taketo Akama, Woosung Choi, Yuichiro Koyama, Yuki Mitsufuji

    Abstract: We demonstrate the efficacy of using intermediate representations from a single foundation model to enhance various music downstream tasks. We introduce SoniDo, a music foundation model (MFM) designed to extract hierarchical features from target music samples. By leveraging hierarchical intermediate features, SoniDo constrains the information granularity, leading to improved performance across var… ▽ More

    Submitted 27 May, 2025; v1 submitted 2 November, 2024; originally announced November 2024.

    Comments: 41 pages with 14 figures

    Journal ref: Published in Transactions on Machine Learning Research (TMLR), 2025

  36. arXiv:2410.15573  [pdf, other

    cs.SD cs.AI cs.CL cs.MM eess.AS

    OpenMU: Your Swiss Army Knife for Music Understanding

    Authors: Mengjie Zhao, Zhi Zhong, Zhuoyuan Mao, Shiqi Yang, Wei-Hsiang Liao, Shusuke Takahashi, Hiromi Wakaki, Yuki Mitsufuji

    Abstract: We present OpenMU-Bench, a large-scale benchmark suite for addressing the data scarcity issue in training multimodal language models to understand music. To construct OpenMU-Bench, we leveraged existing datasets and bootstrapped new annotations. OpenMU-Bench also broadens the scope of music understanding by including lyrics understanding and music tool usage. Using OpenMU-Bench, we trained our mus… ▽ More

    Submitted 27 November, 2024; v1 submitted 20 October, 2024; originally announced October 2024.

    Comments: Resources: https://github.com/sony/openmu

  37. arXiv:2410.14758  [pdf, other

    cs.LG

    Improving Vector-Quantized Image Modeling with Latent Consistency-Matching Diffusion

    Authors: Bac Nguyen, Chieh-Hsin Lai, Yuhta Takida, Naoki Murata, Toshimitsu Uesaka, Stefano Ermon, Yuki Mitsufuji

    Abstract: By embedding discrete representations into a continuous latent space, we can leverage continuous-space latent diffusion models to handle generative modeling of discrete data. However, despite their initial success, most latent diffusion methods rely on fixed pretrained embeddings, limiting the benefits of joint training with the diffusion model. While jointly learning the embedding (via reconstruc… ▽ More

    Submitted 1 April, 2025; v1 submitted 18 October, 2024; originally announced October 2024.

  38. arXiv:2410.14710  [pdf, other

    cs.CV cs.AI cs.LG

    G2D2: Gradient-guided Discrete Diffusion for image inverse problem solving

    Authors: Naoki Murata, Chieh-Hsin Lai, Yuhta Takida, Toshimitsu Uesaka, Bac Nguyen, Stefano Ermon, Yuki Mitsufuji

    Abstract: Recent literature has effectively utilized diffusion models trained on continuous variables as priors for solving inverse problems. Notably, discrete diffusion models with discrete latent codes have shown strong performance, particularly in modalities suited for discrete compressed representations, such as image and motion generation. However, their discrete and non-differentiable nature has limit… ▽ More

    Submitted 9 October, 2024; originally announced October 2024.

  39. arXiv:2410.08709  [pdf, other

    cs.LG math.NA stat.ML

    Distillation of Discrete Diffusion through Dimensional Correlations

    Authors: Satoshi Hayakawa, Yuhta Takida, Masaaki Imaizumi, Hiromi Wakaki, Yuki Mitsufuji

    Abstract: Diffusion models have demonstrated exceptional performances in various fields of generative modeling, but suffer from slow sampling speed due to their iterative nature. While this issue is being addressed in continuous domains, discrete diffusion models face unique challenges, particularly in capturing dependencies between elements (e.g., pixel relationships in image, sequential dependencies in la… ▽ More

    Submitted 8 May, 2025; v1 submitted 11 October, 2024; originally announced October 2024.

    Comments: 39 pages, ICML 2025 accepted

  40. arXiv:2410.07761  [pdf, other

    cs.LG cs.AI cs.CL cs.CV

    $\textit{Jump Your Steps}$: Optimizing Sampling Schedule of Discrete Diffusion Models

    Authors: Yong-Hyun Park, Chieh-Hsin Lai, Satoshi Hayakawa, Yuhta Takida, Yuki Mitsufuji

    Abstract: Diffusion models have seen notable success in continuous domains, leading to the development of discrete diffusion models (DDMs) for discrete variables. Despite recent advances, DDMs face the challenge of slow sampling speeds. While parallel sampling methods like $τ$-leaping accelerate this process, they introduce $\textit{Compounding Decoding Error}$ (CDE), where discrepancies arise between the t… ▽ More

    Submitted 10 October, 2024; originally announced October 2024.

  41. arXiv:2410.06154  [pdf, other

    cs.CV

    GLOV: Guided Large Language Models as Implicit Optimizers for Vision Language Models

    Authors: M. Jehanzeb Mirza, Mengjie Zhao, Zhuoyuan Mao, Sivan Doveh, Wei Lin, Paul Gavrikov, Michael Dorkenwald, Shiqi Yang, Saurav Jha, Hiromi Wakaki, Yuki Mitsufuji, Horst Possegger, Rogerio Feris, Leonid Karlinsky, James Glass

    Abstract: In this work, we propose GLOV, which enables Large Language Models (LLMs) to act as implicit optimizers for Vision-Language Models (VLMs) to enhance downstream vision tasks. GLOV prompts an LLM with the downstream task description, querying it for suitable VLM prompts (e.g., for zero-shot classification with CLIP). These prompts are ranked according to their fitness for the downstream vision task.… ▽ More

    Submitted 5 February, 2025; v1 submitted 8 October, 2024; originally announced October 2024.

    Comments: Code: https://github.com/jmiemirza/GLOV

  42. arXiv:2410.06016  [pdf, other

    cs.SD cs.LG eess.AS

    Variable Bitrate Residual Vector Quantization for Audio Coding

    Authors: Yunkee Chae, Woosung Choi, Yuhta Takida, Junghyun Koo, Yukara Ikemiya, Zhi Zhong, Kin Wai Cheuk, Marco A. Martínez-Ramírez, Kyogu Lee, Wei-Hsiang Liao, Yuki Mitsufuji

    Abstract: Recent state-of-the-art neural audio compression models have progressively adopted residual vector quantization (RVQ). Despite this success, these models employ a fixed number of codebooks per frame, which can be suboptimal in terms of rate-distortion tradeoff, particularly in scenarios with simple input audio, such as silence. To address this limitation, we propose variable bitrate RVQ (VRVQ) for… ▽ More

    Submitted 27 April, 2025; v1 submitted 8 October, 2024; originally announced October 2024.

    Comments: ICASSP 2025 camera ready version

  43. arXiv:2410.05116  [pdf, other

    cs.LG cs.AI cs.CV cs.HC

    HERO: Human-Feedback Efficient Reinforcement Learning for Online Diffusion Model Finetuning

    Authors: Ayano Hiranaka, Shang-Fu Chen, Chieh-Hsin Lai, Dongjun Kim, Naoki Murata, Takashi Shibuya, Wei-Hsiang Liao, Shao-Hua Sun, Yuki Mitsufuji

    Abstract: Controllable generation through Stable Diffusion (SD) fine-tuning aims to improve fidelity, safety, and alignment with human guidance. Existing reinforcement learning from human feedback methods usually rely on predefined heuristic reward functions or pretrained reward models built on large-scale datasets, limiting their applicability to scenarios where collecting such data is costly or difficult.… ▽ More

    Submitted 13 March, 2025; v1 submitted 7 October, 2024; originally announced October 2024.

    Comments: Published in International Conference on Learning Representations (ICLR) 2025

  44. arXiv:2410.01796  [pdf, other

    cs.LG

    Bellman Diffusion: Generative Modeling as Learning a Linear Operator in the Distribution Space

    Authors: Yangming Li, Chieh-Hsin Lai, Carola-Bibiane Schönlieb, Yuki Mitsufuji, Stefano Ermon

    Abstract: Deep Generative Models (DGMs), including Energy-Based Models (EBMs) and Score-based Generative Models (SGMs), have advanced high-fidelity data generation and complex continuous distribution approximation. However, their application in Markov Decision Processes (MDPs), particularly in distributional Reinforcement Learning (RL), remains underexplored, with conventional histogram-based methods domina… ▽ More

    Submitted 2 October, 2024; originally announced October 2024.

    Comments: Paper under review

  45. arXiv:2410.00700  [pdf, other

    cs.CV cs.AI

    Mining Your Own Secrets: Diffusion Classifier Scores for Continual Personalization of Text-to-Image Diffusion Models

    Authors: Saurav Jha, Shiqi Yang, Masato Ishii, Mengjie Zhao, Christian Simon, Muhammad Jehanzeb Mirza, Dong Gong, Lina Yao, Shusuke Takahashi, Yuki Mitsufuji

    Abstract: Personalized text-to-image diffusion models have grown popular for their ability to efficiently acquire a new concept from user-defined text descriptions and a few images. However, in the real world, a user may wish to personalize a model on multiple concepts but one at a time, with no access to the data from previous concepts due to storage/privacy concerns. When faced with this continual learnin… ▽ More

    Submitted 9 February, 2025; v1 submitted 1 October, 2024; originally announced October 2024.

    Comments: Accepted to ICLR 2025

  46. arXiv:2410.00083  [pdf, ps, other

    cs.LG cs.AI cs.CV

    A Survey on Diffusion Models for Inverse Problems

    Authors: Giannis Daras, Hyungjin Chung, Chieh-Hsin Lai, Yuki Mitsufuji, Jong Chul Ye, Peyman Milanfar, Alexandros G. Dimakis, Mauricio Delbracio

    Abstract: Diffusion models have become increasingly popular for generative modeling due to their ability to generate high-quality samples. This has unlocked exciting new possibilities for solving inverse problems, especially in image restoration and reconstruction, by treating diffusion models as unsupervised priors. This survey provides a comprehensive overview of methods that utilize pre-trained diffusion… ▽ More

    Submitted 30 September, 2024; originally announced October 2024.

    Comments: Work in progress. 38 pages

  47. arXiv:2409.17550  [pdf, other

    cs.LG cs.MM cs.SD eess.AS

    A Simple but Strong Baseline for Sounding Video Generation: Effective Adaptation of Audio and Video Diffusion Models for Joint Generation

    Authors: Masato Ishii, Akio Hayakawa, Takashi Shibuya, Yuki Mitsufuji

    Abstract: In this work, we build a simple but strong baseline for sounding video generation. Given base diffusion models for audio and video, we integrate them with additional modules into a single model and train it to make the model jointly generate audio and video. To enhance alignment between audio-video pairs, we introduce two novel mechanisms in our model. The first one is timestep adjustment, which p… ▽ More

    Submitted 8 April, 2025; v1 submitted 26 September, 2024; originally announced September 2024.

    Comments: IJCNN 2025. The source code is available: https://github.com/SonyResearch/SVG_baseline

  48. arXiv:2409.07743  [pdf, other

    cs.CR

    LOCKEY: A Novel Approach to Model Authentication and Deepfake Tracking

    Authors: Mayank Kumar Singh, Naoya Takahashi, Wei-Hsiang Liao, Yuki Mitsufuji

    Abstract: This paper presents a novel approach to deter unauthorized deepfakes and enable user tracking in generative models, even when the user has full access to the model parameters, by integrating key-based model authentication with watermarking techniques. Our method involves providing users with model parameters accompanied by a unique, user-specific key. During inference, the model is conditioned upo… ▽ More

    Submitted 21 September, 2024; v1 submitted 12 September, 2024; originally announced September 2024.

    Comments: Authenticating deep generative models, 5 pages, 5 figures, 2 tables

  49. arXiv:2409.06096  [pdf, ps, other

    cs.SD cs.AI cs.IR eess.AS

    Latent Diffusion Bridges for Unsupervised Musical Audio Timbre Transfer

    Authors: Michele Mancusi, Yurii Halychanskyi, Kin Wai Cheuk, Eloi Moliner, Chieh-Hsin Lai, Stefan Uhlich, Junghyun Koo, Marco A. Martínez-Ramírez, Wei-Hsiang Liao, Giorgio Fabbro, Yuki Mitsufuji

    Abstract: Music timbre transfer is a challenging task that involves modifying the timbral characteristics of an audio signal while preserving its melodic structure. In this paper, we propose a novel method based on dual diffusion bridges, trained using the CocoChorales Dataset, which consists of unpaired monophonic single-instrument audio data. Each diffusion model is trained on a specific instrument with a… ▽ More

    Submitted 7 January, 2025; v1 submitted 9 September, 2024; originally announced September 2024.

  50. arXiv:2408.10807  [pdf, other

    cs.SD cs.AI cs.LG eess.AS

    DisMix: Disentangling Mixtures of Musical Instruments for Source-level Pitch and Timbre Manipulation

    Authors: Yin-Jyun Luo, Kin Wai Cheuk, Woosung Choi, Toshimitsu Uesaka, Keisuke Toyama, Koichi Saito, Chieh-Hsin Lai, Yuhta Takida, Wei-Hsiang Liao, Simon Dixon, Yuki Mitsufuji

    Abstract: Existing work on pitch and timbre disentanglement has been mostly focused on single-instrument music audio, excluding the cases where multiple instruments are presented. To fill the gap, we propose DisMix, a generative framework in which the pitch and timbre representations act as modular building blocks for constructing the melody and instrument of a source, and the collection of which forms a se… ▽ More

    Submitted 20 August, 2024; originally announced August 2024.