Skip to main content

Showing 1–44 of 44 results for author: Shibuya, T

Searching in archive cs. Search in all archives.
.
  1. arXiv:2506.20995  [pdf, ps, other

    cs.CV cs.LG cs.SD eess.AS

    Step-by-Step Video-to-Audio Synthesis via Negative Audio Guidance

    Authors: Akio Hayakawa, Masato Ishii, Takashi Shibuya, Yuki Mitsufuji

    Abstract: We propose a novel step-by-step video-to-audio generation method that sequentially produces individual audio tracks, each corresponding to a specific sound event in the video. Our approach mirrors traditional Foley workflows, aiming to capture all sound events induced by a given video comprehensively. Each generation step is formulated as a guided video-to-audio synthesis task, conditioned on a ta… ▽ More

    Submitted 27 June, 2025; v1 submitted 26 June, 2025; originally announced June 2025.

  2. arXiv:2506.20234  [pdf, ps, other

    cs.CR

    Communication-Efficient Publication of Sparse Vectors under Differential Privacy

    Authors: Quentin Hillebrand, Vorapong Suppakitpaisarn, Tetsuo Shibuya

    Abstract: In this work, we propose a differentially private algorithm for publishing matrices aggregated from sparse vectors. These matrices include social network adjacency matrices, user-item interaction matrices in recommendation systems, and single nucleotide polymorphisms (SNPs) in DNA data. Traditionally, differential privacy in vector collection relies on randomized response, but this approach incurs… ▽ More

    Submitted 25 June, 2025; originally announced June 2025.

  3. arXiv:2506.13697  [pdf, ps, other

    cs.CV

    Vid-CamEdit: Video Camera Trajectory Editing with Generative Rendering from Estimated Geometry

    Authors: Junyoung Seo, Jisang Han, Jaewoo Jung, Siyoon Jin, Joungbin Lee, Takuya Narihira, Kazumi Fukuda, Takashi Shibuya, Donghoon Ahn, Shoukang Hu, Seungryong Kim, Yuki Mitsufuji

    Abstract: We introduce Vid-CamEdit, a novel framework for video camera trajectory editing, enabling the re-synthesis of monocular videos along user-defined camera paths. This task is challenging due to its ill-posed nature and the limited multi-view video data for training. Traditional reconstruction methods struggle with extreme trajectory changes, and existing generative models for dynamic novel view synt… ▽ More

    Submitted 16 June, 2025; originally announced June 2025.

    Comments: Our project page can be found at https://cvlab-kaist.github.io/Vid-CamEdit/

  4. arXiv:2506.01493  [pdf, ps, other

    cs.CV cs.LG

    Efficiency without Compromise: CLIP-aided Text-to-Image GANs with Increased Diversity

    Authors: Yuya Kobayashi, Yuhta Takida, Takashi Shibuya, Yuki Mitsufuji

    Abstract: Recently, Generative Adversarial Networks (GANs) have been successfully scaled to billion-scale large text-to-image datasets. However, training such models entails a high training cost, limiting some applications and research usage. To reduce the cost, one promising direction is the incorporation of pre-trained models. The existing method of utilizing pre-trained models for a generator significant… ▽ More

    Submitted 2 June, 2025; originally announced June 2025.

    Comments: Accepted at IJCNN 2025

  5. arXiv:2505.09827  [pdf, ps, other

    cs.CV

    Dyadic Mamba: Long-term Dyadic Human Motion Synthesis

    Authors: Julian Tanke, Takashi Shibuya, Kengo Uchida, Koichi Saito, Yuki Mitsufuji

    Abstract: Generating realistic dyadic human motion from text descriptions presents significant challenges, particularly for extended interactions that exceed typical training sequence lengths. While recent transformer-based approaches have shown promising results for short-term dyadic motion synthesis, they struggle with longer sequences due to inherent limitations in positional encoding schemes. In this pa… ▽ More

    Submitted 14 May, 2025; originally announced May 2025.

    Comments: CVPR 2025 HuMoGen Workshop

  6. arXiv:2504.20111  [pdf, ps, other

    cs.CV

    Forging and Removing Latent-Noise Diffusion Watermarks Using a Single Image

    Authors: Anubhav Jain, Yuya Kobayashi, Naoki Murata, Yuhta Takida, Takashi Shibuya, Yuki Mitsufuji, Niv Cohen, Nasir Memon, Julian Togelius

    Abstract: Watermarking techniques are vital for protecting intellectual property and preventing fraudulent use of media. Most previous watermarking schemes designed for diffusion models embed a secret key in the initial noise. The resulting pattern is often considered hard to remove and forge into unrelated images. In this paper, we propose a black-box adversarial attack without presuming access to the diff… ▽ More

    Submitted 27 April, 2025; originally announced April 2025.

  7. arXiv:2502.12080  [pdf, ps, other

    cs.CV

    HumanGif: Single-View Human Diffusion with Generative Prior

    Authors: Shoukang Hu, Takuya Narihira, Kazumi Fukuda, Ryosuke Sawata, Takashi Shibuya, Yuki Mitsufuji

    Abstract: Previous 3D human creation methods have made significant progress in synthesizing view-consistent and temporally aligned results from sparse-view images or monocular videos. However, it remains challenging to produce perpetually realistic, view-consistent, and temporally coherent human avatars from a single image, as limited information is available in the single-view input setting. Motivated by t… ▽ More

    Submitted 29 June, 2025; v1 submitted 17 February, 2025; originally announced February 2025.

    Comments: Project page: https://skhu101.github.io/HumanGif/

  8. arXiv:2501.02786  [pdf, other

    cs.SD cs.CV eess.AS

    CCStereo: Audio-Visual Contextual and Contrastive Learning for Binaural Audio Generation

    Authors: Yuanhong Chen, Kazuki Shimada, Christian Simon, Yukara Ikemiya, Takashi Shibuya, Yuki Mitsufuji

    Abstract: Binaural audio generation (BAG) aims to convert monaural audio to stereo audio using visual prompts, requiring a deep understanding of spatial and semantic information. However, current models risk overfitting to room environments and lose fine-grained spatial details. In this paper, we propose a new audio-visual binaural generation model incorporating an audio-visual conditional normalisation lay… ▽ More

    Submitted 6 January, 2025; originally announced January 2025.

  9. arXiv:2412.15322  [pdf, other

    cs.CV cs.LG cs.SD eess.AS

    MMAudio: Taming Multimodal Joint Training for High-Quality Video-to-Audio Synthesis

    Authors: Ho Kei Cheng, Masato Ishii, Akio Hayakawa, Takashi Shibuya, Alexander Schwing, Yuki Mitsufuji

    Abstract: We propose to synthesize high-quality and synchronized audio, given video and optional text conditions, using a novel multimodal joint training framework MMAudio. In contrast to single-modality training conditioned on (limited) video data only, MMAudio is jointly trained with larger-scale, readily available text-audio data to learn to generate semantically aligned high-quality audio samples. Addit… ▽ More

    Submitted 7 April, 2025; v1 submitted 19 December, 2024; originally announced December 2024.

    Comments: Accepted to CVPR 2025. Project page: https://hkchengrex.github.io/MMAudio

  10. arXiv:2412.13462  [pdf, other

    cs.SD cs.MM eess.AS

    SAVGBench: Benchmarking Spatially Aligned Audio-Video Generation

    Authors: Kazuki Shimada, Christian Simon, Takashi Shibuya, Shusuke Takahashi, Yuki Mitsufuji

    Abstract: This work addresses the lack of multimodal generative models capable of producing high-quality videos with spatially aligned audio. While recent advancements in generative models have been successful in video generation, they often overlook the spatial alignment between audio and visuals, which is essential for immersive experiences. To tackle this problem, we establish a new research direction in… ▽ More

    Submitted 17 December, 2024; originally announced December 2024.

    Comments: 5 pages, 3 figures

  11. arXiv:2412.07658  [pdf, other

    cs.CV cs.AI cs.LG

    TraSCE: Trajectory Steering for Concept Erasure

    Authors: Anubhav Jain, Yuya Kobayashi, Takashi Shibuya, Yuhta Takida, Nasir Memon, Julian Togelius, Yuki Mitsufuji

    Abstract: Recent advancements in text-to-image diffusion models have brought them to the public spotlight, becoming widely accessible and embraced by everyday users. However, these models have been shown to generate harmful content such as not-safe-for-work (NSFW) images. While approaches have been proposed to erase such abstract concepts from the models, jail-breaking techniques have succeeded in bypassing… ▽ More

    Submitted 17 March, 2025; v1 submitted 10 December, 2024; originally announced December 2024.

  12. arXiv:2411.16738  [pdf, other

    cs.CV cs.AI cs.LG

    Classifier-Free Guidance inside the Attraction Basin May Cause Memorization

    Authors: Anubhav Jain, Yuya Kobayashi, Takashi Shibuya, Yuhta Takida, Nasir Memon, Julian Togelius, Yuki Mitsufuji

    Abstract: Diffusion models are prone to exactly reproduce images from the training data. This exact reproduction of the training data is concerning as it can lead to copyright infringement and/or leakage of privacy-sensitive information. In this paper, we present a novel perspective on the memorization phenomenon and propose a simple yet effective approach to mitigate it. We argue that memorization occurs b… ▽ More

    Submitted 17 March, 2025; v1 submitted 23 November, 2024; originally announced November 2024.

    Comments: CVPR 2025

  13. arXiv:2410.10187  [pdf, ps, other

    cs.DS

    Differentially Private Selection using Smooth Sensitivity

    Authors: Akito Yamamoto, Tetsuo Shibuya

    Abstract: With the growing volume of data in society, the need for privacy protection in data analysis also rises. In particular, private selection tasks, wherein the most important information is retrieved under differential privacy are emphasized in a wide range of contexts, including machine learning and medical statistical analysis. However, existing mechanisms use global sensitivity, which may add larg… ▽ More

    Submitted 14 October, 2024; originally announced October 2024.

    Comments: Preprint of an article accepted at IEEE IPCCC 2024

  14. arXiv:2410.05116  [pdf, other

    cs.LG cs.AI cs.CV cs.HC

    HERO: Human-Feedback Efficient Reinforcement Learning for Online Diffusion Model Finetuning

    Authors: Ayano Hiranaka, Shang-Fu Chen, Chieh-Hsin Lai, Dongjun Kim, Naoki Murata, Takashi Shibuya, Wei-Hsiang Liao, Shao-Hua Sun, Yuki Mitsufuji

    Abstract: Controllable generation through Stable Diffusion (SD) fine-tuning aims to improve fidelity, safety, and alignment with human guidance. Existing reinforcement learning from human feedback methods usually rely on predefined heuristic reward functions or pretrained reward models built on large-scale datasets, limiting their applicability to scenarios where collecting such data is costly or difficult.… ▽ More

    Submitted 13 March, 2025; v1 submitted 7 October, 2024; originally announced October 2024.

    Comments: Published in International Conference on Learning Representations (ICLR) 2025

  15. arXiv:2410.02441  [pdf, other

    cs.CL

    Embedded Topic Models Enhanced by Wikification

    Authors: Takashi Shibuya, Takehito Utsuro

    Abstract: Topic modeling analyzes a collection of documents to learn meaningful patterns of words. However, previous topic models consider only the spelling of words and do not take into consideration the homography of words. In this study, we incorporate the Wikipedia knowledge into a neural topic model to make it aware of named entities. We evaluate our method on two datasets, 1) news articles of \textit{… ▽ More

    Submitted 3 October, 2024; originally announced October 2024.

    Comments: Accepted at EMNLP 2024 Workshop NLP for Wikipedia

  16. arXiv:2409.17550  [pdf, other

    cs.LG cs.MM cs.SD eess.AS

    A Simple but Strong Baseline for Sounding Video Generation: Effective Adaptation of Audio and Video Diffusion Models for Joint Generation

    Authors: Masato Ishii, Akio Hayakawa, Takashi Shibuya, Yuki Mitsufuji

    Abstract: In this work, we build a simple but strong baseline for sounding video generation. Given base diffusion models for audio and video, we integrate them with additional modules into a single model and train it to make the model jointly generate audio and video. To enhance alignment between audio-video pairs, we introduce two novel mechanisms in our model. The first one is timestep adjustment, which p… ▽ More

    Submitted 8 April, 2025; v1 submitted 26 September, 2024; originally announced September 2024.

    Comments: IJCNN 2025. The source code is available: https://github.com/SonyResearch/SVG_baseline

  17. arXiv:2409.16688  [pdf, other

    cs.CR cs.DS

    Cycle Counting under Local Differential Privacy for Degeneracy-bounded Graphs

    Authors: Quentin Hillebrand, Vorapong Suppakitpaisarn, Tetsuo Shibuya

    Abstract: We propose an algorithm for counting the number of cycles under local differential privacy for degeneracy-bounded input graphs. Numerous studies have focused on counting the number of triangles under the privacy notion, demonstrating that the expected $\ell_2$-error of these algorithms is $Ω(n^{1.5})$, where $n$ is the number of nodes in the graph. When parameterized by the number of cycles of len… ▽ More

    Submitted 26 September, 2024; v1 submitted 25 September, 2024; originally announced September 2024.

  18. arXiv:2406.17672  [pdf, other

    cs.SD eess.AS

    SpecMaskGIT: Masked Generative Modeling of Audio Spectrograms for Efficient Audio Synthesis and Beyond

    Authors: Marco Comunità, Zhi Zhong, Akira Takahashi, Shiqi Yang, Mengjie Zhao, Koichi Saito, Yukara Ikemiya, Takashi Shibuya, Shusuke Takahashi, Yuki Mitsufuji

    Abstract: Recent advances in generative models that iteratively synthesize audio clips sparked great success to text-to-audio synthesis (TTA), but with the cost of slow synthesis speed and heavy computation. Although there have been attempts to accelerate the iterative procedure, high-quality TTA systems remain inefficient due to hundreds of iterations required in the inference phase and large amount of mod… ▽ More

    Submitted 26 June, 2024; v1 submitted 25 June, 2024; originally announced June 2024.

    Comments: 6 pages, 8 figures, 8 tables. Audio samples: https://zzaudio.github.io/SpecMaskGIT/index.html

  19. arXiv:2406.01867  [pdf, other

    cs.CV

    MoLA: Motion Generation and Editing with Latent Diffusion Enhanced by Adversarial Training

    Authors: Kengo Uchida, Takashi Shibuya, Yuhta Takida, Naoki Murata, Julian Tanke, Shusuke Takahashi, Yuki Mitsufuji

    Abstract: In text-to-motion generation, controllability as well as generation quality and speed has become increasingly critical. The controllability challenges include generating a motion of a length that matches the given textual description and editing the generated motions according to control signals, such as the start-end positions and the pelvis trajectory. In this paper, we propose MoLA, which provi… ▽ More

    Submitted 14 April, 2025; v1 submitted 3 June, 2024; originally announced June 2024.

    Comments: CVPR 2025 HuMoGen Workshop

  20. arXiv:2405.18503  [pdf, other

    cs.SD cs.LG eess.AS

    SoundCTM: Unifying Score-based and Consistency Models for Full-band Text-to-Sound Generation

    Authors: Koichi Saito, Dongjun Kim, Takashi Shibuya, Chieh-Hsin Lai, Zhi Zhong, Yuhta Takida, Yuki Mitsufuji

    Abstract: Sound content creation, essential for multimedia works such as video games and films, often involves extensive trial-and-error, enabling creators to semantically reflect their artistic ideas and inspirations, which evolve throughout the creation process, into the sound. Recent high-quality diffusion-based Text-to-Sound (T2S) generative models provide valuable tools for creators. However, these mod… ▽ More

    Submitted 10 March, 2025; v1 submitted 28 May, 2024; originally announced May 2024.

    Comments: Audio samples: https://anonymus-soundctm.github.io/soundctm_iclr/. Codes: https://github.com/sony/soundctm. Checkpoints: https://huggingface.co/Sony/soundctm

  21. arXiv:2405.17842  [pdf, other

    cs.CV cs.LG cs.MM cs.SD eess.AS

    MMDisCo: Multi-Modal Discriminator-Guided Cooperative Diffusion for Joint Audio and Video Generation

    Authors: Akio Hayakawa, Masato Ishii, Takashi Shibuya, Yuki Mitsufuji

    Abstract: This study aims to construct an audio-video generative model with minimal computational cost by leveraging pre-trained single-modal generative models for audio and video. To achieve this, we propose a novel method that guides single-modal models to cooperatively generate well-aligned samples across modalities. Specifically, given two pre-trained base diffusion models, we train a lightweight joint… ▽ More

    Submitted 25 February, 2025; v1 submitted 28 May, 2024; originally announced May 2024.

    Comments: ICLR 2025

  22. arXiv:2405.17251  [pdf, other

    cs.CV

    GenWarp: Single Image to Novel Views with Semantic-Preserving Generative Warping

    Authors: Junyoung Seo, Kazumi Fukuda, Takashi Shibuya, Takuya Narihira, Naoki Murata, Shoukang Hu, Chieh-Hsin Lai, Seungryong Kim, Yuki Mitsufuji

    Abstract: Generating novel views from a single image remains a challenging task due to the complexity of 3D scenes and the limited diversity in the existing multi-view datasets to train a model on. Recent research combining large-scale text-to-image (T2I) models with monocular depth estimation (MDE) has shown promise in handling in-the-wild images. In these methods, an input view is geometrically warped to… ▽ More

    Submitted 26 September, 2024; v1 submitted 27 May, 2024; originally announced May 2024.

    Comments: Accepted to NeurIPS 2024 / Project page: https://GenWarp-NVS.github.io

  23. arXiv:2405.14598  [pdf, other

    cs.CV cs.LG cs.MM cs.SD eess.AS

    Visual Echoes: A Simple Unified Transformer for Audio-Visual Generation

    Authors: Shiqi Yang, Zhi Zhong, Mengjie Zhao, Shusuke Takahashi, Masato Ishii, Takashi Shibuya, Yuki Mitsufuji

    Abstract: In recent years, with the realistic generation results and a wide range of personalized applications, diffusion-based generative models gain huge attention in both visual and audio generation areas. Compared to the considerable advancements of text2image or text2audio generation, research in audio2visual or visual2audio generation has been relatively slow. The recent audio-visual generation method… ▽ More

    Submitted 24 May, 2024; v1 submitted 23 May, 2024; originally announced May 2024.

    Comments: 10 pages

  24. arXiv:2402.07584  [pdf, ps, other

    cs.CR

    Privacy-Optimized Randomized Response for Sharing Multi-Attribute Data

    Authors: Akito Yamamoto, Tetsuo Shibuya

    Abstract: With the increasing amount of data in society, privacy concerns in data sharing have become widely recognized. Particularly, protecting personal attribute information is essential for a wide range of aims from crowdsourcing to realizing personalized medicine. Although various differentially private methods based on randomized response have been proposed for single attribute information or specific… ▽ More

    Submitted 12 February, 2024; originally announced February 2024.

  25. arXiv:2401.00365  [pdf, other

    cs.LG cs.AI cs.CV

    HQ-VAE: Hierarchical Discrete Representation Learning with Variational Bayes

    Authors: Yuhta Takida, Yukara Ikemiya, Takashi Shibuya, Kazuki Shimada, Woosung Choi, Chieh-Hsin Lai, Naoki Murata, Toshimitsu Uesaka, Kengo Uchida, Wei-Hsiang Liao, Yuki Mitsufuji

    Abstract: Vector quantization (VQ) is a technique to deterministically learn features with discrete codebook representations. It is commonly performed with a variational autoencoding model, VQ-VAE, which can be further extended to hierarchical structures for making high-fidelity reconstructions. However, such hierarchical extensions of VQ-VAE often suffer from the codebook/layer collapse issue, where the co… ▽ More

    Submitted 28 March, 2024; v1 submitted 30 December, 2023; originally announced January 2024.

    Comments: 34 pages with 17 figures, accepted for TMLR

  26. arXiv:2312.07055  [pdf, other

    cs.CR cs.AI

    Communication Cost Reduction for Subgraph Counting under Local Differential Privacy via Hash Functions

    Authors: Quentin Hillebrand, Vorapong Suppakitpaisarn, Tetsuo Shibuya

    Abstract: We suggest the use of hash functions to cut down the communication costs when counting subgraphs under edge local differential privacy. While various algorithms exist for computing graph statistics, including the count of subgraphs, under the edge local differential privacy, many suffer with high communication costs, making them less efficient for large graphs. Though data compression is a typical… ▽ More

    Submitted 12 December, 2023; originally announced December 2023.

    Comments: 13 pages, 3 figures

  27. arXiv:2310.13267  [pdf, other

    cs.CL cs.CV cs.LG cs.SD eess.AS

    On the Language Encoder of Contrastive Cross-modal Models

    Authors: Mengjie Zhao, Junya Ono, Zhi Zhong, Chieh-Hsin Lai, Yuhta Takida, Naoki Murata, Wei-Hsiang Liao, Takashi Shibuya, Hiromi Wakaki, Yuki Mitsufuji

    Abstract: Contrastive cross-modal models such as CLIP and CLAP aid various vision-language (VL) and audio-language (AL) tasks. However, there has been limited investigation of and improvement in their language encoder, which is the central component of encoding natural language descriptions of image/audio into vector representations. We extensively evaluate how unsupervised and supervised sentence embedding… ▽ More

    Submitted 20 October, 2023; originally announced October 2023.

  28. arXiv:2309.09223  [pdf, other

    cs.SD eess.AS

    Zero- and Few-shot Sound Event Localization and Detection

    Authors: Kazuki Shimada, Kengo Uchida, Yuichiro Koyama, Takashi Shibuya, Shusuke Takahashi, Yuki Mitsufuji, Tatsuya Kawahara

    Abstract: Sound event localization and detection (SELD) systems estimate direction-of-arrival (DOA) and temporal activation for sets of target classes. Neural network (NN)-based SELD systems have performed well in various sets of target classes, but they only output the DOA and temporal activation of preset classes trained before inference. To customize target classes after training, we tackle zero- and few… ▽ More

    Submitted 17 January, 2024; v1 submitted 17 September, 2023; originally announced September 2023.

    Comments: 5 pages, 4 figures, accepted for publication in IEEE ICASSP 2024

  29. arXiv:2309.02836  [pdf, other

    cs.SD cs.LG eess.AS

    BigVSAN: Enhancing GAN-based Neural Vocoders with Slicing Adversarial Network

    Authors: Takashi Shibuya, Yuhta Takida, Yuki Mitsufuji

    Abstract: Generative adversarial network (GAN)-based vocoders have been intensively studied because they can synthesize high-fidelity audio waveforms faster than real-time. However, it has been reported that most GANs fail to obtain the optimal projection for discriminating between real and fake data in the feature space. In the literature, it has been demonstrated that slicing adversarial network (SAN), an… ▽ More

    Submitted 24 March, 2024; v1 submitted 6 September, 2023; originally announced September 2023.

    Comments: Accepted at ICASSP 2024. Equation (5) in the previous version is wrong. We modified it

  30. arXiv:2305.10734  [pdf, other

    cs.SD cs.CL eess.AS

    Diffusion-Based Speech Enhancement with Joint Generative and Predictive Decoders

    Authors: Hao Shi, Kazuki Shimada, Masato Hirano, Takashi Shibuya, Yuichiro Koyama, Zhi Zhong, Shusuke Takahashi, Tatsuya Kawahara, Yuki Mitsufuji

    Abstract: Diffusion-based generative speech enhancement (SE) has recently received attention, but reverse diffusion remains time-consuming. One solution is to initialize the reverse diffusion process with enhanced features estimated by a predictive SE system. However, the pipeline structure currently does not consider for a combined use of generative and predictive decoders. The predictive decoder allows us… ▽ More

    Submitted 28 February, 2024; v1 submitted 18 May, 2023; originally announced May 2023.

  31. arXiv:2305.06701  [pdf, ps, other

    cs.SD eess.AS

    Extending Audio Masked Autoencoders Toward Audio Restoration

    Authors: Zhi Zhong, Hao Shi, Masato Hirano, Kazuki Shimada, Kazuya Tateishi, Takashi Shibuya, Shusuke Takahashi, Yuki Mitsufuji

    Abstract: Audio classification and restoration are among major downstream tasks in audio signal processing. However, restoration derives less of a benefit from pretrained models compared to the overwhelming success of pretrained models in classification tasks. Due to such unbalanced benefits, there has been rising interest in how to improve the performance of pretrained models for restoration tasks, e.g., s… ▽ More

    Submitted 17 August, 2023; v1 submitted 11 May, 2023; originally announced May 2023.

    Comments: WASPAA 2023.Copyright 2023 IEEE.Personal use of this material is permitted.Permission from IEEE must be obtained for all other uses,in any current or future media,including reprinting/republishing this material for advertising or promotional purposes, creating new collective works,for resale or redistribution to servers or lists,or reuse of any copyrighted component of this work in other works

  32. arXiv:2301.12811  [pdf, other

    cs.LG

    SAN: Inducing Metrizability of GAN with Discriminative Normalized Linear Layer

    Authors: Yuhta Takida, Masaaki Imaizumi, Takashi Shibuya, Chieh-Hsin Lai, Toshimitsu Uesaka, Naoki Murata, Yuki Mitsufuji

    Abstract: Generative adversarial networks (GANs) learn a target probability distribution by optimizing a generator and a discriminator with minimax objectives. This paper addresses the question of whether such optimization actually provides the generator with gradients that make its distribution close to the target distribution. We derive metrizable conditions, sufficient conditions for the discriminator to… ▽ More

    Submitted 10 April, 2024; v1 submitted 30 January, 2023; originally announced January 2023.

    Comments: 34 pages with 17 figures, accepted for publication in ICLR 2024

  33. arXiv:2212.10352  [pdf, other

    cs.NE cs.LG

    Fixed-Weight Difference Target Propagation

    Authors: Tatsukichi Shibuya, Nakamasa Inoue, Rei Kawakami, Ikuro Sato

    Abstract: Target Propagation (TP) is a biologically more plausible algorithm than the error backpropagation (BP) to train deep networks, and improving practicality of TP is an open issue. TP methods require the feedforward and feedback networks to form layer-wise autoencoders for propagating the target values generated at the output layer. However, this causes certain drawbacks; e.g., careful hyperparameter… ▽ More

    Submitted 19 December, 2022; originally announced December 2022.

    Comments: Accepted at the Thirty-Seventh AAAI Conference on Artificial Intelligence (AAAI-23). 9 pages and 3 figures in main manuscript; 11 pages and 5 figures in supplementary material

  34. Diffiner: A Versatile Diffusion-based Generative Refiner for Speech Enhancement

    Authors: Ryosuke Sawata, Naoki Murata, Yuhta Takida, Toshimitsu Uesaka, Takashi Shibuya, Shusuke Takahashi, Yuki Mitsufuji

    Abstract: Although deep neural network (DNN)-based speech enhancement (SE) methods outperform the previous non-DNN-based ones, they often degrade the perceptual quality of generated outputs. To tackle this problem, we introduce a DNN-based generative refiner, Diffiner, aiming to improve perceptual speech quality pre-processed by an SE method. We train a diffusion-based generative model by utilizing a datase… ▽ More

    Submitted 30 August, 2023; v1 submitted 27 October, 2022; originally announced October 2022.

    Comments: Accepted by Interspeech 2023

  35. arXiv:2210.16978  [pdf, other

    cs.CL

    XMD: An End-to-End Framework for Interactive Explanation-Based Debugging of NLP Models

    Authors: Dong-Ho Lee, Akshen Kadakia, Brihi Joshi, Aaron Chan, Ziyi Liu, Kiran Narahari, Takashi Shibuya, Ryosuke Mitani, Toshiyuki Sekiya, Jay Pujara, Xiang Ren

    Abstract: NLP models are susceptible to learning spurious biases (i.e., bugs) that work on some datasets but do not properly reflect the underlying task. Explanation-based model debugging aims to resolve spurious biases by showing human users explanations of model behavior, asking users to give feedback on the behavior, then using the feedback to update the model. While existing model debugging methods have… ▽ More

    Submitted 30 October, 2022; originally announced October 2022.

    Comments: 6 pages, 7 figures. Project page: https://inklab.usc.edu/xmd/

  36. arXiv:2205.07547  [pdf, other

    cs.LG cs.CV

    SQ-VAE: Variational Bayes on Discrete Representation with Self-annealed Stochastic Quantization

    Authors: Yuhta Takida, Takashi Shibuya, WeiHsiang Liao, Chieh-Hsin Lai, Junki Ohmura, Toshimitsu Uesaka, Naoki Murata, Shusuke Takahashi, Toshiyuki Kumakura, Yuki Mitsufuji

    Abstract: One noted issue of vector-quantized variational autoencoder (VQ-VAE) is that the learned discrete representation uses only a fraction of the full capacity of the codebook, also known as codebook collapse. We hypothesize that the training scheme of VQ-VAE, which involves some carefully designed heuristics, underlies this issue. In this paper, we propose a new training scheme that extends the standa… ▽ More

    Submitted 9 June, 2022; v1 submitted 16 May, 2022; originally announced May 2022.

    Comments: 25 pages with 10 figures, accepted for publication in ICML 2022 (Our code is available at https://github.com/sony/sqvae)

  37. arXiv:2110.08454  [pdf, other

    cs.CL

    Good Examples Make A Faster Learner: Simple Demonstration-based Learning for Low-resource NER

    Authors: Dong-Ho Lee, Akshen Kadakia, Kangmin Tan, Mahak Agarwal, Xinyu Feng, Takashi Shibuya, Ryosuke Mitani, Toshiyuki Sekiya, Jay Pujara, Xiang Ren

    Abstract: Recent advances in prompt-based learning have shown strong results on few-shot text classification by using cloze-style templates. Similar attempts have been made on named entity recognition (NER) which manually design templates to predict entity types for every text span in a sentence. However, such methods may suffer from error propagation induced by entity span detection, high cost due to enume… ▽ More

    Submitted 30 March, 2022; v1 submitted 15 October, 2021; originally announced October 2021.

    Comments: Accepted to ACL 2022 main conference. 14 pages, 8 figures, 9 tables

  38. arXiv:2011.00425  [pdf, other

    cs.CL cs.LG

    Analyzing the Effect of Multi-task Learning for Biomedical Named Entity Recognition

    Authors: Arda Akdemir, Tetsuo Shibuya

    Abstract: Developing high-performing systems for detecting biomedical named entities has major implications. State-of-the-art deep-learning based solutions for entity recognition often require large annotated datasets, which is not available in the biomedical domain. Transfer learning and multi-task learning have been shown to improve performance for low-resource domains. However, the applications of these… ▽ More

    Submitted 1 November, 2020; originally announced November 2020.

  39. arXiv:2004.12247  [pdf, other

    cs.CL cs.IR cs.LG

    Hierarchical Multi Task Learning with Subword Contextual Embeddings for Languages with Rich Morphology

    Authors: Arda Akdemir, Tetsuo Shibuya, Tunga Güngör

    Abstract: Morphological information is important for many sequence labeling tasks in Natural Language Processing (NLP). Yet, existing approaches rely heavily on manual annotations or external software to capture this information. In this study, we propose using subword contextual embeddings to capture the morphological information for languages with rich morphology. In addition, we incorporate these embeddi… ▽ More

    Submitted 25 April, 2020; originally announced April 2020.

  40. arXiv:1909.02250  [pdf, other

    cs.CL

    Nested Named Entity Recognition via Second-best Sequence Learning and Decoding

    Authors: Takashi Shibuya, Eduard Hovy

    Abstract: When an entity name contains other names within it, the identification of all combinations of names can become difficult and expensive. We propose a new method to recognize not only outermost named entities but also inner nested ones. We design an objective function for training a neural model that treats the tag sequence for nested entities as the second best path within the span of their parent… ▽ More

    Submitted 10 July, 2020; v1 submitted 5 September, 2019; originally announced September 2019.

    Comments: Accepted to TACL

  41. arXiv:1804.08285  [pdf, other

    cs.DS

    Succinct Oblivious RAM

    Authors: Taku Onodera, Tetsuo Shibuya

    Abstract: Reducing the database space overhead is critical in big-data processing. In this paper, we revisit oblivious RAM (ORAM) using big-data standard for the database space overhead. ORAM is a cryptographic primitive that enables users to perform arbitrary database accesses without revealing the access pattern to the server. It is particularly important today since cloud services become increasingly c… ▽ More

    Submitted 23 April, 2018; originally announced April 2018.

    Comments: 21 pages. A preliminary version of this paper appeared in STACS'18

  42. arXiv:1308.1611  [pdf

    cs.CY

    Pattern Language for Good Old Future From Japanese Culture

    Authors: Megumi Kadotani, Aya Matsumoto, Takafumi Shibuya, Younjae Lee, Saori Watanabe, Takashi Iba

    Abstract: Having developed greatly over millennium under its culture, the ancient buildings and old town atmospheres maintain a quality of comfort. However, people only appreciate the "good old" quality and do not think further about the rational reasons why they feel comfort in it. This keeps them from creating their own things and models with good old quality, relying on the imported western thinking and… ▽ More

    Submitted 6 August, 2013; originally announced August 2013.

    Comments: Presented at COINs13 Conference, Chile, 2013 (arxiv:1308.1028)

    Report number: coins13/2013/09

  43. arXiv:1307.7925  [pdf, ps, other

    cs.DS cs.CE cs.DM q-bio.QM

    Detecting Superbubbles in Assembly Graphs

    Authors: Taku Onodera, Kunihiko Sadakane, Tetsuo Shibuya

    Abstract: We introduce a new concept of a subgraph class called a superbubble for analyzing assembly graphs, and propose an efficient algorithm for detecting it. Most assembly algorithms utilize assembly graphs like the de Bruijn graph or the overlap graph constructed from reads. From these graphs, many assembly algorithms first detect simple local graph structures (motifs), such as tips and bubbles, mainly… ▽ More

    Submitted 30 July, 2013; originally announced July 2013.

    Comments: Peer-reviewed and presented as part of the 13th Workshop on Algorithms in Bioinformatics (WABI2013)

  44. arXiv:0801.0931  [pdf, ps, other

    cs.IT

    The Asymptotic Bit Error Probability of LDPC Codes for the Binary Erasure Channel with Finite Iteration Number

    Authors: Ryuhei Mori, Kenta Kasai, Tomoharu Shibuya, Kohichi Sakaniwa

    Abstract: We consider communication over the binary erasure channel (BEC) using low-density parity-check (LDPC) code and belief propagation (BP) decoding. The bit error probability for infinite block length is known by density evolution and it is well known that a difference between the bit error probability at finite iteration number for finite block length $n$ and for infinite block length is asymptotic… ▽ More

    Submitted 23 January, 2008; v1 submitted 7 January, 2008; originally announced January 2008.

    Comments: 5 pages, 6 figures, correcting errors in Theorem 1 and poor English