-
Step-by-Step Video-to-Audio Synthesis via Negative Audio Guidance
Authors:
Akio Hayakawa,
Masato Ishii,
Takashi Shibuya,
Yuki Mitsufuji
Abstract:
We propose a novel step-by-step video-to-audio generation method that sequentially produces individual audio tracks, each corresponding to a specific sound event in the video. Our approach mirrors traditional Foley workflows, aiming to capture all sound events induced by a given video comprehensively. Each generation step is formulated as a guided video-to-audio synthesis task, conditioned on a ta…
▽ More
We propose a novel step-by-step video-to-audio generation method that sequentially produces individual audio tracks, each corresponding to a specific sound event in the video. Our approach mirrors traditional Foley workflows, aiming to capture all sound events induced by a given video comprehensively. Each generation step is formulated as a guided video-to-audio synthesis task, conditioned on a target text prompt and previously generated audio tracks. This design is inspired by the idea of concept negation from prior compositional generation frameworks. To enable this guided generation, we introduce a training framework that leverages pre-trained video-to-audio models and eliminates the need for specialized paired datasets, allowing training on more accessible data. Experimental results demonstrate that our method generates multiple semantically distinct audio tracks for a single input video, leading to higher-quality composite audio synthesis than existing baselines.
△ Less
Submitted 27 June, 2025; v1 submitted 26 June, 2025;
originally announced June 2025.
-
Communication-Efficient Publication of Sparse Vectors under Differential Privacy
Authors:
Quentin Hillebrand,
Vorapong Suppakitpaisarn,
Tetsuo Shibuya
Abstract:
In this work, we propose a differentially private algorithm for publishing matrices aggregated from sparse vectors. These matrices include social network adjacency matrices, user-item interaction matrices in recommendation systems, and single nucleotide polymorphisms (SNPs) in DNA data. Traditionally, differential privacy in vector collection relies on randomized response, but this approach incurs…
▽ More
In this work, we propose a differentially private algorithm for publishing matrices aggregated from sparse vectors. These matrices include social network adjacency matrices, user-item interaction matrices in recommendation systems, and single nucleotide polymorphisms (SNPs) in DNA data. Traditionally, differential privacy in vector collection relies on randomized response, but this approach incurs high communication costs. Specifically, for a matrix with $N$ users, $n$ columns, and $m$ nonzero elements, conventional methods require $Ω(n \times N)$ communication, making them impractical for large-scale data. Our algorithm significantly reduces this cost to $O(\varepsilon m)$, where $\varepsilon$ is the privacy budget. Notably, this is even lower than the non-private case, which requires $Ω(m \log n)$ communication. Moreover, as the privacy budget decreases, communication cost further reduces, enabling better privacy with improved efficiency. We theoretically prove that our method yields results identical to those of randomized response, and experimental evaluations confirm its effectiveness in terms of accuracy, communication efficiency, and computational complexity.
△ Less
Submitted 25 June, 2025;
originally announced June 2025.
-
Vid-CamEdit: Video Camera Trajectory Editing with Generative Rendering from Estimated Geometry
Authors:
Junyoung Seo,
Jisang Han,
Jaewoo Jung,
Siyoon Jin,
Joungbin Lee,
Takuya Narihira,
Kazumi Fukuda,
Takashi Shibuya,
Donghoon Ahn,
Shoukang Hu,
Seungryong Kim,
Yuki Mitsufuji
Abstract:
We introduce Vid-CamEdit, a novel framework for video camera trajectory editing, enabling the re-synthesis of monocular videos along user-defined camera paths. This task is challenging due to its ill-posed nature and the limited multi-view video data for training. Traditional reconstruction methods struggle with extreme trajectory changes, and existing generative models for dynamic novel view synt…
▽ More
We introduce Vid-CamEdit, a novel framework for video camera trajectory editing, enabling the re-synthesis of monocular videos along user-defined camera paths. This task is challenging due to its ill-posed nature and the limited multi-view video data for training. Traditional reconstruction methods struggle with extreme trajectory changes, and existing generative models for dynamic novel view synthesis cannot handle in-the-wild videos. Our approach consists of two steps: estimating temporally consistent geometry, and generative rendering guided by this geometry. By integrating geometric priors, the generative model focuses on synthesizing realistic details where the estimated geometry is uncertain. We eliminate the need for extensive 4D training data through a factorized fine-tuning framework that separately trains spatial and temporal components using multi-view image and video data. Our method outperforms baselines in producing plausible videos from novel camera trajectories, especially in extreme extrapolation scenarios on real-world footage.
△ Less
Submitted 16 June, 2025;
originally announced June 2025.
-
Efficiency without Compromise: CLIP-aided Text-to-Image GANs with Increased Diversity
Authors:
Yuya Kobayashi,
Yuhta Takida,
Takashi Shibuya,
Yuki Mitsufuji
Abstract:
Recently, Generative Adversarial Networks (GANs) have been successfully scaled to billion-scale large text-to-image datasets. However, training such models entails a high training cost, limiting some applications and research usage. To reduce the cost, one promising direction is the incorporation of pre-trained models. The existing method of utilizing pre-trained models for a generator significant…
▽ More
Recently, Generative Adversarial Networks (GANs) have been successfully scaled to billion-scale large text-to-image datasets. However, training such models entails a high training cost, limiting some applications and research usage. To reduce the cost, one promising direction is the incorporation of pre-trained models. The existing method of utilizing pre-trained models for a generator significantly reduced the training cost compared with the other large-scale GANs, but we found the model loses the diversity of generation for a given prompt by a large margin. To build an efficient and high-fidelity text-to-image GAN without compromise, we propose to use two specialized discriminators with Slicing Adversarial Networks (SANs) adapted for text-to-image tasks. Our proposed model, called SCAD, shows a notable enhancement in diversity for a given prompt with better sample fidelity. We also propose to use a metric called Per-Prompt Diversity (PPD) to evaluate the diversity of text-to-image models quantitatively. SCAD achieved a zero-shot FID competitive with the latest large-scale GANs at two orders of magnitude less training cost.
△ Less
Submitted 2 June, 2025;
originally announced June 2025.
-
Dyadic Mamba: Long-term Dyadic Human Motion Synthesis
Authors:
Julian Tanke,
Takashi Shibuya,
Kengo Uchida,
Koichi Saito,
Yuki Mitsufuji
Abstract:
Generating realistic dyadic human motion from text descriptions presents significant challenges, particularly for extended interactions that exceed typical training sequence lengths. While recent transformer-based approaches have shown promising results for short-term dyadic motion synthesis, they struggle with longer sequences due to inherent limitations in positional encoding schemes. In this pa…
▽ More
Generating realistic dyadic human motion from text descriptions presents significant challenges, particularly for extended interactions that exceed typical training sequence lengths. While recent transformer-based approaches have shown promising results for short-term dyadic motion synthesis, they struggle with longer sequences due to inherent limitations in positional encoding schemes. In this paper, we introduce Dyadic Mamba, a novel approach that leverages State-Space Models (SSMs) to generate high-quality dyadic human motion of arbitrary length. Our method employs a simple yet effective architecture that facilitates information flow between individual motion sequences through concatenation, eliminating the need for complex cross-attention mechanisms. We demonstrate that Dyadic Mamba achieves competitive performance on standard short-term benchmarks while significantly outperforming transformer-based approaches on longer sequences. Additionally, we propose a new benchmark for evaluating long-term motion synthesis quality, providing a standardized framework for future research. Our results demonstrate that SSM-based architectures offer a promising direction for addressing the challenging task of long-term dyadic human motion synthesis from text descriptions.
△ Less
Submitted 14 May, 2025;
originally announced May 2025.
-
Forging and Removing Latent-Noise Diffusion Watermarks Using a Single Image
Authors:
Anubhav Jain,
Yuya Kobayashi,
Naoki Murata,
Yuhta Takida,
Takashi Shibuya,
Yuki Mitsufuji,
Niv Cohen,
Nasir Memon,
Julian Togelius
Abstract:
Watermarking techniques are vital for protecting intellectual property and preventing fraudulent use of media. Most previous watermarking schemes designed for diffusion models embed a secret key in the initial noise. The resulting pattern is often considered hard to remove and forge into unrelated images. In this paper, we propose a black-box adversarial attack without presuming access to the diff…
▽ More
Watermarking techniques are vital for protecting intellectual property and preventing fraudulent use of media. Most previous watermarking schemes designed for diffusion models embed a secret key in the initial noise. The resulting pattern is often considered hard to remove and forge into unrelated images. In this paper, we propose a black-box adversarial attack without presuming access to the diffusion model weights. Our attack uses only a single watermarked example and is based on a simple observation: there is a many-to-one mapping between images and initial noises. There are regions in the clean image latent space pertaining to each watermark that get mapped to the same initial noise when inverted. Based on this intuition, we propose an adversarial attack to forge the watermark by introducing perturbations to the images such that we can enter the region of watermarked images. We show that we can also apply a similar approach for watermark removal by learning perturbations to exit this region. We report results on multiple watermarking schemes (Tree-Ring, RingID, WIND, and Gaussian Shading) across two diffusion models (SDv1.4 and SDv2.0). Our results demonstrate the effectiveness of the attack and expose vulnerabilities in the watermarking methods, motivating future research on improving them.
△ Less
Submitted 27 April, 2025;
originally announced April 2025.
-
HumanGif: Single-View Human Diffusion with Generative Prior
Authors:
Shoukang Hu,
Takuya Narihira,
Kazumi Fukuda,
Ryosuke Sawata,
Takashi Shibuya,
Yuki Mitsufuji
Abstract:
Previous 3D human creation methods have made significant progress in synthesizing view-consistent and temporally aligned results from sparse-view images or monocular videos. However, it remains challenging to produce perpetually realistic, view-consistent, and temporally coherent human avatars from a single image, as limited information is available in the single-view input setting. Motivated by t…
▽ More
Previous 3D human creation methods have made significant progress in synthesizing view-consistent and temporally aligned results from sparse-view images or monocular videos. However, it remains challenging to produce perpetually realistic, view-consistent, and temporally coherent human avatars from a single image, as limited information is available in the single-view input setting. Motivated by the success of 2D character animation, we propose HumanGif, a single-view human diffusion model with generative prior. Specifically, we formulate the single-view-based 3D human novel view and pose synthesis as a single-view-conditioned human diffusion process, utilizing generative priors from foundational diffusion models to complement the missing information. To ensure fine-grained and consistent novel view and pose synthesis, we introduce a Human NeRF module in HumanGif to learn spatially aligned features from the input image, implicitly capturing the relative camera and human pose transformation. Furthermore, we introduce an image-level loss during optimization to bridge the gap between latent and image spaces in diffusion models. Extensive experiments on RenderPeople, DNA-Rendering, THuman 2.1, and TikTok datasets demonstrate that HumanGif achieves the best perceptual performance, with better generalizability for novel view and pose synthesis.
△ Less
Submitted 29 June, 2025; v1 submitted 17 February, 2025;
originally announced February 2025.
-
CCStereo: Audio-Visual Contextual and Contrastive Learning for Binaural Audio Generation
Authors:
Yuanhong Chen,
Kazuki Shimada,
Christian Simon,
Yukara Ikemiya,
Takashi Shibuya,
Yuki Mitsufuji
Abstract:
Binaural audio generation (BAG) aims to convert monaural audio to stereo audio using visual prompts, requiring a deep understanding of spatial and semantic information. However, current models risk overfitting to room environments and lose fine-grained spatial details. In this paper, we propose a new audio-visual binaural generation model incorporating an audio-visual conditional normalisation lay…
▽ More
Binaural audio generation (BAG) aims to convert monaural audio to stereo audio using visual prompts, requiring a deep understanding of spatial and semantic information. However, current models risk overfitting to room environments and lose fine-grained spatial details. In this paper, we propose a new audio-visual binaural generation model incorporating an audio-visual conditional normalisation layer that dynamically aligns the mean and variance of the target difference audio features using visual context, along with a new contrastive learning method to enhance spatial sensitivity by mining negative samples from shuffled visual features. We also introduce a cost-efficient way to utilise test-time augmentation in video data to enhance performance. Our approach achieves state-of-the-art generation accuracy on the FAIR-Play and MUSIC-Stereo benchmarks.
△ Less
Submitted 6 January, 2025;
originally announced January 2025.
-
MMAudio: Taming Multimodal Joint Training for High-Quality Video-to-Audio Synthesis
Authors:
Ho Kei Cheng,
Masato Ishii,
Akio Hayakawa,
Takashi Shibuya,
Alexander Schwing,
Yuki Mitsufuji
Abstract:
We propose to synthesize high-quality and synchronized audio, given video and optional text conditions, using a novel multimodal joint training framework MMAudio. In contrast to single-modality training conditioned on (limited) video data only, MMAudio is jointly trained with larger-scale, readily available text-audio data to learn to generate semantically aligned high-quality audio samples. Addit…
▽ More
We propose to synthesize high-quality and synchronized audio, given video and optional text conditions, using a novel multimodal joint training framework MMAudio. In contrast to single-modality training conditioned on (limited) video data only, MMAudio is jointly trained with larger-scale, readily available text-audio data to learn to generate semantically aligned high-quality audio samples. Additionally, we improve audio-visual synchrony with a conditional synchronization module that aligns video conditions with audio latents at the frame level. Trained with a flow matching objective, MMAudio achieves new video-to-audio state-of-the-art among public models in terms of audio quality, semantic alignment, and audio-visual synchronization, while having a low inference time (1.23s to generate an 8s clip) and just 157M parameters. MMAudio also achieves surprisingly competitive performance in text-to-audio generation, showing that joint training does not hinder single-modality performance. Code and demo are available at: https://hkchengrex.github.io/MMAudio
△ Less
Submitted 7 April, 2025; v1 submitted 19 December, 2024;
originally announced December 2024.
-
SAVGBench: Benchmarking Spatially Aligned Audio-Video Generation
Authors:
Kazuki Shimada,
Christian Simon,
Takashi Shibuya,
Shusuke Takahashi,
Yuki Mitsufuji
Abstract:
This work addresses the lack of multimodal generative models capable of producing high-quality videos with spatially aligned audio. While recent advancements in generative models have been successful in video generation, they often overlook the spatial alignment between audio and visuals, which is essential for immersive experiences. To tackle this problem, we establish a new research direction in…
▽ More
This work addresses the lack of multimodal generative models capable of producing high-quality videos with spatially aligned audio. While recent advancements in generative models have been successful in video generation, they often overlook the spatial alignment between audio and visuals, which is essential for immersive experiences. To tackle this problem, we establish a new research direction in benchmarking Spatially Aligned Audio-Video Generation (SAVG). We propose three key components for the benchmark: dataset, baseline, and metrics. We introduce a spatially aligned audio-visual dataset, derived from an audio-visual dataset consisting of multichannel audio, video, and spatiotemporal annotations of sound events. We propose a baseline audio-visual diffusion model focused on stereo audio-visual joint learning to accommodate spatial sound. Finally, we present metrics to evaluate video and spatial audio quality, including a new spatial audio-visual alignment metric. Our experimental result demonstrates that gaps exist between the baseline model and ground truth in terms of video and audio quality, and spatial alignment between both modalities.
△ Less
Submitted 17 December, 2024;
originally announced December 2024.
-
TraSCE: Trajectory Steering for Concept Erasure
Authors:
Anubhav Jain,
Yuya Kobayashi,
Takashi Shibuya,
Yuhta Takida,
Nasir Memon,
Julian Togelius,
Yuki Mitsufuji
Abstract:
Recent advancements in text-to-image diffusion models have brought them to the public spotlight, becoming widely accessible and embraced by everyday users. However, these models have been shown to generate harmful content such as not-safe-for-work (NSFW) images. While approaches have been proposed to erase such abstract concepts from the models, jail-breaking techniques have succeeded in bypassing…
▽ More
Recent advancements in text-to-image diffusion models have brought them to the public spotlight, becoming widely accessible and embraced by everyday users. However, these models have been shown to generate harmful content such as not-safe-for-work (NSFW) images. While approaches have been proposed to erase such abstract concepts from the models, jail-breaking techniques have succeeded in bypassing such safety measures. In this paper, we propose TraSCE, an approach to guide the diffusion trajectory away from generating harmful content. Our approach is based on negative prompting, but as we show in this paper, a widely used negative prompting strategy is not a complete solution and can easily be bypassed in some corner cases. To address this issue, we first propose using a specific formulation of negative prompting instead of the widely used one. Furthermore, we introduce a localized loss-based guidance that enhances the modified negative prompting technique by steering the diffusion trajectory. We demonstrate that our proposed method achieves state-of-the-art results on various benchmarks in removing harmful content, including ones proposed by red teams, and erasing artistic styles and objects. Our proposed approach does not require any training, weight modifications, or training data (either image or prompt), making it easier for model owners to erase new concepts.
△ Less
Submitted 17 March, 2025; v1 submitted 10 December, 2024;
originally announced December 2024.
-
Classifier-Free Guidance inside the Attraction Basin May Cause Memorization
Authors:
Anubhav Jain,
Yuya Kobayashi,
Takashi Shibuya,
Yuhta Takida,
Nasir Memon,
Julian Togelius,
Yuki Mitsufuji
Abstract:
Diffusion models are prone to exactly reproduce images from the training data. This exact reproduction of the training data is concerning as it can lead to copyright infringement and/or leakage of privacy-sensitive information. In this paper, we present a novel perspective on the memorization phenomenon and propose a simple yet effective approach to mitigate it. We argue that memorization occurs b…
▽ More
Diffusion models are prone to exactly reproduce images from the training data. This exact reproduction of the training data is concerning as it can lead to copyright infringement and/or leakage of privacy-sensitive information. In this paper, we present a novel perspective on the memorization phenomenon and propose a simple yet effective approach to mitigate it. We argue that memorization occurs because of an attraction basin in the denoising process which steers the diffusion trajectory towards a memorized image. However, this can be mitigated by guiding the diffusion trajectory away from the attraction basin by not applying classifier-free guidance until an ideal transition point occurs from which classifier-free guidance is applied. This leads to the generation of non-memorized images that are high in image quality and well-aligned with the conditioning mechanism. To further improve on this, we present a new guidance technique, opposite guidance, that escapes the attraction basin sooner in the denoising process. We demonstrate the existence of attraction basins in various scenarios in which memorization occurs, and we show that our proposed approach successfully mitigates memorization.
△ Less
Submitted 17 March, 2025; v1 submitted 23 November, 2024;
originally announced November 2024.
-
Differentially Private Selection using Smooth Sensitivity
Authors:
Akito Yamamoto,
Tetsuo Shibuya
Abstract:
With the growing volume of data in society, the need for privacy protection in data analysis also rises. In particular, private selection tasks, wherein the most important information is retrieved under differential privacy are emphasized in a wide range of contexts, including machine learning and medical statistical analysis. However, existing mechanisms use global sensitivity, which may add larg…
▽ More
With the growing volume of data in society, the need for privacy protection in data analysis also rises. In particular, private selection tasks, wherein the most important information is retrieved under differential privacy are emphasized in a wide range of contexts, including machine learning and medical statistical analysis. However, existing mechanisms use global sensitivity, which may add larger amount of perturbation than is necessary. Therefore, this study proposes a novel mechanism for differentially private selection using the concept of smooth sensitivity and presents theoretical proofs of strict privacy guarantees. Simultaneously, given that the current state-of-the-art algorithm using smooth sensitivity is still of limited use, and that the theoretical analysis of the basic properties of the noise distributions are not yet rigorous, we present fundamental theorems to improve upon them. Furthermore, new theorems are proposed for efficient noise generation. Experiments demonstrate that the proposed mechanism can provide higher accuracy than the existing global sensitivity-based methods. Finally, we show key directions for further theoretical development. Overall, this study can be an important foundational work for expanding the potential of smooth sensitivity in privacy-preserving data analysis. The Python implementation of our experiments and supplemental results are available at https://github.com/ay0408/Smooth-Private-Selection.
△ Less
Submitted 14 October, 2024;
originally announced October 2024.
-
HERO: Human-Feedback Efficient Reinforcement Learning for Online Diffusion Model Finetuning
Authors:
Ayano Hiranaka,
Shang-Fu Chen,
Chieh-Hsin Lai,
Dongjun Kim,
Naoki Murata,
Takashi Shibuya,
Wei-Hsiang Liao,
Shao-Hua Sun,
Yuki Mitsufuji
Abstract:
Controllable generation through Stable Diffusion (SD) fine-tuning aims to improve fidelity, safety, and alignment with human guidance. Existing reinforcement learning from human feedback methods usually rely on predefined heuristic reward functions or pretrained reward models built on large-scale datasets, limiting their applicability to scenarios where collecting such data is costly or difficult.…
▽ More
Controllable generation through Stable Diffusion (SD) fine-tuning aims to improve fidelity, safety, and alignment with human guidance. Existing reinforcement learning from human feedback methods usually rely on predefined heuristic reward functions or pretrained reward models built on large-scale datasets, limiting their applicability to scenarios where collecting such data is costly or difficult. To effectively and efficiently utilize human feedback, we develop a framework, HERO, which leverages online human feedback collected on the fly during model learning. Specifically, HERO features two key mechanisms: (1) Feedback-Aligned Representation Learning, an online training method that captures human feedback and provides informative learning signals for fine-tuning, and (2) Feedback-Guided Image Generation, which involves generating images from SD's refined initialization samples, enabling faster convergence towards the evaluator's intent. We demonstrate that HERO is 4x more efficient in online feedback for body part anomaly correction compared to the best existing method. Additionally, experiments show that HERO can effectively handle tasks like reasoning, counting, personalization, and reducing NSFW content with only 0.5K online feedback. The code and project page are available at https://hero-dm.github.io/.
△ Less
Submitted 13 March, 2025; v1 submitted 7 October, 2024;
originally announced October 2024.
-
Embedded Topic Models Enhanced by Wikification
Authors:
Takashi Shibuya,
Takehito Utsuro
Abstract:
Topic modeling analyzes a collection of documents to learn meaningful patterns of words. However, previous topic models consider only the spelling of words and do not take into consideration the homography of words. In this study, we incorporate the Wikipedia knowledge into a neural topic model to make it aware of named entities. We evaluate our method on two datasets, 1) news articles of \textit{…
▽ More
Topic modeling analyzes a collection of documents to learn meaningful patterns of words. However, previous topic models consider only the spelling of words and do not take into consideration the homography of words. In this study, we incorporate the Wikipedia knowledge into a neural topic model to make it aware of named entities. We evaluate our method on two datasets, 1) news articles of \textit{New York Times} and 2) the AIDA-CoNLL dataset. Our experiments show that our method improves the performance of neural topic models in generalizability. Moreover, we analyze frequent terms in each topic and the temporal dependencies between topics to demonstrate that our entity-aware topic models can capture the time-series development of topics well.
△ Less
Submitted 3 October, 2024;
originally announced October 2024.
-
A Simple but Strong Baseline for Sounding Video Generation: Effective Adaptation of Audio and Video Diffusion Models for Joint Generation
Authors:
Masato Ishii,
Akio Hayakawa,
Takashi Shibuya,
Yuki Mitsufuji
Abstract:
In this work, we build a simple but strong baseline for sounding video generation. Given base diffusion models for audio and video, we integrate them with additional modules into a single model and train it to make the model jointly generate audio and video. To enhance alignment between audio-video pairs, we introduce two novel mechanisms in our model. The first one is timestep adjustment, which p…
▽ More
In this work, we build a simple but strong baseline for sounding video generation. Given base diffusion models for audio and video, we integrate them with additional modules into a single model and train it to make the model jointly generate audio and video. To enhance alignment between audio-video pairs, we introduce two novel mechanisms in our model. The first one is timestep adjustment, which provides different timestep information to each base model. It is designed to align how samples are generated along with timesteps across modalities. The second one is a new design of the additional modules, termed Cross-Modal Conditioning as Positional Encoding (CMC-PE). In CMC-PE, cross-modal information is embedded as if it represents temporal position information, and the embeddings are fed into the model like positional encoding. Compared with the popular cross-attention mechanism, CMC-PE provides a better inductive bias for temporal alignment in the generated data. Experimental results validate the effectiveness of the two newly introduced mechanisms and also demonstrate that our method outperforms existing methods.
△ Less
Submitted 8 April, 2025; v1 submitted 26 September, 2024;
originally announced September 2024.
-
Cycle Counting under Local Differential Privacy for Degeneracy-bounded Graphs
Authors:
Quentin Hillebrand,
Vorapong Suppakitpaisarn,
Tetsuo Shibuya
Abstract:
We propose an algorithm for counting the number of cycles under local differential privacy for degeneracy-bounded input graphs. Numerous studies have focused on counting the number of triangles under the privacy notion, demonstrating that the expected $\ell_2$-error of these algorithms is $Ω(n^{1.5})$, where $n$ is the number of nodes in the graph. When parameterized by the number of cycles of len…
▽ More
We propose an algorithm for counting the number of cycles under local differential privacy for degeneracy-bounded input graphs. Numerous studies have focused on counting the number of triangles under the privacy notion, demonstrating that the expected $\ell_2$-error of these algorithms is $Ω(n^{1.5})$, where $n$ is the number of nodes in the graph. When parameterized by the number of cycles of length four ($C_4$), the best existing triangle counting algorithm has an error of $O(n^{1.5} + \sqrt{C_4}) = O(n^2)$. In this paper, we introduce an algorithm with an expected $\ell_2$-error of $O(δ^{1.5} n^{0.5} + δ^{0.5} d_{\max}^{0.5} n^{0.5})$, where $δ$ is the degeneracy and $d_{\max}$ is the maximum degree of the graph. For degeneracy-bounded graphs ($δ\in Θ(1)$) commonly found in practical social networks, our algorithm achieves an expected $\ell_2$-error of $O(d_{\max}^{0.5} n^{0.5}) = O(n)$. Our algorithm's core idea is a precise count of triangles following a preprocessing step that approximately sorts the degree of all nodes. This approach can be extended to approximate the number of cycles of length $k$, maintaining a similar $\ell_2$-error, namely $O(δ^{(k-2)/2} d_{\max}^{0.5} n^{(k-2)/2} + δ^{k/2} n^{(k-2)/2})$ or $O(d_{\max}^{0.5} n^{(k-2)/2}) = O(n^{(k-1)/2})$ for degeneracy-bounded graphs.
△ Less
Submitted 26 September, 2024; v1 submitted 25 September, 2024;
originally announced September 2024.
-
SpecMaskGIT: Masked Generative Modeling of Audio Spectrograms for Efficient Audio Synthesis and Beyond
Authors:
Marco Comunità,
Zhi Zhong,
Akira Takahashi,
Shiqi Yang,
Mengjie Zhao,
Koichi Saito,
Yukara Ikemiya,
Takashi Shibuya,
Shusuke Takahashi,
Yuki Mitsufuji
Abstract:
Recent advances in generative models that iteratively synthesize audio clips sparked great success to text-to-audio synthesis (TTA), but with the cost of slow synthesis speed and heavy computation. Although there have been attempts to accelerate the iterative procedure, high-quality TTA systems remain inefficient due to hundreds of iterations required in the inference phase and large amount of mod…
▽ More
Recent advances in generative models that iteratively synthesize audio clips sparked great success to text-to-audio synthesis (TTA), but with the cost of slow synthesis speed and heavy computation. Although there have been attempts to accelerate the iterative procedure, high-quality TTA systems remain inefficient due to hundreds of iterations required in the inference phase and large amount of model parameters. To address the challenges, we propose SpecMaskGIT, a light-weighted, efficient yet effective TTA model based on the masked generative modeling of spectrograms. First, SpecMaskGIT synthesizes a realistic 10s audio clip by less than 16 iterations, an order-of-magnitude less than previous iterative TTA methods. As a discrete model, SpecMaskGIT outperforms larger VQ-Diffusion and auto-regressive models in the TTA benchmark, while being real-time with only 4 CPU cores or even 30x faster with a GPU. Next, built upon a latent space of Mel-spectrogram, SpecMaskGIT has a wider range of applications (e.g., the zero-shot bandwidth extension) than similar methods built on the latent wave domain. Moreover, we interpret SpecMaskGIT as a generative extension to previous discriminative audio masked Transformers, and shed light on its audio representation learning potential. We hope our work inspires the exploration of masked audio modeling toward further diverse scenarios.
△ Less
Submitted 26 June, 2024; v1 submitted 25 June, 2024;
originally announced June 2024.
-
MoLA: Motion Generation and Editing with Latent Diffusion Enhanced by Adversarial Training
Authors:
Kengo Uchida,
Takashi Shibuya,
Yuhta Takida,
Naoki Murata,
Julian Tanke,
Shusuke Takahashi,
Yuki Mitsufuji
Abstract:
In text-to-motion generation, controllability as well as generation quality and speed has become increasingly critical. The controllability challenges include generating a motion of a length that matches the given textual description and editing the generated motions according to control signals, such as the start-end positions and the pelvis trajectory. In this paper, we propose MoLA, which provi…
▽ More
In text-to-motion generation, controllability as well as generation quality and speed has become increasingly critical. The controllability challenges include generating a motion of a length that matches the given textual description and editing the generated motions according to control signals, such as the start-end positions and the pelvis trajectory. In this paper, we propose MoLA, which provides fast, high-quality, variable-length motion generation and can also deal with multiple editing tasks in a single framework. Our approach revisits the motion representation used as inputs and outputs in the model, incorporating an activation variable to enable variable-length motion generation. Additionally, we integrate a variational autoencoder and a latent diffusion model, further enhanced through adversarial training, to achieve high-quality and fast generation. Moreover, we apply a training-free guided generation framework to achieve various editing tasks with motion control inputs. We quantitatively show the effectiveness of adversarial learning in text-to-motion generation, and demonstrate the applicability of our editing framework to multiple editing tasks in the motion domain.
△ Less
Submitted 14 April, 2025; v1 submitted 3 June, 2024;
originally announced June 2024.
-
SoundCTM: Unifying Score-based and Consistency Models for Full-band Text-to-Sound Generation
Authors:
Koichi Saito,
Dongjun Kim,
Takashi Shibuya,
Chieh-Hsin Lai,
Zhi Zhong,
Yuhta Takida,
Yuki Mitsufuji
Abstract:
Sound content creation, essential for multimedia works such as video games and films, often involves extensive trial-and-error, enabling creators to semantically reflect their artistic ideas and inspirations, which evolve throughout the creation process, into the sound. Recent high-quality diffusion-based Text-to-Sound (T2S) generative models provide valuable tools for creators. However, these mod…
▽ More
Sound content creation, essential for multimedia works such as video games and films, often involves extensive trial-and-error, enabling creators to semantically reflect their artistic ideas and inspirations, which evolve throughout the creation process, into the sound. Recent high-quality diffusion-based Text-to-Sound (T2S) generative models provide valuable tools for creators. However, these models often suffer from slow inference speeds, imposing an undesirable burden that hinders the trial-and-error process. While existing T2S distillation models address this limitation through 1-step generation, the sample quality of $1$-step generation remains insufficient for production use. Additionally, while multi-step sampling in those distillation models improves sample quality itself, the semantic content changes due to their lack of deterministic sampling capabilities. To address these issues, we introduce Sound Consistency Trajectory Models (SoundCTM), which allow flexible transitions between high-quality $1$-step sound generation and superior sound quality through multi-step deterministic sampling. This allows creators to efficiently conduct trial-and-error with 1-step generation to semantically align samples with their intention, and subsequently refine sample quality with preserving semantic content through deterministic multi-step sampling. To develop SoundCTM, we reframe the CTM training framework, originally proposed in computer vision, and introduce a novel feature distance using the teacher network for a distillation loss. For production-level generation, we scale up our model to 1B trainable parameters, making SoundCTM-DiT-1B the first large-scale distillation model in the sound community to achieve both promising high-quality 1-step and multi-step full-band (44.1kHz) generation.
△ Less
Submitted 10 March, 2025; v1 submitted 28 May, 2024;
originally announced May 2024.
-
MMDisCo: Multi-Modal Discriminator-Guided Cooperative Diffusion for Joint Audio and Video Generation
Authors:
Akio Hayakawa,
Masato Ishii,
Takashi Shibuya,
Yuki Mitsufuji
Abstract:
This study aims to construct an audio-video generative model with minimal computational cost by leveraging pre-trained single-modal generative models for audio and video. To achieve this, we propose a novel method that guides single-modal models to cooperatively generate well-aligned samples across modalities. Specifically, given two pre-trained base diffusion models, we train a lightweight joint…
▽ More
This study aims to construct an audio-video generative model with minimal computational cost by leveraging pre-trained single-modal generative models for audio and video. To achieve this, we propose a novel method that guides single-modal models to cooperatively generate well-aligned samples across modalities. Specifically, given two pre-trained base diffusion models, we train a lightweight joint guidance module to adjust scores separately estimated by the base models to match the score of joint distribution over audio and video. We show that this guidance can be computed using the gradient of the optimal discriminator, which distinguishes real audio-video pairs from fake ones independently generated by the base models. Based on this analysis, we construct a joint guidance module by training this discriminator. Additionally, we adopt a loss function to stabilize the discriminator's gradient and make it work as a noise estimator, as in standard diffusion models. Empirical evaluations on several benchmark datasets demonstrate that our method improves both single-modal fidelity and multimodal alignment with relatively few parameters. The code is available at: https://github.com/SonyResearch/MMDisCo.
△ Less
Submitted 25 February, 2025; v1 submitted 28 May, 2024;
originally announced May 2024.
-
GenWarp: Single Image to Novel Views with Semantic-Preserving Generative Warping
Authors:
Junyoung Seo,
Kazumi Fukuda,
Takashi Shibuya,
Takuya Narihira,
Naoki Murata,
Shoukang Hu,
Chieh-Hsin Lai,
Seungryong Kim,
Yuki Mitsufuji
Abstract:
Generating novel views from a single image remains a challenging task due to the complexity of 3D scenes and the limited diversity in the existing multi-view datasets to train a model on. Recent research combining large-scale text-to-image (T2I) models with monocular depth estimation (MDE) has shown promise in handling in-the-wild images. In these methods, an input view is geometrically warped to…
▽ More
Generating novel views from a single image remains a challenging task due to the complexity of 3D scenes and the limited diversity in the existing multi-view datasets to train a model on. Recent research combining large-scale text-to-image (T2I) models with monocular depth estimation (MDE) has shown promise in handling in-the-wild images. In these methods, an input view is geometrically warped to novel views with estimated depth maps, then the warped image is inpainted by T2I models. However, they struggle with noisy depth maps and loss of semantic details when warping an input view to novel viewpoints. In this paper, we propose a novel approach for single-shot novel view synthesis, a semantic-preserving generative warping framework that enables T2I generative models to learn where to warp and where to generate, through augmenting cross-view attention with self-attention. Our approach addresses the limitations of existing methods by conditioning the generative model on source view images and incorporating geometric warping signals. Qualitative and quantitative evaluations demonstrate that our model outperforms existing methods in both in-domain and out-of-domain scenarios. Project page is available at https://GenWarp-NVS.github.io/.
△ Less
Submitted 26 September, 2024; v1 submitted 27 May, 2024;
originally announced May 2024.
-
Visual Echoes: A Simple Unified Transformer for Audio-Visual Generation
Authors:
Shiqi Yang,
Zhi Zhong,
Mengjie Zhao,
Shusuke Takahashi,
Masato Ishii,
Takashi Shibuya,
Yuki Mitsufuji
Abstract:
In recent years, with the realistic generation results and a wide range of personalized applications, diffusion-based generative models gain huge attention in both visual and audio generation areas. Compared to the considerable advancements of text2image or text2audio generation, research in audio2visual or visual2audio generation has been relatively slow. The recent audio-visual generation method…
▽ More
In recent years, with the realistic generation results and a wide range of personalized applications, diffusion-based generative models gain huge attention in both visual and audio generation areas. Compared to the considerable advancements of text2image or text2audio generation, research in audio2visual or visual2audio generation has been relatively slow. The recent audio-visual generation methods usually resort to huge large language model or composable diffusion models. Instead of designing another giant model for audio-visual generation, in this paper we take a step back showing a simple and lightweight generative transformer, which is not fully investigated in multi-modal generation, can achieve excellent results on image2audio generation. The transformer operates in the discrete audio and visual Vector-Quantized GAN space, and is trained in the mask denoising manner. After training, the classifier-free guidance could be deployed off-the-shelf achieving better performance, without any extra training or modification. Since the transformer model is modality symmetrical, it could also be directly deployed for audio2image generation and co-generation. In the experiments, we show that our simple method surpasses recent image2audio generation methods. Generated audio samples can be found at https://docs.google.com/presentation/d/1ZtC0SeblKkut4XJcRaDsSTuCRIXB3ypxmSi7HTY3IyQ/
△ Less
Submitted 24 May, 2024; v1 submitted 23 May, 2024;
originally announced May 2024.
-
Privacy-Optimized Randomized Response for Sharing Multi-Attribute Data
Authors:
Akito Yamamoto,
Tetsuo Shibuya
Abstract:
With the increasing amount of data in society, privacy concerns in data sharing have become widely recognized. Particularly, protecting personal attribute information is essential for a wide range of aims from crowdsourcing to realizing personalized medicine. Although various differentially private methods based on randomized response have been proposed for single attribute information or specific…
▽ More
With the increasing amount of data in society, privacy concerns in data sharing have become widely recognized. Particularly, protecting personal attribute information is essential for a wide range of aims from crowdsourcing to realizing personalized medicine. Although various differentially private methods based on randomized response have been proposed for single attribute information or specific analysis purposes such as frequency estimation, there is a lack of studies on the mechanism for sharing individuals' multiple categorical information itself. The existing randomized response for sharing multi-attribute data uses the Kronecker product to perturb each attribute information in turn according to the respective privacy level but achieves only a weak privacy level for the entire dataset. Therefore, in this study, we propose a privacy-optimized randomized response that guarantees the strongest privacy in sharing multi-attribute data. Furthermore, we present an efficient heuristic algorithm for constructing a near-optimal mechanism. The time complexity of our algorithm is O(k^2), where k is the number of attributes, and it can be performed in about 1 second even for large datasets with k = 1,000. The experimental results demonstrate that both of our methods provide significantly stronger privacy guarantees for the entire dataset than the existing method. In addition, we show an analysis example using genome statistics to confirm that our methods can achieve less than half the output error compared with that of the existing method. Overall, this study is an important step toward trustworthy sharing and analysis of multi-attribute data. The Python implementation of our experiments and supplemental results are available at https://github.com/ay0408/Optimized-RR.
△ Less
Submitted 12 February, 2024;
originally announced February 2024.
-
HQ-VAE: Hierarchical Discrete Representation Learning with Variational Bayes
Authors:
Yuhta Takida,
Yukara Ikemiya,
Takashi Shibuya,
Kazuki Shimada,
Woosung Choi,
Chieh-Hsin Lai,
Naoki Murata,
Toshimitsu Uesaka,
Kengo Uchida,
Wei-Hsiang Liao,
Yuki Mitsufuji
Abstract:
Vector quantization (VQ) is a technique to deterministically learn features with discrete codebook representations. It is commonly performed with a variational autoencoding model, VQ-VAE, which can be further extended to hierarchical structures for making high-fidelity reconstructions. However, such hierarchical extensions of VQ-VAE often suffer from the codebook/layer collapse issue, where the co…
▽ More
Vector quantization (VQ) is a technique to deterministically learn features with discrete codebook representations. It is commonly performed with a variational autoencoding model, VQ-VAE, which can be further extended to hierarchical structures for making high-fidelity reconstructions. However, such hierarchical extensions of VQ-VAE often suffer from the codebook/layer collapse issue, where the codebook is not efficiently used to express the data, and hence degrades reconstruction accuracy. To mitigate this problem, we propose a novel unified framework to stochastically learn hierarchical discrete representation on the basis of the variational Bayes framework, called hierarchically quantized variational autoencoder (HQ-VAE). HQ-VAE naturally generalizes the hierarchical variants of VQ-VAE, such as VQ-VAE-2 and residual-quantized VAE (RQ-VAE), and provides them with a Bayesian training scheme. Our comprehensive experiments on image datasets show that HQ-VAE enhances codebook usage and improves reconstruction performance. We also validated HQ-VAE in terms of its applicability to a different modality with an audio dataset.
△ Less
Submitted 28 March, 2024; v1 submitted 30 December, 2023;
originally announced January 2024.
-
Communication Cost Reduction for Subgraph Counting under Local Differential Privacy via Hash Functions
Authors:
Quentin Hillebrand,
Vorapong Suppakitpaisarn,
Tetsuo Shibuya
Abstract:
We suggest the use of hash functions to cut down the communication costs when counting subgraphs under edge local differential privacy. While various algorithms exist for computing graph statistics, including the count of subgraphs, under the edge local differential privacy, many suffer with high communication costs, making them less efficient for large graphs. Though data compression is a typical…
▽ More
We suggest the use of hash functions to cut down the communication costs when counting subgraphs under edge local differential privacy. While various algorithms exist for computing graph statistics, including the count of subgraphs, under the edge local differential privacy, many suffer with high communication costs, making them less efficient for large graphs. Though data compression is a typical approach in differential privacy, its application in local differential privacy requires a form of compression that every node can reproduce. In our study, we introduce linear congruence hashing. With a sampling rate of $s$, our method can cut communication costs by a factor of $s^2$, albeit at the cost of increasing variance in the published graph statistic by a factor of $s$. The experimental results indicate that, when matched for communication costs, our method achieves a reduction in the $\ell_2$-error for triangle counts by up to 1000 times compared to the performance of leading algorithms.
△ Less
Submitted 12 December, 2023;
originally announced December 2023.
-
On the Language Encoder of Contrastive Cross-modal Models
Authors:
Mengjie Zhao,
Junya Ono,
Zhi Zhong,
Chieh-Hsin Lai,
Yuhta Takida,
Naoki Murata,
Wei-Hsiang Liao,
Takashi Shibuya,
Hiromi Wakaki,
Yuki Mitsufuji
Abstract:
Contrastive cross-modal models such as CLIP and CLAP aid various vision-language (VL) and audio-language (AL) tasks. However, there has been limited investigation of and improvement in their language encoder, which is the central component of encoding natural language descriptions of image/audio into vector representations. We extensively evaluate how unsupervised and supervised sentence embedding…
▽ More
Contrastive cross-modal models such as CLIP and CLAP aid various vision-language (VL) and audio-language (AL) tasks. However, there has been limited investigation of and improvement in their language encoder, which is the central component of encoding natural language descriptions of image/audio into vector representations. We extensively evaluate how unsupervised and supervised sentence embedding training affect language encoder quality and cross-modal task performance. In VL pretraining, we found that sentence embedding training language encoder quality and aids in cross-modal tasks, improving contrastive VL models such as CyCLIP. In contrast, AL pretraining benefits less from sentence embedding training, which may result from the limited amount of pretraining data. We analyze the representation spaces to understand the strengths of sentence embedding training, and find that it improves text-space uniformity, at the cost of decreased cross-modal alignment.
△ Less
Submitted 20 October, 2023;
originally announced October 2023.
-
Zero- and Few-shot Sound Event Localization and Detection
Authors:
Kazuki Shimada,
Kengo Uchida,
Yuichiro Koyama,
Takashi Shibuya,
Shusuke Takahashi,
Yuki Mitsufuji,
Tatsuya Kawahara
Abstract:
Sound event localization and detection (SELD) systems estimate direction-of-arrival (DOA) and temporal activation for sets of target classes. Neural network (NN)-based SELD systems have performed well in various sets of target classes, but they only output the DOA and temporal activation of preset classes trained before inference. To customize target classes after training, we tackle zero- and few…
▽ More
Sound event localization and detection (SELD) systems estimate direction-of-arrival (DOA) and temporal activation for sets of target classes. Neural network (NN)-based SELD systems have performed well in various sets of target classes, but they only output the DOA and temporal activation of preset classes trained before inference. To customize target classes after training, we tackle zero- and few-shot SELD tasks, in which we set new classes with a text sample or a few audio samples. While zero-shot sound classification tasks are achievable by embedding from contrastive language-audio pretraining (CLAP), zero-shot SELD tasks require assigning an activity and a DOA to each embedding, especially in overlapping cases. To tackle the assignment problem in overlapping cases, we propose an embed-ACCDOA model, which is trained to output track-wise CLAP embedding and corresponding activity-coupled Cartesian direction-of-arrival (ACCDOA). In our experimental evaluations on zero- and few-shot SELD tasks, the embed-ACCDOA model showed better location-dependent scores than a straightforward combination of the CLAP audio encoder and a DOA estimation model. Moreover, the proposed combination of the embed-ACCDOA model and CLAP audio encoder with zero- or few-shot samples performed comparably to an official baseline system trained with complete train data in an evaluation dataset.
△ Less
Submitted 17 January, 2024; v1 submitted 17 September, 2023;
originally announced September 2023.
-
BigVSAN: Enhancing GAN-based Neural Vocoders with Slicing Adversarial Network
Authors:
Takashi Shibuya,
Yuhta Takida,
Yuki Mitsufuji
Abstract:
Generative adversarial network (GAN)-based vocoders have been intensively studied because they can synthesize high-fidelity audio waveforms faster than real-time. However, it has been reported that most GANs fail to obtain the optimal projection for discriminating between real and fake data in the feature space. In the literature, it has been demonstrated that slicing adversarial network (SAN), an…
▽ More
Generative adversarial network (GAN)-based vocoders have been intensively studied because they can synthesize high-fidelity audio waveforms faster than real-time. However, it has been reported that most GANs fail to obtain the optimal projection for discriminating between real and fake data in the feature space. In the literature, it has been demonstrated that slicing adversarial network (SAN), an improved GAN training framework that can find the optimal projection, is effective in the image generation task. In this paper, we investigate the effectiveness of SAN in the vocoding task. For this purpose, we propose a scheme to modify least-squares GAN, which most GAN-based vocoders adopt, so that their loss functions satisfy the requirements of SAN. Through our experiments, we demonstrate that SAN can improve the performance of GAN-based vocoders, including BigVGAN, with small modifications. Our code is available at https://github.com/sony/bigvsan.
△ Less
Submitted 24 March, 2024; v1 submitted 6 September, 2023;
originally announced September 2023.
-
Diffusion-Based Speech Enhancement with Joint Generative and Predictive Decoders
Authors:
Hao Shi,
Kazuki Shimada,
Masato Hirano,
Takashi Shibuya,
Yuichiro Koyama,
Zhi Zhong,
Shusuke Takahashi,
Tatsuya Kawahara,
Yuki Mitsufuji
Abstract:
Diffusion-based generative speech enhancement (SE) has recently received attention, but reverse diffusion remains time-consuming. One solution is to initialize the reverse diffusion process with enhanced features estimated by a predictive SE system. However, the pipeline structure currently does not consider for a combined use of generative and predictive decoders. The predictive decoder allows us…
▽ More
Diffusion-based generative speech enhancement (SE) has recently received attention, but reverse diffusion remains time-consuming. One solution is to initialize the reverse diffusion process with enhanced features estimated by a predictive SE system. However, the pipeline structure currently does not consider for a combined use of generative and predictive decoders. The predictive decoder allows us to use the further complementarity between predictive and diffusion-based generative SE. In this paper, we propose a unified system that use jointly generative and predictive decoders across two levels. The encoder encodes both generative and predictive information at the shared encoding level. At the decoded feature level, we fuse the two decoded features by generative and predictive decoders. Specifically, the two SE modules are fused in the initial and final diffusion steps: the initial fusion initializes the diffusion process with the predictive SE to improve convergence, and the final fusion combines the two complementary SE outputs to enhance SE performance. Experiments conducted on the Voice-Bank dataset demonstrate that incorporating predictive information leads to faster decoding and higher PESQ scores compared with other score-based diffusion SE (StoRM and SGMSE+).
△ Less
Submitted 28 February, 2024; v1 submitted 18 May, 2023;
originally announced May 2023.
-
Extending Audio Masked Autoencoders Toward Audio Restoration
Authors:
Zhi Zhong,
Hao Shi,
Masato Hirano,
Kazuki Shimada,
Kazuya Tateishi,
Takashi Shibuya,
Shusuke Takahashi,
Yuki Mitsufuji
Abstract:
Audio classification and restoration are among major downstream tasks in audio signal processing. However, restoration derives less of a benefit from pretrained models compared to the overwhelming success of pretrained models in classification tasks. Due to such unbalanced benefits, there has been rising interest in how to improve the performance of pretrained models for restoration tasks, e.g., s…
▽ More
Audio classification and restoration are among major downstream tasks in audio signal processing. However, restoration derives less of a benefit from pretrained models compared to the overwhelming success of pretrained models in classification tasks. Due to such unbalanced benefits, there has been rising interest in how to improve the performance of pretrained models for restoration tasks, e.g., speech enhancement (SE). Previous works have shown that the features extracted by pretrained audio encoders are effective for SE tasks, but these speech-specialized encoder-only models usually require extra decoders to become compatible with SE, and involve complicated pretraining procedures or complex data augmentation. Therefore, in pursuit of a universal audio model, the audio masked autoencoder (MAE) whose backbone is the autoencoder of Vision Transformers (ViT-AE), is extended from audio classification to SE, a representative restoration task with well-established evaluation standards. ViT-AE learns to restore masked audio signal via a mel-to-mel mapping during pretraining, which is similar to restoration tasks like SE. We propose variations of ViT-AE for a better SE performance, where the mel-to-mel variations yield high scores in non-intrusive metrics and the STFT-oriented variation is effective at intrusive metrics such as PESQ. Different variations can be used in accordance with the scenarios. Comprehensive evaluations reveal that MAE pretraining is beneficial to SE tasks and help the ViT-AE to better generalize to out-of-domain distortions. We further found that large-scale noisy data of general audio sources, rather than clean speech, is sufficiently effective for pretraining.
△ Less
Submitted 17 August, 2023; v1 submitted 11 May, 2023;
originally announced May 2023.
-
SAN: Inducing Metrizability of GAN with Discriminative Normalized Linear Layer
Authors:
Yuhta Takida,
Masaaki Imaizumi,
Takashi Shibuya,
Chieh-Hsin Lai,
Toshimitsu Uesaka,
Naoki Murata,
Yuki Mitsufuji
Abstract:
Generative adversarial networks (GANs) learn a target probability distribution by optimizing a generator and a discriminator with minimax objectives. This paper addresses the question of whether such optimization actually provides the generator with gradients that make its distribution close to the target distribution. We derive metrizable conditions, sufficient conditions for the discriminator to…
▽ More
Generative adversarial networks (GANs) learn a target probability distribution by optimizing a generator and a discriminator with minimax objectives. This paper addresses the question of whether such optimization actually provides the generator with gradients that make its distribution close to the target distribution. We derive metrizable conditions, sufficient conditions for the discriminator to serve as the distance between the distributions by connecting the GAN formulation with the concept of sliced optimal transport. Furthermore, by leveraging these theoretical results, we propose a novel GAN training scheme, called slicing adversarial network (SAN). With only simple modifications, a broad class of existing GANs can be converted to SANs. Experiments on synthetic and image datasets support our theoretical results and the SAN's effectiveness as compared to usual GANs. Furthermore, we also apply SAN to StyleGAN-XL, which leads to state-of-the-art FID score amongst GANs for class conditional generation on ImageNet 256$\times$256. Our implementation is available on https://ytakida.github.io/san.
△ Less
Submitted 10 April, 2024; v1 submitted 30 January, 2023;
originally announced January 2023.
-
Fixed-Weight Difference Target Propagation
Authors:
Tatsukichi Shibuya,
Nakamasa Inoue,
Rei Kawakami,
Ikuro Sato
Abstract:
Target Propagation (TP) is a biologically more plausible algorithm than the error backpropagation (BP) to train deep networks, and improving practicality of TP is an open issue. TP methods require the feedforward and feedback networks to form layer-wise autoencoders for propagating the target values generated at the output layer. However, this causes certain drawbacks; e.g., careful hyperparameter…
▽ More
Target Propagation (TP) is a biologically more plausible algorithm than the error backpropagation (BP) to train deep networks, and improving practicality of TP is an open issue. TP methods require the feedforward and feedback networks to form layer-wise autoencoders for propagating the target values generated at the output layer. However, this causes certain drawbacks; e.g., careful hyperparameter tuning is required to synchronize the feedforward and feedback training, and frequent updates of the feedback path are usually required than that of the feedforward path. Learning of the feedforward and feedback networks is sufficient to make TP methods capable of training, but is having these layer-wise autoencoders a necessary condition for TP to work? We answer this question by presenting Fixed-Weight Difference Target Propagation (FW-DTP) that keeps the feedback weights constant during training. We confirmed that this simple method, which naturally resolves the abovementioned problems of TP, can still deliver informative target values to hidden layers for a given task; indeed, FW-DTP consistently achieves higher test performance than a baseline, the Difference Target Propagation (DTP), on four classification datasets. We also present a novel propagation architecture that explains the exact form of the feedback function of DTP to analyze FW-DTP.
△ Less
Submitted 19 December, 2022;
originally announced December 2022.
-
Diffiner: A Versatile Diffusion-based Generative Refiner for Speech Enhancement
Authors:
Ryosuke Sawata,
Naoki Murata,
Yuhta Takida,
Toshimitsu Uesaka,
Takashi Shibuya,
Shusuke Takahashi,
Yuki Mitsufuji
Abstract:
Although deep neural network (DNN)-based speech enhancement (SE) methods outperform the previous non-DNN-based ones, they often degrade the perceptual quality of generated outputs. To tackle this problem, we introduce a DNN-based generative refiner, Diffiner, aiming to improve perceptual speech quality pre-processed by an SE method. We train a diffusion-based generative model by utilizing a datase…
▽ More
Although deep neural network (DNN)-based speech enhancement (SE) methods outperform the previous non-DNN-based ones, they often degrade the perceptual quality of generated outputs. To tackle this problem, we introduce a DNN-based generative refiner, Diffiner, aiming to improve perceptual speech quality pre-processed by an SE method. We train a diffusion-based generative model by utilizing a dataset consisting of clean speech only. Then, our refiner effectively mixes clean parts newly generated via denoising diffusion restoration into the degraded and distorted parts caused by a preceding SE method, resulting in refined speech. Once our refiner is trained on a set of clean speech, it can be applied to various SE methods without additional training specialized for each SE module. Therefore, our refiner can be a versatile post-processing module w.r.t. SE methods and has high potential in terms of modularity. Experimental results show that our method improved perceptual speech quality regardless of the preceding SE methods used.
△ Less
Submitted 30 August, 2023; v1 submitted 27 October, 2022;
originally announced October 2022.
-
XMD: An End-to-End Framework for Interactive Explanation-Based Debugging of NLP Models
Authors:
Dong-Ho Lee,
Akshen Kadakia,
Brihi Joshi,
Aaron Chan,
Ziyi Liu,
Kiran Narahari,
Takashi Shibuya,
Ryosuke Mitani,
Toshiyuki Sekiya,
Jay Pujara,
Xiang Ren
Abstract:
NLP models are susceptible to learning spurious biases (i.e., bugs) that work on some datasets but do not properly reflect the underlying task. Explanation-based model debugging aims to resolve spurious biases by showing human users explanations of model behavior, asking users to give feedback on the behavior, then using the feedback to update the model. While existing model debugging methods have…
▽ More
NLP models are susceptible to learning spurious biases (i.e., bugs) that work on some datasets but do not properly reflect the underlying task. Explanation-based model debugging aims to resolve spurious biases by showing human users explanations of model behavior, asking users to give feedback on the behavior, then using the feedback to update the model. While existing model debugging methods have shown promise, their prototype-level implementations provide limited practical utility. Thus, we propose XMD: the first open-source, end-to-end framework for explanation-based model debugging. Given task- or instance-level explanations, users can flexibly provide various forms of feedback via an intuitive, web-based UI. After receiving user feedback, XMD automatically updates the model in real time, by regularizing the model so that its explanations align with the user feedback. The new model can then be easily deployed into real-world applications via Hugging Face. Using XMD, we can improve the model's OOD performance on text classification tasks by up to 18%.
△ Less
Submitted 30 October, 2022;
originally announced October 2022.
-
SQ-VAE: Variational Bayes on Discrete Representation with Self-annealed Stochastic Quantization
Authors:
Yuhta Takida,
Takashi Shibuya,
WeiHsiang Liao,
Chieh-Hsin Lai,
Junki Ohmura,
Toshimitsu Uesaka,
Naoki Murata,
Shusuke Takahashi,
Toshiyuki Kumakura,
Yuki Mitsufuji
Abstract:
One noted issue of vector-quantized variational autoencoder (VQ-VAE) is that the learned discrete representation uses only a fraction of the full capacity of the codebook, also known as codebook collapse. We hypothesize that the training scheme of VQ-VAE, which involves some carefully designed heuristics, underlies this issue. In this paper, we propose a new training scheme that extends the standa…
▽ More
One noted issue of vector-quantized variational autoencoder (VQ-VAE) is that the learned discrete representation uses only a fraction of the full capacity of the codebook, also known as codebook collapse. We hypothesize that the training scheme of VQ-VAE, which involves some carefully designed heuristics, underlies this issue. In this paper, we propose a new training scheme that extends the standard VAE via novel stochastic dequantization and quantization, called stochastically quantized variational autoencoder (SQ-VAE). In SQ-VAE, we observe a trend that the quantization is stochastic at the initial stage of the training but gradually converges toward a deterministic quantization, which we call self-annealing. Our experiments show that SQ-VAE improves codebook utilization without using common heuristics. Furthermore, we empirically show that SQ-VAE is superior to VAE and VQ-VAE in vision- and speech-related tasks.
△ Less
Submitted 9 June, 2022; v1 submitted 16 May, 2022;
originally announced May 2022.
-
Good Examples Make A Faster Learner: Simple Demonstration-based Learning for Low-resource NER
Authors:
Dong-Ho Lee,
Akshen Kadakia,
Kangmin Tan,
Mahak Agarwal,
Xinyu Feng,
Takashi Shibuya,
Ryosuke Mitani,
Toshiyuki Sekiya,
Jay Pujara,
Xiang Ren
Abstract:
Recent advances in prompt-based learning have shown strong results on few-shot text classification by using cloze-style templates. Similar attempts have been made on named entity recognition (NER) which manually design templates to predict entity types for every text span in a sentence. However, such methods may suffer from error propagation induced by entity span detection, high cost due to enume…
▽ More
Recent advances in prompt-based learning have shown strong results on few-shot text classification by using cloze-style templates. Similar attempts have been made on named entity recognition (NER) which manually design templates to predict entity types for every text span in a sentence. However, such methods may suffer from error propagation induced by entity span detection, high cost due to enumeration of all possible text spans, and omission of inter-dependencies among token labels in a sentence. Here we present a simple demonstration-based learning method for NER, which lets the input be prefaced by task demonstrations for in-context learning. We perform a systematic study on demonstration strategy regarding what to include (entity examples, with or without surrounding context), how to select the examples, and what templates to use. Results on in-domain learning and domain adaptation show that the model's performance in low-resource settings can be largely improved with a suitable demonstration strategy (e.g., a 4-17% improvement on 25 train instances). We also find that good demonstration can save many labeled examples and consistency in demonstration contributes to better performance.
△ Less
Submitted 30 March, 2022; v1 submitted 15 October, 2021;
originally announced October 2021.
-
Analyzing the Effect of Multi-task Learning for Biomedical Named Entity Recognition
Authors:
Arda Akdemir,
Tetsuo Shibuya
Abstract:
Developing high-performing systems for detecting biomedical named entities has major implications. State-of-the-art deep-learning based solutions for entity recognition often require large annotated datasets, which is not available in the biomedical domain. Transfer learning and multi-task learning have been shown to improve performance for low-resource domains. However, the applications of these…
▽ More
Developing high-performing systems for detecting biomedical named entities has major implications. State-of-the-art deep-learning based solutions for entity recognition often require large annotated datasets, which is not available in the biomedical domain. Transfer learning and multi-task learning have been shown to improve performance for low-resource domains. However, the applications of these methods are relatively scarce in the biomedical domain, and a theoretical understanding of why these methods improve the performance is lacking. In this study, we performed an extensive analysis to understand the transferability between different biomedical entity datasets. We found useful measures to predict transferability between these datasets. Besides, we propose combining transfer learning and multi-task learning to improve the performance of biomedical named entity recognition systems, which is not applied before to the best of our knowledge.
△ Less
Submitted 1 November, 2020;
originally announced November 2020.
-
Hierarchical Multi Task Learning with Subword Contextual Embeddings for Languages with Rich Morphology
Authors:
Arda Akdemir,
Tetsuo Shibuya,
Tunga Güngör
Abstract:
Morphological information is important for many sequence labeling tasks in Natural Language Processing (NLP). Yet, existing approaches rely heavily on manual annotations or external software to capture this information. In this study, we propose using subword contextual embeddings to capture the morphological information for languages with rich morphology. In addition, we incorporate these embeddi…
▽ More
Morphological information is important for many sequence labeling tasks in Natural Language Processing (NLP). Yet, existing approaches rely heavily on manual annotations or external software to capture this information. In this study, we propose using subword contextual embeddings to capture the morphological information for languages with rich morphology. In addition, we incorporate these embeddings in a hierarchical multi-task setting which is not employed before, to the best of our knowledge. Evaluated on Dependency Parsing (DEP) and Named Entity Recognition (NER) tasks, which are shown to benefit greatly from morphological information, our final model outperforms previous state-of-the-art models on both tasks for the Turkish language. Besides, we show a net improvement of 18.86% and 4.61% F-1 over the previously proposed multi-task learner in the same setting for the DEP and the NER tasks, respectively. Empirical results for five different MTL settings show that incorporating subword contextual embeddings brings significant improvements for both tasks. In addition, we observed that multi-task learning consistently improves the performance of the DEP component.
△ Less
Submitted 25 April, 2020;
originally announced April 2020.
-
Nested Named Entity Recognition via Second-best Sequence Learning and Decoding
Authors:
Takashi Shibuya,
Eduard Hovy
Abstract:
When an entity name contains other names within it, the identification of all combinations of names can become difficult and expensive. We propose a new method to recognize not only outermost named entities but also inner nested ones. We design an objective function for training a neural model that treats the tag sequence for nested entities as the second best path within the span of their parent…
▽ More
When an entity name contains other names within it, the identification of all combinations of names can become difficult and expensive. We propose a new method to recognize not only outermost named entities but also inner nested ones. We design an objective function for training a neural model that treats the tag sequence for nested entities as the second best path within the span of their parent entity. In addition, we provide the decoding method for inference that extracts entities iteratively from outermost ones to inner ones in an outside-to-inside way. Our method has no additional hyperparameters to the conditional random field based model widely used for flat named entity recognition tasks. Experiments demonstrate that our method performs better than or at least as well as existing methods capable of handling nested entities, achieving the F1-scores of 85.82%, 84.34%, and 77.36% on ACE-2004, ACE-2005, and GENIA datasets, respectively.
△ Less
Submitted 10 July, 2020; v1 submitted 5 September, 2019;
originally announced September 2019.
-
Succinct Oblivious RAM
Authors:
Taku Onodera,
Tetsuo Shibuya
Abstract:
Reducing the database space overhead is critical in big-data processing. In this paper, we revisit oblivious RAM (ORAM) using big-data standard for the database space overhead.
ORAM is a cryptographic primitive that enables users to perform arbitrary database accesses without revealing the access pattern to the server. It is particularly important today since cloud services become increasingly c…
▽ More
Reducing the database space overhead is critical in big-data processing. In this paper, we revisit oblivious RAM (ORAM) using big-data standard for the database space overhead.
ORAM is a cryptographic primitive that enables users to perform arbitrary database accesses without revealing the access pattern to the server. It is particularly important today since cloud services become increasingly common making it necessary to protect users' private information from database access pattern analyses. Previous ORAM studies focused mostly on reducing the access overhead. Consequently, the access overhead of the state-of-the-art ORAM constructions is almost at practical levels in certain application scenarios such as secure processors. On the other hand, most existing ORAM constructions require $(1+Θ(1))n$ (say, $10n$) bits of server space where $n$ is the database size. Though such space complexity is often considered to be "optimal", overhead such as $10 \times$ is prohibitive for big-data applications in practice.
We propose ORAM constructions that take only $(1+o(1))n$ bits of server space while maintaining state-of-the-art performance in terms of the access overhead and the user space. We also give non-asymptotic analyses and simulation results which indicate that the proposed ORAM constructions are practically effective.
△ Less
Submitted 23 April, 2018;
originally announced April 2018.
-
Pattern Language for Good Old Future From Japanese Culture
Authors:
Megumi Kadotani,
Aya Matsumoto,
Takafumi Shibuya,
Younjae Lee,
Saori Watanabe,
Takashi Iba
Abstract:
Having developed greatly over millennium under its culture, the ancient buildings and old town atmospheres maintain a quality of comfort. However, people only appreciate the "good old" quality and do not think further about the rational reasons why they feel comfort in it. This keeps them from creating their own things and models with good old quality, relying on the imported western thinking and…
▽ More
Having developed greatly over millennium under its culture, the ancient buildings and old town atmospheres maintain a quality of comfort. However, people only appreciate the "good old" quality and do not think further about the rational reasons why they feel comfort in it. This keeps them from creating their own things and models with good old quality, relying on the imported western thinking and methods as a result of modernization. Since people are now struggling under the imbalanced and complex society, we believe that support is needed for generating things and frameworks with good old quality in modern time situations and scenes.
△ Less
Submitted 6 August, 2013;
originally announced August 2013.
-
Detecting Superbubbles in Assembly Graphs
Authors:
Taku Onodera,
Kunihiko Sadakane,
Tetsuo Shibuya
Abstract:
We introduce a new concept of a subgraph class called a superbubble for analyzing assembly graphs, and propose an efficient algorithm for detecting it. Most assembly algorithms utilize assembly graphs like the de Bruijn graph or the overlap graph constructed from reads. From these graphs, many assembly algorithms first detect simple local graph structures (motifs), such as tips and bubbles, mainly…
▽ More
We introduce a new concept of a subgraph class called a superbubble for analyzing assembly graphs, and propose an efficient algorithm for detecting it. Most assembly algorithms utilize assembly graphs like the de Bruijn graph or the overlap graph constructed from reads. From these graphs, many assembly algorithms first detect simple local graph structures (motifs), such as tips and bubbles, mainly to find sequencing errors. These motifs are easy to detect, but they are sometimes too simple to deal with more complex errors. The superbubble is an extension of the bubble, which is also important for analyzing assembly graphs. Though superbubbles are much more complex than ordinary bubbles, we show that they can be efficiently enumerated. We propose an average-case linear time algorithm (i.e., O(n+m) for a graph with n vertices and m edges) for graphs with a reasonable model, though the worst-case time complexity of our algorithm is quadratic (i.e., O(n(n+m))). Moreover, the algorithm is practically very fast: Our experiments show that our algorithm runs in reasonable time with a single CPU core even against a very large graph of a whole human genome.
△ Less
Submitted 30 July, 2013;
originally announced July 2013.
-
The Asymptotic Bit Error Probability of LDPC Codes for the Binary Erasure Channel with Finite Iteration Number
Authors:
Ryuhei Mori,
Kenta Kasai,
Tomoharu Shibuya,
Kohichi Sakaniwa
Abstract:
We consider communication over the binary erasure channel (BEC) using low-density parity-check (LDPC) code and belief propagation (BP) decoding. The bit error probability for infinite block length is known by density evolution and it is well known that a difference between the bit error probability at finite iteration number for finite block length $n$ and for infinite block length is asymptotic…
▽ More
We consider communication over the binary erasure channel (BEC) using low-density parity-check (LDPC) code and belief propagation (BP) decoding. The bit error probability for infinite block length is known by density evolution and it is well known that a difference between the bit error probability at finite iteration number for finite block length $n$ and for infinite block length is asymptotically $α/n$, where $α$ is a specific constant depending on the degree distribution, the iteration number and the erasure probability. Our main result is to derive an efficient algorithm for calculating $α$ for regular ensembles. The approximation using $α$ is accurate for $(2,r)$-regular ensembles even in small block length.
△ Less
Submitted 23 January, 2008; v1 submitted 7 January, 2008;
originally announced January 2008.