Search | arXiv e-print repository

Physics-guided and fabrication-aware inverse design of photonic devices using diffusion models

Authors: Dongjin Seo, Soobin Um, Sangbin Lee, Jong Chul Ye, Haejun Chung

Abstract: Designing free-form photonic devices is fundamentally challenging due to the vast number of possible geometries and the complex requirements of fabrication constraints. Traditional inverse-design approaches--whether driven by human intuition, global optimization, or adjoint-based gradient methods--often involve intricate binarization and filtering steps, while recent deep learning strategies deman… ▽ More Designing free-form photonic devices is fundamentally challenging due to the vast number of possible geometries and the complex requirements of fabrication constraints. Traditional inverse-design approaches--whether driven by human intuition, global optimization, or adjoint-based gradient methods--often involve intricate binarization and filtering steps, while recent deep learning strategies demand prohibitively large numbers of simulations (10^5 to 10^6). To overcome these limitations, we present AdjointDiffusion, a physics-guided framework that integrates adjoint sensitivity gradients into the sampling process of diffusion models. AdjointDiffusion begins by training a diffusion network on a synthetic, fabrication-aware dataset of binary masks. During inference, we compute the adjoint gradient of a candidate structure and inject this physics-based guidance at each denoising step, steering the generative process toward high figure-of-merit (FoM) solutions without additional post-processing. We demonstrate our method on two canonical photonic design problems--a bent waveguide and a CMOS image sensor color router--and show that our method consistently outperforms state-of-the-art nonlinear optimizers (such as MMA and SLSQP) in both efficiency and manufacturability, while using orders of magnitude fewer simulations (approximately 2 x 10^2) than pure deep learning approaches (approximately 10^5 to 10^6). By eliminating complex binarization schedules and minimizing simulation overhead, AdjointDiffusion offers a streamlined, simulation-efficient, and fabrication-aware pipeline for next-generation photonic device design. Our open-source implementation is available at https://github.com/dongjin-seo2020/AdjointDiffusion. △ Less

Submitted 23 April, 2025; originally announced April 2025.

Comments: 25 pages, 7 Figures

arXiv:2504.01861 [pdf, other]

Corner-Grasp: Multi-Action Grasp Detection and Active Gripper Adaptation for Grasping in Cluttered Environments

Authors: Yeong Gwang Son, Seunghwan Um, Juyong Hong, Tat Hieu Bui, Hyouk Ryeol Choi

Abstract: Robotic grasping is an essential capability, playing a critical role in enabling robots to physically interact with their surroundings. Despite extensive research, challenges remain due to the diverse shapes and properties of target objects, inaccuracies in sensing, and potential collisions with the environment. In this work, we propose a method for effectively grasping in cluttered bin-picking en… ▽ More Robotic grasping is an essential capability, playing a critical role in enabling robots to physically interact with their surroundings. Despite extensive research, challenges remain due to the diverse shapes and properties of target objects, inaccuracies in sensing, and potential collisions with the environment. In this work, we propose a method for effectively grasping in cluttered bin-picking environments where these challenges intersect. We utilize a multi-functional gripper that combines both suction and finger grasping to handle a wide range of objects. We also present an active gripper adaptation strategy to minimize collisions between the gripper hardware and the surrounding environment by actively leveraging the reciprocating suction cup and reconfigurable finger motion. To fully utilize the gripper's capabilities, we built a neural network that detects suction and finger grasp points from a single input RGB-D image. This network is trained using a larger-scale synthetic dataset generated from simulation. In addition to this, we propose an efficient approach to constructing a real-world dataset that facilitates grasp point detection on various objects with diverse characteristics. Experiment results show that the proposed method can grasp objects in cluttered bin-picking scenarios and prevent collisions with environmental constraints such as a corner of the bin. Our proposed method demonstrated its effectiveness in the 9th Robotic Grasping and Manipulation Competition (RGMC) held at ICRA 2024. △ Less

Submitted 2 April, 2025; originally announced April 2025.

Comments: 11 pages, 14 figures

arXiv:2502.06516 [pdf, other]

Boost-and-Skip: A Simple Guidance-Free Diffusion for Minority Generation

Authors: Soobin Um, Beomsu Kim, Jong Chul Ye

Abstract: Minority samples are underrepresented instances located in low-density regions of a data manifold, and are valuable in many generative AI applications, such as data augmentation, creative content generation, etc. Unfortunately, existing diffusion-based minority generators often rely on computationally expensive guidance dedicated for minority generation. To address this, here we present a simple y… ▽ More Minority samples are underrepresented instances located in low-density regions of a data manifold, and are valuable in many generative AI applications, such as data augmentation, creative content generation, etc. Unfortunately, existing diffusion-based minority generators often rely on computationally expensive guidance dedicated for minority generation. To address this, here we present a simple yet powerful guidance-free approach called Boost-and-Skip for generating minority samples using diffusion models. The key advantage of our framework requires only two minimal changes to standard generative processes: (i) variance-boosted initialization and (ii) timestep skipping. We highlight that these seemingly-trivial modifications are supported by solid theoretical and empirical evidence, thereby effectively promoting emergence of underrepresented minority features. Our comprehensive experiments demonstrate that Boost-and-Skip greatly enhances the capability of generating minority samples, even rivaling guidance-based state-of-the-art approaches while requiring significantly fewer computations. △ Less

Submitted 10 February, 2025; originally announced February 2025.

Comments: 29 pages, 11 figures

arXiv:2501.02504 [pdf, other]

Watch Video, Catch Keyword: Context-aware Keyword Attention for Moment Retrieval and Highlight Detection

Authors: Sung Jin Um, Dongjin Kim, Sangmin Lee, Jung Uk Kim

Abstract: The goal of video moment retrieval and highlight detection is to identify specific segments and highlights based on a given text query. With the rapid growth of video content and the overlap between these tasks, recent works have addressed both simultaneously. However, they still struggle to fully capture the overall video context, making it challenging to determine which words are most relevant.… ▽ More The goal of video moment retrieval and highlight detection is to identify specific segments and highlights based on a given text query. With the rapid growth of video content and the overlap between these tasks, recent works have addressed both simultaneously. However, they still struggle to fully capture the overall video context, making it challenging to determine which words are most relevant. In this paper, we present a novel Video Context-aware Keyword Attention module that overcomes this limitation by capturing keyword variation within the context of the entire video. To achieve this, we introduce a video context clustering module that provides concise representations of the overall video context, thereby enhancing the understanding of keyword dynamics. Furthermore, we propose a keyword weight detection module with keyword-aware contrastive learning that incorporates keyword information to enhance fine-grained alignment between visual and textual features. Extensive experiments on the QVHighlights, TVSum, and Charades-STA benchmarks demonstrate that our proposed method significantly improves performance in moment retrieval and highlight detection tasks compared to existing approaches. Our code is available at: https://github.com/VisualAIKHU/Keyword-DETR △ Less

Submitted 5 January, 2025; originally announced January 2025.

Comments: Accepted at AAAI 2025

arXiv:2410.07838 [pdf, other]

Minority-Focused Text-to-Image Generation via Prompt Optimization

Authors: Soobin Um, Jong Chul Ye

Abstract: We investigate the generation of minority samples using pretrained text-to-image (T2I) latent diffusion models. Minority instances, in the context of T2I generation, can be defined as ones living on low-density regions of text-conditional data distributions. They are valuable for various applications of modern T2I generators, such as data augmentation and creative AI. Unfortunately, existing pretr… ▽ More We investigate the generation of minority samples using pretrained text-to-image (T2I) latent diffusion models. Minority instances, in the context of T2I generation, can be defined as ones living on low-density regions of text-conditional data distributions. They are valuable for various applications of modern T2I generators, such as data augmentation and creative AI. Unfortunately, existing pretrained T2I diffusion models primarily focus on high-density regions, largely due to the influence of guided samplers (like CFG) that are essential for high-quality generation. To address this, we present a novel framework to counter the high-density-focus of T2I diffusion models. Specifically, we first develop an online prompt optimization framework that encourages emergence of desired properties during inference while preserving semantic contents of user-provided prompts. We subsequently tailor this generic prompt optimizer into a specialized solver that promotes generation of minority features by incorporating a carefully-crafted likelihood objective. Extensive experiments conducted across various types of T2I models demonstrate that our approach significantly enhances the capability to produce high-quality minority instances compared to existing samplers. Code is available at https://github.com/soobin-um/MinorityPrompt. △ Less

Submitted 4 April, 2025; v1 submitted 10 October, 2024; originally announced October 2024.

Comments: CVPR 2025 (Oral), 21 pages, 10 figures

arXiv:2409.17285 [pdf, other]

SpoofCeleb: Speech Deepfake Detection and SASV In The Wild

Authors: Jee-weon Jung, Yihan Wu, Xin Wang, Ji-Hoon Kim, Soumi Maiti, Yuta Matsunaga, Hye-jin Shim, Jinchuan Tian, Nicholas Evans, Joon Son Chung, Wangyou Zhang, Seyun Um, Shinnosuke Takamichi, Shinji Watanabe

Abstract: This paper introduces SpoofCeleb, a dataset designed for Speech Deepfake Detection (SDD) and Spoofing-robust Automatic Speaker Verification (SASV), utilizing source data from real-world conditions and spoofing attacks generated by Text-To-Speech (TTS) systems also trained on the same real-world data. Robust recognition systems require speech data recorded in varied acoustic environments with diffe… ▽ More This paper introduces SpoofCeleb, a dataset designed for Speech Deepfake Detection (SDD) and Spoofing-robust Automatic Speaker Verification (SASV), utilizing source data from real-world conditions and spoofing attacks generated by Text-To-Speech (TTS) systems also trained on the same real-world data. Robust recognition systems require speech data recorded in varied acoustic environments with different levels of noise to be trained. However, current datasets typically include clean, high-quality recordings (bona fide data) due to the requirements for TTS training; studio-quality or well-recorded read speech is typically necessary to train TTS models. Current SDD datasets also have limited usefulness for training SASV models due to insufficient speaker diversity. SpoofCeleb leverages a fully automated pipeline we developed that processes the VoxCeleb1 dataset, transforming it into a suitable form for TTS training. We subsequently train 23 contemporary TTS systems. SpoofCeleb comprises over 2.5 million utterances from 1,251 unique speakers, collected under natural, real-world conditions. The dataset includes carefully partitioned training, validation, and evaluation sets with well-controlled experimental protocols. We present the baseline results for both SDD and SASV tasks. All data, protocols, and baselines are publicly available at https://jungjee.github.io/spoofceleb. △ Less

Submitted 15 April, 2025; v1 submitted 18 September, 2024; originally announced September 2024.

Comments: IEEE OJSP. Official document lives at: https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=10839331

arXiv:2409.08711 [pdf, other]

Text-To-Speech Synthesis In The Wild

Authors: Jee-weon Jung, Wangyou Zhang, Soumi Maiti, Yihan Wu, Xin Wang, Ji-Hoon Kim, Yuta Matsunaga, Seyun Um, Jinchuan Tian, Hye-jin Shim, Nicholas Evans, Joon Son Chung, Shinnosuke Takamichi, Shinji Watanabe

Abstract: Text-to-speech (TTS) systems are traditionally trained using modest databases of studio-quality, prompted or read speech collected in benign acoustic environments such as anechoic rooms. The recent literature nonetheless shows efforts to train TTS systems using data collected in the wild. While this approach allows for the use of massive quantities of natural speech, until now, there are no common… ▽ More Text-to-speech (TTS) systems are traditionally trained using modest databases of studio-quality, prompted or read speech collected in benign acoustic environments such as anechoic rooms. The recent literature nonetheless shows efforts to train TTS systems using data collected in the wild. While this approach allows for the use of massive quantities of natural speech, until now, there are no common datasets. We introduce the TTS In the Wild (TITW) dataset, the result of a fully automated pipeline, in this case, applied to the VoxCeleb1 dataset commonly used for speaker recognition. We further propose two training sets. TITW-Hard is derived from the transcription, segmentation, and selection of VoxCeleb1 source data. TITW-Easy is derived from the additional application of enhancement and additional data selection based on DNSMOS. We show that a number of recent TTS models can be trained successfully using TITW-Easy, but that it remains extremely challenging to produce similar results using TITW-Hard. Both the dataset and protocols are publicly available and support the benchmarking of TTS systems trained using TITW data. △ Less

Submitted 13 September, 2024; originally announced September 2024.

Comments: 5 pages, submitted to ICASSP 2025 as a conference paper

arXiv:2407.11555 [pdf, other]

Self-Guided Generation of Minority Samples Using Diffusion Models

Authors: Soobin Um, Jong Chul Ye

Abstract: We present a novel approach for generating minority samples that live on low-density regions of a data manifold. Our framework is built upon diffusion models, leveraging the principle of guided sampling that incorporates an arbitrary energy-based guidance during inference time. The key defining feature of our sampler lies in its \emph{self-contained} nature, \ie, implementable solely with a pretra… ▽ More We present a novel approach for generating minority samples that live on low-density regions of a data manifold. Our framework is built upon diffusion models, leveraging the principle of guided sampling that incorporates an arbitrary energy-based guidance during inference time. The key defining feature of our sampler lies in its \emph{self-contained} nature, \ie, implementable solely with a pretrained model. This distinguishes our sampler from existing techniques that require expensive additional components (like external classifiers) for minority generation. Specifically, we first estimate the likelihood of features within an intermediate latent sample by evaluating a reconstruction loss w.r.t. its posterior mean. The generation then proceeds with the minimization of the estimated likelihood, thereby encouraging the emergence of minority features in the latent samples of subsequent timesteps. To further improve the performance of our sampler, we provide several time-scheduling techniques that properly manage the influence of guidance over inference steps. Experiments on benchmark real datasets demonstrate that our approach can greatly improve the capability of creating realistic low-likelihood minority instances over the existing techniques without the reliance on costly additional elements. Code is available at \url{https://github.com/soobin-um/sg-minority}. △ Less

Submitted 16 July, 2024; originally announced July 2024.

Comments: ECCV 2024

arXiv:2403.17420 [pdf, other]

Learning to Visually Localize Sound Sources from Mixtures without Prior Source Knowledge

Authors: Dongjin Kim, Sung Jin Um, Sangmin Lee, Jung Uk Kim

Abstract: The goal of the multi-sound source localization task is to localize sound sources from the mixture individually. While recent multi-sound source localization methods have shown improved performance, they face challenges due to their reliance on prior information about the number of objects to be separated. In this paper, to overcome this limitation, we present a novel multi-sound source localizati… ▽ More The goal of the multi-sound source localization task is to localize sound sources from the mixture individually. While recent multi-sound source localization methods have shown improved performance, they face challenges due to their reliance on prior information about the number of objects to be separated. In this paper, to overcome this limitation, we present a novel multi-sound source localization method that can perform localization without prior knowledge of the number of sound sources. To achieve this goal, we propose an iterative object identification (IOI) module, which can recognize sound-making objects in an iterative manner. After finding the regions of sound-making objects, we devise object similarity-aware clustering (OSC) loss to guide the IOI module to effectively combine regions of the same object but also distinguish between different objects and backgrounds. It enables our method to perform accurate localization of sound-making objects without any prior knowledge. Extensive experimental results on the MUSIC and VGGSound benchmarks show the significant performance improvements of the proposed method over the existing methods for both single and multi-source. Our code is available at: https://github.com/VisualAIKHU/NoPrior_MultiSSL △ Less

Submitted 26 March, 2024; originally announced March 2024.

Comments: Accepted at CVPR 2024

arXiv:2308.06087 [pdf, other]

Audio-Visual Spatial Integration and Recursive Attention for Robust Sound Source Localization

Authors: Sung Jin Um, Dongjin Kim, Jung Uk Kim

Abstract: The objective of the sound source localization task is to enable machines to detect the location of sound-making objects within a visual scene. While the audio modality provides spatial cues to locate the sound source, existing approaches only use audio as an auxiliary role to compare spatial regions of the visual modality. Humans, on the other hand, utilize both audio and visual modalities as spa… ▽ More The objective of the sound source localization task is to enable machines to detect the location of sound-making objects within a visual scene. While the audio modality provides spatial cues to locate the sound source, existing approaches only use audio as an auxiliary role to compare spatial regions of the visual modality. Humans, on the other hand, utilize both audio and visual modalities as spatial cues to locate sound sources. In this paper, we propose an audio-visual spatial integration network that integrates spatial cues from both modalities to mimic human behavior when detecting sound-making objects. Additionally, we introduce a recursive attention network to mimic human behavior of iterative focusing on objects, resulting in more accurate attention regions. To effectively encode spatial information from both modalities, we propose audio-visual pair matching loss and spatial region alignment loss. By utilizing the spatial cues of audio-visual modalities and recursively focusing objects, our method can perform more robust sound source localization. Comprehensive experimental results on the Flickr SoundNet and VGG-Sound Source datasets demonstrate the superiority of our proposed method over existing approaches. Our code is available at: https://github.com/VisualAIKHU/SIRA-SSL △ Less

Submitted 17 August, 2023; v1 submitted 11 August, 2023; originally announced August 2023.

Comments: Camera-Ready, ACM MM 2023

arXiv:2301.12334 [pdf, other]

Don't Play Favorites: Minority Guidance for Diffusion Models

Authors: Soobin Um, Suhyeon Lee, Jong Chul Ye

Abstract: We explore the problem of generating minority samples using diffusion models. The minority samples are instances that lie on low-density regions of a data manifold. Generating a sufficient number of such minority instances is important, since they often contain some unique attributes of the data. However, the conventional generation process of the diffusion models mostly yields majority samples (t… ▽ More We explore the problem of generating minority samples using diffusion models. The minority samples are instances that lie on low-density regions of a data manifold. Generating a sufficient number of such minority instances is important, since they often contain some unique attributes of the data. However, the conventional generation process of the diffusion models mostly yields majority samples (that lie on high-density regions of the manifold) due to their high likelihoods, making themselves ineffective and time-consuming for the minority generating task. In this work, we present a novel framework that can make the generation process of the diffusion models focus on the minority samples. We first highlight that Tweedie's denoising formula yields favorable results for majority samples. The observation motivates us to introduce a metric that describes the uniqueness of a given sample. To address the inherent preference of the diffusion models w.r.t. the majority samples, we further develop minority guidance, a sampling technique that can guide the generation process toward regions with desired likelihood levels. Experiments on benchmark real datasets demonstrate that our minority guidance can greatly improve the capability of generating high-quality minority samples over existing generative samplers. We showcase that the performance benefit of our framework persists even in demanding real-world scenarios such as medical imaging, further underscoring the practical significance of our work. Code is available at https://github.com/soobin-um/minority-guidance. △ Less

Submitted 26 February, 2024; v1 submitted 28 January, 2023; originally announced January 2023.

Comments: ICLR 2024

arXiv:2205.04104 [pdf, other]

ReCAB-VAE: Gumbel-Softmax Variational Inference Based on Analytic Divergence

Authors: Sangshin Oh, Seyun Um, Hong-Goo Kang

Abstract: The Gumbel-softmax distribution, or Concrete distribution, is often used to relax the discrete characteristics of a categorical distribution and enable back-propagation through differentiable reparameterization. Although it reliably yields low variance gradients, it still relies on a stochastic sampling process for optimization. In this work, we present a relaxed categorical analytic bound (ReCAB)… ▽ More The Gumbel-softmax distribution, or Concrete distribution, is often used to relax the discrete characteristics of a categorical distribution and enable back-propagation through differentiable reparameterization. Although it reliably yields low variance gradients, it still relies on a stochastic sampling process for optimization. In this work, we present a relaxed categorical analytic bound (ReCAB), a novel divergence-like metric which corresponds to the upper bound of the Kullback-Leibler divergence (KLD) of a relaxed categorical distribution. The proposed metric is easy to implement because it has a closed form solution, and empirical results show that it is close to the actual KLD. Along with this new metric, we propose a relaxed categorical analytic bound variational autoencoder (ReCAB-VAE) that successfully models both continuous and relaxed discrete latent representations. We implement an emotional text-to-speech synthesis system based on the proposed framework, and show that the proposed system flexibly and stably controls emotion expressions with better speech quality compared to baselines that use stochastic estimation or categorical distribution approximation. △ Less

Submitted 9 May, 2022; originally announced May 2022.

arXiv:2204.02172 [pdf, other]

doi 10.21437/Interspeech.2023-1571

Adversarial Learning of Intermediate Acoustic Feature for End-to-End Lightweight Text-to-Speech

Authors: Hyungchan Yoon, Seyun Um, Changwhan Kim, Hong-Goo Kang

Abstract: To simplify the generation process, several text-to-speech (TTS) systems implicitly learn intermediate latent representations instead of relying on predefined features (e.g., mel-spectrogram). However, their generation quality is unsatisfactory as these representations lack speech variances. In this paper, we improve TTS performance by adding \emph{prosody embeddings} to the latent representations… ▽ More To simplify the generation process, several text-to-speech (TTS) systems implicitly learn intermediate latent representations instead of relying on predefined features (e.g., mel-spectrogram). However, their generation quality is unsatisfactory as these representations lack speech variances. In this paper, we improve TTS performance by adding \emph{prosody embeddings} to the latent representations. During training, we extract reference prosody embeddings from mel-spectrograms, and during inference, we estimate these embeddings from text using generative adversarial networks (GANs). Using GANs, we reliably estimate the prosody embeddings in a fast way, which have complex distributions due to the dynamic nature of speech. We also show that the prosody embeddings work as efficient features for learning a robust alignment between text and acoustic features. Our proposed model surpasses several publicly available models with less parameters and computational complexity in comparative experiments. △ Less

Submitted 28 August, 2023; v1 submitted 5 April, 2022; originally announced April 2022.

Comments: INTERSPEECH 2023

MSC Class: 68T07 (Primary) 68T50; 68T99 (Secondary) ACM Class: I.2.7; I.2.6

arXiv:2204.01387 [pdf, other]

doi 10.21437/Interspeech.2022-10200

Anti-Spoofing Using Transfer Learning with Variational Information Bottleneck

Authors: Youngsik Eom, Yeonghyeon Lee, Ji Sub Um, Hoirin Kim

Abstract: Recent advances in sophisticated synthetic speech generated from text-to-speech (TTS) or voice conversion (VC) systems cause threats to the existing automatic speaker verification (ASV) systems. Since such synthetic speech is generated from diverse algorithms, generalization ability with using limited training data is indispensable for a robust anti-spoofing system. In this work, we propose a tran… ▽ More Recent advances in sophisticated synthetic speech generated from text-to-speech (TTS) or voice conversion (VC) systems cause threats to the existing automatic speaker verification (ASV) systems. Since such synthetic speech is generated from diverse algorithms, generalization ability with using limited training data is indispensable for a robust anti-spoofing system. In this work, we propose a transfer learning scheme based on the wav2vec 2.0 pretrained model with variational information bottleneck (VIB) for speech anti-spoofing task. Evaluation on the ASVspoof 2019 logical access (LA) database shows that our method improves the performance of distinguishing unseen spoofed and genuine speech, outperforming current state-of-the-art anti-spoofing systems. Furthermore, we show that the proposed system improves performance in low-resource and cross-dataset settings of anti-spoofing task significantly, demonstrating that our system is also robust in terms of data size and data distribution. △ Less

Submitted 14 December, 2022; v1 submitted 4 April, 2022; originally announced April 2022.

Comments: Accepted to Interspeech 2022

arXiv:2202.03601 [pdf]

Two-Step Spike Encoding Scheme and Architecture for Highly Sparse Spiking-Neural-Network

Authors: Sangyeob Kim, Sangjin Kim, Soyeon Um, Soyeon Kim, Hoi-Jun Yoo

Abstract: This paper proposes a two-step spike encoding scheme, which consists of the source encoding and the process encoding for a high energy-efficient spiking-neural-network (SNN) acceleration. The eigen-train generation and its superposition generate spike trains which show high accuracy with low spike ratio. Sparsity boosting (SB) and spike generation skipping (SGS) reduce the amount of operations for… ▽ More This paper proposes a two-step spike encoding scheme, which consists of the source encoding and the process encoding for a high energy-efficient spiking-neural-network (SNN) acceleration. The eigen-train generation and its superposition generate spike trains which show high accuracy with low spike ratio. Sparsity boosting (SB) and spike generation skipping (SGS) reduce the amount of operations for SNN. Time shrinking multi-level encoding (TS-MLE) compresses the number of spikes in a train along time axis, and spike-level clock skipping (SLCS) decreases the processing time. Eigen-train generation achieves 90.3% accuracy, the same accuracy of CNN, under the condition of 4.18% spike ratio for CIFAR-10 classification. SB reduces spike ratio by 0.49x with only 0.1% accuracy loss, and the SGS reduces the spike ratio by 20.9% with 0.5% accuracy loss. TS-MLE and SLCS increases the throughput of SNN by 2.8x while decreasing the hardware resource for spike generator by 75% compared with previous generators. △ Less

Submitted 7 February, 2022; originally announced February 2022.

Comments: 5 pages, 10 figures

arXiv:2107.12003 [pdf, other]

Facetron: A Multi-speaker Face-to-Speech Model based on Cross-modal Latent Representations

Authors: Se-Yun Um, Jihyun Kim, Jihyun Lee, Hong-Goo Kang

Abstract: In this paper, we propose a multi-speaker face-to-speech waveform generation model that also works for unseen speaker conditions. Using a generative adversarial network (GAN) with linguistic and speaker characteristic features as auxiliary conditions, our method directly converts face images into speech waveforms under an end-to-end training framework. The linguistic features are extracted from li… ▽ More In this paper, we propose a multi-speaker face-to-speech waveform generation model that also works for unseen speaker conditions. Using a generative adversarial network (GAN) with linguistic and speaker characteristic features as auxiliary conditions, our method directly converts face images into speech waveforms under an end-to-end training framework. The linguistic features are extracted from lip movements using a lip-reading model, and the speaker characteristic features are predicted from face images using cross-modal learning with a pre-trained acoustic model. Since these two features are uncorrelated and controlled independently, we can flexibly synthesize speech waveforms whose speaker characteristics vary depending on the input face images. We show the superiority of our proposed model over conventional methods in terms of objective and subjective evaluation results. Specifically, we evaluate the performances of linguistic features by measuring their accuracy on an automatic speech recognition task. In addition, we estimate speaker and gender similarity for multi-speaker and unseen conditions, respectively. We also evaluate the aturalness of the synthesized speech waveforms using a mean opinion score (MOS) test and non-intrusive objective speech quality assessment (NISQA).The demo samples of the proposed and other models are available at https://sam-0927.github.io/ △ Less

Submitted 15 March, 2023; v1 submitted 26 July, 2021; originally announced July 2021.

Comments: 5 pages (including references), 1 figure

arXiv:1911.01635 [pdf, other]

Emotional speech synthesis with rich and granularized control

Authors: Se-Yun Um, Sangshin Oh, Kyungguen Byun, Inseon Jang, Chunghyun Ahn, Hong-Goo Kang

Abstract: This paper proposes an effective emotion control method for an end-to-end text-to-speech (TTS) system. To flexibly control the distinct characteristic of a target emotion category, it is essential to determine embedding vectors representing the TTS input. We introduce an inter-to-intra emotional distance ratio algorithm to the embedding vectors that can minimize the distance to the target emotion… ▽ More This paper proposes an effective emotion control method for an end-to-end text-to-speech (TTS) system. To flexibly control the distinct characteristic of a target emotion category, it is essential to determine embedding vectors representing the TTS input. We introduce an inter-to-intra emotional distance ratio algorithm to the embedding vectors that can minimize the distance to the target emotion category while maximizing its distance to the other emotion categories. To further enhance the expressiveness of a target speech, we also introduce an effective interpolation technique that enables the intensity of a target emotion to be gradually changed to that of neutral speech. Subjective evaluation results in terms of emotional expressiveness and controllability show the superiority of the proposed algorithm to the conventional methods. △ Less

Submitted 5 November, 2019; v1 submitted 5 November, 2019; originally announced November 2019.

Comments: Submitted to ICASSP 2020

arXiv:1704.05231 [pdf, other]

Fast 2-D Complex Gabor Filter with Kernel Decomposition

Authors: Suhyuk Um, Jaeyoon Kim, Dongbo Min

Abstract: 2-D complex Gabor filtering has found numerous applications in the fields of computer vision and image processing. Especially, in some applications, it is often needed to compute 2-D complex Gabor filter bank consisting of the 2-D complex Gabor filtering outputs at multiple orientations and frequencies. Although several approaches for fast 2-D complex Gabor filtering have been proposed, they prima… ▽ More 2-D complex Gabor filtering has found numerous applications in the fields of computer vision and image processing. Especially, in some applications, it is often needed to compute 2-D complex Gabor filter bank consisting of the 2-D complex Gabor filtering outputs at multiple orientations and frequencies. Although several approaches for fast 2-D complex Gabor filtering have been proposed, they primarily focus on reducing the runtime of performing the 2-D complex Gabor filtering once at specific orientation and frequency. To obtain the 2-D complex Gabor filter bank output, existing methods are repeatedly applied with respect to multiple orientations and frequencies. In this paper, we propose a novel approach that efficiently computes the 2-D complex Gabor filter bank by reducing the computational redundancy that arises when performing the Gabor filtering at multiple orientations and frequencies. The proposed method first decomposes the Gabor basis kernels to allow a fast convolution with the Gaussian kernel in a separable manner. This enables reducing the runtime of the 2-D complex Gabor filter bank by reusing intermediate results of the 2-D complex Gabor filtering computed at a specific orientation. Furthermore, we extend this idea into 2-D localized sliding discrete Fourier transform (SDFT) using the Gaussian kernel in the DFT computation, which lends a spatial localization ability as in the 2-D complex Gabor filter. Experimental results demonstrate that our method runs faster than state-of-the-arts methods for fast 2-D complex Gabor filtering, while maintaining similar filtering quality. △ Less

Submitted 18 April, 2017; originally announced April 2017.

Showing 1–18 of 18 results for author: Um, S