Skip to main content

Showing 1–50 of 305 results for author: Lee, K

Searching in archive eess. Search in all archives.
.
  1. arXiv:2507.08422  [pdf, ps, other

    cs.CV eess.IV

    Upsample What Matters: Region-Adaptive Latent Sampling for Accelerated Diffusion Transformers

    Authors: Wongi Jeong, Kyungryeol Lee, Hoigi Seo, Se Young Chun

    Abstract: Diffusion transformers have emerged as an alternative to U-net-based diffusion models for high-fidelity image and video generation, offering superior scalability. However, their heavy computation remains a major obstacle to real-world deployment. Existing acceleration methods primarily exploit the temporal dimension such as reusing cached features across diffusion timesteps. Here, we propose Regio… ▽ More

    Submitted 11 July, 2025; originally announced July 2025.

  2. arXiv:2507.03468  [pdf, ps, other

    cs.SD eess.AS

    Robust Localization of Partially Fake Speech: Metrics, Models, and Out-of-Domain Evaluation

    Authors: Hieu-Thi Luong, Inbal Rimons, Haim Permuter, Kong Aik Lee, Eng Siong Chng

    Abstract: Partial audio deepfake localization pose unique challenges and remain underexplored compared to full-utterance spoofing detection. While recent methods report strong in-domain performance, their real-world utility remains unclear. In this analysis, we critically examine the limitations of current evaluation practices, particularly the widespread use of Equal Error Rate (EER), which often obscures… ▽ More

    Submitted 4 July, 2025; originally announced July 2025.

    Comments: Submitted to APSIPA 2025

  3. Enhancing Multi-Exposure High Dynamic Range Imaging with Overlapped Codebook for Improved Representation Learning

    Authors: Keuntek Lee, Jaehyun Park, Nam Ik Cho

    Abstract: High dynamic range (HDR) imaging technique aims to create realistic HDR images from low dynamic range (LDR) inputs. Specifically, Multi-exposure HDR imaging uses multiple LDR frames taken from the same scene to improve reconstruction performance. However, there are often discrepancies in motion among the frames, and different exposure settings for each capture can lead to saturated regions. In thi… ▽ More

    Submitted 2 July, 2025; originally announced July 2025.

    Comments: Accepted to International Conference on Pattern Recognition. Springer, Cham, 2025 (ICPR 2024)

  4. arXiv:2507.01587  [pdf, ps, other

    cs.CV eess.IV

    Towards Controllable Real Image Denoising with Camera Parameters

    Authors: Youngjin Oh, Junhyeong Kwon, Keuntek Lee, Nam Ik Cho

    Abstract: Recent deep learning-based image denoising methods have shown impressive performance; however, many lack the flexibility to adjust the denoising strength based on the noise levels, camera settings, and user preferences. In this paper, we introduce a new controllable denoising framework that adaptively removes noise from images by utilizing information from camera parameters. Specifically, we focus… ▽ More

    Submitted 2 July, 2025; originally announced July 2025.

    Comments: Accepted for publication in ICIP 2025, IEEE International Conference on Image Processing

  5. arXiv:2506.20152  [pdf, ps, other

    cs.CV cs.AI eess.IV

    Loss-Aware Automatic Selection of Structured Pruning Criteria for Deep Neural Network Acceleration

    Authors: Deepak Ghimire, Kilho Lee, Seong-heum Kim

    Abstract: Structured pruning is a well-established technique for compressing neural networks, making it suitable for deployment in resource-limited edge devices. This paper presents an efficient Loss-Aware Automatic Selection of Structured Pruning Criteria (LAASP) for slimming and accelerating deep neural networks. The majority of pruning methodologies employ a sequential process consisting of three stages:… ▽ More

    Submitted 25 June, 2025; originally announced June 2025.

    Journal ref: Image Vision Comput. 136 (2023) 104745

  6. arXiv:2506.19446  [pdf, ps, other

    cs.SD eess.AS

    Vo-Ve: An Explainable Voice-Vector for Speaker Identity Evaluation

    Authors: Jaejun Lee, Kyogu Lee

    Abstract: In this paper, we propose Vo-Ve, a novel voice-vector embedding that captures speaker identity. Unlike conventional speaker embeddings, Vo-Ve is explainable, as it contains the probabilities of explicit voice attribute classes. Through extensive analysis, we demonstrate that Vo-Ve not only evaluates speaker similarity competitively with conventional techniques but also provides an interpretable ex… ▽ More

    Submitted 24 June, 2025; originally announced June 2025.

    Comments: Interspeech 2025

  7. arXiv:2506.16538  [pdf, ps, other

    cs.SD eess.AS

    Towards Bitrate-Efficient and Noise-Robust Speech Coding with Variable Bitrate RVQ

    Authors: Yunkee Chae, Kyogu Lee

    Abstract: Residual Vector Quantization (RVQ) has become a dominant approach in neural speech and audio coding, providing high-fidelity compression. However, speech coding presents additional challenges due to real-world noise, which degrades compression efficiency. Standard codecs allocate bits uniformly, wasting bitrate on noise components that do not contribute to intelligibility. This paper introduces a… ▽ More

    Submitted 19 June, 2025; originally announced June 2025.

    Comments: Accepted to Interspeech 2025

  8. arXiv:2506.07536  [pdf, ps, other

    eess.AS

    Bayesian Learning for Domain-Invariant Speaker Verification and Anti-Spoofing

    Authors: Jin Li, Man-Wai Mak, Johan Rohdin, Kong Aik Lee, Hynek Hermansky

    Abstract: The performance of automatic speaker verification (ASV) and anti-spoofing drops seriously under real-world domain mismatch conditions. The relaxed instance frequency-wise normalization (RFN), which normalizes the frequency components based on the feature statistics along the time and channel axes, is a promising approach to reducing the domain dependence in the feature maps of a speaker embedding… ▽ More

    Submitted 9 June, 2025; originally announced June 2025.

    Comments: Accepted to Interspeech2025

  9. arXiv:2506.06090  [pdf, ps, other

    eess.SP

    Distribution-Level AirComp for Wireless Federated Learning under Data Scarcity and Heterogeneity

    Authors: Jun-Pyo Hong, Hyowoon Seo, Kisong Lee

    Abstract: The conventional FL methods face critical challenges in realistic wireless edge networks, where training data is both limited and heterogeneous, often leading to unstable training and poor generalization. To address these challenges in a principled manner, we propose a novel wireless FL framework grounded in Bayesian inference. By virtue of the Bayesian approach, our framework captures model uncer… ▽ More

    Submitted 6 June, 2025; originally announced June 2025.

  10. arXiv:2506.01460  [pdf, ps, other

    cs.SD eess.AS

    Few-step Adversarial Schrödinger Bridge for Generative Speech Enhancement

    Authors: Seungu Han, Sungho Lee, Juheon Lee, Kyogu Lee

    Abstract: Deep generative models have recently been employed for speech enhancement to generate perceptually valid clean speech on large-scale datasets. Several diffusion models have been proposed, and more recently, a tractable Schrödinger Bridge has been introduced to transport between the clean and noisy speech distributions. However, these models often suffer from an iterative reverse process and requir… ▽ More

    Submitted 2 June, 2025; originally announced June 2025.

    Comments: Accepted to Interspeech 2025

  11. arXiv:2506.00832  [pdf, ps, other

    cs.SD cs.AI eess.AS

    Counterfactual Activation Editing for Post-hoc Prosody and Mispronunciation Correction in TTS Models

    Authors: Kyowoon Lee, Artyom Stitsyuk, Gunu Jho, Inchul Hwang, Jaesik Choi

    Abstract: Recent advances in Text-to-Speech (TTS) have significantly improved speech naturalness, increasing the demand for precise prosody control and mispronunciation correction. Existing approaches for prosody manipulation often depend on specialized modules or additional training, limiting their capacity for post-hoc adjustments. Similarly, traditional mispronunciation correction relies on grapheme-to-p… ▽ More

    Submitted 1 June, 2025; originally announced June 2025.

    Comments: Accepted at Interspeech 2025

  12. arXiv:2505.23305  [pdf, ps, other

    cs.SD cs.LG eess.AS

    MGE-LDM: Joint Latent Diffusion for Simultaneous Music Generation and Source Extraction

    Authors: Yunkee Chae, Kyogu Lee

    Abstract: We present MGE-LDM, a unified latent diffusion framework for simultaneous music generation, source imputation, and query-driven source separation. Unlike prior approaches constrained to fixed instrument classes, MGE-LDM learns a joint distribution over full mixtures, submixtures, and individual stems within a single compact latent diffusion model. At inference, MGE-LDM enables (1) complete mixture… ▽ More

    Submitted 29 May, 2025; originally announced May 2025.

    Comments: 27 pages, 4 figures

  13. arXiv:2505.09661  [pdf, ps, other

    cs.SD cs.AI eess.AS

    Introducing voice timbre attribute detection

    Authors: Jinghao He, Zhengyan Sheng, Liping Chen, Kong Aik Lee, Zhen-Hua Ling

    Abstract: This paper focuses on explaining the timbre conveyed by speech signals and introduces a task termed voice timbre attribute detection (vTAD). In this task, voice timbre is explained with a set of sensory attributes describing its human perception. A pair of speech utterances is processed, and their intensity is compared in a designated timbre descriptor. Moreover, a framework is proposed, which is… ▽ More

    Submitted 22 June, 2025; v1 submitted 14 May, 2025; originally announced May 2025.

    Comments: arXiv admin note: substantial text overlap with arXiv:2505.09382

  14. arXiv:2505.09382  [pdf, ps, other

    cs.SD cs.AI eess.AS

    The Voice Timbre Attribute Detection 2025 Challenge Evaluation Plan

    Authors: Zhengyan Sheng, Jinghao He, Liping Chen, Kong Aik Lee, Zhen-Hua Ling

    Abstract: Voice timbre refers to the unique quality or character of a person's voice that distinguishes it from others as perceived by human hearing. The Voice Timbre Attribute Detection (VtaD) 2025 challenge focuses on explaining the voice timbre attribute in a comparative manner. In this challenge, the human impression of voice timbre is verbalized with a set of sensory descriptors, including bright, coar… ▽ More

    Submitted 22 June, 2025; v1 submitted 14 May, 2025; originally announced May 2025.

  15. arXiv:2505.00210  [pdf, other

    cs.LG cs.CE eess.SY

    Generative Machine Learning in Adaptive Control of Dynamic Manufacturing Processes: A Review

    Authors: Suk Ki Lee, Hyunwoong Ko

    Abstract: Dynamic manufacturing processes exhibit complex characteristics defined by time-varying parameters, nonlinear behaviors, and uncertainties. These characteristics require sophisticated in-situ monitoring techniques utilizing multimodal sensor data and adaptive control systems that can respond to real-time feedback while maintaining product quality. Recently, generative machine learning (ML) has eme… ▽ More

    Submitted 30 April, 2025; originally announced May 2025.

    Comments: 12 pages, 1 figure, 1 table. This paper has been accepted for publication in the proceedings of ASME IDETC-CIE 2025

  16. arXiv:2504.18157  [pdf, other

    eess.AS cs.SD

    DOSE : Drum One-Shot Extraction from Music Mixture

    Authors: Suntae Hwang, Seonghyeon Kang, Kyungsu Kim, Semin Ahn, Kyogu Lee

    Abstract: Drum one-shot samples are crucial for music production, particularly in sound design and electronic music. This paper introduces Drum One-Shot Extraction, a task in which the goal is to extract drum one-shots that are present in the music mixture. To facilitate this, we propose the Random Mixture One-shot Dataset (RMOD), comprising large-scale, randomly arranged music mixtures paired with correspo… ▽ More

    Submitted 25 April, 2025; originally announced April 2025.

    Comments: Published in IEEE ICASSP 2025

  17. arXiv:2504.07053  [pdf, other

    cs.CL cs.SD eess.AS

    TASTE: Text-Aligned Speech Tokenization and Embedding for Spoken Language Modeling

    Authors: Liang-Hsuan Tseng, Yi-Chang Chen, Kuan-Yi Lee, Da-Shan Shiu, Hung-yi Lee

    Abstract: Recent efforts target spoken language models (SLMs) that not only listen but also speak for more natural human-LLM interaction. Joint speech-text modeling is a promising direction to achieve this. However, the effectiveness of recent speech tokens for joint modeling remains underexplored. To address this, we introduce Text-Aligned Speech Tokenization and Embedding (TASTE), a method that directly a… ▽ More

    Submitted 22 May, 2025; v1 submitted 9 April, 2025; originally announced April 2025.

    Comments: Preprint

  18. arXiv:2504.05657  [pdf, other

    eess.AS cs.AI cs.SD

    Nes2Net: A Lightweight Nested Architecture for Foundation Model Driven Speech Anti-spoofing

    Authors: Tianchi Liu, Duc-Tuan Truong, Rohan Kumar Das, Kong Aik Lee, Haizhou Li

    Abstract: Speech foundation models have significantly advanced various speech-related tasks by providing exceptional representation capabilities. However, their high-dimensional output features often create a mismatch with downstream task models, which typically require lower-dimensional inputs. A common solution is to apply a dimensionality reduction (DR) layer, but this approach increases parameter overhe… ▽ More

    Submitted 8 April, 2025; originally announced April 2025.

    Comments: This manuscript has been submitted for peer review

  19. arXiv:2503.15498  [pdf, other

    cs.HC cs.AI cs.MA cs.MM cs.SD eess.AS

    Revival: Collaborative Artistic Creation through Human-AI Interactions in Musical Creativity

    Authors: Keon Ju M. Lee, Philippe Pasquier, Jun Yuri

    Abstract: Revival is an innovative live audiovisual performance and music improvisation by our artist collective K-Phi-A, blending human and AI musicianship to create electronic music with audio-reactive visuals. The performance features real-time co-creative improvisation between a percussionist, an electronic music artist, and AI musical agents. Trained in works by deceased composers and the collective's… ▽ More

    Submitted 19 January, 2025; originally announced March 2025.

    Comments: Keon Ju M. Lee, Philippe Pasquier and Jun Yuri. 2024. In Proceedings of the Creativity and Generative AI NIPS (Neural Information Processing Systems) Workshop

  20. arXiv:2503.07940  [pdf, other

    cs.CV cs.RO eess.IV

    BUFFER-X: Towards Zero-Shot Point Cloud Registration in Diverse Scenes

    Authors: Minkyun Seo, Hyungtae Lim, Kanghee Lee, Luca Carlone, Jaesik Park

    Abstract: Recent advances in deep learning-based point cloud registration have improved generalization, yet most methods still require retraining or manual parameter tuning for each new environment. In this paper, we identify three key factors limiting generalization: (a) reliance on environment-specific voxel size and search radius, (b) poor out-of-domain robustness of learning-based keypoint detectors, an… ▽ More

    Submitted 10 March, 2025; originally announced March 2025.

    Comments: 20 pages, 14 figures

  21. arXiv:2503.04929  [pdf, other

    cs.RO cs.LG eess.SY

    Neural Configuration-Space Barriers for Manipulation Planning and Control

    Authors: Kehan Long, Ki Myung Brian Lee, Nikola Raicevic, Niyas Attasseri, Melvin Leok, Nikolay Atanasov

    Abstract: Planning and control for high-dimensional robot manipulators in cluttered, dynamic environments require both computational efficiency and robust safety guarantees. Inspired by recent advances in learning configuration-space distance functions (CDFs) as robot body representations, we propose a unified framework for motion planning and control that formulates safety constraints as CDF barriers. A CD… ▽ More

    Submitted 6 May, 2025; v1 submitted 6 March, 2025; originally announced March 2025.

  22. arXiv:2502.17726  [pdf, other

    cs.SD cs.AI cs.DL cs.IR eess.AS

    The GigaMIDI Dataset with Features for Expressive Music Performance Detection

    Authors: Keon Ju Maverick Lee, Jeff Ens, Sara Adkins, Pedro Sarmento, Mathieu Barthet, Philippe Pasquier

    Abstract: The Musical Instrument Digital Interface (MIDI), introduced in 1983, revolutionized music production by allowing computers and instruments to communicate efficiently. MIDI files encode musical instructions compactly, facilitating convenient music sharing. They benefit Music Information Retrieval (MIR), aiding in research on music understanding, computational musicology, and generative music. The G… ▽ More

    Submitted 24 February, 2025; originally announced February 2025.

    Comments: Published at Transactions of the International Society for Music Information Retrieval (TISMIR), 8(1), 1-19

  23. Conditional Generative Adversarial Networks for Channel Estimation in RIS-Assisted ISAC Systems

    Authors: Alice Faisal, Ibrahim Al-Nahhal, Kyesan Lee, Octavia A. Dobre, Hyundong Shin

    Abstract: Integrated sensing and communication (ISAC) technology has been explored as a potential advancement for future wireless networks, striving to effectively use spectral resources for both communication and sensing. The integration of reconfigurable intelligent surfaces (RIS) with ISAC further enhances this capability by optimizing the propagation environment, thereby improving both the sensing accur… ▽ More

    Submitted 24 February, 2025; originally announced February 2025.

    Comments: Accepted for publication in IEEE Transactions on Communications

  24. arXiv:2502.08857  [pdf, other

    eess.AS

    ASVspoof 5: Design, Collection and Validation of Resources for Spoofing, Deepfake, and Adversarial Attack Detection Using Crowdsourced Speech

    Authors: Xin Wang, Héctor Delgado, Hemlata Tak, Jee-weon Jung, Hye-jin Shim, Massimiliano Todisco, Ivan Kukanov, Xuechen Liu, Md Sahidullah, Tomi Kinnunen, Nicholas Evans, Kong Aik Lee, Junichi Yamagishi, Myeonghun Jeong, Ge Zhu, Yongyi Zang, You Zhang, Soumi Maiti, Florian Lux, Nicolas Müller, Wangyou Zhang, Chengzhe Sun, Shuwei Hou, Siwei Lyu, Sébastien Le Maguer , et al. (4 additional authors not shown)

    Abstract: ASVspoof 5 is the fifth edition in a series of challenges which promote the study of speech spoofing and deepfake attacks as well as the design of detection solutions. We introduce the ASVspoof 5 database which is generated in a crowdsourced fashion from data collected in diverse acoustic conditions (cf. studio-quality data for earlier ASVspoof databases) and from ~2,000 speakers (cf. ~100 earlier… ▽ More

    Submitted 24 April, 2025; v1 submitted 12 February, 2025; originally announced February 2025.

    Comments: Database link: https://zenodo.org/records/14498691, Database mirror link: https://huggingface.co/datasets/jungjee/asvspoof5, ASVspoof 5 Challenge Workshop Proceeding: https://www.isca-archive.org/asvspoof_2024/index.html

  25. arXiv:2502.08035  [pdf, other

    eess.SP math.NA

    Global Convergence of ESPRIT with Preconditioned First-Order Methods for Spike Deconvolution

    Authors: Joseph Gabet, Meghna Kalra, Maxime Ferreira Da Costa, Kiryung Lee

    Abstract: Spike deconvolution is the problem of recovering point sources from their convolution with a known point spread function, playing a fundamental role in many sensing and imaging applications. This paper proposes a novel approach combining ESPRIT with Preconditioned Gradient Descent (PGD) to estimate the amplitudes and locations of the point sources by a non-linear least squares. The preconditioning… ▽ More

    Submitted 11 February, 2025; originally announced February 2025.

  26. arXiv:2502.00023  [pdf, other

    cs.MA cs.AI cs.HC cs.SD eess.AS

    Musical Agent Systems: MACAT and MACataRT

    Authors: Keon Ju M. Lee, Philippe Pasquier

    Abstract: Our research explores the development and application of musical agents, human-in-the-loop generative AI systems designed to support music performance and improvisation within co-creative spaces. We introduce MACAT and MACataRT, two distinct musical agent systems crafted to enhance interactive music-making between human musicians and AI. MACAT is optimized for agent-led performance, employing real… ▽ More

    Submitted 19 January, 2025; originally announced February 2025.

    Comments: In Proceedings of the Creativity and Generative AI NIPS (Neural Information Processing Systems) Workshop 2024

  27. arXiv:2501.02453  [pdf, other

    cs.IT eess.SP

    Blockage-Aware UAV-Assisted Wireless Data Harvesting With Building Avoidance

    Authors: Gitae Park, Kanghyun Heo, Kisong Lee

    Abstract: Unmanned aerial vehicles (UAVs) offer dynamic trajectory control, enabling them to avoid obstacles and establish line-of-sight (LoS) wireless channels with ground nodes (GNs), unlike traditional ground-fixed base stations. This study addresses the joint optimization of scheduling and three-dimensional (3D) trajectory planning for UAV-assisted wireless data harvesting. The objective is to maximize… ▽ More

    Submitted 5 January, 2025; originally announced January 2025.

  28. arXiv:2412.15191  [pdf, other

    cs.CV cs.LG cs.SD eess.AS

    AV-Link: Temporally-Aligned Diffusion Features for Cross-Modal Audio-Video Generation

    Authors: Moayed Haji-Ali, Willi Menapace, Aliaksandr Siarohin, Ivan Skorokhodov, Alper Canberk, Kwot Sin Lee, Vicente Ordonez, Sergey Tulyakov

    Abstract: We propose AV-Link, a unified framework for Video-to-Audio (A2V) and Audio-to-Video (A2V) generation that leverages the activations of frozen video and audio diffusion models for temporally-aligned cross-modal conditioning. The key to our framework is a Fusion Block that facilitates bidirectional information exchange between video and audio diffusion models through temporally-aligned self attentio… ▽ More

    Submitted 10 March, 2025; v1 submitted 19 December, 2024; originally announced December 2024.

    Comments: Project Page: snap-research.github.io/AVLink/

  29. arXiv:2412.09195  [pdf, other

    cs.SD cs.LG eess.AS

    On the Generation and Removal of Speaker Adversarial Perturbation for Voice-Privacy Protection

    Authors: Chenyang Guo, Liping Chen, Zhuhai Li, Kong Aik Lee, Zhen-Hua Ling, Wu Guo

    Abstract: Neural networks are commonly known to be vulnerable to adversarial attacks mounted through subtle perturbation on the input data. Recent development in voice-privacy protection has shown the positive use cases of the same technique to conceal speaker's voice attribute with additive perturbation signal generated by an adversarial network. This paper examines the reversibility property where an enti… ▽ More

    Submitted 12 December, 2024; originally announced December 2024.

    Comments: 6 pages, 3 figures, published to IEEE SLT Workshop 2024

    Journal ref: 2024 IEEE Spoken Language Technology Workshop (SLT), 2024, pp. 1197-1202

  30. arXiv:2412.08247  [pdf, other

    cs.SD cs.CV cs.MM eess.AS

    MoMuSE: Momentum Multi-modal Target Speaker Extraction for Real-time Scenarios with Impaired Visual Cues

    Authors: Junjie Li, Ke Zhang, Shuai Wang, Kong Aik Lee, Man-Wai Mak, Haizhou Li

    Abstract: Audio-visual Target Speaker Extraction (AV-TSE) aims to isolate the speech of a specific target speaker from an audio mixture using time-synchronized visual cues. In real-world scenarios, visual cues are not always available due to various impairments, which undermines the stability of AV-TSE. Despite this challenge, humans can maintain attentional momentum over time, even when the target speaker… ▽ More

    Submitted 31 March, 2025; v1 submitted 11 December, 2024; originally announced December 2024.

  31. arXiv:2412.04504  [pdf, other

    cs.CL cs.DC cs.LG eess.SY

    Multi-Bin Batching for Increasing LLM Inference Throughput

    Authors: Ozgur Guldogan, Jackson Kunde, Kangwook Lee, Ramtin Pedarsani

    Abstract: As large language models (LLMs) grow in popularity for their diverse capabilities, improving the efficiency of their inference systems has become increasingly critical. Batching LLM requests is a critical step in scheduling the inference jobs on servers (e.g. GPUs), enabling the system to maximize throughput by allowing multiple requests to be processed in parallel. However, requests often have va… ▽ More

    Submitted 2 December, 2024; originally announced December 2024.

  32. arXiv:2412.00150  [pdf, other

    cs.CV eess.IV

    Curriculum Fine-tuning of Vision Foundation Model for Medical Image Classification Under Label Noise

    Authors: Yeonguk Yu, Minhwan Ko, Sungho Shin, Kangmin Kim, Kyoobin Lee

    Abstract: Deep neural networks have demonstrated remarkable performance in various vision tasks, but their success heavily depends on the quality of the training data. Noisy labels are a critical issue in medical datasets and can significantly degrade model performance. Previous clean sample selection methods have not utilized the well pre-trained features of vision foundation models (VFMs) and assumed that… ▽ More

    Submitted 29 November, 2024; originally announced December 2024.

    Comments: Accepted at NeurIPS 2024

  33. arXiv:2411.11692  [pdf, other

    cs.SD cs.IR eess.AS

    Do Captioning Metrics Reflect Music Semantic Alignment?

    Authors: Jinwoo Lee, Kyogu Lee

    Abstract: Music captioning has emerged as a promising task, fueled by the advent of advanced language generation models. However, the evaluation of music captioning relies heavily on traditional metrics such as BLEU, METEOR, and ROUGE which were developed for other domains, without proper justification for their use in this new field. We present cases where traditional metrics are vulnerable to syntactic ch… ▽ More

    Submitted 18 November, 2024; originally announced November 2024.

    Comments: International Society for Music Information Retrieval (ISMIR) 2024, Late Breaking Demo (LBD)

  34. arXiv:2411.01575  [pdf, other

    eess.IV cs.CV

    HC$^3$L-Diff: Hybrid conditional latent diffusion with high frequency enhancement for CBCT-to-CT synthesis

    Authors: Shi Yin, Hongqi Tan, Li Ming Chong, Haofeng Liu, Hui Liu, Kang Hao Lee, Jeffrey Kit Loong Tuan, Dean Ho, Yueming Jin

    Abstract: Background: Cone-beam computed tomography (CBCT) plays a crucial role in image-guided radiotherapy, but artifacts and noise make them unsuitable for accurate dose calculation. Artificial intelligence methods have shown promise in enhancing CBCT quality to produce synthetic CT (sCT) images. However, existing methods either produce images of suboptimal quality or incur excessive time costs, failing… ▽ More

    Submitted 3 November, 2024; originally announced November 2024.

    Comments: 13 pages, 5 figures

  35. arXiv:2411.00274  [pdf, other

    cs.CV cs.LG eess.IV

    Adaptive Residual Transformation for Enhanced Feature-Based OOD Detection in SAR Imagery

    Authors: Kyung-hwan Lee, Kyung-tae Kim

    Abstract: Recent advances in deep learning architectures have enabled efficient and accurate classification of pre-trained targets in Synthetic Aperture Radar (SAR) images. Nevertheless, the presence of unknown targets in real battlefield scenarios is unavoidable, resulting in misclassification and reducing the accuracy of the classifier. Over the past decades, various feature-based out-of-distribution (OOD… ▽ More

    Submitted 31 October, 2024; originally announced November 2024.

  36. arXiv:2410.09236  [pdf

    eess.AS cs.SD

    Enhancing Infant Crying Detection with Gradient Boosting for Improved Emotional and Mental Health Diagnostics

    Authors: Kyunghun Lee, Lauren M. Henry, Eleanor Hansen, Elizabeth Tandilashvili, Lauren S. Wakschlag, Elizabeth Norton, Daniel S. Pine, Melissa A. Brotman, Francisco Pereira

    Abstract: Infant crying can serve as a crucial indicator of various physiological and emotional states. This paper introduces a comprehensive approach detecting infant cries within audio data. We integrate Wav2Vec with traditional audio features and employ Gradient Boosting Machines for cry classification. We validate our approach on a real world dataset, demonstrating significant performance improvements o… ▽ More

    Submitted 10 January, 2025; v1 submitted 11 October, 2024; originally announced October 2024.

  37. arXiv:2410.06016  [pdf, other

    cs.SD cs.LG eess.AS

    Variable Bitrate Residual Vector Quantization for Audio Coding

    Authors: Yunkee Chae, Woosung Choi, Yuhta Takida, Junghyun Koo, Yukara Ikemiya, Zhi Zhong, Kin Wai Cheuk, Marco A. Martínez-Ramírez, Kyogu Lee, Wei-Hsiang Liao, Yuki Mitsufuji

    Abstract: Recent state-of-the-art neural audio compression models have progressively adopted residual vector quantization (RVQ). Despite this success, these models employ a fixed number of codebooks per frame, which can be suboptimal in terms of rate-distortion tradeoff, particularly in scenarios with simple input audio, such as silence. To address this limitation, we propose variable bitrate RVQ (VRVQ) for… ▽ More

    Submitted 27 April, 2025; v1 submitted 8 October, 2024; originally announced October 2024.

    Comments: ICASSP 2025 camera ready version

  38. NTU-NPU System for Voice Privacy 2024 Challenge

    Authors: Nikita Kuzmin, Hieu-Thi Luong, Jixun Yao, Lei Xie, Kong Aik Lee, Eng Siong Chng

    Abstract: In this work, we describe our submissions for the Voice Privacy Challenge 2024. Rather than proposing a novel speech anonymization system, we enhance the provided baselines to meet all required conditions and improve evaluated metrics. Specifically, we implement emotion embedding and experiment with WavLM and ECAPA2 speaker embedders for the B3 baseline. Additionally, we compare different speaker… ▽ More

    Submitted 3 October, 2024; originally announced October 2024.

    Comments: System description for VPC 2024

    Journal ref: 2024 Challenge. Proc. 4th Symposium on Security and Privacy in Speech Communication, 72-79

  39. arXiv:2409.14743  [pdf, other

    eess.AS cs.SD

    LlamaPartialSpoof: An LLM-Driven Fake Speech Dataset Simulating Disinformation Generation

    Authors: Hieu-Thi Luong, Haoyang Li, Lin Zhang, Kong Aik Lee, Eng Siong Chng

    Abstract: Previous fake speech datasets were constructed from a defender's perspective to develop countermeasure (CM) systems without considering diverse motivations of attackers. To better align with real-life scenarios, we created LlamaPartialSpoof, a 130-hour dataset that contains both fully and partially fake speech, using a large language model (LLM) and voice cloning technologies to evaluate the robus… ▽ More

    Submitted 5 January, 2025; v1 submitted 23 September, 2024; originally announced September 2024.

    Comments: 5 pages, ICASSP 2025

  40. arXiv:2409.14712  [pdf, other

    eess.AS cs.SD

    Room Impulse Responses help attackers to evade Deep Fake Detection

    Authors: Hieu-Thi Luong, Duc-Tuan Truong, Kong Aik Lee, Eng Siong Chng

    Abstract: The ASVspoof 2021 benchmark, a widely-used evaluation framework for anti-spoofing, consists of two subsets: Logical Access (LA) and Deepfake (DF), featuring samples with varied coding characteristics and compression artifacts. Notably, the current state-of-the-art (SOTA) system boasts impressive performance, achieving an Equal Error Rate (EER) of 0.87% on the LA subset and 2.58% on the DF. However… ▽ More

    Submitted 23 September, 2024; originally announced September 2024.

    Comments: 7 pages, to be presented at SLT 2024

  41. arXiv:2409.12521  [pdf, other

    cs.RO eess.SY

    GraspSAM: When Segment Anything Model Meets Grasp Detection

    Authors: Sangjun Noh, Jongwon Kim, Dongwoo Nam, Seunghyeok Back, Raeyoung Kang, Kyoobin Lee

    Abstract: Grasp detection requires flexibility to handle objects of various shapes without relying on prior knowledge of the object, while also offering intuitive, user-guided control. This paper introduces GraspSAM, an innovative extension of the Segment Anything Model (SAM), designed for prompt-driven and category-agnostic grasp detection. Unlike previous methods, which are often limited by small-scale tr… ▽ More

    Submitted 23 September, 2024; v1 submitted 19 September, 2024; originally announced September 2024.

    Comments: 6 pages (main), 1 page (references)

  42. arXiv:2409.09589  [pdf, other

    cs.SD eess.AS

    On the effectiveness of enrollment speech augmentation for Target Speaker Extraction

    Authors: Junjie Li, Ke Zhang, Shuai Wang, Haizhou Li, Man-Wai Mak, Kong Aik Lee

    Abstract: Deep learning technologies have significantly advanced the performance of target speaker extraction (TSE) tasks. To enhance the generalization and robustness of these algorithms when training data is insufficient, data augmentation is a commonly adopted technique. Unlike typical data augmentation applied to speech mixtures, this work thoroughly investigates the effectiveness of augmenting the enro… ▽ More

    Submitted 14 September, 2024; originally announced September 2024.

    Comments: Accepted by SLT2024

  43. arXiv:2409.08346  [pdf, other

    eess.AS cs.AI cs.CL cs.SD

    Towards Quantifying and Reducing Language Mismatch Effects in Cross-Lingual Speech Anti-Spoofing

    Authors: Tianchi Liu, Ivan Kukanov, Zihan Pan, Qiongqiong Wang, Hardik B. Sailor, Kong Aik Lee

    Abstract: The effects of language mismatch impact speech anti-spoofing systems, while investigations and quantification of these effects remain limited. Existing anti-spoofing datasets are mainly in English, and the high cost of acquiring multilingual datasets hinders training language-independent models. We initiate this work by evaluating top-performing speech anti-spoofing systems that are trained on Eng… ▽ More

    Submitted 12 September, 2024; originally announced September 2024.

    Comments: Accepted to the IEEE Spoken Language Technology Workshop (SLT) 2024

  44. arXiv:2409.04173  [pdf, other

    eess.AS

    NPU-NTU System for Voice Privacy 2024 Challenge

    Authors: Jixun Yao, Nikita Kuzmin, Qing Wang, Pengcheng Guo, Ziqian Ning, Dake Guo, Kong Aik Lee, Eng-Siong Chng, Lei Xie

    Abstract: Speaker anonymization is an effective privacy protection solution that conceals the speaker's identity while preserving the linguistic content and paralinguistic information of the original speech. To establish a fair benchmark and facilitate comparison of speaker anonymization systems, the VoicePrivacy Challenge (VPC) was held in 2020 and 2022, with a new edition planned for 2024. In this paper,… ▽ More

    Submitted 4 February, 2025; v1 submitted 6 September, 2024; originally announced September 2024.

    Comments: System description for VPC 2024

  45. arXiv:2408.09802  [pdf, other

    cs.SD cs.CV eess.AS

    Hear Your Face: Face-based voice conversion with F0 estimation

    Authors: Jaejun Lee, Yoori Oh, Injune Hwang, Kyogu Lee

    Abstract: This paper delves into the emerging field of face-based voice conversion, leveraging the unique relationship between an individual's facial features and their vocal characteristics. We present a novel face-based voice conversion framework that particularly utilizes the average fundamental frequency of the target speaker, derived solely from their facial images. Through extensive analysis, our fram… ▽ More

    Submitted 19 August, 2024; originally announced August 2024.

    Comments: Interspeech 2024

  46. arXiv:2408.09300  [pdf, other

    eess.AS cs.CR cs.LG cs.SD

    Malacopula: adversarial automatic speaker verification attacks using a neural-based generalised Hammerstein model

    Authors: Massimiliano Todisco, Michele Panariello, Xin Wang, Héctor Delgado, Kong Aik Lee, Nicholas Evans

    Abstract: We present Malacopula, a neural-based generalised Hammerstein model designed to introduce adversarial perturbations to spoofed speech utterances so that they better deceive automatic speaker verification (ASV) systems. Using non-linear processes to modify speech utterances, Malacopula enhances the effectiveness of spoofing attacks. The model comprises parallel branches of polynomial functions foll… ▽ More

    Submitted 17 August, 2024; originally announced August 2024.

    Comments: Accepted at ASVspoof Workshop 2024

  47. arXiv:2408.08739  [pdf, other

    eess.AS cs.AI cs.SD

    ASVspoof 5: Crowdsourced Speech Data, Deepfakes, and Adversarial Attacks at Scale

    Authors: Xin Wang, Hector Delgado, Hemlata Tak, Jee-weon Jung, Hye-jin Shim, Massimiliano Todisco, Ivan Kukanov, Xuechen Liu, Md Sahidullah, Tomi Kinnunen, Nicholas Evans, Kong Aik Lee, Junichi Yamagishi

    Abstract: ASVspoof 5 is the fifth edition in a series of challenges that promote the study of speech spoofing and deepfake attacks, and the design of detection solutions. Compared to previous challenges, the ASVspoof 5 database is built from crowdsourced data collected from a vastly greater number of speakers in diverse acoustic conditions. Attacks, also crowdsourced, are generated and tested using surrogat… ▽ More

    Submitted 16 August, 2024; originally announced August 2024.

    Comments: 8 pages, ASVspoof 5 Workshop (Interspeech2024 Satellite)

  48. arXiv:2408.08616  [pdf, other

    eess.IV cs.CV

    Reference-free Axial Super-resolution of 3D Microscopy Images using Implicit Neural Representation with a 2D Diffusion Prior

    Authors: Kyungryun Lee, Won-Ki Jeong

    Abstract: Analysis and visualization of 3D microscopy images pose challenges due to anisotropic axial resolution, demanding volumetric super-resolution along the axial direction. While training a learning-based 3D super-resolution model seems to be a straightforward solution, it requires ground truth isotropic volumes and suffers from the curse of dimensionality. Therefore, existing methods utilize 2D neura… ▽ More

    Submitted 16 August, 2024; originally announced August 2024.

    Comments: MICCAI2024 accepted

  49. arXiv:2408.03204  [pdf, other

    cs.SD eess.AS

    GRAFX: An Open-Source Library for Audio Processing Graphs in PyTorch

    Authors: Sungho Lee, Marco Martínez-Ramírez, Wei-Hsiang Liao, Stefan Uhlich, Giorgio Fabbro, Kyogu Lee, Yuki Mitsufuji

    Abstract: We present GRAFX, an open-source library designed for handling audio processing graphs in PyTorch. Along with various library functionalities, we describe technical details on the efficient parallel computation of input graphs, signals, and processor parameters in GPU. Then, we show its example use under a music mixing scenario, where parameters of every differentiable processor in a large graph a… ▽ More

    Submitted 6 August, 2024; originally announced August 2024.

    Comments: Accepted to DAFx 2024 demo

  50. arXiv:2407.19900  [pdf, other

    cs.SD cs.AI eess.AS

    Practical and Reproducible Symbolic Music Generation by Large Language Models with Structural Embeddings

    Authors: Seungyeon Rhyu, Kichang Yang, Sungjun Cho, Jaehyeon Kim, Kyogu Lee, Moontae Lee

    Abstract: Music generation introduces challenging complexities to large language models. Symbolic structures of music often include vertical harmonization as well as horizontal counterpoint, urging various adaptations and enhancements for large-scale Transformers. However, existing works share three major drawbacks: 1) their tokenization requires domain-specific annotations, such as bars and beats, that are… ▽ More

    Submitted 29 July, 2024; originally announced July 2024.

    Comments: 9 pages, 6 figures, 4 tables