Skip to main content

Showing 1–50 of 381 results for author: Lee, S

Searching in archive eess. Search in all archives.
.
  1. arXiv:2507.08254  [pdf, ps, other

    eess.IV cs.CV cs.LG

    Raptor: Scalable Train-Free Embeddings for 3D Medical Volumes Leveraging Pretrained 2D Foundation Models

    Authors: Ulzee An, Moonseong Jeong, Simon A. Lee, Aditya Gorla, Yuzhe Yang, Sriram Sankararaman

    Abstract: Current challenges in developing foundational models for volumetric imaging data, such as magnetic resonance imaging (MRI), stem from the computational complexity of training state-of-the-art architectures in high dimensions and curating sufficiently large datasets of volumes. To address these challenges, we introduce Raptor (Random Planar Tensor Reduction), a train-free method for generating sema… ▽ More

    Submitted 10 July, 2025; originally announced July 2025.

    Comments: 21 pages, 10 figures, accepted to ICML 2025. The first two authors contributed equally

    Journal ref: In Proc. 42th International Conference on Machine Learning (ICML 2025 Spotlight)

  2. arXiv:2507.08128  [pdf, ps, other

    cs.SD cs.AI cs.CL eess.AS

    Audio Flamingo 3: Advancing Audio Intelligence with Fully Open Large Audio Language Models

    Authors: Arushi Goel, Sreyan Ghosh, Jaehyeon Kim, Sonal Kumar, Zhifeng Kong, Sang-gil Lee, Chao-Han Huck Yang, Ramani Duraiswami, Dinesh Manocha, Rafael Valle, Bryan Catanzaro

    Abstract: We present Audio Flamingo 3 (AF3), a fully open state-of-the-art (SOTA) large audio-language model that advances reasoning and understanding across speech, sound, and music. AF3 introduces: (i) AF-Whisper, a unified audio encoder trained using a novel strategy for joint representation learning across all 3 modalities of speech, sound, and music; (ii) flexible, on-demand thinking, allowing the mode… ▽ More

    Submitted 10 July, 2025; originally announced July 2025.

    Comments: Code, Datasets and Models: https://research.nvidia.com/labs/adlr/AF3/

  3. arXiv:2507.04660  [pdf, ps, other

    eess.IV cs.CV

    CP-Dilatation: A Copy-and-Paste Augmentation Method for Preserving the Boundary Context Information of Histopathology Images

    Authors: Sungrae Hong, Sol Lee, Mun Yong Yi

    Abstract: Medical AI diagnosis including histopathology segmentation has derived benefits from the recent development of deep learning technology. However, deep learning itself requires a large amount of training data and the medical image segmentation masking, in particular, requires an extremely high cost due to the shortage of medical specialists. To mitigate this issue, we propose a new data augmentatio… ▽ More

    Submitted 7 July, 2025; originally announced July 2025.

    Comments: 5 pages, 5 figures

  4. arXiv:2506.21765  [pdf, ps, other

    eess.IV cs.CV

    TUS-REC2024: A Challenge to Reconstruct 3D Freehand Ultrasound Without External Tracker

    Authors: Qi Li, Shaheer U. Saeed, Yuliang Huang, Mingyuan Luo, Zhongnuo Yan, Jiongquan Chen, Xin Yang, Dong Ni, Nektarios Winter, Phuc Nguyen, Lucas Steinberger, Caelan Haney, Yuan Zhao, Mingjie Jiang, Bowen Ren, SiYeoul Lee, Seonho Kim, MinKyung Seo, MinWoo Kim, Yimeng Dou, Zhiwei Zhang, Yin Li, Tomy Varghese, Dean C. Barratt, Matthew J. Clarkson , et al. (2 additional authors not shown)

    Abstract: Trackerless freehand ultrasound reconstruction aims to reconstruct 3D volumes from sequences of 2D ultrasound images without relying on external tracking systems, offering a low-cost, portable, and widely deployable alternative for volumetric imaging. However, it presents significant challenges, including accurate inter-frame motion estimation, minimisation of drift accumulation over long sequence… ▽ More

    Submitted 26 June, 2025; originally announced June 2025.

  5. arXiv:2506.19451  [pdf, ps, other

    eess.SP cs.LG

    Low-Complexity Semantic Packet Aggregation for Token Communication via Lookahead Search

    Authors: Seunghun Lee, Jihong Park, Jinho Choi, Hyuncheol Park

    Abstract: Tokens are fundamental processing units of generative AI (GenAI) and large language models (LLMs), and token communication (TC) is essential for enabling remote AI-generate content (AIGC) and wireless LLM applications. Unlike traditional bits, each of which is independently treated, the semantics of each token depends on its surrounding context tokens. This inter-token dependency makes TC vulnerab… ▽ More

    Submitted 24 June, 2025; originally announced June 2025.

  6. arXiv:2506.11142  [pdf, ps, other

    cs.CV cs.LG eess.IV

    FARCLUSS: Fuzzy Adaptive Rebalancing and Contrastive Uncertainty Learning for Semi-Supervised Semantic Segmentation

    Authors: Ebenezer Tarubinga, Jenifer Kalafatovich, Seong-Whan Lee

    Abstract: Semi-supervised semantic segmentation (SSSS) faces persistent challenges in effectively leveraging unlabeled data, such as ineffective utilization of pseudo-labels, exacerbation of class imbalance biases, and neglect of prediction uncertainty. Current approaches often discard uncertain regions through strict thresholding favouring dominant classes. To address these limitations, we introduce a holi… ▽ More

    Submitted 23 June, 2025; v1 submitted 10 June, 2025; originally announced June 2025.

    Comments: Submitted to Neural Networks

  7. Toward High Accuracy DME for Alternative Aircraft Positioning: SFOL Pulse Transmission in High-Power DME

    Authors: Jongmin Park, Seowoo Park, Sunghwa Lee, Jiwon Seo, Euiho Kim

    Abstract: The Stretched-FrOnt-Leg (SFOL) pulse is an advanced distance measuring equipment (DME) pulse that offers superior ranging accuracy compared to conventional Gaussian pulses. Successful SFOL pulse transmission has been recently demonstrated from a commercial Gaussian pulse-based DME in low-power mode utilizing digital predistortion (DPD) techniques for power amplifiers. These adjustments were achiev… ▽ More

    Submitted 6 June, 2025; originally announced June 2025.

    Comments: Submitted to IEEE TAES

  8. arXiv:2506.06537  [pdf, ps, other

    cs.CV cs.SD eess.AS

    Bridging Audio and Vision: Zero-Shot Audiovisual Segmentation by Connecting Pretrained Models

    Authors: Seung-jae Lee, Paul Hongsuck Seo

    Abstract: Audiovisual segmentation (AVS) aims to identify visual regions corresponding to sound sources, playing a vital role in video understanding, surveillance, and human-computer interaction. Traditional AVS methods depend on large-scale pixel-level annotations, which are costly and time-consuming to obtain. To address this, we propose a novel zero-shot AVS framework that eliminates task-specific traini… ▽ More

    Submitted 6 June, 2025; originally announced June 2025.

    Comments: Accepted on INTERSPEECH2025

  9. arXiv:2506.06311  [pdf, ps, other

    eess.SP cs.LG

    A Novel Shape-Aware Topological Representation for GPR Data with DNN Integration

    Authors: Meiyan Kang, Shizuo Kaji, Sang-Yun Lee, Taegon Kim, Hee-Hwan Ryu, Suyoung Choi

    Abstract: Ground Penetrating Radar (GPR) is a widely used Non-Destructive Testing (NDT) technique for subsurface exploration, particularly in infrastructure inspection and maintenance. However, conventional interpretation methods are often limited by noise sensitivity and a lack of structural awareness. This study presents a novel framework that enhances the detection of underground utilities, especially pi… ▽ More

    Submitted 10 July, 2025; v1 submitted 26 May, 2025; originally announced June 2025.

    Comments: 15 pages, 6 figures

  10. arXiv:2506.01460  [pdf, ps, other

    cs.SD eess.AS

    Few-step Adversarial Schrödinger Bridge for Generative Speech Enhancement

    Authors: Seungu Han, Sungho Lee, Juheon Lee, Kyogu Lee

    Abstract: Deep generative models have recently been employed for speech enhancement to generate perceptually valid clean speech on large-scale datasets. Several diffusion models have been proposed, and more recently, a tractable Schrödinger Bridge has been introduced to transport between the clean and noisy speech distributions. However, these models often suffer from an iterative reverse process and requir… ▽ More

    Submitted 2 June, 2025; originally announced June 2025.

    Comments: Accepted to Interspeech 2025

  11. arXiv:2505.24336  [pdf, ps, other

    eess.AS cs.AI cs.LG cs.SD eess.SP

    When Humans Growl and Birds Speak: High-Fidelity Voice Conversion from Human to Animal and Designed Sounds

    Authors: Minsu Kang, Seolhee Lee, Choonghyeon Lee, Namhyun Cho

    Abstract: Human to non-human voice conversion (H2NH-VC) transforms human speech into animal or designed vocalizations. Unlike prior studies focused on dog-sounds and 16 or 22.05kHz audio transformation, this work addresses a broader range of non-speech sounds, including natural sounds (lion-roars, birdsongs) and designed voice (synthetic growls). To accomodate generation of diverse non-speech sounds and 44.… ▽ More

    Submitted 30 May, 2025; originally announced May 2025.

    Comments: INTERSPEECH 2025 accepted

  12. arXiv:2505.20868  [pdf, ps, other

    cs.SD cs.AI eess.AS

    Spotlight-TTS: Spotlighting the Style via Voiced-Aware Style Extraction and Style Direction Adjustment for Expressive Text-to-Speech

    Authors: Nam-Gyu Kim, Deok-Hyeon Cho, Seung-Bin Kim, Seong-Whan Lee

    Abstract: Recent advances in expressive text-to-speech (TTS) have introduced diverse methods based on style embedding extracted from reference speech. However, synthesizing high-quality expressive speech remains challenging. We propose Spotlight-TTS, which exclusively emphasizes style via voiced-aware style extraction and style direction adjustment. Voiced-aware style extraction focuses on voiced regions hi… ▽ More

    Submitted 29 June, 2025; v1 submitted 27 May, 2025; originally announced May 2025.

    Comments: Proceedings of Interspeech 2025

  13. arXiv:2505.20794  [pdf, ps, other

    cs.SD cs.AI eess.AS

    VibE-SVC: Vibrato Extraction with High-frequency F0 Contour for Singing Voice Conversion

    Authors: Joon-Seung Choi, Dong-Min Byun, Hyung-Seok Oh, Seong-Whan Lee

    Abstract: Controlling singing style is crucial for achieving an expressive and natural singing voice. Among the various style factors, vibrato plays a key role in conveying emotions and enhancing musical depth. However, modeling vibrato remains challenging due to its dynamic nature, making it difficult to control in singing voice conversion. To address this, we propose VibESVC, a controllable singing voice… ▽ More

    Submitted 27 May, 2025; originally announced May 2025.

    Comments: Proceedings of Interspeech 2025

  14. arXiv:2505.19693  [pdf, ps, other

    cs.SD cs.AI eess.AS

    EmoSphere-SER: Enhancing Speech Emotion Recognition Through Spherical Representation with Auxiliary Classification

    Authors: Deok-Hyeon Cho, Hyung-Seok Oh, Seung-Bin Kim, Seong-Whan Lee

    Abstract: Speech emotion recognition predicts a speaker's emotional state from speech signals using discrete labels or continuous dimensions such as arousal, valence, and dominance (VAD). We propose EmoSphere-SER, a joint model that integrates spherical VAD region classification to guide VAD regression for improved emotion prediction. In our framework, VAD values are transformed into spherical coordinates t… ▽ More

    Submitted 26 May, 2025; originally announced May 2025.

    Comments: Proceedings of Interspeech 2025

  15. arXiv:2505.19687  [pdf, ps, other

    cs.SD cs.AI eess.AS

    DiEmo-TTS: Disentangled Emotion Representations via Self-Supervised Distillation for Cross-Speaker Emotion Transfer in Text-to-Speech

    Authors: Deok-Hyeon Cho, Hyung-Seok Oh, Seung-Bin Kim, Seong-Whan Lee

    Abstract: Cross-speaker emotion transfer in speech synthesis relies on extracting speaker-independent emotion embeddings for accurate emotion modeling without retaining speaker traits. However, existing timbre compression methods fail to fully separate speaker and emotion characteristics, causing speaker leakage and degraded synthesis quality. To address this, we propose DiEmo-TTS, a self-supervised distill… ▽ More

    Submitted 26 May, 2025; originally announced May 2025.

    Comments: Proceedings of Interspeech 2025

  16. arXiv:2505.19384  [pdf, ps, other

    cs.CL cs.SD eess.AS

    GSA-TTS : Toward Zero-Shot Speech Synthesis based on Gradual Style Adaptor

    Authors: Seokgi Lee, Jungjun Kim

    Abstract: We present the gradual style adaptor TTS (GSA-TTS) with a novel style encoder that gradually encodes speaking styles from an acoustic reference for zero-shot speech synthesis. GSA first captures the local style of each semantic sound unit. Then the local styles are combined by self-attention to obtain a global style condition. This semantic and hierarchical encoding strategy provides a robust and… ▽ More

    Submitted 25 May, 2025; originally announced May 2025.

    Comments: 7 pages, 3 figures

  17. arXiv:2505.18614  [pdf, ps, other

    cs.CL cs.LG cs.MM cs.SD eess.AS

    MAVL: A Multilingual Audio-Video Lyrics Dataset for Animated Song Translation

    Authors: Woohyun Cho, Youngmin Kim, Sunghyun Lee, Youngjae Yu

    Abstract: Lyrics translation requires both accurate semantic transfer and preservation of musical rhythm, syllabic structure, and poetic style. In animated musicals, the challenge intensifies due to alignment with visual and auditory cues. We introduce Multilingual Audio-Video Lyrics Benchmark for Animated Song Translation (MAVL), the first multilingual, multimodal benchmark for singable lyrics translation.… ▽ More

    Submitted 5 June, 2025; v1 submitted 24 May, 2025; originally announced May 2025.

    Comments: 28 pages, 8 figures, our codes and datasets are available at https://github.com/k1064190/MAVL

  18. arXiv:2505.18162  [pdf

    eess.SP cs.LG

    Accelerating Battery Material Optimization through iterative Machine Learning

    Authors: Seon-Hwa Lee, Insoo Ye, Changhwan Lee, Jieun Kim, Geunho Choi, Sang-Cheol Nam, Inchul Park

    Abstract: The performance of battery materials is determined by their composition and the processing conditions employed during commercial-scale fabrication, where raw materials undergo complex processing steps with various additives to yield final products. As the complexity of these parameters expands with the development of industry, conventional one-factor-at-a-time (OFAT) experiment becomes old fashion… ▽ More

    Submitted 12 May, 2025; originally announced May 2025.

    Comments: 25 pages, 5 figures

  19. arXiv:2505.15914  [pdf, ps, other

    cs.SD eess.AS

    A Novel Deep Learning Framework for Efficient Multichannel Acoustic Feedback Control

    Authors: Yuan-Kuei Wu, Juan Azcarreta, Kashyap Patel, Buye Xu, Jung-Suk Lee, Sanha Lee, Ashutosh Pandey

    Abstract: This study presents a deep-learning framework for controlling multichannel acoustic feedback in audio devices. Traditional digital signal processing methods struggle with convergence when dealing with highly correlated noise such as feedback. We introduce a Convolutional Recurrent Network that efficiently combines spatial and temporal processing, significantly enhancing speech enhancement capabili… ▽ More

    Submitted 29 May, 2025; v1 submitted 21 May, 2025; originally announced May 2025.

    Comments: Accepted by Interspeech 2025

  20. arXiv:2505.12991  [pdf, ps, other

    cs.SD eess.AS

    Personalized Fine-Tuning with Controllable Synthetic Speech from LLM-Generated Transcripts for Dysarthric Speech Recognition

    Authors: Dominik Wagner, Ilja Baumann, Natalie Engert, Seanie Lee, Elmar Nöth, Korbinian Riedhammer, Tobias Bocklet

    Abstract: In this work, we present our submission to the Speech Accessibility Project challenge for dysarthric speech recognition. We integrate parameter-efficient fine-tuning with latent audio representations to improve an encoder-decoder ASR system. Synthetic training data is generated by fine-tuning Parler-TTS to mimic dysarthric speech, using LLM-generated prompts for corpus-consistent target transcript… ▽ More

    Submitted 19 May, 2025; originally announced May 2025.

    Comments: Accepted at Interspeech 2025

  21. arXiv:2505.12863  [pdf, other

    cs.SD cs.AI cs.CV eess.AS

    Unified Cross-modal Translation of Score Images, Symbolic Music, and Performance Audio

    Authors: Jongmin Jung, Dongmin Kim, Sihun Lee, Seola Cho, Hyungjoon Soh, Irmak Bukey, Chris Donahue, Dasaem Jeong

    Abstract: Music exists in various modalities, such as score images, symbolic scores, MIDI, and audio. Translations between each modality are established as core tasks of music information retrieval, such as automatic music transcription (audio-to-MIDI) and optical music recognition (score image to symbolic score). However, most past work on multimodal translation trains specialized models on individual tran… ▽ More

    Submitted 19 May, 2025; originally announced May 2025.

    Comments: Submitted to IEEE Transactions on Audio, Speech and Language Processing (TASLPRO)

  22. arXiv:2505.12089  [pdf, ps, other

    eess.IV cs.AI cs.CV

    NTIRE 2025 Challenge on Efficient Burst HDR and Restoration: Datasets, Methods, and Results

    Authors: Sangmin Lee, Eunpil Park, Angel Canelo, Hyunhee Park, Youngjo Kim, Hyung-Ju Chun, Xin Jin, Chongyi Li, Chun-Le Guo, Radu Timofte, Qi Wu, Tianheng Qiu, Yuchun Dong, Shenglin Ding, Guanghua Pan, Weiyu Zhou, Tao Hu, Yixu Feng, Duwei Dai, Yu Cao, Peng Wu, Wei Dong, Yanning Zhang, Qingsen Yan, Simon J. Larsen , et al. (11 additional authors not shown)

    Abstract: This paper reviews the NTIRE 2025 Efficient Burst HDR and Restoration Challenge, which aims to advance efficient multi-frame high dynamic range (HDR) and restoration techniques. The challenge is based on a novel RAW multi-frame fusion dataset, comprising nine noisy and misaligned RAW frames with various exposure levels per scene. Participants were tasked with developing solutions capable of effect… ▽ More

    Submitted 17 May, 2025; originally announced May 2025.

  23. arXiv:2505.09978  [pdf, ps, other

    cs.IT eess.SP

    Low-Complexity Decoding for Low-Rate Block Codes of Short Length Based on Concatenated Coding Structure

    Authors: Mao-Chao Lin, Shih-Kai Lee, Pin Lin, Ching-Chang Lin, Chia-Chun Chen, Teng-Yuan Syu, Huang-Chang Lee

    Abstract: To decode a short linear block code, ordered statics decoding (OSD) and/or the $A^*$ decoding are usually considered. Either OSD or the $A^*$ decoding utilizes the magnitudes of the received symbols to establish the most reliable and independent positions (MRIP) frame. A restricted searched space can be employed to achieve near-optimum decoding with reduced decoding complexity. For a low-rate code… ▽ More

    Submitted 15 May, 2025; originally announced May 2025.

  24. arXiv:2505.00210  [pdf, other

    cs.LG cs.CE eess.SY

    Generative Machine Learning in Adaptive Control of Dynamic Manufacturing Processes: A Review

    Authors: Suk Ki Lee, Hyunwoong Ko

    Abstract: Dynamic manufacturing processes exhibit complex characteristics defined by time-varying parameters, nonlinear behaviors, and uncertainties. These characteristics require sophisticated in-situ monitoring techniques utilizing multimodal sensor data and adaptive control systems that can respond to real-time feedback while maintaining product quality. Recently, generative machine learning (ML) has eme… ▽ More

    Submitted 30 April, 2025; originally announced May 2025.

    Comments: 12 pages, 1 figure, 1 table. This paper has been accepted for publication in the proceedings of ASME IDETC-CIE 2025

  25. arXiv:2504.19591  [pdf, ps, other

    eess.SP

    Semantic Packet Aggregation for Token Communication via Genetic Beam Search

    Authors: Seunghun Lee, Jihong Park, Jinho Choi, Hyuncheol Park

    Abstract: Token communication (TC) is poised to play a pivotal role in emerging language-driven applications such as AI-generated content (AIGC) and wireless language models (LLMs). However, token loss caused by channel noise can severely degrade task performance. To address this, in this article, we focus on the problem of semantics-aware packetization and develop a novel algorithm, termed semantic packet… ▽ More

    Submitted 28 April, 2025; originally announced April 2025.

  26. arXiv:2504.17080  [pdf, other

    cs.RO eess.SY

    Geometric Formulation of Unified Force-Impedance Control on SE(3) for Robotic Manipulators

    Authors: Joohwan Seo, Nikhil Potu Surya Prakash, Soomi Lee, Arvind Kruthiventy, Megan Teng, Jongeun Choi, Roberto Horowitz

    Abstract: In this paper, we present an impedance control framework on the SE(3) manifold, which enables force tracking while guaranteeing passivity. Building upon the unified force-impedance control (UFIC) and our previous work on geometric impedance control (GIC), we develop the geometric unified force impedance control (GUFIC) to account for the SE(3) manifold structure in the controller formulation using… ▽ More

    Submitted 23 April, 2025; originally announced April 2025.

    Comments: Submitted to Control Decision Conference (CDC) 2025

  27. An Addendum to NeBula: Towards Extending TEAM CoSTAR's Solution to Larger Scale Environments

    Authors: Ali Agha, Kyohei Otsu, Benjamin Morrell, David D. Fan, Sung-Kyun Kim, Muhammad Fadhil Ginting, Xianmei Lei, Jeffrey Edlund, Seyed Fakoorian, Amanda Bouman, Fernando Chavez, Taeyeon Kim, Gustavo J. Correa, Maira Saboia, Angel Santamaria-Navarro, Brett Lopez, Boseong Kim, Chanyoung Jung, Mamoru Sobue, Oriana Claudia Peltzer, Joshua Ott, Robert Trybula, Thomas Touma, Marcel Kaufmann, Tiago Stegun Vaquero , et al. (64 additional authors not shown)

    Abstract: This paper presents an appendix to the original NeBula autonomy solution developed by the TEAM CoSTAR (Collaborative SubTerranean Autonomous Robots), participating in the DARPA Subterranean Challenge. Specifically, this paper presents extensions to NeBula's hardware, software, and algorithmic components that focus on increasing the range and scale of the exploration environment. From the algorithm… ▽ More

    Submitted 18 April, 2025; originally announced April 2025.

    Journal ref: IEEE Transactions on Field Robotics, vol. 1, pp. 476-526, 2024

  28. arXiv:2504.05196  [pdf, other

    eess.IV cs.AI cs.CV

    Universal Lymph Node Detection in Multiparametric MRI with Selective Augmentation

    Authors: Tejas Sudharshan Mathai, Sungwon Lee, Thomas C. Shen, Zhiyong Lu, Ronald M. Summers

    Abstract: Robust localization of lymph nodes (LNs) in multiparametric MRI (mpMRI) is critical for the assessment of lymphadenopathy. Radiologists routinely measure the size of LN to distinguish benign from malignant nodes, which would require subsequent cancer staging. Sizing is a cumbersome task compounded by the diverse appearances of LNs in mpMRI, which renders their measurement difficult. Furthermore, s… ▽ More

    Submitted 7 April, 2025; originally announced April 2025.

    Comments: Published at SPIE Medical Imaging 2023

  29. arXiv:2503.23734  [pdf, ps, other

    eess.SP

    Semantic Packet Aggregation and Repeated Transmission for Text-to-Image Generation

    Authors: Seunghun Lee, Jihong Park, Jinho Choi, Hyuncheol Park

    Abstract: Text-based communication is expected to be prevalent in 6G applications such as wireless AI-generated content (AIGC). Motivated by this, this paper addresses the challenges of transmitting text prompts over erasure channels for a text-to-image AIGC task by developing the semantic segmentation and repeated transmission (SMART) algorithm. SMART groups words in text prompts into packets, prioritizing… ▽ More

    Submitted 31 March, 2025; originally announced March 2025.

  30. arXiv:2503.18151  [pdf, ps, other

    eess.IV cs.AI cs.CV

    Efficient Deep Learning Approaches for Processing Ultra-Widefield Retinal Imaging

    Authors: Siwon Kim, Wooyung Yun, Jeongbin Oh, Soomok Lee

    Abstract: Deep learning has emerged as the predominant solution for classifying medical images. We intend to apply these developments to the ultra-widefield (UWF) retinal imaging dataset. Since UWF images can accurately diagnose various retina diseases, it is very important to clas sify them accurately and prevent them with early treatment. However, processing images manually is time-consuming and labor-int… ▽ More

    Submitted 23 March, 2025; originally announced March 2025.

  31. arXiv:2503.13581  [pdf, other

    eess.IV cs.CV

    Subgroup Performance of a Commercial Digital Breast Tomosynthesis Model for Breast Cancer Detection

    Authors: Beatrice Brown-Mulry, Rohan Satya Isaac, Sang Hyup Lee, Ambika Seth, KyungJee Min, Theo Dapamede, Frank Li, Aawez Mansuri, MinJae Woo, Christian Allison Fauria-Robinson, Bhavna Paryani, Judy Wawira Gichoya, Hari Trivedi

    Abstract: While research has established the potential of AI models for mammography to improve breast cancer screening outcomes, there have not been any detailed subgroup evaluations performed to assess the strengths and weaknesses of commercial models for digital breast tomosynthesis (DBT) imaging. This study presents a granular evaluation of the Lunit INSIGHT DBT model on a large retrospective cohort of 1… ▽ More

    Submitted 17 March, 2025; originally announced March 2025.

    Comments: 14 pages, 7 figures (plus 7 figures in supplement), 3 tables (plus 1 table in supplement)

  32. arXiv:2503.09829  [pdf, other

    cs.RO cs.LG eess.SY

    SE(3)-Equivariant Robot Learning and Control: A Tutorial Survey

    Authors: Joohwan Seo, Soochul Yoo, Junwoo Chang, Hyunseok An, Hyunwoo Ryu, Soomi Lee, Arvind Kruthiventy, Jongeun Choi, Roberto Horowitz

    Abstract: Recent advances in deep learning and Transformers have driven major breakthroughs in robotics by employing techniques such as imitation learning, reinforcement learning, and LLM-based multimodal perception and decision-making. However, conventional deep learning and Transformer models often struggle to process data with inherent symmetries and invariances, typically relying on large datasets or ex… ▽ More

    Submitted 23 April, 2025; v1 submitted 12 March, 2025; originally announced March 2025.

    Comments: Accepted to International Journcal of Control, Automation and Systems (IJCAS)

  33. arXiv:2503.07977  [pdf, other

    cs.SD cs.LG eess.AS

    Boundary Regression for Leitmotif Detection in Music Audio

    Authors: Sihun Lee, Dasaem Jeong

    Abstract: Leitmotifs are musical phrases that are reprised in various forms throughout a piece. Due to diverse variations and instrumentation, detecting the occurrence of leitmotifs from audio recordings is a highly challenging task. Leitmotif detection may be handled as a subcategory of audio event detection, where leitmotif activity is predicted at the frame level. However, as leitmotifs embody distinct,… ▽ More

    Submitted 10 March, 2025; originally announced March 2025.

    Comments: 2 pages, 1 figure; presented at the 2024 ISMIR conference Late-Breaking Demo

    MSC Class: I.2.0; I.2.1

  34. arXiv:2503.00733  [pdf, other

    eess.AS cs.CL cs.SD

    UniWav: Towards Unified Pre-training for Speech Representation Learning and Generation

    Authors: Alexander H. Liu, Sang-gil Lee, Chao-Han Huck Yang, Yuan Gong, Yu-Chiang Frank Wang, James R. Glass, Rafael Valle, Bryan Catanzaro

    Abstract: Pre-training and representation learning have been playing an increasingly important role in modern speech processing. Nevertheless, different applications have been relying on different foundation models, since predominant pre-training techniques are either designed for discriminative tasks or generative tasks. In this work, we make the first attempt at building a unified pre-training framework f… ▽ More

    Submitted 2 March, 2025; originally announced March 2025.

    Comments: ICLR 2025; demo page at https://alexander-h-liu.github.io/uniwav-demo.github.io/

  35. arXiv:2502.20824  [pdf, other

    cs.CV eess.IV

    MFSR-GAN: Multi-Frame Super-Resolution with Handheld Motion Modeling

    Authors: Fadeel Sher Khan, Joshua Ebenezer, Hamid Sheikh, Seok-Jun Lee

    Abstract: Smartphone cameras have become ubiquitous imaging tools, yet their small sensors and compact optics often limit spatial resolution and introduce distortions. Combining information from multiple low-resolution (LR) frames to produce a high-resolution (HR) image has been explored to overcome the inherent limitations of smartphone cameras. Despite the promise of multi-frame super-resolution (MFSR), c… ▽ More

    Submitted 1 May, 2025; v1 submitted 28 February, 2025; originally announced February 2025.

    Comments: Accepted to NTIRE Workshop at CVPR 2025; 8 pages, 6 figures

  36. arXiv:2502.17528  [pdf, other

    eess.SP

    Temperature Compensation Method of Six-Axis Force/Torque Sensor Using Gated Recurrent Unit

    Authors: Hyun-Bin Kim, Seokju Lee, Byeong-Il Ham, Kyung-Soo Kim

    Abstract: This study aims to enhance the accuracy of a six-axis force/torque sensor compared to existing approaches that utilize Multi-Layer Perceptron (MLP) and the Least Square Method. The sensor used in this study is based on a photo-coupler and operates with infrared light, making it susceptible to dark current effects, which cause drift due to temperature variations. Additionally, the sensor is compact… ▽ More

    Submitted 24 February, 2025; originally announced February 2025.

    Comments: 8 pages, 9 figures

  37. arXiv:2502.16538  [pdf, other

    cs.CV eess.IV

    Color Information-Based Automated Mask Generation for Detecting Underwater Atypical Glare Areas

    Authors: Mingyu Jeon, Yeonji Paeng, Sejin Lee

    Abstract: Underwater diving assistance and safety support robots acquire real-time diver information through onboard underwater cameras. This study introduces a breath bubble detection algorithm that utilizes unsupervised K-means clustering, thereby addressing the high accuracy demands of deep learning models as well as the challenges associated with constructing supervised datasets. The proposed method fus… ▽ More

    Submitted 23 February, 2025; originally announced February 2025.

    Comments: 7pages, 6 figures

  38. arXiv:2502.05817  [pdf, other

    cs.RO eess.SY

    DreamFLEX: Learning Fault-Aware Quadrupedal Locomotion Controller for Anomaly Situation in Rough Terrains

    Authors: Seunghyun Lee, I Made Aswin Nahrendra, Dongkyu Lee, Byeongho Yu, Minho Oh, Hyun Myung

    Abstract: Recent advances in quadrupedal robots have demonstrated impressive agility and the ability to traverse diverse terrains. However, hardware issues, such as motor overheating or joint locking, may occur during long-distance walking or traversing through rough terrains leading to locomotion failures. Although several studies have proposed fault-tolerant control methods for quadrupedal robots, there a… ▽ More

    Submitted 9 February, 2025; originally announced February 2025.

    Comments: Accepted for ICRA 2025. Project site is available at https://dreamflex.github.io/

  39. arXiv:2502.03505  [pdf, other

    eess.IV cs.AI cs.LG

    Enhancing Free-hand 3D Photoacoustic and Ultrasound Reconstruction using Deep Learning

    Authors: SiYeoul Lee, SeonHo Kim, Minkyung Seo, SeongKyu Park, Salehin Imrus, Kambaluru Ashok, DongEon Lee, Chunsu Park, SeonYeong Lee, Jiye Kim, Jae-Heung Yoo, MinWoo Kim

    Abstract: This study introduces a motion-based learning network with a global-local self-attention module (MoGLo-Net) to enhance 3D reconstruction in handheld photoacoustic and ultrasound (PAUS) imaging. Standard PAUS imaging is often limited by a narrow field of view and the inability to effectively visualize complex 3D structures. The 3D freehand technique, which aligns sequential 2D images for 3D reconst… ▽ More

    Submitted 5 February, 2025; originally announced February 2025.

  40. arXiv:2501.14790  [pdf, other

    q-bio.NC cs.AI cs.SD eess.AS

    Towards Dynamic Neural Communication and Speech Neuroprosthesis Based on Viseme Decoding

    Authors: Ji-Ha Park, Seo-Hyun Lee, Soowon Kim, Seong-Whan Lee

    Abstract: Decoding text, speech, or images from human neural signals holds promising potential both as neuroprosthesis for patients and as innovative communication tools for general users. Although neural signals contain various information on speech intentions, movements, and phonetic details, generating informative outputs from them remains challenging, with mostly focusing on decoding short intentions or… ▽ More

    Submitted 8 January, 2025; originally announced January 2025.

    Comments: 5 pages, 5 figures, 1 table, Name of Conference: 2025 IEEE International Conference on Acoustics, Speech, and Signal Processing

  41. arXiv:2501.14115  [pdf, other

    eess.SY physics.space-ph

    Passivity-Based Robust Shape Control of a Cable-Driven Solar Sail Boom for the CABLESSail Concept

    Authors: Soojeong Lee, Ryan J. Caverly

    Abstract: Solar sails provide a means of propulsion using solar radiation pressure, which offers the possibility of exciting new spacecraft capabilities. However, solar sails have attitude control challenges because of the significant disturbance torques that they encounter due to imperfections in the sail and its supporting structure, as well as limited actuation capabilities. The Cable-Actuated Bio-inspir… ▽ More

    Submitted 23 January, 2025; originally announced January 2025.

  42. arXiv:2501.11631  [pdf, other

    cs.SD cs.AI eess.AS

    Noise-Agnostic Multitask Whisper Training for Reducing False Alarm Errors in Call-for-Help Detection

    Authors: Myeonghoon Ryu, June-Woo Kim, Minseok Oh, Suji Lee, Han Park

    Abstract: Keyword spotting is often implemented by keyword classifier to the encoder in acoustic models, enabling the classification of predefined or open vocabulary keywords. Although keyword spotting is a crucial task in various applications and can be extended to call-for-help detection in emergencies, however, the previous method often suffers from scalability limitations due to retraining required to i… ▽ More

    Submitted 20 January, 2025; originally announced January 2025.

    Comments: Accepted to ICASSP 2025

  43. arXiv:2501.11542  [pdf, other

    eess.SY cs.LG

    DLinear-based Prediction of Remaining Useful Life of Lithium-Ion Batteries: Feature Engineering through Explainable Artificial Intelligence

    Authors: Minsu Kim, Jaehyun Oh, Sang-Young Lee, Junghwan Kim

    Abstract: Accurate prediction of the Remaining Useful Life (RUL) of lithium-ion batteries is essential for ensuring safety, reducing maintenance costs, and optimizing usage. However, predicting RUL is challenging due to the nonlinear characteristics of the degradation caused by complex chemical reactions. Machine learning allows precise predictions by learning the latent functions of degradation relationshi… ▽ More

    Submitted 20 January, 2025; originally announced January 2025.

  44. arXiv:2501.11311  [pdf, other

    cs.SD cs.LG eess.AS

    A2SB: Audio-to-Audio Schrodinger Bridges

    Authors: Zhifeng Kong, Kevin J Shih, Weili Nie, Arash Vahdat, Sang-gil Lee, Joao Felipe Santos, Ante Jukic, Rafael Valle, Bryan Catanzaro

    Abstract: Audio in the real world may be perturbed due to numerous factors, causing the audio quality to be degraded. The following work presents an audio restoration model tailored for high-res music at 44.1kHz. Our model, Audio-to-Audio Schrodinger Bridges (A2SB), is capable of both bandwidth extension (predicting high-frequency components) and inpainting (re-generating missing segments). Critically, A2SB… ▽ More

    Submitted 20 January, 2025; originally announced January 2025.

  45. arXiv:2501.04926  [pdf, other

    eess.AS cs.AI cs.CL cs.SD

    FLowHigh: Towards Efficient and High-Quality Audio Super-Resolution with Single-Step Flow Matching

    Authors: Jun-Hak Yun, Seung-Bin Kim, Seong-Whan Lee

    Abstract: Audio super-resolution is challenging owing to its ill-posed nature. Recently, the application of diffusion models in audio super-resolution has shown promising results in alleviating this challenge. However, diffusion-based models have limitations, primarily the necessity for numerous sampling steps, which causes significantly increased latency when synthesizing high-quality audio samples. In thi… ▽ More

    Submitted 8 January, 2025; originally announced January 2025.

    Comments: Accepted by ICASSP 2025

  46. arXiv:2501.04904  [pdf, other

    cs.CL cs.SD eess.AS

    JELLY: Joint Emotion Recognition and Context Reasoning with LLMs for Conversational Speech Synthesis

    Authors: Jun-Hyeok Cha, Seung-Bin Kim, Hyung-Seok Oh, Seong-Whan Lee

    Abstract: Recently, there has been a growing demand for conversational speech synthesis (CSS) that generates more natural speech by considering the conversational context. To address this, we introduce JELLY, a novel CSS framework that integrates emotion recognition and context reasoning for generating appropriate speech in conversation by fine-tuning a large language model (LLM) with multiple partial LoRA… ▽ More

    Submitted 8 January, 2025; originally announced January 2025.

    Comments: Accepted by ICASSP 2025

  47. arXiv:2412.19351  [pdf, ps, other

    cs.SD cs.CL cs.LG eess.AS

    ETTA: Elucidating the Design Space of Text-to-Audio Models

    Authors: Sang-gil Lee, Zhifeng Kong, Arushi Goel, Sungwon Kim, Rafael Valle, Bryan Catanzaro

    Abstract: Recent years have seen significant progress in Text-To-Audio (TTA) synthesis, enabling users to enrich their creative workflows with synthetic audio generated from natural language prompts. Despite this progress, the effects of data, model architecture, training objective functions, and sampling strategies on target benchmarks are not well understood. With the purpose of providing a holistic under… ▽ More

    Submitted 30 June, 2025; v1 submitted 26 December, 2024; originally announced December 2024.

    Comments: ICML 2025. Demo: https://research.nvidia.com/labs/adlr/ETTA/ Code: https://github.com/NVIDIA/elucidated-text-to-audio

  48. arXiv:2412.19110  [pdf, other

    cs.IT eess.SP

    A Selective Secure Precoding Framework for MU-MIMO Rate-Splitting Multiple Access Networks Under Limited CSIT

    Authors: Sangmin Lee, Seokjun Park, Jeonghun Park, Jinseok Choi

    Abstract: In this paper, we propose a robust and adaptable secure precoding framework designed to encapsulate a intricate scenario where legitimate users have different information security: secure private or normal public information. Leveraging rate-splitting multiple access (RSMA), we formulate the sum secrecy spectral efficiency (SE) maximization problem in downlink multi-user multiple-input multiple-ou… ▽ More

    Submitted 26 December, 2024; originally announced December 2024.

    Comments: 13 pages, 10 figures

  49. Text-Aware Adapter for Few-Shot Keyword Spotting

    Authors: Youngmoon Jung, Jinyoung Lee, Seungjin Lee, Myunghun Jung, Yong-Hyeok Lee, Hoon-Young Cho

    Abstract: Recent advances in flexible keyword spotting (KWS) with text enrollment allow users to personalize keywords without uttering them during enrollment. However, there is still room for improvement in target keyword performance. In this work, we propose a novel few-shot transfer learning method, called text-aware adapter (TA-adapter), designed to enhance a pre-trained flexible KWS model for specific k… ▽ More

    Submitted 23 December, 2024; originally announced December 2024.

    Comments: 5 pages, 3 figures, Accepted by ICASSP 2025

  50. arXiv:2412.15299  [pdf, other

    cs.CL cs.SD eess.AS

    LAMA-UT: Language Agnostic Multilingual ASR through Orthography Unification and Language-Specific Transliteration

    Authors: Sangmin Lee, Woo-Jin Chung, Hong-Goo Kang

    Abstract: Building a universal multilingual automatic speech recognition (ASR) model that performs equitably across languages has long been a challenge due to its inherent difficulties. To address this task we introduce a Language-Agnostic Multilingual ASR pipeline through orthography Unification and language-specific Transliteration (LAMA-UT). LAMA-UT operates without any language-specific modules while ma… ▽ More

    Submitted 22 December, 2024; v1 submitted 19 December, 2024; originally announced December 2024.