Search | arXiv e-print repository

Learning Humanoid Arm Motion via Centroidal Momentum Regularized Multi-Agent Reinforcement Learning

Authors: Ho Jae Lee, Se Hwan Jeon, Sangbae Kim

Abstract: Humans naturally swing their arms during locomotion to regulate whole-body dynamics, reduce angular momentum, and help maintain balance. Inspired by this principle, we present a limb-level multi-agent reinforcement learning (RL) framework that enables coordinated whole-body control of humanoid robots through emergent arm motion. Our approach employs separate actor-critic structures for the arms an… ▽ More Humans naturally swing their arms during locomotion to regulate whole-body dynamics, reduce angular momentum, and help maintain balance. Inspired by this principle, we present a limb-level multi-agent reinforcement learning (RL) framework that enables coordinated whole-body control of humanoid robots through emergent arm motion. Our approach employs separate actor-critic structures for the arms and legs, trained with centralized critics but decentralized actors that share only base states and centroidal angular momentum (CAM) observations, allowing each agent to specialize in task-relevant behaviors through modular reward design. The arm agent guided by CAM tracking and damping rewards promotes arm motions that reduce overall angular momentum and vertical ground reaction moments, contributing to improved balance during locomotion or under external perturbations. Comparative studies with single-agent and alternative multi-agent baselines further validate the effectiveness of our approach. Finally, we deploy the learned policy on a humanoid platform, achieving robust performance across diverse locomotion tasks, including flat-ground walking, rough terrain traversal, and stair climbing. △ Less

Submitted 5 July, 2025; originally announced July 2025.

Comments: 8 pages, 10 figures

arXiv:2507.02897 [pdf, ps, other]

Regulation Compliant AI for Fusion: Real-Time Image Analysis-Based Control of Divertor Detachment in Tokamaks

Authors: Nathaniel Chen, Cheolsik Byun, Azarakash Jalalvand, Sangkyeun Kim, Andrew Rothstein, Filippo Scotti, Steve Allen, David Eldon, Keith Erickson, Egemen Kolemen

Abstract: While artificial intelligence (AI) has been promising for fusion control, its inherent black-box nature will make compliant implementation in regulatory environments a challenge. This study implements and validates a real-time AI enabled linear and interpretable control system for successful divertor detachment control with the DIII-D lower divertor camera. Using D2 gas, we demonstrate feedback di… ▽ More While artificial intelligence (AI) has been promising for fusion control, its inherent black-box nature will make compliant implementation in regulatory environments a challenge. This study implements and validates a real-time AI enabled linear and interpretable control system for successful divertor detachment control with the DIII-D lower divertor camera. Using D2 gas, we demonstrate feedback divertor detachment control with a mean absolute difference of 2% from the target for both detachment and reattachment. This automatic training and linear processing framework can be extended to any image based diagnostic for regulatory compliant controller necessary for future fusion reactors. △ Less

Submitted 21 June, 2025; originally announced July 2025.

arXiv:2507.01038 [pdf, ps, other]

Cross-Attention Message-Passing Transformers for Code-Agnostic Decoding in 6G Networks

Authors: Seong-Joon Park, Hee-Youl Kwak, Sang-Hyo Kim, Yongjune Kim, Jong-Seon No

Abstract: Channel coding for 6G networks is expected to support a wide range of requirements arising from heterogeneous communication scenarios. These demands challenge traditional code-specific decoders, which lack the flexibility and scalability required for next-generation systems. To tackle this problem, we propose an AI-native foundation model for unified and code-agnostic decoding based on the transfo… ▽ More Channel coding for 6G networks is expected to support a wide range of requirements arising from heterogeneous communication scenarios. These demands challenge traditional code-specific decoders, which lack the flexibility and scalability required for next-generation systems. To tackle this problem, we propose an AI-native foundation model for unified and code-agnostic decoding based on the transformer architecture. We first introduce a cross-attention message-passing transformer (CrossMPT). CrossMPT employs two masked cross-attention blocks that iteratively update two distinct input representations-magnitude and syndrome vectors-allowing the model to effectively learn the decoding problem. Notably, our CrossMPT has achieved state-of-the-art decoding performance among single neural decoders. Building on this, we develop foundation CrossMPT (FCrossMPT) by making the architecture invariant to code length, rate, and class, allowing a single trained model to decode a broad range of codes without retraining. To further enhance decoding performance, particularly for short blocklength codes, we propose CrossMPT ensemble decoder (CrossED), an ensemble decoder composed of multiple parallel CrossMPT blocks employing different parity-check matrices. This architecture can also serve as a foundation model, showing strong generalization across diverse code types. Overall, the proposed AI-native code-agnostic decoder offers flexibility, scalability, and high performance, presenting a promising direction to channel coding for 6G networks. △ Less

Submitted 22 June, 2025; originally announced July 2025.

arXiv:2506.21765 [pdf, ps, other]

TUS-REC2024: A Challenge to Reconstruct 3D Freehand Ultrasound Without External Tracker

Authors: Qi Li, Shaheer U. Saeed, Yuliang Huang, Mingyuan Luo, Zhongnuo Yan, Jiongquan Chen, Xin Yang, Dong Ni, Nektarios Winter, Phuc Nguyen, Lucas Steinberger, Caelan Haney, Yuan Zhao, Mingjie Jiang, Bowen Ren, SiYeoul Lee, Seonho Kim, MinKyung Seo, MinWoo Kim, Yimeng Dou, Zhiwei Zhang, Yin Li, Tomy Varghese, Dean C. Barratt, Matthew J. Clarkson , et al. (2 additional authors not shown)

Abstract: Trackerless freehand ultrasound reconstruction aims to reconstruct 3D volumes from sequences of 2D ultrasound images without relying on external tracking systems, offering a low-cost, portable, and widely deployable alternative for volumetric imaging. However, it presents significant challenges, including accurate inter-frame motion estimation, minimisation of drift accumulation over long sequence… ▽ More Trackerless freehand ultrasound reconstruction aims to reconstruct 3D volumes from sequences of 2D ultrasound images without relying on external tracking systems, offering a low-cost, portable, and widely deployable alternative for volumetric imaging. However, it presents significant challenges, including accurate inter-frame motion estimation, minimisation of drift accumulation over long sequences, and generalisability across scanning protocols. The TUS-REC2024 Challenge was established to benchmark and accelerate progress in trackerless 3D ultrasound reconstruction by providing a publicly available dataset for the first time, along with a baseline model and evaluation framework. The Challenge attracted over 43 registered teams, of which 6 teams submitted 21 valid dockerized solutions. Submitted methods spanned a wide range of algorithmic approaches, including recurrent models, registration-driven volume refinement, attention, and physics-informed models. This paper presents an overview of the Challenge design, summarises the key characteristics of the dataset, provides a concise literature review, introduces the technical details of the underlying methodology working with tracked freehand ultrasound data, and offers a comparative analysis of submitted methods across multiple evaluation metrics. The results highlight both the progress and current limitations of state-of-the-art approaches in this domain, and inform directions for future research. The data, evaluation code, and baseline are publicly available to facilitate ongoing development and reproducibility. As a live and evolving benchmark, this Challenge is designed to be continuously developed and improved. The Challenge was held at MICCAI 2024 and will be organised again at MICCAI 2025, reflecting its growing impact and the sustained commitment to advancing this field. △ Less

Submitted 26 June, 2025; originally announced June 2025.

arXiv:2506.20152 [pdf, ps, other]

doi 10.1016/j.imavis.2023.104745

Loss-Aware Automatic Selection of Structured Pruning Criteria for Deep Neural Network Acceleration

Authors: Deepak Ghimire, Kilho Lee, Seong-heum Kim

Abstract: Structured pruning is a well-established technique for compressing neural networks, making it suitable for deployment in resource-limited edge devices. This paper presents an efficient Loss-Aware Automatic Selection of Structured Pruning Criteria (LAASP) for slimming and accelerating deep neural networks. The majority of pruning methodologies employ a sequential process consisting of three stages:… ▽ More Structured pruning is a well-established technique for compressing neural networks, making it suitable for deployment in resource-limited edge devices. This paper presents an efficient Loss-Aware Automatic Selection of Structured Pruning Criteria (LAASP) for slimming and accelerating deep neural networks. The majority of pruning methodologies employ a sequential process consisting of three stages: 1) training, 2) pruning, and 3) fine-tuning, whereas the proposed pruning technique adopts a pruning-while-training approach that eliminates the first stage and integrates the second and third stages into a single cycle. The automatic selection of magnitude or similarity-based filter pruning criteria from a specified pool of criteria and the specific pruning layer at each pruning iteration is guided by the network's overall loss on a small subset of the training data. To mitigate the abrupt accuracy drop due to pruning, the network is retrained briefly after each reduction of a predefined number of floating-point operations (FLOPs). The optimal pruning rates for each layer in the network are automatically determined, eliminating the need for manual allocation of fixed or variable pruning rates for each layer. Experiments on the VGGNet and ResNet models on the CIFAR-10 and ImageNet benchmark datasets demonstrate the effectiveness of the proposed method. In particular, the ResNet56 and ResNet110 models on the CIFAR-10 dataset significantly improve the top-1 accuracy compared to state-of-the-art methods while reducing the network FLOPs by 52\%. Furthermore, the ResNet50 model on the ImageNet dataset reduces FLOPs by more than 42\% with a negligible 0.33\% drop in top-5 accuracy. The source code of this paper is publicly available online - https://github.com/ghimiredhikura/laasp. △ Less

Submitted 25 June, 2025; originally announced June 2025.

Journal ref: Image Vision Comput. 136 (2023) 104745

arXiv:2506.16741 [pdf, ps, other]

RapFlow-TTS: Rapid and High-Fidelity Text-to-Speech with Improved Consistency Flow Matching

Authors: Hyun Joon Park, Jeongmin Liu, Jin Sob Kim, Jeong Yeol Yang, Sung Won Han, Eunwoo Song

Abstract: We introduce RapFlow-TTS, a rapid and high-fidelity TTS acoustic model that leverages velocity consistency constraints in flow matching (FM) training. Although ordinary differential equation (ODE)-based TTS generation achieves natural-quality speech, it typically requires a large number of generation steps, resulting in a trade-off between quality and inference speed. To address this challenge, Ra… ▽ More We introduce RapFlow-TTS, a rapid and high-fidelity TTS acoustic model that leverages velocity consistency constraints in flow matching (FM) training. Although ordinary differential equation (ODE)-based TTS generation achieves natural-quality speech, it typically requires a large number of generation steps, resulting in a trade-off between quality and inference speed. To address this challenge, RapFlow-TTS enforces consistency in the velocity field along the FM-straightened ODE trajectory, enabling consistent synthetic quality with fewer generation steps. Additionally, we introduce techniques such as time interval scheduling and adversarial learning to further enhance the quality of the few-step synthesis. Experimental results show that RapFlow-TTS achieves high-fidelity speech synthesis with a 5- and 10-fold reduction in synthesis steps than the conventional FM- and score-based approaches, respectively. △ Less

Submitted 20 June, 2025; originally announced June 2025.

Comments: Accepted on Interspeech 2025

arXiv:2506.16738 [pdf, ps, other]

LM-SPT: LM-Aligned Semantic Distillation for Speech Tokenization

Authors: Daejin Jo, Jeeyoung Yun, Byungseok Roh, Sungwoong Kim

Abstract: With the rapid progress of speech language models (SLMs), discrete speech tokens have emerged as a core interface between speech and text, enabling unified modeling across modalities. Recent speech tokenization approaches aim to isolate semantic information from low-level acoustics to better align with language models. In particular, previous methods use SSL teachers such as HuBERT to extract sema… ▽ More With the rapid progress of speech language models (SLMs), discrete speech tokens have emerged as a core interface between speech and text, enabling unified modeling across modalities. Recent speech tokenization approaches aim to isolate semantic information from low-level acoustics to better align with language models. In particular, previous methods use SSL teachers such as HuBERT to extract semantic representations, which are then distilled into a semantic quantizer to suppress acoustic redundancy as well as capture content-related latent structures. However, they still produce speech token sequences significantly longer than their textual counterparts, creating challenges for efficient speech-language modeling. Reducing the frame rate is a natural solution, but standard techniques, such as rigid average pooling across frames, can distort or dilute the semantic structure required for effective LM alignment. To address this, we propose LM-SPT, a speech tokenization method that introduces a novel semantic distillation. Instead of directly matching teacher and student features via pooling, we reconstruct speech solely from semantic tokens and minimize the discrepancy between the encoded representations of the original and reconstructed waveforms, obtained from a frozen automatic speech recognition (ASR) encoder. This indirect yet data-driven supervision enables the tokenizer to learn discrete units that are more semantically aligned with language models. LM-SPT further incorporates architectural improvements to the encoder and decoder for speech tokenization, and supports multiple frame rates, including 25Hz, 12.5Hz, and 6.25Hz. Experimental results show that LM-SPT achieves superior reconstruction fidelity compared to baselines, and that SLMs trained with LM-SPT tokens achieve competitive performances on speech-to-text and consistently outperform baselines on text-to-speech tasks. △ Less

Submitted 20 June, 2025; originally announced June 2025.

arXiv:2506.16231 [pdf, ps, other]

EDNet: A Distortion-Agnostic Speech Enhancement Framework with Gating Mamba Mechanism and Phase Shift-Invariant Training

Authors: Doyeop Kwak, Youngjoon Jang, Seongyu Kim, Joon Son Chung

Abstract: Speech signals in real-world environments are frequently affected by various distortions such as additive noise, reverberation, and bandwidth limitation, which may appear individually or in combination. Traditional speech enhancement methods typically rely on either masking, which focuses on suppressing non-speech components while preserving observable structure, or mapping, which seeks to recover… ▽ More Speech signals in real-world environments are frequently affected by various distortions such as additive noise, reverberation, and bandwidth limitation, which may appear individually or in combination. Traditional speech enhancement methods typically rely on either masking, which focuses on suppressing non-speech components while preserving observable structure, or mapping, which seeks to recover clean speech through direct transformation of the input. Each approach offers strengths in specific scenarios but may be less effective outside its target conditions. We propose the Erase and Draw Network (EDNet), a distortion-agnostic speech enhancement framework designed to handle a broad range of distortion types without prior assumptions about task or input characteristics. EDNet consists of two main components: (1) the Gating Mamba (GM) module, which adaptively combines masking and mapping through a learnable gating mechanism that selects between suppression (Erase) and reconstruction (Draw) based on local signal features, and (2) Phase Shift-Invariant Training (PSIT), a shift tolerant supervision strategy that improves phase estimation by enabling dynamic alignment during training while remaining compatible with standard loss functions. Experimental results on denoising, dereverberation, bandwidth extension, and multi distortion enhancement tasks show that EDNet consistently achieves strong performance across conditions, demonstrating its architectural flexibility and adaptability to diverse task settings. △ Less

Submitted 19 June, 2025; originally announced June 2025.

arXiv:2506.09487 [pdf, ps, other]

BemaGANv2: A Tutorial and Comparative Survey of GAN-based Vocoders for Long-Term Audio Generation

Authors: Taesoo Park, Mungwi Jeong, Mingyu Park, Narae Kim, Junyoung Kim, Mujung Kim, Jisang Yoo, Hoyun Lee, Sanghoon Kim, Soonchul Kwon

Abstract: This paper presents a tutorial-style survey and implementation guide of BemaGANv2, an advanced GAN-based vocoder designed for high-fidelity and long-term audio generation. Built upon the original BemaGAN architecture, BemaGANv2 incorporates major architectural innovations by replacing traditional ResBlocks in the generator with the Anti-aliased Multi-Periodicity composition (AMP) module, which int… ▽ More This paper presents a tutorial-style survey and implementation guide of BemaGANv2, an advanced GAN-based vocoder designed for high-fidelity and long-term audio generation. Built upon the original BemaGAN architecture, BemaGANv2 incorporates major architectural innovations by replacing traditional ResBlocks in the generator with the Anti-aliased Multi-Periodicity composition (AMP) module, which internally applies the Snake activation function to better model periodic structures. In the discriminator framework, we integrate the Multi-Envelope Discriminator (MED), a novel architecture we originally proposed, to extract rich temporal envelope features crucial for periodicity detection. Coupled with the Multi-Resolution Discriminator (MRD), this combination enables more accurate modeling of long-range dependencies in audio. We systematically evaluate various discriminator configurations, including MSD + MED, MSD + MRD, and MPD + MED + MRD, using objective metrics (FAD, SSIM, PLCC, MCD) and subjective evaluations (MOS, SMOS). This paper also provides a comprehensive tutorial on the model architecture, training methodology, and implementation to promote reproducibility. The code and pre-trained models are available at: https://github.com/dinhoitt/BemaGANv2. △ Less

Submitted 11 June, 2025; originally announced June 2025.

Comments: 11 pages, 7 figures. Survey and tutorial paper. Currently under review at ICT Express as an extended version of our ICAIIC 2025 paper

ACM Class: I.2.6; H.5.5; I.5.1

arXiv:2506.01789 [pdf, ps, other]

Datasheets Aren't Enough: DataRubrics for Automated Quality Metrics and Accountability

Authors: Genta Indra Winata, David Anugraha, Emmy Liu, Alham Fikri Aji, Shou-Yi Hung, Aditya Parashar, Patrick Amadeus Irawan, Ruochen Zhang, Zheng-Xin Yong, Jan Christian Blaise Cruz, Niklas Muennighoff, Seungone Kim, Hanyang Zhao, Sudipta Kar, Kezia Erina Suryoraharjo, M. Farid Adilazuarda, En-Shiun Annie Lee, Ayu Purwarianti, Derry Tanti Wijaya, Monojit Choudhury

Abstract: High-quality datasets are fundamental to training and evaluating machine learning models, yet their creation-especially with accurate human annotations-remains a significant challenge. Many dataset paper submissions lack originality, diversity, or rigorous quality control, and these shortcomings are often overlooked during peer review. Submissions also frequently omit essential details about datas… ▽ More High-quality datasets are fundamental to training and evaluating machine learning models, yet their creation-especially with accurate human annotations-remains a significant challenge. Many dataset paper submissions lack originality, diversity, or rigorous quality control, and these shortcomings are often overlooked during peer review. Submissions also frequently omit essential details about dataset construction and properties. While existing tools such as datasheets aim to promote transparency, they are largely descriptive and do not provide standardized, measurable methods for evaluating data quality. Similarly, metadata requirements at conferences promote accountability but are inconsistently enforced. To address these limitations, this position paper advocates for the integration of systematic, rubric-based evaluation metrics into the dataset review process-particularly as submission volumes continue to grow. We also explore scalable, cost-effective methods for synthetic data generation, including dedicated tools and LLM-as-a-judge approaches, to support more efficient evaluation. As a call to action, we introduce DataRubrics, a structured framework for assessing the quality of both human- and model-generated datasets. Leveraging recent advances in LLM-based evaluation, DataRubrics offers a reproducible, scalable, and actionable solution for dataset quality assessment, enabling both authors and reviewers to uphold higher standards in data-centric research. We also release code to support reproducibility of LLM-based evaluations at https://github.com/datarubrics/datarubrics. △ Less

Submitted 3 June, 2025; v1 submitted 2 June, 2025; originally announced June 2025.

Comments: Preprint

arXiv:2506.01382 [pdf, ps, other]

Enabling Scalable Distributed Beamforming via Networked LEO Satellites Towards 6G

Authors: Yuchen Zhang Seungnyun Kim, Tareq Y. Al-Naffouri

Abstract: In this paper, we propose scalable distributed beamforming schemes over low Earth orbit (LEO) satellite networks that rely solely on statistical channel state information for downlink orthogonal frequency division multiplexing systems. We begin by introducing the system model and presenting a pragmatic yet effective analog beamformer and user-scheduling design. We then derive a closed-form lower b… ▽ More In this paper, we propose scalable distributed beamforming schemes over low Earth orbit (LEO) satellite networks that rely solely on statistical channel state information for downlink orthogonal frequency division multiplexing systems. We begin by introducing the system model and presenting a pragmatic yet effective analog beamformer and user-scheduling design. We then derive a closed-form lower bound on the ergodic sum rate, based on the hardening bound, for the digital beamformer design. Next, we formulate a per-satellite power-constrained sum-rate maximization problem, whose centralized solution, obtained via the weighted minimum mean squared error (WMMSE) framework, establishes performance limits and motivates decentralized strategies. We subsequently introduce two decentralized optimization schemes, based on approximating the hardening bound and decentralizing the WMMSE framework, for representative inter-satellite link topologies. In the Ring scheme, satellites update beamformers locally and exchange intermediate parameters sequentially. In the Star scheme, edge satellites update beamformers locally and in parallel, achieving consensus on intermediate parameters at a central satellite using a penalty-dual decomposition framework. Extensive simulations demonstrate that our distributed designs achieve near-centralized performance with superior scalability, substantially outperforming simple closed-form beamformers and single-satellite baselines in sum rate. Additionally, the delay-overhead trade-off between the two topologies is revealed. △ Less

Submitted 2 June, 2025; originally announced June 2025.

Comments: This paper has been submitted to IEEE journal

arXiv:2505.23834 [pdf, ps, other]

Patient-Aware Feature Alignment for Robust Lung Sound Classification:Cohesion-Separation and Global Alignment Losses

Authors: Seung Gyu Jeong, Seong Eun Kim

Abstract: Lung sound classification is vital for early diagnosis of respiratory diseases. However, biomedical signals often exhibit inter-patient variability even among patients with the same symptoms, requiring a learning approach that considers individual differences. We propose a Patient-Aware Feature Alignment (PAFA) framework with two novel losses, Patient Cohesion-Separation Loss (PCSL) and Global Pat… ▽ More Lung sound classification is vital for early diagnosis of respiratory diseases. However, biomedical signals often exhibit inter-patient variability even among patients with the same symptoms, requiring a learning approach that considers individual differences. We propose a Patient-Aware Feature Alignment (PAFA) framework with two novel losses, Patient Cohesion-Separation Loss (PCSL) and Global Patient Alignment Loss (GPAL). PCSL clusters features of the same patient while separating those from other patients to capture patient variability, whereas GPAL draws each patient's centroid toward a global center, preventing feature space fragmentation. Our method achieves outstanding results on the ICBHI dataset with a score of 64.84\% for four-class and 72.08\% for two-class classification. These findings highlight PAFA's ability to capture individualized patterns and demonstrate performance gains in distinct patient clusters, offering broader applications for patient-centered healthcare. △ Less

Submitted 28 May, 2025; originally announced May 2025.

Comments: Accepted INTERSPEECH 2025

arXiv:2505.23132 [pdf, ps, other]

Patient Domain Supervised Contrastive Learning for Lung Sound Classification Using Mobile Phone

Authors: Seung Gyu Jeong, Seong Eun Kim

Abstract: Auscultation is crucial for diagnosing lung diseases. The COVID-19 pandemic has revealed the limitations of traditional, in-person lung sound assessments. To overcome these issues, advancements in digital stethoscopes and artificial intelligence (AI) have led to the development of new diagnostic methods. In this context, our study aims to use smartphone microphones to record and analyze lung sound… ▽ More Auscultation is crucial for diagnosing lung diseases. The COVID-19 pandemic has revealed the limitations of traditional, in-person lung sound assessments. To overcome these issues, advancements in digital stethoscopes and artificial intelligence (AI) have led to the development of new diagnostic methods. In this context, our study aims to use smartphone microphones to record and analyze lung sounds. We faced two major challenges: the difference in audio style between electronic stethoscopes and smartphone microphones, and the variability among patients. To address these challenges, we developed a method called Patient Domain Supervised Contrastive Learning (PD-SCL). By integrating this method with the Audio Spectrogram Transformer (AST) model, we significantly improved its performance by 2.4\% compared to the original AST model. This progress demonstrates that smartphones can effectively diagnose lung sounds, addressing inconsistencies in patient data and showing potential for broad use beyond traditional clinical settings. Our research contributes to making lung disease detection more accessible in the post-COVID-19 world. △ Less

Submitted 29 May, 2025; originally announced May 2025.

Comments: ITS-CSCC 2024

arXiv:2505.22489 [pdf, other]

Cascaded 3D Diffusion Models for Whole-body 3D 18-F FDG PET/CT synthesis from Demographics

Authors: Siyeop Yoon, Sifan Song, Pengfei Jin, Matthew Tivnan, Yujin Oh, Sekeun Kim, Dufan Wu, Xiang Li, Quanzheng Li

Abstract: We propose a cascaded 3D diffusion model framework to synthesize high-fidelity 3D PET/CT volumes directly from demographic variables, addressing the growing need for realistic digital twins in oncologic imaging, virtual trials, and AI-driven data augmentation. Unlike deterministic phantoms, which rely on predefined anatomical and metabolic templates, our method employs a two-stage generative proce… ▽ More We propose a cascaded 3D diffusion model framework to synthesize high-fidelity 3D PET/CT volumes directly from demographic variables, addressing the growing need for realistic digital twins in oncologic imaging, virtual trials, and AI-driven data augmentation. Unlike deterministic phantoms, which rely on predefined anatomical and metabolic templates, our method employs a two-stage generative process. An initial score-based diffusion model synthesizes low-resolution PET/CT volumes from demographic variables alone, providing global anatomical structures and approximate metabolic activity. This is followed by a super-resolution residual diffusion model that refines spatial resolution. Our framework was trained on 18-F FDG PET/CT scans from the AutoPET dataset and evaluated using organ-wise volume and standardized uptake value (SUV) distributions, comparing synthetic and real data between demographic subgroups. The organ-wise comparison demonstrated strong concordance between synthetic and real images. In particular, most deviations in metabolic uptake values remained within 3-5% of the ground truth in subgroup analysis. These findings highlight the potential of cascaded 3D diffusion models to generate anatomically and metabolically accurate PET/CT images, offering a robust alternative to traditional phantoms and enabling scalable, population-informed synthetic imaging for clinical and research applications. △ Less

Submitted 28 May, 2025; originally announced May 2025.

Comments: MICCAI2025 Submitted version

arXiv:2505.20868 [pdf, ps, other]

Spotlight-TTS: Spotlighting the Style via Voiced-Aware Style Extraction and Style Direction Adjustment for Expressive Text-to-Speech

Authors: Nam-Gyu Kim, Deok-Hyeon Cho, Seung-Bin Kim, Seong-Whan Lee

Abstract: Recent advances in expressive text-to-speech (TTS) have introduced diverse methods based on style embedding extracted from reference speech. However, synthesizing high-quality expressive speech remains challenging. We propose Spotlight-TTS, which exclusively emphasizes style via voiced-aware style extraction and style direction adjustment. Voiced-aware style extraction focuses on voiced regions hi… ▽ More Recent advances in expressive text-to-speech (TTS) have introduced diverse methods based on style embedding extracted from reference speech. However, synthesizing high-quality expressive speech remains challenging. We propose Spotlight-TTS, which exclusively emphasizes style via voiced-aware style extraction and style direction adjustment. Voiced-aware style extraction focuses on voiced regions highly related to style while maintaining continuity across different speech regions to improve expressiveness. We adjust the direction of the extracted style for optimal integration into the TTS model, which improves speech quality. Experimental results demonstrate that Spotlight-TTS achieves superior performance compared to baseline models in terms of expressiveness, overall speech quality, and style transfer capability. Our audio samples are publicly available. △ Less

Submitted 29 June, 2025; v1 submitted 27 May, 2025; originally announced May 2025.

Comments: Proceedings of Interspeech 2025

arXiv:2505.19693 [pdf, ps, other]

EmoSphere-SER: Enhancing Speech Emotion Recognition Through Spherical Representation with Auxiliary Classification

Authors: Deok-Hyeon Cho, Hyung-Seok Oh, Seung-Bin Kim, Seong-Whan Lee

Abstract: Speech emotion recognition predicts a speaker's emotional state from speech signals using discrete labels or continuous dimensions such as arousal, valence, and dominance (VAD). We propose EmoSphere-SER, a joint model that integrates spherical VAD region classification to guide VAD regression for improved emotion prediction. In our framework, VAD values are transformed into spherical coordinates t… ▽ More Speech emotion recognition predicts a speaker's emotional state from speech signals using discrete labels or continuous dimensions such as arousal, valence, and dominance (VAD). We propose EmoSphere-SER, a joint model that integrates spherical VAD region classification to guide VAD regression for improved emotion prediction. In our framework, VAD values are transformed into spherical coordinates that are divided into multiple spherical regions, and an auxiliary classification task predicts which spherical region each point belongs to, guiding the regression process. Additionally, we incorporate a dynamic weighting scheme and a style pooling layer with multi-head self-attention to capture spectral and temporal dynamics, further boosting performance. This combined training strategy reinforces structured learning and improves prediction consistency. Experimental results show that our approach exceeds baseline methods, confirming the validity of the proposed framework. △ Less

Submitted 26 May, 2025; originally announced May 2025.

Comments: Proceedings of Interspeech 2025

arXiv:2505.19687 [pdf, ps, other]

DiEmo-TTS: Disentangled Emotion Representations via Self-Supervised Distillation for Cross-Speaker Emotion Transfer in Text-to-Speech

Authors: Deok-Hyeon Cho, Hyung-Seok Oh, Seung-Bin Kim, Seong-Whan Lee

Abstract: Cross-speaker emotion transfer in speech synthesis relies on extracting speaker-independent emotion embeddings for accurate emotion modeling without retaining speaker traits. However, existing timbre compression methods fail to fully separate speaker and emotion characteristics, causing speaker leakage and degraded synthesis quality. To address this, we propose DiEmo-TTS, a self-supervised distill… ▽ More Cross-speaker emotion transfer in speech synthesis relies on extracting speaker-independent emotion embeddings for accurate emotion modeling without retaining speaker traits. However, existing timbre compression methods fail to fully separate speaker and emotion characteristics, causing speaker leakage and degraded synthesis quality. To address this, we propose DiEmo-TTS, a self-supervised distillation method to minimize emotional information loss and preserve speaker identity. We introduce cluster-driven sampling and information perturbation to preserve emotion while removing irrelevant factors. To facilitate this process, we propose an emotion clustering and matching approach using emotional attribute prediction and speaker embeddings, enabling generalization to unlabeled data. Additionally, we designed a dual conditioning transformer to integrate style features better. Experimental results confirm the effectiveness of our method in learning speaker-irrelevant emotion embeddings. △ Less

Submitted 26 May, 2025; originally announced May 2025.

Comments: Proceedings of Interspeech 2025

arXiv:2505.16212 [pdf, ps, other]

Large Language Models based ASR Error Correction for Child Conversations

Authors: Anfeng Xu, Tiantian Feng, So Hyun Kim, Somer Bishop, Catherine Lord, Shrikanth Narayanan

Abstract: Automatic Speech Recognition (ASR) has recently shown remarkable progress, but accurately transcribing children's speech remains a significant challenge. Recent developments in Large Language Models (LLMs) have shown promise in improving ASR transcriptions. However, their applications in child speech including conversational scenarios are underexplored. In this study, we explore the use of LLMs in… ▽ More Automatic Speech Recognition (ASR) has recently shown remarkable progress, but accurately transcribing children's speech remains a significant challenge. Recent developments in Large Language Models (LLMs) have shown promise in improving ASR transcriptions. However, their applications in child speech including conversational scenarios are underexplored. In this study, we explore the use of LLMs in correcting ASR errors for conversational child speech. We demonstrate the promises and challenges of LLMs through experiments on two children's conversational speech datasets with both zero-shot and fine-tuned ASR outputs. We find that while LLMs are helpful in correcting zero-shot ASR outputs and fine-tuned CTC-based ASR outputs, it remains challenging for LLMs to improve ASR performance when incorporating contextual information or when using fine-tuned autoregressive ASR (e.g., Whisper) outputs. △ Less

Submitted 24 May, 2025; v1 submitted 22 May, 2025; originally announced May 2025.

Comments: Accepted to Interspeech 2025

arXiv:2505.12686 [pdf, other]

RoVo: Robust Voice Protection Against Unauthorized Speech Synthesis with Embedding-Level Perturbations

Authors: Seungmin Kim, Sohee Park, Donghyun Kim, Jisu Lee, Daeseon Choi

Abstract: With the advancement of AI-based speech synthesis technologies such as Deep Voice, there is an increasing risk of voice spoofing attacks, including voice phishing and fake news, through unauthorized use of others' voices. Existing defenses that inject adversarial perturbations directly into audio signals have limited effectiveness, as these perturbations can easily be neutralized by speech enhance… ▽ More With the advancement of AI-based speech synthesis technologies such as Deep Voice, there is an increasing risk of voice spoofing attacks, including voice phishing and fake news, through unauthorized use of others' voices. Existing defenses that inject adversarial perturbations directly into audio signals have limited effectiveness, as these perturbations can easily be neutralized by speech enhancement methods. To overcome this limitation, we propose RoVo (Robust Voice), a novel proactive defense technique that injects adversarial perturbations into high-dimensional embedding vectors of audio signals, reconstructing them into protected speech. This approach effectively defends against speech synthesis attacks and also provides strong resistance to speech enhancement models, which represent a secondary attack threat. In extensive experiments, RoVo increased the Defense Success Rate (DSR) by over 70% compared to unprotected speech, across four state-of-the-art speech synthesis models. Specifically, RoVo achieved a DSR of 99.5% on a commercial speaker-verification API, effectively neutralizing speech synthesis attack. Moreover, RoVo's perturbations remained robust even under strong speech enhancement conditions, outperforming traditional methods. A user study confirmed that RoVo preserves both naturalness and usability of protected speech, highlighting its effectiveness in complex and evolving threat scenarios. △ Less

Submitted 19 May, 2025; originally announced May 2025.

arXiv:2505.11788 [pdf, ps, other]

Communication-Efficient Hybrid Language Model via Uncertainty-Aware Opportunistic and Compressed Transmission

Authors: Seungeun Oh, Jinhyuk Kim, Jihong Park, Seung-Woo Ko, Jinho Choi, Tony Q. S. Quek, Seong-Lyun Kim

Abstract: To support emerging language-based applications using dispersed and heterogeneous computing resources, the hybrid language model (HLM) offers a promising architecture, where an on-device small language model (SLM) generates draft tokens that are validated and corrected by a remote large language model (LLM). However, the original HLM suffers from substantial communication overhead, as the LLM requ… ▽ More To support emerging language-based applications using dispersed and heterogeneous computing resources, the hybrid language model (HLM) offers a promising architecture, where an on-device small language model (SLM) generates draft tokens that are validated and corrected by a remote large language model (LLM). However, the original HLM suffers from substantial communication overhead, as the LLM requires the SLM to upload the full vocabulary distribution for each token. Moreover, both communication and computation resources are wasted when the LLM validates tokens that are highly likely to be accepted. To overcome these limitations, we propose communication-efficient and uncertainty-aware HLM (CU-HLM). In CU-HLM, the SLM transmits truncated vocabulary distributions only when its output uncertainty is high. We validate the feasibility of this opportunistic transmission by discovering a strong correlation between SLM's uncertainty and LLM's rejection probability. Furthermore, we theoretically derive optimal uncertainty thresholds and optimal vocabulary truncation strategies. Simulation results show that, compared to standard HLM, CU-HLM achieves up to 206$\times$ higher token throughput by skipping 74.8% transmissions with 97.4% vocabulary compression, while maintaining 97.4% accuracy. △ Less

Submitted 16 May, 2025; originally announced May 2025.

Comments: 14 pages, 10 figures, 2 tables; This work has been submitted to the IEEE for possible publication

arXiv:2505.05710 [pdf, ps, other]

HyperspectralMAE: The Hyperspectral Imagery Classification Model using Fourier-Encoded Dual-Branch Masked Autoencoder

Authors: Wooyoung Jeong, Hyun Jae Park, Seonghun Jeong, Jong Wook Jang, Tae Hoon Lim, Dae Seoung Kim

Abstract: Hyperspectral imagery provides rich spectral detail but poses unique challenges because of its high dimensionality in both spatial and spectral domains. We propose \textit{HyperspectralMAE}, a Transformer-based foundation model for hyperspectral data that employs a \textit{dual masking} strategy: during pre-training we randomly occlude 50\% of spatial patches and 50\% of spectral bands. This force… ▽ More Hyperspectral imagery provides rich spectral detail but poses unique challenges because of its high dimensionality in both spatial and spectral domains. We propose \textit{HyperspectralMAE}, a Transformer-based foundation model for hyperspectral data that employs a \textit{dual masking} strategy: during pre-training we randomly occlude 50\% of spatial patches and 50\% of spectral bands. This forces the model to learn representations capable of reconstructing missing information across both dimensions. To encode spectral order, we introduce learnable harmonic Fourier positional embeddings based on wavelength. The reconstruction objective combines mean-squared error (MSE) with the spectral angle mapper (SAM) to balance pixel-level accuracy and spectral-shape fidelity. The resulting model contains about $1.8\times10^{8}$ parameters and produces 768-dimensional embeddings, giving it sufficient capacity for transfer learning. We pre-trained HyperspectralMAE on two large hyperspectral corpora -- NASA EO-1 Hyperion ($\sim$1\,600 scenes, $\sim$$3\times10^{11}$ pixel spectra) and DLR EnMAP Level-0 ($\sim$1\,300 scenes, $\sim$$3\times10^{11}$ pixel spectra) -- and fine-tuned it for land-cover classification on the Indian Pines benchmark. HyperspectralMAE achieves state-of-the-art transfer-learning accuracy on Indian Pines, confirming that masked dual-dimensional pre-training yields robust spectral-spatial representations. These results demonstrate that dual masking and wavelength-aware embeddings advance hyperspectral image reconstruction and downstream analysis. △ Less

Submitted 8 May, 2025; originally announced May 2025.

arXiv:2505.01617 [pdf, other]

High Speed Robotic Table Tennis Swinging Using Lightweight Hardware with Model Predictive Control

Authors: David Nguyen, Kendrick D. Cancio, Sangbae Kim

Abstract: We present a robotic table tennis platform that achieves a variety of hit styles and ball-spins with high precision, power, and consistency. This is enabled by a custom lightweight, high-torque, low rotor inertia, five degree-of-freedom arm capable of high acceleration. To generate swing trajectories, we formulate an optimal control problem (OCP) that constrains the state of the paddle at the time… ▽ More We present a robotic table tennis platform that achieves a variety of hit styles and ball-spins with high precision, power, and consistency. This is enabled by a custom lightweight, high-torque, low rotor inertia, five degree-of-freedom arm capable of high acceleration. To generate swing trajectories, we formulate an optimal control problem (OCP) that constrains the state of the paddle at the time of the strike. The terminal position is given by a predicted ball trajectory, and the terminal orientation and velocity of the paddle are chosen to match various possible styles of hits: loops (topspin), drives (flat), and chops (backspin). Finally, we construct a fixed-horizon model predictive controller (MPC) around this OCP to allow the hardware to quickly react to changes in the predicted ball trajectory. We validate on hardware that the system is capable of hitting balls with an average exit velocity of 11 m/s at an 88% success rate across the three swing types. △ Less

Submitted 2 May, 2025; originally announced May 2025.

arXiv:2504.18539 [pdf, other]

Multi-Task Corrupted Prediction for Learning Robust Audio-Visual Speech Representation

Authors: Sungnyun Kim, Sungwoo Cho, Sangmin Bae, Kangwook Jang, Se-Young Yun

Abstract: Audio-visual speech recognition (AVSR) incorporates auditory and visual modalities to improve recognition accuracy, particularly in noisy environments where audio-only speech systems are insufficient. While previous research has largely addressed audio disruptions, few studies have dealt with visual corruptions, e.g., lip occlusions or blurred videos, which are also detrimental. To address this re… ▽ More Audio-visual speech recognition (AVSR) incorporates auditory and visual modalities to improve recognition accuracy, particularly in noisy environments where audio-only speech systems are insufficient. While previous research has largely addressed audio disruptions, few studies have dealt with visual corruptions, e.g., lip occlusions or blurred videos, which are also detrimental. To address this real-world challenge, we propose CAV2vec, a novel self-supervised speech representation learning framework particularly designed to handle audio-visual joint corruption. CAV2vec employs a self-distillation approach with a corrupted prediction task, where the student model learns to predict clean targets, generated by the teacher model, with corrupted input frames. Specifically, we suggest a unimodal multi-task learning, which distills cross-modal knowledge and aligns the corrupted modalities, by predicting clean audio targets with corrupted videos, and clean video targets with corrupted audios. This strategy mitigates the dispersion in the representation space caused by corrupted modalities, leading to more reliable and robust audio-visual fusion. Our experiments on robust AVSR benchmarks demonstrate that the corrupted representation learning method significantly enhances recognition accuracy across generalized environments involving various types of corruption. Our code is available at https://github.com/sungnyun/cav2vec. △ Less

Submitted 30 April, 2025; v1 submitted 23 January, 2025; originally announced April 2025.

Comments: ICLR 2025; 22 pages, 6 figures, 14 tables

arXiv:2504.15663 [pdf, other]

doi 10.1109/ICASSP49660.2025.10888053

FADEL: Uncertainty-aware Fake Audio Detection with Evidential Deep Learning

Authors: Ju Yeon Kang, Ji Won Yoon, Semin Kim, Min Hyun Han, Nam Soo Kim

Abstract: Recently, fake audio detection has gained significant attention, as advancements in speech synthesis and voice conversion have increased the vulnerability of automatic speaker verification (ASV) systems to spoofing attacks. A key challenge in this task is generalizing models to detect unseen, out-of-distribution (OOD) attacks. Although existing approaches have shown promising results, they inheren… ▽ More Recently, fake audio detection has gained significant attention, as advancements in speech synthesis and voice conversion have increased the vulnerability of automatic speaker verification (ASV) systems to spoofing attacks. A key challenge in this task is generalizing models to detect unseen, out-of-distribution (OOD) attacks. Although existing approaches have shown promising results, they inherently suffer from overconfidence issues due to the usage of softmax for classification, which can produce unreliable predictions when encountering unpredictable spoofing attempts. To deal with this limitation, we propose a novel framework called fake audio detection with evidential learning (FADEL). By modeling class probabilities with a Dirichlet distribution, FADEL incorporates model uncertainty into its predictions, thereby leading to more robust performance in OOD scenarios. Experimental results on the ASVspoof2019 Logical Access (LA) and ASVspoof2021 LA datasets indicate that the proposed method significantly improves the performance of baseline models. Furthermore, we demonstrate the validity of uncertainty estimation by analyzing a strong correlation between average uncertainty and equal error rate (EER) across different spoofing algorithms. △ Less

Submitted 22 April, 2025; originally announced April 2025.

Comments: Accepted at ICASSP 2025

arXiv:2504.13461 [pdf, other]

doi 10.1109/TFR.2024.3430891

An Addendum to NeBula: Towards Extending TEAM CoSTAR's Solution to Larger Scale Environments

Authors: Ali Agha, Kyohei Otsu, Benjamin Morrell, David D. Fan, Sung-Kyun Kim, Muhammad Fadhil Ginting, Xianmei Lei, Jeffrey Edlund, Seyed Fakoorian, Amanda Bouman, Fernando Chavez, Taeyeon Kim, Gustavo J. Correa, Maira Saboia, Angel Santamaria-Navarro, Brett Lopez, Boseong Kim, Chanyoung Jung, Mamoru Sobue, Oriana Claudia Peltzer, Joshua Ott, Robert Trybula, Thomas Touma, Marcel Kaufmann, Tiago Stegun Vaquero , et al. (64 additional authors not shown)

Abstract: This paper presents an appendix to the original NeBula autonomy solution developed by the TEAM CoSTAR (Collaborative SubTerranean Autonomous Robots), participating in the DARPA Subterranean Challenge. Specifically, this paper presents extensions to NeBula's hardware, software, and algorithmic components that focus on increasing the range and scale of the exploration environment. From the algorithm… ▽ More This paper presents an appendix to the original NeBula autonomy solution developed by the TEAM CoSTAR (Collaborative SubTerranean Autonomous Robots), participating in the DARPA Subterranean Challenge. Specifically, this paper presents extensions to NeBula's hardware, software, and algorithmic components that focus on increasing the range and scale of the exploration environment. From the algorithmic perspective, we discuss the following extensions to the original NeBula framework: (i) large-scale geometric and semantic environment mapping; (ii) an adaptive positioning system; (iii) probabilistic traversability analysis and local planning; (iv) large-scale POMDP-based global motion planning and exploration behavior; (v) large-scale networking and decentralized reasoning; (vi) communication-aware mission planning; and (vii) multi-modal ground-aerial exploration solutions. We demonstrate the application and deployment of the presented systems and solutions in various large-scale underground environments, including limestone mine exploration scenarios as well as deployment in the DARPA Subterranean challenge. △ Less

Submitted 18 April, 2025; originally announced April 2025.

Journal ref: IEEE Transactions on Field Robotics, vol. 1, pp. 476-526, 2024

arXiv:2504.12512 [pdf, other]

Practical Insights on Grasp Strategies for Mobile Manipulation in the Wild

Authors: Isabella Huang, Richard Cheng, Sangwoon Kim, Dan Kruse, Carolyn Matl, Lukas Kaul, JC Hancock, Shanmuga Harikumar, Mark Tjersland, James Borders, Dan Helmick

Abstract: Mobile manipulation robots are continuously advancing, with their grasping capabilities rapidly progressing. However, there are still significant gaps preventing state-of-the-art mobile manipulators from widespread real-world deployments, including their ability to reliably grasp items in unstructured environments. To help bridge this gap, we developed SHOPPER, a mobile manipulation robot platform… ▽ More Mobile manipulation robots are continuously advancing, with their grasping capabilities rapidly progressing. However, there are still significant gaps preventing state-of-the-art mobile manipulators from widespread real-world deployments, including their ability to reliably grasp items in unstructured environments. To help bridge this gap, we developed SHOPPER, a mobile manipulation robot platform designed to push the boundaries of reliable and generalizable grasp strategies. We develop these grasp strategies and deploy them in a real-world grocery store -- an exceptionally challenging setting chosen for its vast diversity of manipulable items, fixtures, and layouts. In this work, we present our detailed approach to designing general grasp strategies towards picking any item in a real grocery store. Additionally, we provide an in-depth analysis of our latest real-world field test, discussing key findings related to fundamental failure modes over hundreds of distinct pick attempts. Through our detailed analysis, we aim to offer valuable practical insights and identify key grasping challenges, which can guide the robotics community towards pressing open problems in the field. △ Less

Submitted 16 April, 2025; originally announced April 2025.

Comments: 8 pages, 8 figures, submitted to IROS 2025

arXiv:2504.12354 [pdf, other]

WaterFlow: Learning Fast & Robust Watermarks using Stable Diffusion

Authors: Vinay Shukla, Prachee Sharma, Ryan Rossi, Sungchul Kim, Tong Yu, Aditya Grover

Abstract: The ability to embed watermarks in images is a fundamental problem of interest for computer vision, and is exacerbated by the rapid rise of generated imagery in recent times. Current state-of-the-art techniques suffer from computational and statistical challenges such as the slow execution speed for practical deployments. In addition, other works trade off fast watermarking speeds but suffer great… ▽ More The ability to embed watermarks in images is a fundamental problem of interest for computer vision, and is exacerbated by the rapid rise of generated imagery in recent times. Current state-of-the-art techniques suffer from computational and statistical challenges such as the slow execution speed for practical deployments. In addition, other works trade off fast watermarking speeds but suffer greatly in their robustness or perceptual quality. In this work, we propose WaterFlow (WF), a fast and extremely robust approach for high fidelity visual watermarking based on a learned latent-dependent watermark. Our approach utilizes a pretrained latent diffusion model to encode an arbitrary image into a latent space and produces a learned watermark that is then planted into the Fourier Domain of the latent. The transformation is specified via invertible flow layers that enhance the expressivity of the latent space of the pre-trained model to better preserve image quality while permitting robust and tractable detection. Most notably, WaterFlow demonstrates state-of-the-art performance on general robustness and is the first method capable of effectively defending against difficult combination attacks. We validate our findings on three widely used real and generated datasets: MS-COCO, DiffusionDB, and WikiArt. △ Less

Submitted 17 April, 2025; v1 submitted 15 April, 2025; originally announced April 2025.

arXiv:2504.03600 [pdf, other]

MedSAM2: Segment Anything in 3D Medical Images and Videos

Authors: Jun Ma, Zongxin Yang, Sumin Kim, Bihui Chen, Mohammed Baharoon, Adibvafa Fallahpour, Reza Asakereh, Hongwei Lyu, Bo Wang

Abstract: Medical image and video segmentation is a critical task for precision medicine, which has witnessed considerable progress in developing task or modality-specific and generalist models for 2D images. However, there have been limited studies on building general-purpose models for 3D images and videos with comprehensive user studies. Here, we present MedSAM2, a promptable segmentation foundation mode… ▽ More Medical image and video segmentation is a critical task for precision medicine, which has witnessed considerable progress in developing task or modality-specific and generalist models for 2D images. However, there have been limited studies on building general-purpose models for 3D images and videos with comprehensive user studies. Here, we present MedSAM2, a promptable segmentation foundation model for 3D image and video segmentation. The model is developed by fine-tuning the Segment Anything Model 2 on a large medical dataset with over 455,000 3D image-mask pairs and 76,000 frames, outperforming previous models across a wide range of organs, lesions, and imaging modalities. Furthermore, we implement a human-in-the-loop pipeline to facilitate the creation of large-scale datasets resulting in, to the best of our knowledge, the most extensive user study to date, involving the annotation of 5,000 CT lesions, 3,984 liver MRI lesions, and 251,550 echocardiogram video frames, demonstrating that MedSAM2 can reduce manual costs by more than 85%. MedSAM2 is also integrated into widely used platforms with user-friendly interfaces for local and cloud deployment, making it a practical tool for supporting efficient, scalable, and high-quality segmentation in both research and healthcare environments. △ Less

Submitted 4 April, 2025; originally announced April 2025.

Comments: https://medsam2.github.io/

arXiv:2503.19228 [pdf, ps, other]

Bridging the Sim-to-real Gap: A Control Framework for Imitation Learning of Model Predictive Control

Authors: Seungtaek Kim, Jonghyup Lee, Kyoungseok Han, Seibum B. Choi

Abstract: To address the computational challenges of Model Predictive Control (MPC), recent research has studied using imitation learning to approximate the MPC to a computationally efficient Deep Neural Network (DNN). However, this introduces a common issue in learning-based control, the simulation-to-reality (sim-to-real) gap, and Domain Randomization (DR) has been widely used to mitigate this gap by intr… ▽ More To address the computational challenges of Model Predictive Control (MPC), recent research has studied using imitation learning to approximate the MPC to a computationally efficient Deep Neural Network (DNN). However, this introduces a common issue in learning-based control, the simulation-to-reality (sim-to-real) gap, and Domain Randomization (DR) has been widely used to mitigate this gap by introducing perturbations in the source domain. However, DR inevitably leads to low data collection efficiency and an overly conservative control strategy. This study proposes a new control framework that addresses this issue from a control perspective inspired by Robust Tube MPC. The framework ensures the DNN operates in the same environment as the source domain, handling the sim-to-real gap with great data collection efficiency. Moreover, a parameter governor is introduced to address the DNN's inability to adapt to variations in model parameters, enabling the system to satisfy MPC constraints more robustly under changing conditions. The proposed framework was validated through two case studies: cart-pole control and vehicle collision avoidance control, which analyzed the principles of the proposed framework in detail and demonstrated its application to a vehicle control case. △ Less

Submitted 3 July, 2025; v1 submitted 24 March, 2025; originally announced March 2025.

arXiv:2503.18880 [pdf, other]

Seeing Speech and Sound: Distinguishing and Locating Audios in Visual Scenes

Authors: Hyeonggon Ryu, Seongyu Kim, Joon Son Chung, Arda Senocak

Abstract: We present a unified model capable of simultaneously grounding both spoken language and non-speech sounds within a visual scene, addressing key limitations in current audio-visual grounding models. Existing approaches are typically limited to handling either speech or non-speech sounds independently, or at best, together but sequentially without mixing. This limitation prevents them from capturing… ▽ More We present a unified model capable of simultaneously grounding both spoken language and non-speech sounds within a visual scene, addressing key limitations in current audio-visual grounding models. Existing approaches are typically limited to handling either speech or non-speech sounds independently, or at best, together but sequentially without mixing. This limitation prevents them from capturing the complexity of real-world audio sources that are often mixed. Our approach introduces a 'mix-and-separate' framework with audio-visual alignment objectives that jointly learn correspondence and disentanglement using mixed audio. Through these objectives, our model learns to produce distinct embeddings for each audio type, enabling effective disentanglement and grounding across mixed audio sources. Additionally, we created a new dataset to evaluate simultaneous grounding of mixed audio sources, demonstrating that our model outperforms prior methods. Our approach also achieves comparable or better performance in standard segmentation and cross-modal retrieval tasks, highlighting the benefits of our mix-and-separate approach. △ Less

Submitted 24 March, 2025; originally announced March 2025.

Comments: CVPR 2025

arXiv:2503.18151 [pdf, ps, other]

Efficient Deep Learning Approaches for Processing Ultra-Widefield Retinal Imaging

Authors: Siwon Kim, Wooyung Yun, Jeongbin Oh, Soomok Lee

Abstract: Deep learning has emerged as the predominant solution for classifying medical images. We intend to apply these developments to the ultra-widefield (UWF) retinal imaging dataset. Since UWF images can accurately diagnose various retina diseases, it is very important to clas sify them accurately and prevent them with early treatment. However, processing images manually is time-consuming and labor-int… ▽ More Deep learning has emerged as the predominant solution for classifying medical images. We intend to apply these developments to the ultra-widefield (UWF) retinal imaging dataset. Since UWF images can accurately diagnose various retina diseases, it is very important to clas sify them accurately and prevent them with early treatment. However, processing images manually is time-consuming and labor-intensive, and there are two challenges to automating this process. First, high perfor mance usually requires high computational resources. Artificial intelli gence medical technology is better suited for places with limited medical resources, but using high-performance processing units in such environ ments is challenging. Second, the problem of the accuracy of colour fun dus photography (CFP) methods. In general, the UWF method provides more information for retinal diagnosis than the CFP method, but most of the research has been conducted based on the CFP method. Thus, we demonstrate that these problems can be efficiently addressed in low performance units using methods such as strategic data augmentation and model ensembles, which balance performance and computational re sources while utilizing UWF images. △ Less

Submitted 23 March, 2025; originally announced March 2025.

arXiv:2503.16366 [pdf, other]

Dynamic Metasurface-Backed Luneburg Lens for Multiplexed Backscatter Communication

Authors: Samuel Kim, Tim Sleasman, Avrami Rakovsky, Ra'id Awadallah, David B. Shrekenhamer

Abstract: Backscatter communications is attractive for its low power requirements due to the lack of actively radiating components; however, commonly used devices are typically limited in range and functionality. Here, we design and demonstrate a flattened Luneburg lens combined with a spatially-tunable dynamic metasurface to create a low-power backscatter communicator. The Luneburg lens is a spherically-sy… ▽ More Backscatter communications is attractive for its low power requirements due to the lack of actively radiating components; however, commonly used devices are typically limited in range and functionality. Here, we design and demonstrate a flattened Luneburg lens combined with a spatially-tunable dynamic metasurface to create a low-power backscatter communicator. The Luneburg lens is a spherically-symmetric lens that focuses a collimated beam from any direction, enabling a wide field-of-view with no aberrations. By applying quasi-conformal transformation optics (QCTO), we design a flattened Luneburg lens to facilitate its seamless interface with the planar metasurface. The gradient index of the Luneburg lens is realized through additive manufacturing. We show that the flattened Luneburg lens with a reflective surface at the flattened focal plane is able to achieve diffraction-limited retroreflection, enabling long-range backscatter communication. When an interrogator transmits towards the metasurface-backed Luneburg lens, the device can modulate the reflected signal phase across a wide field of regard to communicate data. We experimentally show that the spatial control over the metasurface allows different bit streams to be simultaneously communicated in different directions. Additionally, we show that the device is able to prevent eavesdroppers from receiving information, thus securing communications. △ Less

Submitted 20 March, 2025; originally announced March 2025.

Comments: 13 pages, 8 figures

arXiv:2503.12806 [pdf, other]

AV-Surf: Surface-Enhanced Geometry-Aware Novel-View Acoustic Synthesis

Authors: Hadam Baek, Hannie Shin, Jiyoung Seo, Chanwoo Kim, Saerom Kim, Hyeongbok Kim, Sangpil Kim

Abstract: Accurately modeling sound propagation with complex real-world environments is essential for Novel View Acoustic Synthesis (NVAS). While previous studies have leveraged visual perception to estimate spatial acoustics, the combined use of surface normal and structural details from 3D representations in acoustic modeling has been underexplored. Given their direct impact on sound wave reflections and… ▽ More Accurately modeling sound propagation with complex real-world environments is essential for Novel View Acoustic Synthesis (NVAS). While previous studies have leveraged visual perception to estimate spatial acoustics, the combined use of surface normal and structural details from 3D representations in acoustic modeling has been underexplored. Given their direct impact on sound wave reflections and propagation, surface normals should be jointly modeled with structural details to achieve accurate spatial acoustics. In this paper, we propose a surface-enhanced geometry-aware approach for NVAS to improve spatial acoustic modeling. To achieve this, we exploit geometric priors, such as image, depth map, surface normals, and point clouds obtained using a 3D Gaussian Splatting (3DGS) based framework. We introduce a dual cross-attention-based transformer integrating geometrical constraints into frequency query to understand the surroundings of the emitter. Additionally, we design a ConvNeXt-based spectral features processing network called Spectral Refinement Network (SRN) to synthesize realistic binaural audio. Experimental results on the RWAVS and SoundSpace datasets highlight the necessity of our approach, as it surpasses existing methods in novel view acoustic synthesis. △ Less

Submitted 17 March, 2025; originally announced March 2025.

arXiv:2503.11026 [pdf, other]

MAVFlow: Preserving Paralinguistic Elements with Conditional Flow Matching for Zero-Shot AV2AV Multilingual Translation

Authors: Sungwoo Cho, Jeongsoo Choi, Sungnyun Kim, Se-Young Yun

Abstract: Despite recent advances in text-to-speech (TTS) models, audio-visual to audio-visual (AV2AV) translation still faces a critical challenge: maintaining speaker consistency between the original and translated vocal and facial features. To address this issue, we propose a conditional flow matching (CFM) zero-shot audio-visual renderer that utilizes strong dual guidance from both audio and visual moda… ▽ More Despite recent advances in text-to-speech (TTS) models, audio-visual to audio-visual (AV2AV) translation still faces a critical challenge: maintaining speaker consistency between the original and translated vocal and facial features. To address this issue, we propose a conditional flow matching (CFM) zero-shot audio-visual renderer that utilizes strong dual guidance from both audio and visual modalities. By leveraging multi-modal guidance with CFM, our model robustly preserves speaker-specific characteristics and significantly enhances zero-shot AV2AV translation abilities. For the audio modality, we enhance the CFM process by integrating robust speaker embeddings with x-vectors, which serve to bolster speaker consistency. Additionally, we convey emotional nuances to the face rendering module. The guidance provided by both audio and visual cues remains independent of semantic or linguistic content, allowing our renderer to effectively handle zero-shot translation tasks for monolingual speakers in different languages. We empirically demonstrate that the inclusion of high-quality mel-spectrograms conditioned on facial information not only enhances the quality of the synthesized speech but also positively influences facial generation, leading to overall performance improvements. △ Less

Submitted 13 March, 2025; originally announced March 2025.

Comments: Preliminary work

arXiv:2503.10349 [pdf, ps, other]

Autonomous Robotic Radio Source Localization via a Novel Gaussian Mixture Filtering Approach

Authors: Sukkeun Kim, Sangwoo Moon, Ivan Petrunin, Hyo-Sang Shin, Shehryar Khattak

Abstract: This study proposes a new Gaussian Mixture Filter (GMF) to improve the estimation performance for the autonomous robotic radio signal source search and localization problem in unknown environments. The proposed filter is first tested with a benchmark numerical problem to validate the performance with other state-of-the-practice approaches such as Particle Filter (PF) and Particle Gaussian Mixture… ▽ More This study proposes a new Gaussian Mixture Filter (GMF) to improve the estimation performance for the autonomous robotic radio signal source search and localization problem in unknown environments. The proposed filter is first tested with a benchmark numerical problem to validate the performance with other state-of-the-practice approaches such as Particle Filter (PF) and Particle Gaussian Mixture (PGM) filters. Then the proposed approach is tested and compared against PF and PGM filters in real-world robotic field experiments to validate its impact for real-world applications. The considered real-world scenarios have partial observability with the range-only measurement and uncertainty with the measurement model. The results show that the proposed filter can handle this partial observability effectively whilst showing improved performance compared to PF, reducing the computation requirements while demonstrating improved robustness over compared techniques. △ Less

Submitted 13 June, 2025; v1 submitted 13 March, 2025; originally announced March 2025.

arXiv:2503.04966 [pdf, other]

Prediction of Frozen Region Growth in Kidney Cryoablation Intervention Using a 3D Flow-Matching Model

Authors: Siyeop Yoon, Yujin Oh, Matthew Tivnan, Sifan Song, Pengfei Jin, Sekeun Kim, Hyun Jin Cho, Dufan Wu, Raul Uppot, Quanzheng Li

Abstract: This study presents a 3D flow-matching model designed to predict the progression of the frozen region (iceball) during kidney cryoablation. Precise intraoperative guidance is critical in cryoablation to ensure complete tumor eradication while preserving adjacent healthy tissue. However, conventional methods, typically based on physics driven or diffusion based simulations, are computationally dema… ▽ More This study presents a 3D flow-matching model designed to predict the progression of the frozen region (iceball) during kidney cryoablation. Precise intraoperative guidance is critical in cryoablation to ensure complete tumor eradication while preserving adjacent healthy tissue. However, conventional methods, typically based on physics driven or diffusion based simulations, are computationally demanding and often struggle to represent complex anatomical structures accurately. To address these limitations, our approach leverages intraoperative CT imaging to inform the model. The proposed 3D flow matching model is trained to learn a continuous deformation field that maps early-stage CT scans to future predictions. This transformation not only estimates the volumetric expansion of the iceball but also generates corresponding segmentation masks, effectively capturing spatial and morphological changes over time. Quantitative analysis highlights the model robustness, demonstrating strong agreement between predictions and ground-truth segmentations. The model achieves an Intersection over Union (IoU) score of 0.61 and a Dice coefficient of 0.75. By integrating real time CT imaging with advanced deep learning techniques, this approach has the potential to enhance intraoperative guidance in kidney cryoablation, improving procedural outcomes and advancing the field of minimally invasive surgery. △ Less

Submitted 11 March, 2025; v1 submitted 6 March, 2025; originally announced March 2025.

Comments: MICCAI 2025 submitted version (author list included)

arXiv:2503.01075 [pdf, other]

Tackling Hallucination from Conditional Models for Medical Image Reconstruction with DynamicDPS

Authors: Seunghoi Kim, Henry F. J. Tregidgo, Matteo Figini, Chen Jin, Sarang Joshi, Daniel C. Alexander

Abstract: Hallucinations are spurious structures not present in the ground truth, posing a critical challenge in medical image reconstruction, especially for data-driven conditional models. We hypothesize that combining an unconditional diffusion model with data consistency, trained on a diverse dataset, can reduce these hallucinations. Based on this, we propose DynamicDPS, a diffusion-based framework that… ▽ More Hallucinations are spurious structures not present in the ground truth, posing a critical challenge in medical image reconstruction, especially for data-driven conditional models. We hypothesize that combining an unconditional diffusion model with data consistency, trained on a diverse dataset, can reduce these hallucinations. Based on this, we propose DynamicDPS, a diffusion-based framework that integrates conditional and unconditional diffusion models to enhance low-quality medical images while systematically reducing hallucinations. Our approach first generates an initial reconstruction using a conditional model, then refines it with an adaptive diffusion-based inverse problem solver. DynamicDPS skips early stage in the reverse process by selecting an optimal starting time point per sample and applies Wolfe's line search for adaptive step sizes, improving both efficiency and image fidelity. Using diffusion priors and data consistency, our method effectively reduces hallucinations from any conditional model output. We validate its effectiveness in Image Quality Transfer for low-field MRI enhancement. Extensive evaluations on synthetic and real MR scans, including a downstream task for tissue volume estimation, show that DynamicDPS reduces hallucinations, improving relative volume estimation by over 15% for critical tissues while using only 5% of the sampling steps required by baseline diffusion models. As a model-agnostic and fine-tuning-free approach, DynamicDPS offers a robust solution for hallucination reduction in medical imaging. The code will be made publicly available upon publication. △ Less

Submitted 2 March, 2025; originally announced March 2025.

arXiv:2502.13983 [pdf, other]

Gesture-Aware Zero-Shot Speech Recognition for Patients with Language Disorders

Authors: Seungbae Kim, Daeun Lee, Brielle Stark, Jinyoung Han

Abstract: Individuals with language disorders often face significant communication challenges due to their limited language processing and comprehension abilities, which also affect their interactions with voice-assisted systems that mostly rely on Automatic Speech Recognition (ASR). Despite advancements in ASR that address disfluencies, there has been little attention on integrating non-verbal communicatio… ▽ More Individuals with language disorders often face significant communication challenges due to their limited language processing and comprehension abilities, which also affect their interactions with voice-assisted systems that mostly rely on Automatic Speech Recognition (ASR). Despite advancements in ASR that address disfluencies, there has been little attention on integrating non-verbal communication methods, such as gestures, which individuals with language disorders substantially rely on to supplement their communication. Recognizing the need to interpret the latent meanings of visual information not captured by speech alone, we propose a gesture-aware ASR system utilizing a multimodal large language model with zero-shot learning for individuals with speech impairments. Our experiment results and analyses show that including gesture information significantly enhances semantic understanding. This study can help develop effective communication technologies, specifically designed to meet the unique needs of individuals with language impairments. △ Less

Submitted 18 February, 2025; originally announced February 2025.

arXiv:2502.11283 [pdf, other]

doi 10.33012/2025.19990

Set-Based Position Ambiguity Reduction Method for Zonotope Shadow Matching in Urban Areas Using Estimated Multipath Errors

Authors: Sanghyun Kim, Jiwon Seo

Abstract: In urban areas, the quality of global navigation satellite system (GNSS) signals deteriorates, leading to reduced positioning accuracy. To address this issue, 3D-mapping-aided (3DMA) techniques, such as shadow matching and zonotope shadow matching (ZSM), have been proposed. However, these methods can introduce a problem known as multi-modal position ambiguity, making it challenging to select the e… ▽ More In urban areas, the quality of global navigation satellite system (GNSS) signals deteriorates, leading to reduced positioning accuracy. To address this issue, 3D-mapping-aided (3DMA) techniques, such as shadow matching and zonotope shadow matching (ZSM), have been proposed. However, these methods can introduce a problem known as multi-modal position ambiguity, making it challenging to select the exact mode in which the receiver is located. Accurately selecting the correct mode is essential for improving positioning accuracy. A previous study proposed a method that uses satellite-pseudorange consistency (SPC), calculated from pseudorange measurements, to select the mode containing the receiver. This method achieved a mode selection accuracy of approximately 78%. To further enhance accuracy, the study utilized pseudorange measurements collected at multiple timesteps from a fixed location and a trained line-of-sight (LOS) classifier. However, in practice, collecting data at multiple timesteps from the same location in dynamic environments is challenging. Moreover, the performance of the trained LOS classifier heavily depends on the surrounding environment, leading to low reliability. In this study, we propose a method that estimates and corrects multipath errors based on the mode distribution obtained from the output of ZSM and extract an enhanced SPC using the corrected pseudorange measurements. This enables high mode selection accuracy using only single-timestep pseudorange measurements, without requiring a trained LOS classifier. △ Less

Submitted 16 February, 2025; originally announced February 2025.

Comments: Submitted to ION ITM 2025

arXiv:2502.10447 [pdf, other]

MoHAVE: Mixture of Hierarchical Audio-Visual Experts for Robust Speech Recognition

Authors: Sungnyun Kim, Kangwook Jang, Sangmin Bae, Sungwoo Cho, Se-Young Yun

Abstract: Audio-visual speech recognition (AVSR) has become critical for enhancing speech recognition in noisy environments by integrating both auditory and visual modalities. However, existing AVSR systems struggle to scale up without compromising computational efficiency. In this study, we introduce MoHAVE (Mixture of Hierarchical Audio-Visual Experts), a novel robust AVSR framework designed to address th… ▽ More Audio-visual speech recognition (AVSR) has become critical for enhancing speech recognition in noisy environments by integrating both auditory and visual modalities. However, existing AVSR systems struggle to scale up without compromising computational efficiency. In this study, we introduce MoHAVE (Mixture of Hierarchical Audio-Visual Experts), a novel robust AVSR framework designed to address these scalability constraints. By leveraging a Mixture-of-Experts (MoE) architecture, MoHAVE activates modality-specific expert groups, ensuring dynamic adaptation to various audio-visual inputs with minimal computational overhead. Key contributions of MoHAVE include: (1) a sparse MoE framework that efficiently scales AVSR model capacity, (2) a hierarchical gating mechanism that dynamically utilizes the expert groups based on input context, enhancing adaptability and robustness, and (3) remarkable performance across robust AVSR benchmarks, including LRS3 and MuAViC transcription and translation tasks, setting a new standard for scalable speech recognition systems. △ Less

Submitted 21 May, 2025; v1 submitted 11 February, 2025; originally announced February 2025.

Comments: Accepted to ICML 2025

arXiv:2502.09929 [pdf, other]

Low-Complexity On-Grid Channel Estimation for Partially-Connected Hybrid XL-MIMO

Authors: Sunho Kim, Wan Choi

Abstract: This paper addresses the challenge of channel estimation in extremely large-scale multiple-input multiple-output (XL-MIMO) systems, pivotal for the advancement of 6G communications. XL-MIMO systems, characterized by their vast antenna arrays, necessitate accurate channel state information (CSI) to leverage high spatial multiplexing and beamforming gains. However, conventional channel estimation me… ▽ More This paper addresses the challenge of channel estimation in extremely large-scale multiple-input multiple-output (XL-MIMO) systems, pivotal for the advancement of 6G communications. XL-MIMO systems, characterized by their vast antenna arrays, necessitate accurate channel state information (CSI) to leverage high spatial multiplexing and beamforming gains. However, conventional channel estimation methods for near-field XL-MIMO encounter significant computational complexity due to the exceedingly high parameter quantization levels needed for estimating the parametric near-field channel. To address this, we propose a low-complexity two-stage on-grid channel estimation algorithm designed for near-field XL-MIMO systems. The first stage focuses on estimating the LoS channel component while treating the NLoS paths as interference. This estimation is accomplished through an alternating subarray-wise array gain maximization (ASAGM) approach based on the piecewise outer product model (SOPM). In the second stage, we estimate the NLoS channel component by utilizing the sensing matrix refinement-based orthogonal matching pursuit (SMR-OMP) algorithm. This approach helps reduce the high computational complexity associated with large-dimensional joint sensing matrices. Simulation results demonstrate the effectiveness of our proposed low-complexity method, showcasing its significant superiority over existing near-field XL-MIMO channel estimation techniques, particularly in intermediate and high SNR regimes, and in practical scenarios involving arbitrary array placements. △ Less

Submitted 14 February, 2025; originally announced February 2025.

arXiv:2502.07208 [pdf]

Towards Understanding of Frequency Dependence on Sound Event Detection

Authors: Hyeonuk Nam, Seong-Hu Kim, Deokki Min, Byeong-Yun Ko, Yong-Hwa Park

Abstract: In this work, various analysis methods are conducted on frequency-dependent methods on SED to further delve into their detailed characteristics and behaviors on SED. While SED has been rapidly advancing through the adoption of various deep learning techniques from other pattern recognition fields, these techniques are often not suitable for SED. To address this issue, two frequency-dependent SED m… ▽ More In this work, various analysis methods are conducted on frequency-dependent methods on SED to further delve into their detailed characteristics and behaviors on SED. While SED has been rapidly advancing through the adoption of various deep learning techniques from other pattern recognition fields, these techniques are often not suitable for SED. To address this issue, two frequency-dependent SED methods were previously proposed: FilterAugment, a data augmentation randomly weighting frequency bands, and frequency dynamic convolution (FDY Conv), an architecture applying frequency adaptive convolution kernels. These methods have demonstrated superior performance in SED, and we aim to further analyze their detailed effectiveness and characteristics in SED. We compare class-wise performance to find out specific pros and cons of FilterAugment and FDY Conv. We apply Gradient-weighted Class Activation Mapping (Grad-CAM), which highlights time-frequency region that is more inferred by the model, on SED models with and without frequency masking and two types of FilterAugment to observe their detailed characteristics. We propose simpler frequency dependent convolution methods and compare them with FDY Conv to further understand which components of FDY Conv affects SED performance. Lastly, we apply PCA to show how FDY Conv adapts dynamic kernel across frequency dimensions on different sound event classes. The results and discussions demonstrate that frequency dependency plays a significant role in sound event detection and further confirms the effectiveness of frequency dependent methods on SED. △ Less

Submitted 10 February, 2025; originally announced February 2025.

arXiv:2502.05330 [pdf, other]

Multi-Class Segmentation of Aortic Branches and Zones in Computed Tomography Angiography: The AortaSeg24 Challenge

Authors: Muhammad Imran, Jonathan R. Krebs, Vishal Balaji Sivaraman, Teng Zhang, Amarjeet Kumar, Walker R. Ueland, Michael J. Fassler, Jinlong Huang, Xiao Sun, Lisheng Wang, Pengcheng Shi, Maximilian Rokuss, Michael Baumgartner, Yannick Kirchhof, Klaus H. Maier-Hein, Fabian Isensee, Shuolin Liu, Bing Han, Bong Thanh Nguyen, Dong-jin Shin, Park Ji-Woo, Mathew Choi, Kwang-Hyun Uhm, Sung-Jea Ko, Chanwoong Lee , et al. (38 additional authors not shown)

Abstract: Multi-class segmentation of the aorta in computed tomography angiography (CTA) scans is essential for diagnosing and planning complex endovascular treatments for patients with aortic dissections. However, existing methods reduce aortic segmentation to a binary problem, limiting their ability to measure diameters across different branches and zones. Furthermore, no open-source dataset is currently… ▽ More Multi-class segmentation of the aorta in computed tomography angiography (CTA) scans is essential for diagnosing and planning complex endovascular treatments for patients with aortic dissections. However, existing methods reduce aortic segmentation to a binary problem, limiting their ability to measure diameters across different branches and zones. Furthermore, no open-source dataset is currently available to support the development of multi-class aortic segmentation methods. To address this gap, we organized the AortaSeg24 MICCAI Challenge, introducing the first dataset of 100 CTA volumes annotated for 23 clinically relevant aortic branches and zones. This dataset was designed to facilitate both model development and validation. The challenge attracted 121 teams worldwide, with participants leveraging state-of-the-art frameworks such as nnU-Net and exploring novel techniques, including cascaded models, data augmentation strategies, and custom loss functions. We evaluated the submitted algorithms using the Dice Similarity Coefficient (DSC) and Normalized Surface Distance (NSD), highlighting the approaches adopted by the top five performing teams. This paper presents the challenge design, dataset details, evaluation metrics, and an in-depth analysis of the top-performing algorithms. The annotated dataset, evaluation code, and implementations of the leading methods are publicly available to support further research. All resources can be accessed at https://aortaseg24.grand-challenge.org. △ Less

Submitted 7 February, 2025; originally announced February 2025.

arXiv:2502.03505 [pdf, other]

Enhancing Free-hand 3D Photoacoustic and Ultrasound Reconstruction using Deep Learning

Authors: SiYeoul Lee, SeonHo Kim, Minkyung Seo, SeongKyu Park, Salehin Imrus, Kambaluru Ashok, DongEon Lee, Chunsu Park, SeonYeong Lee, Jiye Kim, Jae-Heung Yoo, MinWoo Kim

Abstract: This study introduces a motion-based learning network with a global-local self-attention module (MoGLo-Net) to enhance 3D reconstruction in handheld photoacoustic and ultrasound (PAUS) imaging. Standard PAUS imaging is often limited by a narrow field of view and the inability to effectively visualize complex 3D structures. The 3D freehand technique, which aligns sequential 2D images for 3D reconst… ▽ More This study introduces a motion-based learning network with a global-local self-attention module (MoGLo-Net) to enhance 3D reconstruction in handheld photoacoustic and ultrasound (PAUS) imaging. Standard PAUS imaging is often limited by a narrow field of view and the inability to effectively visualize complex 3D structures. The 3D freehand technique, which aligns sequential 2D images for 3D reconstruction, faces significant challenges in accurate motion estimation without relying on external positional sensors. MoGLo-Net addresses these limitations through an innovative adaptation of the self-attention mechanism, which effectively exploits the critical regions, such as fully-developed speckle area or high-echogenic tissue area within successive ultrasound images to accurately estimate motion parameters. This facilitates the extraction of intricate features from individual frames. Additionally, we designed a patch-wise correlation operation to generate a correlation volume that is highly correlated with the scanning motion. A custom loss function was also developed to ensure robust learning with minimized bias, leveraging the characteristics of the motion parameters. Experimental evaluations demonstrated that MoGLo-Net surpasses current state-of-the-art methods in both quantitative and qualitative performance metrics. Furthermore, we expanded the application of 3D reconstruction technology beyond simple B-mode ultrasound volumes to incorporate Doppler ultrasound and photoacoustic imaging, enabling 3D visualization of vasculature. The source code for this study is publicly available at: https://github.com/guhong3648/US3D △ Less

Submitted 5 February, 2025; originally announced February 2025.

arXiv:2502.00619 [pdf, other]

Distribution-aware Fairness Learning in Medical Image Segmentation From A Control-Theoretic Perspective

Authors: Yujin Oh, Pengfei Jin, Sangjoon Park, Sekeun Kim, Siyeop Yoon, Kyungsang Kim, Jin Sung Kim, Xiang Li, Quanzheng Li

Abstract: Ensuring fairness in medical image segmentation is critical due to biases in imbalanced clinical data acquisition caused by demographic attributes (e.g., age, sex, race) and clinical factors (e.g., disease severity). To address these challenges, we introduce Distribution-aware Mixture of Experts (dMoE), inspired by optimal control theory. We provide a comprehensive analysis of its underlying mecha… ▽ More Ensuring fairness in medical image segmentation is critical due to biases in imbalanced clinical data acquisition caused by demographic attributes (e.g., age, sex, race) and clinical factors (e.g., disease severity). To address these challenges, we introduce Distribution-aware Mixture of Experts (dMoE), inspired by optimal control theory. We provide a comprehensive analysis of its underlying mechanisms and clarify dMoE's role in adapting to heterogeneous distributions in medical image segmentation. Furthermore, we integrate dMoE into multiple network architectures, demonstrating its broad applicability across diverse medical image analysis tasks. By incorporating demographic and clinical factors, dMoE achieves state-of-the-art performance on two 2D benchmark datasets and a 3D in-house dataset. Our results highlight the effectiveness of dMoE in mitigating biases from imbalanced distributions, offering a promising approach to bridging control theory and medical image segmentation within fairness learning paradigms. The source code will be made available. The source code is available at https://github.com/tvseg/dMoE. △ Less

Submitted 27 May, 2025; v1 submitted 1 February, 2025; originally announced February 2025.

Comments: ICML 2025 spotlight, see https://openreview.net/forum?id=BUONdewsBa

arXiv:2501.18921 [pdf, other]

Full-scale Representation Guided Network for Retinal Vessel Segmentation

Authors: Sunyong Seo, Huisu Yoon, Semin Kim, Jongha Lee

Abstract: The U-Net architecture and its variants have remained state-of-the-art (SOTA) for retinal vessel segmentation over the past decade. In this study, we introduce a Full Scale Guided Network (FSG-Net), where the feature representation network with modernized convolution blocks extracts full-scale information and the guided convolution block refines that information. Attention-guided filter is introdu… ▽ More The U-Net architecture and its variants have remained state-of-the-art (SOTA) for retinal vessel segmentation over the past decade. In this study, we introduce a Full Scale Guided Network (FSG-Net), where the feature representation network with modernized convolution blocks extracts full-scale information and the guided convolution block refines that information. Attention-guided filter is introduced to the guided convolution block under the interpretation that the filter behaves like the unsharp mask filter. Passing full-scale information to the attention block allows for the generation of improved attention maps, which are then passed to the attention-guided filter, resulting in performance enhancement of the segmentation network. The structure preceding the guided convolution block can be replaced by any U-Net variant, which enhances the scalability of the proposed approach. For a fair comparison, we re-implemented recent studies available in public repositories to evaluate their scalability and reproducibility. Our experiments also show that the proposed network demonstrates competitive results compared to current SOTA models on various public datasets. Ablation studies demonstrate that the proposed model is competitive with much smaller parameter sizes. Lastly, by applying the proposed model to facial wrinkle segmentation, we confirmed the potential for scalability to similar tasks in other domains. Our code is available on https://github.com/ZombaSY/FSG-Net-pytorch. △ Less

Submitted 31 January, 2025; originally announced January 2025.

Comments: 10 pages, 7 figures

arXiv:2501.14790 [pdf, other]

Towards Dynamic Neural Communication and Speech Neuroprosthesis Based on Viseme Decoding

Authors: Ji-Ha Park, Seo-Hyun Lee, Soowon Kim, Seong-Whan Lee

Abstract: Decoding text, speech, or images from human neural signals holds promising potential both as neuroprosthesis for patients and as innovative communication tools for general users. Although neural signals contain various information on speech intentions, movements, and phonetic details, generating informative outputs from them remains challenging, with mostly focusing on decoding short intentions or… ▽ More Decoding text, speech, or images from human neural signals holds promising potential both as neuroprosthesis for patients and as innovative communication tools for general users. Although neural signals contain various information on speech intentions, movements, and phonetic details, generating informative outputs from them remains challenging, with mostly focusing on decoding short intentions or producing fragmented outputs. In this study, we developed a diffusion model-based framework to decode visual speech intentions from speech-related non-invasive brain signals, to facilitate face-to-face neural communication. We designed an experiment to consolidate various phonemes to train visemes of each phoneme, aiming to learn the representation of corresponding lip formations from neural signals. By decoding visemes from both isolated trials and continuous sentences, we successfully reconstructed coherent lip movements, effectively bridging the gap between brain signals and dynamic visual interfaces. The results highlight the potential of viseme decoding and talking face reconstruction from human neural signals, marking a significant step toward dynamic neural communication systems and speech neuroprosthesis for patients. △ Less

Submitted 8 January, 2025; originally announced January 2025.

Comments: 5 pages, 5 figures, 1 table, Name of Conference: 2025 IEEE International Conference on Acoustics, Speech, and Signal Processing

arXiv:2501.14171 [pdf, other]

Fully Guided Neural Schrödinger bridge for Brain MR image synthesis

Authors: Hanyeol Yang, Sunggyu Kim, Yongseon Yoo, Jong-min Lee

Abstract: Multi-modal brain MRI provides essential complementary information for clinical diagnosis. However, acquiring all modalities is often challenging due to time and cost constraints. To address this, various methods have been proposed to generate missing modalities from available ones. Traditional approaches can be broadly categorized into two main types: paired and unpaired methods. While paired met… ▽ More Multi-modal brain MRI provides essential complementary information for clinical diagnosis. However, acquiring all modalities is often challenging due to time and cost constraints. To address this, various methods have been proposed to generate missing modalities from available ones. Traditional approaches can be broadly categorized into two main types: paired and unpaired methods. While paired methods offer superior performance, obtaining large-scale paired datasets is challenging in real-world scenarios. Conversely, unpaired methods facilitate large-scale data collection but struggle to preserve critical image features, such as tumors. In this paper, we propose Fully Guided Schrödinger Bridges (FGSB), a novel framework based on Neural Schrödinger Bridges, to overcome these limitations. FGSB achieves stable, high-quality generation of missing modalities using minimal paired data. Furthermore, when provided with ground truth or a segmentation network for specific regions, FGSB can generate missing modalities while preserving these critical areas with reduced data requirements. Our proposed model consists of two consecutive phases. 1) Generation Phase: Fuses a generated image, a paired reference image, and Gaussian noise, employing iterative refinement to mitigate issues such as mode collapse and improve generation quality 2) Training Phase: Learns the mapping from the generated image to the target modality. Experiments demonstrate that FGSB achieves comparable generation performance to methods trained on large datasets, while using data from only two subjects. Moreover, the utilization of lesion information with FGSB significantly enhances its ability to preserve crucial lesion features. △ Less

Submitted 23 January, 2025; originally announced January 2025.

Comments: 9 pages,4 figures

arXiv:2501.11225 [pdf, other]

CNN-based TEM image denoising from first principles

Authors: Jinwoong Chae, Sungwook Hong, Sungkyu Kim, Sungroh Yoon, Gunn Kim

Abstract: Transmission electron microscope (TEM) images are often corrupted by noise, hindering their interpretation. To address this issue, we propose a deep learning-based approach using simulated images. Using density functional theory calculations with a set of pseudo-atomic orbital basis sets, we generate highly accurate ground truth images. We introduce four types of noise into these simulations to cr… ▽ More Transmission electron microscope (TEM) images are often corrupted by noise, hindering their interpretation. To address this issue, we propose a deep learning-based approach using simulated images. Using density functional theory calculations with a set of pseudo-atomic orbital basis sets, we generate highly accurate ground truth images. We introduce four types of noise into these simulations to create realistic training datasets. Each type of noise is then used to train a separate convolutional neural network (CNN) model. Our results show that these CNNs are effective in reducing noise, even when applied to images with different noise levels than those used during training. However, we observe limitations in some cases, particularly in preserving the integrity of circular shapes and avoiding visible artifacts between image patches. To overcome these challenges, we propose alternative training strategies and future research directions. This study provides a valuable framework for training deep learning models for TEM image denoising. △ Less

Submitted 19 January, 2025; originally announced January 2025.

Comments: 10 pages and 4 figures

arXiv:2501.04926 [pdf, other]

doi 10.1109/ICASSP49660.2025.10888772

FLowHigh: Towards Efficient and High-Quality Audio Super-Resolution with Single-Step Flow Matching

Authors: Jun-Hak Yun, Seung-Bin Kim, Seong-Whan Lee

Abstract: Audio super-resolution is challenging owing to its ill-posed nature. Recently, the application of diffusion models in audio super-resolution has shown promising results in alleviating this challenge. However, diffusion-based models have limitations, primarily the necessity for numerous sampling steps, which causes significantly increased latency when synthesizing high-quality audio samples. In thi… ▽ More Audio super-resolution is challenging owing to its ill-posed nature. Recently, the application of diffusion models in audio super-resolution has shown promising results in alleviating this challenge. However, diffusion-based models have limitations, primarily the necessity for numerous sampling steps, which causes significantly increased latency when synthesizing high-quality audio samples. In this paper, we propose FLowHigh, a novel approach that integrates flow matching, a highly efficient generative model, into audio super-resolution. We also explore probability paths specially tailored for audio super-resolution, which effectively capture high-resolution audio distributions, thereby enhancing reconstruction quality. The proposed method generates high-fidelity, high-resolution audio through a single-step sampling process across various input sampling rates. The experimental results on the VCTK benchmark dataset demonstrate that FLowHigh achieves state-of-the-art performance in audio super-resolution, as evaluated by log-spectral distance and ViSQOL while maintaining computational efficiency with only a single-step sampling process. △ Less

Submitted 8 January, 2025; originally announced January 2025.

Comments: Accepted by ICASSP 2025

Showing 1–50 of 494 results for author: Kim, S