-
Audio Flamingo 3: Advancing Audio Intelligence with Fully Open Large Audio Language Models
Authors:
Arushi Goel,
Sreyan Ghosh,
Jaehyeon Kim,
Sonal Kumar,
Zhifeng Kong,
Sang-gil Lee,
Chao-Han Huck Yang,
Ramani Duraiswami,
Dinesh Manocha,
Rafael Valle,
Bryan Catanzaro
Abstract:
We present Audio Flamingo 3 (AF3), a fully open state-of-the-art (SOTA) large audio-language model that advances reasoning and understanding across speech, sound, and music. AF3 introduces: (i) AF-Whisper, a unified audio encoder trained using a novel strategy for joint representation learning across all 3 modalities of speech, sound, and music; (ii) flexible, on-demand thinking, allowing the mode…
▽ More
We present Audio Flamingo 3 (AF3), a fully open state-of-the-art (SOTA) large audio-language model that advances reasoning and understanding across speech, sound, and music. AF3 introduces: (i) AF-Whisper, a unified audio encoder trained using a novel strategy for joint representation learning across all 3 modalities of speech, sound, and music; (ii) flexible, on-demand thinking, allowing the model to do chain-of-thought-type reasoning before answering; (iii) multi-turn, multi-audio chat; (iv) long audio understanding and reasoning (including speech) up to 10 minutes; and (v) voice-to-voice interaction. To enable these capabilities, we propose several large-scale training datasets curated using novel strategies, including AudioSkills-XL, LongAudio-XL, AF-Think, and AF-Chat, and train AF3 with a novel five-stage curriculum-based training strategy. Trained on only open-source audio data, AF3 achieves new SOTA results on over 20+ (long) audio understanding and reasoning benchmarks, surpassing both open-weight and closed-source models trained on much larger datasets.
△ Less
Submitted 10 July, 2025;
originally announced July 2025.
-
Automated anatomy-based post-processing reduces false positives and improved interpretability of deep learning intracranial aneurysm detection
Authors:
Jisoo Kim,
Chu-Hsuan Lin,
Alberto Ceballos-Arroyo,
Ping Liu,
Huaizu Jiang,
Shrikanth Yadav,
Qi Wan,
Lei Qin,
Geoffrey S Young
Abstract:
Introduction: Deep learning (DL) models can help detect intracranial aneurysms on CTA, but high false positive (FP) rates remain a barrier to clinical translation, despite improvement in model architectures and strategies like detection threshold tuning. We employed an automated, anatomy-based, heuristic-learning hybrid artery-vein segmentation post-processing method to further reduce FPs. Methods…
▽ More
Introduction: Deep learning (DL) models can help detect intracranial aneurysms on CTA, but high false positive (FP) rates remain a barrier to clinical translation, despite improvement in model architectures and strategies like detection threshold tuning. We employed an automated, anatomy-based, heuristic-learning hybrid artery-vein segmentation post-processing method to further reduce FPs. Methods: Two DL models, CPM-Net and a deformable 3D convolutional neural network-transformer hybrid (3D-CNN-TR), were trained with 1,186 open-source CTAs (1,373 annotated aneurysms), and evaluated with 143 held-out private CTAs (218 annotated aneurysms). Brain, artery, vein, and cavernous venous sinus (CVS) segmentation masks were applied to remove possible FPs in the DL outputs that overlapped with: (1) brain mask; (2) vein mask; (3) vein more than artery masks; (4) brain plus vein mask; (5) brain plus vein more than artery masks. Results: CPM-Net yielded 139 true-positives (TP); 79 false-negative (FN); 126 FP. 3D-CNN-TR yielded 179 TP; 39 FN; 182 FP. FPs were commonly extracranial (CPM-Net 27.3%; 3D-CNN-TR 42.3%), venous (CPM-Net 56.3%; 3D-CNN-TR 29.1%), arterial (CPM-Net 11.9%; 3D-CNN-TR 53.3%), and non-vascular (CPM-Net 25.4%; 3D-CNN-TR 9.3%) structures. Method 5 performed best, reducing CPM-Net FP by 70.6% (89/126) and 3D-CNN-TR FP by 51.6% (94/182), without reducing TP, lowering the FP/case rate from 0.88 to 0.26 for CPM-NET, and from 1.27 to 0.62 for the 3D-CNN-TR. Conclusion: Anatomy-based, interpretable post-processing can improve DL-based aneurysm detection model performance. More broadly, automated, domain-informed, hybrid heuristic-learning processing holds promise for improving the performance and clinical acceptance of aneurysm detection models.
△ Less
Submitted 1 July, 2025;
originally announced July 2025.
-
MedRegion-CT: Region-Focused Multimodal LLM for Comprehensive 3D CT Report Generation
Authors:
Sunggu Kyung,
Jinyoung Seo,
Hyunseok Lim,
Dongyeong Kim,
Hyungbin Park,
Jimin Sung,
Jihyun Kim,
Wooyoung Jo,
Yoojin Nam,
Namkug Kim
Abstract:
The recent release of RadGenome-Chest CT has significantly advanced CT-based report generation. However, existing methods primarily focus on global features, making it challenging to capture region-specific details, which may cause certain abnormalities to go unnoticed. To address this, we propose MedRegion-CT, a region-focused Multi-Modal Large Language Model (MLLM) framework, featuring three key…
▽ More
The recent release of RadGenome-Chest CT has significantly advanced CT-based report generation. However, existing methods primarily focus on global features, making it challenging to capture region-specific details, which may cause certain abnormalities to go unnoticed. To address this, we propose MedRegion-CT, a region-focused Multi-Modal Large Language Model (MLLM) framework, featuring three key innovations. First, we introduce Region Representative ($R^2$) Token Pooling, which utilizes a 2D-wise pretrained vision model to efficiently extract 3D CT features. This approach generates global tokens representing overall slice features and region tokens highlighting target areas, enabling the MLLM to process comprehensive information effectively. Second, a universal segmentation model generates pseudo-masks, which are then processed by a mask encoder to extract region-centric features. This allows the MLLM to focus on clinically relevant regions, using six predefined region masks. Third, we leverage segmentation results to extract patient-specific attributions, including organ size, diameter, and locations. These are converted into text prompts, enriching the MLLM's understanding of patient-specific contexts. To ensure rigorous evaluation, we conducted benchmark experiments on report generation using the RadGenome-Chest CT. MedRegion-CT achieved state-of-the-art performance, outperforming existing methods in natural language generation quality and clinical relevance while maintaining interpretability. The code for our framework is publicly available.
△ Less
Submitted 29 June, 2025;
originally announced June 2025.
-
Demonstrating Interoperable Channel State Feedback Compression with Machine Learning
Authors:
Dani Korpi,
Rachel Wang,
Jerry Wang,
Abdelrahman Ibrahim,
Carl Nuzman,
Runxin Wang,
Kursat Rasim Mestav,
Dustin Zhang,
Iraj Saniee,
Shawn Winston,
Gordana Pavlovic,
Wei Ding,
William J. Hillery,
Chenxi Hao,
Ram Thirunagari,
Jung Chang,
Jeehyun Kim,
Bartek Kozicki,
Dragan Samardzija,
Taesang Yoo,
Andreas Maeder,
Tingfang Ji,
Harish Viswanathan
Abstract:
Neural network-based compression and decompression of channel state feedback has been one of the most widely studied applications of machine learning (ML) in wireless networks. Various simulation-based studies have shown that ML-based feedback compression can result in reduced overhead and more accurate channel information. However, to the best of our knowledge, there are no real-life proofs of co…
▽ More
Neural network-based compression and decompression of channel state feedback has been one of the most widely studied applications of machine learning (ML) in wireless networks. Various simulation-based studies have shown that ML-based feedback compression can result in reduced overhead and more accurate channel information. However, to the best of our knowledge, there are no real-life proofs of concepts demonstrating the benefits of ML-based channel feedback compression in a practical setting, where the user equipment (UE) and base station have no access to each others' ML models. In this paper, we present a novel approach for training interoperable compression and decompression ML models in a confidential manner, and demonstrate the accuracy of the ensuing models using prototype UEs and base stations. The performance of the ML-based channel feedback is measured both in terms of the accuracy of the reconstructed channel information and achieved downlink throughput gains when using the channel information for beamforming. The reported measurement results demonstrate that it is possible to develop an accurate ML-based channel feedback link without having to share ML models between device and network vendors. These results pave the way for a practical implementation of ML-based channel feedback in commercial 6G networks.
△ Less
Submitted 26 June, 2025;
originally announced June 2025.
-
RapFlow-TTS: Rapid and High-Fidelity Text-to-Speech with Improved Consistency Flow Matching
Authors:
Hyun Joon Park,
Jeongmin Liu,
Jin Sob Kim,
Jeong Yeol Yang,
Sung Won Han,
Eunwoo Song
Abstract:
We introduce RapFlow-TTS, a rapid and high-fidelity TTS acoustic model that leverages velocity consistency constraints in flow matching (FM) training. Although ordinary differential equation (ODE)-based TTS generation achieves natural-quality speech, it typically requires a large number of generation steps, resulting in a trade-off between quality and inference speed. To address this challenge, Ra…
▽ More
We introduce RapFlow-TTS, a rapid and high-fidelity TTS acoustic model that leverages velocity consistency constraints in flow matching (FM) training. Although ordinary differential equation (ODE)-based TTS generation achieves natural-quality speech, it typically requires a large number of generation steps, resulting in a trade-off between quality and inference speed. To address this challenge, RapFlow-TTS enforces consistency in the velocity field along the FM-straightened ODE trajectory, enabling consistent synthetic quality with fewer generation steps. Additionally, we introduce techniques such as time interval scheduling and adversarial learning to further enhance the quality of the few-step synthesis. Experimental results show that RapFlow-TTS achieves high-fidelity speech synthesis with a 5- and 10-fold reduction in synthesis steps than the conventional FM- and score-based approaches, respectively.
△ Less
Submitted 20 June, 2025;
originally announced June 2025.
-
Privacy-Preserving Chest X-ray Classification in Latent Space with Homomorphically Encrypted Neural Inference
Authors:
Jonghun Kim,
Gyeongdeok Jo,
Sinyoung Ra,
Hyunjin Park
Abstract:
Medical imaging data contain sensitive patient information requiring strong privacy protection. Many analytical setups require data to be sent to a server for inference purposes. Homomorphic encryption (HE) provides a solution by allowing computations to be performed on encrypted data without revealing the original information. However, HE inference is computationally expensive, particularly for l…
▽ More
Medical imaging data contain sensitive patient information requiring strong privacy protection. Many analytical setups require data to be sent to a server for inference purposes. Homomorphic encryption (HE) provides a solution by allowing computations to be performed on encrypted data without revealing the original information. However, HE inference is computationally expensive, particularly for large images (e.g., chest X-rays). In this study, we propose an HE inference framework for medical images that uses VQGAN to compress images into latent representations, thereby significantly reducing the computational burden while preserving image quality. We approximate the activation functions with lower-degree polynomials to balance the accuracy and efficiency in compliance with HE requirements. We observed that a downsampling factor of eight for compression achieved an optimal balance between performance and computational cost. We further adapted the squeeze and excitation module, which is known to improve traditional CNNs, to enhance the HE framework. Our method was tested on two chest X-ray datasets for multi-label classification tasks using vanilla CNN backbones. Although HE inference remains relatively slow and introduces minor performance differences compared with unencrypted inference, our approach shows strong potential for practical use in medical images
△ Less
Submitted 19 June, 2025; v1 submitted 18 June, 2025;
originally announced June 2025.
-
ViSAGe: Video-to-Spatial Audio Generation
Authors:
Jaeyeon Kim,
Heeseung Yun,
Gunhee Kim
Abstract:
Spatial audio is essential for enhancing the immersiveness of audio-visual experiences, yet its production typically demands complex recording systems and specialized expertise. In this work, we address a novel problem of generating first-order ambisonics, a widely used spatial audio format, directly from silent videos. To support this task, we introduce YT-Ambigen, a dataset comprising 102K 5-sec…
▽ More
Spatial audio is essential for enhancing the immersiveness of audio-visual experiences, yet its production typically demands complex recording systems and specialized expertise. In this work, we address a novel problem of generating first-order ambisonics, a widely used spatial audio format, directly from silent videos. To support this task, we introduce YT-Ambigen, a dataset comprising 102K 5-second YouTube video clips paired with corresponding first-order ambisonics. We also propose new evaluation metrics to assess the spatial aspect of generated audio based on audio energy maps and saliency metrics. Furthermore, we present Video-to-Spatial Audio Generation (ViSAGe), an end-to-end framework that generates first-order ambisonics from silent video frames by leveraging CLIP visual features, autoregressive neural audio codec modeling with both directional and visual guidance. Experimental results demonstrate that ViSAGe produces plausible and coherent first-order ambisonics, outperforming two-stage approaches consisting of video-to-audio generation and audio spatialization. Qualitative examples further illustrate that ViSAGe generates temporally aligned high-quality spatial audio that adapts to viewpoint changes.
△ Less
Submitted 13 June, 2025;
originally announced June 2025.
-
FairASR: Fair Audio Contrastive Learning for Automatic Speech Recognition
Authors:
Jongsuk Kim,
Jaemyung Yu,
Minchan Kwon,
Junmo Kim
Abstract:
Large-scale ASR models have achieved remarkable gains in accuracy and robustness. However, fairness issues remain largely unaddressed despite their critical importance in real-world applications. In this work, we introduce FairASR, a system that mitigates demographic bias by learning representations that are uninformative about group membership, enabling fair generalization across demographic grou…
▽ More
Large-scale ASR models have achieved remarkable gains in accuracy and robustness. However, fairness issues remain largely unaddressed despite their critical importance in real-world applications. In this work, we introduce FairASR, a system that mitigates demographic bias by learning representations that are uninformative about group membership, enabling fair generalization across demographic groups. Leveraging a multi-demographic dataset, our approach employs a gradient reversal layer to suppress demographic-discriminative features while maintaining the ability to capture generalizable speech patterns through an unsupervised contrastive loss. Experimental results show that FairASR delivers competitive overall ASR performance while significantly reducing performance disparities across different demographic groups.
△ Less
Submitted 12 June, 2025;
originally announced June 2025.
-
BemaGANv2: A Tutorial and Comparative Survey of GAN-based Vocoders for Long-Term Audio Generation
Authors:
Taesoo Park,
Mungwi Jeong,
Mingyu Park,
Narae Kim,
Junyoung Kim,
Mujung Kim,
Jisang Yoo,
Hoyun Lee,
Sanghoon Kim,
Soonchul Kwon
Abstract:
This paper presents a tutorial-style survey and implementation guide of BemaGANv2, an advanced GAN-based vocoder designed for high-fidelity and long-term audio generation. Built upon the original BemaGAN architecture, BemaGANv2 incorporates major architectural innovations by replacing traditional ResBlocks in the generator with the Anti-aliased Multi-Periodicity composition (AMP) module, which int…
▽ More
This paper presents a tutorial-style survey and implementation guide of BemaGANv2, an advanced GAN-based vocoder designed for high-fidelity and long-term audio generation. Built upon the original BemaGAN architecture, BemaGANv2 incorporates major architectural innovations by replacing traditional ResBlocks in the generator with the Anti-aliased Multi-Periodicity composition (AMP) module, which internally applies the Snake activation function to better model periodic structures. In the discriminator framework, we integrate the Multi-Envelope Discriminator (MED), a novel architecture we originally proposed, to extract rich temporal envelope features crucial for periodicity detection. Coupled with the Multi-Resolution Discriminator (MRD), this combination enables more accurate modeling of long-range dependencies in audio. We systematically evaluate various discriminator configurations, including MSD + MED, MSD + MRD, and MPD + MED + MRD, using objective metrics (FAD, SSIM, PLCC, MCD) and subjective evaluations (MOS, SMOS). This paper also provides a comprehensive tutorial on the model architecture, training methodology, and implementation to promote reproducibility. The code and pre-trained models are available at: https://github.com/dinhoitt/BemaGANv2.
△ Less
Submitted 11 June, 2025;
originally announced June 2025.
-
Automated Traffic Incident Response Plans using Generative Artificial Intelligence: Part 1 -- Building the Incident Response Benchmark
Authors:
Artur Grigorev,
Khaled Saleh,
Jiwon Kim,
Adriana-Simona Mihaita
Abstract:
Traffic incidents remain a critical public safety concern worldwide, with Australia recording 1,300 road fatalities in 2024, which is the highest toll in 12 years. Similarly, the United States reports approximately 6 million crashes annually, raising significant challenges in terms of a fast reponse time and operational management. Traditional response protocols rely on human decision-making, whic…
▽ More
Traffic incidents remain a critical public safety concern worldwide, with Australia recording 1,300 road fatalities in 2024, which is the highest toll in 12 years. Similarly, the United States reports approximately 6 million crashes annually, raising significant challenges in terms of a fast reponse time and operational management. Traditional response protocols rely on human decision-making, which introduces potential inconsistencies and delays during critical moments when every minute impacts both safety outcomes and network performance. To address this issue, we propose a novel Incident Response Benchmark that uses generative artificial intelligence to automatically generate response plans for incoming traffic incidents. Our approach aims to significantly reduce incident resolution times by suggesting context-appropriate actions such as variable message sign deployment, lane closures, and emergency resource allocation adapted to specific incident characteristics. First, the proposed methodology uses real-world incident reports from the Performance Measurement System (PeMS) as training and evaluation data. We extract historically implemented actions from these reports and compare them against AI-generated response plans that suggest specific actions, such as lane closures, variable message sign announcements, and/or dispatching appropriate emergency resources. Second, model evaluations reveal that advanced generative AI models like GPT-4o and Grok 2 achieve superior alignment with expert solutions, demonstrated by minimized Hamming distances (averaging 2.96-2.98) and low weighted differences (approximately 0.27-0.28). Conversely, while Gemini 1.5 Pro records the lowest count of missed actions, its extremely high number of unnecessary actions (1547 compared to 225 for GPT-4o) indicates an over-triggering strategy that reduces the overall plan efficiency.
△ Less
Submitted 3 June, 2025;
originally announced June 2025.
-
InfiniteAudio: Infinite-Length Audio Generation with Consistency
Authors:
Chaeyoung Jung,
Hojoon Ki,
Ji-Hoon Kim,
Junmo Kim,
Joon Son Chung
Abstract:
This paper presents InfiniteAudio, a simple yet effective strategy for generating infinite-length audio using diffusion-based text-to-audio methods. Current approaches face memory constraints because the output size increases with input length, making long duration generation challenging. A common workaround is to concatenate short audio segments, but this often leads to inconsistencies due to the…
▽ More
This paper presents InfiniteAudio, a simple yet effective strategy for generating infinite-length audio using diffusion-based text-to-audio methods. Current approaches face memory constraints because the output size increases with input length, making long duration generation challenging. A common workaround is to concatenate short audio segments, but this often leads to inconsistencies due to the lack of shared temporal context. To address this, InfiniteAudio integrates seamlessly into existing pipelines without additional training. It introduces two key techniques: FIFO sampling, a first-in, first-out inference strategy with fixed-size inputs, and curved denoising, which selectively prioritizes key diffusion steps for efficiency. Experiments show that InfiniteAudio achieves comparable or superior performance across all metrics. Audio samples are available on our project page.
△ Less
Submitted 3 June, 2025;
originally announced June 2025.
-
Astrophotography turbulence mitigation via generative models
Authors:
Joonyeoup Kim,
Yu Yuan,
Xingguang Zhang,
Xijun Wang,
Stanley Chan
Abstract:
Photography is the cornerstone of modern astronomical and space research. However, most astronomical images captured by ground-based telescopes suffer from atmospheric turbulence, resulting in degraded imaging quality. While multi-frame strategies like lucky imaging can mitigate some effects, they involve intensive data acquisition and complex manual processing. In this paper, we propose AstroDiff…
▽ More
Photography is the cornerstone of modern astronomical and space research. However, most astronomical images captured by ground-based telescopes suffer from atmospheric turbulence, resulting in degraded imaging quality. While multi-frame strategies like lucky imaging can mitigate some effects, they involve intensive data acquisition and complex manual processing. In this paper, we propose AstroDiff, a generative restoration method that leverages both the high-quality generative priors and restoration capabilities of diffusion models to mitigate atmospheric turbulence. Extensive experiments demonstrate that AstroDiff outperforms existing state-of-the-art learning-based methods in astronomical image turbulence mitigation, providing higher perceptual quality and better structural fidelity under severe turbulence conditions. Our code and additional results are available at https://web-six-kappa-66.vercel.app/
△ Less
Submitted 3 June, 2025;
originally announced June 2025.
-
Beyond the LUMIR challenge: The pathway to foundational registration models
Authors:
Junyu Chen,
Shuwen Wei,
Joel Honkamaa,
Pekka Marttinen,
Hang Zhang,
Min Liu,
Yichao Zhou,
Zuopeng Tan,
Zhuoyuan Wang,
Yi Wang,
Hongchao Zhou,
Shunbo Hu,
Yi Zhang,
Qian Tao,
Lukas Förner,
Thomas Wendler,
Bailiang Jian,
Benedikt Wiestler,
Tim Hable,
Jin Kim,
Dan Ruan,
Frederic Madesta,
Thilo Sentker,
Wiebke Heyer,
Lianrui Zuo
, et al. (11 additional authors not shown)
Abstract:
Medical image challenges have played a transformative role in advancing the field, catalyzing algorithmic innovation and establishing new performance standards across diverse clinical applications. Image registration, a foundational task in neuroimaging pipelines, has similarly benefited from the Learn2Reg initiative. Building on this foundation, we introduce the Large-scale Unsupervised Brain MRI…
▽ More
Medical image challenges have played a transformative role in advancing the field, catalyzing algorithmic innovation and establishing new performance standards across diverse clinical applications. Image registration, a foundational task in neuroimaging pipelines, has similarly benefited from the Learn2Reg initiative. Building on this foundation, we introduce the Large-scale Unsupervised Brain MRI Image Registration (LUMIR) challenge, a next-generation benchmark designed to assess and advance unsupervised brain MRI registration. Distinct from prior challenges that leveraged anatomical label maps for supervision, LUMIR removes this dependency by providing over 4,000 preprocessed T1-weighted brain MRIs for training without any label maps, encouraging biologically plausible deformation modeling through self-supervision. In addition to evaluating performance on 590 held-out test subjects, LUMIR introduces a rigorous suite of zero-shot generalization tasks, spanning out-of-domain imaging modalities (e.g., FLAIR, T2-weighted, T2*-weighted), disease populations (e.g., Alzheimer's disease), acquisition protocols (e.g., 9.4T MRI), and species (e.g., macaque brains). A total of 1,158 subjects and over 4,000 image pairs were included for evaluation. Performance was assessed using both segmentation-based metrics (Dice coefficient, 95th percentile Hausdorff distance) and landmark-based registration accuracy (target registration error). Across both in-domain and zero-shot tasks, deep learning-based methods consistently achieved state-of-the-art accuracy while producing anatomically plausible deformation fields. The top-performing deep learning-based models demonstrated diffeomorphic properties and inverse consistency, outperforming several leading optimization-based methods, and showing strong robustness to most domain shifts, the exception being a drop in performance on out-of-domain contrasts.
△ Less
Submitted 29 May, 2025;
originally announced May 2025.
-
Improving Respiratory Sound Classification with Architecture-Agnostic Knowledge Distillation from Ensembles
Authors:
Miika Toikkanen,
June-Woo Kim
Abstract:
Respiratory sound datasets are limited in size and quality, making high performance difficult to achieve. Ensemble models help but inevitably increase compute cost at inference time. Soft label training distills knowledge efficiently with extra cost only at training. In this study, we explore soft labels for respiratory sound classification as an architecture-agnostic approach to distill an ensemb…
▽ More
Respiratory sound datasets are limited in size and quality, making high performance difficult to achieve. Ensemble models help but inevitably increase compute cost at inference time. Soft label training distills knowledge efficiently with extra cost only at training. In this study, we explore soft labels for respiratory sound classification as an architecture-agnostic approach to distill an ensemble of teacher models into a student model. We examine different variations of our approach and find that even a single teacher, identical to the student, considerably improves performance beyond its own capability, with optimal gains achieved using only a few teachers. We achieve the new state-of-the-art Score of 64.39 on ICHBI, surpassing the previous best by 0.85 and improving average Scores across architectures by more than 1.16. Our results highlight the effectiveness of knowledge distillation with soft labels for respiratory sound classification, regardless of size or architecture.
△ Less
Submitted 28 May, 2025;
originally announced May 2025.
-
Dub-S2ST: Textless Speech-to-Speech Translation for Seamless Dubbing
Authors:
Jeongsoo Choi,
Jaehun Kim,
Joon Son Chung
Abstract:
This paper introduces a cross-lingual dubbing system that translates speech from one language to another while preserving key characteristics such as duration, speaker identity, and speaking speed. Despite the strong translation quality of existing speech translation approaches, they often overlook the transfer of speech patterns, leading to mismatches with source speech and limiting their suitabi…
▽ More
This paper introduces a cross-lingual dubbing system that translates speech from one language to another while preserving key characteristics such as duration, speaker identity, and speaking speed. Despite the strong translation quality of existing speech translation approaches, they often overlook the transfer of speech patterns, leading to mismatches with source speech and limiting their suitability for dubbing applications. To address this, we propose a discrete diffusion-based speech-to-unit translation model with explicit duration control, enabling time-aligned translation. We then synthesize speech based on the predicted units and source identity with a conditional flow matching model. Additionally, we introduce a unit-based speed adaptation mechanism that guides the translation model to produce speech at a rate consistent with the source, without relying on any text. Extensive experiments demonstrate that our framework generates natural and fluent translations that align with the original speech's duration and speaking pace, while achieving competitive translation performance.
△ Less
Submitted 27 May, 2025;
originally announced May 2025.
-
Accelerating Diffusion-based Text-to-Speech Model Training with Dual Modality Alignment
Authors:
Jeongsoo Choi,
Zhikang Niu,
Ji-Hoon Kim,
Chunhui Wang,
Joon Son Chung,
Xie Chen
Abstract:
The goal of this paper is to optimize the training process of diffusion-based text-to-speech models. While recent studies have achieved remarkable advancements, their training demands substantial time and computational costs, largely due to the implicit guidance of diffusion models in learning complex intermediate representations. To address this, we propose A-DMA, an effective strategy for Accele…
▽ More
The goal of this paper is to optimize the training process of diffusion-based text-to-speech models. While recent studies have achieved remarkable advancements, their training demands substantial time and computational costs, largely due to the implicit guidance of diffusion models in learning complex intermediate representations. To address this, we propose A-DMA, an effective strategy for Accelerating training with Dual Modality Alignment. Our method introduces a novel alignment pipeline leveraging both text and speech modalities: text-guided alignment, which incorporates contextual representations, and speech-guided alignment, which refines semantic representations. By aligning hidden states with discriminative features, our training scheme reduces the reliance on diffusion models for learning complex representations. Extensive experiments demonstrate that A-DMA doubles the convergence speed while achieving superior performance over baselines. Code and demo samples are available at: https://github.com/ZhikangNiu/A-DMA
△ Less
Submitted 30 May, 2025; v1 submitted 26 May, 2025;
originally announced May 2025.
-
Stack Less, Repeat More: A Block Reusing Approach for Progressive Speech Enhancement
Authors:
Jangyeon Kim,
Ui-Hyeop Shin,
Jaehyun Ko,
Hyung-Min Park
Abstract:
This paper presents an efficient speech enhancement (SE) approach that reuses a processing block repeatedly instead of conventional stacking. Rather than increasing the number of blocks for learning deep latent representations, repeating a single block leads to progressive refinement while reducing parameter redundancy. We also minimize domain transformation by keeping an encoder and decoder shall…
▽ More
This paper presents an efficient speech enhancement (SE) approach that reuses a processing block repeatedly instead of conventional stacking. Rather than increasing the number of blocks for learning deep latent representations, repeating a single block leads to progressive refinement while reducing parameter redundancy. We also minimize domain transformation by keeping an encoder and decoder shallow and reusing a single sequence modeling block. Experimental results show that the number of processing stages is more critical to performance than the number of blocks with different weights. Also, we observed that the proposed method gradually refines a noisy input within a single block. Furthermore, with the block reuse method, we demonstrate that deepening the encoder and decoder can be redundant for learning deep complex representation. Therefore, the experimental results confirm that the proposed block reusing enables progressive learning and provides an efficient alternative for SE.
△ Less
Submitted 25 May, 2025;
originally announced May 2025.
-
GSA-TTS : Toward Zero-Shot Speech Synthesis based on Gradual Style Adaptor
Authors:
Seokgi Lee,
Jungjun Kim
Abstract:
We present the gradual style adaptor TTS (GSA-TTS) with a novel style encoder that gradually encodes speaking styles from an acoustic reference for zero-shot speech synthesis. GSA first captures the local style of each semantic sound unit. Then the local styles are combined by self-attention to obtain a global style condition. This semantic and hierarchical encoding strategy provides a robust and…
▽ More
We present the gradual style adaptor TTS (GSA-TTS) with a novel style encoder that gradually encodes speaking styles from an acoustic reference for zero-shot speech synthesis. GSA first captures the local style of each semantic sound unit. Then the local styles are combined by self-attention to obtain a global style condition. This semantic and hierarchical encoding strategy provides a robust and rich style representation for an acoustic model. We test GSA-TTS on unseen speakers and obtain promising results regarding naturalness, speaker similarity, and intelligibility. Additionally, we explore the potential of GSA in terms of interpretability and controllability, which stems from its hierarchical structure.
△ Less
Submitted 25 May, 2025;
originally announced May 2025.
-
Accelerating Battery Material Optimization through iterative Machine Learning
Authors:
Seon-Hwa Lee,
Insoo Ye,
Changhwan Lee,
Jieun Kim,
Geunho Choi,
Sang-Cheol Nam,
Inchul Park
Abstract:
The performance of battery materials is determined by their composition and the processing conditions employed during commercial-scale fabrication, where raw materials undergo complex processing steps with various additives to yield final products. As the complexity of these parameters expands with the development of industry, conventional one-factor-at-a-time (OFAT) experiment becomes old fashion…
▽ More
The performance of battery materials is determined by their composition and the processing conditions employed during commercial-scale fabrication, where raw materials undergo complex processing steps with various additives to yield final products. As the complexity of these parameters expands with the development of industry, conventional one-factor-at-a-time (OFAT) experiment becomes old fashioned. While domain expertise aids in parameter optimization, this traditional approach becomes increasingly vulnerable to cognitive limitations and anthropogenic biases as the complexity of factors grows. Herein, we introduce an iterative machine learning (ML) framework that integrates active learning to guide targeted experimentation and facilitate incremental model refinement. This method systematically leverages comprehensive experimental observations, including both successful and unsuccessful results, effectively mitigating human-induced biases and alleviating data scarcity. Consequently, it significantly accelerates exploration within the high-dimensional design space. Our results demonstrate that active-learning-driven experimentation markedly reduces the total number of experimental cycles necessary, underscoring the transformative potential of ML-based strategies in expediting battery material optimization.
△ Less
Submitted 12 May, 2025;
originally announced May 2025.
-
Communication-Efficient Hybrid Language Model via Uncertainty-Aware Opportunistic and Compressed Transmission
Authors:
Seungeun Oh,
Jinhyuk Kim,
Jihong Park,
Seung-Woo Ko,
Jinho Choi,
Tony Q. S. Quek,
Seong-Lyun Kim
Abstract:
To support emerging language-based applications using dispersed and heterogeneous computing resources, the hybrid language model (HLM) offers a promising architecture, where an on-device small language model (SLM) generates draft tokens that are validated and corrected by a remote large language model (LLM). However, the original HLM suffers from substantial communication overhead, as the LLM requ…
▽ More
To support emerging language-based applications using dispersed and heterogeneous computing resources, the hybrid language model (HLM) offers a promising architecture, where an on-device small language model (SLM) generates draft tokens that are validated and corrected by a remote large language model (LLM). However, the original HLM suffers from substantial communication overhead, as the LLM requires the SLM to upload the full vocabulary distribution for each token. Moreover, both communication and computation resources are wasted when the LLM validates tokens that are highly likely to be accepted. To overcome these limitations, we propose communication-efficient and uncertainty-aware HLM (CU-HLM). In CU-HLM, the SLM transmits truncated vocabulary distributions only when its output uncertainty is high. We validate the feasibility of this opportunistic transmission by discovering a strong correlation between SLM's uncertainty and LLM's rejection probability. Furthermore, we theoretically derive optimal uncertainty thresholds and optimal vocabulary truncation strategies. Simulation results show that, compared to standard HLM, CU-HLM achieves up to 206$\times$ higher token throughput by skipping 74.8% transmissions with 97.4% vocabulary compression, while maintaining 97.4% accuracy.
△ Less
Submitted 16 May, 2025;
originally announced May 2025.
-
On the Sharp Input-Output Analysis of Nonlinear Systems under Adversarial Attacks
Authors:
Jihun Kim,
Yuchen Fang,
Javad Lavaei
Abstract:
This paper is concerned with learning the input-output mapping of general nonlinear dynamical systems. While the existing literature focuses on Gaussian inputs and benign disturbances, we significantly broaden the scope of admissible control inputs and allow correlated, nonzero-mean, adversarial disturbances. With our reformulation as a linear combination of basis functions, we prove that the…
▽ More
This paper is concerned with learning the input-output mapping of general nonlinear dynamical systems. While the existing literature focuses on Gaussian inputs and benign disturbances, we significantly broaden the scope of admissible control inputs and allow correlated, nonzero-mean, adversarial disturbances. With our reformulation as a linear combination of basis functions, we prove that the $l_1$-norm estimator overcomes the challenges as long as the probability that the system is under adversarial attack at a given time is smaller than a certain threshold. We provide an estimation error bound that decays with the input memory length and prove its optimality by constructing a problem instance that suffers from the same bound under adversarial attacks. Our work provides a sharp input-output analysis for a generic nonlinear and partially observed system under significantly generalized assumptions compared to existing works.
△ Less
Submitted 16 May, 2025;
originally announced May 2025.
-
DRL-Based Injection Molding Process Parameter Optimization for Adaptive and Profitable Production
Authors:
Joon-Young Kim,
Jecheon Yu,
Heekyu Kim,
Seunghwa Ryu
Abstract:
Plastic injection molding remains essential to modern manufacturing. However, optimizing process parameters to balance product quality and profitability under dynamic environmental and economic conditions remains a persistent challenge. This study presents a novel deep reinforcement learning (DRL)-based framework for real-time process optimization in injection molding, integrating product quality…
▽ More
Plastic injection molding remains essential to modern manufacturing. However, optimizing process parameters to balance product quality and profitability under dynamic environmental and economic conditions remains a persistent challenge. This study presents a novel deep reinforcement learning (DRL)-based framework for real-time process optimization in injection molding, integrating product quality and profitability into the control objective. A profit function was developed to reflect real-world manufacturing costs, incorporating resin, mold wear, and electricity prices, including time-of-use variations. Surrogate models were constructed to predict product quality and cycle time, enabling efficient offline training of DRL agents using soft actor-critic (SAC) and proximal policy optimization (PPO) algorithms. Experimental results demonstrate that the proposed DRL framework can dynamically adapt to seasonal and operational variations, consistently maintaining product quality while maximizing profit. Compared to traditional optimization methods such as genetic algorithms, the DRL models achieved comparable economic performance with up to 135x faster inference speeds, making them well-suited for real-time applications. The framework's scalability and adaptability highlight its potential as a foundation for intelligent, data-driven decision-making in modern manufacturing environments.
△ Less
Submitted 16 May, 2025;
originally announced May 2025.
-
Wearable Tracking of Eye and Body Movements During Breaching Training: Towards Real-Time Blast Injury Monitoring
Authors:
Jeremy P. Kemmerer,
James R. Williamson,
Joseph Kim,
Elizabeth Halford,
Hrishikesh M. Rao,
Christopher J. Smalt
Abstract:
Repeated exposure to blast overpressure in occupational settings has been associated with changes in cognitive and psychological health, as well as deficits in neurosensory subsystems. In this work, we describe a wearable system to simultaneously monitor physiology and blast exposure levels and demonstrate how this system can identify individualized exposure levels corresponding to acute physiolog…
▽ More
Repeated exposure to blast overpressure in occupational settings has been associated with changes in cognitive and psychological health, as well as deficits in neurosensory subsystems. In this work, we describe a wearable system to simultaneously monitor physiology and blast exposure levels and demonstrate how this system can identify individualized exposure levels corresponding to acute physiological response to blast exposure. Machine learning was used to develop a dose-response model that fused multiple physiological measures (electrooculuography, gait, and balance) into a single risk score by predicting the level of blast exposure on held-out subjects (Fused model, R = 0.60). We found that blast events with peak pressure levels as low as 0.25 psi could be related to physiological changes and hence may contribute to blast injury. We also identified an individual subject with deteriorating reaction time scores that consistently showed a rapid and anomalous change in physiology-based risk scores after exposure to low-level blast events. Our results suggest that the wearable approach to blast monitoring is viable in weapons training environments as a complement to more direct but sparsely administered brain health assessments, potentially viable in austere environments, and that fusing multiple physiological signals can improve sensitivity.
△ Less
Submitted 14 May, 2025;
originally announced May 2025.
-
Multi-Domain Audio Question Answering Toward Acoustic Content Reasoning in The DCASE 2025 Challenge
Authors:
Chao-Han Huck Yang,
Sreyan Ghosh,
Qing Wang,
Jaeyeon Kim,
Hengyi Hong,
Sonal Kumar,
Guirui Zhong,
Zhifeng Kong,
S Sakshi,
Vaibhavi Lokegaonkar,
Oriol Nieto,
Ramani Duraiswami,
Dinesh Manocha,
Gunhee Kim,
Jun Du,
Rafael Valle,
Bryan Catanzaro
Abstract:
We present Task 5 of the DCASE 2025 Challenge: an Audio Question Answering (AQA) benchmark spanning multiple domains of sound understanding. This task defines three QA subsets (Bioacoustics, Temporal Soundscapes, and Complex QA) to test audio-language models on interactive question-answering over diverse acoustic scenes. We describe the dataset composition (from marine mammal calls to soundscapes…
▽ More
We present Task 5 of the DCASE 2025 Challenge: an Audio Question Answering (AQA) benchmark spanning multiple domains of sound understanding. This task defines three QA subsets (Bioacoustics, Temporal Soundscapes, and Complex QA) to test audio-language models on interactive question-answering over diverse acoustic scenes. We describe the dataset composition (from marine mammal calls to soundscapes and complex real-world clips), the evaluation protocol (top-1 accuracy with answer-shuffling robustness), and baseline systems (Qwen2-Audio-7B, AudioFlamingo 2, Gemini-2-Flash). Preliminary results on the development set are compared, showing strong variation across models and subsets. This challenge aims to advance the audio understanding and reasoning capabilities of audio-language models toward human-level acuity, which are crucial for enabling AI agents to perceive and interact about the world effectively.
△ Less
Submitted 12 May, 2025;
originally announced May 2025.
-
SwinLSTM Autoencoder for Temporal-Spatial-Frequency Domain CSI Compression in Massive MIMO Systems
Authors:
Aakash Saini,
Yunchou Xing,
Jee Hyun Kim,
Amir Ahmadian Tehrani,
Wolfgang Gerstacker
Abstract:
This study presents a parameter-light, low-complexity artificial intelligence/machine learning (AI/ML) model that enhances channel state information (CSI) feedback in wireless systems by jointly exploiting temporal, spatial, and frequency (TSF) domain correlations. While traditional frameworks use autoencoders for CSI compression at the user equipment (UE) and reconstruction at the network (NW) si…
▽ More
This study presents a parameter-light, low-complexity artificial intelligence/machine learning (AI/ML) model that enhances channel state information (CSI) feedback in wireless systems by jointly exploiting temporal, spatial, and frequency (TSF) domain correlations. While traditional frameworks use autoencoders for CSI compression at the user equipment (UE) and reconstruction at the network (NW) side in spatial-frequency (SF), massive multiple-input multiple-output (mMIMO) systems in low mobility scenarios exhibit strong temporal correlation alongside frequency and spatial correlations. An autoencoder architecture alone is insufficient to exploit the TSF domain correlation in CSI; a recurrent element is also required. To address the vanishing gradients problem, researchers in recent works have proposed state-of-the-art TSF domain CSI compression architectures that combine recurrent networks for temporal correlation exploitation with deep pre-trained autoencoder that handle SF domain CSI compression. However, this approach increases the number of parameters and computational complexity. To jointly utilize correlations across the TSF domain, we propose a novel, parameter-light, low-complexity AI/ML-based recurrent autoencoder architecture to compress CSI at the UE side and reconstruct it on the NW side while minimizing CSI feedback overhead.
△ Less
Submitted 7 May, 2025;
originally announced May 2025.
-
MAISY: Motion-Aware Image SYnthesis for Medical Image Motion Correction
Authors:
Andrew Zhang,
Hao Wang,
Shuchang Ye,
Michael Fulham,
Jinman Kim
Abstract:
Patient motion during medical image acquisition causes blurring, ghosting, and distorts organs, which makes image interpretation challenging. Current state-of-the-art algorithms using Generative Adversarial Network (GAN)-based methods with their ability to learn the mappings between corrupted images and their ground truth via Structural Similarity Index Measure (SSIM) loss effectively generate mot…
▽ More
Patient motion during medical image acquisition causes blurring, ghosting, and distorts organs, which makes image interpretation challenging. Current state-of-the-art algorithms using Generative Adversarial Network (GAN)-based methods with their ability to learn the mappings between corrupted images and their ground truth via Structural Similarity Index Measure (SSIM) loss effectively generate motion-free images. However, we identified the following limitations: (i) they mainly focus on global structural characteristics and therefore overlook localized features that often carry critical pathological information, and (ii) the SSIM loss function struggles to handle images with varying pixel intensities, luminance factors, and variance. In this study, we propose Motion-Aware Image SYnthesis (MAISY) which initially characterize motion and then uses it for correction by: (a) leveraging the foundation model Segment Anything Model (SAM), to dynamically learn spatial patterns along anatomical boundaries where motion artifacts are most pronounced and, (b) introducing the Variance-Selective SSIM (VS-SSIM) loss which adaptively emphasizes spatial regions with high pixel variance to preserve essential anatomical details during artifact correction. Experiments on chest and head CT datasets demonstrate that our model outperformed the state-of-the-art counterparts, with Peak Signal-to-Noise Ratio (PSNR) increasing by 40%, SSIM by 10%, and Dice by 16%.
△ Less
Submitted 8 May, 2025; v1 submitted 6 May, 2025;
originally announced May 2025.
-
Multi-Antenna Users in Cell-Free Massive MIMO: Stream Allocation and Necessity of Downlink Pilots
Authors:
Eren Berk Kama,
Junbeom Kim,
Emil Björnson
Abstract:
We consider a cell-free massive multiple-input multiple-output (MIMO) system with multiple antennas on the users and access points (APs). In previous works, the downlink spectral efficiency (SE) has been evaluated using the hardening bound that requires no downlink pilots. This approach works well for single-antenna users. In this paper, we show that much higher SEs can be achieved if downlink pil…
▽ More
We consider a cell-free massive multiple-input multiple-output (MIMO) system with multiple antennas on the users and access points (APs). In previous works, the downlink spectral efficiency (SE) has been evaluated using the hardening bound that requires no downlink pilots. This approach works well for single-antenna users. In this paper, we show that much higher SEs can be achieved if downlink pilots are sent when having multi-antenna users. The reason is that the effective channel matrix does not harden. We propose a pilot-based downlink estimation scheme, derive a new SE expression, and show numerically that it yields substantially higher performance when having correlated Rayleigh fading channels.
In cases with multi-antenna users, the APs can either transmit the same or different data streams. The latter reduces the fronthaul signaling but comes with a SE loss. We propose precoding and combining schemes for these cases and consider whether channel knowledge is shared between the APs. Finally, we show numerically how the number of users, APs, and the number of antennas on users and APs affect the SE.
△ Less
Submitted 5 May, 2025;
originally announced May 2025.
-
Stabilization by Controllers Having Integer Coefficients
Authors:
Joowon Lee,
Donggil Lee,
Junsoo Kim
Abstract:
The system property of ``having integer coefficients,'' that is, a transfer function has an integer monic polynomial as its denominator, is significant in the field of encrypted control as it is required for a dynamic controller to be realized over encrypted data. This paper shows that there always exists a controller with integer coefficients stabilizing a given discrete-time linear time-invarian…
▽ More
The system property of ``having integer coefficients,'' that is, a transfer function has an integer monic polynomial as its denominator, is significant in the field of encrypted control as it is required for a dynamic controller to be realized over encrypted data. This paper shows that there always exists a controller with integer coefficients stabilizing a given discrete-time linear time-invariant plant. A constructive algorithm to obtain such a controller is provided, along with numerical examples. Furthermore, the proposed method is applied to converting a pre-designed controller to have integer coefficients, while the original performance is preserved in the sense that the transfer function of the closed-loop system remains unchanged.
△ Less
Submitted 1 May, 2025;
originally announced May 2025.
-
Efficient COLREGs-Compliant Collision Avoidance using Turning Circle-based Control Barrier Function
Authors:
Changyu Lee,
Jinwook Park,
Jinwhan Kim
Abstract:
This paper proposes a computationally efficient collision avoidance algorithm using turning circle-based control barrier functions (CBFs) that comply with international regulations for preventing collisions at sea (COLREGs). Conventional CBFs often lack explicit consideration of turning capabilities and avoidance direction, which are key elements in developing a COLREGs-compliant collision avoidan…
▽ More
This paper proposes a computationally efficient collision avoidance algorithm using turning circle-based control barrier functions (CBFs) that comply with international regulations for preventing collisions at sea (COLREGs). Conventional CBFs often lack explicit consideration of turning capabilities and avoidance direction, which are key elements in developing a COLREGs-compliant collision avoidance algorithm. To overcome these limitations, we introduce two CBFs derived from left and right turning circles. These functions establish safety conditions based on the proximity between the traffic ships and the centers of the turning circles, effectively determining both avoidance directions and turning capabilities. The proposed method formulates a quadratic programming problem with the CBFs as constraints, ensuring safe navigation without relying on computationally intensive trajectory optimization. This approach significantly reduces computational effort while maintaining performance comparable to model predictive control-based methods. Simulation results validate the effectiveness of the proposed algorithm in enabling COLREGs-compliant, safe navigation, demonstrating its potential for reliable and efficient operation in complex maritime environments.
△ Less
Submitted 27 April, 2025;
originally announced April 2025.
-
Documentation on Encrypted Dynamic Control Simulation Code using Ring-LWE based Cryptosystems
Authors:
Yeongjun Jang,
Joowon Lee,
Junsoo Kim
Abstract:
Encrypted controllers offer secure computation by employing modern cryptosystems to execute control operations directly over encrypted data without decryption. However, incorporating cryptosystems into dynamic controllers significantly increases the computational load. This paper aims to provide an accessible guideline for running encrypted controllers using an open-source library Lattigo, which s…
▽ More
Encrypted controllers offer secure computation by employing modern cryptosystems to execute control operations directly over encrypted data without decryption. However, incorporating cryptosystems into dynamic controllers significantly increases the computational load. This paper aims to provide an accessible guideline for running encrypted controllers using an open-source library Lattigo, which supports an efficient implementation of Ring-Learing With Errors (LWE) based encrypted controllers, and our explanations are assisted with example codes that are fully available at https://github.com/CDSL-EncryptedControl/CDSL.
△ Less
Submitted 17 April, 2025;
originally announced April 2025.
-
OmniMamba4D: Spatio-temporal Mamba for longitudinal CT lesion segmentation
Authors:
Justin Namuk Kim,
Yiqiao Liu,
Rajath Soans,
Keith Persson,
Sarah Halek,
Michal Tomaszewski,
Jianda Yuan,
Gregory Goldmacher,
Antong Chen
Abstract:
Accurate segmentation of longitudinal CT scans is important for monitoring tumor progression and evaluating treatment responses. However, existing 3D segmentation models solely focus on spatial information. To address this gap, we propose OmniMamba4D, a novel segmentation model designed for 4D medical images (3D images over time). OmniMamba4D utilizes a spatio-temporal tetra-orientated Mamba block…
▽ More
Accurate segmentation of longitudinal CT scans is important for monitoring tumor progression and evaluating treatment responses. However, existing 3D segmentation models solely focus on spatial information. To address this gap, we propose OmniMamba4D, a novel segmentation model designed for 4D medical images (3D images over time). OmniMamba4D utilizes a spatio-temporal tetra-orientated Mamba block to effectively capture both spatial and temporal features. Unlike traditional 3D models, which analyze single-time points, OmniMamba4D processes 4D CT data, providing comprehensive spatio-temporal information on lesion progression. Evaluated on an internal dataset comprising of 3,252 CT scans, OmniMamba4D achieves a competitive Dice score of 0.682, comparable to state-of-the-arts (SOTA) models, while maintaining computational efficiency and better detecting disappeared lesions. This work demonstrates a new framework to leverage spatio-temporal information for longitudinal CT lesion segmentation.
△ Less
Submitted 24 April, 2025; v1 submitted 13 April, 2025;
originally announced April 2025.
-
Asymptotic stabilization under homomorphic encryption: A re-encryption free method
Authors:
Shuai Feng,
Qian Ma,
Junsoo Kim,
Shengyuan Xu
Abstract:
In this paper, we propose methods to encrypted a pre-given dynamic controller with homomorphic encryption, without re-encrypting the control inputs. We first present a preliminary result showing that the coefficients in a pre-given dynamic controller can be scaled up into integers by the zooming-in factor in dynamic quantization, without utilizing re-encryption. However, a sufficiently small zoomi…
▽ More
In this paper, we propose methods to encrypted a pre-given dynamic controller with homomorphic encryption, without re-encrypting the control inputs. We first present a preliminary result showing that the coefficients in a pre-given dynamic controller can be scaled up into integers by the zooming-in factor in dynamic quantization, without utilizing re-encryption. However, a sufficiently small zooming-in factor may not always exist because it requires that the convergence speed of the pre-given closed-loop system should be sufficiently fast. Then, as the main result, we design a new controller approximating the pre-given dynamic controller, in which the zooming-in factor is decoupled from the convergence rate of the pre-given closed-loop system. Therefore, there always exist a (sufficiently small) zooming-in factor of dynamic quantization scaling up all the controller's coefficients to integers, and a finite modulus preventing overflow in cryptosystems. The process is asymptotically stable and the quantizer is not saturated.
△ Less
Submitted 12 April, 2025;
originally announced April 2025.
-
System Identification from Partial Observations under Adversarial Attacks
Authors:
Jihun Kim,
Javad Lavaei
Abstract:
This paper is concerned with the partially observed linear system identification, where the goal is to obtain reasonably accurate estimation of the balanced truncation of the true system up to the order $k$ from output measurements. We consider the challenging case of system identification under adversarial attacks, where the probability of having an attack at each time is $Θ(1/k)$ while the value…
▽ More
This paper is concerned with the partially observed linear system identification, where the goal is to obtain reasonably accurate estimation of the balanced truncation of the true system up to the order $k$ from output measurements. We consider the challenging case of system identification under adversarial attacks, where the probability of having an attack at each time is $Θ(1/k)$ while the value of the attack is arbitrary. We first show that the $l_1$-norm estimator exactly identifies the true Markov parameter matrix for nilpotent systems under any type of attack. We then build on this result to extend it to general systems and show that the estimation error exponentially decays as $k$ grows. The estimated balanced truncation model accordingly shows an exponentially decaying error for the identification of the true system up to the similarity transformation. This work is the first to provide the input-output analysis of the system with partial observations under arbitrary attacks.
△ Less
Submitted 31 March, 2025;
originally announced April 2025.
-
Nonhuman Primate Brain Tissue Segmentation Using a Transfer Learning Approach
Authors:
Zhen Lin,
Hongyu Yuan,
Richard Barcus,
Qing Lyu,
Sucheta Chakravarty,
Megan E. Lipford,
Carol A. Shively,
Suzanne Craft,
Mohammad Kawas,
Jeongchul Kim,
Christopher T. Whitlow
Abstract:
Non-human primates (NHPs) serve as critical models for understanding human brain function and neurological disorders due to their close evolutionary relationship with humans. Accurate brain tissue segmentation in NHPs is critical for understanding neurological disorders, but challenging due to the scarcity of annotated NHP brain MRI datasets, the small size of the NHP brain, the limited resolution…
▽ More
Non-human primates (NHPs) serve as critical models for understanding human brain function and neurological disorders due to their close evolutionary relationship with humans. Accurate brain tissue segmentation in NHPs is critical for understanding neurological disorders, but challenging due to the scarcity of annotated NHP brain MRI datasets, the small size of the NHP brain, the limited resolution of available imaging data and the anatomical differences between human and NHP brains. To address these challenges, we propose a novel approach utilizing STU-Net with transfer learning to leverage knowledge transferred from human brain MRI data to enhance segmentation accuracy in the NHP brain MRI, particularly when training data is limited. The combination of STU-Net and transfer learning effectively delineates complex tissue boundaries and captures fine anatomical details specific to NHP brains. Notably, our method demonstrated improvement in segmenting small subcortical structures such as putamen and thalamus that are challenging to resolve with limited spatial resolution and tissue contrast, and achieved DSC of over 0.88, IoU over 0.8 and HD95 under 7. This study introduces a robust method for multi-class brain tissue segmentation in NHPs, potentially accelerating research in evolutionary neuroscience and preclinical studies of neurological disorders relevant to human health.
△ Less
Submitted 1 April, 2025; v1 submitted 28 March, 2025;
originally announced March 2025.
-
Turning Circle-based Control Barrier Function for Efficient Collision Avoidance of Nonholonomic Vehicles
Authors:
Changyu Lee,
Kiyong Park,
Jinwhan Kim
Abstract:
This paper presents a new control barrier function (CBF) designed to improve the efficiency of collision avoidance for nonholonomic vehicles. Traditional CBFs typically rely on the shortest Euclidean distance to obstacles, overlooking the limited heading change ability of nonholonomic vehicles. This often leads to abrupt maneuvers and excessive speed reductions, which is not desirable and reduces…
▽ More
This paper presents a new control barrier function (CBF) designed to improve the efficiency of collision avoidance for nonholonomic vehicles. Traditional CBFs typically rely on the shortest Euclidean distance to obstacles, overlooking the limited heading change ability of nonholonomic vehicles. This often leads to abrupt maneuvers and excessive speed reductions, which is not desirable and reduces the efficiency of collision avoidance. Our approach addresses these limitations by incorporating the distance to the turning circle, considering the vehicle's limited maneuverability imposed by its nonholonomic constraints. The proposed CBF is integrated with model predictive control (MPC) to generate more efficient trajectories compared to existing methods that rely solely on Euclidean distance-based CBFs. The effectiveness of the proposed method is validated through numerical simulations on unicycle vehicles and experiments with underactuated surface vehicles.
△ Less
Submitted 26 March, 2025;
originally announced March 2025.
-
Rethinking Glaucoma Calibration: Voting-Based Binocular and Metadata Integration
Authors:
Taejin Jeong,
Joohyeok Kim,
Jaehoon Joo,
Yeonwoo Jung,
Hyeonmin Kim,
Seong Jae Hwang
Abstract:
Glaucoma is an incurable ophthalmic disease that damages the optic nerve, leads to vision loss, and ranks among the leading causes of blindness worldwide. Diagnosing glaucoma typically involves fundus photography, optical coherence tomography (OCT), and visual field testing. However, the high cost of OCT often leads to reliance on fundus photography and visual field testing, both of which exhibit…
▽ More
Glaucoma is an incurable ophthalmic disease that damages the optic nerve, leads to vision loss, and ranks among the leading causes of blindness worldwide. Diagnosing glaucoma typically involves fundus photography, optical coherence tomography (OCT), and visual field testing. However, the high cost of OCT often leads to reliance on fundus photography and visual field testing, both of which exhibit inherent inter-observer variability. This stems from glaucoma being a multifaceted disease that influenced by various factors. As a result, glaucoma diagnosis is highly subjective, emphasizing the necessity of calibration, which aligns predicted probabilities with actual disease likelihood. Proper calibration is essential to prevent overdiagnosis or misdiagnosis, which are critical concerns for high-risk diseases. Although AI has significantly improved diagnostic accuracy, overconfidence in models have worsen calibration performance. Recent study has begun focusing on calibration for glaucoma. Nevertheless, previous study has not fully considered glaucoma's systemic nature and the high subjectivity in its diagnostic process. To overcome these limitations, we propose V-ViT (Voting-based ViT), a novel framework that enhances calibration by incorporating disease-specific characteristics. V-ViT integrates binocular data and metadata, reflecting the multi-faceted nature of glaucoma diagnosis. Additionally, we introduce a MC dropout-based Voting System to address high subjectivity. Our approach achieves state-of-the-art performance across all metrics, including accuracy, demonstrating that our proposed methods are effective in addressing calibration issues. We validate our method using a custom dataset including binocular data.
△ Less
Submitted 24 March, 2025;
originally announced March 2025.
-
From Faces to Voices: Learning Hierarchical Representations for High-quality Video-to-Speech
Authors:
Ji-Hoon Kim,
Jeongsoo Choi,
Jaehun Kim,
Chaeyoung Jung,
Joon Son Chung
Abstract:
The objective of this study is to generate high-quality speech from silent talking face videos, a task also known as video-to-speech synthesis. A significant challenge in video-to-speech synthesis lies in the substantial modality gap between silent video and multi-faceted speech. In this paper, we propose a novel video-to-speech system that effectively bridges this modality gap, significantly enha…
▽ More
The objective of this study is to generate high-quality speech from silent talking face videos, a task also known as video-to-speech synthesis. A significant challenge in video-to-speech synthesis lies in the substantial modality gap between silent video and multi-faceted speech. In this paper, we propose a novel video-to-speech system that effectively bridges this modality gap, significantly enhancing the quality of synthesized speech. This is achieved by learning of hierarchical representations from video to speech. Specifically, we gradually transform silent video into acoustic feature spaces through three sequential stages -- content, timbre, and prosody modeling. In each stage, we align visual factors -- lip movements, face identity, and facial expressions -- with corresponding acoustic counterparts to ensure the seamless transformation. Additionally, to generate realistic and coherent speech from the visual representations, we employ a flow matching model that estimates direct trajectories from a simple prior distribution to the target speech distribution. Extensive experiments demonstrate that our method achieves exceptional generation quality comparable to real utterances, outperforming existing methods by a significant margin.
△ Less
Submitted 21 March, 2025;
originally announced March 2025.
-
PCLA: A Framework for Testing Autonomous Agents in the CARLA Simulator
Authors:
Masoud Jamshidiyan Tehrani,
Jinhan Kim,
Paolo Tonella
Abstract:
Recent research on testing autonomous driving agents has grown significantly, especially in simulation environments. The CARLA simulator is often the preferred choice, and the autonomous agents from the CARLA Leaderboard challenge are regarded as the best-performing agents within this environment. However, researchers who test these agents, rather than training their own ones from scratch, often f…
▽ More
Recent research on testing autonomous driving agents has grown significantly, especially in simulation environments. The CARLA simulator is often the preferred choice, and the autonomous agents from the CARLA Leaderboard challenge are regarded as the best-performing agents within this environment. However, researchers who test these agents, rather than training their own ones from scratch, often face challenges in utilizing them within customized test environments and scenarios. To address these challenges, we introduce PCLA (Pretrained CARLA Leaderboard Agents), an open-source Python testing framework that includes nine high-performing pre-trained autonomous agents from the Leaderboard challenges. PCLA is the first infrastructure specifically designed for testing various autonomous agents in arbitrary CARLA environments/scenarios. PCLA provides a simple way to deploy Leaderboard agents onto a vehicle without relying on the Leaderboard codebase, it allows researchers to easily switch between agents without requiring modifications to CARLA versions or programming environments, and it is fully compatible with the latest version of CARLA while remaining independent of the Leaderboard's specific CARLA version. PCLA is publicly accessible at https://github.com/MasoudJTehrani/PCLA.
△ Less
Submitted 13 March, 2025; v1 submitted 12 March, 2025;
originally announced March 2025.
-
Merry-Go-Round: Safe Control of Decentralized Multi-Robot Systems with Deadlock Prevention
Authors:
Wonjong Lee,
Joonyeol Sim,
Joonkyung Kim,
Siwon Jo,
Wenhao Luo,
Changjoo Nam
Abstract:
We propose a hybrid approach for decentralized multi-robot navigation that ensures both safety and deadlock prevention. Building on a standard control formulation, we add a lightweight deadlock prevention mechanism by forming temporary "roundabouts" (circular reference paths). Each robot relies only on local, peer-to-peer communication and a controller for base collision avoidance; a roundabout is…
▽ More
We propose a hybrid approach for decentralized multi-robot navigation that ensures both safety and deadlock prevention. Building on a standard control formulation, we add a lightweight deadlock prevention mechanism by forming temporary "roundabouts" (circular reference paths). Each robot relies only on local, peer-to-peer communication and a controller for base collision avoidance; a roundabout is generated or joined on demand to avert deadlocks. Robots in the roundabout travel in one direction until an escape condition is met, allowing them to return to goal-oriented motion. Unlike classical decentralized methods that lack explicit deadlock resolution, our roundabout maneuver ensures system-wide forward progress while preserving safety constraints. Extensive simulations and physical robot experiments show that our method consistently outperforms or matches the success and arrival rates of other decentralized control approaches, particularly in cluttered or high-density scenarios, all with minimal centralized coordination.
△ Less
Submitted 7 March, 2025;
originally announced March 2025.
-
A Risk-aware Bi-level Bidding Strategy for Virtual Power Plant with Power-to-Hydrogen System
Authors:
Jaehyun Yoo,
Jip Kim
Abstract:
This paper presents a risk-aware bi-level bidding strategy for Virtual Power Plant (VPP) that integrates Power-to-Hydrogen (P2H) system, addressing the challenges posed by renewable energy variability and market volatility. By incorporating Conditional Value at Risk (CVaR) within the bi-level optimization framework, the proposed strategy enables VPPs to mitigate financial risks associated with unc…
▽ More
This paper presents a risk-aware bi-level bidding strategy for Virtual Power Plant (VPP) that integrates Power-to-Hydrogen (P2H) system, addressing the challenges posed by renewable energy variability and market volatility. By incorporating Conditional Value at Risk (CVaR) within the bi-level optimization framework, the proposed strategy enables VPPs to mitigate financial risks associated with uncertain market conditions. The upper-level problem seeks to maximize revenue through optimal bidding, while the lower-level problem ensures market-clearing compliance. The integration of the P2H system allows surplus renewable energy to be stored as hydrogen, which is utilized as an energy carrier, thereby increasing market profitability and enhancing resilience against financial risks. The effectiveness of the proposed strategy is validated through a modified IEEE 14 bus system, demonstrating that the inclusion of the P2H system and CVaR-based risk aversion enhances both revenue and financial hedging capability under volatile market conditions.This paper underscores the strategic role of hydrogen storage in VPP operations, contributing to supporting improved profitability and the efficacy of a risk-aware bidding strategy.
△ Less
Submitted 7 March, 2025;
originally announced March 2025.
-
Community Energy Management System for Fast Frequency Response: A Hierarchical Control Approach
Authors:
Joonsung Jung,
Hyunjoong Kim,
Hyunghwan Shin,
Jip Kim
Abstract:
The increase in renewable energy sources (RES) has reduced power system inertia, making frequency stabilization more challenging and highlighting the need for fast frequency response (FFR) resources. While building energy management systems (BEMS) equipped with distributed energy resources (DERs) can provide FFR, individual BEMS alone cannot fully meet demand. To address this, we propose a communi…
▽ More
The increase in renewable energy sources (RES) has reduced power system inertia, making frequency stabilization more challenging and highlighting the need for fast frequency response (FFR) resources. While building energy management systems (BEMS) equipped with distributed energy resources (DERs) can provide FFR, individual BEMS alone cannot fully meet demand. To address this, we propose a community energy management system (CEMS) operational model that minimizes energy costs and generates additional revenue, which is provided FFR through coordinated DERs and building loads under photovoltaic (PV) generation uncertainty. The model incorporates a hierarchical control framework with three levels: Level 1 allocates maximum FFR capacity, Level 2 employs scenario-based stochastic model predictive control (SMPC) to adjust DER operations and ensure FFR provision despite PV uncertainties, and Level 3 performs rapid load adjustments in response to frequency fluctuations detected by a frequency meter. Simulation results on a campus building cluster demonstrate the effectiveness of the proposed model, achieving a 10\% reduction in energy costs and a 24\% increase in FFR capacity, all while maintaining occupant comfort and enhancing frequency stabilization.
△ Less
Submitted 7 March, 2025;
originally announced March 2025.
-
Audio Flamingo 2: An Audio-Language Model with Long-Audio Understanding and Expert Reasoning Abilities
Authors:
Sreyan Ghosh,
Zhifeng Kong,
Sonal Kumar,
S Sakshi,
Jaehyeon Kim,
Wei Ping,
Rafael Valle,
Dinesh Manocha,
Bryan Catanzaro
Abstract:
Understanding and reasoning over non-speech sounds and music are crucial for both humans and AI agents to interact effectively with their environments. In this paper, we introduce Audio Flamingo 2 (AF2), an Audio-Language Model (ALM) with advanced audio understanding and reasoning capabilities. AF2 leverages (i) a custom CLAP model, (ii) synthetic Audio QA data for fine-grained audio reasoning, an…
▽ More
Understanding and reasoning over non-speech sounds and music are crucial for both humans and AI agents to interact effectively with their environments. In this paper, we introduce Audio Flamingo 2 (AF2), an Audio-Language Model (ALM) with advanced audio understanding and reasoning capabilities. AF2 leverages (i) a custom CLAP model, (ii) synthetic Audio QA data for fine-grained audio reasoning, and (iii) a multi-stage curriculum learning strategy. AF2 achieves state-of-the-art performance with only a 3B parameter small language model, surpassing large open-source and proprietary models across over 20 benchmarks. Next, for the first time, we extend audio understanding to long audio segments (30 secs to 5 mins) and propose LongAudio, a large and novel dataset for training ALMs on long audio captioning and question-answering tasks. Fine-tuning AF2 on LongAudio leads to exceptional performance on our proposed LongAudioBench, an expert annotated benchmark for evaluating ALMs on long audio understanding capabilities. We conduct extensive ablation studies to confirm the efficacy of our approach. Project Website: https://research.nvidia.com/labs/adlr/AF2/.
△ Less
Submitted 5 March, 2025;
originally announced March 2025.
-
Deep learning approaches to surgical video segmentation and object detection: A Scoping Review
Authors:
Devanish N. Kamtam,
Joseph B. Shrager,
Satya Deepya Malla,
Nicole Lin,
Juan J. Cardona,
Jake J. Kim,
Clarence Hu
Abstract:
Introduction: Computer vision (CV) has had a transformative impact in biomedical fields such as radiology, dermatology, and pathology. Its real-world adoption in surgical applications, however, remains limited. We review the current state-of-the-art performance of deep learning (DL)-based CV models for segmentation and object detection of anatomical structures in videos obtained during surgical pr…
▽ More
Introduction: Computer vision (CV) has had a transformative impact in biomedical fields such as radiology, dermatology, and pathology. Its real-world adoption in surgical applications, however, remains limited. We review the current state-of-the-art performance of deep learning (DL)-based CV models for segmentation and object detection of anatomical structures in videos obtained during surgical procedures.
Methods: We conducted a scoping review of studies on semantic segmentation and object detection of anatomical structures published between 2014 and 2024 from 3 major databases - PubMed, Embase, and IEEE Xplore. The primary objective was to evaluate the state-of-the-art performance of semantic segmentation in surgical videos. Secondary objectives included examining DL models, progress toward clinical applications, and the specific challenges with segmentation of organs/tissues in surgical videos.
Results: We identified 58 relevant published studies. These focused predominantly on procedures from general surgery [20(34.4%)], colorectal surgery [9(15.5%)], and neurosurgery [8(13.8%)]. Cholecystectomy [14(24.1%)] and low anterior rectal resection [5(8.6%)] were the most common procedures addressed. Semantic segmentation [47(81%)] was the primary CV task. U-Net [14(24.1%)] and DeepLab [13(22.4%)] were the most widely used models. Larger organs such as the liver (Dice score: 0.88) had higher accuracy compared to smaller structures such as nerves (Dice score: 0.49). Models demonstrated real-time inference potential ranging from 5-298 frames-per-second (fps).
Conclusion: This review highlights the significant progress made in DL-based semantic segmentation for surgical videos with real-time applicability, particularly for larger organs. Addressing challenges with smaller structures, data availability, and generalizability remains crucial for future advancements.
△ Less
Submitted 23 February, 2025;
originally announced February 2025.
-
Structure-from-Sherds++: Robust Incremental 3D Reassembly of Axially Symmetric Pots from Unordered and Mixed Fragment Collections
Authors:
Seong Jong Yoo,
Sisung Liu,
Muhammad Zeeshan Arshad,
Jinhyeok Kim,
Young Min Kim,
Yiannis Aloimonos,
Cornelia Fermuller,
Kyungdon Joo,
Jinwook Kim,
Je Hyeong Hong
Abstract:
Reassembling multiple axially symmetric pots from fragmentary sherds is crucial for cultural heritage preservation, yet it poses significant challenges due to thin and sharp fracture surfaces that generate numerous false positive matches and hinder large-scale puzzle solving. Existing global approaches, which optimize all potential fragment pairs simultaneously or data-driven models, are prone to…
▽ More
Reassembling multiple axially symmetric pots from fragmentary sherds is crucial for cultural heritage preservation, yet it poses significant challenges due to thin and sharp fracture surfaces that generate numerous false positive matches and hinder large-scale puzzle solving. Existing global approaches, which optimize all potential fragment pairs simultaneously or data-driven models, are prone to local minima and face scalability issues when multiple pots are intermixed. Motivated by Structure-from-Motion (SfM) for 3D reconstruction from multiple images, we propose an efficient reassembly method for axially symmetric pots based on iterative registration of one sherd at a time, called Structure-from-Sherds++ (SfS++). Our method extends beyond simple replication of incremental SfM and leverages multi-graph beam search to explore multiple registration paths. This allows us to effectively filter out indistinguishable false matches and simultaneously reconstruct multiple pots without requiring prior information such as base or the number of mixed objects. Our approach achieves 87% reassembly accuracy on a dataset of 142 real fragments from 10 different pots, outperforming other methods in handling complex fracture patterns with mixed datasets and achieving state-of-the-art performance. Code and results can be found in our project page https://sj-yoo.info/sfs/.
△ Less
Submitted 18 February, 2025;
originally announced February 2025.
-
Anomaly Detection with LWE Encrypted Control
Authors:
Rijad Alisic,
Junsoo Kim,
Henrik Sandberg
Abstract:
Detecting attacks using encrypted signals is challenging since encryption hides its information content. We present a novel mechanism for anomaly detection over Learning with Errors (LWE) encrypted signals without using decryption, secure channels, nor complex communication schemes. Instead, the detector exploits the homomorphic property of LWE encryption to perform hypothesis tests on transformat…
▽ More
Detecting attacks using encrypted signals is challenging since encryption hides its information content. We present a novel mechanism for anomaly detection over Learning with Errors (LWE) encrypted signals without using decryption, secure channels, nor complex communication schemes. Instead, the detector exploits the homomorphic property of LWE encryption to perform hypothesis tests on transformations of the encrypted samples. The specific transformations are determined by solutions to a hard lattice-based minimization problem. While the test's sensitivity deteriorates with suboptimal solutions, similar to the exponential deterioration of the (related) test that breaks the cryptosystem, we show that the deterioration is polynomial for our test. This rate gap can be exploited to pick parameters that lead to somewhat weaker encryption but large gains in detection capability. Finally, we conclude the paper by presenting a numerical example that simulates anomaly detection, demonstrating the effectiveness of our method in identifying attacks.
△ Less
Submitted 14 February, 2025;
originally announced February 2025.
-
Rate-Splitting Multiple Access for 6G: Prototypes, Experimental Results and Link/System level Simulations
Authors:
Sundar Aditya,
Yong Jin Daniel Kim,
David Vargas,
David Redgate,
Onur Dizdar,
Neil Bhushan,
Xinze Lyu,
Sibo Zhang,
Stephen Wang,
Bruno Clerckx
Abstract:
Rate-Splitting Multiple Access (RSMA) is a powerful and versatile physical layer multiple access technique that generalizes and has better interference management capabilities than 5G-based Space Division Multiple Access (SDMA). It is also a rapidly maturing technology, all of which makes it a natural successor to SDMA in 6G. In this article, we describe RSMA's suitability for 6G by presenting: i)…
▽ More
Rate-Splitting Multiple Access (RSMA) is a powerful and versatile physical layer multiple access technique that generalizes and has better interference management capabilities than 5G-based Space Division Multiple Access (SDMA). It is also a rapidly maturing technology, all of which makes it a natural successor to SDMA in 6G. In this article, we describe RSMA's suitability for 6G by presenting: i) link and system level simulations of RSMA's performance gains over SDMA in realistic environments, and (ii) pioneering experimental results that demonstrate RSMA's gains over SDMA for key use cases like enhanced Mobile Broadband (eMBb), and Integrated Sensing and Communications (ISAC). We also comment on the status of standardization activities for RSMA.
△ Less
Submitted 17 February, 2025; v1 submitted 13 February, 2025;
originally announced February 2025.
-
Improving Lesion Segmentation in Medical Images by Global and Regional Feature Compensation
Authors:
Chuhan Wang,
Zhenghao Chen,
Jean Y. H. Yang,
Jinman Kim
Abstract:
Automated lesion segmentation of medical images has made tremendous improvements in recent years due to deep learning advancements. However, accurately capturing fine-grained global and regional feature representations remains a challenge. Many existing methods obtain suboptimal performance on complex lesion segmentation due to information loss during typical downsampling operations and the insuff…
▽ More
Automated lesion segmentation of medical images has made tremendous improvements in recent years due to deep learning advancements. However, accurately capturing fine-grained global and regional feature representations remains a challenge. Many existing methods obtain suboptimal performance on complex lesion segmentation due to information loss during typical downsampling operations and the insufficient capture of either regional or global features. To address these issues, we propose the Global and Regional Compensation Segmentation Framework (GRCSF), which introduces two key innovations: the Global Compensation Unit (GCU) and the Region Compensation Unit (RCU). The proposed GCU addresses resolution loss in the U-shaped backbone by preserving global contextual features and fine-grained details during multiscale downsampling. Meanwhile, the RCU introduces a self-supervised learning (SSL) residual map generated by Masked Autoencoders (MAE), obtained as pixel-wise differences between reconstructed and original images, to highlight regions with potential lesions. These SSL residual maps guide precise lesion localization and segmentation through a patch-based cross-attention mechanism that integrates regional spatial and pixel-level features. Additionally, the RCU incorporates patch-level importance scoring to enhance feature fusion by leveraging global spatial information from the backbone. Experiments on two publicly available medical image segmentation datasets, including brain stroke lesion and coronary artery calcification datasets, demonstrate that our GRCSF outperforms state-of-the-art methods, confirming its effectiveness across diverse lesion types and its potential as a generalizable lesion segmentation solution.
△ Less
Submitted 11 February, 2025;
originally announced February 2025.
-
Multi-Class Segmentation of Aortic Branches and Zones in Computed Tomography Angiography: The AortaSeg24 Challenge
Authors:
Muhammad Imran,
Jonathan R. Krebs,
Vishal Balaji Sivaraman,
Teng Zhang,
Amarjeet Kumar,
Walker R. Ueland,
Michael J. Fassler,
Jinlong Huang,
Xiao Sun,
Lisheng Wang,
Pengcheng Shi,
Maximilian Rokuss,
Michael Baumgartner,
Yannick Kirchhof,
Klaus H. Maier-Hein,
Fabian Isensee,
Shuolin Liu,
Bing Han,
Bong Thanh Nguyen,
Dong-jin Shin,
Park Ji-Woo,
Mathew Choi,
Kwang-Hyun Uhm,
Sung-Jea Ko,
Chanwoong Lee
, et al. (38 additional authors not shown)
Abstract:
Multi-class segmentation of the aorta in computed tomography angiography (CTA) scans is essential for diagnosing and planning complex endovascular treatments for patients with aortic dissections. However, existing methods reduce aortic segmentation to a binary problem, limiting their ability to measure diameters across different branches and zones. Furthermore, no open-source dataset is currently…
▽ More
Multi-class segmentation of the aorta in computed tomography angiography (CTA) scans is essential for diagnosing and planning complex endovascular treatments for patients with aortic dissections. However, existing methods reduce aortic segmentation to a binary problem, limiting their ability to measure diameters across different branches and zones. Furthermore, no open-source dataset is currently available to support the development of multi-class aortic segmentation methods. To address this gap, we organized the AortaSeg24 MICCAI Challenge, introducing the first dataset of 100 CTA volumes annotated for 23 clinically relevant aortic branches and zones. This dataset was designed to facilitate both model development and validation. The challenge attracted 121 teams worldwide, with participants leveraging state-of-the-art frameworks such as nnU-Net and exploring novel techniques, including cascaded models, data augmentation strategies, and custom loss functions. We evaluated the submitted algorithms using the Dice Similarity Coefficient (DSC) and Normalized Surface Distance (NSD), highlighting the approaches adopted by the top five performing teams. This paper presents the challenge design, dataset details, evaluation metrics, and an in-depth analysis of the top-performing algorithms. The annotated dataset, evaluation code, and implementations of the leading methods are publicly available to support further research. All resources can be accessed at https://aortaseg24.grand-challenge.org.
△ Less
Submitted 7 February, 2025;
originally announced February 2025.
-
Enhancing Free-hand 3D Photoacoustic and Ultrasound Reconstruction using Deep Learning
Authors:
SiYeoul Lee,
SeonHo Kim,
Minkyung Seo,
SeongKyu Park,
Salehin Imrus,
Kambaluru Ashok,
DongEon Lee,
Chunsu Park,
SeonYeong Lee,
Jiye Kim,
Jae-Heung Yoo,
MinWoo Kim
Abstract:
This study introduces a motion-based learning network with a global-local self-attention module (MoGLo-Net) to enhance 3D reconstruction in handheld photoacoustic and ultrasound (PAUS) imaging. Standard PAUS imaging is often limited by a narrow field of view and the inability to effectively visualize complex 3D structures. The 3D freehand technique, which aligns sequential 2D images for 3D reconst…
▽ More
This study introduces a motion-based learning network with a global-local self-attention module (MoGLo-Net) to enhance 3D reconstruction in handheld photoacoustic and ultrasound (PAUS) imaging. Standard PAUS imaging is often limited by a narrow field of view and the inability to effectively visualize complex 3D structures. The 3D freehand technique, which aligns sequential 2D images for 3D reconstruction, faces significant challenges in accurate motion estimation without relying on external positional sensors. MoGLo-Net addresses these limitations through an innovative adaptation of the self-attention mechanism, which effectively exploits the critical regions, such as fully-developed speckle area or high-echogenic tissue area within successive ultrasound images to accurately estimate motion parameters. This facilitates the extraction of intricate features from individual frames. Additionally, we designed a patch-wise correlation operation to generate a correlation volume that is highly correlated with the scanning motion. A custom loss function was also developed to ensure robust learning with minimized bias, leveraging the characteristics of the motion parameters. Experimental evaluations demonstrated that MoGLo-Net surpasses current state-of-the-art methods in both quantitative and qualitative performance metrics. Furthermore, we expanded the application of 3D reconstruction technology beyond simple B-mode ultrasound volumes to incorporate Doppler ultrasound and photoacoustic imaging, enabling 3D visualization of vasculature. The source code for this study is publicly available at: https://github.com/guhong3648/US3D
△ Less
Submitted 5 February, 2025;
originally announced February 2025.
-
Enhancing Feature Tracking Reliability for Visual Navigation using Real-Time Safety Filter
Authors:
Dabin Kim,
Inkyu Jang,
Youngsoo Han,
Sunwoo Hwang,
H. Jin Kim
Abstract:
Vision sensors are extensively used for localizing a robot's pose, particularly in environments where global localization tools such as GPS or motion capture systems are unavailable. In many visual navigation systems, localization is achieved by detecting and tracking visual features or landmarks, which provide information about the sensor's relative pose. For reliable feature tracking and accurat…
▽ More
Vision sensors are extensively used for localizing a robot's pose, particularly in environments where global localization tools such as GPS or motion capture systems are unavailable. In many visual navigation systems, localization is achieved by detecting and tracking visual features or landmarks, which provide information about the sensor's relative pose. For reliable feature tracking and accurate pose estimation, it is crucial to maintain visibility of a sufficient number of features. This requirement can sometimes conflict with the robot's overall task objective. In this paper, we approach it as a constrained control problem. By leveraging the invariance properties of visibility constraints within the robot's kinematic model, we propose a real-time safety filter based on quadratic programming. This filter takes a reference velocity command as input and produces a modified velocity that minimally deviates from the reference while ensuring the information score from the currently visible features remains above a user-specified threshold. Numerical simulations demonstrate that the proposed safety filter preserves the invariance condition and ensures the visibility of more features than the required minimum. We also validated its real-world performance by integrating it into a visual simultaneous localization and mapping (SLAM) algorithm, where it maintained high estimation quality in challenging environments, outperforming a simple tracking controller.
△ Less
Submitted 3 February, 2025;
originally announced February 2025.