Search | arXiv e-print repository

arXiv:2506.21174 [pdf]

Performance improvement of spatial semantic segmentation with enriched audio features and agent-based error correction for DCASE 2025 Challenge Task 4

Authors: Jongyeon Park, Joonhee Lee, Do-Hyeon Lim, Hong Kook Kim, Hyeongcheol Geum, Jeong Eun Lim

Abstract: This technical report presents submission systems for Task 4 of the DCASE 2025 Challenge. This model incorporates additional audio features (spectral roll-off and chroma features) into the embedding feature extracted from the mel-spectral feature to im-prove the classification capabilities of an audio-tagging model in the spatial semantic segmentation of sound scenes (S5) system. This approach is… ▽ More This technical report presents submission systems for Task 4 of the DCASE 2025 Challenge. This model incorporates additional audio features (spectral roll-off and chroma features) into the embedding feature extracted from the mel-spectral feature to im-prove the classification capabilities of an audio-tagging model in the spatial semantic segmentation of sound scenes (S5) system. This approach is motivated by the fact that mixed audio often contains subtle cues that are difficult to capture with mel-spectrograms alone. Thus, these additional features offer alterna-tive perspectives for the model. Second, an agent-based label correction system is applied to the outputs processed by the S5 system. This system reduces false positives, improving the final class-aware signal-to-distortion ratio improvement (CA-SDRi) metric. Finally, we refine the training dataset to enhance the classi-fication accuracy of low-performing classes by removing irrele-vant samples and incorporating external data. That is, audio mix-tures are generated from a limited number of data points; thus, even a small number of out-of-class data points could degrade model performance. The experiments demonstrate that the submit-ted systems employing these approaches relatively improve CA-SDRi by up to 14.7% compared to the baseline of DCASE 2025 Challenge Task 4. △ Less

Submitted 26 June, 2025; originally announced June 2025.

Comments: DCASE 2025 challenge Task4, 5 pages

arXiv:2506.01947 [pdf, ps, other]

RAW Image Reconstruction from RGB on Smartphones. NTIRE 2025 Challenge Report

Authors: Marcos V. Conde, Radu Timofte, Radu Berdan, Beril Besbinar, Daisuke Iso, Pengzhou Ji, Xiong Dun, Zeying Fan, Chen Wu, Zhansheng Wang, Pengbo Zhang, Jiazi Huang, Qinglin Liu, Wei Yu, Shengping Zhang, Xiangyang Ji, Kyungsik Kim, Minkyung Kim, Hwalmin Lee, Hekun Ma, Huan Zheng, Yanyan Wei, Zhao Zhang, Jing Fang, Meilin Gao , et al. (8 additional authors not shown)

Abstract: Numerous low-level vision tasks operate in the RAW domain due to its linear properties, bit depth, and sensor designs. Despite this, RAW image datasets are scarce and more expensive to collect than the already large and public sRGB datasets. For this reason, many approaches try to generate realistic RAW images using sensor information and sRGB images. This paper covers the second challenge on RAW… ▽ More Numerous low-level vision tasks operate in the RAW domain due to its linear properties, bit depth, and sensor designs. Despite this, RAW image datasets are scarce and more expensive to collect than the already large and public sRGB datasets. For this reason, many approaches try to generate realistic RAW images using sensor information and sRGB images. This paper covers the second challenge on RAW Reconstruction from sRGB (Reverse ISP). We aim to recover RAW sensor images from smartphones given the corresponding sRGB images without metadata and, by doing this, ``reverse" the ISP transformation. Over 150 participants joined this NTIRE 2025 challenge and submitted efficient models. The proposed methods and benchmark establish the state-of-the-art for generating realistic RAW data. △ Less

Submitted 2 June, 2025; originally announced June 2025.

Comments: CVPR 2025 - New Trends in Image Restoration and Enhancement (NTIRE)

arXiv:2505.21607 [pdf]

A Comb-based Colorless Coherent WDM Transmitter

Authors: Di Che, Brian Stern, Kwangwoong Kim, Cagri Ozdilek, Timofey Shpakovsky, John D. Jost, Maxim Karpov

Abstract: We propose a comb-based WDM transmitter capable of modulating independent signals to comb lines without demultiplexing them and prove its concept and potential scalability in a WDM transmitter consisting of a Kerr microcomb and a silicon I/Q modulator array. We propose a comb-based WDM transmitter capable of modulating independent signals to comb lines without demultiplexing them and prove its concept and potential scalability in a WDM transmitter consisting of a Kerr microcomb and a silicon I/Q modulator array. △ Less

Submitted 27 May, 2025; originally announced May 2025.

Comments: Published in OFC'2025, Postdeadline Th4B.4 (updated version)

arXiv:2505.15932 [pdf, ps, other]

doi 10.1109/LCSYS.2025.3583748

Constant-Sum High-Order Barrier Functions for Safety Between Parallel Boundaries

Authors: Kwang Hak Kim, Mamadou Diagne, Miroslav Krstić

Abstract: This paper takes a step towards addressing the difficulty of constructing Control Barrier Functions (CBFs) for parallel safety boundaries. A single CBF for both boundaries has been reported to be difficult to validate for safety, and we identify why this challenge is inherent. To overcome this, the proposed method constructs separate CBFs for each boundary. We begin by presenting results for the r… ▽ More This paper takes a step towards addressing the difficulty of constructing Control Barrier Functions (CBFs) for parallel safety boundaries. A single CBF for both boundaries has been reported to be difficult to validate for safety, and we identify why this challenge is inherent. To overcome this, the proposed method constructs separate CBFs for each boundary. We begin by presenting results for the relative degree one case and then extend these to higher relative degrees using the CBF backstepping technique, establishing conditions that guarantee safety. Finally, we showcase our method by applying it to a unicycle system, deriving a simple, verifiable condition to validate the target CBFs for direct implementation of our results. △ Less

Submitted 21 May, 2025; originally announced May 2025.

Comments: Submitted to IEEE L-CSS and 2025 Conference on Decision and Control (CDC)

arXiv:2504.18157 [pdf, other]

DOSE : Drum One-Shot Extraction from Music Mixture

Authors: Suntae Hwang, Seonghyeon Kang, Kyungsu Kim, Semin Ahn, Kyogu Lee

Abstract: Drum one-shot samples are crucial for music production, particularly in sound design and electronic music. This paper introduces Drum One-Shot Extraction, a task in which the goal is to extract drum one-shots that are present in the music mixture. To facilitate this, we propose the Random Mixture One-shot Dataset (RMOD), comprising large-scale, randomly arranged music mixtures paired with correspo… ▽ More Drum one-shot samples are crucial for music production, particularly in sound design and electronic music. This paper introduces Drum One-Shot Extraction, a task in which the goal is to extract drum one-shots that are present in the music mixture. To facilitate this, we propose the Random Mixture One-shot Dataset (RMOD), comprising large-scale, randomly arranged music mixtures paired with corresponding drum one-shot samples. Our proposed model, Drum One- Shot Extractor (DOSE), leverages neural audio codec language models for end-to-end extraction, bypassing traditional source separation steps. Additionally, we introduce a novel onset loss, designed to encourage accurate prediction of the initial transient of drum one-shots, which is essential for capturing timbral characteristics. We compare this approach against a source separation-based extraction method as a baseline. The results, evaluated using Frechet Audio Distance (FAD) and Multi-Scale Spectral loss (MSS), demonstrate that DOSE, enhanced with onset loss, outperforms the baseline, providing more accurate and higher-quality drum one-shots from music mixtures. The code, model checkpoint, and audio examples are available at https://github.com/HSUNEH/DOSE △ Less

Submitted 25 April, 2025; originally announced April 2025.

Comments: Published in IEEE ICASSP 2025

arXiv:2503.19735 [pdf]

InterSliceBoost: Identifying Tissue Layers in Three-dimensional Ultrasound Images for Chronic Lower Back Pain (cLBP) Assessment

Authors: Zixue Zeng, Matthew Cartier, Xiaoyan Zhao, Pengyu Chen, Xin Meng, Zhiyu Sheng, Maryam Satarpour, John M Cormack, Allison C. Bean, Ryan P. Nussbaum, Maya Maurer, Emily Landis-Walkenhorst, Kang Kim, Ajay D. Wasan, Jiantao Pu

Abstract: Available studies on chronic lower back pain (cLBP) typically focus on one or a few specific tissues rather than conducting a comprehensive layer-by-layer analysis. Since three-dimensional (3-D) images often contain hundreds of slices, manual annotation of these anatomical structures is both time-consuming and error-prone. We aim to develop and validate a novel approach called InterSliceBoost to e… ▽ More Available studies on chronic lower back pain (cLBP) typically focus on one or a few specific tissues rather than conducting a comprehensive layer-by-layer analysis. Since three-dimensional (3-D) images often contain hundreds of slices, manual annotation of these anatomical structures is both time-consuming and error-prone. We aim to develop and validate a novel approach called InterSliceBoost to enable the training of a segmentation model on a partially annotated dataset without compromising segmentation performance. The architecture of InterSliceBoost includes two components: an inter-slice generator and a segmentation model. The generator utilizes residual block-based encoders to extract features from adjacent image-mask pairs (IMPs). Differential features are calculated and input into a decoder to generate inter-slice IMPs. The segmentation model is trained on partially annotated datasets (e.g., skipping 1, 2, 3, or 7 images) and the generated inter-slice IMPs. To validate the performance of InterSliceBoost, we utilized a dataset of 76 B-mode ultrasound scans acquired on 29 subjects enrolled in an ongoing cLBP study. InterSliceBoost, trained on only 33% of the image slices, achieved a mean Dice coefficient of 80.84% across all six layers on the independent test set, with Dice coefficients of 73.48%, 61.11%, 81.87%, 95.74%, 83.52% and 88.74% for segmenting dermis, superficial fat, superficial fascial membrane, deep fat, deep fascial membrane, and muscle. This performance is significantly higher than the conventional model trained on fully annotated images (p<0.05). InterSliceBoost can effectively segment the six tissue layers depicted on 3-D B-model ultrasound images in settings with partial annotations. △ Less

Submitted 25 March, 2025; originally announced March 2025.

arXiv:2502.17528 [pdf, other]

Temperature Compensation Method of Six-Axis Force/Torque Sensor Using Gated Recurrent Unit

Authors: Hyun-Bin Kim, Seokju Lee, Byeong-Il Ham, Kyung-Soo Kim

Abstract: This study aims to enhance the accuracy of a six-axis force/torque sensor compared to existing approaches that utilize Multi-Layer Perceptron (MLP) and the Least Square Method. The sensor used in this study is based on a photo-coupler and operates with infrared light, making it susceptible to dark current effects, which cause drift due to temperature variations. Additionally, the sensor is compact… ▽ More This study aims to enhance the accuracy of a six-axis force/torque sensor compared to existing approaches that utilize Multi-Layer Perceptron (MLP) and the Least Square Method. The sensor used in this study is based on a photo-coupler and operates with infrared light, making it susceptible to dark current effects, which cause drift due to temperature variations. Additionally, the sensor is compact and lightweight (45g), resulting in a low thermal capacity. Consequently, even small amounts of heat can induce rapid temperature changes, affecting the sensor's performance in real time. To address these challenges, this study compares the conventional MLP approach with the proposed Gated Recurrent Unit (GRU)-based method. Experimental results demonstrate that the GRU approach, leveraging sequential data, achieves superior performance. △ Less

Submitted 24 February, 2025; originally announced February 2025.

Comments: 8 pages, 9 figures

arXiv:2502.00619 [pdf, other]

Distribution-aware Fairness Learning in Medical Image Segmentation From A Control-Theoretic Perspective

Authors: Yujin Oh, Pengfei Jin, Sangjoon Park, Sekeun Kim, Siyeop Yoon, Kyungsang Kim, Jin Sung Kim, Xiang Li, Quanzheng Li

Abstract: Ensuring fairness in medical image segmentation is critical due to biases in imbalanced clinical data acquisition caused by demographic attributes (e.g., age, sex, race) and clinical factors (e.g., disease severity). To address these challenges, we introduce Distribution-aware Mixture of Experts (dMoE), inspired by optimal control theory. We provide a comprehensive analysis of its underlying mecha… ▽ More Ensuring fairness in medical image segmentation is critical due to biases in imbalanced clinical data acquisition caused by demographic attributes (e.g., age, sex, race) and clinical factors (e.g., disease severity). To address these challenges, we introduce Distribution-aware Mixture of Experts (dMoE), inspired by optimal control theory. We provide a comprehensive analysis of its underlying mechanisms and clarify dMoE's role in adapting to heterogeneous distributions in medical image segmentation. Furthermore, we integrate dMoE into multiple network architectures, demonstrating its broad applicability across diverse medical image analysis tasks. By incorporating demographic and clinical factors, dMoE achieves state-of-the-art performance on two 2D benchmark datasets and a 3D in-house dataset. Our results highlight the effectiveness of dMoE in mitigating biases from imbalanced distributions, offering a promising approach to bridging control theory and medical image segmentation within fairness learning paradigms. The source code will be made available. The source code is available at https://github.com/tvseg/dMoE. △ Less

Submitted 27 May, 2025; v1 submitted 1 February, 2025; originally announced February 2025.

Comments: ICML 2025 spotlight, see https://openreview.net/forum?id=BUONdewsBa

arXiv:2501.05085 [pdf, other]

End-to-End Deep Learning for Interior Tomography with Low-Dose X-ray CT

Authors: Yoseob Han, Dufan Wu, Kyungsang Kim, Quanzheng Li

Abstract: Objective: There exist several X-ray computed tomography (CT) scanning strategies to reduce a radiation dose, such as (1) sparse-view CT, (2) low-dose CT, and (3) region-of-interest (ROI) CT (called interior tomography). To further reduce the dose, the sparse-view and/or low-dose CT settings can be applied together with interior tomography. Interior tomography has various advantages in terms of re… ▽ More Objective: There exist several X-ray computed tomography (CT) scanning strategies to reduce a radiation dose, such as (1) sparse-view CT, (2) low-dose CT, and (3) region-of-interest (ROI) CT (called interior tomography). To further reduce the dose, the sparse-view and/or low-dose CT settings can be applied together with interior tomography. Interior tomography has various advantages in terms of reducing the number of detectors and decreasing the X-ray radiation dose. However, a large patient or small field-of-view (FOV) detector can cause truncated projections, and then the reconstructed images suffer from severe cupping artifacts. In addition, although the low-dose CT can reduce the radiation exposure dose, analytic reconstruction algorithms produce image noise. Recently, many researchers have utilized image-domain deep learning (DL) approaches to remove each artifact and demonstrated impressive performances, and the theory of deep convolutional framelets supports the reason for the performance improvement. Approach: In this paper, we found that the image-domain convolutional neural network (CNN) is difficult to solve coupled artifacts, based on deep convolutional framelets. Significance: To address the coupled problem, we decouple it into two sub-problems: (i) image domain noise reduction inside truncated projection to solve low-dose CT problem and (ii) extrapolation of projection outside truncated projection to solve the ROI CT problem. The decoupled sub-problems are solved directly with a novel proposed end-to-end learning using dual-domain CNNs. Main results: We demonstrate that the proposed method outperforms the conventional image-domain deep learning methods, and a projection-domain CNN shows better performance than the image-domain CNNs which are commonly used by many researchers. △ Less

Submitted 9 January, 2025; originally announced January 2025.

Comments: Published by Physics in Medicine & Biology (2022.5)

arXiv:2501.01094 [pdf, other]

MMVA: Multimodal Matching Based on Valence and Arousal across Images, Music, and Musical Captions

Authors: Suhwan Choi, Kyu Won Kim, Myungjoo Kang

Abstract: We introduce Multimodal Matching based on Valence and Arousal (MMVA), a tri-modal encoder framework designed to capture emotional content across images, music, and musical captions. To support this framework, we expand the Image-Music-Emotion-Matching-Net (IMEMNet) dataset, creating IMEMNet-C which includes 24,756 images and 25,944 music clips with corresponding musical captions. We employ multimo… ▽ More We introduce Multimodal Matching based on Valence and Arousal (MMVA), a tri-modal encoder framework designed to capture emotional content across images, music, and musical captions. To support this framework, we expand the Image-Music-Emotion-Matching-Net (IMEMNet) dataset, creating IMEMNet-C which includes 24,756 images and 25,944 music clips with corresponding musical captions. We employ multimodal matching scores based on the continuous valence (emotional positivity) and arousal (emotional intensity) values. This continuous matching score allows for random sampling of image-music pairs during training by computing similarity scores from the valence-arousal values across different modalities. Consequently, the proposed approach achieves state-of-the-art performance in valence-arousal prediction tasks. Furthermore, the framework demonstrates its efficacy in various zeroshot tasks, highlighting the potential of valence and arousal predictions in downstream applications. △ Less

Submitted 2 January, 2025; originally announced January 2025.

Comments: Paper accepted in Artificial Intelligence for Music workshop at AAAI 2025

arXiv:2412.19763 [pdf, other]

Multi-population Differential Evolution for RSS based Cooperative Localization in Wireless Sensor Networks with Limited Communication Range

Authors: Lismer Andres Caceres Najarro, Iickho Song, Muhammad Salman, Kiseon Kim

Abstract: This paper presents a novel approach to deal with the cooperative localization problem in wireless sensor networks based on received signal strength measurements. In cooperative scenarios, the cost function of the localization problem becomes increasingly nonlinear and nonconvex due to the heightened interaction between sensor nodes, making the estimation of the positions of the target nodes more… ▽ More This paper presents a novel approach to deal with the cooperative localization problem in wireless sensor networks based on received signal strength measurements. In cooperative scenarios, the cost function of the localization problem becomes increasingly nonlinear and nonconvex due to the heightened interaction between sensor nodes, making the estimation of the positions of the target nodes more challenging. Although most of existing cooperative localization algorithms assure acceptable localization accuracy, their computational complexity increases dramatically, which may restrict their applicability. To reduce the computational complexity and provide competitive localization accuracy at the same time, we propose a localization algorithm based on the differential evolution with multiple populations, opposite-based learning, redirection, and anchoring. In this work, the cooperative localization cost function is split into several simpler cost functions, each of which accounts only for one individual target node. Then, each cost function is solved by a dedicated population of the proposed algorithm. In addition, an enhanced version of the proposed algorithm which incorporates the population midpoint scheme for further improvement in the localization accuracy is devised. Simulation results demonstrate that the proposed algorithms provide comparative localization accuracy with much lower computational complexity compared with the state-of-the-art algorithms. △ Less

Submitted 27 December, 2024; originally announced December 2024.

arXiv:2412.09109 [pdf, other]

evS2CP: Real-time Simultaneous Speed and Charging Planner for Connected Electric Vehicles

Authors: Minwoo Gwon, Jiwon Kim, Seungjun Yoo, Kwang-Ki K. Kim

Abstract: This paper presents evS2CP, an optimization-based framework for simultaneous speed and charging planning designed for connected electric vehicles (EVs). With EVs emerging as competitive alternatives to internal combustion engine vehicles, overcoming challenges such as limited charging infrastructure is crucial. evS2CP addresses these issues by minimizing the travel time, charging time, and energy… ▽ More This paper presents evS2CP, an optimization-based framework for simultaneous speed and charging planning designed for connected electric vehicles (EVs). With EVs emerging as competitive alternatives to internal combustion engine vehicles, overcoming challenges such as limited charging infrastructure is crucial. evS2CP addresses these issues by minimizing the travel time, charging time, and energy consumption, providing practical solutions for both human-operated and autonomous vehicles. This framework leverages V2X communication to integrate essential EV planning data, including route geometry, real-time traffic conditions, and charging station availability, while simulating dynamic driving environments using open-web API services. The speed and charging planning problem was initially formulated as a nonlinear programming model, which was then convexified into a quadratic programming model without charging-stop constraints. Additionally, a mixed-integer programming approach was employed to optimize charging station selection and minimize the frequency of charging events. A mixed-integer quadratic programming implementation exhibited exceptional computational efficiency and scalability, effectively solving trip plans over distances exceeding 700 km in a few seconds. Simulations conducted using open-source and commercial solvers validated the framework's near-global optimality, demonstrating its robustness and feasibility for real-world applications in connected EV ecosystems. △ Less

Submitted 12 December, 2024; originally announced December 2024.

Comments: 15 pages, 9 figures, 2 tables

MSC Class: 93C85; 49M37; 65K05; 90C29; 68T40; 70B15 ACM Class: G.1.6; G.1.2; H.5.2; G.4; H.4.3; J.6

arXiv:2412.08895 [pdf, other]

Fully Bayesian Wideband Direction-of-Arrival Estimation and Detection via RJMCMC

Authors: Kyurae Kim, Philip T. Clemson, James P. Reilly, Jason F. Ralph, Simon Maskell

Abstract: We propose a fully Bayesian approach to wideband, or broadband, direction-of-arrival (DoA) estimation and signal detection. Unlike previous works in wideband DoA estimation and detection, where the signals were modeled in the time-frequency domain, we directly model the time-domain representation and treat the non-causal part of the source signal as latent variables. Furthermore, our Bayesian mode… ▽ More We propose a fully Bayesian approach to wideband, or broadband, direction-of-arrival (DoA) estimation and signal detection. Unlike previous works in wideband DoA estimation and detection, where the signals were modeled in the time-frequency domain, we directly model the time-domain representation and treat the non-causal part of the source signal as latent variables. Furthermore, our Bayesian model allows for closed-form marginalization of the latent source signals by leveraging conjugacy. To further speed up computation, we exploit the sparse ``stripe matrix structure'' of the considered system, which stems from the circulant matrix representation of linear time-invariant (LTI) systems. This drastically reduces the time complexity of computing the likelihood from $\mathcal{O}(N^3 k^3)$ to $\mathcal{O}(N k^3)$, where $N$ is the number of samples received by the array and $k$ is the number of sources. These computational improvements allow for efficient posterior inference through reversible jump Markov chain Monte Carlo (RJMCMC). We use the non-reversible extension of RJMCMC (NRJMCMC), which often achieves lower autocorrelation and faster convergence than the conventional reversible variant. Detection, estimation, and reconstruction of the latent source signals can then all be performed in a fully Bayesian manner through the samples drawn using NRJMCMC. We evaluate the detection performance of the procedure by comparing against generalized likelihood ratio testing (GLRT) and information criteria. △ Less

Submitted 11 December, 2024; originally announced December 2024.

arXiv:2412.00150 [pdf, other]

Curriculum Fine-tuning of Vision Foundation Model for Medical Image Classification Under Label Noise

Authors: Yeonguk Yu, Minhwan Ko, Sungho Shin, Kangmin Kim, Kyoobin Lee

Abstract: Deep neural networks have demonstrated remarkable performance in various vision tasks, but their success heavily depends on the quality of the training data. Noisy labels are a critical issue in medical datasets and can significantly degrade model performance. Previous clean sample selection methods have not utilized the well pre-trained features of vision foundation models (VFMs) and assumed that… ▽ More Deep neural networks have demonstrated remarkable performance in various vision tasks, but their success heavily depends on the quality of the training data. Noisy labels are a critical issue in medical datasets and can significantly degrade model performance. Previous clean sample selection methods have not utilized the well pre-trained features of vision foundation models (VFMs) and assumed that training begins from scratch. In this paper, we propose CUFIT, a curriculum fine-tuning paradigm of VFMs for medical image classification under label noise. Our method is motivated by the fact that linear probing of VFMs is relatively unaffected by noisy samples, as it does not update the feature extractor of the VFM, thus robustly classifying the training samples. Subsequently, curriculum fine-tuning of two adapters is conducted, starting with clean sample selection from the linear probing phase. Our experimental results demonstrate that CUFIT outperforms previous methods across various medical image benchmarks. Specifically, our method surpasses previous baselines by 5.0%, 2.1%, 4.6%, and 5.8% at a 40% noise rate on the HAM10000, APTOS-2019, BloodMnist, and OrgancMnist datasets, respectively. Furthermore, we provide extensive analyses to demonstrate the impact of our method on noisy label detection. For instance, our method shows higher label precision and recall compared to previous approaches. Our work highlights the potential of leveraging VFMs in medical image classification under challenging conditions of noisy labels. △ Less

Submitted 29 November, 2024; originally announced December 2024.

Comments: Accepted at NeurIPS 2024

arXiv:2411.00274 [pdf, other]

Adaptive Residual Transformation for Enhanced Feature-Based OOD Detection in SAR Imagery

Authors: Kyung-hwan Lee, Kyung-tae Kim

Abstract: Recent advances in deep learning architectures have enabled efficient and accurate classification of pre-trained targets in Synthetic Aperture Radar (SAR) images. Nevertheless, the presence of unknown targets in real battlefield scenarios is unavoidable, resulting in misclassification and reducing the accuracy of the classifier. Over the past decades, various feature-based out-of-distribution (OOD… ▽ More Recent advances in deep learning architectures have enabled efficient and accurate classification of pre-trained targets in Synthetic Aperture Radar (SAR) images. Nevertheless, the presence of unknown targets in real battlefield scenarios is unavoidable, resulting in misclassification and reducing the accuracy of the classifier. Over the past decades, various feature-based out-of-distribution (OOD) approaches have been developed to address this issue, yet defining the decision boundary between known and unknown targets remains challenging. Additionally, unlike optical images, detecting unknown targets in SAR imagery is further complicated by high speckle noise, the presence of clutter, and the inherent similarities in back-scattered microwave signals. In this work, we propose transforming feature-based OOD detection into a class-localized feature-residual-based approach, demonstrating that this method can improve stability across varying unknown targets' distribution conditions. Transforming feature-based OOD detection into a residual-based framework offers a more robust reference space for distinguishing between in-distribution (ID) and OOD data, particularly within the unique characteristics of SAR imagery. This adaptive residual transformation method standardizes feature-based inputs into distributional representations, enhancing OOD detection in noisy, low-information images. Our approach demonstrates promising performance in real-world SAR scenarios, effectively adapting to the high levels of noise and clutter inherent in these environments. These findings highlight the practical relevance of residual-based OOD detection for SAR applications and suggest a foundation for further advancements in unknown target detection in complex, operational settings. △ Less

Submitted 31 October, 2024; originally announced November 2024.

arXiv:2410.06493 [pdf, other]

BiC-MPPI: Goal-Pursuing, Sampling-Based Bidirectional Rollout Clustering Path Integral for Trajectory Optimization

Authors: Minchan Jung, Kwangki Kim

Abstract: This paper introduces the Bidirectional Clustered MPPI (BiC-MPPI) algorithm, a novel trajectory optimization method aimed at enhancing goal-directed guidance within the Model Predictive Path Integral (MPPI) framework. BiC-MPPI incorporates bidirectional dynamics approximations and a new guide cost mechanism, improving both trajectory planning and goal-reaching performance. By leveraging forward an… ▽ More This paper introduces the Bidirectional Clustered MPPI (BiC-MPPI) algorithm, a novel trajectory optimization method aimed at enhancing goal-directed guidance within the Model Predictive Path Integral (MPPI) framework. BiC-MPPI incorporates bidirectional dynamics approximations and a new guide cost mechanism, improving both trajectory planning and goal-reaching performance. By leveraging forward and backward rollouts, the bidirectional approach ensures effective trajectory connections between initial and terminal states, while the guide cost helps discover dynamically feasible paths. Experimental results demonstrate that BiC-MPPI outperforms existing MPPI variants in both 2D and 3D environments, achieving higher success rates and competitive computation times across 900 simulations on a modified BARN dataset for autonomous navigation. GitHub: https://github.com/i-ASL/BiC-MPPI △ Less

Submitted 8 October, 2024; originally announced October 2024.

Comments: 7 pages, 1 figures

MSC Class: 68T40; 13P25 ACM Class: I.2.9; I.2.8; G.1.6; G.4

arXiv:2410.00184 [pdf, other]

Volumetric Conditional Score-based Residual Diffusion Model for PET/MR Denoising

Authors: Siyeop Yoon, Rui Hu, Yuang Wang, Matthew Tivnan, Young-don Son, Dufan Wu, Xiang Li, Kyungsang Kim, Quanzheng Li

Abstract: PET imaging is a powerful modality offering quantitative assessments of molecular and physiological processes. The necessity for PET denoising arises from the intrinsic high noise levels in PET imaging, which can significantly hinder the accurate interpretation and quantitative analysis of the scans. With advances in deep learning techniques, diffusion model-based PET denoising techniques have sho… ▽ More PET imaging is a powerful modality offering quantitative assessments of molecular and physiological processes. The necessity for PET denoising arises from the intrinsic high noise levels in PET imaging, which can significantly hinder the accurate interpretation and quantitative analysis of the scans. With advances in deep learning techniques, diffusion model-based PET denoising techniques have shown remarkable performance improvement. However, these models often face limitations when applied to volumetric data. Additionally, many existing diffusion models do not adequately consider the unique characteristics of PET imaging, such as its 3D volumetric nature, leading to the potential loss of anatomic consistency. Our Conditional Score-based Residual Diffusion (CSRD) model addresses these issues by incorporating a refined score function and 3D patch-wise training strategy, optimizing the model for efficient volumetric PET denoising. The CSRD model significantly lowers computational demands and expedites the denoising process. By effectively integrating volumetric data from PET and MRI scans, the CSRD model maintains spatial coherence and anatomical detail. Lastly, we demonstrate that the CSRD model achieves superior denoising performance in both qualitative and quantitative evaluations while maintaining image details and outperforms existing state-of-the-art methods. △ Less

Submitted 30 September, 2024; originally announced October 2024.

Comments: Accepted to MICCAI 2024

arXiv:2410.00046 [pdf, other]

Mixture of Multicenter Experts in Multimodal Generative AI for Advanced Radiotherapy Target Delineation

Authors: Yujin Oh, Sangjoon Park, Xiang Li, Wang Yi, Jonathan Paly, Jason Efstathiou, Annie Chan, Jun Won Kim, Hwa Kyung Byun, Ik Jae Lee, Jaeho Cho, Chan Woo Wee, Peng Shu, Peilong Wang, Nathan Yu, Jason Holmes, Jong Chul Ye, Quanzheng Li, Wei Liu, Woong Sub Koom, Jin Sung Kim, Kyungsang Kim

Abstract: Clinical experts employ diverse philosophies and strategies in patient care, influenced by regional patient populations. However, existing medical artificial intelligence (AI) models are often trained on data distributions that disproportionately reflect highly prevalent patterns, reinforcing biases and overlooking the diverse expertise of clinicians. To overcome this limitation, we introduce the… ▽ More Clinical experts employ diverse philosophies and strategies in patient care, influenced by regional patient populations. However, existing medical artificial intelligence (AI) models are often trained on data distributions that disproportionately reflect highly prevalent patterns, reinforcing biases and overlooking the diverse expertise of clinicians. To overcome this limitation, we introduce the Mixture of Multicenter Experts (MoME) approach. This method strategically integrates specialized expertise from diverse clinical strategies, enhancing the AI model's ability to generalize and adapt across multiple medical centers. The MoME-based multimodal target volume delineation model, trained with few-shot samples including images and clinical notes from each medical center, outperformed baseline methods in prostate cancer radiotherapy target delineation. The advantages of MoME were most pronounced when data characteristics varied across centers or when data availability was limited, demonstrating its potential for broader clinical applications. Therefore, the MoME framework enables the deployment of AI-based target volume delineation models in resource-constrained medical facilities by adapting to specific preferences of each medical center only using a few sample data, without the need for data sharing between institutions. Expanding the number of multicenter experts within the MoME framework will significantly enhance the generalizability, while also improving the usability and adaptability of clinical AI applications in the field of precision radiation oncology. △ Less

Submitted 26 October, 2024; v1 submitted 27 September, 2024; originally announced October 2024.

Comments: 39 pages

arXiv:2409.19834 [pdf, ps, other]

Utilizing Priors in Sampling-based Cost Minimization

Authors: Yuan-Yao Lou, Jonathan Spencer, Kwang Taik Kim, Mung Chiang

Abstract: We consider an autonomous vehicle (AV) agent performing a long-term cost-minimization problem in the elapsed time $T$ over sequences of states $s_{1:T}$ and actions $a_{1:T}$ for some fixed, known (though potentially learned) cost function $C(s_t,a_t)$, approximate system dynamics $P$, and distribution over initial states $d_0$. The goal is to minimize the expected cost-to-go of the driving trajec… ▽ More We consider an autonomous vehicle (AV) agent performing a long-term cost-minimization problem in the elapsed time $T$ over sequences of states $s_{1:T}$ and actions $a_{1:T}$ for some fixed, known (though potentially learned) cost function $C(s_t,a_t)$, approximate system dynamics $P$, and distribution over initial states $d_0$. The goal is to minimize the expected cost-to-go of the driving trajectory $τ= s_1, a_1, ..., s_T, a_T$ from the initial state. △ Less

Submitted 29 September, 2024; originally announced September 2024.

arXiv:2409.14726 [pdf, other]

doi 10.1109/TWC.2025.3578010

Semantic Communication Enabled 6G-NTN Framework: A Novel Denoising and Gateway Hop Integration Mechanism

Authors: Loc X. Nguyen, Sheikh Salman Hassan, Yan Kyaw Tun, Kitae Kim, Zhu Han, Choong Seon Hong

Abstract: The sixth-generation (6G) non-terrestrial networks (NTNs) are crucial for real-time monitoring in critical applications like disaster relief. However, limited bandwidth, latency, rain attenuation, long propagation delays, and co-channel interference pose challenges to efficient satellite communication. Therefore, semantic communication (SC) has emerged as a promising solution to improve transmissi… ▽ More The sixth-generation (6G) non-terrestrial networks (NTNs) are crucial for real-time monitoring in critical applications like disaster relief. However, limited bandwidth, latency, rain attenuation, long propagation delays, and co-channel interference pose challenges to efficient satellite communication. Therefore, semantic communication (SC) has emerged as a promising solution to improve transmission efficiency and address these issues. In this paper, we explore the potential of SC as a bandwidth-efficient, latency-minimizing strategy specifically suited to 6G satellite communications. While existing SC methods have demonstrated efficacy in direct satellite-terrestrial transmissions, they encounter limitations in satellite networks due to distortion accumulation across gateway hop-relays. Additionally, certain ground users (GUs) experience poor signal-to-noise ratios (SNR), making direct satellite communication challenging. To address these issues, we propose a novel framework that optimizes gateway hop-relay selection for GUs with low SNR and integrates gateway-based denoising mechanisms to ensure high-quality-of-service (QoS) in satellite-based SC networks. This approach directly mitigates distortion, leading to significant improvements in satellite service performance by delivering customized services tailored to the unique signal conditions of each GU. Our findings represent a critical advancement in reliable and efficient data transmission from the Earth observation satellites, thereby enabling fast and effective responses to urgent events. Simulation results demonstrate that our proposed strategy significantly enhances overall network performance, outperforming conventional methods by offering tailored communication services based on specific GU conditions. △ Less

Submitted 23 September, 2024; originally announced September 2024.

Comments: 13 pages, 8 figures, 2 tables

Journal ref: in IEEE Transactions on Wireless Communications, Jun. 2025

arXiv:2409.13281 [pdf, other]

Wireless Interconnection Network (WINE) for Post-Exascale High-Performance Computing

Authors: Hong Ki Kim, Yong Hun Jang, Hee Soo Kim, Won Young Kang, Young-Chai Ko, Sang Hyun Lee

Abstract: Interconnection networks, or `interconnects,' play a crucial role in administering the communication among computing units of high-performance computing (HPC) systems. Efficient provisioning of interconnects minimizes the processing delay wherein computing units await information sharing between each other, thereby enhancing the overall computation efficiency. Ideally, interconnects are designed w… ▽ More Interconnection networks, or `interconnects,' play a crucial role in administering the communication among computing units of high-performance computing (HPC) systems. Efficient provisioning of interconnects minimizes the processing delay wherein computing units await information sharing between each other, thereby enhancing the overall computation efficiency. Ideally, interconnects are designed with topologies tailored to match specific workflows, requiring diverse structures for different applications. However, since modifying their structures mid-operation renders impractical, indirect communication incurs across distant units. In managing numerous long-routed data deliveries, heavy burdens on the network side may lead to the under-utilization of computing resources. In view of state-of-the-art HPC paradigms that solicit dense interconnections for diverse computation-hungry applications, this article presents a versatile wireless interconnecting framework, coined as Wireless Interconnection NEtwork (WINE). The framework exploits cutting-edge wireless technologies that promote workload adaptability and scalability of modern interconnects. Design and implementation of wirelessly reliable links are strategized under network-oriented scrutiny of HPC architectures. A virtual HPC platform is developed to assess WINE's feasibilities, verifying its practicality for integration into modern HPC infrastructures. △ Less

Submitted 20 September, 2024; originally announced September 2024.

Comments: 20 pages, 5 figures, to be published in IEEE Wireless Communications Magazine

arXiv:2409.00078 [pdf, other]

SGP-RI: A Real-Time-Trainable and Decentralized IoT Indoor Localization Model Based on Sparse Gaussian Process with Reduced-Dimensional Inputs

Authors: Zhe Tang, Sihao Li, Zichen Huang, Guandong Yang, Kyeong Soo Kim, Jeremy S. Smith

Abstract: Internet of Things (IoT) devices are deployed in the filed, there is an enormous amount of untapped potential in local computing on those IoT devices. Harnessing this potential for indoor localization, therefore, becomes an exciting research area. Conventionally, the training and deployment of indoor localization models are based on centralized servers with substantial computational resources. Thi… ▽ More Internet of Things (IoT) devices are deployed in the filed, there is an enormous amount of untapped potential in local computing on those IoT devices. Harnessing this potential for indoor localization, therefore, becomes an exciting research area. Conventionally, the training and deployment of indoor localization models are based on centralized servers with substantial computational resources. This centralized approach faces several challenges, including the database's inability to accommodate the dynamic and unpredictable nature of the indoor electromagnetic environment, the model retraining costs, and the susceptibility of centralized servers to security breaches. To mitigate these challenges we aim to amalgamate the offline and online phases of traditional indoor localization methods using a real-time-trainable and decentralized IoT indoor localization model based on Sparse Gaussian Process with Reduced-dimensional Inputs (SGP-RI), where the number and dimension of the input data are reduced through reference point and wireless access point filtering, respectively. The experimental results based on a multi-building and multi-floor static database as well as a single-building and single-floor dynamic database, demonstrate that the proposed SGP-RI model with less than half the training samples as inducing inputs can produce comparable localization performance to the standard Gaussian Process model with the whole training samples. The SGP-RI model enables the decentralization of indoor localization, facilitating its deployment to resource-constrained IoT devices, and thereby could provide enhanced security and privacy, reduced costs, and network dependency. Also, the model's capability of real-time training makes it possible to quickly adapt to the time-varying indoor electromagnetic environment. △ Less

Submitted 24 August, 2024; originally announced September 2024.

Comments: 10 pages, 4 figures, under review for journal publication

arXiv:2408.12860 [pdf, other]

Active STAR-RIS Empowered Edge System for Enhanced Energy Efficiency and Task Management

Authors: Pyae Sone Aung, Kitae Kim, Yan Kyaw Tun, Zhu Han, Choong Seon Hong

Abstract: The proliferation of data-intensive and low-latency applications has driven the development of multi-access edge computing (MEC) as a viable solution to meet the increasing demands for high-performance computing and storage capabilities at the network edge. Despite the benefits of MEC, challenges such as obstructions cause non-line-of-sight (NLoS) communication to persist. Reconfigurable intellige… ▽ More The proliferation of data-intensive and low-latency applications has driven the development of multi-access edge computing (MEC) as a viable solution to meet the increasing demands for high-performance computing and storage capabilities at the network edge. Despite the benefits of MEC, challenges such as obstructions cause non-line-of-sight (NLoS) communication to persist. Reconfigurable intelligent surfaces (RISs) and the more advanced simultaneously transmitting and reflecting (STAR)-RISs have emerged to address these challenges; however, practical limitations and multiplicative fading effects hinder their efficacy. We propose an active STAR-RIS-assisted MEC system to overcome these obstacles, leveraging the advantages of active STAR-RIS. The main contributions consist of formulating an optimization problem to minimize energy consumption with task queue stability by jointly optimizing the partial task offloading, amplitude, phase shift coefficients, amplification coefficients, transmit power of the base station (BS), and admitted tasks. Furthermore, we decompose the non-convex problem into manageable sub-problems, employing sequential fractional programming for transmit power control, convex optimization technique for task offloading, and Lyapunov optimization with double deep Q-network (DDQN) for joint amplitude, phase shift, amplification, and task admission. Extensive performance evaluations demonstrate the superiority of the proposed system over benchmark schemes, highlighting its potential for enhancing MEC system performance. Numerical results indicate that our proposed system outperforms the conventional STAR-RIS-assisted by 18.64\% and the conventional RIS-assisted system by 30.43\%, respectively. △ Less

Submitted 23 August, 2024; originally announced August 2024.

Comments: 13 pages, 10 figures

arXiv:2408.05961 [pdf, other]

Cross-Spectral Analysis of Bivariate Graph Signals

Authors: Kyusoon Kim, Hee-Seok Oh

Abstract: With the advancements in technology and monitoring tools, we often encounter multivariate graph signals, which can be seen as the realizations of multivariate graph processes, and revealing the relationship between their constituent quantities is one of the important problems. To address this issue, we propose a cross-spectral analysis tool for bivariate graph signals. The main goal of this study… ▽ More With the advancements in technology and monitoring tools, we often encounter multivariate graph signals, which can be seen as the realizations of multivariate graph processes, and revealing the relationship between their constituent quantities is one of the important problems. To address this issue, we propose a cross-spectral analysis tool for bivariate graph signals. The main goal of this study is to extend the scope of spectral analysis of graph signals to multivariate graph signals. In this study, we define joint weak stationarity graph processes and introduce graph cross-spectral density and coherence for multivariate graph processes. We propose several estimators for the cross-spectral density and investigate the theoretical properties of the proposed estimators. Furthermore, we demonstrate the effectiveness of the proposed estimators through numerical experiments, including simulation studies and a real data application. Finally, as an interesting extension, we discuss robust spectral analysis of graph signals in the presence of outliers. △ Less

Submitted 12 August, 2024; originally announced August 2024.

arXiv:2407.19862 [pdf, other]

Wavespace: A Highly Explorable Wavetable Generator

Authors: Hazounne Lee, Kihong Kim, Sungho Lee, Kyogu Lee

Abstract: Wavetable synthesis generates quasi-periodic waveforms of musical tones by interpolating a list of waveforms called wavetable. As generative models that utilize latent representations offer various methods in waveform generation for musical applications, studies in wavetable generation with invertible architecture have also arisen recently. While they are promising, it is still challenging to gene… ▽ More Wavetable synthesis generates quasi-periodic waveforms of musical tones by interpolating a list of waveforms called wavetable. As generative models that utilize latent representations offer various methods in waveform generation for musical applications, studies in wavetable generation with invertible architecture have also arisen recently. While they are promising, it is still challenging to generate wavetables with detailed controls in disentangling factors within the latent representation. In response, we present Wavespace, a novel framework for wavetable generation that empowers users with enhanced parameter controls. Our model allows users to apply pre-defined conditions to the output wavetables. We employ a variational autoencoder and completely factorize its latent space to different waveform styles. We also condition the generator with auxiliary timbral and morphological descriptors. This way, users can create unique wavetables by independently manipulating each latent subspace and descriptor parameters. Our framework is efficient enough for practical use; we prototyped an oscillator plug-in as a proof of concept for real-time integration of Wavespace within digital audio workspaces (DAWs). △ Less

Submitted 29 July, 2024; originally announced July 2024.

arXiv:2407.09434 [pdf, other]

Foundation Models for the Electric Power Grid

Authors: Hendrik F. Hamann, Thomas Brunschwiler, Blazhe Gjorgiev, Leonardo S. A. Martins, Alban Puech, Anna Varbella, Jonas Weiss, Juan Bernabe-Moreno, Alexandre Blondin Massé, Seong Choi, Ian Foster, Bri-Mathias Hodge, Rishabh Jain, Kibaek Kim, Vincent Mai, François Mirallès, Martin De Montigny, Octavio Ramos-Leaños, Hussein Suprême, Le Xie, El-Nasser S. Youssef, Arnaud Zinflou, Alexander J. Belyi, Ricardo J. Bessa, Bishnu Prasad Bhattarai , et al. (2 additional authors not shown)

Abstract: Foundation models (FMs) currently dominate news headlines. They employ advanced deep learning architectures to extract structural information autonomously from vast datasets through self-supervision. The resulting rich representations of complex systems and dynamics can be applied to many downstream applications. Therefore, FMs can find uses in electric power grids, challenged by the energy transi… ▽ More Foundation models (FMs) currently dominate news headlines. They employ advanced deep learning architectures to extract structural information autonomously from vast datasets through self-supervision. The resulting rich representations of complex systems and dynamics can be applied to many downstream applications. Therefore, FMs can find uses in electric power grids, challenged by the energy transition and climate change. In this paper, we call for the development of, and state why we believe in, the potential of FMs for electric grids. We highlight their strengths and weaknesses amidst the challenges of a changing grid. We argue that an FM learning from diverse grid data and topologies could unlock transformative capabilities, pioneering a new approach in leveraging AI to redefine how we manage complexity and uncertainty in the electric grid. Finally, we discuss a power grid FM concept, namely GridFM, based on graph neural networks and show how different downstream tasks benefit. △ Less

Submitted 12 November, 2024; v1 submitted 12 July, 2024; originally announced July 2024.

Comments: Major equal contributors: H.F.H., T.B., B.G., L.S.A.M., A.P., A.V., J.W.; Significant equal contributors: J.B., A.B.M., S.C., I.F., B.H., R.J., K.K., V.M., F.M., M.D.M., O.R., H.S., L.X., E.S.Y., A.Z.; Other equal contributors: A.J.B., R.J.B., B.P.B., J.S., S.S; Lead contact: H.F.H

arXiv:2407.09030 [pdf, other]

CAMP: Continuous and Adaptive Learning Model in Pathology

Authors: Anh Tien Nguyen, Keunho Byeon, Kyungeun Kim, Boram Song, Seoung Wan Chae, Jin Tae Kwak

Abstract: There exist numerous diagnostic tasks in pathology. Conventional computational pathology formulates and tackles them as independent and individual image classification problems, thereby resulting in computational inefficiency and high costs. To address the challenges, we propose a generic, unified, and universal framework, called a continuous and adaptive learning model in pathology (CAMP), for pa… ▽ More There exist numerous diagnostic tasks in pathology. Conventional computational pathology formulates and tackles them as independent and individual image classification problems, thereby resulting in computational inefficiency and high costs. To address the challenges, we propose a generic, unified, and universal framework, called a continuous and adaptive learning model in pathology (CAMP), for pathology image classification. CAMP is a generative, efficient, and adaptive classification model that can continuously adapt to any classification task by leveraging pathology-specific prior knowledge and learning taskspecific knowledge with minimal computational cost and without forgetting the knowledge from the existing tasks. We evaluated CAMP on 22 datasets, including 1,171,526 patches and 11,811 pathology slides, across 17 classification tasks. CAMP achieves state-of-theart classification performance on a wide range of datasets and tasks at both patch- and slide-levels and reduces up to 94% of computation time and 85% of storage memory in comparison to the conventional classification models. Our results demonstrate that CAMP can offer a fundamental transformation in pathology image classification, paving the way for the fully digitized and computerized pathology practice. △ Less

Submitted 12 July, 2024; originally announced July 2024.

Comments: Under review

arXiv:2407.08503 [pdf, other]

DIOR-ViT: Differential Ordinal Learning Vision Transformer for Cancer Classification in Pathology Images

Authors: Ju Cheon Lee, Keunho Byeon, Boram Song, Kyungeun Kim, Jin Tae Kwak

Abstract: In computational pathology, cancer grading has been mainly studied as a categorical classification problem, which does not utilize the ordering nature of cancer grades such as the higher the grade is, the worse the cancer is. To incorporate the ordering relationship among cancer grades, we introduce a differential ordinal learning problem in which we define and learn the degree of difference in th… ▽ More In computational pathology, cancer grading has been mainly studied as a categorical classification problem, which does not utilize the ordering nature of cancer grades such as the higher the grade is, the worse the cancer is. To incorporate the ordering relationship among cancer grades, we introduce a differential ordinal learning problem in which we define and learn the degree of difference in the categorical class labels between pairs of samples by using their differences in the feature space. To this end, we propose a transformer-based neural network that simultaneously conducts both categorical classification and differential ordinal classification for cancer grading. We also propose a tailored loss function for differential ordinal learning. Evaluating the proposed method on three different types of cancer datasets, we demonstrate that the adoption of differential ordinal learning can improve the accuracy and reliability of cancer grading, outperforming conventional cancer grading approaches. The proposed approach should be applicable to other diseases and problems as they involve ordinal relationship among class labels. △ Less

Submitted 10 July, 2024; originally announced July 2024.

arXiv:2406.12721 [pdf]

Sound event detection based on auxiliary decoder and maximum probability aggregation for DCASE Challenge 2024 Task 4

Authors: Sang Won Son, Jongyeon Park, Hong Kook Kim, Sulaiman Vesal, Jeong Eun Lim

Abstract: In this report, we propose three novel methods for developing a sound event detection (SED) model for the DCASE 2024 Challenge Task 4. First, we propose an auxiliary decoder attached to the final convolutional block to improve feature extraction capabilities while reducing dependency on embeddings from pre-trained large models. The proposed auxiliary decoder operates independently from the main de… ▽ More In this report, we propose three novel methods for developing a sound event detection (SED) model for the DCASE 2024 Challenge Task 4. First, we propose an auxiliary decoder attached to the final convolutional block to improve feature extraction capabilities while reducing dependency on embeddings from pre-trained large models. The proposed auxiliary decoder operates independently from the main decoder, enhancing performance of the convolutional block during the initial training stages by assigning a different weight strategy between main and auxiliary decoder losses. Next, to address the time interval issue between the DESED and MAESTRO datasets, we propose maximum probability aggregation (MPA) during the training step. The proposed MPA method enables the model's output to be aligned with soft labels of 1 s in the MAESTRO dataset. Finally, we propose a multi-channel input feature that employs various versions of logmel and MFCC features to generate time-frequency pattern. The experimental results demonstrate the efficacy of these proposed methods in a view of improving SED performance by achieving a balanced enhancement across different datasets and label types. Ultimately, this approach presents a significant step forward in developing more robust and flexible SED models △ Less

Submitted 24 June, 2024; v1 submitted 17 June, 2024; originally announced June 2024.

Comments: DCASE 2024 challenge Task4, 4 pages

arXiv:2406.11248 [pdf]

Performance Improvement of Language-Queried Audio Source Separation Based on Caption Augmentation From Large Language Models for DCASE Challenge 2024 Task 9

Authors: Do Hyun Lee, Yoonah Song, Hong Kook Kim

Abstract: We present a prompt-engineering-based text-augmentation approach applied to a language-queried audio source separation (LASS) task. To enhance the performance of LASS, the proposed approach utilizes large language models (LLMs) to generate multiple captions corresponding to each sentence of the training dataset. To this end, we first perform experiments to identify the most effective prompts for c… ▽ More We present a prompt-engineering-based text-augmentation approach applied to a language-queried audio source separation (LASS) task. To enhance the performance of LASS, the proposed approach utilizes large language models (LLMs) to generate multiple captions corresponding to each sentence of the training dataset. To this end, we first perform experiments to identify the most effective prompts for caption augmentation with a smaller number of captions. A LASS model trained with these augmented captions demonstrates improved performance on the DCASE 2024 Task 9 validation set compared to that trained without augmentation. This study highlights the effectiveness of LLM-based caption augmentation in advancing language-queried audio source separation. △ Less

Submitted 26 November, 2024; v1 submitted 17 June, 2024; originally announced June 2024.

Comments: DCASE 2024 Challenge Task 9, 4 pages

arXiv:2406.09345 [pdf, other]

DiscreteSLU: A Large Language Model with Self-Supervised Discrete Speech Units for Spoken Language Understanding

Authors: Suwon Shon, Kwangyoun Kim, Yi-Te Hsu, Prashant Sridhar, Shinji Watanabe, Karen Livescu

Abstract: The integration of pre-trained text-based large language models (LLM) with speech input has enabled instruction-following capabilities for diverse speech tasks. This integration requires the use of a speech encoder, a speech adapter, and an LLM, trained on diverse tasks. We propose the use of discrete speech units (DSU), rather than continuous-valued speech encoder outputs, that are converted to t… ▽ More The integration of pre-trained text-based large language models (LLM) with speech input has enabled instruction-following capabilities for diverse speech tasks. This integration requires the use of a speech encoder, a speech adapter, and an LLM, trained on diverse tasks. We propose the use of discrete speech units (DSU), rather than continuous-valued speech encoder outputs, that are converted to the LLM token embedding space using the speech adapter. We generate DSU using a self-supervised speech encoder followed by k-means clustering. The proposed model shows robust performance on speech inputs from seen/unseen domains and instruction-following capability in spoken question answering. We also explore various types of DSU extracted from different layers of the self-supervised speech encoder, as well as Mel frequency Cepstral Coefficients (MFCC). Our findings suggest that the ASR task and datasets are not crucial in instruction-tuning for spoken question answering tasks. △ Less

Submitted 13 June, 2024; originally announced June 2024.

arXiv:2406.02000 [pdf, other]

Advancing Ultra-Reliable 6G: Transformer and Semantic Localization Empowered Robust Beamforming in Millimeter-Wave Communications

Authors: Avi Deb Raha, Kitae Kim, Apurba Adhikary, Mrityunjoy Gain, Zhu Han, Choong Seon Hong

Abstract: Advancements in 6G wireless technology have elevated the importance of beamforming, especially for attaining ultra-high data rates via millimeter-wave (mmWave) frequency deployment. Although promising, mmWave bands require substantial beam training to achieve precise beamforming. While initial deep learning models that use RGB camera images demonstrated promise in reducing beam training overhead,… ▽ More Advancements in 6G wireless technology have elevated the importance of beamforming, especially for attaining ultra-high data rates via millimeter-wave (mmWave) frequency deployment. Although promising, mmWave bands require substantial beam training to achieve precise beamforming. While initial deep learning models that use RGB camera images demonstrated promise in reducing beam training overhead, their performance suffers due to sensitivity to lighting and environmental variations. Due to this sensitivity, Quality of Service (QoS) fluctuates, eventually affecting the stability and dependability of networks in dynamic environments. This emphasizes a critical need for robust solutions. This paper proposes a robust beamforming technique to ensure consistent QoS under varying environmental conditions. An optimization problem has been formulated to maximize users' data rates. To solve the formulated NP-hard optimization problem, we decompose it into two subproblems: the semantic localization problem and the optimal beam selection problem. To solve the semantic localization problem, we propose a novel method that leverages the K-means clustering and YOLOv8 model. To solve the beam selection problem, we propose a novel lightweight hybrid architecture that combines a lightweight transformer with a CNN architecture through a weighted entropy mechanism. This hybrid architecture utilizes multimodal data sources to dynamically predict the optimal beams. A novel metric, Accuracy-Complexity Efficiency (ACE), has been proposed to quantify this. Six testing scenarios have been developed to evaluate the robustness of the proposed model. Finally, the simulation result demonstrates that the proposed model outperforms several state-of-the-art baselines regarding beam prediction accuracy, received power, and ACE in the developed test scenarios. △ Less

Submitted 30 July, 2024; v1 submitted 4 June, 2024; originally announced June 2024.

arXiv:2405.19771 [pdf, other]

Data Service Maximization in Space-Air-Ground Integrated 6G Networks

Authors: Nway Nway Ei, Kitae Kim, Yan Kyaw Tun, Zhu Han, Choong Seon Hong

Abstract: Integrating terrestrial and non-terrestrial networks has emerged as a promising paradigm to fulfill the constantly growing demand for connectivity, low transmission delay, and quality of services (QoS). This integration brings together the strengths of the reliability of terrestrial networks, broad coverage and service continuity of non-terrestrial networks like low earth orbit satellites (LEOSats… ▽ More Integrating terrestrial and non-terrestrial networks has emerged as a promising paradigm to fulfill the constantly growing demand for connectivity, low transmission delay, and quality of services (QoS). This integration brings together the strengths of the reliability of terrestrial networks, broad coverage and service continuity of non-terrestrial networks like low earth orbit satellites (LEOSats), etc. In this work, we study a data service maximization problem in space-air-ground integrated network (SAGIN) where the ground base stations (GBSs) and LEOSats cooperatively serve the coexisting aerial users (AUs) and ground users (GUs). Then, by considering the spectrum scarcity, interference, and QoS requirements of the users, we jointly optimize the user association, AU's trajectory, and power allocation. To tackle the formulated mixed-integer non-convex problem, we disintegrate it into two subproblems: 1) user association problem and 2) trajectory and power allocation problem. We formulate the user association problem as a binary integer programming problem and solve it by using the Gurobi optimizer. Meanwhile, the trajectory and power allocation problem is solved by the deep deterministic policy gradient (DDPG) method to cope with the problem's non-convexity and dynamic network environments. Then, the two subproblems are alternately solved by the proposed block coordinate descent algorithm. By comparing with the baselines in the existing literature, extensive simulations are conducted to evaluate the performance of the proposed framework. △ Less

Submitted 19 July, 2024; v1 submitted 30 May, 2024; originally announced May 2024.

Comments: 5 pages, 4 figures

arXiv:2405.05787 [pdf, other]

Autonomous Robotic Ultrasound System for Liver Follow-up Diagnosis: Pilot Phantom Study

Authors: Tianpeng Zhang, Sekeun Kim, Jerome Charton, Haitong Ma, Kyungsang Kim, Na Li, Quanzheng Li

Abstract: The paper introduces a novel autonomous robot ultrasound (US) system targeting liver follow-up scans for outpatients in local communities. Given a computed tomography (CT) image with specific target regions of interest, the proposed system carries out the autonomous follow-up scan in three steps: (i) initial robot contact to surface, (ii) coordinate mapping between CT image and robot, and (iii) ta… ▽ More The paper introduces a novel autonomous robot ultrasound (US) system targeting liver follow-up scans for outpatients in local communities. Given a computed tomography (CT) image with specific target regions of interest, the proposed system carries out the autonomous follow-up scan in three steps: (i) initial robot contact to surface, (ii) coordinate mapping between CT image and robot, and (iii) target US scan. Utilizing 3D US-CT registration and deep learning-based segmentation networks, we can achieve precise imaging of 3D hepatic veins, facilitating accurate coordinate mapping between CT and the robot. This enables the automatic localization of follow-up targets within the CT image, allowing the robot to navigate precisely to the target's surface. Evaluation of the ultrasound phantom confirms the quality of the US-CT registration and shows the robot reliably locates the targets in repeated trials. The proposed framework holds the potential to significantly reduce time and costs for healthcare providers, clinicians, and follow-up patients, thereby addressing the increasing healthcare burden associated with chronic disease in local communities. △ Less

Submitted 9 May, 2024; originally announced May 2024.

arXiv:2405.03905 [pdf, other]

doi 10.1109/TCASAI.2024.3507694

DeltaKWS: A 65nm 36nJ/Decision Bio-inspired Temporal-Sparsity-Aware Digital Keyword Spotting IC with 0.6V Near-Threshold SRAM

Authors: Qinyu Chen, Kwantae Kim, Chang Gao, Sheng Zhou, Taekwang Jang, Tobi Delbruck, Shih-Chii Liu

Abstract: This paper introduces DeltaKWS, to the best of our knowledge, the first $Δ$RNN-enabled fine-grained temporal sparsity-aware KWS IC for voice-controlled devices. The 65 nm prototype chip features a number of techniques to enhance performance, area, and power efficiencies, specifically: 1) a bio-inspired delta-gated recurrent neural network ($Δ$RNN) classifier leveraging temporal similarities betwee… ▽ More This paper introduces DeltaKWS, to the best of our knowledge, the first $Δ$RNN-enabled fine-grained temporal sparsity-aware KWS IC for voice-controlled devices. The 65 nm prototype chip features a number of techniques to enhance performance, area, and power efficiencies, specifically: 1) a bio-inspired delta-gated recurrent neural network ($Δ$RNN) classifier leveraging temporal similarities between neighboring feature vectors extracted from input frames and network hidden states, eliminating unnecessary operations and memory accesses; 2) an IIR BPF-based FEx that leverages mixed-precision quantization, low-cost computing structure and channel selection; 3) a 24 kB 0.6 V near-$V_\text{TH}$ weight SRAM that achieves 6.6X lower read power than the foundry-provided SRAM. From chip measurement results, we show that the DeltaKWS achieves an 11/12-class GSCD accuracy of 90.5%/89.5% respectively and energy consumption of 36 nJ/decision in 65 nm CMOS process. At 87% temporal sparsity, computing latency and energy/inference are reduced by 2.4X/3.4X, respectively. The IIR BPF-based FEx, $Δ$RNN accelerator, and 24 kB near-$V_\text{TH}$ SRAM blocks occupy 0.084 mm$^{2}$, 0.319 mm$^{2}$, and 0.381 mm$^{2}$ respectively (0.78 mm$^{2}$ in total). △ Less

Submitted 26 November, 2024; v1 submitted 6 May, 2024; originally announced May 2024.

Comments: This paper has been accepted for publication in the IEEE Transactions on Circuits and Systems for Artificial Intelligence (TCASAI)

arXiv:2404.07021 [pdf, other]

A 4x32Gb/s 1.8pJ/bit Collaborative Baud-Rate CDR with Background Eye-Climbing Algorithm and Low-Power Global Clock Distribution

Authors: Jihee Kim, Jia Park, Jiwon Shin, Hanseok Kim, Kahyun Kim, Haengbeom Shin, Ha-Jung Park, Woo-Seok Choi

Abstract: This paper presents design techniques for an energy-efficient multi-lane receiver (RX) with baud-rate clock and data recovery (CDR), which is essential for high-throughput low-latency communication in high-performance computing systems. The proposed low-power global clock distribution not only significantly reduces power consumption across multi-lane RXs but is capable of compensating for the freq… ▽ More This paper presents design techniques for an energy-efficient multi-lane receiver (RX) with baud-rate clock and data recovery (CDR), which is essential for high-throughput low-latency communication in high-performance computing systems. The proposed low-power global clock distribution not only significantly reduces power consumption across multi-lane RXs but is capable of compensating for the frequency offset without any phase interpolators. To this end, a fractional divider controlled by CDR is placed close to the global phase locked loop. Moreover, in order to address the sub-optimal lock point of conventional baud-rate phase detectors, the proposed CDR employs a background eye-climbing algorithm, which optimizes the sampling phase and maximizes the vertical eye margin (VEM). Fabricated in a 28nm CMOS process, the proposed 4x32Gb/s RX shows a low integrated fractional spur of -40.4dBc at a 2500ppm frequency offset. Furthermore, it improves bit-error-rate performance by increasing the VEM by 17%. The entire RX achieves the energy efficiency of 1.8pJ/bit with the aggregate data rate of 128Gb/s. △ Less

Submitted 22 April, 2024; v1 submitted 10 April, 2024; originally announced April 2024.

arXiv:2404.04096 [pdf, other]

Machine Learning-Aided Cooperative Localization under Dense Urban Environment

Authors: Hoon Lee, Hong Ki Kim, Seung Hyun Oh, Sang Hyun Lee

Abstract: Future wireless network technology provides automobiles with the connectivity feature to consolidate the concept of vehicular networks that collaborate on conducting cooperative driving tasks. The full potential of connected vehicles, which promises road safety and quality driving experience, can be leveraged if machine learning models guarantee the robustness in performing core functions includin… ▽ More Future wireless network technology provides automobiles with the connectivity feature to consolidate the concept of vehicular networks that collaborate on conducting cooperative driving tasks. The full potential of connected vehicles, which promises road safety and quality driving experience, can be leveraged if machine learning models guarantee the robustness in performing core functions including localization and controls. Location awareness, in particular, lends itself to the deployment of location-specific services and the improvement of the operation performance. The localization entails direct communication to the network infrastructure, and the resulting centralized positioning solutions readily become intractable as the network scales up. As an alternative to the centralized solutions, this article addresses decentralized principle of vehicular localization reinforced by machine learning techniques in dense urban environments with frequent inaccessibility to reliable measurement. As such, the collaboration of multiple vehicles enhances the positioning performance of machine learning approaches. A virtual testbed is developed to validate this machine learning model for real-map vehicular networks. Numerical results demonstrate universal feasibility of cooperative localization, in particular, for dense urban area configurations. △ Less

Submitted 5 April, 2024; originally announced April 2024.

arXiv:2404.01517 [pdf, other]

Addressing Heterogeneity in Federated Load Forecasting with Personalization Layers

Authors: Shourya Bose, Yu Zhang, Kibaek Kim

Abstract: The advent of smart meters has enabled pervasive collection of energy consumption data for training short-term load forecasting models. In response to privacy concerns, federated learning (FL) has been proposed as a privacy-preserving approach for training, but the quality of trained models degrades as client data becomes heterogeneous. In this paper we propose the use of personalization layers fo… ▽ More The advent of smart meters has enabled pervasive collection of energy consumption data for training short-term load forecasting models. In response to privacy concerns, federated learning (FL) has been proposed as a privacy-preserving approach for training, but the quality of trained models degrades as client data becomes heterogeneous. In this paper we propose the use of personalization layers for load forecasting in a general framework called PL-FL. We show that PL-FL outperforms FL and purely local training, while requiring lower communication bandwidth than FL. This is done through extensive simulations on three different datasets from the NREL ComStock repository. △ Less

Submitted 1 April, 2024; originally announced April 2024.

arXiv:2404.01464 [pdf, other]

Data-Efficient Unsupervised Interpolation Without Any Intermediate Frame for 4D Medical Images

Authors: JungEun Kim, Hangyul Yoon, Geondo Park, Kyungsu Kim, Eunho Yang

Abstract: 4D medical images, which represent 3D images with temporal information, are crucial in clinical practice for capturing dynamic changes and monitoring long-term disease progression. However, acquiring 4D medical images poses challenges due to factors such as radiation exposure and imaging duration, necessitating a balance between achieving high temporal resolution and minimizing adverse effects. Gi… ▽ More 4D medical images, which represent 3D images with temporal information, are crucial in clinical practice for capturing dynamic changes and monitoring long-term disease progression. However, acquiring 4D medical images poses challenges due to factors such as radiation exposure and imaging duration, necessitating a balance between achieving high temporal resolution and minimizing adverse effects. Given these circumstances, not only is data acquisition challenging, but increasing the frame rate for each dataset also proves difficult. To address this challenge, this paper proposes a simple yet effective Unsupervised Volumetric Interpolation framework, UVI-Net. This framework facilitates temporal interpolation without the need for any intermediate frames, distinguishing it from the majority of other existing unsupervised methods. Experiments on benchmark datasets demonstrate significant improvements across diverse evaluation metrics compared to unsupervised and supervised baselines. Remarkably, our approach achieves this superior performance even when trained with a dataset as small as one, highlighting its exceptional robustness and efficiency in scenarios with sparse supervision. This positions UVI-Net as a compelling alternative for 4D medical imaging, particularly in settings where data availability is limited. The source code is available at https://github.com/jungeun122333/UVI-Net. △ Less

Submitted 1 April, 2024; originally announced April 2024.

Comments: CVPR 2024

arXiv:2402.17790 [pdf, other]

EEG classifier cross-task transfer to avoid training sessions in robot-assisted rehabilitation

Authors: Niklas Kueper, Su Kyoung Kim, Elsa Andrea Kirchner

Abstract: Background: For an individualized support of patients during rehabilitation, learning of individual machine learning models from the human electroencephalogram (EEG) is required. Our approach allows labeled training data to be recorded without the need for a specific training session. For this, the planned exoskeleton-assisted rehabilitation enables bilateral mirror therapy, in which movement inte… ▽ More Background: For an individualized support of patients during rehabilitation, learning of individual machine learning models from the human electroencephalogram (EEG) is required. Our approach allows labeled training data to be recorded without the need for a specific training session. For this, the planned exoskeleton-assisted rehabilitation enables bilateral mirror therapy, in which movement intentions can be inferred from the activity of the unaffected arm. During this therapy, labeled EEG data can be collected to enable movement predictions of only the affected arm of a patient. Methods: A study was conducted with 8 healthy subjects and the performance of the classifier transfer approach was evaluated. Each subject performed 3 runs of 40 self-intended unilateral and bilateral reaching movements toward a target while EEG data was recorded from 64 channels. A support vector machine (SVM) classifier was trained under both movement conditions to make predictions for the same type of movement. Furthermore, the classifier was evaluated to predict unilateral movements by only beeing trained on the data of the bilateral movement condition. Results: The results show that the performance of the classifier trained on selected EEG channels evoked by bilateral movement intentions is not significantly reduced compared to a classifier trained directly on EEG data including unilateral movement intentions. Moreover, the results show that our approach also works with only 8 or even 4 channels. Conclusion: It was shown that the proposed classifier transfer approach enables motion prediction without explicit collection of training data. Since the approach can be applied even with a small number of EEG channels, this speaks for the feasibility of the approach in real therapy sessions with patients and motivates further investigations with stroke patients. △ Less

Submitted 26 February, 2024; originally announced February 2024.

Comments: 11 pages, 6 figures, 1 table

MSC Class: 68

arXiv:2401.15313 [pdf, other]

Multi-Robot Relative Pose Estimation in SE(2) with Observability Analysis: A Comparison of Extended Kalman Filtering and Robust Pose Graph Optimization

Authors: Kihoon Shin, Hyunjae Sim, Seungwon Nam, Yonghee Kim, Jae Hu, Kwang-Ki K. Kim

Abstract: In this study, we address multi-robot localization issues, with a specific focus on cooperative localization and observability analysis of relative pose estimation. Cooperative localization involves enhancing each robot's information through a communication network and message passing. If odometry data from a target robot can be transmitted to the ego robot, observability of their relative pose es… ▽ More In this study, we address multi-robot localization issues, with a specific focus on cooperative localization and observability analysis of relative pose estimation. Cooperative localization involves enhancing each robot's information through a communication network and message passing. If odometry data from a target robot can be transmitted to the ego robot, observability of their relative pose estimation can be achieved through range-only or bearing-only measurements, provided both robots have non-zero linear velocities. In cases where odometry data from a target robot are not directly transmitted but estimated by the ego robot, both range and bearing measurements are necessary to ensure observability of relative pose estimation. For ROS/Gazebo simulations, we explore four sensing and communication structures. We compare extended Kalman filtering (EKF) and pose graph optimization (PGO) estimation using different robust loss functions (filtering and smoothing with varying batch sizes of sliding windows) in terms of estimation accuracy. In hardware experiments, two Turtlebot3 equipped with UWB modules are used for real-world inter-robot relative pose estimation, applying both EKF and PGO and comparing their performance. △ Less

Submitted 4 February, 2024; v1 submitted 27 January, 2024; originally announced January 2024.

Comments: 20 pages, 21 figures

MSC Class: 93C85; 93E11; 93E24; 90C26; 93E10; 62M20;

arXiv:2401.08962 [pdf, other]

DOO-RE: A dataset of ambient sensors in a meeting room for activity recognition

Authors: Hyunju Kim, Geon Kim, Taehoon Lee, Kisoo Kim, Dongman Lee

Abstract: With the advancement of IoT technology, recognizing user activities with machine learning methods is a promising way to provide various smart services to users. High-quality data with privacy protection is essential for deploying such services in the real world. Data streams from surrounding ambient sensors are well suited to the requirement. Existing ambient sensor datasets only support constrain… ▽ More With the advancement of IoT technology, recognizing user activities with machine learning methods is a promising way to provide various smart services to users. High-quality data with privacy protection is essential for deploying such services in the real world. Data streams from surrounding ambient sensors are well suited to the requirement. Existing ambient sensor datasets only support constrained private spaces and those for public spaces have yet to be explored despite growing interest in research on them. To meet this need, we build a dataset collected from a meeting room equipped with ambient sensors. The dataset, DOO-RE, includes data streams from various ambient sensor types such as Sound and Projector. Each sensor data stream is segmented into activity units and multiple annotators provide activity labels through a cross-validation annotation process to improve annotation quality. We finally obtain 9 types of activities. To our best knowledge, DOO-RE is the first dataset to support the recognition of both single and group activities in a real meeting room with reliable annotations. △ Less

Submitted 16 January, 2024; originally announced January 2024.

arXiv:2401.08835 [pdf, other]

Improving ASR Contextual Biasing with Guided Attention

Authors: Jiyang Tang, Kwangyoun Kim, Suwon Shon, Felix Wu, Prashant Sridhar, Shinji Watanabe

Abstract: In this paper, we propose a Guided Attention (GA) auxiliary training loss, which improves the effectiveness and robustness of automatic speech recognition (ASR) contextual biasing without introducing additional parameters. A common challenge in previous literature is that the word error rate (WER) reduction brought by contextual biasing diminishes as the number of bias phrases increases. To addres… ▽ More In this paper, we propose a Guided Attention (GA) auxiliary training loss, which improves the effectiveness and robustness of automatic speech recognition (ASR) contextual biasing without introducing additional parameters. A common challenge in previous literature is that the word error rate (WER) reduction brought by contextual biasing diminishes as the number of bias phrases increases. To address this challenge, we employ a GA loss as an additional training objective besides the Transducer loss. The proposed GA loss aims to teach the cross attention how to align bias phrases with text tokens or audio frames. Compared to studies with similar motivations, the proposed loss operates directly on the cross attention weights and is easier to implement. Through extensive experiments based on Conformer Transducer with Contextual Adapter, we demonstrate that the proposed method not only leads to a lower WER but also retains its effectiveness as the number of bias phrases increases. Specifically, the GA loss decreases the WER of rare vocabularies by up to 19.2% on LibriSpeech compared to the contextual biasing baseline, and up to 49.3% compared to a vanilla Transducer. △ Less

Submitted 16 January, 2024; originally announced January 2024.

Comments: Accepted at ICASSP 2024

arXiv:2312.14939 [pdf, other]

Large-scale Graph Representation Learning of Dynamic Brain Connectome with Transformers

Authors: Byung-Hoon Kim, Jungwon Choi, EungGu Yun, Kyungsang Kim, Xiang Li, Juho Lee

Abstract: Graph Transformers have recently been successful in various graph representation learning tasks, providing a number of advantages over message-passing Graph Neural Networks. Utilizing Graph Transformers for learning the representation of the brain functional connectivity network is also gaining interest. However, studies to date have underlooked the temporal dynamics of functional connectivity, wh… ▽ More Graph Transformers have recently been successful in various graph representation learning tasks, providing a number of advantages over message-passing Graph Neural Networks. Utilizing Graph Transformers for learning the representation of the brain functional connectivity network is also gaining interest. However, studies to date have underlooked the temporal dynamics of functional connectivity, which fluctuates over time. Here, we propose a method for learning the representation of dynamic functional connectivity with Graph Transformers. Specifically, we define the connectome embedding, which holds the position, structure, and time information of the functional connectivity graph, and use Transformers to learn its representation across time. We perform experiments with over 50,000 resting-state fMRI samples obtained from three datasets, which is the largest number of fMRI data used in studies by far. The experimental results show that our proposed method outperforms other competitive baselines in gender classification and age regression tasks based on the functional connectivity extracted from the fMRI data. △ Less

Submitted 4 December, 2023; originally announced December 2023.

Comments: NeurIPS 2023 Temporal Graph Learning Workshop

arXiv:2312.09895 [pdf, other]

Generative Context-aware Fine-tuning of Self-supervised Speech Models

Authors: Suwon Shon, Kwangyoun Kim, Prashant Sridhar, Yi-Te Hsu, Shinji Watanabe, Karen Livescu

Abstract: When performing tasks like automatic speech recognition or spoken language understanding for a given utterance, access to preceding text or audio provides contextual information can improve performance. Considering the recent advances in generative large language models (LLM), we hypothesize that an LLM could generate useful context information using the preceding text. With appropriate prompts, L… ▽ More When performing tasks like automatic speech recognition or spoken language understanding for a given utterance, access to preceding text or audio provides contextual information can improve performance. Considering the recent advances in generative large language models (LLM), we hypothesize that an LLM could generate useful context information using the preceding text. With appropriate prompts, LLM could generate a prediction of the next sentence or abstractive text like titles or topics. In this paper, we study the use of LLM-generated context information and propose an approach to distill the generated information during fine-tuning of self-supervised speech models, which we refer to as generative context-aware fine-tuning. This approach allows the fine-tuned model to make improved predictions without access to the true surrounding segments or to the LLM at inference time, while requiring only a very small additional context module. We evaluate the proposed approach using the SLUE and Libri-light benchmarks for several downstream tasks: automatic speech recognition, named entity recognition, and sentiment analysis. The results show that generative context-aware fine-tuning outperforms a context injection fine-tuning approach that accesses the ground-truth previous text, and is competitive with a generative context injection fine-tuning approach that requires the LLM at inference time. △ Less

Submitted 15 December, 2023; originally announced December 2023.

arXiv:2312.01004 [pdf, other]

Learning-based Ecological Adaptive Cruise Control of Autonomous Electric Vehicles: A Comparison of ADP, DQN and DDPG Approaches

Authors: Sunwoo Kim, Kwang-Ki K. Kim

Abstract: This paper presents model-based and model-free learning methods for economic and ecological adaptive cruise control (Eco-ACC) of connected and autonomous electric vehicles. For model-based optimal control of Eco-ACC, we considered longitudinal vehicle dynamics and a quasi-steady-state powertrain model including the physical limits of a commercial electric vehicle. We used adaptive dynamic programm… ▽ More This paper presents model-based and model-free learning methods for economic and ecological adaptive cruise control (Eco-ACC) of connected and autonomous electric vehicles. For model-based optimal control of Eco-ACC, we considered longitudinal vehicle dynamics and a quasi-steady-state powertrain model including the physical limits of a commercial electric vehicle. We used adaptive dynamic programming (ADP), in which the value function was trained using data obtained from IPG CarMaker simulations. For real-time implementation, forward multi-step look-ahead prediction and optimization were executed in a receding horizon scheme to maximize the energy efficiency of the electric machine while avoiding rear-end collisions and satisfying the powertrain, speed, and distance-gap constraints. For model-free optimal control of Eco-ACC, we applied two reinforcement learning methods, Deep Q-Network (DQN) and Deep Deterministic Policy Gradient (DDPG), in which deep neural networks were trained in IPG CarMaker simulations. For performance demonstrations, the HWFET, US06, and WLTP Class 3b driving cycles were used to simulate the front vehicle, and the energy consumptions of the host vehicle and front vehicle were compared. In high-fidelity IPG CarMaker simulations, the proposed learning-based Eco-ACC methods demonstrated approximately 3-5% and 10-14% efficiency improvements in highway and city-highway driving scenarios, respectively, compared with the front vehicle. A video of the CarMaker simulation is available at https://youtu.be/DIXzJxMVig8. △ Less

Submitted 1 December, 2023; originally announced December 2023.

MSC Class: 93E20; 68T20; 49M37; 90-08

arXiv:2311.10224 [pdf, other]

CV-Attention UNet: Attention-based UNet for 3D Cerebrovascular Segmentation of Enhanced TOF-MRA Images

Authors: Syed Farhan Abbas, Nguyen Thanh Duc, Yoonguu Song, Kyungwon Kim, Ekta Srivastava, Boreom Lee

Abstract: Due to the lack of automated methods, to diagnose cerebrovascular disease, time-of-flight magnetic resonance angiography (TOF-MRA) is assessed visually, making it time-consuming. The commonly used encoder-decoder architectures for cerebrovascular segmentation utilize redundant features, eventually leading to the extraction of low-level features multiple times. Additionally, convolutional neural ne… ▽ More Due to the lack of automated methods, to diagnose cerebrovascular disease, time-of-flight magnetic resonance angiography (TOF-MRA) is assessed visually, making it time-consuming. The commonly used encoder-decoder architectures for cerebrovascular segmentation utilize redundant features, eventually leading to the extraction of low-level features multiple times. Additionally, convolutional neural networks (CNNs) suffer from performance degradation when the batch size is small, and deeper networks experience the vanishing gradient problem. Methods: In this paper, we attempt to solve these limitations and propose the 3D cerebrovascular attention UNet method, named CV-AttentionUNet, for precise extraction of brain vessel images. We proposed a sequence of preprocessing techniques followed by deeply supervised UNet to improve the accuracy of segmentation of the brain vessels leading to a stroke. To combine the low and high semantics, we applied the attention mechanism. This mechanism focuses on relevant associations and neglects irrelevant anatomical information. Furthermore, the inclusion of deep supervision incorporates different levels of features that prove to be beneficial for network convergence. Results: We demonstrate the efficiency of the proposed method by cross-validating with an unlabeled dataset, which was further labeled by us. We believe that the novelty of this algorithm lies in its ability to perform well on both labeled and unlabeled data with image processing-based enhancement. The results indicate that our method performed better than the existing state-of-the-art methods on the TubeTK dataset. Conclusion: The proposed method will help in accurate segmentation of cerebrovascular structure leading to stroke △ Less

Submitted 19 June, 2024; v1 submitted 16 November, 2023; originally announced November 2023.

arXiv:2310.07663 [pdf, other]

doi 10.1109/ICASSP43922.2022.9747073

Deep Video Inpainting Guided by Audio-Visual Self-Supervision

Authors: Kyuyeon Kim, Junsik Jung, Woo Jae Kim, Sung-Eui Yoon

Abstract: Humans can easily imagine a scene from auditory information based on their prior knowledge of audio-visual events. In this paper, we mimic this innate human ability in deep learning models to improve the quality of video inpainting. To implement the prior knowledge, we first train the audio-visual network, which learns the correspondence between auditory and visual information. Then, the audio-vis… ▽ More Humans can easily imagine a scene from auditory information based on their prior knowledge of audio-visual events. In this paper, we mimic this innate human ability in deep learning models to improve the quality of video inpainting. To implement the prior knowledge, we first train the audio-visual network, which learns the correspondence between auditory and visual information. Then, the audio-visual network is employed as a guider that conveys the prior knowledge of audio-visual correspondence to the video inpainting network. This prior knowledge is transferred through our proposed two novel losses: audio-visual attention loss and audio-visual pseudo-class consistency loss. These two losses further improve the performance of the video inpainting by encouraging the inpainting result to have a high correspondence to its synchronized audio. Experimental results demonstrate that our proposed method can restore a wider domain of video scenes and is particularly effective when the sounding object in the scene is partially blinded. △ Less

Submitted 11 October, 2023; originally announced October 2023.

Comments: Accepted at ICASSP 2022

arXiv:2310.02467 [pdf]

Dual-Polarization Phase Retrieval Receiver in Silicon Photonics

Authors: Brian Stern, Hanzi Huang, Haoshuo Chen, Kwangwoong Kim, Mohamad Hossein Idjadi

Abstract: We demonstrate a silicon photonic dual-polarization phase retrieval receiver. The receiver recovers phase from intensity-only measurements without a local oscillator or transmitted carrier. We design silicon waveguides providing long delays and microring resonators with large dispersion to enable symbol-to-symbol interference and dispersive projection in the phase retrieval algorithm. We retrieve… ▽ More We demonstrate a silicon photonic dual-polarization phase retrieval receiver. The receiver recovers phase from intensity-only measurements without a local oscillator or transmitted carrier. We design silicon waveguides providing long delays and microring resonators with large dispersion to enable symbol-to-symbol interference and dispersive projection in the phase retrieval algorithm. We retrieve the full field of a polarization-division multiplexed 30-GBd QPSK and 20-GBd 8QAM signals over 80 km of SSMF. △ Less

Submitted 3 October, 2023; originally announced October 2023.

Comments: 11 pages, 7 figures

arXiv:2309.13539 [pdf, other]

MediViSTA: Medical Video Segmentation via Temporal Fusion SAM Adaptation for Echocardiography

Authors: Sekeun Kim, Pengfei Jin, Cheng Chen, Kyungsang Kim, Zhiliang Lyu, Hui Ren, Sunghwan Kim, Zhengliang Liu, Aoxiao Zhong, Tianming Liu, Xiang Li, Quanzheng Li

Abstract: Despite achieving impressive results in general-purpose semantic segmentation with strong generalization on natural images, the Segment Anything Model (SAM) has shown less precision and stability in medical image segmentation. In particular, the original SAM architecture is designed for 2D natural images and is therefore not support to handle three-dimensional information, which is particularly im… ▽ More Despite achieving impressive results in general-purpose semantic segmentation with strong generalization on natural images, the Segment Anything Model (SAM) has shown less precision and stability in medical image segmentation. In particular, the original SAM architecture is designed for 2D natural images and is therefore not support to handle three-dimensional information, which is particularly important for medical imaging modalities that are often volumetric or video data. In this paper, we introduce MediViSTA, a parameter-efficient fine-tuning method designed to adapt the vision foundation model for medical video, with a specific focus on echocardiographic segmentation. To achieve spatial adaptation, we propose a frequency feature fusion technique that injects spatial frequency information from a CNN branch. For temporal adaptation, we integrate temporal adapters within the transformer blocks of the image encoder. Using a fine-tuning strategy, only a small subset of pre-trained parameters is updated, allowing efficient adaptation to echocardiographic data. The effectiveness of our method has been comprehensively evaluated on three datasets, comprising two public datasets and one multi-center in-house dataset. Our method consistently outperforms various state-of-the-art approaches without using any prompts. Furthermore, our model exhibits strong generalization capabilities on unseen datasets, surpassing the second-best approach by 2.15\% in Dice and 0.09 in temporal consistency. The results demonstrate the potential of MediViSTA to significantly advance echocardiographical video segmentation, offering improved accuracy and robustness in cardiac assessment applications. △ Less

Submitted 6 November, 2024; v1 submitted 23 September, 2023; originally announced September 2023.

Showing 1–50 of 186 results for author: Kim, K